Punctuation Prediction using Conditional Random Fields

Punctuation using Conditional Random Fields

Semih YavuzComputer Science and EngineeringUniversity of California, San Diego

La Jolla, California [email protected]

Alican NalciElectrical and Computer EngineeringUniversity of California, San Diego

La Jolla, California [email protected]

Abstract—In this paper, we train Conditional Ran-dom Field (CRF) based classifiers to predict punctua-tion tags of English sentences. We employ two differentoptimization approaches, namely Collin’s perceptronand stochastic gradient ascent (SGA), Collins perceptronrequire correct implementation of the Viterbi algorithmand SGA require computing the gradient. Featurefunctions of our CRF training models are designed byconsidering English punctuation rules. In order to cap-ture the correlation between words, and correspondingpunctuation symbols, feature function templates arecreated according to part-of-speech(POS) tags of thewords. For this purpose, a standalone POS tagger byStanford NLP group is used on the training and testingdatasets. In the training phase of SGA, we perform agrid search to find the best hyper-parameters, namelythe regularization parameter (µ) and learning rate (λ),on validation set. Finally, in both experiments, ourtrained CRF model is tested on a separate test setonce. SGA method achieves 93.12% accuracy on the25% of the training set(validation), and achieves 92.85%accuracy on the test set. Collins perceptron has 92.01%accuracy on the validation set, and 91.64% accuracy onthe test set.

I. INTRODUCTION

In this project the aim is to correctly predict punc-tuation marks given a set of English sentences. Forthis purpose, we employ the the idea of ConditionalRandom Fields(CRF) using Collin’s perceptron(CP),and Stochastic Gradient Ascent(SGA). Collin’s Per-ceptron, and stochastic gradient ascent are used toobtain a generic model, a vector of wj values orweights, for predicting the probability of a specificpunctuation mark at a given place in a sentence.The goal in performing punctuation placement issimilar to assigning a predefined set of labels in thecorrect order to an individual sentence. So that if asample sentence is: ”I went to school", the correctpunctuation placement would be ”SPACE SPACESPACE PERIOD".

There are at least two factors that make the taskof punctuation prediction difficult. One of them isthat, the inputs to the predictor is often variable

in size, that is one sentence can have a differentnumber words than another sentence, so a predefinedlength for the input sequence cannot be determined.This places a burden on the classifier since the inputsize can be exponentially large. A second issue isthat if a traditional classifier is used to make aprediction for each punctuation separately then wewould lose too much information, this is true eventhe classifier has access to the complete sentence [1].Therefore, to overcome such issues and to achievehigh accuracy compared to the traditional classifiers,we use the technique of conditional random fieldsand use high-level information such as correlationbetween neighboring punctuation tags.

II. LOG-LINEAR MODELS

The method of linear chain random fields is anextension of logistic regression. It is a way to use alog-linear model for the sequence prediction task [1].If we assume that x is an example drawn from anyset X , and y is the label corresponding to the specificsample drawn for a finite set Y , then the log-linearmodel yields that the probability of a specific labely, given the example x is

p(y|x;w) =exp

∑Jj=1 wjFj(x, y)

Z(x, w)(1)

It should be noted that y, and x are sequences ofvariable length, meaning that for a given examplesentence for our purpose, if the number of words isn, then the corresponding label sequence y will alsobe of size n. Expressions Fj(x, y) is called a featurefunction and contains the correlation information ofa specific example x and the label y. The index j, inequation 1 traverses all the possible feature functionsgiven x, and y. Accordingly, each one of the Jfeature functions Fj(x, y), can measure a differenttype of relation between x, and y. The vector wj iscalled the weight of the jth feature function amongthe others J−1 feature functions. The sign of wj can

1

be negative or positive. If the weight wj is positivethen this means the jth feature function concludesthat the label y is a possibly correct match for thesequence x. Intuitively, a negative wj , concludes theopposite. A zero value of wj means that the featurefunction Fj is unimportant for deciding the label ygiven x.The normalizing factor Z(x, w) in the denominatorof equation 1 is called a partition function, which isa constant value given the example x, and w. It is away to ensure that the probability p(y|x;w) alwayssum up to 1, and it is defined as:

Z(x, w) =∑y′∈Y

exp

J∑j=1

wjFj(x, y)

Intuitively, denominator is just the sum of allpossible values of the numerator in equation 1, thatis a marginalization of exp

∑Jj=1 wjFj(x, y) with

respect to y. In log-linear models, there is exactlyone weight value for each feature function Fj(x, y).

Feature functions can be further defined as booleanmapping operator that operates in X × Y space,where the output is 0, 1. However, in many practicalapplications feature functions are defined as a class.In light of this, we can define a class of featurefunctions

Fj(x, y) = Aa(x)Bb(y)

III. CONDITIONAL RANDOM FIELDS

Conditional random fields(CRF) is particularlyuseful when dealing with machine learning taskswhere the input sequence is complex, like an Englishsentence and the output label sequence is hard todetermine using traditional per-word classifiers. Whatis meant by complex input is that, if a sentence isrequired to be punctuated using a log-linear model,then we would lose too much information if only aper-word classifier is used, because often neighboringwords in a sentence, or the beginning word of asentence is a strong punctuation determiner. Alsoneighboring tags of y can be utilized more usingCRFs, instead of using per word classifiers.Given the standard log-linear model in equation1, we want to find a parameter vector w :=(w1, w2, . . . , wJ) which results in the best possibleprediction

y = argmaxy

p(y|x;w) (2)

for each training example x. Before we describe thepossible training methods to find a good parametervector w, we need to first resolve the decision prob-lem for finding the most probable tag sequence yfor a sentence x given w. This is computationallyhard problem since there are exponentially manycandidate tag sequence y we should consider tocompute y in Equation 2. This inference problem issolved by exploiting the structural property of featurefunctions in linear-chain CRF model. That is, the factthat low level feature functions can depend on atmost two tags, which must be consecutive. This isexplained in more detail in the upcoming section.

Our objective in training is to find the best possibletraining vector w for which the conditional probabil-ity of training examples (Equation 1) is maximized.To this end, we use the logarithm of the conditionallikelihood (LCL) of the set of training examples asan objective function. More precisely, our goal is tomaximize the LCL with respect to parameter vectorw. Formally, we define the LCL based on Equation1 as follows:

LCL =

J∑j=1

wjFj(x, y)− logZ(x,w)

A. Linear Chain CRF

Learning methods using CRF computes the op-timal value of w, so that the best prediction ofy is found. As mentioned above, once parametervector w is determined, computing Equation 2 iscomputationally exhaustive because the number ofcandidate tag sequences is exponential in the size ofx. To make this computation tractable, we exploita crucial property of the low-level feature functionsimposed by the linear chain CRF model. That is,they depend on at most two tags. Now, to clarifyhow this property allows us to compute the argmax inEquation 2 in polynomial time, we explain what low-level feature function means and how it operates as abuilding block of a feature function. We can decom-pose the feature functions Fj(x, y) into a set of low-level feature functions defined as fj(yi−1, yi, x, i)so that each low-level feature function depends onthe whole input sequence x, the current position iwithin the input sequence and two consecutive tagsyi−1 and yi. Low level feature functions of the formfj(yi−1, yi, x, i) are defined for i = 1, 2 . . . , n + 1where n is the number of words in x, hence thenumber of tags required for x. In addition, we usetwo additional dummy tags y0 = START andyn+1 = STOP for convenience for the beginning

2

and end of each x. The reason why fjs are definedfor i = n+1 is that there are n+1 many consecutivetags in x, namely (y0, y1), (y1, y2), . . . , (yn, yn+1),on each of which each low level feature function fjmust be well-defined by structure. Having definedlow level feature functions, Fj(x, y) is given as

Fj(x, y) =

n+1∑i=1

fj(yi−1, yi, x, i) (3)

We now provide the efficient computation of argmaxusing this new definition of Fj(x, y). We can rewriteEquation 2 as follows:

y = argmaxy

J∑j=1

wjFj(x, y)

y = argmaxy

J∑j=1

wj

n+1∑i=1

fj(yi−1, yi, x, i)

y = argmaxy

n+1∑i=1

gi(yi−1, yi)

where we define gi(yi−1, yi) as

gi(yi−1, yi) =

J∑j=1

wjfj(yi−1, yi, x, i)

Given x, i and the vector w, corresponding gi func-tion can be represented as m by m matrix can beused where m is the number of different tags [2].Computing this matrix takes O(m2J) time to com-pute since the number of different tag pairs is m2,each of which requires O(J) time assuming we haveO(1) time access to fj(yi−1, yi, x, i). Therefore, ittakes O(nm2J) time to compute all gi matrices fori = 1, 2 . . . , n+ 1.

Now, we define U(k, v) to be the best score ofthe best tag sequence from position 1 to position k,where the k-th tag is equal to v. This can be seen asfinding the score of the most probable tag sequencerestricted to the condition that the last one is equal tov. This quantitative statistic will allow us to recoverthe best tag sequence for x. Formally,

U(k, v) = maxy1,...,yk−1

(

k−1∑i=1

gi(yi−1, yi) + gk(yk−1, v))

This can be rewritten as

U(k, v) = maxyk−1

maxy1,...,yk−2

(

k−2∑i=1

gi(yi−1, yi) + gk−1(yk−2, yk−1) + ...

gk(yk−1, v))

Considering the definition of U(k−1, yk−1) and writ-ing u instead of yk−1 yields the following recursiveformula

U(k, v) = maxu

[U(k − 1, v) + gk(u, v)] (4)

where the maximum is taken over all possible tags u.Again by considering the definition of U(k, v), wecan set our base case as follows:

U(0, v) = I(v = START )

for each tag value v. This dynamic algorithm takesO(m) time to compute each entry U(k, v) as ititerates through m many possible tags. Since Umatrix has nm many entries, computing the entire Uruns in O(m2n) time. One should note here that theinformation of best tag sequence y = (y1, y2, . . . , yn)is now available in U matrix and it can be recoveredas follows: First compute

yn = argmaxv

U(n, v) (5)

and then recursively compute

yk−1 = argmaxu

[U(k − 1, u) + gk(u, yk)] (6)

for k = n, n − 1, . . . , 2. Having computed U andgi matrices, this last recovery process takes O(nm)time.

The algorithm defined above exploits the specificstructural property of low level feature functions viadynamic programming technique. Its overall runningtime is O(m2nJ + m2n + mn) since computing gimatrices, U matrix and recovering y tag sequencetakes O(m2nJ), O(m2n), and O(mn) time, respec-tively. Thanks to this dynamic programming trick,inference problem, which is intractable by naiveapproach, can be solved in polynomial time. Havingresolved the inference problem, we can now proceedto describe learning process, choosing the weightvector w.

B. Training via CRF Model

Given a set of training examples, our learning taskis to determine the weight vector w for which thetotal conditional probability of the training examplesis maximized. This is same as maximizing the loga-rithm of the conditional likelihood (LCL) of the setof training examples. Let {(xn, yn)}Nn=1 be the setof all training examples. Then, our objective functionis

LCL(w) =

N∑n=1

logp(yn|xn;w)

3

Putting,

p(yn|xn;w) =exp

∑Jj=1 wjFj(xn, yn)

Z(xn, w)

we get

LCL(w) =

N∑n=1

(

J∑j=1

wjFj(xn, yn)− logZ(xn;w))

Moreover, in order to prevent our model from overfit-ting from the training data, l2 regularization is used.Embedding L2 regularization into our objective func-tion, we obtain the following regularized objective:

RegLCL(w) =

N∑n=1

logp(yn|xn;w)− µN ||w||2

In order to find the maximum value of our ob-jective function, we perform gradient ascent. Wedescribe two different methods for this purpose inthe following two sections

C. Training with Stochastic Gradient Ascent

In stochastic gradient ascent method, we updateparameters based on single training examples. Thus,the partial derivative of LCL with respect to each wj

is computed for a single training example. The partialderivative of LCL w.r.t wj is

∂

∂wjlogp(y|x : w) = Fj(x, y)− ∂

∂wjlogZ(x,w)

(7)

= Fj(x, y)− 1

Z(x,w)

∂

∂wjZ(x,w)

(8)

where y is given as the true label of training examplex and j is the index of the parameter vector w forwhich we compute the partial derivative. Expandingthe Z(x,w) as

∂

∂wj

∑y′

[exp∑j′wj′Fj′(x, y

′)]

and following the derivation in the lecture notes, weobtain∂

∂wjlogp(y|x;w) = Fj(x, y)−

∑j′Fj′(x, y

′)p(y′|x;w)

(9)= Fj(x, y)− Ey′∼p(y′|x;w)[Fj(x, y

′)](10)

In order to compute the partial derivative above,we need to evaluate the expectation in the formula.

Naive approach of computing this expectation wouldtake exponential time as there are exponentially manydifferent tag sequence y′ over which we take theexpectation. In order to evaluate this expectationefficiently, in polynomial time in the size of input,we introduce the concept of forward and backwardvectors.

Let α(k, v) denote the score of the unnormalizedprobability of the set containing all possible unfin-ished tag sequences that end at position k with tagv. Note that α(k, v) is a vector of length m for eachk and it is called a forward vector.

Recall the expansion

Z(x, w) =∑y

exp∑

j = 1JwjFj(x, y)

=∑y

exp

J∑j=1

wj

n+1∑i=1

fj(yi−1, yi, x, i)

=∑y

exp

n+1∑i=1

gi(yi−1, yi)

Recalling the definition of forward vector α(k, v), wehave

α(k + 1, v) =∑

y1,...,yk

exp[

k∑i=1

gi(yi−1, yi) + gk+1(yk, v)]

Similar to the recursive formula we derived fordynamic computation of U matrix in section III.A,we have

α(k + 1, v) =∑u

α(k, v)[exp(gk+1(u, v))]

for all k = 1, . . . , n. The base case for forward vectoris α(0, v) = I(v = START ).

Note by definition that

Z(x, w) = α(n+ 1, STOP ) (11)

Similarly, backward vector β(u, k) is defined asthe unnormalized probability of partial tag sequencesstarting at position k with the tag value u. It has thefollowing similar recursive formula

β(u, k) =∑v

[expgk+1(u, v)]β(v, k + 1)

where the base case is β(u, n+ 1) = I(u = STOP )for each tag u.

Again by definition, we have

Z(x, w) = β(START, 0) (12)

Forward and backward vectors can be used to checkif Z(x, w) is computed correctly since the value of

4

Z(x,w) obtained by Equation 11 and Equation 12must be the same.

As discussed in the lecture notes, forwardand backward vectors have many useful proper-ties, we specifically need the following propertyin order to efficiently compute the expectationEy′∼p(y′|x;w)[Fj(x, y

′)] in the update rule 10:

p(Yk = u, Yk+1 = v|x, w) =

α(k, u)[exp(gk+1(u, v))]β(v, k + 1)

Z( ¯x,w)

Now, we provide the derivation of the expectationusing forward and backward vectors. The expecta-tion we want to compute is a weighted average ofFj(x, y) over each y w.r.t probability distributiony ∼ p(y|x;w). Hence,

E[Fj(x, y)] =∑y

p(y|x;w)Fj(x, y)

= Ey[

n+1∑i=1

fj(yi−1, yi, x, i)]

=

n+1∑i=1

Ey[fj(yi−1, yi, x, i))]

Last expression above can be rewritten as an expecta-tion over all tags yi−1 and yi at consecutive positionsi− 1 and i by the structure of our low level featurefunctions, they depend on at most two tags whichmust be adjacent , as follows:

E[Fj(x, y)] =

n+1∑i=1

Eyi−1,yi [fj(yi−1, yi, x, i)]

=

n+1∑i=1

∑yi−1

∑yi

p(yi−1, yi|x;w)fj(yi−1, yi, x, i)

Finally, using the probability formula of the specifictags u and v at positions k and k + 1 in terms offorward and backward vectors above, we concludethat

Ey[Fj(x, y)] =

n+1∑i=1

∑yi−1

∑yi

fj(yi−1, yi, x, i)...

α(i− 1, yi−1)[exp(gi(yi−1, yi)]β(yi, i)

Z(x, w)

Assuming we have O(1) time access to gi ma-trices, forward and backward vectors, computingexpectation via the last formula above takes O(nm2).

Once gi matrices are computed, it takes O(nm2)time to compute forward vectors since computationof α(k, v) for each (k, v) position-tag pair takesO(m) time summing over all possible tags u andthere are exactly nm such computations performed.Similarly, computing backward vectors β(u, k) takesO(nm2) time in total. Recall that it takes O(nm2J)time in total to compute gi matrices. Finally wecan conclude that computing the required expectationvalue for gradient computation takes O(nm2J +nm2 +nm2 +nm2). Simplifying the big-O notation,it takes O(nm2J + nm2) overall computation time.

Therefore, in order to maximize our objectivefunction LCL via SGA, we perform the followingupdate rule to weight wj

wj = wj + λ(Fj(x, y)− Ey′∼p(y′|x;w)[Fj(x, y′)])

where λ stands for learning rate. In training vithSGA, this update is performed for each wj basedon a randomly chosen single training example. Ineach epoch, each wj is updated based on eachtraining example once in a random order of trainingexamples.

D. Collin’s Perceptron

Collin’s perceptron algorithm assumes that weplace all the probability mass on the most likelyvalue of y [2]. Meaning that the true expected labelwill get a probability value of 1, and the other labelswill receive probability 0. This allows us to use thefollowing approximation to simplify the stochasticgradient update rule.

p(y|x : w) = I(y = y), where

y = argmaxy

p(y|x)

This approximation simplifies the stochastic updaterule to

wj := wj + λFj(x, y)− λFj(x, y)

Given a training example of pairs (x, y) this updaterule is used for every component wj of weight vectorw. In each epoch, wj is updated based on eachtraining example once. If for the features whosevalues are higher for y than y, that is Fj(x, y)is larger than Fj(x, y), the value of wj increases,otherwise wj decreases. Intuitively, this update rulecan be verified by considering the special case inwhich wj does not change, we must have y = y.This implies that y is the most probable tag sequencefor x based on the current value of wj component ofwiehgt vector w, hence there is no need to change wj .

5

We can fix the learning rate λ = 1 because otherwisethe weights would be multiplied by the same nonzeroconstant which would make no difference.

IV. DATASET CRITERION

A. English Sentence Dataset

English sentence data set used in this projectincludes e-mail messages collected from Enron com-pany database. There are 70, 115 sentences providedfor the training purpose, and 28, 026 provided for thetesting purposes. The number of words in sentencesis not uniform, sentences might contain differentnumber of words, for the training dataset maximumword count in the available sentences is 30, theminimum word count is 3 and the mean value ofthe word count is 10. Here we plot the histogram ofthe number of words available by sentences in thetraining database.

0 5 10 15 20 25 300

1000

2000

3000

4000

5000

6000

7000

8000

9000Histogram of Number of Words by Sentences

# of Words

Fre

que

ncy

Fig. 1: Histogram of the count of words by sentencesfrom the training dataset

Figure 1 shows that there are more sentences withnumber of words in the range [5, 12] than [15, 20]range. Although, it is not important for the trainingalgorithm we give the word count distribution of thetest set sentences as well.

0 5 10 15 20 250

500

1000

1500

2000

2500

3000

3500Histogram of Number of Words by Sentences

# of Words

Fre

que

ncy

Fig. 2: Histogram of the count of words by sentencesfrom the test dataset

B. Word Encoding

Given this training dataset we aim to place punc-tuation marks correctly, for this reason as we haveexplained in the previous section, we operate bothon the POS tagged sentences and on the raw data si-multaneously. We implement a preprocessing step toremove the unnecessary sentences, that is sentencescontaining meaningless words is removed from thedataset. For easier processing of the sentences weuse the POS tagging framework developed by theStanford Natural Language Processing Group. ThePOS tagged sequences of the original raw sentencesinclude labels for the individual words, i.e. a label’CC’ instead of the word ’and’, meaning that ’and’is tagged as a conjunction word.

C. Label Encoding

Since we perform a supervised learning, we aregiven the correct punctuation labels as a separate filefor the training purpose. These training labels includethe following string values: ’SPACE’, ’COMMA’,’PERIOD’, ’COLON’, ’QUESTION_MARK’, and’EXCLAMATION_POINT’. For efficiency whilerunning the algorithms we convert these string valuesto distinct integer values so that while doing thecomparison in the training algorithm we avoid timeconsuming operations such as string comparisons.

V. EXPERIMENTAL DESIGN

A. Feature Function Design

We define our set of functions according to thefollowing rule

fa,b(yi−1, yi, x, i) = Aa(x, i)Bb(yi−1, yi, i)

where Aa(x, i) depends only on individual words ofsentence x and their corresponding POS tags and

6

Bb(yi−1, yi, i) is position dependent component ofour low level feature function fa,b since it is definedon two consecutive tags at positions i − 1 and iof tag sequence y. We extract our low level featurefunctions fj for each pair of (a, b) subscripts wherea ∈ S1 and b ∈ S2, and S1 and S2 sets varydepending on the structure of the template we create.To illustrate this point, consider the following featurefunction

fj(yi−1, yi, x, i) = Aa(x, i)Bb(yi−1, yi, i)

where

Aa(x, i) = I(POS(i) = a)

and

Bb(yi−1, yi, i) = I(yi−1 = b)

for each a ∈ S1 and b ∈ S2, where S1 is a certainsubset of all POS tags and S2 is a certain subsetof all punctuation tags. For this single template weuse different (S1, S2) pair of subscript sets, each ofwhich are crossed among themselves to give |S1||S2|many low level feature functions. For example, forthe template defined above, for one pair of subscriptcluster (S1, S2), POS tags CC, NN, NNP , PRP, RBare some of the elements of set S1 and punctuationtags COMMA, SPACE, COLON are the elements ofset S2. By using this template, we extract |S1||S2|low level feature functions for each pair of (a, b).

Another template we use can be described asfollows:

Aa(x, i) = I((POS(i− 1), POS(i)) = a)

and

Bb(yi−1, yi, i) = I((yi−1, yi) = b)

for each a ∈ S1 and b ∈ S2, where S1 is certainsubset of all POS tag pairs and S2 is a certain subsetof all punctuation tag pairs. Similar to above, we usethis template for different pair subscript sets (S1, S2).For instance, for the template defined above, for onepair of subscript cluster (S1, S2), POS tag pairs (NN,NN), (NN, NNP), (NN, PRP), (PRP, NN), (NNP,NNP) are some of the elements in S1 and punctuationtag pairs (COMMA,COMMA), (COMMA, SPACE),(SPACE, COMMA), (SPACE, SPACE) can be takenas an example of subscript cluster S2.

In addition to two templates above, we also usethe following more specific template

Aa(x, i) = I(POS(1) = a)

and

Bb(yi−1, yi, i) = I((yi−1, yi) = b)

where one example of S1 is the set of closedclass of question words, corresponding to POS tagsWP, WRB, MD and the corresponding subscript setS2 is consisting of single punctuation pair (QUES-TION_MARK, STOP).

In this manner we create different feature func-tion templates, and we define different templatedependent clusters of pair of subscript sets S1, S2,crossing of which, as described above, generates us|S1||S2| many low level feature functions. There are36 possible POS tags assigned by the POS tagger (inAppendix), and 1296 different POS tag pairs. Naiveapproach for selection of subscript set pairs (S1, S2)would be to choose subscript set S1 as the set of allPOS tag pairs, and S2 as the set of all punctuationtag pairs in the second template defined above. Thiswould generate 36*36*6*6= 46656 many low levelfeature functions, most of which would be useless.That is why, to reduce the computational complexity,we select only the ones, pair of subscript sets S1, S2,whose crossings are more likely to capture the cor-relation between POS tags or individual words andpunctuation tags in our feature functions.

B. Collin’s Perceptron Design

In order to train the Collin’s perceptron we dividethe training set into two parts. 75% of the trainingsamples are used for training the algorithm and theremaining 25% is used as a validation set. Collin’sperceptron algorithm that is used in this project hastwo constraints for convergence, the first constraintis the validation accuracy check. If between two con-secutive epochs the validation set accuracy decreasesthen algorithm stops training, also the second stop-ping condition is a preset value of maximum numberepochs. The general algorithm of the implementedCollins perceptron algorithm is given below Viterbialgorithm above is explicitly described in sectionIII.A. The algorithm above returns a unique weightvalue wj for each feature function Fj(x, y). Afterthe training step is completed we use these valuesto compute the accuracy on the test set by usingthe same Viterbi algorithm which return the mostprobable tage sequence for x, once w is fixed.

7

while epoch number < max epochs AND oldvalidation accuracy <new validation accuracydo

for each training sample, x doG ← Compute G matrix;U ← Compute U matrix;y ← Viterbi(U,G);for j=1:number of feature functions do

wj ← wj +Fj(x, y) - Fj(x, y)end

endepoch number = epoch number +1;Compute accuracy on validation set;Update old, new accuracy ;

endAlgorithm 1: Collin’s Perceptron Algorithm

C. Stochastic Gradient Ascent Design

Similar to design of Collins perceptron the trainingset is divide into two parts. 75% of the trainingsamples are used for training the algorithm and theremaining 25% is used as a validation set. To employthe early stopping condition for preventing overfit-ting, we select a fixed epoch number. Therefore, val-idation set is only used to extract information aboutthe optimal values of λ, and µ. For the stochasticgradient algorithm we conduct a grid search overthe values of µ =

[10−5, 10−4, 10−3, 10−2, 10−1, 0

]and λ =

[10−5, 10−4, 10−3, 10−2, 10−1, 0

]. With all

the feature functions and the grid search loops, thecomputational time for finding the optimal λ and µvalues is too costly, therefore we run the grid searchin parallel using the parfor command in Matlab. Thealgorithm used to train the stochastic gradient ascentis presented below. As given in the algorithm below

while epoch number < max epochs dofor each training sample x do

G ← Compute G matrix;α ← Compute forward vector α;β ← Compute backward vector β;for j=1:number of feature functions do

E ← Ey∼p(y′|x;w)[Fj(x, y′];

wj ← wj + λFj(x, y) - λ E;end

endepoch number = epoch number+1;

endAlgorithm 2: Stochastic Gradient Ascent Algo-rithm

for stochastic gradient algorithm we calculate theG matrix, forward and backward vectors for eachtraining sample, and compute the expected value ofeach feature function given x.

VI. EXPERIMENTAL RESULTS & DISCUSSION

A. Stochastic Gradient Ascent

The grid search yields the best learning rate pa-rameter as λ = 10−3. The value of regularizationparameter is not affecting the accuracy of the SGAalgorithm on the validation set. Therefore, we selectµ arbitrarily, as one of the values in the grid searchvector, i.e. µ = 10−1. Figure 3 justifies that theoptimum value for the learning parameter λ shouldbe 10−3 because the value of the averaged LCLis at maximum when λ = 10−3, where averagedLCL is defined as the value of LCL divided bythe number of samples in the validation set. Thebest value λ maximizing averaged LCL obviouslymaximizes LCL as well.

Fig. 3: Averaged LCL value versus learning rate, µ =10−1

After we run the stochastic gradient ascent algo-rithm we get satisfactory results in terms of LCL,and accuracy behaviors. For the best combination ofhyper parameters and epoch conditions, maximumaccuracy obtained on the validation set is calculatedas 93.12%. The feature function weights wj corre-sponding to the best configurations are later tested onthe test set and an accuracy of 92.85% is achieved.

8

While doing the experiment on the validation setwe notice that we should employ early stoppingcondition to prevent over fitting and performanceloss. After the first epoch, averaged LCL on thevalidation set is −18.01. This is shown below.

Fig. 4: Averaged LCL versus number of epochs λ =10−3

Figure 4 above is showing the number of epochversus the value of normalized LCL on the validationset. After the first epoch, we see that the averagedLCL value, we try to maximize starts decreasing.This verifies that using early stopping instead of reg-ularization is sensible. Furthermore, figure 5 showsthat the accuracy on the validation set decreasesstarting from the best accuracy rate 93.12%, which isattained after first epoch. This behavior also supportsthe observation made above. This is due to the factthat SGA algorithm finishes the training in less thanone epoch, and epochs after the first one possiblycause overfitting.

Fig. 5: Accuracy versus number of epochs on thevalidation set, λ = 10−3

Figure 6 show accuracy on the validation set ver-sus the value of the learning parameter λ, weobserve that for the optimal value λ = 10−3

we obtain the best accuracy. The accuracy be-fore reaching the optimal increases during thegrid search, and it starts to decrease immedi-ately after we search for values larger than 10−3.

Fig. 6: Averaged LCL versus learning rate, µ = 10−1

So by the observations given above, we achieveour best results for SGA algorithm for λ = 10−3 withearly stopping condition that enforces terminationafter 1 epoch. We report the final tag level accuracyvalues of the SGA algorithm on the validation, andtest sets. These result are obtained by comparing eachtrue tag sequence y with y sequence returned by theviterbi algorithm. We then count the number of cor-rect tag predictions and present tag wise accuraciesbelow

Tag Validation TestAccuracy Tag Count Accuracy Tag Count

Space 0.9727 145015 0.9673 235536Period 0.9848 14315 0.9868 22492

Q. Mark 0.2813 2726 0.3982 4792Comma 0.3485 7135 0.3884 10665Colon 0.01 321 0 1174

Exc. Point 0 640 0 1066

TABLE I: Tag prediction accuracy on the validationand the test set using stochastic gradient ascent

As can be observed from the our experimentresults in the table above, training with SGA fails topredict almost each COLON and Exc. Point. Our firstreasoning for this result is that the feature functionswe used in our implementation remain rather weakto capture the patterns that these two punctuationtags form. Another reasoning would be the fact thatthe frequencies of these two tags in question are

9

not sufficient for the learning algorithm to train themodel.

B. Collins Perceptron

The Collins perceptron algorithm converges imme-diately after the first epoch achieving an accuracy of92.01% on the validation set. The feature functionweights wj computed by the algorithm is later usedon the test set, resulting in 91.64% accuracy. Figure7 shows that validation set accuracy, which is notchanging as we increase the number of epochs. Thismeans that learning is completed in less than oneepoch, during the first training cycle. Therefore, afterone epoch optimal weights are not updated, and weget the same results regardless of the number ofepochs

Fig. 7: Accuracy on the validation set versus thenumber of epochs, learning is completend in less thanone epoch thus, wj are not updated and we get thesame accuracy for each number of epochs

Using the final values of the feature func-tion weights of Collin’s perceptron algorithm wealso compute the punctuation label accuracies, andpresent them in the following table

Tag Validation TestAccuracy Tag Count Accuracy Tag Count

Space 0.9603 145015 0.9582 235536Period 0.9881 14315 0.9798 22492

Q. Mark 0.2764 2726 0.2780 4792Comma 0.3321 7135 0.3384 10665Colon 0 321 0 1174

Exc. Point 0 640 0 1066

TABLE II: Tag prediction accuracy on the validationand the test set using collins perceptron

VII. CONCLUSIONS

In this study, we perform supervised learning topredict correct punctuation of a given sentence. Forthis task, we use Conditional Random Fields(CRF)which can be considered as a general model ofsupervised learning. More specifically, we use linearchain CRFs via two different optimization methods,namely collins perceptron and stochastic gradientascent. The main difference between them is thatSGA takes more time to run in practice even thoughthe asymptotic time complexities are the same. Thisis simply because Collins Perceptron algorithm ap-proximates the expected value required for the updaterule whereas SGA computes the precise value. Nev-ertheless, SGA results in better accuracy compared toCollins Perceptron. This can be interpreted as a tradeoff between accuracy and practical computation time.

In training with SGA, we perform grid searchto find the optimal values of hyper-parameters onthe validation set, extracted from the training data.Furthermore, we both perform early stopping andL2 regularization in order to prevent over-fitting. Wealso conduct line search for finding the best learningrate via early stopping condition and we verify thatoptimal value obtained by this way, agrees with theone found in grid search.

For Collins Perceptron algorithm, same featurefunctions are used. In this model learning rate isfixed to 1. We use accuracy on validation set as ourstopping condition as described in experiment designsection.

In order to improve the running time, we takeadvantage of the sparsity by determining the trainingexamples for which Aa(x, i) is always zero regard-less of Bb functions used to generate low level featurefunctions. We perform this task during the pre-computation of feature functions Fj(x, y) over theentire training data. This causes no extra computationtime since each Fj(x, y) is computed in trainingprocess anyway.

We obtain 91.64% overall tag accuracy on thetest set using Collins Perceptron method whereasSGA results in 92.85%. The overall tag accuracyis not very informative since number of matchingSPACE and PERIOD dominates this rate. Therefore,we also calculate per tag accuracies on the test setand conclude that SGA results in better per tagaccuracies as well.

10

REFERENCES

[1] Elkan, Charles, ’Log-linear models and conditional randomfields’, Access: 02/12/2014.

[2] Elkan, Charles, ’Conditional Random Fields for Word Hy-phenation’, Access: 02/12/2014.

[3] Elkan, Charles, ’Maximum Likelihood, Logistic Regression,and Stochastic Gradient Training’, Access: 02/12/2014.

[4] POS tagging Abbreviations, http://www.computing.dcu.ie/~acahill/tagset.html, Access: 02/12/2014.

VIII. APPENDIX

A. Stanford NLP POS Tagger Abbreviations

POS Tag PlaceCC Coordinating conjunction, e.g. and,but,or...CD Cardinal numberDT DeterminerEX Existential thereFW Foreign wordIN Preposition or subordinating conjunctionJJ Adjective

JJR Adjective, comparativeJJS Adjective, superlativeLS List item markerMD Modal e.g. can, could, might, may...NN Noun, singular or mass

NNS Noun, pluralNNP Proper noun, singular

NNPS Proper noun, pluralPDT Predeterminer e.g. all, both ...POS Possessive ending e.g. Nouns ending in ’sPRP Personal pronoun e.g. I, me, you, he...

PRP$ Possessive pronoun e.g. my, your, mine, yours...RB Adverb

RBR Adverb, comparativeRBS Adverb, superlativeRP Particle

SYM SymbolTO toUH Interjection e.g. uh, well, yes, my..VB Verb, base form

VBD Verb, past tenseVBG Verb, gerund or present participleVBN Verb, past participleVBP Verb, non-3rd person singular presentVBZ Verb, 3rd person singular presentWDT Wh-determiner e.g. which, and thatWP Wh-pronoun e.g. what, who, whom...WP$ Possessive wh-pronounWRB Wh-adverb e.g. how, where why

TABLE III: Stanford NLP Group POS Tagger Ab-breviations [2]

11

http://www.computing.dcu.ie/~acahill/tagset.html

http://www.computing.dcu.ie/~acahill/tagset.html

Punctuation Prediction using Conditional Random Fields

Documents

Transcript of Punctuation Prediction using Conditional Random Fields