| \documentclass[11pt]{article} |
| \usepackage[utf8]{inputenc} |
| \usepackage{amsmath,amssymb} |
| \usepackage{booktabs} |
| \usepackage{hyperref} |
|
|
| \begin{document} |
|
|
| \section*{\textbf{Conditional Random Fields: Probabilistic Models} \textbf{for Segmenting and Labeling Sequence Data}} |
|
|
| \textbf{John Lafferty} _[†∗]_ LAFFERTY@CS.CMU.EDU |
| \textbf{Andrew McCallum} _[∗†]_ MCCALLUM@WHIZBANG.COM |
| \textbf{Fernando Pereira} _[∗‡]_ FPEREIRA@WHIZBANG.COM |
| _∗_ WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA |
|
|
| _†_ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA |
|
|
| _‡_ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA |
|
|
|
|
|
|
| \textbf{Abstract} |
|
|
|
|
| We present _conditional random fields_, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars |
| for such tasks, including the ability to relax |
| strong independence assumptions made in those |
| models. Conditional random fields also avoid |
| a fundamental limitation of maximum entropy |
| Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states |
| with few successor states. We present iterative |
| parameter estimation algorithms for conditional |
| random fields and compare the performance of |
| the resulting models to HMMs and MEMMs on |
| synthetic and natural-language data. |
|
|
|
|
| \textbf{1. Introduction} |
|
|
|
|
| The need to segment and label sequences arises in many |
| different problems in several scientific fields. Hidden |
| Markov models (HMMs) and stochastic grammars are well |
| understood and widely used probabilistic models for such |
| problems. In computational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known |
| evolutionary family, and analyze RNA secondary structure |
| (Durbin et al., 1998). In computational linguistics and |
| computer science, HMMs and stochastic grammars have |
| been applied to a wide variety of problems in text and |
| speech processing, including topic segmentation, part-ofspeech (POS) tagging, information extraction, and syntactic disambiguation (Manning & Sch¨utze, 1999). |
|
|
|
|
| HMMs and stochastic grammars are generative models, assigning a joint probability to paired observation and label |
| sequences; the parameters are typically trained to maxi |
|
|
|
|
| mize the joint likelihood of training examples. To define |
| a joint probability over observation and label sequences, |
| a generative model needs to enumerate all possible observation sequences, typically requiring a representation |
| in which observations are task-appropriate atomic entities, |
| such as words or nucleotides. In particular, it is not practical to represent multiple interacting features or long-range |
| dependencies of the observations, since the inference problem for such models is intractable. |
|
|
|
|
| This difficulty is one of the main motivations for looking at |
| conditional models as an alternative. A conditional model |
| specifies the probabilities of possible label sequences given |
| an observation sequence. Therefore, it does not expend |
| modeling effort on the observations, which at test time |
| are fixed anyway. Furthermore, the conditional probability of the label sequence can depend on arbitrary, nonindependent features of the observation sequence without |
| forcing the model to account for the distribution of those |
| dependencies. The chosen features may represent attributes |
| at different levels of granularity of the same observations |
| (for example, words and characters in English text), or |
| aggregate properties of the observation sequence (for instance, text layout). The probability of a transition between |
| labels may depend not only on the current observation, |
| but also on past and future observations, if available. In |
| contrast, generative models must make very strict independence assumptions on the observations, for instance conditional independence given the labels, to achieve tractability. |
|
|
|
|
| Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all of the |
| above advantages (McCallum et al., 2000). In MEMMs, |
| each source state [1] has a exponential model that takes the |
| observation features as input, and outputs a distribution |
| over possible next states. These exponential models are |
| trained by an appropriate iterative scaling method in the |
|
|
|
|
| 1Output labels are associated with states; it is possible for several states to have the same label, but for simplicity in the rest of |
| this paper we assume a one-to-one correspondence. |
|
|
|
|
| maximum entropy framework. Previously published experimental results show MEMMs increasing recall and doubling precision relative to HMMs in a FAQ segmentation |
| task. |
|
|
|
|
| MEMMs and other non-generative finite-state models |
| based on next-state classifiers, such as discriminative |
| Markov models (Bottou, 1991), share a weakness we call |
| here the _label bias problem_ : the transitions leaving a given |
| state compete only against each other, rather than against |
| all other transitions in the model. In probabilistic terms, |
| transition scores are the conditional probabilities of possible next states given the current state and the observation sequence. This per-state normalization of transition |
| scores implies a “conservation of score mass” (Bottou, |
| 1991) whereby all the mass that arrives at a state must be |
| distributed among the possible successor states. An observation can affect which destination states get the mass, but |
| not how much total mass to pass on. This causes a bias toward states with fewer outgoing transitions. In the extreme |
| case, a state with a single outgoing transition effectively |
| ignores the observation. In those cases, unlike in HMMs, |
| Viterbi decoding cannot downgrade a branch based on observations after the branch point, and models with statetransition structures that have sparsely connected chains of |
| states are not properly handled. The Markovian assumptions in MEMMs and similar state-conditional models insulate decisions at one state from future decisions in a way |
| that does not match the actual dependencies between consecutive states. |
|
|
|
|
| This paper introduces _conditional random fields_ (CRFs), a |
| sequence modeling framework that has all the advantages |
| of MEMMs but also solves the label bias problem in a |
| principled way. The critical difference between CRFs and |
| MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the |
| current state, while a CRF has a single exponential model |
| for the joint probability of the entire sequence of labels |
| given the observation sequence. Therefore, the weights of |
| different features at different states can be traded off against |
| each other. |
|
|
|
|
| We can also think of a CRF as a finite state model with unnormalized transition probabilities. However, unlike some |
| other weighted finite-state approaches (LeCun et al., 1998), |
| CRFs assign a well-defined probability distribution over |
| possible labelings, trained by maximum likelihood or MAP |
| estimation. Furthermore, the loss function is convex, [2] guaranteeing convergence to the global optimum. CRFs also |
| generalize easily to analogues of stochastic context-free |
| grammars that would be useful in such problems as RNA |
| secondary structure prediction and natural language processing. |
|
|
|
|
| 2In the case of fully observable states, as we are discussing |
| here; if several states have the same label, the usual local maxima |
| of Baum-Welch arise. |
|
|
|
|
|
|
| _Figure 1._ Label bias example, after (Bottou, 1991). For conciseness, we place observation-label pairs _o_ : _l_ on transitions rather |
| than states; the symbol ‘ ~~’~~ represents the null output label. |
|
|
|
|
| We present the model, describe two training procedures and |
| sketch a proof of convergence. We also give experimental |
| results on synthetic data showing that CRFs solve the classical version of the label bias problem, and, more significantly, that CRFs perform better than HMMs and MEMMs |
| when the true data distribution has higher-order dependencies than the model, as is often the case in practice. Finally, |
| we confirm these results as well as the claimed advantages |
| of conditional models by evaluating HMMs, MEMMs and |
| CRFs with identical state structure on a part-of-speech tagging task. |
|
|
|
|
| \textbf{2. The Label Bias Problem} |
|
|
|
|
| Classical probabilistic automata (Paz, 1971), discriminative Markov models (Bottou, 1991), maximum entropy |
| taggers (Ratnaparkhi, 1996), and MEMMs, as well as |
| non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (Punyakanok & Roth, 2001) are all potential victims of the label |
| bias problem. |
|
|
|
|
| For example, Figure 1 represents a simple finite-state |
| model designed to distinguish between the two words rib |
| and rob. Suppose that the observation sequence is r i b. |
| In the first time step, r matches both transitions from the |
| start state, so the probability mass gets distributed roughly |
| equally among those two transitions. Next we observe i. |
| Both states 1 and 4 have only one outgoing transition. State |
| 1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4 |
| has no choice but to pass all its mass to its single outgoing |
| transition, since it is not generating the observation, only |
| conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. More generally, |
| states with low-entropy next state distributions will take little notice of observations. Returning to the example, the |
| top path and the bottom path will be about equally likely, |
| independently of the observation sequence. If one of the |
| two words is slightly more common in the training set, the |
| transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will |
| always win. This behavior is demonstrated experimentally |
| in Section 5. |
|
|
|
|
| L´eon Bottou (1991) discussed two solutions for the label |
| bias problem. One is to change the state-transition struc |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ture of the model. In the above example we could collapse |
| states 1 and 4, and delay the branching until we get a discriminating observation. This operation is a special case |
| of determinization (Mohri, 1997), but determinization of |
| weighted finite-state machines is not always possible, and |
| even when possible, it may lead to combinatorial explosion. The other solution mentioned is to start with a fullyconnected model and let the training procedure figure out |
| a good structure. But that would preclude the use of prior |
| structural knowledge that has proven so valuable in information extraction tasks (Freitag & McCallum, 2000). |
|
|
|
|
| Proper solutions require models that account for whole |
| state sequences at once by letting some transitions “vote” |
| more strongly than others depending on the corresponding |
| observations. This implies that score mass will not be conserved, but instead individual transitions can “amplify” or |
| “dampen” the mass they receive. In the above example, the |
| transitions from the start state would have a very weak effect on path score, while the transitions from states 1 and 4 |
| would have much stronger effects, amplifying or damping |
| depending on the actual observation, and a proportionally |
| higher contribution to the selection of the Viterbi path. [3] |
|
|
|
|
| In the related work section we discuss other heuristic model |
| classes that account for state sequences globally rather than |
| locally. To the best of our knowledge, CRFs are the only |
| model class that does this in a purely probabilistic setting, |
| with guaranteed global maximum likelihood convergence. |
|
|
|
|
| \textbf{3. Conditional Random Fields} |
|
|
|
|
| In what follows, \textbf{X} is a random variable over data sequences to be labeled, and \textbf{Y} is a random variable over |
| corresponding label sequences. All components \textbf{Y} _i_ of \textbf{Y} |
| are assumed to range over a finite label alphabet _Y_ . For example, \textbf{X} might range over natural language sentences and |
| \textbf{Y} range over part-of-speech taggings of those sentences, |
| with _Y_ the set of possible part-of-speech tags. The random variables \textbf{X} and \textbf{Y} are jointly distributed, but in a discriminative framework we construct a conditional model |
| _p_ ( \textbf{Y} _|_ \textbf{X} ) from paired observation and label sequences, and |
| do not explicitly model the marginal _p_ ( \textbf{X} ). |
|
|
| \textbf{Definition} . _Let G_ = ( _V, E_ ) _be a graph such that_ |
| \textbf{Y} = ( \textbf{Y} _v_ ) _v_ _V, so that_ \textbf{Y} _is indexed by the vertices_ |
| _∈_ |
| _of G._ _Then_ ( \textbf{X} _,_ \textbf{Y} ) _is a conditional random field in_ |
| _case, when conditioned on_ \textbf{X} _, the random variables_ \textbf{Y} _v_ |
| _obey the Markov property with respect to the graph:_ |
| _p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ = _v_ ) = _p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ _v_ ) _, where_ |
| _w ∼_ _v | means that _ _w and v are neighbors in |_ _∼ G._ |
|
|
| Thus, a CRF is a random field globally conditioned on the |
| observation \textbf{X} . Throughout the paper we tacitly assume |
| that the graph _G_ is fixed. In the simplest and most impor |
|
|
| 3Weighted determinization and minimization techniques shift |
| transition weights while preserving overall path weight (Mohri, |
| 2000); their connection to this discussion deserves further study. |
|
|
|
|
|
|
| tant example for modeling sequences, _G_ is a simple chain |
| or line: _G_ = ( _V_ = _{_ 1 _,_ 2 _, . . . m}, E_ = _{_ ( _i, i_ + 1) _}_ ). |
| \textbf{X} may also have a natural graph structure; yet in general it is not necessary to assume that \textbf{X} and \textbf{Y} have the |
| same graphical structure, or even that \textbf{X} has any graphical structure at all. However, in this paper we will be |
| most concerned with sequences \textbf{X} = ( \textbf{X} 1 _,_ \textbf{X} 2 _, . . .,_ \textbf{X} _n_ ) |
| and \textbf{Y} = ( \textbf{Y} 1 _,_ \textbf{Y} 2 _, . . .,_ \textbf{Y} _n_ ). |
|
|
| If the graph _G_ = ( _V, E_ ) of \textbf{Y} is a tree (of which a chain |
| is the simplest example), its cliques are the edges and vertices. Therefore, by the fundamental theorem of random |
| fields (Hammersley & Clifford, 1971), the joint distribution over the label sequence \textbf{Y} given \textbf{X} has the form |
|
|
|
|
|
|
| As a particular case, we can construct an HMM-like CRF |
| by defining one feature for each state pair ( _y_ _[]_ _, y_ ), and one |
| feature for each state-observation pair ( _y, x_ ): |
|
|
|
|
| _fy,y_ ( _<u, v>,_ \textbf{y} _<u,v>,_ \textbf{x} ) = _δ_ ( \textbf{y} _u, y_ _[]_ ) _δ_ ( \textbf{y} _v, y_ ) |
| _|_ |
| _gy,x_ ( _v,_ \textbf{y} _|v,_ \textbf{x} ) = _δ_ ( \textbf{y} _v, y_ ) _δ_ ( \textbf{x} _v, x_ ) . |
|
|
| The corresponding parameters _λy,y_ and _µy,x_ play a similar role to the (logarithms of the) usual HMM parameters |
| _p_ ( _y_ _[]_ _| y_ ) and _p_ ( _x|y_ ). Boltzmann chain models (Saul & Jordan, 1996; MacKay, 1996) have a similar form but use a |
| single normalization constant to yield a joint distribution, |
| whereas CRFs use the observation-dependent normalization _Z_ ( \textbf{x} ) for conditional distributions. |
|
|
|
|
| Although it encompasses HMM-like models, the class of |
| conditional random fields is much more expressive, because it allows arbitrary dependencies on the observation |
|
|
|
|
|
|
| _pθ_ ( \textbf{y} _|_ \textbf{x} ) _∝_ (1) |
|
|
|
|
|
|
| |
|
|
|
|
|
|
| |
|
|
|
|
|
|
| exp |
|
|
|
|
|
|
| [] |
|
|
|
|
|
|
| _v∈V,k_ |
|
|
|
|
|
|
| |
|
|
|
|
|
|
| _e∈E,k_ |
|
|
|
|
|
|
| _λk fk_ ( _e,_ \textbf{y} _|e,_ \textbf{x} ) + |
|
|
|
|
|
|
| _µk gk_ ( _v,_ \textbf{y} _|v,_ \textbf{x} ), |
|
|
|
|
|
|
| where \textbf{x} is a data sequence, \textbf{y} a label sequence, and \textbf{y} _|S_ is |
| the set of components of \textbf{y} associated with the vertices in |
| subgraph _S_ . |
|
|
|
|
| We assume that the _features fk_ and _gk_ are given and fixed. |
| For example, a Boolean vertex feature _gk_ might be true if |
| the word \textbf{X} _i_ is upper case and the tag \textbf{Y} _i_ is “proper noun.” |
|
|
|
|
| The parameter estimation problem is to determine the parameters _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ; _µ_ 1 _, µ_ 2 _, . . ._ ) from training data |
| _D_ = _{_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} [(] _[i]_ [)] ) _}i_ _[N]_ =1 [with empirical distribution][ ] _[p]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)][.] |
|
|
| In Section 4 we describe an iterative scaling algorithm that |
| maximizes the log-likelihood objective function _O_ ( _θ_ ): |
|
|
|
|
|
|
| _O_ ( _θ_ ) = |
|
|
|
|
| _∝_ |
|
|
|
|
|
|
| _N_ |
|
|
|
|
| _i_ =1 |
|
|
| |
|
|
|
|
| \textbf{x} _,_ \textbf{y} |
|
|
|
|
|
|
| log _pθ_ ( \textbf{y} [(] _[i]_ [)] \textbf{x} [(] _[i]_ [)] ) |
| _|_ |
|
|
|
|
| _p_ ( \textbf{x} _,_ \textbf{y} ) log _pθ_ ( \textbf{y} \textbf{x} ) . |
| _|_ |
|
|
|
|
| \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1 |
|
|
|
|
|
|
| \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1 |
|
|
|
|
|
|
| _i−_ 1 |
|
|
| ✲ |
| ✻ |
| |
|
|
|
|
|
|
| _i_ \textbf{Y} |
|
|
| ✲ |
| ✻ |
| |
|
|
|
|
|
|
| ❝ |
|
|
|
|
|
|
| +1 |
|
|
|
|
| ✻ |
| |
|
|
|
|
|
|
| ❝ |
|
|
|
|
|
|
| \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1 |
|
|
|
|
|
|
| |
|
|
| ❝ |
|
|
|
|
|
|
| \textbf{Y} _i−_ 1 |
|
|
| ✲ |
| |
|
|
|
|
|
|
| _i_ |
|
|
| ✲ |
| |
|
|
|
|
|
|
| ❄ |
| _i_ |
|
|
|
|
|
|
| _i_ |
|
|
| |
|
|
|
|
|
|
| ❄ |
| _i_ |
|
|
|
|
|
|
| |
|
|
| ❝ |
|
|
|
|
|
|
| _i_ |
|
|
| |
|
|
| ❝ _i_ |
|
|
|
|
|
|
| ❄ |
|
|
| \textbf{X} _i_ |
|
|
|
|
|
|
| \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1 |
|
|
|
|
|
|
| \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1 |
|
|
|
|
|
|
| \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1 |
|
|
|
|
|
|
| _i_ ❝ |
|
|
|
|
|
|
| _Figure 2._ Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences. |
| An open circle indicates that the variable is not generated by the model. |
|
|
|
|
|
|
| sequence. In addition, the features do not need to specify |
| completely a state or observation, so one might expect that |
| the model can be estimated from less training data. Another |
| attractive property is the convexity of the loss function; indeed, CRFs share all of the convexity properties of general |
| maximum entropy models. |
|
|
|
|
| For the remainder of the paper we assume that the dependencies of \textbf{Y}, conditioned on \textbf{X}, form a chain. To simplify some expressions, we add special start and stop states |
| \textbf{Y} 0 = start and \textbf{Y} _n_ +1 = stop. Thus, we will be using the |
| graphical structure shown in Figure 2. For a chain structure, the conditional probability of a label sequence can be |
| expressed concisely in matrix form, which will be useful |
| in describing the parameter estimation and inference algorithms in Section 4. Suppose that _pθ_ ( \textbf{Y} \textbf{X} ) is a CRF |
| _|_ |
| given by (1). For each position _i_ in the observation sequence \textbf{x}, we define the _|Y| × |Y|_ matrix random variable |
| _Mi_ ( \textbf{x} ) = [ _Mi_ ( _y_ _[]_ _, y_ \textbf{x} )] by |
| _|_ |
|
|
|
|
|
|
| of the training data. Both algorithms are based on the improved iterative scaling (IIS) algorithm of Della Pietra et al. |
| (1997); the proof technique based on auxiliary functions |
| can be extended to show convergence of the algorithms for |
| CRFs. |
|
|
|
|
| Iterative scaling algorithms update the weights as _λk_ |
| _λδλk_ + _k_ and _δλ δµk_ and _k_ . In particular, the IIS update _µk ←_ _µk_ + _δµk_ for appropriately chosen _δλk_ for an edge _←_ |
| feature _fk_ is the solution of |
|
|
|
|
|
|
| _T_ ( \textbf{x} _,_ \textbf{y} ) |
|
|
|
|
|
|
| |
| _E_ [ _fk_ ] |
|
|
|
|
| = |
|
|
|
|
|
|
| =def |
|
|
|
|
| |
|
|
|
|
| \textbf{x} _,_ \textbf{y} |
|
|
|
|
|
|
| _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) _e_ _[δλ][k][T]_ [ (] \textbf{[x]} _[,]_ \textbf{[y]} [)] . |
| _|_ |
|
|
|
|
|
|
| |
|
|
|
|
| \textbf{x} _,_ \textbf{y} |
|
|
|
|
|
|
| _n_ +1 |
|
|
| _p_ ( \textbf{x} ) _p_ ( \textbf{y} _|_ \textbf{x} ) |
|
|
| _i_ =1 |
|
|
|
|
|
|
| _p_ ( \textbf{x} _,_ \textbf{y} ) |
|
|
|
|
|
|
| _n_ +1 |
|
|
|
|
| _i_ =1 |
|
|
|
|
|
|
| _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) |
| _|_ |
|
|
|
|
|
|
| where _T_ ( \textbf{x} _,_ \textbf{y} ) is the _total feature count_ |
|
|
|
|
|
|
| |
|
|
|
|
| _i,k_ |
|
|
|
|
|
|
| _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) + |
| _|_ |
|
|
|
|
|
|
| =def |
|
|
|
|
|
|
| |
|
|
| _gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ) . |
| _|_ |
|
|
| _i,k_ |
|
|
|
|
|
|
| _Mi_ ( _y_ _[]_ _, y |_ \textbf{x} ) = exp (Λ _i_ ( _y_ _[]_ _, y |_ \textbf{x} )) |
| Λ _i_ ( _y_ _[]_ _, y |_ \textbf{x} ) = _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[|][e]_ |
|
|
|
|
|
|
| _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[|][e]_ _i_ [= (] _[y][][, y]_ [)] _[,]_ \textbf{[ x]} [) +] |
|
|
|
|
|
|
| _k_ _[µ][k][ g][k]_ [(] _[v][i][,]_ \textbf{[ Y]} _[|][v]_ _i_ [=] _[ y,]_ \textbf{[ x]} [)][,] |
|
|
|
|
|
|
| where _ei_ is the edge with labels ( \textbf{Y} _i_ 1 _,_ \textbf{Y} _i_ ) and _vi_ is the |
|
|
| _−_ |
| vertex with label \textbf{Y} _i_ . In contrast to generative models, conditional models like CRFs do not need to enumerate over |
| all possible observation sequences \textbf{x}, and therefore these |
| matrices can be computed directly as needed from a given |
| training or test observation sequence \textbf{x} and the parameter |
| vector _θ_ . Then the normalization (partition function) _Zθ_ ( \textbf{x} ) |
| is the (start _,_ stop) entry of the product of these matrices: |
|
|
| _Zθ_ ( \textbf{x} ) = ( _M_ 1( \textbf{x} ) _M_ 2( \textbf{x} ) _· · · Mn_ +1( \textbf{x} ))start _,_ stop . |
|
|
| Using this notation, the conditional probability of a label |
| sequence \textbf{y} is written as |
|
|
|
|
|
|
| The equations for vertex feature updates _δµk_ have similar |
| form. |
|
|
|
|
| However, efficiently computing the exponential sums on |
| the right-hand sides of these equations is problematic, because _T_ ( \textbf{x} _,_ \textbf{y} ) is a global property of ( \textbf{x} _,_ \textbf{y} ), and dynamic |
| programming will sum over sequences with potentially |
| varying _T_ . To deal with this, the first algorithm, Algorithm |
| S, uses a “slack feature.” The second, Algorithm T, keeps |
| track of partial _T_ totals. |
|
|
|
|
| For Algorithm S, we define the _slack feature_ by |
|
|
|
|
|
|
| _S −_ |
|
|
|
|
|
|
| _s_ ( \textbf{x} _,_ \textbf{y} ) |
|
|
|
|
|
|
| =def |
|
|
|
|
|
|
| _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) |
| _|_ _−_ |
|
|
|
|
|
|
| |
|
|
|
|
| _k_ |
|
|
|
|
|
|
| |
|
|
|
|
| _i_ |
|
|
|
|
|
|
| _gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ), |
| _|_ |
|
|
|
|
|
|
| |
|
|
|
|
|
|
| _i_ |
|
|
|
|
|
|
| _n_ +1 |
|
|
|
|
|
|
| |
|
|
|
|
| _k_ |
|
|
|
|
|
|
| _pθ_ ( \textbf{y} \textbf{x} ) = |
| _|_ |
|
|
|
|
|
|
| _n_ +1 |
|
|
| _i_ =1 ~~~~ _[M]_ _n_ +1 _[i]_ [(] \textbf{[y]} _[i][−]_ [1] _[,]_ \textbf{[ y]} ~~~~ _[i][ |]_ \textbf{[ x]} [)] |
|
|
|
|
|
|
| _n_ +1 |
|
|
| _i_ =1 _[M][i]_ [(] \textbf{[x]} [)] |
|
|
|
|
|
|
| ~~~~ |
|
|
|
|
|
|
| , |
|
|
|
|
|
|
| start _,_ stop |
|
|
|
|
|
|
| where \textbf{y} 0 = start and \textbf{y} _n_ +1 = stop. |
|
|
|
|
| \textbf{4. Parameter Estimation for CRFs} |
|
|
|
|
| We now describe two iterative scaling algorithms to find |
| the parameter vector _θ_ that maximizes the log-likelihood |
|
|
|
|
|
|
| where _S_ is a constant chosen so that _s_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} ) _≥_ 0 for all |
| \textbf{y} and all observation vectors \textbf{x} [(] _[i]_ [)] in the training set, thus |
| making _T_ ( \textbf{x} _,_ \textbf{y} ) = _S_ . Feature _s_ is “global,” that is, it does |
| not correspond to any particular edge or vertex. |
|
|
| For each index _i_ = 0 _, . . ., n_ + 1 we now define the _forward_ |
| _vectors αi_ ( \textbf{x} ) with base case |
|
|
|
|
|
|
| _α_ 0( _y_ \textbf{x} ) = |
| _|_ |
|
|
|
|
|
|
| |
| 1 if _y_ = start |
| 0 otherwise |
|
|
|
|
| and recurrence |
|
|
|
|
| _αi_ ( \textbf{x} ) = _αi−_ 1( \textbf{x} ) _Mi_ ( \textbf{x} ) . |
|
|
| Similarly, the _backward vectors βi_ ( \textbf{x} ) are defined by |
|
|
|
|
|
|
| _βk_ and _γk_ are the unique positive roots to the following |
| polynomial equations |
|
|
|
|
|
|
| _T_ max |
|
|
|
|
| _i_ =0 |
|
|
|
|
|
|
| _ak,t βk_ _[t]_ [=][ ] _[Ef][k]_ _[,]_ |
|
|
|
|
|
|
| _T_ max |
|
|
|
|
| _i_ =0 |
|
|
|
|
|
|
| _bk,t γk_ _[t]_ [=][ ] _[Eg][k]_ [,] (2) |
|
|
|
|
|
|
| _βn_ +1( _y_ \textbf{x} ) = |
| _|_ |
|
|
|
|
|
|
| |
| 1 if _y_ = stop |
| 0 otherwise |
|
|
|
|
|
|
| and |
|
|
| _βi_ ( \textbf{x} ) _[]_ = _Mi_ +1( \textbf{x} ) _βi_ +1( \textbf{x} ) . |
|
|
|
|
| With these definitions, the update equations are |
|
|
|
|
|
|
| _δλk_ = [1] |
|
|
| _S_ [log] |
|
|
|
|
|
|
| |
| _Efk_ |
| _Efk_ |
|
|
|
|
|
|
| _,_ _δµk_ = [1] |
|
|
| _S_ [log] |
|
|
|
|
|
|
| |
| _Egk_ |
| _Egk_ |
|
|
|
|
|
|
| , |
|
|
|
|
|
|
| where |
|
|
|
|
| _Efk_ = |
|
|
|
|
| _Egk_ = |
|
|
|
|
|
|
| |
|
|
|
|
| \textbf{x} |
|
|
|
|
| |
|
|
|
|
| \textbf{x} |
|
|
|
|
|
|
| _gk_ ( _vi,_ \textbf{y} _|vi_ = _y,_ \textbf{x} ) _×_ |
|
|
|
|
|
|
| |
|
|
|
|
| _y_ _[]_ _,y_ |
|
|
|
|
|
|
| _n_ +1 |
|
|
| _p_ ( \textbf{x} ) |
|
|
|
|
| _i_ =1 |
|
|
|
|
|
|
| _n_ |
|
|
| _p_ ( \textbf{x} ) |
|
|
|
|
| _i_ =1 |
|
|
|
|
|
|
| |
|
|
|
|
| _y_ |
|
|
|
|
|
|
| _fk_ ( _ei,_ \textbf{y} _|ei_ = ( _y_ _[]_ _, y_ ) _,_ \textbf{x} ) _×_ |
|
|
|
|
|
|
| which can be easily computed by Newton’s method. |
|
|
|
|
| A single iteration of Algorithm S and Algorithm T has |
| roughly the same time and space complexity as the well |
| known Baum-Welch algorithm for HMMs. To prove convergence of our algorithms, we can derive an auxiliary |
| function to bound the change in likelihood from below; this |
| method is developed in detail by Della Pietra et al. (1997). |
| The full proof is somewhat detailed; however, here we give |
| an idea of how to derive the auxiliary function. To simplify |
| notation, we assume only edge features _fk_ with parameters |
| _λk_ . |
|
|
| Given two parameter settings _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ) and _θ_ _[]_ = |
| ( _λ_ 1 + _δλ_ 1 _, λ_ 2 + _δλ_ 2 _, . . ._ ), we bound from below the change |
| in the objective function with an _auxiliary function A_ ( _θ_ _[]_ _, θ_ ) |
| as follows |
|
|
|
|
|
|
| _αi−_ 1( _y_ _[]_ _|_ \textbf{x} ) _Mi_ ( _y_ _[]_ _, y |_ \textbf{x} ) _βi_ ( _y |_ \textbf{x} ) |
|
|
| _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~ |
|
|
|
|
|
|
| _O_ ( _θ_ _[]_ ) _−O_ ( _θ_ ) = |
|
|
|
|
|
|
| |
|
|
|
|
| \textbf{x} _,_ \textbf{y} |
|
|
|
|
|
|
| _p_ ( \textbf{x} _,_ \textbf{y} ) log _[p][θ][]_ [(] \textbf{[y]} _[ |]_ \textbf{[ x]} [)] |
|
|
|
|
|
|
| ~~_p_~~ _θ_ ~~(~~ ~~\textbf{y}~~ ~~\textbf{x}~~ ~~)~~ |
| _|_ |
|
|
|
|
|
|
| = ( _θ_ _[]_ _−_ _θ_ ) _·_ _Ef_ [] _−_ |
|
|
|
|
|
|
| _p_ ( \textbf{x} ) log _[Z][θ][]_ [(] \textbf{[x]} [)] |
|
|
|
|
|
|
| _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~ |
|
|
|
|
|
|
| _αi_ ( _y_ \textbf{x} ) _βi_ ( _y_ \textbf{x} ) |
| _|_ _|_ . |
|
|
| _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~ |
|
|
|
|
|
|
| _≥_ ( _θ_ _[]_ _−_ _θ_ ) _·_ _Ef_ [] _−_ |
|
|
|
|
|
|
| |
|
|
|
|
| \textbf{x} |
|
|
| |
|
|
|
|
| \textbf{x} |
|
|
|
|
|
|
| _p_ ( \textbf{x} ) _[Z][θ][]_ [(] \textbf{[x]} [)] |
|
|
|
|
|
|
| _p_ ( \textbf{x} ) |
|
|
|
|
|
|
| _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~ |
|
|
|
|
|
|
| |
|
|
|
|
|
|
| \textbf{x} |
|
|
| |
|
|
|
|
| \textbf{x} _,_ \textbf{y} _,k_ |
|
|
|
|
|
|
| _p_ ( \textbf{x} ) _pθ_ ( \textbf{y} \textbf{x} ) _[f][k]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)] |
| _|_ _T_ ~~(~~ ~~\textbf{x}~~ ~~)~~ _[e][δλ][k][T]_ [ (] \textbf{[x]} [)] |
|
|
|
|
|
|
| The factors involving the forward and backward vectors in |
| the above equations have the same meaning as for standard |
| hidden Markov models. For example, |
|
|
| _pθ_ ( \textbf{Y} _i_ = _y_ \textbf{x} ) = _αi_ ( _y |_ \textbf{x} ) _βi_ ( _y |_ \textbf{x} ) |
| _|_ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~ |
|
|
|
|
| is the marginal probability of label \textbf{Y} _i_ = _y_ given that the |
| observation sequence is \textbf{x} . This algorithm is closely related |
| to the algorithm of Darroch and Ratcliff (1972), and MART |
| algorithms used in image reconstruction. |
|
|
|
|
| The constant _S_ in Algorithm S can be quite large, since in |
| practice it is proportional to the length of the longest training observation sequence. As a result, the algorithm may |
| converge slowly, taking very small steps toward the maximum in each iteration. If the length of the observations \textbf{x} [(] _[i]_ [)] |
|
|
| and the number of active features varies greatly, a fasterconverging algorithm can be obtained by keeping track of |
| feature totals for each observation sequence separately. |
|
|
|
|
| Let _T_ ( \textbf{x} ) = maxdef \textbf{y} _T_ ( \textbf{x} _,_ \textbf{y} ). Algorithm T accumulates |
|
|
| feature expectations into counters indexed by _T_ ( \textbf{x} ). More |
| specifically, we use the forward-backward recurrences just |
| introduced to compute the expectations _ak,t_ of feature _fk_ |
| and _bk,t_ of feature _gk_ given that _T_ ( \textbf{x} ) = _t_ . Then our parameter updates are _δλk_ = log _βk_ and _δµk_ = log _γk_, where |
|
|
|
|
|
|
| = _δλ ·_ _Ef_ [] _−_ |
|
|
|
|
| _≥_ _δλ ·_ _Ef_ [] _−_ |
|
|
|
|
|
|
| _pθ_ ( \textbf{y} \textbf{x} ) _e_ _[δλ][·][f]_ [(] \textbf{[x]} _[,]_ \textbf{[y]} [)] |
| _|_ |
|
|
|
|
|
|
| |
|
|
|
|
| \textbf{y} |
|
|
|
|
|
|
| =def _A_ ( _θ_ _[]_ _, θ_ ) |
|
|
| where the inequalities follow from the convexity of _−_ log |
| and exp. Differentiating _A_ with respect to _δλk_ and setting |
| the result to zero yields equation (2). |
|
|
|
|
| \textbf{5. Experiments} |
|
|
|
|
| We first discuss two sets of experiments with synthetic data |
| that highlight the differences between CRFs and MEMMs. |
| The first experiments are a direct verification of the label |
| bias problem discussed in Section 2. In the second set of |
| experiments, we generate synthetic data using randomly |
| chosen hidden Markov models, each of which is a mixture of a first-order and second-order model. Competing |
| _first-order_ models are then trained and compared on test |
| data. As the data becomes more second-order, the test error rates of the trained models increase. This experiment |
| corresponds to the common modeling practice of approximating complex local and long-range dependencies, as occur in natural data, by small-order Markov models. Our |
|
|
|
|
| 60 |
|
|
|
|
| 50 |
|
|
|
|
| 40 |
|
|
|
|
| 30 |
|
|
|
|
| 20 |
|
|
|
|
|
|
| 0 10 20 30 40 50 60 |
|
|
| HMM Error |
|
|
|
|
|
|
| 0 10 20 30 40 50 60 |
|
|
| HMM Error |
|
|
|
|
|
|
| 10 |
|
|
|
|
| 0 |
|
|
| 0 10 20 30 40 50 60 |
|
|
| CRF Error |
|
|
|
|
|
|
| 60 |
|
|
|
|
| 50 |
|
|
|
|
| 40 |
|
|
|
|
| 30 |
|
|
|
|
| 20 |
|
|
|
|
| 10 |
|
|
|
|
| 0 |
|
|
|
|
|
|
| 60 |
|
|
|
|
| 50 |
|
|
|
|
| 40 |
|
|
|
|
| 30 |
|
|
|
|
| 20 |
|
|
|
|
| 10 |
|
|
|
|
| 0 |
|
|
|
|
|
|
| _Figure 3._ Plots of 2 _×_ 2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2. |
| As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically |
| significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square |
| represents a data set with _α <_ [1] 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2 [. The plot shows that when the data is mostly second] |
|
|
| order ( _α ≥_ [1] 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate] |
|
|
|
|
|
|
|
|
| [1] |
|
|
| 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2 |
|
|
|
|
|
|
| order ( _α ≥_ 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate] |
|
|
| the advantages of the additional representational power of CRFs and MEMMs relative to HMMs. |
|
|
|
|
|
|
| results clearly indicate that even when the models are parameterized in exactly the same way, CRFs are more robust to inaccurate modeling assumptions than MEMMs or |
| HMMs, and resolve the label bias problem, which affects |
| the performance of MEMMs. To avoid confusion of different effects, the MEMMs and CRFs in these experiments |
| _do not_ use overlapping features of the observations. Finally, in a set of POS tagging experiments, we confirm the |
| advantage of CRFs over MEMMs. We also show that the |
| addition of overlapping features to CRFs and MEMMs allows them to perform much better than HMMs, as already |
| shown for MEMMs by McCallum et al. (2000). |
|
|
|
|
| \textbf{5.1 Modeling label bias} |
|
|
|
|
| We generate data from a simple HMM which encodes a |
| noisy version of the finite-state network in Figure 1. Each |
| state emits its designated symbol with probability 29 _/_ 32 |
| and any of the other symbols with probability 1 _/_ 32. We |
| train both an MEMM and a CRF with the same topologies |
| on the data generated by the HMM. The observation features are simply the identity of the observation symbols. |
| In a typical run using 2 _,_ 000 training and 500 test samples, |
| trained to convergence of the iterative scaling algorithm, |
| the CRF error is 4 _._ 6 |
| showing that the MEMM fails to discriminate between the |
| two branches. |
|
|
|
|
| \textbf{5.2 Modeling mixed-order sources} |
|
|
| For these results, we use five labels, a-e ( _|Y|_ = 5), and 26 |
| observation values, A-Z ( _|X|_ = 26); however, the results |
| were qualitatively the same over a range of sizes for _Y_ and |
| _X_ . We generate data from a mixed-order HMM with state |
| transition probabilities given by _α p_ larly, emission probabilities given by _α p_ have a standard first-order HMM. In order to limit the size22(( \textbf{yx} _ii | |_ \textbf{y y} _ii−,_ \textbf{x} 1 _,i_ \textbf{y} _−_ 1 _i−_ )+(12) + (1 _−α_ ) _− p_ 1( _α_ \textbf{x} ) _i | p p_ \textbf{y} 1( _iα_ \textbf{y} ) _p_ (. Thus, for _i_ \textbf{y} _|αi_ \textbf{y} _|_ ( \textbf{yx} _i−ii |_ 1 _−_ \textbf{y} )1 and, simi- _,i_ \textbf{y} _, α_ \textbf{x} _i_ = 0 _−i−_ 21)) = we= |
|
|
|
|
|
|
| of the Bayes error rate for the resulting models, the conditional probability tables _pα_ are constrained to be sparse. |
| In particular, _pα_ ( _y, y_ _[]_ ) can have at most two nonzero en |
| _· |_ |
| tries, for each _y, y_ _[]_, and _pα_ ( _y, x_ _[]_ ) can have at most three |
|
|
| _· |_ |
| nonzero entries for each _y, x_ _[]_ . For each randomly generated model, a sample of 1,000 sequences of length 25 is |
| generated for training and testing. |
|
|
|
|
| On each randomly generated training set, a CRF is trained |
| using Algorithm S. (Note that since the length of the sequences and number of active features is constant, Algorithms S and T are identical.) The algorithm is fairly slow |
| to converge, typically taking approximately 500 iterations |
| for the model to stabilize. On the 500 MHz Pentium PC |
| used in our experiments, each iteration takes approximately |
| 0.2 seconds. On the same data an MEMM is trained using |
| iterative scaling, which does not require forward-backward |
| calculations, and is thus more efficient. The MEMM training converges more quickly, stabilizing after approximately |
| 100 iterations. For each model, the Viterbi algorithm is |
| used to label a test set; the experimental results do not significantly change when using forward-backward decoding |
| to minimize the per-symbol error rate. |
|
|
|
|
| The results of several runs are presented in Figure 3. Each |
| plot compares two classes of models, with each point indicating the error rate for a single test set. As _α_ increases, the |
| error rates generally increase, as the first-order models fail |
| to fit the second-order data. The figure compares models |
| parameterized as _µy_, _λy,y_, and _λy,y,x_ ; results for models |
| parameterized as _µy_, _λy,y_, and _µy,x_ are qualitatively the |
| same. As shown in the first graph, the CRF generally outperforms the MEMM, often by a wide margin of 10 |
| relative error. (The points for very small error rate, with |
| _α <_ 0 _._ 01, where the MEMM does better than the CRF, |
| are suspected to be the result of an insufficient number of |
| training iterations for the CRF.) |
|
|
|
|
| |model|error oov error| |
| |---|---| |
| |HMM<br>MEMM<br>CRF|5.69 |
| |MEMM+<br>CRF+|4.81 |
|
|
|
|
| +Using spelling features |
|
|
|
|
| _Figure 4._ Per-word error rates for POS tagging on the Penn treebank, using first-order models trained on 50 |
| word corpus. The oov rate is 5.45 |
|
|
|
|
| \textbf{5.3 POS tagging experiments} |
|
|
|
|
| To confirm our synthetic data results, we also compared |
| HMMs, MEMMs and CRFs on Penn treebank POS tagging, where each word in a given input sentence must be |
| labeled with one of 45 syntactic tags. |
|
|
|
|
| We carried out two sets of experiments with this natural |
| language data. First, we trained first-order HMM, MEMM, |
| and CRF models as in the synthetic data experiments, introducing parameters _µy,x_ for each tag-word pair and _λy,y_ |
| for each tag-tag pair in the training set. The results are consistent with what is observed on synthetic data: the HMM |
| outperforms the MEMM, as a consequence of the label bias |
| problem, while the CRF outperforms the HMM. The error rates for training runs using a 50 |
| are shown in Figure 5.3; the results are qualitatively similar for other splits of the data. The error rates on outof-vocabulary (oov) words, which are not observed in the |
| training set, are reported separately. |
|
|
|
|
| In the second set of experiments, we take advantage of the |
| power of conditional models by adding a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and |
| whether it ends in one of the following suffixes: -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as |
| expected, that both the MEMM and the CRF benefit significantly from the use of these features, with the overall error |
| rate reduced by around 25 |
|
|
|
|
| One usually starts training from the all zero parameter vector, corresponding to the uniform distribution. However, |
| for these datasets, CRF training with that initialization is |
| much slower than MEMM training. Fortunately, we can |
| use the optimal MEMM parameter vector as a starting |
| point for training the corresponding CRF. In Figure 5.3, |
| MEMM [+] was trained to convergence in around 100 iterations. Its parameters were then used to initialize the training of CRF [+], which converged in 1,000 iterations. In contrast, training of the same CRF from the uniform distribution had not converged even after 2,000 iterations. |
|
|
|
|
|
|
| \textbf{6. Further Aspects of CRFs} |
|
|
|
|
| Many further aspects of CRFs are attractive for applications and deserve further study. In this section we briefly |
| mention just two. |
|
|
|
|
| Conditional random fields can be trained using the exponential loss objective function used by the AdaBoost algorithm (Freund & Schapire, 1997). Typically, boosting is |
| applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling |
| have treated each label as a separate classification problem |
| (Abney et al., 1999). However, it is possible to apply the |
| parallel update algorithm of Collins et al. (2000) to optimize the per-sequence exponential loss. This requires a |
| forward-backward algorithm to compute efficiently certain |
| feature expectations, along the lines of Algorithm T, except that each feature requires a separate set of forward and |
| backward accumulators. |
|
|
|
|
| Another attractive aspect of CRFs is that one can implement efficient feature selection and feature induction algorithms for them. That is, rather than specifying in advance which features of ( \textbf{X} _,_ \textbf{Y} ) to use, we could start from |
| feature-generating rules and evaluate the benefit of generated features automatically on data. In particular, the feature induction algorithms presented in Della Pietra et al. |
| (1997) can be adapted to fit the dynamic programming |
| techniques of conditional random fields. |
|
|
|
|
| \textbf{7. Related Work and Conclusions} |
|
|
|
|
| As far as we know, the present work is the first to combine |
| the benefits of conditional models with the global normalization of random field models. Other applications of exponential models in sequence modeling have either attempted |
| to build generative models (Rosenfeld, 1997), which involve a hard normalization problem, or adopted local conditional models (Berger et al., 1996; Ratnaparkhi, 1996; |
| McCallum et al., 2000) that may suffer from label bias. |
|
|
|
|
| Non-probabilistic local decision models have also been |
| widely used in segmentation and tagging (Brill, 1995; |
| Roth, 1998; Abney et al., 1999). Because of the computational complexity of global training, these models are only |
| trained to minimize the error of individual label decisions |
| assuming that neighboring labels are correctly chosen. Label bias would be expected to be a problem here too. |
|
|
|
|
| An alternative approach to discriminative modeling of sequence labeling is to use a permissive generative model, |
| which can only model local dependencies, to produce a |
| list of candidates, and then use a more global discriminative model to rerank those candidates. This approach is |
| standard in large-vocabulary speech recognition (Schwartz |
| & Austin, 1993), and has also been proposed for parsing |
| (Collins, 2000). However, these methods fail when the correct output is pruned away in the first pass. |
|
|
|
|
| Closest to our proposal are gradient-descent methods that |
| adjust the parameters of all of the local classifiers to minimize a smooth loss function (e.g., quadratic loss) combining loss terms for each label. If state dependencies are local, this can be done efficiently with dynamic programming |
| (LeCun et al., 1998). Such methods should alleviate label |
| bias. However, their loss function is not convex, so they |
| may get stuck in local minima. |
|
|
|
|
| Conditional random fields offer a unique combination of |
| properties: discriminatively trained models for sequence |
| segmentation and labeling; combination of arbitrary, overlapping and agglomerative observation features from both |
| the past and future; efficient training and decoding based |
| on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main current limitation is the slow convergence of the training algorithm |
| relative to MEMMs, let alone to HMMs, for which training |
| on fully observed data is very efficient. In future work, we |
| plan to investigate alternative training methods such as the |
| update methods of Collins et al. (2000) and refinements on |
| using a MEMM as starting point as we did in some of our |
| experiments. More general tree-structured random fields, |
| feature induction methods, and further natural data evaluations will also be investigated. |
|
|
|
|
| \textbf{Acknowledgments} |
|
|
|
|
| We thank Yoshua Bengio, L´eon Bottou, Michael Collins |
| and Yann LeCun for alerting us to what we call here the label bias problem. We also thank Andrew Ng and Sebastian |
| Thrun for discussions related to this work. |
|
|
|
|
| \textbf{References} |
|
|
|
|
| Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting |
|
|
| applied to tagging and PP attachment. _Proc. EMNLP-_ |
| _VLC_ . New Brunswick, New Jersey: Association for |
| Computational Linguistics. |
| Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. |
|
|
| (1996). A maximum entropy approach to natural language processing. _Computational Linguistics_, _22_ . |
| Bottou, L. (1991). _Une_ _approche_ _th´eorique_ _de_ |
| _l’apprentissage connexionniste: Applications `a la recon-_ |
| _naissance de la parole_ . Doctoral dissertation, Universit´e |
| de Paris XI. |
| Brill, E. (1995). Transformation-based error-driven learn |
| ing and natural language processing: a case study in part |
| of speech tagging. _Computational Linguistics_, _21_, 543– |
| 565. |
| Collins, M. (2000). Discriminative reranking for natural |
|
|
| language parsing. _Proc. ICML 2000_ . Stanford, California. |
| Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re |
| gression, AdaBoost, and Bregman distances. _Proc. 13th_ |
| _COLT_ . |
| Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative |
|
|
|
|
|
|
| scaling for log-linear models. _The Annals of Mathemat-_ |
| _ical Statistics_, _43_, 1470–1480. |
| Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In |
| ducing features of random fields. _IEEE Transactions on_ |
| _Pattern Analysis and Machine Intelligence_, _19_, 380–393. |
| Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). |
|
|
| _Biological sequence analysis: Probabilistic models of_ |
| _proteins and nucleic acids_ . Cambridge University Press. |
| Freitag, D., & McCallum, A. (2000). Information extrac |
| tion with HMM structures learned by stochastic optimization. _Proc. AAAI 2000_ . |
| Freund, Y., & Schapire, R. (1997). A decision-theoretic |
|
|
| generalization of on-line learning and an application to |
| boosting. _Journal of Computer and System Sciences_, _55_, |
| 119–139. |
| Hammersley, J., & Clifford, P. (1971). Markov fields on |
|
|
| finite graphs and lattices. Unpublished manuscript. |
| LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). |
|
|
| Gradient-based learning applied to document recognition. _Proceedings of the IEEE_, _86_, 2278–2324. |
| MacKay, D. J. (1996). Equivalence of linear Boltzmann |
|
|
| chains and hidden Markov models. _Neural Computation_, |
| _8_, 178–181. |
| Manning, C. D., & Sch¨utze, H. (1999). _Foundations of sta-_ |
|
|
| _tistical natural language processing_ . Cambridge Massachusetts: MIT Press. |
| McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum |
|
|
| entropy Markov models for information extraction and |
| segmentation. _Proc. ICML 2000_ (pp. 591–598). Stanford, California. |
| Mohri, M. (1997). Finite-state transducers in language and |
|
|
| speech processing. _Computational Linguistics_, _23_ . |
| Mohri, M. (2000). Minimization algorithms for sequential |
|
|
| transducers. _Theoretical Computer Science_, _234_, 177– |
| 201. |
| Paz, A. (1971). _Introduction to probabilistic automata_ . |
| Academic Press. |
| Punyakanok, V., & Roth, D. (2001). The use of classifiers |
|
|
| in sequential inference. _NIPS 13_ . Forthcoming. |
| Ratnaparkhi, A. (1996). A maximum entropy model for |
|
|
| part-of-speech tagging. _Proc. EMNLP_ . New Brunswick, |
| New Jersey: Association for Computational Linguistics. |
| Rosenfeld, R. (1997). A whole sentence maximum entropy |
|
|
| language model. _Proceedings of the IEEE Workshop on_ |
| _Speech Recognition and Understanding_ . Santa Barbara, |
| California. |
| Roth, D. (1998). Learning to resolve natural language am |
| biguities: A unified approach. _Proc. 15th AAAI_ (pp. 806– |
| 813). Menlo Park, California: AAAI Press. |
| Saul, L., & Jordan, M. (1996). Boltzmann chains and hid |
| den Markov models. _Advances in Neural Information_ |
| _Processing Systems 7_ . MIT Press. |
| Schwartz, R., & Austin, S. (1993). A comparison of several |
|
|
| approximate algorithms for finding multiple (N-BEST) |
| sentence hypotheses. _Proc. ICASSP_ . Minneapolis, MN. |
|
|
|
|
|
|
| \end{document} |
|
|