Add references folder with research papers (markdown, tex, source files)

36a70ab 3 months ago

44.4 kB

	\documentclass[11pt]{article}
	\usepackage[utf8]{inputenc}
	\usepackage{amsmath,amssymb}
	\usepackage{booktabs}
	\usepackage{hyperref}

	\begin{document}

	\section*{\textbf{Conditional Random Fields: Probabilistic Models} \textbf{for Segmenting and Labeling Sequence Data}}

	\textbf{John Lafferty} _[†∗]_ LAFFERTY@CS.CMU.EDU
	\textbf{Andrew McCallum} _[∗†]_ MCCALLUM@WHIZBANG.COM
	\textbf{Fernando Pereira} _[∗‡]_ FPEREIRA@WHIZBANG.COM
	_∗_ WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA

	_†_ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA

	_‡_ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA



	\textbf{Abstract}


	We present _conditional random fields_, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars
	for such tasks, including the ability to relax
	strong independence assumptions made in those
	models. Conditional random fields also avoid
	a fundamental limitation of maximum entropy
	Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states
	with few successor states. We present iterative
	parameter estimation algorithms for conditional
	random fields and compare the performance of
	the resulting models to HMMs and MEMMs on
	synthetic and natural-language data.


	\textbf{1. Introduction}


	The need to segment and label sequences arises in many
	different problems in several scientific fields. Hidden
	Markov models (HMMs) and stochastic grammars are well
	understood and widely used probabilistic models for such
	problems. In computational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known
	evolutionary family, and analyze RNA secondary structure
	(Durbin et al., 1998). In computational linguistics and
	computer science, HMMs and stochastic grammars have
	been applied to a wide variety of problems in text and
	speech processing, including topic segmentation, part-ofspeech (POS) tagging, information extraction, and syntactic disambiguation (Manning & Sch¨utze, 1999).


	HMMs and stochastic grammars are generative models, assigning a joint probability to paired observation and label
	sequences; the parameters are typically trained to maxi


	mize the joint likelihood of training examples. To define
	a joint probability over observation and label sequences,
	a generative model needs to enumerate all possible observation sequences, typically requiring a representation
	in which observations are task-appropriate atomic entities,
	such as words or nucleotides. In particular, it is not practical to represent multiple interacting features or long-range
	dependencies of the observations, since the inference problem for such models is intractable.


	This difficulty is one of the main motivations for looking at
	conditional models as an alternative. A conditional model
	specifies the probabilities of possible label sequences given
	an observation sequence. Therefore, it does not expend
	modeling effort on the observations, which at test time
	are fixed anyway. Furthermore, the conditional probability of the label sequence can depend on arbitrary, nonindependent features of the observation sequence without
	forcing the model to account for the distribution of those
	dependencies. The chosen features may represent attributes
	at different levels of granularity of the same observations
	(for example, words and characters in English text), or
	aggregate properties of the observation sequence (for instance, text layout). The probability of a transition between
	labels may depend not only on the current observation,
	but also on past and future observations, if available. In
	contrast, generative models must make very strict independence assumptions on the observations, for instance conditional independence given the labels, to achieve tractability.


	Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all of the
	above advantages (McCallum et al., 2000). In MEMMs,
	each source state [1] has a exponential model that takes the
	observation features as input, and outputs a distribution
	over possible next states. These exponential models are
	trained by an appropriate iterative scaling method in the


	1Output labels are associated with states; it is possible for several states to have the same label, but for simplicity in the rest of
	this paper we assume a one-to-one correspondence.


	maximum entropy framework. Previously published experimental results show MEMMs increasing recall and doubling precision relative to HMMs in a FAQ segmentation
	task.


	MEMMs and other non-generative finite-state models
	based on next-state classifiers, such as discriminative
	Markov models (Bottou, 1991), share a weakness we call
	here the _label bias problem_ : the transitions leaving a given
	state compete only against each other, rather than against
	all other transitions in the model. In probabilistic terms,
	transition scores are the conditional probabilities of possible next states given the current state and the observation sequence. This per-state normalization of transition
	scores implies a “conservation of score mass” (Bottou,
	1991) whereby all the mass that arrives at a state must be
	distributed among the possible successor states. An observation can affect which destination states get the mass, but
	not how much total mass to pass on. This causes a bias toward states with fewer outgoing transitions. In the extreme
	case, a state with a single outgoing transition effectively
	ignores the observation. In those cases, unlike in HMMs,
	Viterbi decoding cannot downgrade a branch based on observations after the branch point, and models with statetransition structures that have sparsely connected chains of
	states are not properly handled. The Markovian assumptions in MEMMs and similar state-conditional models insulate decisions at one state from future decisions in a way
	that does not match the actual dependencies between consecutive states.


	This paper introduces _conditional random fields_ (CRFs), a
	sequence modeling framework that has all the advantages
	of MEMMs but also solves the label bias problem in a
	principled way. The critical difference between CRFs and
	MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the
	current state, while a CRF has a single exponential model
	for the joint probability of the entire sequence of labels
	given the observation sequence. Therefore, the weights of
	different features at different states can be traded off against
	each other.


	We can also think of a CRF as a finite state model with unnormalized transition probabilities. However, unlike some
	other weighted finite-state approaches (LeCun et al., 1998),
	CRFs assign a well-defined probability distribution over
	possible labelings, trained by maximum likelihood or MAP
	estimation. Furthermore, the loss function is convex, [2] guaranteeing convergence to the global optimum. CRFs also
	generalize easily to analogues of stochastic context-free
	grammars that would be useful in such problems as RNA
	secondary structure prediction and natural language processing.


	2In the case of fully observable states, as we are discussing
	here; if several states have the same label, the usual local maxima
	of Baum-Welch arise.



	_Figure 1._ Label bias example, after (Bottou, 1991). For conciseness, we place observation-label pairs _o_ : _l_ on transitions rather
	than states; the symbol ‘ ~~’~~ represents the null output label.


	We present the model, describe two training procedures and
	sketch a proof of convergence. We also give experimental
	results on synthetic data showing that CRFs solve the classical version of the label bias problem, and, more significantly, that CRFs perform better than HMMs and MEMMs
	when the true data distribution has higher-order dependencies than the model, as is often the case in practice. Finally,
	we confirm these results as well as the claimed advantages
	of conditional models by evaluating HMMs, MEMMs and
	CRFs with identical state structure on a part-of-speech tagging task.


	\textbf{2. The Label Bias Problem}


	Classical probabilistic automata (Paz, 1971), discriminative Markov models (Bottou, 1991), maximum entropy
	taggers (Ratnaparkhi, 1996), and MEMMs, as well as
	non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (Punyakanok & Roth, 2001) are all potential victims of the label
	bias problem.


	For example, Figure 1 represents a simple finite-state
	model designed to distinguish between the two words rib
	and rob. Suppose that the observation sequence is r i b.
	In the first time step, r matches both transitions from the
	start state, so the probability mass gets distributed roughly
	equally among those two transitions. Next we observe i.
	Both states 1 and 4 have only one outgoing transition. State
	1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4
	has no choice but to pass all its mass to its single outgoing
	transition, since it is not generating the observation, only
	conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. More generally,
	states with low-entropy next state distributions will take little notice of observations. Returning to the example, the
	top path and the bottom path will be about equally likely,
	independently of the observation sequence. If one of the
	two words is slightly more common in the training set, the
	transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will
	always win. This behavior is demonstrated experimentally
	in Section 5.


	L´eon Bottou (1991) discussed two solutions for the label
	bias problem. One is to change the state-transition struc











	ture of the model. In the above example we could collapse
	states 1 and 4, and delay the branching until we get a discriminating observation. This operation is a special case
	of determinization (Mohri, 1997), but determinization of
	weighted finite-state machines is not always possible, and
	even when possible, it may lead to combinatorial explosion. The other solution mentioned is to start with a fullyconnected model and let the training procedure figure out
	a good structure. But that would preclude the use of prior
	structural knowledge that has proven so valuable in information extraction tasks (Freitag & McCallum, 2000).


	Proper solutions require models that account for whole
	state sequences at once by letting some transitions “vote”
	more strongly than others depending on the corresponding
	observations. This implies that score mass will not be conserved, but instead individual transitions can “amplify” or
	“dampen” the mass they receive. In the above example, the
	transitions from the start state would have a very weak effect on path score, while the transitions from states 1 and 4
	would have much stronger effects, amplifying or damping
	depending on the actual observation, and a proportionally
	higher contribution to the selection of the Viterbi path. [3]


	In the related work section we discuss other heuristic model
	classes that account for state sequences globally rather than
	locally. To the best of our knowledge, CRFs are the only
	model class that does this in a purely probabilistic setting,
	with guaranteed global maximum likelihood convergence.


	\textbf{3. Conditional Random Fields}


	In what follows, \textbf{X} is a random variable over data sequences to be labeled, and \textbf{Y} is a random variable over
	corresponding label sequences. All components \textbf{Y} _i_ of \textbf{Y}
	are assumed to range over a finite label alphabet _Y_ . For example, \textbf{X} might range over natural language sentences and
	\textbf{Y} range over part-of-speech taggings of those sentences,
	with _Y_ the set of possible part-of-speech tags. The random variables \textbf{X} and \textbf{Y} are jointly distributed, but in a discriminative framework we construct a conditional model
	_p_ ( \textbf{Y} _\|_ \textbf{X} ) from paired observation and label sequences, and
	do not explicitly model the marginal _p_ ( \textbf{X} ).

	\textbf{Definition} . _Let G_ = ( _V, E_ ) _be a graph such that_
	\textbf{Y} = ( \textbf{Y} _v_ ) _v_ _V, so that_ \textbf{Y} _is indexed by the vertices_
	_∈_
	_of G._ _Then_ ( \textbf{X} _,_ \textbf{Y} ) _is a conditional random field in_
	_case, when conditioned on_ \textbf{X} _, the random variables_ \textbf{Y} _v_
	_obey the Markov property with respect to the graph:_
	_p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ = _v_ ) = _p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ _v_ ) _, where_
	_w ∼_ _v \| means that _ _w and v are neighbors in \|_ _∼ G._

	Thus, a CRF is a random field globally conditioned on the
	observation \textbf{X} . Throughout the paper we tacitly assume
	that the graph _G_ is fixed. In the simplest and most impor

	3Weighted determinization and minimization techniques shift
	transition weights while preserving overall path weight (Mohri,
	2000); their connection to this discussion deserves further study.



	tant example for modeling sequences, _G_ is a simple chain
	or line: _G_ = ( _V_ = _{_ 1 _,_ 2 _, . . . m}, E_ = _{_ ( _i, i_ + 1) _}_ ).
	\textbf{X} may also have a natural graph structure; yet in general it is not necessary to assume that \textbf{X} and \textbf{Y} have the
	same graphical structure, or even that \textbf{X} has any graphical structure at all. However, in this paper we will be
	most concerned with sequences \textbf{X} = ( \textbf{X} 1 _,_ \textbf{X} 2 _, . . .,_ \textbf{X} _n_ )
	and \textbf{Y} = ( \textbf{Y} 1 _,_ \textbf{Y} 2 _, . . .,_ \textbf{Y} _n_ ).

	If the graph _G_ = ( _V, E_ ) of \textbf{Y} is a tree (of which a chain
	is the simplest example), its cliques are the edges and vertices. Therefore, by the fundamental theorem of random
	fields (Hammersley & Clifford, 1971), the joint distribution over the label sequence \textbf{Y} given \textbf{X} has the form



	As a particular case, we can construct an HMM-like CRF
	by defining one feature for each state pair ( _y_ _[]_ _, y_ ), and one
	feature for each state-observation pair ( _y, x_ ):


	_fy,y_ ( _<u, v>,_ \textbf{y} _<u,v>,_ \textbf{x} ) = _δ_ ( \textbf{y} _u, y_ _[]_ ) _δ_ ( \textbf{y} _v, y_ )
	_\|_
	_gy,x_ ( _v,_ \textbf{y} _\|v,_ \textbf{x} ) = _δ_ ( \textbf{y} _v, y_ ) _δ_ ( \textbf{x} _v, x_ ) .

	The corresponding parameters _λy,y_ and _µy,x_ play a similar role to the (logarithms of the) usual HMM parameters
	_p_ ( _y_ _[]_ _\| y_ ) and _p_ ( _x\|y_ ). Boltzmann chain models (Saul & Jordan, 1996; MacKay, 1996) have a similar form but use a
	single normalization constant to yield a joint distribution,
	whereas CRFs use the observation-dependent normalization _Z_ ( \textbf{x} ) for conditional distributions.


	Although it encompasses HMM-like models, the class of
	conditional random fields is much more expressive, because it allows arbitrary dependencies on the observation



	_pθ_ ( \textbf{y}  _\|_ \textbf{x} ) _∝_ (1)



	



	



	exp



	 []



	_v∈V,k_







	_e∈E,k_



	_λk fk_ ( _e,_ \textbf{y} _\|e,_ \textbf{x} ) +



	_µk gk_ ( _v,_ \textbf{y} _\|v,_ \textbf{x} ),



	where \textbf{x} is a data sequence, \textbf{y} a label sequence, and \textbf{y} _\|S_ is
	the set of components of \textbf{y} associated with the vertices in
	subgraph _S_ .


	We assume that the _features fk_ and _gk_ are given and fixed.
	For example, a Boolean vertex feature _gk_ might be true if
	the word \textbf{X} _i_ is upper case and the tag \textbf{Y} _i_ is “proper noun.”


	The parameter estimation problem is to determine the parameters _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ; _µ_ 1 _, µ_ 2 _, . . ._ ) from training data
	_D_ = _{_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} [(] _[i]_ [)] ) _}i_ _[N]_ =1 [with empirical distribution][ ] _[p]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)][.]

	In Section 4 we describe an iterative scaling algorithm that
	maximizes the log-likelihood objective function _O_ ( _θ_ ):



	_O_ ( _θ_ ) =


	_∝_



	_N_


	_i_ =1




	\textbf{x} _,_ \textbf{y}



	log _pθ_ ( \textbf{y} [(] _[i]_ [)] \textbf{x} [(] _[i]_ [)] )
	_\|_


	_p_ ( \textbf{x} _,_ \textbf{y} ) log _pθ_ ( \textbf{y} \textbf{x} ) .
	_\|_


	\textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1



	\textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1



	_i−_ 1

	✲
	✻




	_i_ \textbf{Y}

	✲
	✻




	❝



	+1


	✻




	❝



	\textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1





	❝



	\textbf{Y} _i−_ 1

	✲




	_i_

	✲




	❄
	_i_



	_i_





	❄
	_i_





	❝



	_i_



	❝ _i_



	❄

	\textbf{X} _i_



	\textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1



	\textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1



	\textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1



	_i_ ❝



	_Figure 2._ Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.
	An open circle indicates that the variable is not generated by the model.



	sequence. In addition, the features do not need to specify
	completely a state or observation, so one might expect that
	the model can be estimated from less training data. Another
	attractive property is the convexity of the loss function; indeed, CRFs share all of the convexity properties of general
	maximum entropy models.


	For the remainder of the paper we assume that the dependencies of \textbf{Y}, conditioned on \textbf{X}, form a chain. To simplify some expressions, we add special start and stop states
	\textbf{Y} 0 = start and \textbf{Y} _n_ +1 = stop. Thus, we will be using the
	graphical structure shown in Figure 2. For a chain structure, the conditional probability of a label sequence can be
	expressed concisely in matrix form, which will be useful
	in describing the parameter estimation and inference algorithms in Section 4. Suppose that _pθ_ ( \textbf{Y} \textbf{X} ) is a CRF
	_\|_
	given by (1). For each position _i_ in the observation sequence \textbf{x}, we define the _\|Y\| × \|Y\|_ matrix random variable
	_Mi_ ( \textbf{x} ) = [ _Mi_ ( _y_ _[]_ _, y_ \textbf{x} )] by
	_\|_



	of the training data. Both algorithms are based on the improved iterative scaling (IIS) algorithm of Della Pietra et al.
	(1997); the proof technique based on auxiliary functions
	can be extended to show convergence of the algorithms for
	CRFs.


	Iterative scaling algorithms update the weights as _λk_
	_λδλk_ + _k_ and _δλ δµk_ and _k_ . In particular, the IIS update _µk ←_ _µk_ + _δµk_ for appropriately chosen _δλk_ for an edge _←_
	feature _fk_ is the solution of



	_T_ ( \textbf{x} _,_ \textbf{y} )




	_E_ [ _fk_ ]


	=



	=def





	\textbf{x} _,_ \textbf{y}



	_fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) _e_ _[δλ][k][T]_ [ (] \textbf{[x]} _[,]_ \textbf{[y]} [)] .
	_\|_






	\textbf{x} _,_ \textbf{y}



	_n_ +1

	_p_ ( \textbf{x} ) _p_ ( \textbf{y} _\|_ \textbf{x} )

	_i_ =1



	_p_ ( \textbf{x} _,_ \textbf{y} )



	_n_ +1


	_i_ =1



	_fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} )
	_\|_



	where _T_ ( \textbf{x} _,_ \textbf{y} ) is the _total feature count_






	_i,k_



	_fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) +
	_\|_



	=def





	_gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ) .
	_\|_

	_i,k_



	_Mi_ ( _y_ _[]_ _, y \|_ \textbf{x} ) = exp (Λ _i_ ( _y_ _[]_ _, y \|_ \textbf{x} ))
	Λ _i_ ( _y_ _[]_ _, y \|_ \textbf{x} ) = _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[\|][e]_



	_k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[\|][e]_ _i_ [= (] _[y][][, y]_ [)] _[,]_ \textbf{[ x]} [) +]



	_k_ _[µ][k][ g][k]_ [(] _[v][i][,]_ \textbf{[ Y]} _[\|][v]_ _i_ [=] _[ y,]_ \textbf{[ x]} [)][,]



	where _ei_ is the edge with labels ( \textbf{Y} _i_ 1 _,_ \textbf{Y} _i_ ) and _vi_ is the

	_−_
	vertex with label \textbf{Y} _i_ . In contrast to generative models, conditional models like CRFs do not need to enumerate over
	all possible observation sequences \textbf{x}, and therefore these
	matrices can be computed directly as needed from a given
	training or test observation sequence \textbf{x} and the parameter
	vector _θ_ . Then the normalization (partition function) _Zθ_ ( \textbf{x} )
	is the (start _,_ stop) entry of the product of these matrices:

	_Zθ_ ( \textbf{x} ) = ( _M_ 1( \textbf{x} ) _M_ 2( \textbf{x} ) _· · · Mn_ +1( \textbf{x} ))start _,_ stop .

	Using this notation, the conditional probability of a label
	sequence \textbf{y} is written as



	The equations for vertex feature updates _δµk_ have similar
	form.


	However, efficiently computing the exponential sums on
	the right-hand sides of these equations is problematic, because _T_ ( \textbf{x} _,_ \textbf{y} ) is a global property of ( \textbf{x} _,_ \textbf{y} ), and dynamic
	programming will sum over sequences with potentially
	varying _T_ . To deal with this, the first algorithm, Algorithm
	S, uses a “slack feature.” The second, Algorithm T, keeps
	track of partial _T_ totals.


	For Algorithm S, we define the _slack feature_ by



	_S −_



	_s_ ( \textbf{x} _,_ \textbf{y} )



	=def



	_fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} )
	_\|_ _−_






	_k_






	_i_



	_gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ),
	_\|_







	_i_



	_n_ +1






	_k_



	_pθ_ ( \textbf{y} \textbf{x} ) =
	_\|_



	_n_ +1

	_i_ =1 ~~~~ _[M]_ _n_ +1 _[i]_ [(] \textbf{[y]} _[i][−]_ [1] _[,]_ \textbf{[ y]} ~~~~ _[i][ \|]_ \textbf{[ x]} [)]



	_n_ +1

	_i_ =1 _[M][i]_ [(] \textbf{[x]} [)]



	~~~~



	,



	start _,_ stop



	where \textbf{y} 0 = start and \textbf{y} _n_ +1 = stop.


	\textbf{4. Parameter Estimation for CRFs}


	We now describe two iterative scaling algorithms to find
	the parameter vector _θ_ that maximizes the log-likelihood



	where _S_ is a constant chosen so that _s_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} ) _≥_ 0 for all
	\textbf{y} and all observation vectors \textbf{x} [(] _[i]_ [)] in the training set, thus
	making _T_ ( \textbf{x} _,_ \textbf{y} ) = _S_ . Feature _s_ is “global,” that is, it does
	not correspond to any particular edge or vertex.

	For each index _i_ = 0 _, . . ., n_ + 1 we now define the _forward_
	_vectors αi_ ( \textbf{x} ) with base case



	_α_ 0( _y_ \textbf{x} ) =
	_\|_




	1 if _y_ = start
	0 otherwise


	and recurrence


	_αi_ ( \textbf{x} ) = _αi−_ 1( \textbf{x} ) _Mi_ ( \textbf{x} ) .

	Similarly, the _backward vectors βi_ ( \textbf{x} ) are defined by



	_βk_ and _γk_ are the unique positive roots to the following
	polynomial equations



	_T_ max


	_i_ =0



	_ak,t βk_ _[t]_ [=][ ] _[Ef][k]_ _[,]_



	_T_ max


	_i_ =0



	_bk,t γk_ _[t]_ [=][ ] _[Eg][k]_ [,] (2)



	_βn_ +1( _y_ \textbf{x} ) =
	_\|_




	1 if _y_ = stop
	0 otherwise



	and

	_βi_ ( \textbf{x} ) _[]_ = _Mi_ +1( \textbf{x} ) _βi_ +1( \textbf{x} ) .


	With these definitions, the update equations are



	_δλk_ = [1]

	_S_ [log]




	_Efk_
	_Efk_



	_,_ _δµk_ = [1]

	_S_ [log]




	_Egk_
	_Egk_



	,



	where


	_Efk_ =


	_Egk_ =






	\textbf{x}





	\textbf{x}



	_gk_ ( _vi,_ \textbf{y} _\|vi_ = _y,_ \textbf{x} ) _×_






	_y_ _[]_ _,y_



	_n_ +1

	_p_ ( \textbf{x} )


	_i_ =1



	_n_

	_p_ ( \textbf{x} )


	_i_ =1






	_y_



	_fk_ ( _ei,_ \textbf{y} _\|ei_ = ( _y_ _[]_ _, y_ ) _,_ \textbf{x} ) _×_



	which can be easily computed by Newton’s method.


	A single iteration of Algorithm S and Algorithm T has
	roughly the same time and space complexity as the well
	known Baum-Welch algorithm for HMMs. To prove convergence of our algorithms, we can derive an auxiliary
	function to bound the change in likelihood from below; this
	method is developed in detail by Della Pietra et al. (1997).
	The full proof is somewhat detailed; however, here we give
	an idea of how to derive the auxiliary function. To simplify
	notation, we assume only edge features _fk_ with parameters
	_λk_ .

	Given two parameter settings _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ) and _θ_ _[]_ =
	( _λ_ 1 + _δλ_ 1 _, λ_ 2 + _δλ_ 2 _, . . ._ ), we bound from below the change
	in the objective function with an _auxiliary function A_ ( _θ_ _[]_ _, θ_ )
	as follows



	_αi−_ 1( _y_ _[]_ _\|_ \textbf{x} ) _Mi_ ( _y_ _[]_ _, y \|_ \textbf{x} ) _βi_ ( _y \|_ \textbf{x} )

	_Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~



	_O_ ( _θ_ _[]_ ) _−O_ ( _θ_ ) =






	\textbf{x} _,_ \textbf{y}



	_p_ ( \textbf{x} _,_ \textbf{y} ) log _[p][θ][]_ [(] \textbf{[y]} _[ \|]_ \textbf{[ x]} [)]



	~~_p_~~ _θ_ ~~(~~ ~~\textbf{y}~~ ~~\textbf{x}~~ ~~)~~
	_\|_



	= ( _θ_ _[]_ _−_ _θ_ ) _·_ _Ef_ [] _−_



	_p_ ( \textbf{x} ) log _[Z][θ][]_ [(] \textbf{[x]} [)]



	_Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~



	_αi_ ( _y_ \textbf{x} ) _βi_ ( _y_ \textbf{x} )
	_\|_ _\|_ .

	_Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~



	_≥_ ( _θ_ _[]_ _−_ _θ_ ) _·_ _Ef_ [] _−_






	\textbf{x}




	\textbf{x}



	_p_ ( \textbf{x} ) _[Z][θ][]_ [(] \textbf{[x]} [)]



	_p_ ( \textbf{x} )



	_Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~







	\textbf{x}




	\textbf{x} _,_ \textbf{y} _,k_



	_p_ ( \textbf{x} ) _pθ_ ( \textbf{y} \textbf{x} ) _[f][k]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)]
	_\|_ _T_ ~~(~~ ~~\textbf{x}~~ ~~)~~ _[e][δλ][k][T]_ [ (] \textbf{[x]} [)]



	The factors involving the forward and backward vectors in
	the above equations have the same meaning as for standard
	hidden Markov models. For example,

	_pθ_ ( \textbf{Y} _i_ = _y_ \textbf{x} ) = _αi_ ( _y \|_ \textbf{x} ) _βi_ ( _y \|_ \textbf{x} )
	_\|_ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~


	is the marginal probability of label \textbf{Y} _i_ = _y_ given that the
	observation sequence is \textbf{x} . This algorithm is closely related
	to the algorithm of Darroch and Ratcliff (1972), and MART
	algorithms used in image reconstruction.


	The constant _S_ in Algorithm S can be quite large, since in
	practice it is proportional to the length of the longest training observation sequence. As a result, the algorithm may
	converge slowly, taking very small steps toward the maximum in each iteration. If the length of the observations \textbf{x} [(] _[i]_ [)]

	and the number of active features varies greatly, a fasterconverging algorithm can be obtained by keeping track of
	feature totals for each observation sequence separately.


	Let _T_ ( \textbf{x} ) = maxdef \textbf{y} _T_ ( \textbf{x} _,_ \textbf{y} ). Algorithm T accumulates

	feature expectations into counters indexed by _T_ ( \textbf{x} ). More
	specifically, we use the forward-backward recurrences just
	introduced to compute the expectations _ak,t_ of feature _fk_
	and _bk,t_ of feature _gk_ given that _T_ ( \textbf{x} ) = _t_ . Then our parameter updates are _δλk_ = log _βk_ and _δµk_ = log _γk_, where



	= _δλ ·_ _Ef_ [] _−_


	_≥_ _δλ ·_ _Ef_ [] _−_



	_pθ_ ( \textbf{y} \textbf{x} ) _e_ _[δλ][·][f]_ [(] \textbf{[x]} _[,]_ \textbf{[y]} [)]
	_\|_






	\textbf{y}



	=def _A_ ( _θ_ _[]_ _, θ_ )

	where the inequalities follow from the convexity of _−_ log
	and exp. Differentiating _A_ with respect to _δλk_ and setting
	the result to zero yields equation (2).


	\textbf{5. Experiments}


	We first discuss two sets of experiments with synthetic data
	that highlight the differences between CRFs and MEMMs.
	The first experiments are a direct verification of the label
	bias problem discussed in Section 2. In the second set of
	experiments, we generate synthetic data using randomly
	chosen hidden Markov models, each of which is a mixture of a first-order and second-order model. Competing
	_first-order_ models are then trained and compared on test
	data. As the data becomes more second-order, the test error rates of the trained models increase. This experiment
	corresponds to the common modeling practice of approximating complex local and long-range dependencies, as occur in natural data, by small-order Markov models. Our


	60


	50


	40


	30


	20



	0 10 20 30 40 50 60

	HMM Error



	0 10 20 30 40 50 60

	HMM Error



	10


	0

	0 10 20 30 40 50 60

	CRF Error



	60


	50


	40


	30


	20


	10


	0



	60


	50


	40


	30


	20


	10


	0



	_Figure 3._ Plots of 2 _×_ 2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2.
	As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically
	significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square
	represents a data set with _α <_ [1] 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2 [. The plot shows that when the data is mostly second]

	order ( _α ≥_ [1] 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]




	[1]

	2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2



	order ( _α ≥_ 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]

	the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.



	results clearly indicate that even when the models are parameterized in exactly the same way, CRFs are more robust to inaccurate modeling assumptions than MEMMs or
	HMMs, and resolve the label bias problem, which affects
	the performance of MEMMs. To avoid confusion of different effects, the MEMMs and CRFs in these experiments
	_do not_ use overlapping features of the observations. Finally, in a set of POS tagging experiments, we confirm the
	advantage of CRFs over MEMMs. We also show that the
	addition of overlapping features to CRFs and MEMMs allows them to perform much better than HMMs, as already
	shown for MEMMs by McCallum et al. (2000).


	\textbf{5.1 Modeling label bias}


	We generate data from a simple HMM which encodes a
	noisy version of the finite-state network in Figure 1. Each
	state emits its designated symbol with probability 29 _/_ 32
	and any of the other symbols with probability 1 _/_ 32. We
	train both an MEMM and a CRF with the same topologies
	on the data generated by the HMM. The observation features are simply the identity of the observation symbols.
	In a typical run using 2 _,_ 000 training and 500 test samples,
	trained to convergence of the iterative scaling algorithm,
	the CRF error is 4 _._ 6% while the MEMM error is 42%,
	showing that the MEMM fails to discriminate between the
	two branches.


	\textbf{5.2 Modeling mixed-order sources}

	For these results, we use five labels, a-e ( _\|Y\|_ = 5), and 26
	observation values, A-Z ( _\|X\|_ = 26); however, the results
	were qualitatively the same over a range of sizes for _Y_ and
	_X_ . We generate data from a mixed-order HMM with state
	transition probabilities given by _α p_ larly, emission probabilities given by _α p_ have a standard first-order HMM. In order to limit the size22(( \textbf{yx} _ii \| \|_ \textbf{y y} _ii−,_ \textbf{x} 1 _,i_ \textbf{y} _−_ 1 _i−_ )+(12) + (1 _−α_ ) _− p_ 1( _α_ \textbf{x} ) _i \| p p_ \textbf{y} 1( _iα_ \textbf{y} ) _p_ (. Thus, for _i_ \textbf{y} _\|αi_ \textbf{y} _\|_ ( \textbf{yx} _i−ii \|_ 1 _−_ \textbf{y} )1 and, simi- _,i_ \textbf{y} _, α_ \textbf{x} _i_ = 0 _−i−_ 21)) = we=



	of the Bayes error rate for the resulting models, the conditional probability tables _pα_ are constrained to be sparse.
	In particular, _pα_ ( _y, y_ _[]_ ) can have at most two nonzero en
	_· \|_
	tries, for each _y, y_ _[]_, and _pα_ ( _y, x_ _[]_ ) can have at most three

	_· \|_
	nonzero entries for each _y, x_ _[]_ . For each randomly generated model, a sample of 1,000 sequences of length 25 is
	generated for training and testing.


	On each randomly generated training set, a CRF is trained
	using Algorithm S. (Note that since the length of the sequences and number of active features is constant, Algorithms S and T are identical.) The algorithm is fairly slow
	to converge, typically taking approximately 500 iterations
	for the model to stabilize. On the 500 MHz Pentium PC
	used in our experiments, each iteration takes approximately
	0.2 seconds. On the same data an MEMM is trained using
	iterative scaling, which does not require forward-backward
	calculations, and is thus more efficient. The MEMM training converges more quickly, stabilizing after approximately
	100 iterations. For each model, the Viterbi algorithm is
	used to label a test set; the experimental results do not significantly change when using forward-backward decoding
	to minimize the per-symbol error rate.


	The results of several runs are presented in Figure 3. Each
	plot compares two classes of models, with each point indicating the error rate for a single test set. As _α_ increases, the
	error rates generally increase, as the first-order models fail
	to fit the second-order data. The figure compares models
	parameterized as _µy_, _λy,y_, and _λy,y,x_ ; results for models
	parameterized as _µy_, _λy,y_, and _µy,x_ are qualitatively the
	same. As shown in the first graph, the CRF generally outperforms the MEMM, often by a wide margin of 10%–20%
	relative error. (The points for very small error rate, with
	_α <_ 0 _._ 01, where the MEMM does better than the CRF,
	are suspected to be the result of an insufficient number of
	training iterations for the CRF.)


	\|model\|error oov error\|
	\|---\|---\|
	\|HMM<br>MEMM<br>CRF\|5.69%<br>45.99%<br>6.37%<br>54.61%<br>5.55%<br>48.05%\|
	\|MEMM+<br>CRF+\|4.81%<br>26.99%<br>4.27%<br>23.76%\|


	+Using spelling features


	_Figure 4._ Per-word error rates for POS tagging on the Penn treebank, using first-order models trained on 50% of the 1.1 million
	word corpus. The oov rate is 5.45%.


	\textbf{5.3 POS tagging experiments}


	To confirm our synthetic data results, we also compared
	HMMs, MEMMs and CRFs on Penn treebank POS tagging, where each word in a given input sentence must be
	labeled with one of 45 syntactic tags.


	We carried out two sets of experiments with this natural
	language data. First, we trained first-order HMM, MEMM,
	and CRF models as in the synthetic data experiments, introducing parameters _µy,x_ for each tag-word pair and _λy,y_
	for each tag-tag pair in the training set. The results are consistent with what is observed on synthetic data: the HMM
	outperforms the MEMM, as a consequence of the label bias
	problem, while the CRF outperforms the HMM. The error rates for training runs using a 50%-50% train-test split
	are shown in Figure 5.3; the results are qualitatively similar for other splits of the data. The error rates on outof-vocabulary (oov) words, which are not observed in the
	training set, are reported separately.


	In the second set of experiments, we take advantage of the
	power of conditional models by adding a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and
	whether it ends in one of the following suffixes: -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as
	expected, that both the MEMM and the CRF benefit significantly from the use of these features, with the overall error
	rate reduced by around 25%, and the out-of-vocabulary error rate reduced by around 50%.


	One usually starts training from the all zero parameter vector, corresponding to the uniform distribution. However,
	for these datasets, CRF training with that initialization is
	much slower than MEMM training. Fortunately, we can
	use the optimal MEMM parameter vector as a starting
	point for training the corresponding CRF. In Figure 5.3,
	MEMM [+] was trained to convergence in around 100 iterations. Its parameters were then used to initialize the training of CRF [+], which converged in 1,000 iterations. In contrast, training of the same CRF from the uniform distribution had not converged even after 2,000 iterations.



	\textbf{6. Further Aspects of CRFs}


	Many further aspects of CRFs are attractive for applications and deserve further study. In this section we briefly
	mention just two.


	Conditional random fields can be trained using the exponential loss objective function used by the AdaBoost algorithm (Freund & Schapire, 1997). Typically, boosting is
	applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling
	have treated each label as a separate classification problem
	(Abney et al., 1999). However, it is possible to apply the
	parallel update algorithm of Collins et al. (2000) to optimize the per-sequence exponential loss. This requires a
	forward-backward algorithm to compute efficiently certain
	feature expectations, along the lines of Algorithm T, except that each feature requires a separate set of forward and
	backward accumulators.


	Another attractive aspect of CRFs is that one can implement efficient feature selection and feature induction algorithms for them. That is, rather than specifying in advance which features of ( \textbf{X} _,_ \textbf{Y} ) to use, we could start from
	feature-generating rules and evaluate the benefit of generated features automatically on data. In particular, the feature induction algorithms presented in Della Pietra et al.
	(1997) can be adapted to fit the dynamic programming
	techniques of conditional random fields.


	\textbf{7. Related Work and Conclusions}


	As far as we know, the present work is the first to combine
	the benefits of conditional models with the global normalization of random field models. Other applications of exponential models in sequence modeling have either attempted
	to build generative models (Rosenfeld, 1997), which involve a hard normalization problem, or adopted local conditional models (Berger et al., 1996; Ratnaparkhi, 1996;
	McCallum et al., 2000) that may suffer from label bias.


	Non-probabilistic local decision models have also been
	widely used in segmentation and tagging (Brill, 1995;
	Roth, 1998; Abney et al., 1999). Because of the computational complexity of global training, these models are only
	trained to minimize the error of individual label decisions
	assuming that neighboring labels are correctly chosen. Label bias would be expected to be a problem here too.


	An alternative approach to discriminative modeling of sequence labeling is to use a permissive generative model,
	which can only model local dependencies, to produce a
	list of candidates, and then use a more global discriminative model to rerank those candidates. This approach is
	standard in large-vocabulary speech recognition (Schwartz
	& Austin, 1993), and has also been proposed for parsing
	(Collins, 2000). However, these methods fail when the correct output is pruned away in the first pass.


	Closest to our proposal are gradient-descent methods that
	adjust the parameters of all of the local classifiers to minimize a smooth loss function (e.g., quadratic loss) combining loss terms for each label. If state dependencies are local, this can be done efficiently with dynamic programming
	(LeCun et al., 1998). Such methods should alleviate label
	bias. However, their loss function is not convex, so they
	may get stuck in local minima.


	Conditional random fields offer a unique combination of
	properties: discriminatively trained models for sequence
	segmentation and labeling; combination of arbitrary, overlapping and agglomerative observation features from both
	the past and future; efficient training and decoding based
	on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main current limitation is the slow convergence of the training algorithm
	relative to MEMMs, let alone to HMMs, for which training
	on fully observed data is very efficient. In future work, we
	plan to investigate alternative training methods such as the
	update methods of Collins et al. (2000) and refinements on
	using a MEMM as starting point as we did in some of our
	experiments. More general tree-structured random fields,
	feature induction methods, and further natural data evaluations will also be investigated.


	\textbf{Acknowledgments}


	We thank Yoshua Bengio, L´eon Bottou, Michael Collins
	and Yann LeCun for alerting us to what we call here the label bias problem. We also thank Andrew Ng and Sebastian
	Thrun for discussions related to this work.


	\textbf{References}


	Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting

	applied to tagging and PP attachment. _Proc. EMNLP-_
	_VLC_ . New Brunswick, New Jersey: Association for
	Computational Linguistics.
	Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J.

	(1996). A maximum entropy approach to natural language processing. _Computational Linguistics_, _22_ .
	Bottou, L. (1991). _Une_ _approche_ _th´eorique_ _de_
	_l’apprentissage connexionniste: Applications `a la recon-_
	_naissance de la parole_ . Doctoral dissertation, Universit´e
	de Paris XI.
	Brill, E. (1995). Transformation-based error-driven learn
	ing and natural language processing: a case study in part
	of speech tagging. _Computational Linguistics_, _21_, 543–
	565.
	Collins, M. (2000). Discriminative reranking for natural

	language parsing. _Proc. ICML 2000_ . Stanford, California.
	Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re
	gression, AdaBoost, and Bregman distances. _Proc. 13th_
	_COLT_ .
	Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative



	scaling for log-linear models. _The Annals of Mathemat-_
	_ical Statistics_, _43_, 1470–1480.
	Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In
	ducing features of random fields. _IEEE Transactions on_
	_Pattern Analysis and Machine Intelligence_, _19_, 380–393.
	Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).

	_Biological sequence analysis: Probabilistic models of_
	_proteins and nucleic acids_ . Cambridge University Press.
	Freitag, D., & McCallum, A. (2000). Information extrac
	tion with HMM structures learned by stochastic optimization. _Proc. AAAI 2000_ .
	Freund, Y., & Schapire, R. (1997). A decision-theoretic

	generalization of on-line learning and an application to
	boosting. _Journal of Computer and System Sciences_, _55_,
	119–139.
	Hammersley, J., & Clifford, P. (1971). Markov fields on

	finite graphs and lattices. Unpublished manuscript.
	LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).

	Gradient-based learning applied to document recognition. _Proceedings of the IEEE_, _86_, 2278–2324.
	MacKay, D. J. (1996). Equivalence of linear Boltzmann

	chains and hidden Markov models. _Neural Computation_,
	_8_, 178–181.
	Manning, C. D., & Sch¨utze, H. (1999). _Foundations of sta-_

	_tistical natural language processing_ . Cambridge Massachusetts: MIT Press.
	McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum

	entropy Markov models for information extraction and
	segmentation. _Proc. ICML 2000_ (pp. 591–598). Stanford, California.
	Mohri, M. (1997). Finite-state transducers in language and

	speech processing. _Computational Linguistics_, _23_ .
	Mohri, M. (2000). Minimization algorithms for sequential

	transducers. _Theoretical Computer Science_, _234_, 177–
	201.
	Paz, A. (1971). _Introduction to probabilistic automata_ .
	Academic Press.
	Punyakanok, V., & Roth, D. (2001). The use of classifiers

	in sequential inference. _NIPS 13_ . Forthcoming.
	Ratnaparkhi, A. (1996). A maximum entropy model for

	part-of-speech tagging. _Proc. EMNLP_ . New Brunswick,
	New Jersey: Association for Computational Linguistics.
	Rosenfeld, R. (1997). A whole sentence maximum entropy

	language model. _Proceedings of the IEEE Workshop on_
	_Speech Recognition and Understanding_ . Santa Barbara,
	California.
	Roth, D. (1998). Learning to resolve natural language am
	biguities: A unified approach. _Proc. 15th AAAI_ (pp. 806–
	813). Menlo Park, California: AAAI Press.
	Saul, L., & Jordan, M. (1996). Boltzmann chains and hid
	den Markov models. _Advances in Neural Information_
	_Processing Systems 7_ . MIT Press.
	Schwartz, R., & Austin, S. (1993). A comparison of several

	approximate algorithms for finding multiple (N-BEST)
	sentence hypotheses. _Proc. ICASSP_ . Minneapolis, MN.



	\end{document}