rain1024 commited on
Commit
36a70ab
·
1 Parent(s): 0ec913f

Add references folder with research papers (markdown, tex, source files)

Browse files
Files changed (32) hide show
  1. references/2001.icml.lafferty/paper.md +1498 -0
  2. references/2001.icml.lafferty/paper.tex +1497 -0
  3. references/2014.eacl.nguyen/paper.md +420 -0
  4. references/2014.eacl.nguyen/paper.tex +419 -0
  5. references/2018.naacl.vu/paper.md +147 -0
  6. references/2018.naacl.vu/paper.tex +302 -0
  7. references/2018.naacl.vu/source/VnCoreNLP.bbl +169 -0
  8. references/2018.naacl.vu/source/VnCoreNLP.tex +302 -0
  9. references/2018.naacl.vu/source/naacl_natbib.bst +1552 -0
  10. references/2018.naacl.vu/source/naaclhlt2018.sty +543 -0
  11. references/2020.emnlp.nguyen/paper.md +123 -0
  12. references/2020.emnlp.nguyen/paper.tex +301 -0
  13. references/2020.emnlp.nguyen/source/acl_natbib.bst +1975 -0
  14. references/2020.emnlp.nguyen/source/emnlp2020.sty +560 -0
  15. references/2020.emnlp.nguyen/source/emnlp2020_PhoBERT.bbl +227 -0
  16. references/2020.emnlp.nguyen/source/emnlp2020_PhoBERT.tex +301 -0
  17. references/2021.naacl.nguyen/paper.md +167 -0
  18. references/2021.naacl.nguyen/paper.tex +641 -0
  19. references/2021.naacl.nguyen/source/acl_natbib.bst +1979 -0
  20. references/2021.naacl.nguyen/source/minted.sty +1212 -0
  21. references/2021.naacl.nguyen/source/naacl2021.bbl +180 -0
  22. references/2021.naacl.nguyen/source/naacl2021.sty +310 -0
  23. references/2021.naacl.nguyen/source/naacl2021.tex +641 -0
  24. references/2021.naacl.nguyen/source/refs.bib +625 -0
  25. references/README.md +43 -0
  26. references/python_crfsuite.md +131 -0
  27. references/research_vietnamese_pos/README.md +145 -0
  28. references/research_vietnamese_pos/bibliography.bib +171 -0
  29. references/research_vietnamese_pos/papers.md +378 -0
  30. references/research_vietnamese_pos/sota.md +95 -0
  31. references/underthesea.md +137 -0
  32. references/universal_dependencies.md +100 -0
references/2001.icml.lafferty/paper.md ADDED
@@ -0,0 +1,1498 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data"
3
+ authors:
4
+ - "John D. Lafferty"
5
+ - "Andrew McCallum"
6
+ - "Fernando C. N. Pereira"
7
+ year: 2001
8
+ venue: "ICML"
9
+ url: "https://dl.acm.org/doi/10.5555/645530.655813"
10
+ ---
11
+
12
+ # **Conditional Random Fields: Probabilistic Models** **for Segmenting and Labeling Sequence Data**
13
+
14
+ **John Lafferty** _[†∗]_ LAFFERTY@CS.CMU.EDU
15
+ **Andrew McCallum** _[∗†]_ MCCALLUM@WHIZBANG.COM
16
+ **Fernando Pereira** _[∗‡]_ FPEREIRA@WHIZBANG.COM
17
+ _∗_ WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA
18
+
19
+ _†_ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA
20
+
21
+ _‡_ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA
22
+
23
+
24
+
25
+ **Abstract**
26
+
27
+
28
+ We present _conditional random fields_, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars
29
+ for such tasks, including the ability to relax
30
+ strong independence assumptions made in those
31
+ models. Conditional random fields also avoid
32
+ a fundamental limitation of maximum entropy
33
+ Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states
34
+ with few successor states. We present iterative
35
+ parameter estimation algorithms for conditional
36
+ random fields and compare the performance of
37
+ the resulting models to HMMs and MEMMs on
38
+ synthetic and natural-language data.
39
+
40
+
41
+ **1. Introduction**
42
+
43
+
44
+ The need to segment and label sequences arises in many
45
+ different problems in several scientific fields. Hidden
46
+ Markov models (HMMs) and stochastic grammars are well
47
+ understood and widely used probabilistic models for such
48
+ problems. In computational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known
49
+ evolutionary family, and analyze RNA secondary structure
50
+ (Durbin et al., 1998). In computational linguistics and
51
+ computer science, HMMs and stochastic grammars have
52
+ been applied to a wide variety of problems in text and
53
+ speech processing, including topic segmentation, part-ofspeech (POS) tagging, information extraction, and syntactic disambiguation (Manning & Sch¨utze, 1999).
54
+
55
+
56
+ HMMs and stochastic grammars are generative models, assigning a joint probability to paired observation and label
57
+ sequences; the parameters are typically trained to maxi
58
+
59
+
60
+ mize the joint likelihood of training examples. To define
61
+ a joint probability over observation and label sequences,
62
+ a generative model needs to enumerate all possible observation sequences, typically requiring a representation
63
+ in which observations are task-appropriate atomic entities,
64
+ such as words or nucleotides. In particular, it is not practical to represent multiple interacting features or long-range
65
+ dependencies of the observations, since the inference problem for such models is intractable.
66
+
67
+
68
+ This difficulty is one of the main motivations for looking at
69
+ conditional models as an alternative. A conditional model
70
+ specifies the probabilities of possible label sequences given
71
+ an observation sequence. Therefore, it does not expend
72
+ modeling effort on the observations, which at test time
73
+ are fixed anyway. Furthermore, the conditional probability of the label sequence can depend on arbitrary, nonindependent features of the observation sequence without
74
+ forcing the model to account for the distribution of those
75
+ dependencies. The chosen features may represent attributes
76
+ at different levels of granularity of the same observations
77
+ (for example, words and characters in English text), or
78
+ aggregate properties of the observation sequence (for instance, text layout). The probability of a transition between
79
+ labels may depend not only on the current observation,
80
+ but also on past and future observations, if available. In
81
+ contrast, generative models must make very strict independence assumptions on the observations, for instance conditional independence given the labels, to achieve tractability.
82
+
83
+
84
+ Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all of the
85
+ above advantages (McCallum et al., 2000). In MEMMs,
86
+ each source state [1] has a exponential model that takes the
87
+ observation features as input, and outputs a distribution
88
+ over possible next states. These exponential models are
89
+ trained by an appropriate iterative scaling method in the
90
+
91
+
92
+ 1Output labels are associated with states; it is possible for several states to have the same label, but for simplicity in the rest of
93
+ this paper we assume a one-to-one correspondence.
94
+
95
+
96
+ maximum entropy framework. Previously published experimental results show MEMMs increasing recall and doubling precision relative to HMMs in a FAQ segmentation
97
+ task.
98
+
99
+
100
+ MEMMs and other non-generative finite-state models
101
+ based on next-state classifiers, such as discriminative
102
+ Markov models (Bottou, 1991), share a weakness we call
103
+ here the _label bias problem_ : the transitions leaving a given
104
+ state compete only against each other, rather than against
105
+ all other transitions in the model. In probabilistic terms,
106
+ transition scores are the conditional probabilities of possible next states given the current state and the observation sequence. This per-state normalization of transition
107
+ scores implies a “conservation of score mass” (Bottou,
108
+ 1991) whereby all the mass that arrives at a state must be
109
+ distributed among the possible successor states. An observation can affect which destination states get the mass, but
110
+ not how much total mass to pass on. This causes a bias toward states with fewer outgoing transitions. In the extreme
111
+ case, a state with a single outgoing transition effectively
112
+ ignores the observation. In those cases, unlike in HMMs,
113
+ Viterbi decoding cannot downgrade a branch based on observations after the branch point, and models with statetransition structures that have sparsely connected chains of
114
+ states are not properly handled. The Markovian assumptions in MEMMs and similar state-conditional models insulate decisions at one state from future decisions in a way
115
+ that does not match the actual dependencies between consecutive states.
116
+
117
+
118
+ This paper introduces _conditional random fields_ (CRFs), a
119
+ sequence modeling framework that has all the advantages
120
+ of MEMMs but also solves the label bias problem in a
121
+ principled way. The critical difference between CRFs and
122
+ MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the
123
+ current state, while a CRF has a single exponential model
124
+ for the joint probability of the entire sequence of labels
125
+ given the observation sequence. Therefore, the weights of
126
+ different features at different states can be traded off against
127
+ each other.
128
+
129
+
130
+ We can also think of a CRF as a finite state model with unnormalized transition probabilities. However, unlike some
131
+ other weighted finite-state approaches (LeCun et al., 1998),
132
+ CRFs assign a well-defined probability distribution over
133
+ possible labelings, trained by maximum likelihood or MAP
134
+ estimation. Furthermore, the loss function is convex, [2] guaranteeing convergence to the global optimum. CRFs also
135
+ generalize easily to analogues of stochastic context-free
136
+ grammars that would be useful in such problems as RNA
137
+ secondary structure prediction and natural language processing.
138
+
139
+
140
+ 2In the case of fully observable states, as we are discussing
141
+ here; if several states have the same label, the usual local maxima
142
+ of Baum-Welch arise.
143
+
144
+
145
+
146
+ _Figure 1._ Label bias example, after (Bottou, 1991). For conciseness, we place observation-label pairs _o_ : _l_ on transitions rather
147
+ than states; the symbol ‘ ~~’~~ represents the null output label.
148
+
149
+
150
+ We present the model, describe two training procedures and
151
+ sketch a proof of convergence. We also give experimental
152
+ results on synthetic data showing that CRFs solve the classical version of the label bias problem, and, more significantly, that CRFs perform better than HMMs and MEMMs
153
+ when the true data distribution has higher-order dependencies than the model, as is often the case in practice. Finally,
154
+ we confirm these results as well as the claimed advantages
155
+ of conditional models by evaluating HMMs, MEMMs and
156
+ CRFs with identical state structure on a part-of-speech tagging task.
157
+
158
+
159
+ **2. The Label Bias Problem**
160
+
161
+
162
+ Classical probabilistic automata (Paz, 1971), discriminative Markov models (Bottou, 1991), maximum entropy
163
+ taggers (Ratnaparkhi, 1996), and MEMMs, as well as
164
+ non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (Punyakanok & Roth, 2001) are all potential victims of the label
165
+ bias problem.
166
+
167
+
168
+ For example, Figure 1 represents a simple finite-state
169
+ model designed to distinguish between the two words rib
170
+ and rob. Suppose that the observation sequence is r i b.
171
+ In the first time step, r matches both transitions from the
172
+ start state, so the probability mass gets distributed roughly
173
+ equally among those two transitions. Next we observe i.
174
+ Both states 1 and 4 have only one outgoing transition. State
175
+ 1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4
176
+ has no choice but to pass all its mass to its single outgoing
177
+ transition, since it is not generating the observation, only
178
+ conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. More generally,
179
+ states with low-entropy next state distributions will take little notice of observations. Returning to the example, the
180
+ top path and the bottom path will be about equally likely,
181
+ independently of the observation sequence. If one of the
182
+ two words is slightly more common in the training set, the
183
+ transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will
184
+ always win. This behavior is demonstrated experimentally
185
+ in Section 5.
186
+
187
+
188
+ L´eon Bottou (1991) discussed two solutions for the label
189
+ bias problem. One is to change the state-transition struc
190
+
191
+
192
+
193
+
194
+
195
+
196
+
197
+
198
+
199
+
200
+
201
+ ture of the model. In the above example we could collapse
202
+ states 1 and 4, and delay the branching until we get a discriminating observation. This operation is a special case
203
+ of determinization (Mohri, 1997), but determinization of
204
+ weighted finite-state machines is not always possible, and
205
+ even when possible, it may lead to combinatorial explosion. The other solution mentioned is to start with a fullyconnected model and let the training procedure figure out
206
+ a good structure. But that would preclude the use of prior
207
+ structural knowledge that has proven so valuable in information extraction tasks (Freitag & McCallum, 2000).
208
+
209
+
210
+ Proper solutions require models that account for whole
211
+ state sequences at once by letting some transitions “vote”
212
+ more strongly than others depending on the corresponding
213
+ observations. This implies that score mass will not be conserved, but instead individual transitions can “amplify” or
214
+ “dampen” the mass they receive. In the above example, the
215
+ transitions from the start state would have a very weak effect on path score, while the transitions from states 1 and 4
216
+ would have much stronger effects, amplifying or damping
217
+ depending on the actual observation, and a proportionally
218
+ higher contribution to the selection of the Viterbi path. [3]
219
+
220
+
221
+ In the related work section we discuss other heuristic model
222
+ classes that account for state sequences globally rather than
223
+ locally. To the best of our knowledge, CRFs are the only
224
+ model class that does this in a purely probabilistic setting,
225
+ with guaranteed global maximum likelihood convergence.
226
+
227
+
228
+ **3. Conditional Random Fields**
229
+
230
+
231
+ In what follows, **X** is a random variable over data sequences to be labeled, and **Y** is a random variable over
232
+ corresponding label sequences. All components **Y** _i_ of **Y**
233
+ are assumed to range over a finite label alphabet _Y_ . For example, **X** might range over natural language sentences and
234
+ **Y** range over part-of-speech taggings of those sentences,
235
+ with _Y_ the set of possible part-of-speech tags. The random variables **X** and **Y** are jointly distributed, but in a discriminative framework we construct a conditional model
236
+ _p_ ( **Y** _|_ **X** ) from paired observation and label sequences, and
237
+ do not explicitly model the marginal _p_ ( **X** ).
238
+
239
+ **Definition** . _Let G_ = ( _V, E_ ) _be a graph such that_
240
+ **Y** = ( **Y** _v_ ) _v_ _V, so that_ **Y** _is indexed by the vertices_
241
+ _∈_
242
+ _of G._ _Then_ ( **X** _,_ **Y** ) _is a conditional random field in_
243
+ _case, when conditioned on_ **X** _, the random variables_ **Y** _v_
244
+ _obey the Markov property with respect to the graph:_
245
+ _p_ ( **Y** _v_ **X** _,_ **Y** _w, w_ = _v_ ) = _p_ ( **Y** _v_ **X** _,_ **Y** _w, w_ _v_ ) _, where_
246
+ _w ∼_ _v | means that ￿_ _w and v are neighbors in |_ _∼ G._
247
+
248
+ Thus, a CRF is a random field globally conditioned on the
249
+ observation **X** . Throughout the paper we tacitly assume
250
+ that the graph _G_ is fixed. In the simplest and most impor
251
+
252
+ 3Weighted determinization and minimization techniques shift
253
+ transition weights while preserving overall path weight (Mohri,
254
+ 2000); their connection to this discussion deserves further study.
255
+
256
+
257
+
258
+ tant example for modeling sequences, _G_ is a simple chain
259
+ or line: _G_ = ( _V_ = _{_ 1 _,_ 2 _, . . . m}, E_ = _{_ ( _i, i_ + 1) _}_ ).
260
+ **X** may also have a natural graph structure; yet in general it is not necessary to assume that **X** and **Y** have the
261
+ same graphical structure, or even that **X** has any graphical structure at all. However, in this paper we will be
262
+ most concerned with sequences **X** = ( **X** 1 _,_ **X** 2 _, . . .,_ **X** _n_ )
263
+ and **Y** = ( **Y** 1 _,_ **Y** 2 _, . . .,_ **Y** _n_ ).
264
+
265
+ If the graph _G_ = ( _V, E_ ) of **Y** is a tree (of which a chain
266
+ is the simplest example), its cliques are the edges and vertices. Therefore, by the fundamental theorem of random
267
+ fields (Hammersley & Clifford, 1971), the joint distribution over the label sequence **Y** given **X** has the form
268
+
269
+
270
+
271
+ As a particular case, we can construct an HMM-like CRF
272
+ by defining one feature for each state pair ( _y_ _[￿]_ _, y_ ), and one
273
+ feature for each state-observation pair ( _y, x_ ):
274
+
275
+
276
+ _fy￿,y_ ( _<u, v>,_ **y** _<u,v>,_ **x** ) = _δ_ ( **y** _u, y_ _[￿]_ ) _δ_ ( **y** _v, y_ )
277
+ _|_
278
+ _gy,x_ ( _v,_ **y** _|v,_ **x** ) = _δ_ ( **y** _v, y_ ) _δ_ ( **x** _v, x_ ) .
279
+
280
+ The corresponding parameters _λy￿,y_ and _µy,x_ play a similar role to the (logarithms of the) usual HMM parameters
281
+ _p_ ( _y_ _[￿]_ _| y_ ) and _p_ ( _x|y_ ). Boltzmann chain models (Saul & Jordan, 1996; MacKay, 1996) have a similar form but use a
282
+ single normalization constant to yield a joint distribution,
283
+ whereas CRFs use the observation-dependent normalization _Z_ ( **x** ) for conditional distributions.
284
+
285
+
286
+ Although it encompasses HMM-like models, the class of
287
+ conditional random fields is much more expressive, because it allows arbitrary dependencies on the observation
288
+
289
+
290
+
291
+ _pθ_ ( **y**  _|_ **x** ) _∝_ (1)
292
+
293
+
294
+
295
+
296
+
297
+
298
+
299
+
300
+
301
+
302
+
303
+ exp
304
+
305
+
306
+
307
+  [￿]
308
+
309
+
310
+
311
+ _v∈V,k_
312
+
313
+
314
+
315
+ ￿
316
+
317
+
318
+
319
+ _e∈E,k_
320
+
321
+
322
+
323
+ _λk fk_ ( _e,_ **y** _|e,_ **x** ) +
324
+
325
+
326
+
327
+ _µk gk_ ( _v,_ **y** _|v,_ **x** ),
328
+
329
+
330
+
331
+ where **x** is a data sequence, **y** a label sequence, and **y** _|S_ is
332
+ the set of components of **y** associated with the vertices in
333
+ subgraph _S_ .
334
+
335
+
336
+ We assume that the _features fk_ and _gk_ are given and fixed.
337
+ For example, a Boolean vertex feature _gk_ might be true if
338
+ the word **X** _i_ is upper case and the tag **Y** _i_ is “proper noun.”
339
+
340
+
341
+ The parameter estimation problem is to determine the parameters _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ; _µ_ 1 _, µ_ 2 _, . . ._ ) from training data
342
+ _D_ = _{_ ( **x** [(] _[i]_ [)] _,_ **y** [(] _[i]_ [)] ) _}i_ _[N]_ =1 [with empirical distribution][ ￿] _[p]_ [(] **[x]** _[,]_ **[ y]** [)][.]
343
+
344
+ In Section 4 we describe an iterative scaling algorithm that
345
+ maximizes the log-likelihood objective function _O_ ( _θ_ ):
346
+
347
+
348
+
349
+ _O_ ( _θ_ ) =
350
+
351
+
352
+ _∝_
353
+
354
+
355
+
356
+ ￿ _N_
357
+
358
+
359
+ _i_ =1
360
+
361
+ ￿
362
+
363
+
364
+ **x** _,_ **y**
365
+
366
+
367
+
368
+ log _pθ_ ( **y** [(] _[i]_ [)] **x** [(] _[i]_ [)] )
369
+ _|_
370
+
371
+
372
+ ￿ _p_ ( **x** _,_ **y** ) log _pθ_ ( **y** **x** ) .
373
+ _|_
374
+
375
+
376
+ **Y** _i−_ 1 **Y** _i_ **Y** _i_ +1
377
+
378
+
379
+
380
+ **Y** _i−_ 1 **Y** _i_ **Y** _i_ +1
381
+
382
+
383
+
384
+ _i−_ 1
385
+
386
+
387
+
388
+ ￿
389
+
390
+
391
+
392
+ _i_ **Y**
393
+
394
+
395
+
396
+ ￿
397
+
398
+
399
+
400
+
401
+
402
+
403
+
404
+ +1
405
+
406
+
407
+
408
+ ￿
409
+
410
+
411
+
412
+
413
+
414
+
415
+
416
+ **Y** _i−_ 1 **Y** _i_ **Y** _i_ +1
417
+
418
+
419
+
420
+ ￿
421
+
422
+
423
+
424
+
425
+
426
+ **Y** _i−_ 1
427
+
428
+
429
+ ￿
430
+
431
+
432
+
433
+ _i_
434
+
435
+
436
+ ￿
437
+
438
+
439
+
440
+
441
+ ￿ _i_
442
+
443
+
444
+
445
+ _i_
446
+
447
+ ￿
448
+
449
+
450
+
451
+
452
+ ￿ _i_
453
+
454
+
455
+
456
+ ￿
457
+
458
+
459
+
460
+
461
+
462
+ _i_
463
+
464
+ ￿
465
+
466
+ ❝ _i_
467
+
468
+
469
+
470
+
471
+
472
+ **X** ￿ _i_
473
+
474
+
475
+
476
+ **X** _i−_ 1 **X** _i_ **X** _i_ +1
477
+
478
+
479
+
480
+ **X** _i−_ 1 **X** _i_ **X** _i_ +1
481
+
482
+
483
+
484
+ **X** _i−_ 1 **X** _i_ **X** _i_ +1
485
+
486
+
487
+
488
+ _i_ ❝
489
+
490
+
491
+
492
+ _Figure 2._ Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.
493
+ An open circle indicates that the variable is not generated by the model.
494
+
495
+
496
+
497
+ sequence. In addition, the features do not need to specify
498
+ completely a state or observation, so one might expect that
499
+ the model can be estimated from less training data. Another
500
+ attractive property is the convexity of the loss function; indeed, CRFs share all of the convexity properties of general
501
+ maximum entropy models.
502
+
503
+
504
+ For the remainder of the paper we assume that the dependencies of **Y**, conditioned on **X**, form a chain. To simplify some expressions, we add special start and stop states
505
+ **Y** 0 = start and **Y** _n_ +1 = stop. Thus, we will be using the
506
+ graphical structure shown in Figure 2. For a chain structure, the conditional probability of a label sequence can be
507
+ expressed concisely in matrix form, which will be useful
508
+ in describing the parameter estimation and inference algorithms in Section 4. Suppose that _pθ_ ( **Y** **X** ) is a CRF
509
+ _|_
510
+ given by (1). For each position _i_ in the observation sequence **x**, we define the _|Y| × |Y|_ matrix random variable
511
+ _Mi_ ( **x** ) = [ _Mi_ ( _y_ _[￿]_ _, y_ **x** )] by
512
+ _|_
513
+
514
+
515
+
516
+ of the training data. Both algorithms are based on the improved iterative scaling (IIS) algorithm of Della Pietra et al.
517
+ (1997); the proof technique based on auxiliary functions
518
+ can be extended to show convergence of the algorithms for
519
+ CRFs.
520
+
521
+
522
+ Iterative scaling algorithms update the weights as _λk_
523
+ _λδλk_ + _k_ and _δλ δµk_ and _k_ . In particular, the IIS update _µk ←_ _µk_ + _δµk_ for appropriately chosen _δλk_ for an edge _←_
524
+ feature _fk_ is the solution of
525
+
526
+
527
+
528
+ _T_ ( **x** _,_ **y** )
529
+
530
+
531
+
532
+ ￿
533
+ _E_ [ _fk_ ]
534
+
535
+
536
+ =
537
+
538
+
539
+
540
+ =def
541
+
542
+
543
+ ￿
544
+
545
+
546
+ **x** _,_ **y**
547
+
548
+
549
+
550
+ _fk_ ( _ei,_ **y** _ei,_ **x** ) _e_ _[δλ][k][T]_ [ (] **[x]** _[,]_ **[y]** [)] .
551
+ _|_
552
+
553
+
554
+
555
+ ￿
556
+
557
+
558
+ **x** _,_ **y**
559
+
560
+
561
+
562
+ _n_ ￿+1
563
+
564
+ ￿ _p_ ( **x** ) _p_ ( **y** _|_ **x** )
565
+
566
+ _i_ =1
567
+
568
+
569
+
570
+ ￿ _p_ ( **x** _,_ **y** )
571
+
572
+
573
+
574
+ _n_ ￿+1
575
+
576
+
577
+ _i_ =1
578
+
579
+
580
+
581
+ _fk_ ( _ei,_ **y** _ei,_ **x** )
582
+ _|_
583
+
584
+
585
+
586
+ where _T_ ( **x** _,_ **y** ) is the _total feature count_
587
+
588
+
589
+
590
+ ￿
591
+
592
+
593
+ _i,k_
594
+
595
+
596
+
597
+ _fk_ ( _ei,_ **y** _ei,_ **x** ) +
598
+ _|_
599
+
600
+
601
+
602
+ =def
603
+
604
+
605
+
606
+ ￿
607
+
608
+ _gk_ ( _vi,_ **y** _vi,_ **x** ) .
609
+ _|_
610
+
611
+ _i,k_
612
+
613
+
614
+
615
+ _Mi_ ( _y_ _[￿]_ _, y |_ **x** ) = ￿exp (Λ _i_ ( _y_ _[￿]_ _, y |_ **x** ))
616
+ Λ _i_ ( _y_ _[￿]_ _, y |_ **x** ) = ￿ _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ **[ Y]** _[|][e]_
617
+
618
+
619
+
620
+ ￿ _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ **[ Y]** _[|][e]_ _i_ [= (] _[y][￿][, y]_ [)] _[,]_ **[ x]** [) +]
621
+
622
+
623
+
624
+ _k_ _[µ][k][ g][k]_ [(] _[v][i][,]_ **[ Y]** _[|][v]_ _i_ [=] _[ y,]_ **[ x]** [)][,]
625
+
626
+
627
+
628
+ where _ei_ is the edge with labels ( **Y** _i_ 1 _,_ **Y** _i_ ) and _vi_ is the
629
+
630
+ _−_
631
+ vertex with label **Y** _i_ . In contrast to generative models, conditional models like CRFs do not need to enumerate over
632
+ all possible observation sequences **x**, and therefore these
633
+ matrices can be computed directly as needed from a given
634
+ training or test observation sequence **x** and the parameter
635
+ vector _θ_ . Then the normalization (partition function) _Zθ_ ( **x** )
636
+ is the (start _,_ stop) entry of the product of these matrices:
637
+
638
+ _Zθ_ ( **x** ) = ( _M_ 1( **x** ) _M_ 2( **x** ) _· · · Mn_ +1( **x** ))start _,_ stop .
639
+
640
+ Using this notation, the conditional probability of a label
641
+ sequence **y** is written as
642
+
643
+
644
+
645
+ The equations for vertex feature updates _δµk_ have similar
646
+ form.
647
+
648
+
649
+ However, efficiently computing the exponential sums on
650
+ the right-hand sides of these equations is problematic, because _T_ ( **x** _,_ **y** ) is a global property of ( **x** _,_ **y** ), and dynamic
651
+ programming will sum over sequences with potentially
652
+ varying _T_ . To deal with this, the first algorithm, Algorithm
653
+ S, uses a “slack feature.” The second, Algorithm T, keeps
654
+ track of partial _T_ totals.
655
+
656
+
657
+ For Algorithm S, we define the _slack feature_ by
658
+
659
+
660
+
661
+ _S −_
662
+
663
+
664
+
665
+ _s_ ( **x** _,_ **y** )
666
+
667
+
668
+
669
+ =def
670
+
671
+
672
+
673
+ _fk_ ( _ei,_ **y** _ei,_ **x** )
674
+ _|_ _−_
675
+
676
+
677
+
678
+ ￿
679
+
680
+
681
+ _k_
682
+
683
+
684
+
685
+ ￿
686
+
687
+
688
+ _i_
689
+
690
+
691
+
692
+ _gk_ ( _vi,_ **y** _vi,_ **x** ),
693
+ _|_
694
+
695
+
696
+
697
+ ￿
698
+
699
+
700
+
701
+ _i_
702
+
703
+
704
+
705
+ ￿ _n_ +1
706
+
707
+
708
+
709
+ ￿
710
+
711
+
712
+ _k_
713
+
714
+
715
+
716
+ _pθ_ ( **y** **x** ) =
717
+ _|_
718
+
719
+
720
+
721
+ _n_ +1
722
+
723
+ _i_ =1 ~~￿￿~~ _[M]_ _n_ +1 _[i]_ [(] **[y]** _[i][−]_ [1] _[,]_ **[ y]** ~~￿~~ _[i][ |]_ **[ x]** [)]
724
+
725
+
726
+
727
+ _n_ +1
728
+
729
+ _i_ =1 _[M][i]_ [(] **[x]** [)]
730
+
731
+
732
+
733
+ ~~￿~~
734
+
735
+
736
+
737
+ ,
738
+
739
+
740
+
741
+ start _,_ stop
742
+
743
+
744
+
745
+ where **y** 0 = start and **y** _n_ +1 = stop.
746
+
747
+
748
+ **4. Parameter Estimation for CRFs**
749
+
750
+
751
+ We now describe two iterative scaling algorithms to find
752
+ the parameter vector _θ_ that maximizes the log-likelihood
753
+
754
+
755
+
756
+ where _S_ is a constant chosen so that _s_ ( **x** [(] _[i]_ [)] _,_ **y** ) _≥_ 0 for all
757
+ **y** and all observation vectors **x** [(] _[i]_ [)] in the training set, thus
758
+ making _T_ ( **x** _,_ **y** ) = _S_ . Feature _s_ is “global,” that is, it does
759
+ not correspond to any particular edge or vertex.
760
+
761
+ For each index _i_ = 0 _, . . ., n_ + 1 we now define the _forward_
762
+ _vectors αi_ ( **x** ) with base case
763
+
764
+
765
+
766
+ _α_ 0( _y_ **x** ) =
767
+ _|_
768
+
769
+
770
+
771
+ ￿
772
+ 1 if _y_ = start
773
+ 0 otherwise
774
+
775
+
776
+ and recurrence
777
+
778
+
779
+ _αi_ ( **x** ) = _αi−_ 1( **x** ) _Mi_ ( **x** ) .
780
+
781
+ Similarly, the _backward vectors βi_ ( **x** ) are defined by
782
+
783
+
784
+
785
+ _βk_ and _γk_ are the unique positive roots to the following
786
+ polynomial equations
787
+
788
+
789
+
790
+ _T_ ￿max
791
+
792
+
793
+ _i_ =0
794
+
795
+
796
+
797
+ _ak,t βk_ _[t]_ [=][ ￿] _[Ef][k]_ _[,]_
798
+
799
+
800
+
801
+ _T_ ￿max
802
+
803
+
804
+ _i_ =0
805
+
806
+
807
+
808
+ _bk,t γk_ _[t]_ [=][ ￿] _[Eg][k]_ [,] (2)
809
+
810
+
811
+
812
+ _βn_ +1( _y_ **x** ) =
813
+ _|_
814
+
815
+
816
+
817
+ ￿
818
+ 1 if _y_ = stop
819
+ 0 otherwise
820
+
821
+
822
+
823
+ and
824
+
825
+ _βi_ ( **x** ) _[￿]_ = _Mi_ +1( **x** ) _βi_ +1( **x** ) .
826
+
827
+
828
+ With these definitions, the update equations are
829
+
830
+
831
+
832
+ _δλk_ = [1]
833
+
834
+ _S_ [log]
835
+
836
+
837
+
838
+ ￿
839
+ _Efk_
840
+ _Efk_
841
+
842
+
843
+
844
+ _,_ _δµk_ = [1]
845
+
846
+ _S_ [log]
847
+
848
+
849
+
850
+ ￿
851
+ _Egk_
852
+ _Egk_
853
+
854
+
855
+
856
+ ,
857
+
858
+
859
+
860
+ where
861
+
862
+
863
+ _Efk_ =
864
+
865
+
866
+ _Egk_ =
867
+
868
+
869
+
870
+ ￿
871
+
872
+
873
+ **x**
874
+
875
+
876
+ ￿
877
+
878
+
879
+ **x**
880
+
881
+
882
+
883
+ _gk_ ( _vi,_ **y** _|vi_ = _y,_ **x** ) _×_
884
+
885
+
886
+
887
+ ￿
888
+
889
+
890
+ _y_ _[￿]_ _,y_
891
+
892
+
893
+
894
+ _n_ ￿+1
895
+
896
+ ￿ _p_ ( **x** )
897
+
898
+
899
+ _i_ =1
900
+
901
+
902
+
903
+ ￿ _n_
904
+
905
+ ￿ _p_ ( **x** )
906
+
907
+
908
+ _i_ =1
909
+
910
+
911
+
912
+ ￿
913
+
914
+
915
+ _y_
916
+
917
+
918
+
919
+ _fk_ ( _ei,_ **y** _|ei_ = ( _y_ _[￿]_ _, y_ ) _,_ **x** ) _×_
920
+
921
+
922
+
923
+ which can be easily computed by Newton’s method.
924
+
925
+
926
+ A single iteration of Algorithm S and Algorithm T has
927
+ roughly the same time and space complexity as the well
928
+ known Baum-Welch algorithm for HMMs. To prove convergence of our algorithms, we can derive an auxiliary
929
+ function to bound the change in likelihood from below; this
930
+ method is developed in detail by Della Pietra et al. (1997).
931
+ The full proof is somewhat detailed; however, here we give
932
+ an idea of how to derive the auxiliary function. To simplify
933
+ notation, we assume only edge features _fk_ with parameters
934
+ _λk_ .
935
+
936
+ Given two parameter settings _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ) and _θ_ _[￿]_ =
937
+ ( _λ_ 1 + _δλ_ 1 _, λ_ 2 + _δλ_ 2 _, . . ._ ), we bound from below the change
938
+ in the objective function with an _auxiliary function A_ ( _θ_ _[￿]_ _, θ_ )
939
+ as follows
940
+
941
+
942
+
943
+ _αi−_ 1( _y_ _[￿]_ _|_ **x** ) _Mi_ ( _y_ _[￿]_ _, y |_ **x** ) _βi_ ( _y |_ **x** )
944
+
945
+ _Zθ_ ~~(~~ ~~**x**~~ ~~)~~
946
+
947
+
948
+
949
+ _O_ ( _θ_ _[￿]_ ) _−O_ ( _θ_ ) =
950
+
951
+
952
+
953
+ ￿
954
+
955
+
956
+ **x** _,_ **y**
957
+
958
+
959
+
960
+ ￿ _p_ ( **x** _,_ **y** ) log _[p][θ][￿]_ [(] **[y]** _[ |]_ **[ x]** [)]
961
+
962
+
963
+
964
+ ~~_p_~~ _θ_ ~~(~~ ~~**y**~~ ~~**x**~~ ~~)~~
965
+ _|_
966
+
967
+
968
+
969
+ = ( _θ_ _[￿]_ _−_ _θ_ ) _·_ _Ef_ [￿] _−_
970
+
971
+
972
+
973
+ ￿ _p_ ( **x** ) log _[Z][θ][￿]_ [(] **[x]** [)]
974
+
975
+
976
+
977
+ _Zθ_ ~~(~~ ~~**x**~~ ~~)~~
978
+
979
+
980
+
981
+ _αi_ ( _y_ **x** ) _βi_ ( _y_ **x** )
982
+ _|_ _|_ .
983
+
984
+ _Zθ_ ~~(~~ ~~**x**~~ ~~)~~
985
+
986
+
987
+
988
+ _≥_ ( _θ_ _[￿]_ _−_ _θ_ ) _·_ _Ef_ [￿] _−_
989
+
990
+
991
+
992
+ ￿
993
+
994
+
995
+ **x**
996
+
997
+ ￿
998
+
999
+
1000
+ **x**
1001
+
1002
+
1003
+
1004
+ ￿ _p_ ( **x** ) _[Z][θ][￿]_ [(] **[x]** [)]
1005
+
1006
+
1007
+
1008
+ ￿ _p_ ( **x** )
1009
+
1010
+
1011
+
1012
+ _Zθ_ ~~(~~ ~~**x**~~ ~~)~~
1013
+
1014
+
1015
+
1016
+ ￿
1017
+
1018
+
1019
+
1020
+ **x**
1021
+
1022
+ ￿
1023
+
1024
+
1025
+ **x** _,_ **y** _,k_
1026
+
1027
+
1028
+
1029
+ ￿ _p_ ( **x** ) _pθ_ ( **y** **x** ) _[f][k]_ [(] **[x]** _[,]_ **[ y]** [)]
1030
+ _|_ _T_ ~~(~~ ~~**x**~~ ~~)~~ _[e][δλ][k][T]_ [ (] **[x]** [)]
1031
+
1032
+
1033
+
1034
+ The factors involving the forward and backward vectors in
1035
+ the above equations have the same meaning as for standard
1036
+ hidden Markov models. For example,
1037
+
1038
+ _pθ_ ( **Y** _i_ = _y_ **x** ) = _αi_ ( _y |_ **x** ) _βi_ ( _y |_ **x** )
1039
+ _|_ _Zθ_ ~~(~~ ~~**x**~~ ~~)~~
1040
+
1041
+
1042
+ is the marginal probability of label **Y** _i_ = _y_ given that the
1043
+ observation sequence is **x** . This algorithm is closely related
1044
+ to the algorithm of Darroch and Ratcliff (1972), and MART
1045
+ algorithms used in image reconstruction.
1046
+
1047
+
1048
+ The constant _S_ in Algorithm S can be quite large, since in
1049
+ practice it is proportional to the length of the longest training observation sequence. As a result, the algorithm may
1050
+ converge slowly, taking very small steps toward the maximum in each iteration. If the length of the observations **x** [(] _[i]_ [)]
1051
+
1052
+ and the number of active features varies greatly, a fasterconverging algorithm can be obtained by keeping track of
1053
+ feature totals for each observation sequence separately.
1054
+
1055
+
1056
+ Let _T_ ( **x** ) = maxdef **y** _T_ ( **x** _,_ **y** ). Algorithm T accumulates
1057
+
1058
+ feature expectations into counters indexed by _T_ ( **x** ). More
1059
+ specifically, we use the forward-backward recurrences just
1060
+ introduced to compute the expectations _ak,t_ of feature _fk_
1061
+ and _bk,t_ of feature _gk_ given that _T_ ( **x** ) = _t_ . Then our parameter updates are _δλk_ = log _βk_ and _δµk_ = log _γk_, where
1062
+
1063
+
1064
+
1065
+ = _δλ ·_ _Ef_ [￿] _−_
1066
+
1067
+
1068
+ _≥_ _δλ ·_ _Ef_ [￿] _−_
1069
+
1070
+
1071
+
1072
+ _pθ_ ( **y** **x** ) _e_ _[δλ][·][f]_ [(] **[x]** _[,]_ **[y]** [)]
1073
+ _|_
1074
+
1075
+
1076
+
1077
+ ￿
1078
+
1079
+
1080
+ **y**
1081
+
1082
+
1083
+
1084
+ =def _A_ ( _θ_ _[￿]_ _, θ_ )
1085
+
1086
+ where the inequalities follow from the convexity of _−_ log
1087
+ and exp. Differentiating _A_ with respect to _δλk_ and setting
1088
+ the result to zero yields equation (2).
1089
+
1090
+
1091
+ **5. Experiments**
1092
+
1093
+
1094
+ We first discuss two sets of experiments with synthetic data
1095
+ that highlight the differences between CRFs and MEMMs.
1096
+ The first experiments are a direct verification of the label
1097
+ bias problem discussed in Section 2. In the second set of
1098
+ experiments, we generate synthetic data using randomly
1099
+ chosen hidden Markov models, each of which is a mixture of a first-order and second-order model. Competing
1100
+ _first-order_ models are then trained and compared on test
1101
+ data. As the data becomes more second-order, the test error rates of the trained models increase. This experiment
1102
+ corresponds to the common modeling practice of approximating complex local and long-range dependencies, as occur in natural data, by small-order Markov models. Our
1103
+
1104
+
1105
+ 60
1106
+
1107
+
1108
+ 50
1109
+
1110
+
1111
+ 40
1112
+
1113
+
1114
+ 30
1115
+
1116
+
1117
+ 20
1118
+
1119
+
1120
+
1121
+ 0 10 20 30 40 50 60
1122
+
1123
+ HMM Error
1124
+
1125
+
1126
+
1127
+ 0 10 20 30 40 50 60
1128
+
1129
+ HMM Error
1130
+
1131
+
1132
+
1133
+ 10
1134
+
1135
+
1136
+ 0
1137
+
1138
+ 0 10 20 30 40 50 60
1139
+
1140
+ CRF Error
1141
+
1142
+
1143
+
1144
+ 60
1145
+
1146
+
1147
+ 50
1148
+
1149
+
1150
+ 40
1151
+
1152
+
1153
+ 30
1154
+
1155
+
1156
+ 20
1157
+
1158
+
1159
+ 10
1160
+
1161
+
1162
+ 0
1163
+
1164
+
1165
+
1166
+ 60
1167
+
1168
+
1169
+ 50
1170
+
1171
+
1172
+ 40
1173
+
1174
+
1175
+ 30
1176
+
1177
+
1178
+ 20
1179
+
1180
+
1181
+ 10
1182
+
1183
+
1184
+ 0
1185
+
1186
+
1187
+
1188
+ _Figure 3._ Plots of 2 _×_ 2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2.
1189
+ As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically
1190
+ significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square
1191
+ represents a data set with _α <_ [1] 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2 [. The plot shows that when the data is mostly second]
1192
+
1193
+ order ( _α ≥_ [1] 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]
1194
+
1195
+
1196
+
1197
+
1198
+ [1]
1199
+
1200
+ 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2
1201
+
1202
+
1203
+
1204
+ order ( _α ≥_ 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]
1205
+
1206
+ the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.
1207
+
1208
+
1209
+
1210
+ results clearly indicate that even when the models are parameterized in exactly the same way, CRFs are more robust to inaccurate modeling assumptions than MEMMs or
1211
+ HMMs, and resolve the label bias problem, which affects
1212
+ the performance of MEMMs. To avoid confusion of different effects, the MEMMs and CRFs in these experiments
1213
+ _do not_ use overlapping features of the observations. Finally, in a set of POS tagging experiments, we confirm the
1214
+ advantage of CRFs over MEMMs. We also show that the
1215
+ addition of overlapping features to CRFs and MEMMs allows them to perform much better than HMMs, as already
1216
+ shown for MEMMs by McCallum et al. (2000).
1217
+
1218
+
1219
+ **5.1 Modeling label bias**
1220
+
1221
+
1222
+ We generate data from a simple HMM which encodes a
1223
+ noisy version of the finite-state network in Figure 1. Each
1224
+ state emits its designated symbol with probability 29 _/_ 32
1225
+ and any of the other symbols with probability 1 _/_ 32. We
1226
+ train both an MEMM and a CRF with the same topologies
1227
+ on the data generated by the HMM. The observation features are simply the identity of the observation symbols.
1228
+ In a typical run using 2 _,_ 000 training and 500 test samples,
1229
+ trained to convergence of the iterative scaling algorithm,
1230
+ the CRF error is 4 _._ 6% while the MEMM error is 42%,
1231
+ showing that the MEMM fails to discriminate between the
1232
+ two branches.
1233
+
1234
+
1235
+ **5.2 Modeling mixed-order sources**
1236
+
1237
+ For these results, we use five labels, a-e ( _|Y|_ = 5), and 26
1238
+ observation values, A-Z ( _|X|_ = 26); however, the results
1239
+ were qualitatively the same over a range of sizes for _Y_ and
1240
+ _X_ . We generate data from a mixed-order HMM with state
1241
+ transition probabilities given by _α p_ larly, emission probabilities given by _α p_ have a standard first-order HMM. In order to limit the size22(( **yx** _ii | |_ **y y** _ii−,_ **x** 1 _,i_ **y** _−_ 1 _i−_ )+(12) + (1 _−α_ ) _− p_ 1( _α_ **x** ) _i | p p_ **y** 1( _iα_ **y** ) _p_ (. Thus, for _i_ **y** _|αi_ **y** _|_ ( **yx** _i−ii |_ 1 _−_ **y** )1 and, simi- _,i_ **y** _, α_ **x** _i_ = 0 _−i−_ 21)) = we=
1242
+
1243
+
1244
+
1245
+ of the Bayes error rate for the resulting models, the conditional probability tables _pα_ are constrained to be sparse.
1246
+ In particular, _pα_ ( _y, y_ _[￿]_ ) can have at most two nonzero en
1247
+ _· |_
1248
+ tries, for each _y, y_ _[￿]_, and _pα_ ( _y, x_ _[￿]_ ) can have at most three
1249
+
1250
+ _· |_
1251
+ nonzero entries for each _y, x_ _[￿]_ . For each randomly generated model, a sample of 1,000 sequences of length 25 is
1252
+ generated for training and testing.
1253
+
1254
+
1255
+ On each randomly generated training set, a CRF is trained
1256
+ using Algorithm S. (Note that since the length of the sequences and number of active features is constant, Algorithms S and T are identical.) The algorithm is fairly slow
1257
+ to converge, typically taking approximately 500 iterations
1258
+ for the model to stabilize. On the 500 MHz Pentium PC
1259
+ used in our experiments, each iteration takes approximately
1260
+ 0.2 seconds. On the same data an MEMM is trained using
1261
+ iterative scaling, which does not require forward-backward
1262
+ calculations, and is thus more efficient. The MEMM training converges more quickly, stabilizing after approximately
1263
+ 100 iterations. For each model, the Viterbi algorithm is
1264
+ used to label a test set; the experimental results do not significantly change when using forward-backward decoding
1265
+ to minimize the per-symbol error rate.
1266
+
1267
+
1268
+ The results of several runs are presented in Figure 3. Each
1269
+ plot compares two classes of models, with each point indicating the error rate for a single test set. As _α_ increases, the
1270
+ error rates generally increase, as the first-order models fail
1271
+ to fit the second-order data. The figure compares models
1272
+ parameterized as _µy_, _λy￿,y_, and _λy￿,y,x_ ; results for models
1273
+ parameterized as _µy_, _λy￿,y_, and _µy,x_ are qualitatively the
1274
+ same. As shown in the first graph, the CRF generally outperforms the MEMM, often by a wide margin of 10%–20%
1275
+ relative error. (The points for very small error rate, with
1276
+ _α <_ 0 _._ 01, where the MEMM does better than the CRF,
1277
+ are suspected to be the result of an insufficient number of
1278
+ training iterations for the CRF.)
1279
+
1280
+
1281
+ |model|error oov error|
1282
+ |---|---|
1283
+ |HMM<br>MEMM<br>CRF|5.69%<br>45.99%<br>6.37%<br>54.61%<br>5.55%<br>48.05%|
1284
+ |MEMM+<br>CRF+|4.81%<br>26.99%<br>4.27%<br>23.76%|
1285
+
1286
+
1287
+ +Using spelling features
1288
+
1289
+
1290
+ _Figure 4._ Per-word error rates for POS tagging on the Penn treebank, using first-order models trained on 50% of the 1.1 million
1291
+ word corpus. The oov rate is 5.45%.
1292
+
1293
+
1294
+ **5.3 POS tagging experiments**
1295
+
1296
+
1297
+ To confirm our synthetic data results, we also compared
1298
+ HMMs, MEMMs and CRFs on Penn treebank POS tagging, where each word in a given input sentence must be
1299
+ labeled with one of 45 syntactic tags.
1300
+
1301
+
1302
+ We carried out two sets of experiments with this natural
1303
+ language data. First, we trained first-order HMM, MEMM,
1304
+ and CRF models as in the synthetic data experiments, introducing parameters _µy,x_ for each tag-word pair and _λy￿,y_
1305
+ for each tag-tag pair in the training set. The results are consistent with what is observed on synthetic data: the HMM
1306
+ outperforms the MEMM, as a consequence of the label bias
1307
+ problem, while the CRF outperforms the HMM. The error rates for training runs using a 50%-50% train-test split
1308
+ are shown in Figure 5.3; the results are qualitatively similar for other splits of the data. The error rates on outof-vocabulary (oov) words, which are not observed in the
1309
+ training set, are reported separately.
1310
+
1311
+
1312
+ In the second set of experiments, we take advantage of the
1313
+ power of conditional models by adding a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and
1314
+ whether it ends in one of the following suffixes: -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as
1315
+ expected, that both the MEMM and the CRF benefit significantly from the use of these features, with the overall error
1316
+ rate reduced by around 25%, and the out-of-vocabulary error rate reduced by around 50%.
1317
+
1318
+
1319
+ One usually starts training from the all zero parameter vector, corresponding to the uniform distribution. However,
1320
+ for these datasets, CRF training with that initialization is
1321
+ much slower than MEMM training. Fortunately, we can
1322
+ use the optimal MEMM parameter vector as a starting
1323
+ point for training the corresponding CRF. In Figure 5.3,
1324
+ MEMM [+] was trained to convergence in around 100 iterations. Its parameters were then used to initialize the training of CRF [+], which converged in 1,000 iterations. In contrast, training of the same CRF from the uniform distribution had not converged even after 2,000 iterations.
1325
+
1326
+
1327
+
1328
+ **6. Further Aspects of CRFs**
1329
+
1330
+
1331
+ Many further aspects of CRFs are attractive for applications and deserve further study. In this section we briefly
1332
+ mention just two.
1333
+
1334
+
1335
+ Conditional random fields can be trained using the exponential loss objective function used by the AdaBoost algorithm (Freund & Schapire, 1997). Typically, boosting is
1336
+ applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling
1337
+ have treated each label as a separate classification problem
1338
+ (Abney et al., 1999). However, it is possible to apply the
1339
+ parallel update algorithm of Collins et al. (2000) to optimize the per-sequence exponential loss. This requires a
1340
+ forward-backward algorithm to compute efficiently certain
1341
+ feature expectations, along the lines of Algorithm T, except that each feature requires a separate set of forward and
1342
+ backward accumulators.
1343
+
1344
+
1345
+ Another attractive aspect of CRFs is that one can implement efficient feature selection and feature induction algorithms for them. That is, rather than specifying in advance which features of ( **X** _,_ **Y** ) to use, we could start from
1346
+ feature-generating rules and evaluate the benefit of generated features automatically on data. In particular, the feature induction algorithms presented in Della Pietra et al.
1347
+ (1997) can be adapted to fit the dynamic programming
1348
+ techniques of conditional random fields.
1349
+
1350
+
1351
+ **7. Related Work and Conclusions**
1352
+
1353
+
1354
+ As far as we know, the present work is the first to combine
1355
+ the benefits of conditional models with the global normalization of random field models. Other applications of exponential models in sequence modeling have either attempted
1356
+ to build generative models (Rosenfeld, 1997), which involve a hard normalization problem, or adopted local conditional models (Berger et al., 1996; Ratnaparkhi, 1996;
1357
+ McCallum et al., 2000) that may suffer from label bias.
1358
+
1359
+
1360
+ Non-probabilistic local decision models have also been
1361
+ widely used in segmentation and tagging (Brill, 1995;
1362
+ Roth, 1998; Abney et al., 1999). Because of the computational complexity of global training, these models are only
1363
+ trained to minimize the error of individual label decisions
1364
+ assuming that neighboring labels are correctly chosen. Label bias would be expected to be a problem here too.
1365
+
1366
+
1367
+ An alternative approach to discriminative modeling of sequence labeling is to use a permissive generative model,
1368
+ which can only model local dependencies, to produce a
1369
+ list of candidates, and then use a more global discriminative model to rerank those candidates. This approach is
1370
+ standard in large-vocabulary speech recognition (Schwartz
1371
+ & Austin, 1993), and has also been proposed for parsing
1372
+ (Collins, 2000). However, these methods fail when the correct output is pruned away in the first pass.
1373
+
1374
+
1375
+ Closest to our proposal are gradient-descent methods that
1376
+ adjust the parameters of all of the local classifiers to minimize a smooth loss function (e.g., quadratic loss) combining loss terms for each label. If state dependencies are local, this can be done efficiently with dynamic programming
1377
+ (LeCun et al., 1998). Such methods should alleviate label
1378
+ bias. However, their loss function is not convex, so they
1379
+ may get stuck in local minima.
1380
+
1381
+
1382
+ Conditional random fields offer a unique combination of
1383
+ properties: discriminatively trained models for sequence
1384
+ segmentation and labeling; combination of arbitrary, overlapping and agglomerative observation features from both
1385
+ the past and future; efficient training and decoding based
1386
+ on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main current limitation is the slow convergence of the training algorithm
1387
+ relative to MEMMs, let alone to HMMs, for which training
1388
+ on fully observed data is very efficient. In future work, we
1389
+ plan to investigate alternative training methods such as the
1390
+ update methods of Collins et al. (2000) and refinements on
1391
+ using a MEMM as starting point as we did in some of our
1392
+ experiments. More general tree-structured random fields,
1393
+ feature induction methods, and further natural data evaluations will also be investigated.
1394
+
1395
+
1396
+ **Acknowledgments**
1397
+
1398
+
1399
+ We thank Yoshua Bengio, L´eon Bottou, Michael Collins
1400
+ and Yann LeCun for alerting us to what we call here the label bias problem. We also thank Andrew Ng and Sebastian
1401
+ Thrun for discussions related to this work.
1402
+
1403
+
1404
+ **References**
1405
+
1406
+
1407
+ Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting
1408
+
1409
+ applied to tagging and PP attachment. _Proc. EMNLP-_
1410
+ _VLC_ . New Brunswick, New Jersey: Association for
1411
+ Computational Linguistics.
1412
+ Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J.
1413
+
1414
+ (1996). A maximum entropy approach to natural language processing. _Computational Linguistics_, _22_ .
1415
+ Bottou, L. (1991). _Une_ _approche_ _th´eorique_ _de_
1416
+ _l’apprentissage connexionniste: Applications `a la recon-_
1417
+ _naissance de la parole_ . Doctoral dissertation, Universit´e
1418
+ de Paris XI.
1419
+ Brill, E. (1995). Transformation-based error-driven learn
1420
+ ing and natural language processing: a case study in part
1421
+ of speech tagging. _Computational Linguistics_, _21_, 543–
1422
+ 565.
1423
+ Collins, M. (2000). Discriminative reranking for natural
1424
+
1425
+ language parsing. _Proc. ICML 2000_ . Stanford, California.
1426
+ Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re
1427
+ gression, AdaBoost, and Bregman distances. _Proc. 13th_
1428
+ _COLT_ .
1429
+ Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative
1430
+
1431
+
1432
+
1433
+ scaling for log-linear models. _The Annals of Mathemat-_
1434
+ _ical Statistics_, _43_, 1470–1480.
1435
+ Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In
1436
+ ducing features of random fields. _IEEE Transactions on_
1437
+ _Pattern Analysis and Machine Intelligence_, _19_, 380–393.
1438
+ Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).
1439
+
1440
+ _Biological sequence analysis: Probabilistic models of_
1441
+ _proteins and nucleic acids_ . Cambridge University Press.
1442
+ Freitag, D., & McCallum, A. (2000). Information extrac
1443
+ tion with HMM structures learned by stochastic optimization. _Proc. AAAI 2000_ .
1444
+ Freund, Y., & Schapire, R. (1997). A decision-theoretic
1445
+
1446
+ generalization of on-line learning and an application to
1447
+ boosting. _Journal of Computer and System Sciences_, _55_,
1448
+ 119–139.
1449
+ Hammersley, J., & Clifford, P. (1971). Markov fields on
1450
+
1451
+ finite graphs and lattices. Unpublished manuscript.
1452
+ LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
1453
+
1454
+ Gradient-based learning applied to document recognition. _Proceedings of the IEEE_, _86_, 2278–2324.
1455
+ MacKay, D. J. (1996). Equivalence of linear Boltzmann
1456
+
1457
+ chains and hidden Markov models. _Neural Computation_,
1458
+ _8_, 178–181.
1459
+ Manning, C. D., & Sch¨utze, H. (1999). _Foundations of sta-_
1460
+
1461
+ _tistical natural language processing_ . Cambridge Massachusetts: MIT Press.
1462
+ McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum
1463
+
1464
+ entropy Markov models for information extraction and
1465
+ segmentation. _Proc. ICML 2000_ (pp. 591–598). Stanford, California.
1466
+ Mohri, M. (1997). Finite-state transducers in language and
1467
+
1468
+ speech processing. _Computational Linguistics_, _23_ .
1469
+ Mohri, M. (2000). Minimization algorithms for sequential
1470
+
1471
+ transducers. _Theoretical Computer Science_, _234_, 177–
1472
+ 201.
1473
+ Paz, A. (1971). _Introduction to probabilistic automata_ .
1474
+ Academic Press.
1475
+ Punyakanok, V., & Roth, D. (2001). The use of classifiers
1476
+
1477
+ in sequential inference. _NIPS 13_ . Forthcoming.
1478
+ Ratnaparkhi, A. (1996). A maximum entropy model for
1479
+
1480
+ part-of-speech tagging. _Proc. EMNLP_ . New Brunswick,
1481
+ New Jersey: Association for Computational Linguistics.
1482
+ Rosenfeld, R. (1997). A whole sentence maximum entropy
1483
+
1484
+ language model. _Proceedings of the IEEE Workshop on_
1485
+ _Speech Recognition and Understanding_ . Santa Barbara,
1486
+ California.
1487
+ Roth, D. (1998). Learning to resolve natural language am
1488
+ biguities: A unified approach. _Proc. 15th AAAI_ (pp. 806–
1489
+ 813). Menlo Park, California: AAAI Press.
1490
+ Saul, L., & Jordan, M. (1996). Boltzmann chains and hid
1491
+ den Markov models. _Advances in Neural Information_
1492
+ _Processing Systems 7_ . MIT Press.
1493
+ Schwartz, R., & Austin, S. (1993). A comparison of several
1494
+
1495
+ approximate algorithms for finding multiple (N-BEST)
1496
+ sentence hypotheses. _Proc. ICASSP_ . Minneapolis, MN.
1497
+
1498
+
references/2001.icml.lafferty/paper.tex ADDED
@@ -0,0 +1,1497 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt]{article}
2
+ \usepackage[utf8]{inputenc}
3
+ \usepackage{amsmath,amssymb}
4
+ \usepackage{booktabs}
5
+ \usepackage{hyperref}
6
+
7
+ \begin{document}
8
+
9
+ \section*{\textbf{Conditional Random Fields: Probabilistic Models} \textbf{for Segmenting and Labeling Sequence Data}}
10
+
11
+ \textbf{John Lafferty} _[†∗]_ LAFFERTY@CS.CMU.EDU
12
+ \textbf{Andrew McCallum} _[∗†]_ MCCALLUM@WHIZBANG.COM
13
+ \textbf{Fernando Pereira} _[∗‡]_ FPEREIRA@WHIZBANG.COM
14
+ _∗_ WhizBang! Labs–Research, 4616 Henry Street, Pittsburgh, PA 15213 USA
15
+
16
+ _†_ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA
17
+
18
+ _‡_ Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104 USA
19
+
20
+
21
+
22
+ \textbf{Abstract}
23
+
24
+
25
+ We present _conditional random fields_, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars
26
+ for such tasks, including the ability to relax
27
+ strong independence assumptions made in those
28
+ models. Conditional random fields also avoid
29
+ a fundamental limitation of maximum entropy
30
+ Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states
31
+ with few successor states. We present iterative
32
+ parameter estimation algorithms for conditional
33
+ random fields and compare the performance of
34
+ the resulting models to HMMs and MEMMs on
35
+ synthetic and natural-language data.
36
+
37
+
38
+ \textbf{1. Introduction}
39
+
40
+
41
+ The need to segment and label sequences arises in many
42
+ different problems in several scientific fields. Hidden
43
+ Markov models (HMMs) and stochastic grammars are well
44
+ understood and widely used probabilistic models for such
45
+ problems. In computational biology, HMMs and stochastic grammars have been successfully used to align biological sequences, find sequences homologous to a known
46
+ evolutionary family, and analyze RNA secondary structure
47
+ (Durbin et al., 1998). In computational linguistics and
48
+ computer science, HMMs and stochastic grammars have
49
+ been applied to a wide variety of problems in text and
50
+ speech processing, including topic segmentation, part-ofspeech (POS) tagging, information extraction, and syntactic disambiguation (Manning & Sch¨utze, 1999).
51
+
52
+
53
+ HMMs and stochastic grammars are generative models, assigning a joint probability to paired observation and label
54
+ sequences; the parameters are typically trained to maxi
55
+
56
+
57
+ mize the joint likelihood of training examples. To define
58
+ a joint probability over observation and label sequences,
59
+ a generative model needs to enumerate all possible observation sequences, typically requiring a representation
60
+ in which observations are task-appropriate atomic entities,
61
+ such as words or nucleotides. In particular, it is not practical to represent multiple interacting features or long-range
62
+ dependencies of the observations, since the inference problem for such models is intractable.
63
+
64
+
65
+ This difficulty is one of the main motivations for looking at
66
+ conditional models as an alternative. A conditional model
67
+ specifies the probabilities of possible label sequences given
68
+ an observation sequence. Therefore, it does not expend
69
+ modeling effort on the observations, which at test time
70
+ are fixed anyway. Furthermore, the conditional probability of the label sequence can depend on arbitrary, nonindependent features of the observation sequence without
71
+ forcing the model to account for the distribution of those
72
+ dependencies. The chosen features may represent attributes
73
+ at different levels of granularity of the same observations
74
+ (for example, words and characters in English text), or
75
+ aggregate properties of the observation sequence (for instance, text layout). The probability of a transition between
76
+ labels may depend not only on the current observation,
77
+ but also on past and future observations, if available. In
78
+ contrast, generative models must make very strict independence assumptions on the observations, for instance conditional independence given the labels, to achieve tractability.
79
+
80
+
81
+ Maximum entropy Markov models (MEMMs) are conditional probabilistic sequence models that attain all of the
82
+ above advantages (McCallum et al., 2000). In MEMMs,
83
+ each source state [1] has a exponential model that takes the
84
+ observation features as input, and outputs a distribution
85
+ over possible next states. These exponential models are
86
+ trained by an appropriate iterative scaling method in the
87
+
88
+
89
+ 1Output labels are associated with states; it is possible for several states to have the same label, but for simplicity in the rest of
90
+ this paper we assume a one-to-one correspondence.
91
+
92
+
93
+ maximum entropy framework. Previously published experimental results show MEMMs increasing recall and doubling precision relative to HMMs in a FAQ segmentation
94
+ task.
95
+
96
+
97
+ MEMMs and other non-generative finite-state models
98
+ based on next-state classifiers, such as discriminative
99
+ Markov models (Bottou, 1991), share a weakness we call
100
+ here the _label bias problem_ : the transitions leaving a given
101
+ state compete only against each other, rather than against
102
+ all other transitions in the model. In probabilistic terms,
103
+ transition scores are the conditional probabilities of possible next states given the current state and the observation sequence. This per-state normalization of transition
104
+ scores implies a “conservation of score mass” (Bottou,
105
+ 1991) whereby all the mass that arrives at a state must be
106
+ distributed among the possible successor states. An observation can affect which destination states get the mass, but
107
+ not how much total mass to pass on. This causes a bias toward states with fewer outgoing transitions. In the extreme
108
+ case, a state with a single outgoing transition effectively
109
+ ignores the observation. In those cases, unlike in HMMs,
110
+ Viterbi decoding cannot downgrade a branch based on observations after the branch point, and models with statetransition structures that have sparsely connected chains of
111
+ states are not properly handled. The Markovian assumptions in MEMMs and similar state-conditional models insulate decisions at one state from future decisions in a way
112
+ that does not match the actual dependencies between consecutive states.
113
+
114
+
115
+ This paper introduces _conditional random fields_ (CRFs), a
116
+ sequence modeling framework that has all the advantages
117
+ of MEMMs but also solves the label bias problem in a
118
+ principled way. The critical difference between CRFs and
119
+ MEMMs is that a MEMM uses per-state exponential models for the conditional probabilities of next states given the
120
+ current state, while a CRF has a single exponential model
121
+ for the joint probability of the entire sequence of labels
122
+ given the observation sequence. Therefore, the weights of
123
+ different features at different states can be traded off against
124
+ each other.
125
+
126
+
127
+ We can also think of a CRF as a finite state model with unnormalized transition probabilities. However, unlike some
128
+ other weighted finite-state approaches (LeCun et al., 1998),
129
+ CRFs assign a well-defined probability distribution over
130
+ possible labelings, trained by maximum likelihood or MAP
131
+ estimation. Furthermore, the loss function is convex, [2] guaranteeing convergence to the global optimum. CRFs also
132
+ generalize easily to analogues of stochastic context-free
133
+ grammars that would be useful in such problems as RNA
134
+ secondary structure prediction and natural language processing.
135
+
136
+
137
+ 2In the case of fully observable states, as we are discussing
138
+ here; if several states have the same label, the usual local maxima
139
+ of Baum-Welch arise.
140
+
141
+
142
+
143
+ _Figure 1._ Label bias example, after (Bottou, 1991). For conciseness, we place observation-label pairs _o_ : _l_ on transitions rather
144
+ than states; the symbol ‘ ~~’~~ represents the null output label.
145
+
146
+
147
+ We present the model, describe two training procedures and
148
+ sketch a proof of convergence. We also give experimental
149
+ results on synthetic data showing that CRFs solve the classical version of the label bias problem, and, more significantly, that CRFs perform better than HMMs and MEMMs
150
+ when the true data distribution has higher-order dependencies than the model, as is often the case in practice. Finally,
151
+ we confirm these results as well as the claimed advantages
152
+ of conditional models by evaluating HMMs, MEMMs and
153
+ CRFs with identical state structure on a part-of-speech tagging task.
154
+
155
+
156
+ \textbf{2. The Label Bias Problem}
157
+
158
+
159
+ Classical probabilistic automata (Paz, 1971), discriminative Markov models (Bottou, 1991), maximum entropy
160
+ taggers (Ratnaparkhi, 1996), and MEMMs, as well as
161
+ non-probabilistic sequence tagging and segmentation models with independently trained next-state classifiers (Punyakanok & Roth, 2001) are all potential victims of the label
162
+ bias problem.
163
+
164
+
165
+ For example, Figure 1 represents a simple finite-state
166
+ model designed to distinguish between the two words rib
167
+ and rob. Suppose that the observation sequence is r i b.
168
+ In the first time step, r matches both transitions from the
169
+ start state, so the probability mass gets distributed roughly
170
+ equally among those two transitions. Next we observe i.
171
+ Both states 1 and 4 have only one outgoing transition. State
172
+ 1 has seen this observation often in training, state 4 has almost never seen this observation; but like state 1, state 4
173
+ has no choice but to pass all its mass to its single outgoing
174
+ transition, since it is not generating the observation, only
175
+ conditioning on it. Thus, states with a single outgoing transition effectively ignore their observations. More generally,
176
+ states with low-entropy next state distributions will take little notice of observations. Returning to the example, the
177
+ top path and the bottom path will be about equally likely,
178
+ independently of the observation sequence. If one of the
179
+ two words is slightly more common in the training set, the
180
+ transitions out of the start state will slightly prefer its corresponding transition, and that word’s state sequence will
181
+ always win. This behavior is demonstrated experimentally
182
+ in Section 5.
183
+
184
+
185
+ L´eon Bottou (1991) discussed two solutions for the label
186
+ bias problem. One is to change the state-transition struc
187
+
188
+
189
+
190
+
191
+
192
+
193
+
194
+
195
+
196
+
197
+
198
+ ture of the model. In the above example we could collapse
199
+ states 1 and 4, and delay the branching until we get a discriminating observation. This operation is a special case
200
+ of determinization (Mohri, 1997), but determinization of
201
+ weighted finite-state machines is not always possible, and
202
+ even when possible, it may lead to combinatorial explosion. The other solution mentioned is to start with a fullyconnected model and let the training procedure figure out
203
+ a good structure. But that would preclude the use of prior
204
+ structural knowledge that has proven so valuable in information extraction tasks (Freitag & McCallum, 2000).
205
+
206
+
207
+ Proper solutions require models that account for whole
208
+ state sequences at once by letting some transitions “vote”
209
+ more strongly than others depending on the corresponding
210
+ observations. This implies that score mass will not be conserved, but instead individual transitions can “amplify” or
211
+ “dampen” the mass they receive. In the above example, the
212
+ transitions from the start state would have a very weak effect on path score, while the transitions from states 1 and 4
213
+ would have much stronger effects, amplifying or damping
214
+ depending on the actual observation, and a proportionally
215
+ higher contribution to the selection of the Viterbi path. [3]
216
+
217
+
218
+ In the related work section we discuss other heuristic model
219
+ classes that account for state sequences globally rather than
220
+ locally. To the best of our knowledge, CRFs are the only
221
+ model class that does this in a purely probabilistic setting,
222
+ with guaranteed global maximum likelihood convergence.
223
+
224
+
225
+ \textbf{3. Conditional Random Fields}
226
+
227
+
228
+ In what follows, \textbf{X} is a random variable over data sequences to be labeled, and \textbf{Y} is a random variable over
229
+ corresponding label sequences. All components \textbf{Y} _i_ of \textbf{Y}
230
+ are assumed to range over a finite label alphabet _Y_ . For example, \textbf{X} might range over natural language sentences and
231
+ \textbf{Y} range over part-of-speech taggings of those sentences,
232
+ with _Y_ the set of possible part-of-speech tags. The random variables \textbf{X} and \textbf{Y} are jointly distributed, but in a discriminative framework we construct a conditional model
233
+ _p_ ( \textbf{Y} _|_ \textbf{X} ) from paired observation and label sequences, and
234
+ do not explicitly model the marginal _p_ ( \textbf{X} ).
235
+
236
+ \textbf{Definition} . _Let G_ = ( _V, E_ ) _be a graph such that_
237
+ \textbf{Y} = ( \textbf{Y} _v_ ) _v_ _V, so that_ \textbf{Y} _is indexed by the vertices_
238
+ _∈_
239
+ _of G._ _Then_ ( \textbf{X} _,_ \textbf{Y} ) _is a conditional random field in_
240
+ _case, when conditioned on_ \textbf{X} _, the random variables_ \textbf{Y} _v_
241
+ _obey the Markov property with respect to the graph:_
242
+ _p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ = _v_ ) = _p_ ( \textbf{Y} _v_ \textbf{X} _,_ \textbf{Y} _w, w_ _v_ ) _, where_
243
+ _w ∼_ _v | means that ￿_ _w and v are neighbors in |_ _∼ G._
244
+
245
+ Thus, a CRF is a random field globally conditioned on the
246
+ observation \textbf{X} . Throughout the paper we tacitly assume
247
+ that the graph _G_ is fixed. In the simplest and most impor
248
+
249
+ 3Weighted determinization and minimization techniques shift
250
+ transition weights while preserving overall path weight (Mohri,
251
+ 2000); their connection to this discussion deserves further study.
252
+
253
+
254
+
255
+ tant example for modeling sequences, _G_ is a simple chain
256
+ or line: _G_ = ( _V_ = _{_ 1 _,_ 2 _, . . . m}, E_ = _{_ ( _i, i_ + 1) _}_ ).
257
+ \textbf{X} may also have a natural graph structure; yet in general it is not necessary to assume that \textbf{X} and \textbf{Y} have the
258
+ same graphical structure, or even that \textbf{X} has any graphical structure at all. However, in this paper we will be
259
+ most concerned with sequences \textbf{X} = ( \textbf{X} 1 _,_ \textbf{X} 2 _, . . .,_ \textbf{X} _n_ )
260
+ and \textbf{Y} = ( \textbf{Y} 1 _,_ \textbf{Y} 2 _, . . .,_ \textbf{Y} _n_ ).
261
+
262
+ If the graph _G_ = ( _V, E_ ) of \textbf{Y} is a tree (of which a chain
263
+ is the simplest example), its cliques are the edges and vertices. Therefore, by the fundamental theorem of random
264
+ fields (Hammersley & Clifford, 1971), the joint distribution over the label sequence \textbf{Y} given \textbf{X} has the form
265
+
266
+
267
+
268
+ As a particular case, we can construct an HMM-like CRF
269
+ by defining one feature for each state pair ( _y_ _[￿]_ _, y_ ), and one
270
+ feature for each state-observation pair ( _y, x_ ):
271
+
272
+
273
+ _fy￿,y_ ( _<u, v>,_ \textbf{y} _<u,v>,_ \textbf{x} ) = _δ_ ( \textbf{y} _u, y_ _[￿]_ ) _δ_ ( \textbf{y} _v, y_ )
274
+ _|_
275
+ _gy,x_ ( _v,_ \textbf{y} _|v,_ \textbf{x} ) = _δ_ ( \textbf{y} _v, y_ ) _δ_ ( \textbf{x} _v, x_ ) .
276
+
277
+ The corresponding parameters _λy￿,y_ and _µy,x_ play a similar role to the (logarithms of the) usual HMM parameters
278
+ _p_ ( _y_ _[￿]_ _| y_ ) and _p_ ( _x|y_ ). Boltzmann chain models (Saul & Jordan, 1996; MacKay, 1996) have a similar form but use a
279
+ single normalization constant to yield a joint distribution,
280
+ whereas CRFs use the observation-dependent normalization _Z_ ( \textbf{x} ) for conditional distributions.
281
+
282
+
283
+ Although it encompasses HMM-like models, the class of
284
+ conditional random fields is much more expressive, because it allows arbitrary dependencies on the observation
285
+
286
+
287
+
288
+ _pθ_ ( \textbf{y}  _|_ \textbf{x} ) _∝_ (1)
289
+
290
+
291
+
292
+
293
+
294
+
295
+
296
+
297
+
298
+
299
+
300
+ exp
301
+
302
+
303
+
304
+  [￿]
305
+
306
+
307
+
308
+ _v∈V,k_
309
+
310
+
311
+
312
+ ￿
313
+
314
+
315
+
316
+ _e∈E,k_
317
+
318
+
319
+
320
+ _λk fk_ ( _e,_ \textbf{y} _|e,_ \textbf{x} ) +
321
+
322
+
323
+
324
+ _µk gk_ ( _v,_ \textbf{y} _|v,_ \textbf{x} ),
325
+
326
+
327
+
328
+ where \textbf{x} is a data sequence, \textbf{y} a label sequence, and \textbf{y} _|S_ is
329
+ the set of components of \textbf{y} associated with the vertices in
330
+ subgraph _S_ .
331
+
332
+
333
+ We assume that the _features fk_ and _gk_ are given and fixed.
334
+ For example, a Boolean vertex feature _gk_ might be true if
335
+ the word \textbf{X} _i_ is upper case and the tag \textbf{Y} _i_ is “proper noun.”
336
+
337
+
338
+ The parameter estimation problem is to determine the parameters _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ; _µ_ 1 _, µ_ 2 _, . . ._ ) from training data
339
+ _D_ = _{_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} [(] _[i]_ [)] ) _}i_ _[N]_ =1 [with empirical distribution][ ￿] _[p]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)][.]
340
+
341
+ In Section 4 we describe an iterative scaling algorithm that
342
+ maximizes the log-likelihood objective function _O_ ( _θ_ ):
343
+
344
+
345
+
346
+ _O_ ( _θ_ ) =
347
+
348
+
349
+ _∝_
350
+
351
+
352
+
353
+ ￿ _N_
354
+
355
+
356
+ _i_ =1
357
+
358
+ ￿
359
+
360
+
361
+ \textbf{x} _,_ \textbf{y}
362
+
363
+
364
+
365
+ log _pθ_ ( \textbf{y} [(] _[i]_ [)] \textbf{x} [(] _[i]_ [)] )
366
+ _|_
367
+
368
+
369
+ ￿ _p_ ( \textbf{x} _,_ \textbf{y} ) log _pθ_ ( \textbf{y} \textbf{x} ) .
370
+ _|_
371
+
372
+
373
+ \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1
374
+
375
+
376
+
377
+ \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1
378
+
379
+
380
+
381
+ _i−_ 1
382
+
383
+
384
+
385
+ ￿
386
+
387
+
388
+
389
+ _i_ \textbf{Y}
390
+
391
+
392
+
393
+ ￿
394
+
395
+
396
+
397
+
398
+
399
+
400
+
401
+ +1
402
+
403
+
404
+
405
+ ￿
406
+
407
+
408
+
409
+
410
+
411
+
412
+
413
+ \textbf{Y} _i−_ 1 \textbf{Y} _i_ \textbf{Y} _i_ +1
414
+
415
+
416
+
417
+ ￿
418
+
419
+
420
+
421
+
422
+
423
+ \textbf{Y} _i−_ 1
424
+
425
+
426
+ ￿
427
+
428
+
429
+
430
+ _i_
431
+
432
+
433
+ ￿
434
+
435
+
436
+
437
+
438
+ ￿ _i_
439
+
440
+
441
+
442
+ _i_
443
+
444
+ ￿
445
+
446
+
447
+
448
+
449
+ ￿ _i_
450
+
451
+
452
+
453
+ ￿
454
+
455
+
456
+
457
+
458
+
459
+ _i_
460
+
461
+ ￿
462
+
463
+ ❝ _i_
464
+
465
+
466
+
467
+
468
+
469
+ \textbf{X} ￿ _i_
470
+
471
+
472
+
473
+ \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1
474
+
475
+
476
+
477
+ \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1
478
+
479
+
480
+
481
+ \textbf{X} _i−_ 1 \textbf{X} _i_ \textbf{X} _i_ +1
482
+
483
+
484
+
485
+ _i_ ❝
486
+
487
+
488
+
489
+ _Figure 2._ Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.
490
+ An open circle indicates that the variable is not generated by the model.
491
+
492
+
493
+
494
+ sequence. In addition, the features do not need to specify
495
+ completely a state or observation, so one might expect that
496
+ the model can be estimated from less training data. Another
497
+ attractive property is the convexity of the loss function; indeed, CRFs share all of the convexity properties of general
498
+ maximum entropy models.
499
+
500
+
501
+ For the remainder of the paper we assume that the dependencies of \textbf{Y}, conditioned on \textbf{X}, form a chain. To simplify some expressions, we add special start and stop states
502
+ \textbf{Y} 0 = start and \textbf{Y} _n_ +1 = stop. Thus, we will be using the
503
+ graphical structure shown in Figure 2. For a chain structure, the conditional probability of a label sequence can be
504
+ expressed concisely in matrix form, which will be useful
505
+ in describing the parameter estimation and inference algorithms in Section 4. Suppose that _pθ_ ( \textbf{Y} \textbf{X} ) is a CRF
506
+ _|_
507
+ given by (1). For each position _i_ in the observation sequence \textbf{x}, we define the _|Y| × |Y|_ matrix random variable
508
+ _Mi_ ( \textbf{x} ) = [ _Mi_ ( _y_ _[￿]_ _, y_ \textbf{x} )] by
509
+ _|_
510
+
511
+
512
+
513
+ of the training data. Both algorithms are based on the improved iterative scaling (IIS) algorithm of Della Pietra et al.
514
+ (1997); the proof technique based on auxiliary functions
515
+ can be extended to show convergence of the algorithms for
516
+ CRFs.
517
+
518
+
519
+ Iterative scaling algorithms update the weights as _λk_
520
+ _λδλk_ + _k_ and _δλ δµk_ and _k_ . In particular, the IIS update _µk ←_ _µk_ + _δµk_ for appropriately chosen _δλk_ for an edge _←_
521
+ feature _fk_ is the solution of
522
+
523
+
524
+
525
+ _T_ ( \textbf{x} _,_ \textbf{y} )
526
+
527
+
528
+
529
+ ￿
530
+ _E_ [ _fk_ ]
531
+
532
+
533
+ =
534
+
535
+
536
+
537
+ =def
538
+
539
+
540
+ ￿
541
+
542
+
543
+ \textbf{x} _,_ \textbf{y}
544
+
545
+
546
+
547
+ _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) _e_ _[δλ][k][T]_ [ (] \textbf{[x]} _[,]_ \textbf{[y]} [)] .
548
+ _|_
549
+
550
+
551
+
552
+ ￿
553
+
554
+
555
+ \textbf{x} _,_ \textbf{y}
556
+
557
+
558
+
559
+ _n_ ￿+1
560
+
561
+ ￿ _p_ ( \textbf{x} ) _p_ ( \textbf{y} _|_ \textbf{x} )
562
+
563
+ _i_ =1
564
+
565
+
566
+
567
+ ￿ _p_ ( \textbf{x} _,_ \textbf{y} )
568
+
569
+
570
+
571
+ _n_ ￿+1
572
+
573
+
574
+ _i_ =1
575
+
576
+
577
+
578
+ _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} )
579
+ _|_
580
+
581
+
582
+
583
+ where _T_ ( \textbf{x} _,_ \textbf{y} ) is the _total feature count_
584
+
585
+
586
+
587
+ ￿
588
+
589
+
590
+ _i,k_
591
+
592
+
593
+
594
+ _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} ) +
595
+ _|_
596
+
597
+
598
+
599
+ =def
600
+
601
+
602
+
603
+ ￿
604
+
605
+ _gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ) .
606
+ _|_
607
+
608
+ _i,k_
609
+
610
+
611
+
612
+ _Mi_ ( _y_ _[￿]_ _, y |_ \textbf{x} ) = ￿exp (Λ _i_ ( _y_ _[￿]_ _, y |_ \textbf{x} ))
613
+ Λ _i_ ( _y_ _[￿]_ _, y |_ \textbf{x} ) = ￿ _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[|][e]_
614
+
615
+
616
+
617
+ ￿ _k_ _[λ][k][ f][k]_ [(] _[e][i][,]_ \textbf{[ Y]} _[|][e]_ _i_ [= (] _[y][￿][, y]_ [)] _[,]_ \textbf{[ x]} [) +]
618
+
619
+
620
+
621
+ _k_ _[µ][k][ g][k]_ [(] _[v][i][,]_ \textbf{[ Y]} _[|][v]_ _i_ [=] _[ y,]_ \textbf{[ x]} [)][,]
622
+
623
+
624
+
625
+ where _ei_ is the edge with labels ( \textbf{Y} _i_ 1 _,_ \textbf{Y} _i_ ) and _vi_ is the
626
+
627
+ _−_
628
+ vertex with label \textbf{Y} _i_ . In contrast to generative models, conditional models like CRFs do not need to enumerate over
629
+ all possible observation sequences \textbf{x}, and therefore these
630
+ matrices can be computed directly as needed from a given
631
+ training or test observation sequence \textbf{x} and the parameter
632
+ vector _θ_ . Then the normalization (partition function) _Zθ_ ( \textbf{x} )
633
+ is the (start _,_ stop) entry of the product of these matrices:
634
+
635
+ _Zθ_ ( \textbf{x} ) = ( _M_ 1( \textbf{x} ) _M_ 2( \textbf{x} ) _· · · Mn_ +1( \textbf{x} ))start _,_ stop .
636
+
637
+ Using this notation, the conditional probability of a label
638
+ sequence \textbf{y} is written as
639
+
640
+
641
+
642
+ The equations for vertex feature updates _δµk_ have similar
643
+ form.
644
+
645
+
646
+ However, efficiently computing the exponential sums on
647
+ the right-hand sides of these equations is problematic, because _T_ ( \textbf{x} _,_ \textbf{y} ) is a global property of ( \textbf{x} _,_ \textbf{y} ), and dynamic
648
+ programming will sum over sequences with potentially
649
+ varying _T_ . To deal with this, the first algorithm, Algorithm
650
+ S, uses a “slack feature.” The second, Algorithm T, keeps
651
+ track of partial _T_ totals.
652
+
653
+
654
+ For Algorithm S, we define the _slack feature_ by
655
+
656
+
657
+
658
+ _S −_
659
+
660
+
661
+
662
+ _s_ ( \textbf{x} _,_ \textbf{y} )
663
+
664
+
665
+
666
+ =def
667
+
668
+
669
+
670
+ _fk_ ( _ei,_ \textbf{y} _ei,_ \textbf{x} )
671
+ _|_ _−_
672
+
673
+
674
+
675
+ ￿
676
+
677
+
678
+ _k_
679
+
680
+
681
+
682
+ ￿
683
+
684
+
685
+ _i_
686
+
687
+
688
+
689
+ _gk_ ( _vi,_ \textbf{y} _vi,_ \textbf{x} ),
690
+ _|_
691
+
692
+
693
+
694
+ ￿
695
+
696
+
697
+
698
+ _i_
699
+
700
+
701
+
702
+ ￿ _n_ +1
703
+
704
+
705
+
706
+ ￿
707
+
708
+
709
+ _k_
710
+
711
+
712
+
713
+ _pθ_ ( \textbf{y} \textbf{x} ) =
714
+ _|_
715
+
716
+
717
+
718
+ _n_ +1
719
+
720
+ _i_ =1 ~~￿￿~~ _[M]_ _n_ +1 _[i]_ [(] \textbf{[y]} _[i][−]_ [1] _[,]_ \textbf{[ y]} ~~￿~~ _[i][ |]_ \textbf{[ x]} [)]
721
+
722
+
723
+
724
+ _n_ +1
725
+
726
+ _i_ =1 _[M][i]_ [(] \textbf{[x]} [)]
727
+
728
+
729
+
730
+ ~~￿~~
731
+
732
+
733
+
734
+ ,
735
+
736
+
737
+
738
+ start _,_ stop
739
+
740
+
741
+
742
+ where \textbf{y} 0 = start and \textbf{y} _n_ +1 = stop.
743
+
744
+
745
+ \textbf{4. Parameter Estimation for CRFs}
746
+
747
+
748
+ We now describe two iterative scaling algorithms to find
749
+ the parameter vector _θ_ that maximizes the log-likelihood
750
+
751
+
752
+
753
+ where _S_ is a constant chosen so that _s_ ( \textbf{x} [(] _[i]_ [)] _,_ \textbf{y} ) _≥_ 0 for all
754
+ \textbf{y} and all observation vectors \textbf{x} [(] _[i]_ [)] in the training set, thus
755
+ making _T_ ( \textbf{x} _,_ \textbf{y} ) = _S_ . Feature _s_ is “global,” that is, it does
756
+ not correspond to any particular edge or vertex.
757
+
758
+ For each index _i_ = 0 _, . . ., n_ + 1 we now define the _forward_
759
+ _vectors αi_ ( \textbf{x} ) with base case
760
+
761
+
762
+
763
+ _α_ 0( _y_ \textbf{x} ) =
764
+ _|_
765
+
766
+
767
+
768
+ ￿
769
+ 1 if _y_ = start
770
+ 0 otherwise
771
+
772
+
773
+ and recurrence
774
+
775
+
776
+ _αi_ ( \textbf{x} ) = _αi−_ 1( \textbf{x} ) _Mi_ ( \textbf{x} ) .
777
+
778
+ Similarly, the _backward vectors βi_ ( \textbf{x} ) are defined by
779
+
780
+
781
+
782
+ _βk_ and _γk_ are the unique positive roots to the following
783
+ polynomial equations
784
+
785
+
786
+
787
+ _T_ ￿max
788
+
789
+
790
+ _i_ =0
791
+
792
+
793
+
794
+ _ak,t βk_ _[t]_ [=][ ￿] _[Ef][k]_ _[,]_
795
+
796
+
797
+
798
+ _T_ ￿max
799
+
800
+
801
+ _i_ =0
802
+
803
+
804
+
805
+ _bk,t γk_ _[t]_ [=][ ￿] _[Eg][k]_ [,] (2)
806
+
807
+
808
+
809
+ _βn_ +1( _y_ \textbf{x} ) =
810
+ _|_
811
+
812
+
813
+
814
+ ￿
815
+ 1 if _y_ = stop
816
+ 0 otherwise
817
+
818
+
819
+
820
+ and
821
+
822
+ _βi_ ( \textbf{x} ) _[￿]_ = _Mi_ +1( \textbf{x} ) _βi_ +1( \textbf{x} ) .
823
+
824
+
825
+ With these definitions, the update equations are
826
+
827
+
828
+
829
+ _δλk_ = [1]
830
+
831
+ _S_ [log]
832
+
833
+
834
+
835
+ ￿
836
+ _Efk_
837
+ _Efk_
838
+
839
+
840
+
841
+ _,_ _δµk_ = [1]
842
+
843
+ _S_ [log]
844
+
845
+
846
+
847
+ ￿
848
+ _Egk_
849
+ _Egk_
850
+
851
+
852
+
853
+ ,
854
+
855
+
856
+
857
+ where
858
+
859
+
860
+ _Efk_ =
861
+
862
+
863
+ _Egk_ =
864
+
865
+
866
+
867
+ ￿
868
+
869
+
870
+ \textbf{x}
871
+
872
+
873
+ ￿
874
+
875
+
876
+ \textbf{x}
877
+
878
+
879
+
880
+ _gk_ ( _vi,_ \textbf{y} _|vi_ = _y,_ \textbf{x} ) _×_
881
+
882
+
883
+
884
+ ￿
885
+
886
+
887
+ _y_ _[￿]_ _,y_
888
+
889
+
890
+
891
+ _n_ ￿+1
892
+
893
+ ￿ _p_ ( \textbf{x} )
894
+
895
+
896
+ _i_ =1
897
+
898
+
899
+
900
+ ￿ _n_
901
+
902
+ ￿ _p_ ( \textbf{x} )
903
+
904
+
905
+ _i_ =1
906
+
907
+
908
+
909
+ ￿
910
+
911
+
912
+ _y_
913
+
914
+
915
+
916
+ _fk_ ( _ei,_ \textbf{y} _|ei_ = ( _y_ _[￿]_ _, y_ ) _,_ \textbf{x} ) _×_
917
+
918
+
919
+
920
+ which can be easily computed by Newton’s method.
921
+
922
+
923
+ A single iteration of Algorithm S and Algorithm T has
924
+ roughly the same time and space complexity as the well
925
+ known Baum-Welch algorithm for HMMs. To prove convergence of our algorithms, we can derive an auxiliary
926
+ function to bound the change in likelihood from below; this
927
+ method is developed in detail by Della Pietra et al. (1997).
928
+ The full proof is somewhat detailed; however, here we give
929
+ an idea of how to derive the auxiliary function. To simplify
930
+ notation, we assume only edge features _fk_ with parameters
931
+ _λk_ .
932
+
933
+ Given two parameter settings _θ_ = ( _λ_ 1 _, λ_ 2 _, . . ._ ) and _θ_ _[￿]_ =
934
+ ( _λ_ 1 + _δλ_ 1 _, λ_ 2 + _δλ_ 2 _, . . ._ ), we bound from below the change
935
+ in the objective function with an _auxiliary function A_ ( _θ_ _[￿]_ _, θ_ )
936
+ as follows
937
+
938
+
939
+
940
+ _αi−_ 1( _y_ _[￿]_ _|_ \textbf{x} ) _Mi_ ( _y_ _[￿]_ _, y |_ \textbf{x} ) _βi_ ( _y |_ \textbf{x} )
941
+
942
+ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~
943
+
944
+
945
+
946
+ _O_ ( _θ_ _[￿]_ ) _−O_ ( _θ_ ) =
947
+
948
+
949
+
950
+ ￿
951
+
952
+
953
+ \textbf{x} _,_ \textbf{y}
954
+
955
+
956
+
957
+ ￿ _p_ ( \textbf{x} _,_ \textbf{y} ) log _[p][θ][￿]_ [(] \textbf{[y]} _[ |]_ \textbf{[ x]} [)]
958
+
959
+
960
+
961
+ ~~_p_~~ _θ_ ~~(~~ ~~\textbf{y}~~ ~~\textbf{x}~~ ~~)~~
962
+ _|_
963
+
964
+
965
+
966
+ = ( _θ_ _[￿]_ _−_ _θ_ ) _·_ _Ef_ [￿] _−_
967
+
968
+
969
+
970
+ ￿ _p_ ( \textbf{x} ) log _[Z][θ][￿]_ [(] \textbf{[x]} [)]
971
+
972
+
973
+
974
+ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~
975
+
976
+
977
+
978
+ _αi_ ( _y_ \textbf{x} ) _βi_ ( _y_ \textbf{x} )
979
+ _|_ _|_ .
980
+
981
+ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~
982
+
983
+
984
+
985
+ _≥_ ( _θ_ _[￿]_ _−_ _θ_ ) _·_ _Ef_ [￿] _−_
986
+
987
+
988
+
989
+ ￿
990
+
991
+
992
+ \textbf{x}
993
+
994
+ ￿
995
+
996
+
997
+ \textbf{x}
998
+
999
+
1000
+
1001
+ ￿ _p_ ( \textbf{x} ) _[Z][θ][￿]_ [(] \textbf{[x]} [)]
1002
+
1003
+
1004
+
1005
+ ￿ _p_ ( \textbf{x} )
1006
+
1007
+
1008
+
1009
+ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~
1010
+
1011
+
1012
+
1013
+ ￿
1014
+
1015
+
1016
+
1017
+ \textbf{x}
1018
+
1019
+ ￿
1020
+
1021
+
1022
+ \textbf{x} _,_ \textbf{y} _,k_
1023
+
1024
+
1025
+
1026
+ ￿ _p_ ( \textbf{x} ) _pθ_ ( \textbf{y} \textbf{x} ) _[f][k]_ [(] \textbf{[x]} _[,]_ \textbf{[ y]} [)]
1027
+ _|_ _T_ ~~(~~ ~~\textbf{x}~~ ~~)~~ _[e][δλ][k][T]_ [ (] \textbf{[x]} [)]
1028
+
1029
+
1030
+
1031
+ The factors involving the forward and backward vectors in
1032
+ the above equations have the same meaning as for standard
1033
+ hidden Markov models. For example,
1034
+
1035
+ _pθ_ ( \textbf{Y} _i_ = _y_ \textbf{x} ) = _αi_ ( _y |_ \textbf{x} ) _βi_ ( _y |_ \textbf{x} )
1036
+ _|_ _Zθ_ ~~(~~ ~~\textbf{x}~~ ~~)~~
1037
+
1038
+
1039
+ is the marginal probability of label \textbf{Y} _i_ = _y_ given that the
1040
+ observation sequence is \textbf{x} . This algorithm is closely related
1041
+ to the algorithm of Darroch and Ratcliff (1972), and MART
1042
+ algorithms used in image reconstruction.
1043
+
1044
+
1045
+ The constant _S_ in Algorithm S can be quite large, since in
1046
+ practice it is proportional to the length of the longest training observation sequence. As a result, the algorithm may
1047
+ converge slowly, taking very small steps toward the maximum in each iteration. If the length of the observations \textbf{x} [(] _[i]_ [)]
1048
+
1049
+ and the number of active features varies greatly, a fasterconverging algorithm can be obtained by keeping track of
1050
+ feature totals for each observation sequence separately.
1051
+
1052
+
1053
+ Let _T_ ( \textbf{x} ) = maxdef \textbf{y} _T_ ( \textbf{x} _,_ \textbf{y} ). Algorithm T accumulates
1054
+
1055
+ feature expectations into counters indexed by _T_ ( \textbf{x} ). More
1056
+ specifically, we use the forward-backward recurrences just
1057
+ introduced to compute the expectations _ak,t_ of feature _fk_
1058
+ and _bk,t_ of feature _gk_ given that _T_ ( \textbf{x} ) = _t_ . Then our parameter updates are _δλk_ = log _βk_ and _δµk_ = log _γk_, where
1059
+
1060
+
1061
+
1062
+ = _δλ ·_ _Ef_ [￿] _−_
1063
+
1064
+
1065
+ _≥_ _δλ ·_ _Ef_ [￿] _−_
1066
+
1067
+
1068
+
1069
+ _pθ_ ( \textbf{y} \textbf{x} ) _e_ _[δλ][·][f]_ [(] \textbf{[x]} _[,]_ \textbf{[y]} [)]
1070
+ _|_
1071
+
1072
+
1073
+
1074
+ ￿
1075
+
1076
+
1077
+ \textbf{y}
1078
+
1079
+
1080
+
1081
+ =def _A_ ( _θ_ _[￿]_ _, θ_ )
1082
+
1083
+ where the inequalities follow from the convexity of _−_ log
1084
+ and exp. Differentiating _A_ with respect to _δλk_ and setting
1085
+ the result to zero yields equation (2).
1086
+
1087
+
1088
+ \textbf{5. Experiments}
1089
+
1090
+
1091
+ We first discuss two sets of experiments with synthetic data
1092
+ that highlight the differences between CRFs and MEMMs.
1093
+ The first experiments are a direct verification of the label
1094
+ bias problem discussed in Section 2. In the second set of
1095
+ experiments, we generate synthetic data using randomly
1096
+ chosen hidden Markov models, each of which is a mixture of a first-order and second-order model. Competing
1097
+ _first-order_ models are then trained and compared on test
1098
+ data. As the data becomes more second-order, the test error rates of the trained models increase. This experiment
1099
+ corresponds to the common modeling practice of approximating complex local and long-range dependencies, as occur in natural data, by small-order Markov models. Our
1100
+
1101
+
1102
+ 60
1103
+
1104
+
1105
+ 50
1106
+
1107
+
1108
+ 40
1109
+
1110
+
1111
+ 30
1112
+
1113
+
1114
+ 20
1115
+
1116
+
1117
+
1118
+ 0 10 20 30 40 50 60
1119
+
1120
+ HMM Error
1121
+
1122
+
1123
+
1124
+ 0 10 20 30 40 50 60
1125
+
1126
+ HMM Error
1127
+
1128
+
1129
+
1130
+ 10
1131
+
1132
+
1133
+ 0
1134
+
1135
+ 0 10 20 30 40 50 60
1136
+
1137
+ CRF Error
1138
+
1139
+
1140
+
1141
+ 60
1142
+
1143
+
1144
+ 50
1145
+
1146
+
1147
+ 40
1148
+
1149
+
1150
+ 30
1151
+
1152
+
1153
+ 20
1154
+
1155
+
1156
+ 10
1157
+
1158
+
1159
+ 0
1160
+
1161
+
1162
+
1163
+ 60
1164
+
1165
+
1166
+ 50
1167
+
1168
+
1169
+ 40
1170
+
1171
+
1172
+ 30
1173
+
1174
+
1175
+ 20
1176
+
1177
+
1178
+ 10
1179
+
1180
+
1181
+ 0
1182
+
1183
+
1184
+
1185
+ _Figure 3._ Plots of 2 _×_ 2 error rates for HMMs, CRFs, and MEMMs on randomly generated synthetic data sets, as described in Section 5.2.
1186
+ As the data becomes “more second order,” the error rates of the test models increase. As shown in the left plot, the CRF typically
1187
+ significantly outperforms the MEMM. The center plot shows that the HMM outperforms the MEMM. In the right plot, each open square
1188
+ represents a data set with _α <_ [1] 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2 [. The plot shows that when the data is mostly second]
1189
+
1190
+ order ( _α ≥_ [1] 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]
1191
+
1192
+
1193
+
1194
+
1195
+ [1]
1196
+
1197
+ 2 [, and a solid circle indicates a data set with] _[ α][ ≥]_ [1] 2
1198
+
1199
+
1200
+
1201
+ order ( _α ≥_ 2 [), the discriminatively trained CRF typically outperforms the HMM. These experiments are not designed to demonstrate]
1202
+
1203
+ the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.
1204
+
1205
+
1206
+
1207
+ results clearly indicate that even when the models are parameterized in exactly the same way, CRFs are more robust to inaccurate modeling assumptions than MEMMs or
1208
+ HMMs, and resolve the label bias problem, which affects
1209
+ the performance of MEMMs. To avoid confusion of different effects, the MEMMs and CRFs in these experiments
1210
+ _do not_ use overlapping features of the observations. Finally, in a set of POS tagging experiments, we confirm the
1211
+ advantage of CRFs over MEMMs. We also show that the
1212
+ addition of overlapping features to CRFs and MEMMs allows them to perform much better than HMMs, as already
1213
+ shown for MEMMs by McCallum et al. (2000).
1214
+
1215
+
1216
+ \textbf{5.1 Modeling label bias}
1217
+
1218
+
1219
+ We generate data from a simple HMM which encodes a
1220
+ noisy version of the finite-state network in Figure 1. Each
1221
+ state emits its designated symbol with probability 29 _/_ 32
1222
+ and any of the other symbols with probability 1 _/_ 32. We
1223
+ train both an MEMM and a CRF with the same topologies
1224
+ on the data generated by the HMM. The observation features are simply the identity of the observation symbols.
1225
+ In a typical run using 2 _,_ 000 training and 500 test samples,
1226
+ trained to convergence of the iterative scaling algorithm,
1227
+ the CRF error is 4 _._ 6% while the MEMM error is 42%,
1228
+ showing that the MEMM fails to discriminate between the
1229
+ two branches.
1230
+
1231
+
1232
+ \textbf{5.2 Modeling mixed-order sources}
1233
+
1234
+ For these results, we use five labels, a-e ( _|Y|_ = 5), and 26
1235
+ observation values, A-Z ( _|X|_ = 26); however, the results
1236
+ were qualitatively the same over a range of sizes for _Y_ and
1237
+ _X_ . We generate data from a mixed-order HMM with state
1238
+ transition probabilities given by _α p_ larly, emission probabilities given by _α p_ have a standard first-order HMM. In order to limit the size22(( \textbf{yx} _ii | |_ \textbf{y y} _ii−,_ \textbf{x} 1 _,i_ \textbf{y} _−_ 1 _i−_ )+(12) + (1 _−α_ ) _− p_ 1( _α_ \textbf{x} ) _i | p p_ \textbf{y} 1( _iα_ \textbf{y} ) _p_ (. Thus, for _i_ \textbf{y} _|αi_ \textbf{y} _|_ ( \textbf{yx} _i−ii |_ 1 _−_ \textbf{y} )1 and, simi- _,i_ \textbf{y} _, α_ \textbf{x} _i_ = 0 _−i−_ 21)) = we=
1239
+
1240
+
1241
+
1242
+ of the Bayes error rate for the resulting models, the conditional probability tables _pα_ are constrained to be sparse.
1243
+ In particular, _pα_ ( _y, y_ _[￿]_ ) can have at most two nonzero en
1244
+ _· |_
1245
+ tries, for each _y, y_ _[￿]_, and _pα_ ( _y, x_ _[￿]_ ) can have at most three
1246
+
1247
+ _· |_
1248
+ nonzero entries for each _y, x_ _[￿]_ . For each randomly generated model, a sample of 1,000 sequences of length 25 is
1249
+ generated for training and testing.
1250
+
1251
+
1252
+ On each randomly generated training set, a CRF is trained
1253
+ using Algorithm S. (Note that since the length of the sequences and number of active features is constant, Algorithms S and T are identical.) The algorithm is fairly slow
1254
+ to converge, typically taking approximately 500 iterations
1255
+ for the model to stabilize. On the 500 MHz Pentium PC
1256
+ used in our experiments, each iteration takes approximately
1257
+ 0.2 seconds. On the same data an MEMM is trained using
1258
+ iterative scaling, which does not require forward-backward
1259
+ calculations, and is thus more efficient. The MEMM training converges more quickly, stabilizing after approximately
1260
+ 100 iterations. For each model, the Viterbi algorithm is
1261
+ used to label a test set; the experimental results do not significantly change when using forward-backward decoding
1262
+ to minimize the per-symbol error rate.
1263
+
1264
+
1265
+ The results of several runs are presented in Figure 3. Each
1266
+ plot compares two classes of models, with each point indicating the error rate for a single test set. As _α_ increases, the
1267
+ error rates generally increase, as the first-order models fail
1268
+ to fit the second-order data. The figure compares models
1269
+ parameterized as _µy_, _λy￿,y_, and _λy￿,y,x_ ; results for models
1270
+ parameterized as _µy_, _λy￿,y_, and _µy,x_ are qualitatively the
1271
+ same. As shown in the first graph, the CRF generally outperforms the MEMM, often by a wide margin of 10%–20%
1272
+ relative error. (The points for very small error rate, with
1273
+ _α <_ 0 _._ 01, where the MEMM does better than the CRF,
1274
+ are suspected to be the result of an insufficient number of
1275
+ training iterations for the CRF.)
1276
+
1277
+
1278
+ |model|error oov error|
1279
+ |---|---|
1280
+ |HMM<br>MEMM<br>CRF|5.69%<br>45.99%<br>6.37%<br>54.61%<br>5.55%<br>48.05%|
1281
+ |MEMM+<br>CRF+|4.81%<br>26.99%<br>4.27%<br>23.76%|
1282
+
1283
+
1284
+ +Using spelling features
1285
+
1286
+
1287
+ _Figure 4._ Per-word error rates for POS tagging on the Penn treebank, using first-order models trained on 50% of the 1.1 million
1288
+ word corpus. The oov rate is 5.45%.
1289
+
1290
+
1291
+ \textbf{5.3 POS tagging experiments}
1292
+
1293
+
1294
+ To confirm our synthetic data results, we also compared
1295
+ HMMs, MEMMs and CRFs on Penn treebank POS tagging, where each word in a given input sentence must be
1296
+ labeled with one of 45 syntactic tags.
1297
+
1298
+
1299
+ We carried out two sets of experiments with this natural
1300
+ language data. First, we trained first-order HMM, MEMM,
1301
+ and CRF models as in the synthetic data experiments, introducing parameters _µy,x_ for each tag-word pair and _λy￿,y_
1302
+ for each tag-tag pair in the training set. The results are consistent with what is observed on synthetic data: the HMM
1303
+ outperforms the MEMM, as a consequence of the label bias
1304
+ problem, while the CRF outperforms the HMM. The error rates for training runs using a 50%-50% train-test split
1305
+ are shown in Figure 5.3; the results are qualitatively similar for other splits of the data. The error rates on outof-vocabulary (oov) words, which are not observed in the
1306
+ training set, are reported separately.
1307
+
1308
+
1309
+ In the second set of experiments, we take advantage of the
1310
+ power of conditional models by adding a small set of orthographic features: whether a spelling begins with a number or upper case letter, whether it contains a hyphen, and
1311
+ whether it ends in one of the following suffixes: -ing, ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies. Here we find, as
1312
+ expected, that both the MEMM and the CRF benefit significantly from the use of these features, with the overall error
1313
+ rate reduced by around 25%, and the out-of-vocabulary error rate reduced by around 50%.
1314
+
1315
+
1316
+ One usually starts training from the all zero parameter vector, corresponding to the uniform distribution. However,
1317
+ for these datasets, CRF training with that initialization is
1318
+ much slower than MEMM training. Fortunately, we can
1319
+ use the optimal MEMM parameter vector as a starting
1320
+ point for training the corresponding CRF. In Figure 5.3,
1321
+ MEMM [+] was trained to convergence in around 100 iterations. Its parameters were then used to initialize the training of CRF [+], which converged in 1,000 iterations. In contrast, training of the same CRF from the uniform distribution had not converged even after 2,000 iterations.
1322
+
1323
+
1324
+
1325
+ \textbf{6. Further Aspects of CRFs}
1326
+
1327
+
1328
+ Many further aspects of CRFs are attractive for applications and deserve further study. In this section we briefly
1329
+ mention just two.
1330
+
1331
+
1332
+ Conditional random fields can be trained using the exponential loss objective function used by the AdaBoost algorithm (Freund & Schapire, 1997). Typically, boosting is
1333
+ applied to classification problems with a small, fixed number of classes; applications of boosting to sequence labeling
1334
+ have treated each label as a separate classification problem
1335
+ (Abney et al., 1999). However, it is possible to apply the
1336
+ parallel update algorithm of Collins et al. (2000) to optimize the per-sequence exponential loss. This requires a
1337
+ forward-backward algorithm to compute efficiently certain
1338
+ feature expectations, along the lines of Algorithm T, except that each feature requires a separate set of forward and
1339
+ backward accumulators.
1340
+
1341
+
1342
+ Another attractive aspect of CRFs is that one can implement efficient feature selection and feature induction algorithms for them. That is, rather than specifying in advance which features of ( \textbf{X} _,_ \textbf{Y} ) to use, we could start from
1343
+ feature-generating rules and evaluate the benefit of generated features automatically on data. In particular, the feature induction algorithms presented in Della Pietra et al.
1344
+ (1997) can be adapted to fit the dynamic programming
1345
+ techniques of conditional random fields.
1346
+
1347
+
1348
+ \textbf{7. Related Work and Conclusions}
1349
+
1350
+
1351
+ As far as we know, the present work is the first to combine
1352
+ the benefits of conditional models with the global normalization of random field models. Other applications of exponential models in sequence modeling have either attempted
1353
+ to build generative models (Rosenfeld, 1997), which involve a hard normalization problem, or adopted local conditional models (Berger et al., 1996; Ratnaparkhi, 1996;
1354
+ McCallum et al., 2000) that may suffer from label bias.
1355
+
1356
+
1357
+ Non-probabilistic local decision models have also been
1358
+ widely used in segmentation and tagging (Brill, 1995;
1359
+ Roth, 1998; Abney et al., 1999). Because of the computational complexity of global training, these models are only
1360
+ trained to minimize the error of individual label decisions
1361
+ assuming that neighboring labels are correctly chosen. Label bias would be expected to be a problem here too.
1362
+
1363
+
1364
+ An alternative approach to discriminative modeling of sequence labeling is to use a permissive generative model,
1365
+ which can only model local dependencies, to produce a
1366
+ list of candidates, and then use a more global discriminative model to rerank those candidates. This approach is
1367
+ standard in large-vocabulary speech recognition (Schwartz
1368
+ & Austin, 1993), and has also been proposed for parsing
1369
+ (Collins, 2000). However, these methods fail when the correct output is pruned away in the first pass.
1370
+
1371
+
1372
+ Closest to our proposal are gradient-descent methods that
1373
+ adjust the parameters of all of the local classifiers to minimize a smooth loss function (e.g., quadratic loss) combining loss terms for each label. If state dependencies are local, this can be done efficiently with dynamic programming
1374
+ (LeCun et al., 1998). Such methods should alleviate label
1375
+ bias. However, their loss function is not convex, so they
1376
+ may get stuck in local minima.
1377
+
1378
+
1379
+ Conditional random fields offer a unique combination of
1380
+ properties: discriminatively trained models for sequence
1381
+ segmentation and labeling; combination of arbitrary, overlapping and agglomerative observation features from both
1382
+ the past and future; efficient training and decoding based
1383
+ on dynamic programming; and parameter estimation guaranteed to find the global optimum. Their main current limitation is the slow convergence of the training algorithm
1384
+ relative to MEMMs, let alone to HMMs, for which training
1385
+ on fully observed data is very efficient. In future work, we
1386
+ plan to investigate alternative training methods such as the
1387
+ update methods of Collins et al. (2000) and refinements on
1388
+ using a MEMM as starting point as we did in some of our
1389
+ experiments. More general tree-structured random fields,
1390
+ feature induction methods, and further natural data evaluations will also be investigated.
1391
+
1392
+
1393
+ \textbf{Acknowledgments}
1394
+
1395
+
1396
+ We thank Yoshua Bengio, L´eon Bottou, Michael Collins
1397
+ and Yann LeCun for alerting us to what we call here the label bias problem. We also thank Andrew Ng and Sebastian
1398
+ Thrun for discussions related to this work.
1399
+
1400
+
1401
+ \textbf{References}
1402
+
1403
+
1404
+ Abney, S., Schapire, R. E., & Singer, Y. (1999). Boosting
1405
+
1406
+ applied to tagging and PP attachment. _Proc. EMNLP-_
1407
+ _VLC_ . New Brunswick, New Jersey: Association for
1408
+ Computational Linguistics.
1409
+ Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J.
1410
+
1411
+ (1996). A maximum entropy approach to natural language processing. _Computational Linguistics_, _22_ .
1412
+ Bottou, L. (1991). _Une_ _approche_ _th´eorique_ _de_
1413
+ _l’apprentissage connexionniste: Applications `a la recon-_
1414
+ _naissance de la parole_ . Doctoral dissertation, Universit´e
1415
+ de Paris XI.
1416
+ Brill, E. (1995). Transformation-based error-driven learn
1417
+ ing and natural language processing: a case study in part
1418
+ of speech tagging. _Computational Linguistics_, _21_, 543–
1419
+ 565.
1420
+ Collins, M. (2000). Discriminative reranking for natural
1421
+
1422
+ language parsing. _Proc. ICML 2000_ . Stanford, California.
1423
+ Collins, M., Schapire, R., & Singer, Y. (2000). Logistic re
1424
+ gression, AdaBoost, and Bregman distances. _Proc. 13th_
1425
+ _COLT_ .
1426
+ Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative
1427
+
1428
+
1429
+
1430
+ scaling for log-linear models. _The Annals of Mathemat-_
1431
+ _ical Statistics_, _43_, 1470–1480.
1432
+ Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). In
1433
+ ducing features of random fields. _IEEE Transactions on_
1434
+ _Pattern Analysis and Machine Intelligence_, _19_, 380–393.
1435
+ Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998).
1436
+
1437
+ _Biological sequence analysis: Probabilistic models of_
1438
+ _proteins and nucleic acids_ . Cambridge University Press.
1439
+ Freitag, D., & McCallum, A. (2000). Information extrac
1440
+ tion with HMM structures learned by stochastic optimization. _Proc. AAAI 2000_ .
1441
+ Freund, Y., & Schapire, R. (1997). A decision-theoretic
1442
+
1443
+ generalization of on-line learning and an application to
1444
+ boosting. _Journal of Computer and System Sciences_, _55_,
1445
+ 119–139.
1446
+ Hammersley, J., & Clifford, P. (1971). Markov fields on
1447
+
1448
+ finite graphs and lattices. Unpublished manuscript.
1449
+ LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998).
1450
+
1451
+ Gradient-based learning applied to document recognition. _Proceedings of the IEEE_, _86_, 2278–2324.
1452
+ MacKay, D. J. (1996). Equivalence of linear Boltzmann
1453
+
1454
+ chains and hidden Markov models. _Neural Computation_,
1455
+ _8_, 178–181.
1456
+ Manning, C. D., & Sch¨utze, H. (1999). _Foundations of sta-_
1457
+
1458
+ _tistical natural language processing_ . Cambridge Massachusetts: MIT Press.
1459
+ McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum
1460
+
1461
+ entropy Markov models for information extraction and
1462
+ segmentation. _Proc. ICML 2000_ (pp. 591–598). Stanford, California.
1463
+ Mohri, M. (1997). Finite-state transducers in language and
1464
+
1465
+ speech processing. _Computational Linguistics_, _23_ .
1466
+ Mohri, M. (2000). Minimization algorithms for sequential
1467
+
1468
+ transducers. _Theoretical Computer Science_, _234_, 177–
1469
+ 201.
1470
+ Paz, A. (1971). _Introduction to probabilistic automata_ .
1471
+ Academic Press.
1472
+ Punyakanok, V., & Roth, D. (2001). The use of classifiers
1473
+
1474
+ in sequential inference. _NIPS 13_ . Forthcoming.
1475
+ Ratnaparkhi, A. (1996). A maximum entropy model for
1476
+
1477
+ part-of-speech tagging. _Proc. EMNLP_ . New Brunswick,
1478
+ New Jersey: Association for Computational Linguistics.
1479
+ Rosenfeld, R. (1997). A whole sentence maximum entropy
1480
+
1481
+ language model. _Proceedings of the IEEE Workshop on_
1482
+ _Speech Recognition and Understanding_ . Santa Barbara,
1483
+ California.
1484
+ Roth, D. (1998). Learning to resolve natural language am
1485
+ biguities: A unified approach. _Proc. 15th AAAI_ (pp. 806–
1486
+ 813). Menlo Park, California: AAAI Press.
1487
+ Saul, L., & Jordan, M. (1996). Boltzmann chains and hid
1488
+ den Markov models. _Advances in Neural Information_
1489
+ _Processing Systems 7_ . MIT Press.
1490
+ Schwartz, R., & Austin, S. (1993). A comparison of several
1491
+
1492
+ approximate algorithms for finding multiple (N-BEST)
1493
+ sentence hypotheses. _Proc. ICASSP_ . Minneapolis, MN.
1494
+
1495
+
1496
+
1497
+ \end{document}
references/2014.eacl.nguyen/paper.md ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger"
3
+ authors:
4
+ - "Dat Quoc Nguyen"
5
+ - "Dai Quoc Nguyen"
6
+ - "Dang Duc Pham"
7
+ - "Son Bao Pham"
8
+ year: 2014
9
+ venue: "EACL 2014 Demonstrations"
10
+ url: "https://aclanthology.org/E14-2005/"
11
+ ---
12
+
13
+ # **RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger**
14
+
15
+ **Dat Quoc Nguyen** [1] and **Dai Quoc Nguyen** [1] and **Dang Duc Pham** [2] and **Son Bao Pham** [1]
16
+
17
+ 1
18
+ Faculty of Information Technology
19
+ University of Engineering and Technology
20
+ Vietnam National University, Hanoi
21
+ {datnq, dainq, sonpb}@vnu.edu.vn
22
+ 2
23
+ L3S Research Center, Germany
24
+ pham@L3S.de
25
+
26
+
27
+
28
+ **Abstract**
29
+
30
+
31
+ This paper describes our robust, easyto-use and language independent toolkit
32
+ namely RDRPOSTagger which employs
33
+ an error-driven approach to automatically
34
+ construct a Single Classification Ripple
35
+ Down Rules tree of transformation rules
36
+ for POS tagging task. During the demonstration session, we will run the tagger on
37
+ data sets in 15 different languages.
38
+
39
+
40
+ **1** **Introduction**
41
+
42
+
43
+ As one of the most important tasks in Natural
44
+ Language Processing, Part-of-speech (POS) tagging is to assign a tag representing its lexical
45
+ category to each word in a text. Recently, POS
46
+ taggers employing machine learning techniques
47
+ are still mainstream toolkits obtaining state-ofthe-art performances [1] . However, most of them are
48
+ time-consuming in learning process and require a
49
+ powerful computer for possibly training machine
50
+ learning models.
51
+ Turning to rule-based approaches, the most
52
+ well-known method is proposed by Brill (1995).
53
+ He proposed an approach to automatically learn
54
+ transformation rules for the POS tagging problem.
55
+ In the Brill’s tagger, a new selected rule is learned
56
+ on a context that is generated by all previous rules,
57
+ where a following rule will modify the outputs of
58
+ all the preceding rules. Hence, this procedure returns a difficulty to control the interactions among
59
+ a large number of rules.
60
+ Our RDRPOSTagger is presented to overcome
61
+ the problems mentioned above. The RDRPOSTagger exploits a failure-driven approach to automatically restructure transformation rules in the
62
+ form of a Single Classification Ripple Down Rules
63
+ (SCRDR) tree (Richards, 2009). It accepts interactions between rules, but a rule only changes the
64
+
65
+
66
+ 1
67
+ http://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
68
+
69
+
70
+ 17
71
+
72
+
73
+
74
+ outputs of some previous rules in a controlled context. All rules are structured in a SCRDR tree
75
+ which allows a new exception rule to be added
76
+ when the tree returns an incorrect classification.
77
+ A specific description of our new RDRPOSTagger
78
+ approach is detailed in (Nguyen et al., 2011).
79
+ Packaged in a 0.6MB zip file, implementations
80
+ in Python and Java can be found at the tagger’s
81
+ website _http://rdrpostagger.sourceforge.net/_ . The
82
+ following items exhibit properties of the tagger:
83
+
84
+ _•_ The RDRPOSTagger is easy to configure and
85
+ train. There are only two threshold parameters utilized to learn the rule-based model. Besides, the
86
+ tagger is very simple to use with standard input
87
+ and output, having clear usage and instructions
88
+ available on its website.
89
+
90
+ _•_ The RDRPOSTagger is language independent.
91
+ This POS tagging toolkit has been successfully
92
+ applied to English and Vietnamese. To train the
93
+ toolkit for other languages, users just provide a
94
+ lexicon of words and the most frequent associated
95
+ tags. Moreover, it can be easily combined with existing POS taggers to reach an even better result.
96
+
97
+ _•_ The RDRPOSTagger obtains very competitive
98
+ accuracies. On Penn WSJ Treebank corpus (Marcus et al., 1993), taking WSJ sections 0-18 as the
99
+ training set, the tagger achieves a competitive performance compared to other state-of-the-art English POS taggers on the test set of WSJ sections
100
+ 22-24. For Vietnamese, it outperforms all previous machine learning-based POS tagging systems
101
+ to obtain an up-to-date highest result on the Vietnamese Treebank corpus (Nguyen et al., 2009).
102
+
103
+ _•_ The RDRPOSTagger is fast. For instance in
104
+ English, the time [2] taken to train the tagger on
105
+ the WSJ sections 0-18 is **40** minutes. The tagging
106
+ speed on the test set of the WSJ sections 22-24 is
107
+ **2800** words/second accounted for the latest implementation in Python whilst it is **92k** words/second
108
+
109
+
110
+ 2Training and tagging times are computed on a Windows7 OS computer of Core 2Duo 2.4GHz & 3GB of memory.
111
+
112
+
113
+
114
+ _Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics_, pages 17–20,
115
+ Gothenburg, Sweden, April 26-30 2014. c _⃝_ 2014 Association for Computational Linguistics
116
+
117
+
118
+ _Figure 1:_ A part of our SCRDR tree for English POS tagging.
119
+
120
+
121
+
122
+ for the implementation in Java.
123
+
124
+
125
+ **2** **SCRDR methodology**
126
+
127
+
128
+ A SCRDR tree (Richards, 2009) is a binary tree
129
+ with two distinct types of edges. These edges are
130
+ typically called _except_ and _if-not_ edges. Associated with each node in a tree is a _rule_ . A rule has
131
+ the form: _if α then β_ where _α_ is called the _condi-_
132
+ _tion_ and _β_ is referred to as the _conclusion_ .
133
+ Cases in SCRDR are evaluated by passing a
134
+ case to the root of the tree. At any node in the
135
+ tree, if the condition of a node _N_ ’s rule is satisfied by the case, the case is passed on to the exception child of _N_ using the _except_ link if it exists.
136
+ Otherwise, the case is passed on to the _N_ ’s _if-not_
137
+ child. The conclusion given by this process is the
138
+ conclusion from the last node in the SCRDR tree
139
+ which _fired_ (satisfied by the case). To ensure that
140
+ a conclusion is always given, the root node typically contains a trivial condition which is always
141
+ satisfied. This node is called the _default_ node.
142
+ A new node containing a new rule (i.e. a new exception rule) is added to an SCRDR tree when the
143
+ evaluation process returns the _wrong_ conclusion.
144
+ The new node is attached to the last node in the
145
+ evaluation path of the given case with the _except_
146
+ link if the last node is the _fired_ one. Otherwise, it
147
+ is attached with the _if-not_ link.
148
+ For example with the SCRDR tree in the figure 1, given a case _“as/IN investors/NNS an-_
149
+ _ticipate/VB a/DT recovery/NN”_ where _“antici-_
150
+ _pate/VB”_ is the current word and tag pair, the case
151
+ satisfies the conditions of the rules at nodes (0),
152
+ (1) and (3), it then is passed to the node (6) (utilizing except links). As the case does not satisfy the
153
+ condition of the rule at node (6), it will be transferred to node (7) using if-not link. Since the case
154
+ does not fulfill the conditions of the rules at nodes
155
+ (7) and (8), we have the evaluation path (0)-(1)(3)-(6)-(7)-(8) with fired node (3). Therefore, the
156
+ tag for _“anticipate”_ is concluded as “VBP”.
157
+
158
+
159
+
160
+ Rule (1) - the rule at node (1) - is the exception
161
+ rule [3] of the default rule (0). As node (2) is the ifnot child node of the node (1), the associated rule
162
+ (2) is also an exception rule of the rule (0). Similarly, both rules (3) and (4) are exception rules of
163
+ the rule (1) whereas all rules (6), (7) and (8) are
164
+ exception rules of the rule (3), and so on. Thus,
165
+ the exception structure of the SCRDR tree extends
166
+ to 4 levels: rules (1) and (2) at layer 1, rules (3),
167
+ (4) and (5) at layer 2, rules (6), (7) and (8) at layer
168
+ 3, and rule (9) at layer 4.
169
+
170
+
171
+ **3** **The RDRPOSTagger toolkit**
172
+
173
+
174
+ The toolkit consists of four main components: Utility, Initial-tagger, SCRDR-learner and
175
+ SCRDR-tagger.
176
+
177
+
178
+ **3.1** **The Utility**
179
+
180
+
181
+ The major functions of this component are to evaluate tagging performances (displaying accuracy
182
+ results), and to create a lexicon of words and the
183
+ most frequent associated tags as well as to extract
184
+ _Raw corpus_ from an input golden training corpus.
185
+
186
+
187
+ **3.2** **The Initial-tagger**
188
+
189
+
190
+ The initial-tagger developed in the RDRPOSTagger toolkit is based on the lexicon which is generated in the use of the Utility component to assign a tag for each word. To deal with unknown
191
+ words, the initial-tagger utilizes several regular expressions or heuristics for English and Vietnamese
192
+ whereas the most frequent tag in the training corpus is exploited to label unknown-words when
193
+ adapting to other languages.
194
+
195
+
196
+ **3.3** **The SCRDR-learner**
197
+
198
+
199
+ The SCRDR-learner component uses a failuredriven method to automatically build a SCRDR
200
+ tree of transformation rules. Figure 3 describes the
201
+ learning process of the learner.
202
+
203
+
204
+ 3The default rule is the unique rule which is not an exception rule of any other rule. Every rule in layer _n_ is an exception rule of a rule in layer _n −_ 1.
205
+
206
+
207
+
208
+ 18
209
+
210
+
211
+ #12: _if_ next1 _[st]_ Tag == **“object.next1** _[st]_ **Tag”** _then_ tag = **“correctTag”**
212
+ #14: _if_ prev1 _[st]_ Tag == **“object.prev1** _[st]_ **Tag”** _then_ tag = **“correctTag”**
213
+ #18: _if_ word == **“object.word”** && next1 _[st]_ Tag == **“object.next1** _[st]_ **Tag”** _then_ tag = **“correctTag”**
214
+
215
+
216
+ _Figure 2:_ Rule template examples.
217
+
218
+
219
+
220
+ _Figure 3:_ The diagram of the learning process of the learner.
221
+
222
+
223
+ The _Initialized corpus_ is returned by performing the Initial-tagger on the Raw corpus. By comparing the initialized one with the _Golden corpus_,
224
+ an _Object-driven dictionary_ of pairs ( _**O**_ _bject, cor-_
225
+ _rectTag_ ) is produced in which _Object_ captures the
226
+ 5-word window context covering the current word
227
+ and its tag in following format ( _previous_ 2 _[nd]_ _word_
228
+ _/ previous_ 2 _[nd]_ _tag, previous_ 1 _[st]_ _word / previous_
229
+ 1 _[st]_ _tag, word / currentTag, next_ 1 _[st]_ _word / next_ 1 _[st]_
230
+
231
+ _tag, next_ 2 _[nd]_ _word / next_ 2 _[nd]_ _tag_ ) from the initialized corpus, and the _correctTag_ is the corresponding tag of the current word in the golden corpus.
232
+ There are 27 _Rule templates_ applied for _Rule se-_
233
+ _lector_ which is to select the most suitable rules
234
+ to build the _SCRDR tree_ . Examples of the rule
235
+ templates are shown in figure 2 where elements
236
+ in bold will be replaced by concrete values from
237
+ _Object_ s in the object-driven dictionary to create
238
+ concrete rules. The SCRDR tree of rules is initialized by building the default rule and all exception
239
+ rules of the default one in form of _if currentTag =_
240
+ _“_ _**TAG**_ _” then tag = “_ _**TAG**_ _”_ at the layer-1 exception
241
+ structure, for example rules (1) and (2) in the figure 1, and the like. The learning approach to construct new exception rules to the tree is as follows:
242
+
243
+ _•_ At a node-F in the SCRDR tree, let _SO_ be
244
+ the set of Objects from the object-driven dictionary, which those Objects are fired at the node-F
245
+ but their initialized tags are incorrect (the _current-_
246
+ _Tag_ is not the _correctTag_ associated). It means that
247
+ node-F gives wrong conclusions to all Objects in
248
+ the _SO_ set.
249
+
250
+ _•_ In order to select a new exception rule of the
251
+ rule at node-F from all concrete rules which are
252
+
253
+
254
+
255
+ generated for all Objects in the _SO_ set, the selected rule have to satisfy constraints: (i) The rule
256
+ must be unsatisfied by cases for which node-F has
257
+ already given correct conclusions. This constraint
258
+ does not apply to node-F at layer-1 exception structure. (ii) The rule must associate to a highest score
259
+ value of subtracting B from A in comparison to
260
+ other ones, where A and B are the numbers of the
261
+ _SO_ ’s Objects which are correctly and incorrectly
262
+ concluded by the rule respectively. (iii) And the
263
+ highest value is not smaller than a given threshold.
264
+ The SCRDR-learner applies two threshold parameters: first threshold is to choose exception
265
+ rules at the layer-2 exception structure (e.g rules
266
+ (3), (4) and (5) in figure 1), and second threshold
267
+ is to select rules for higher exception layers.
268
+
269
+ _•_ The process to add new exception rules is repeated until there is no rule satisfying the constraints above. At each iteration, a new rule is
270
+ added to the current SCRDR tree to correct error
271
+ conclusions made by the tree.
272
+
273
+
274
+ **3.4** **The SCRDR-tagger**
275
+
276
+
277
+ The SCRDR-tagger component is to perform the
278
+ POS tagging on a raw text corpus where each line
279
+ is a sequence of words separated by white space
280
+ characters. The component labels the text corpus
281
+ by using the Initial-tagger. It slides due to a leftto-right direction on a 5-word window context to
282
+ generate a corresponding Object for each initially
283
+ tagged word. The Object is then classified by the
284
+ learned SCRDR tree model to produce final conclusion tag of the word as illustrated in the example in the section 2.
285
+
286
+
287
+ **4** **Evaluation**
288
+
289
+
290
+ The RDRPOSTagger has already been successfully applied to English and Vietnamese corpora.
291
+
292
+
293
+ **4.1** **Results for English**
294
+
295
+
296
+ Experiments for English employed the Penn WSJ
297
+ Treebank corpus to exploit the WSJ sections 0-18
298
+ (38219 sentences) for training, the WSJ sections
299
+ 19-21 (5527 sentences) for validation and the WSJ
300
+ sections 22-24 (5462 sentences) for test.
301
+ Using a lexicon created in the use of the train
302
+
303
+
304
+ 19
305
+
306
+
307
+ ing set, the Initial-tagger obtains an accuracy of
308
+ 93.51% on the test set. By varying the thresholds
309
+ on the validation set, we have found the most suitable values [4] of 3 and 2 to be used for evaluating
310
+ the RDRPOSTagger on the test set. Those thresholds return a SCRDR tree model of 2319 rules
311
+ in a 4-level exception structure. The training time
312
+ and tagging speed for those thresholds are mentioned in the introduction section. On the same test
313
+ set, the RDRPOSTagger achieves a performance at
314
+ 96.49% against 96.46% accounted for the state-ofthe-art POS tagger TnT (Brants, 2000).
315
+ For another experiment, only in training process: 1-time occurrence words in training set are
316
+ initially tagged as out-of-dictionary words. With
317
+ a learned tree model of 2418 rules, the tagger
318
+ reaches an accuracy of 96.51% on the test set.
319
+ Retraining the tagger utilizing another initial
320
+ tagger [5] developed in the Brill’s tagger (Brill,
321
+ 1995) instead of the lexicon-based initial one,
322
+ the RDRPOSTagger gains an accuracy result of
323
+ 96.57% which is slightly higher than the performance at 96.53% of the Brill’s.
324
+
325
+
326
+ **4.2** **Results for Vietnamese**
327
+
328
+
329
+ In the first Evaluation Campaign [6] on Vietnamese
330
+ Language Processing, the POS tagging track provided a golden training corpus of 28k sentences
331
+ (631k words) collected from two sources of the
332
+ national VLSP project and the Vietnam Lexicography Center, and a raw test corpus of 2100 sentences (66k words). The training process returned
333
+ a SCRDR tree of 2896 rules [7] . Obtaining a highest
334
+ performance on the test set, the RDRPOSTagger
335
+ surpassed all other participating systems.
336
+ We also carry out POS tagging experiments on
337
+ the golden corpus of 28k sentences and on the
338
+ Vietnamese Treebank of 10k sentences (Nguyen
339
+ et al., 2009) according to 5-fold cross-validation
340
+ scheme [8] . The average accuracy results are presented in the table 1. Achieving an accuracy of
341
+ 92.59% on the Vietnamese Treebank, the RDR
342
+
343
+ 4The thresholds 3 and 2 are reused for all other experiments in English and Vietnamese.
344
+ 5The initial tagger gets a result of 93.58% on the test set.
345
+ 6http://uet.vnu.edu.vn/rivf2013/campaign.html
346
+ 7It took 100 minutes to construct the tree leading to tagging speeds of 1100 words/second and 45k words/second for
347
+ the implementations in Python and Java, respectively, on the
348
+ computer of Core 2Duo 2.4GHz & 3GB of memory.
349
+ 8In each cross-validation run, one fold is selected as test
350
+ set, 4 remaining folds are merged as training set. The initial
351
+ tagger exploits a lexicon generated from the training set. In
352
+ training process, 1-time occurrence words are initially labeled
353
+ as out-of-lexicon words.
354
+
355
+
356
+
357
+ **Table 1:** Accuracy results for Vietnamese
358
+
359
+ | Corpus | Initial-tagger | RDRPOSTagger |
360
+ |--------|----------------|--------------|
361
+ | 28k | 91.18% | 93.42% |
362
+ | 10k | 90.59% | 92.59% |
363
+
364
+
365
+ POSTagger outperforms previous Maximum Entropy Model, Conditional Random Field and Support Vector Machine-based POS tagging systems
366
+ (Tran et al., 2009) on the same evaluation scheme.
367
+
368
+
369
+ **5** **Demonstration and Conclusion**
370
+
371
+
372
+ In addition to English and Vietnamese, in the
373
+ demonstration session, we will present promising
374
+ experimental results and run the RDRPOSTagger
375
+ for other languages including Bulgarian, Czech,
376
+ Danish, Dutch, French, German, Hindi, Italian,
377
+ Lao, Portuguese, Spanish, Swedish and Thai. We
378
+ will also let the audiences to contribute their own
379
+ data sets for retraining and testing the tagger.
380
+ In this paper, we describe the rule-based
381
+ POS tagging toolkit RDRPOSTagger to automatically construct transformation rules in form
382
+ of the SCRDR exception structure. We believe that our robust, easy-to-use and languageindependent toolkit RDRPOSTagger can be useful
383
+ for NLP/CL-related tasks.
384
+
385
+
386
+ **References**
387
+
388
+
389
+ Thorsten Brants. 2000. TnT: a statistical part-ofspeech tagger. In _Proc. of 6th Applied Natural Lan-_
390
+ _guage Processing Conference_, pages 224–231.
391
+ Eric Brill. 1995. Transformation-based error-driven
392
+ learning and natural language processing: a case
393
+ study in part-of-speech tagging. _Comput. Linguist._,
394
+ 21(4):543–565.
395
+ Mitchell P Marcus, Mary Ann Marcinkiewicz, and
396
+ Beatrice Santorini. 1993. Building a large annotated corpus of English: the penn treebank. _Comput._
397
+ _Linguist._, 19(2):313–330.
398
+ Phuong Thai Nguyen, Xuan Luong Vu, Thi
399
+ Minh Huyen Nguyen, Van Hiep Nguyen, and
400
+ Hong Phuong Le. 2009. Building a Large
401
+ Syntactically-Annotated Corpus of Vietnamese. In
402
+ _Proc. of LAW III workshop_, pages 182–185.
403
+ Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham,
404
+ and Dang Duc Pham. 2011. Ripple Down Rules for
405
+ Part-of-Speech Tagging. In _Proc. of 12th CICLing -_
406
+ _Volume Part I_, pages 190–201.
407
+ Debbie Richards. 2009. Two decades of ripple down
408
+ rules research. _Knowledge Engineering Review_,
409
+ 24(2):159–184.
410
+ Oanh Thi Tran, Cuong Anh Le, Thuy Quang Ha, and
411
+ Quynh Hoang Le. 2009. An experimental study
412
+ on vietnamese pos tagging. _Proc. of the 2009 Inter-_
413
+ _national Conference on Asian Language Processing_,
414
+ pages 23–27.
415
+
416
+
417
+
418
+ 20
419
+
420
+
references/2014.eacl.nguyen/paper.tex ADDED
@@ -0,0 +1,419 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt]{article}
2
+ \usepackage[utf8]{inputenc}
3
+ \usepackage{amsmath,amssymb}
4
+ \usepackage{booktabs}
5
+ \usepackage{hyperref}
6
+ \usepackage{graphicx}
7
+
8
+ \begin{document}
9
+
10
+ \section*{\textbf{RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger}}
11
+
12
+ \textbf{Dat Quoc Nguyen} [1] and \textbf{Dai Quoc Nguyen} [1] and \textbf{Dang Duc Pham} [2] and \textbf{Son Bao Pham} [1]
13
+
14
+ 1
15
+ Faculty of Information Technology
16
+ University of Engineering and Technology
17
+ Vietnam National University, Hanoi
18
+ {datnq, dainq, sonpb}@vnu.edu.vn
19
+ 2
20
+ L3S Research Center, Germany
21
+ pham@L3S.de
22
+
23
+
24
+
25
+ \textbf{Abstract}
26
+
27
+
28
+ This paper describes our robust, easyto-use and language independent toolkit
29
+ namely RDRPOSTagger which employs
30
+ an error-driven approach to automatically
31
+ construct a Single Classification Ripple
32
+ Down Rules tree of transformation rules
33
+ for POS tagging task. During the demonstration session, we will run the tagger on
34
+ data sets in 15 different languages.
35
+
36
+
37
+ \textbf{1} \textbf{Introduction}
38
+
39
+
40
+ As one of the most important tasks in Natural
41
+ Language Processing, Part-of-speech (POS) tagging is to assign a tag representing its lexical
42
+ category to each word in a text. Recently, POS
43
+ taggers employing machine learning techniques
44
+ are still mainstream toolkits obtaining state-ofthe-art performances [1] . However, most of them are
45
+ time-consuming in learning process and require a
46
+ powerful computer for possibly training machine
47
+ learning models.
48
+ Turning to rule-based approaches, the most
49
+ well-known method is proposed by Brill (1995).
50
+ He proposed an approach to automatically learn
51
+ transformation rules for the POS tagging problem.
52
+ In the Brill’s tagger, a new selected rule is learned
53
+ on a context that is generated by all previous rules,
54
+ where a following rule will modify the outputs of
55
+ all the preceding rules. Hence, this procedure returns a difficulty to control the interactions among
56
+ a large number of rules.
57
+ Our RDRPOSTagger is presented to overcome
58
+ the problems mentioned above. The RDRPOSTagger exploits a failure-driven approach to automatically restructure transformation rules in the
59
+ form of a Single Classification Ripple Down Rules
60
+ (SCRDR) tree (Richards, 2009). It accepts interactions between rules, but a rule only changes the
61
+
62
+
63
+ 1
64
+ http://aclweb.org/aclwiki/index.php?title=POS_Tagging_(State_of_the_art)
65
+
66
+
67
+ 17
68
+
69
+
70
+
71
+ outputs of some previous rules in a controlled context. All rules are structured in a SCRDR tree
72
+ which allows a new exception rule to be added
73
+ when the tree returns an incorrect classification.
74
+ A specific description of our new RDRPOSTagger
75
+ approach is detailed in (Nguyen et al., 2011).
76
+ Packaged in a 0.6MB zip file, implementations
77
+ in Python and Java can be found at the tagger’s
78
+ website _http://rdrpostagger.sourceforge.net/_ . The
79
+ following items exhibit properties of the tagger:
80
+
81
+ _•_ The RDRPOSTagger is easy to configure and
82
+ train. There are only two threshold parameters utilized to learn the rule-based model. Besides, the
83
+ tagger is very simple to use with standard input
84
+ and output, having clear usage and instructions
85
+ available on its website.
86
+
87
+ _•_ The RDRPOSTagger is language independent.
88
+ This POS tagging toolkit has been successfully
89
+ applied to English and Vietnamese. To train the
90
+ toolkit for other languages, users just provide a
91
+ lexicon of words and the most frequent associated
92
+ tags. Moreover, it can be easily combined with existing POS taggers to reach an even better result.
93
+
94
+ _•_ The RDRPOSTagger obtains very competitive
95
+ accuracies. On Penn WSJ Treebank corpus (Marcus et al., 1993), taking WSJ sections 0-18 as the
96
+ training set, the tagger achieves a competitive performance compared to other state-of-the-art English POS taggers on the test set of WSJ sections
97
+ 22-24. For Vietnamese, it outperforms all previous machine learning-based POS tagging systems
98
+ to obtain an up-to-date highest result on the Vietnamese Treebank corpus (Nguyen et al., 2009).
99
+
100
+ _•_ The RDRPOSTagger is fast. For instance in
101
+ English, the time [2] taken to train the tagger on
102
+ the WSJ sections 0-18 is \textbf{40} minutes. The tagging
103
+ speed on the test set of the WSJ sections 22-24 is
104
+ \textbf{2800} words/second accounted for the latest implementation in Python whilst it is \textbf{92k} words/second
105
+
106
+
107
+ 2Training and tagging times are computed on a Windows7 OS computer of Core 2Duo 2.4GHz & 3GB of memory.
108
+
109
+
110
+
111
+ _Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics_, pages 17–20,
112
+ Gothenburg, Sweden, April 26-30 2014. c _⃝_ 2014 Association for Computational Linguistics
113
+
114
+
115
+ _Figure 1:_ A part of our SCRDR tree for English POS tagging.
116
+
117
+
118
+
119
+ for the implementation in Java.
120
+
121
+
122
+ \textbf{2} \textbf{SCRDR methodology}
123
+
124
+
125
+ A SCRDR tree (Richards, 2009) is a binary tree
126
+ with two distinct types of edges. These edges are
127
+ typically called _except_ and _if-not_ edges. Associated with each node in a tree is a _rule_ . A rule has
128
+ the form: _if α then β_ where _α_ is called the _condi-_
129
+ _tion_ and _β_ is referred to as the _conclusion_ .
130
+ Cases in SCRDR are evaluated by passing a
131
+ case to the root of the tree. At any node in the
132
+ tree, if the condition of a node _N_ ’s rule is satisfied by the case, the case is passed on to the exception child of _N_ using the _except_ link if it exists.
133
+ Otherwise, the case is passed on to the _N_ ’s _if-not_
134
+ child. The conclusion given by this process is the
135
+ conclusion from the last node in the SCRDR tree
136
+ which _fired_ (satisfied by the case). To ensure that
137
+ a conclusion is always given, the root node typically contains a trivial condition which is always
138
+ satisfied. This node is called the _default_ node.
139
+ A new node containing a new rule (i.e. a new exception rule) is added to an SCRDR tree when the
140
+ evaluation process returns the _wrong_ conclusion.
141
+ The new node is attached to the last node in the
142
+ evaluation path of the given case with the _except_
143
+ link if the last node is the _fired_ one. Otherwise, it
144
+ is attached with the _if-not_ link.
145
+ For example with the SCRDR tree in the figure 1, given a case _“as/IN investors/NNS an-_
146
+ _ticipate/VB a/DT recovery/NN”_ where _“antici-_
147
+ _pate/VB”_ is the current word and tag pair, the case
148
+ satisfies the conditions of the rules at nodes (0),
149
+ (1) and (3), it then is passed to the node (6) (utilizing except links). As the case does not satisfy the
150
+ condition of the rule at node (6), it will be transferred to node (7) using if-not link. Since the case
151
+ does not fulfill the conditions of the rules at nodes
152
+ (7) and (8), we have the evaluation path (0)-(1)(3)-(6)-(7)-(8) with fired node (3). Therefore, the
153
+ tag for _“anticipate”_ is concluded as “VBP”.
154
+
155
+
156
+
157
+ Rule (1) - the rule at node (1) - is the exception
158
+ rule [3] of the default rule (0). As node (2) is the ifnot child node of the node (1), the associated rule
159
+ (2) is also an exception rule of the rule (0). Similarly, both rules (3) and (4) are exception rules of
160
+ the rule (1) whereas all rules (6), (7) and (8) are
161
+ exception rules of the rule (3), and so on. Thus,
162
+ the exception structure of the SCRDR tree extends
163
+ to 4 levels: rules (1) and (2) at layer 1, rules (3),
164
+ (4) and (5) at layer 2, rules (6), (7) and (8) at layer
165
+ 3, and rule (9) at layer 4.
166
+
167
+
168
+ \textbf{3} \textbf{The RDRPOSTagger toolkit}
169
+
170
+
171
+ The toolkit consists of four main components: Utility, Initial-tagger, SCRDR-learner and
172
+ SCRDR-tagger.
173
+
174
+
175
+ \textbf{3.1} \textbf{The Utility}
176
+
177
+
178
+ The major functions of this component are to evaluate tagging performances (displaying accuracy
179
+ results), and to create a lexicon of words and the
180
+ most frequent associated tags as well as to extract
181
+ _Raw corpus_ from an input golden training corpus.
182
+
183
+
184
+ \textbf{3.2} \textbf{The Initial-tagger}
185
+
186
+
187
+ The initial-tagger developed in the RDRPOSTagger toolkit is based on the lexicon which is generated in the use of the Utility component to assign a tag for each word. To deal with unknown
188
+ words, the initial-tagger utilizes several regular expressions or heuristics for English and Vietnamese
189
+ whereas the most frequent tag in the training corpus is exploited to label unknown-words when
190
+ adapting to other languages.
191
+
192
+
193
+ \textbf{3.3} \textbf{The SCRDR-learner}
194
+
195
+
196
+ The SCRDR-learner component uses a failuredriven method to automatically build a SCRDR
197
+ tree of transformation rules. Figure 3 describes the
198
+ learning process of the learner.
199
+
200
+
201
+ 3The default rule is the unique rule which is not an exception rule of any other rule. Every rule in layer _n_ is an exception rule of a rule in layer _n −_ 1.
202
+
203
+
204
+
205
+ 18
206
+
207
+
208
+ #12: _if_ next1 _[st]_ Tag == \textbf{“object.next1} _[st]_ \textbf{Tag”} _then_ tag = \textbf{“correctTag”}
209
+ #14: _if_ prev1 _[st]_ Tag == \textbf{“object.prev1} _[st]_ \textbf{Tag”} _then_ tag = \textbf{“correctTag”}
210
+ #18: _if_ word == \textbf{“object.word”} && next1 _[st]_ Tag == \textbf{“object.next1} _[st]_ \textbf{Tag”} _then_ tag = \textbf{“correctTag”}
211
+
212
+
213
+ _Figure 2:_ Rule template examples.
214
+
215
+
216
+
217
+ _Figure 3:_ The diagram of the learning process of the learner.
218
+
219
+
220
+ The _Initialized corpus_ is returned by performing the Initial-tagger on the Raw corpus. By comparing the initialized one with the _Golden corpus_,
221
+ an _Object-driven dictionary_ of pairs ( _\textbf{O}_ _bject, cor-_
222
+ _rectTag_ ) is produced in which _Object_ captures the
223
+ 5-word window context covering the current word
224
+ and its tag in following format ( _previous_ 2 _[nd]_ _word_
225
+ _/ previous_ 2 _[nd]_ _tag, previous_ 1 _[st]_ _word / previous_
226
+ 1 _[st]_ _tag, word / currentTag, next_ 1 _[st]_ _word / next_ 1 _[st]_
227
+
228
+ _tag, next_ 2 _[nd]_ _word / next_ 2 _[nd]_ _tag_ ) from the initialized corpus, and the _correctTag_ is the corresponding tag of the current word in the golden corpus.
229
+ There are 27 _Rule templates_ applied for _Rule se-_
230
+ _lector_ which is to select the most suitable rules
231
+ to build the _SCRDR tree_ . Examples of the rule
232
+ templates are shown in figure 2 where elements
233
+ in bold will be replaced by concrete values from
234
+ _Object_ s in the object-driven dictionary to create
235
+ concrete rules. The SCRDR tree of rules is initialized by building the default rule and all exception
236
+ rules of the default one in form of _if currentTag =_
237
+ _“_ _\textbf{TAG}_ _” then tag = “_ _\textbf{TAG}_ _”_ at the layer-1 exception
238
+ structure, for example rules (1) and (2) in the figure 1, and the like. The learning approach to construct new exception rules to the tree is as follows:
239
+
240
+ _•_ At a node-F in the SCRDR tree, let _SO_ be
241
+ the set of Objects from the object-driven dictionary, which those Objects are fired at the node-F
242
+ but their initialized tags are incorrect (the _current-_
243
+ _Tag_ is not the _correctTag_ associated). It means that
244
+ node-F gives wrong conclusions to all Objects in
245
+ the _SO_ set.
246
+
247
+ _•_ In order to select a new exception rule of the
248
+ rule at node-F from all concrete rules which are
249
+
250
+
251
+
252
+ generated for all Objects in the _SO_ set, the selected rule have to satisfy constraints: (i) The rule
253
+ must be unsatisfied by cases for which node-F has
254
+ already given correct conclusions. This constraint
255
+ does not apply to node-F at layer-1 exception structure. (ii) The rule must associate to a highest score
256
+ value of subtracting B from A in comparison to
257
+ other ones, where A and B are the numbers of the
258
+ _SO_ ’s Objects which are correctly and incorrectly
259
+ concluded by the rule respectively. (iii) And the
260
+ highest value is not smaller than a given threshold.
261
+ The SCRDR-learner applies two threshold parameters: first threshold is to choose exception
262
+ rules at the layer-2 exception structure (e.g rules
263
+ (3), (4) and (5) in figure 1), and second threshold
264
+ is to select rules for higher exception layers.
265
+
266
+ _•_ The process to add new exception rules is repeated until there is no rule satisfying the constraints above. At each iteration, a new rule is
267
+ added to the current SCRDR tree to correct error
268
+ conclusions made by the tree.
269
+
270
+
271
+ \textbf{3.4} \textbf{The SCRDR-tagger}
272
+
273
+
274
+ The SCRDR-tagger component is to perform the
275
+ POS tagging on a raw text corpus where each line
276
+ is a sequence of words separated by white space
277
+ characters. The component labels the text corpus
278
+ by using the Initial-tagger. It slides due to a leftto-right direction on a 5-word window context to
279
+ generate a corresponding Object for each initially
280
+ tagged word. The Object is then classified by the
281
+ learned SCRDR tree model to produce final conclusion tag of the word as illustrated in the example in the section 2.
282
+
283
+
284
+ \textbf{4} \textbf{Evaluation}
285
+
286
+
287
+ The RDRPOSTagger has already been successfully applied to English and Vietnamese corpora.
288
+
289
+
290
+ \textbf{4.1} \textbf{Results for English}
291
+
292
+
293
+ Experiments for English employed the Penn WSJ
294
+ Treebank corpus to exploit the WSJ sections 0-18
295
+ (38219 sentences) for training, the WSJ sections
296
+ 19-21 (5527 sentences) for validation and the WSJ
297
+ sections 22-24 (5462 sentences) for test.
298
+ Using a lexicon created in the use of the train
299
+
300
+
301
+ 19
302
+
303
+
304
+ ing set, the Initial-tagger obtains an accuracy of
305
+ 93.51% on the test set. By varying the thresholds
306
+ on the validation set, we have found the most suitable values [4] of 3 and 2 to be used for evaluating
307
+ the RDRPOSTagger on the test set. Those thresholds return a SCRDR tree model of 2319 rules
308
+ in a 4-level exception structure. The training time
309
+ and tagging speed for those thresholds are mentioned in the introduction section. On the same test
310
+ set, the RDRPOSTagger achieves a performance at
311
+ 96.49% against 96.46% accounted for the state-ofthe-art POS tagger TnT (Brants, 2000).
312
+ For another experiment, only in training process: 1-time occurrence words in training set are
313
+ initially tagged as out-of-dictionary words. With
314
+ a learned tree model of 2418 rules, the tagger
315
+ reaches an accuracy of 96.51% on the test set.
316
+ Retraining the tagger utilizing another initial
317
+ tagger [5] developed in the Brill’s tagger (Brill,
318
+ 1995) instead of the lexicon-based initial one,
319
+ the RDRPOSTagger gains an accuracy result of
320
+ 96.57% which is slightly higher than the performance at 96.53% of the Brill’s.
321
+
322
+
323
+ \textbf{4.2} \textbf{Results for Vietnamese}
324
+
325
+
326
+ In the first Evaluation Campaign [6] on Vietnamese
327
+ Language Processing, the POS tagging track provided a golden training corpus of 28k sentences
328
+ (631k words) collected from two sources of the
329
+ national VLSP project and the Vietnam Lexicography Center, and a raw test corpus of 2100 sentences (66k words). The training process returned
330
+ a SCRDR tree of 2896 rules [7] . Obtaining a highest
331
+ performance on the test set, the RDRPOSTagger
332
+ surpassed all other participating systems.
333
+ We also carry out POS tagging experiments on
334
+ the golden corpus of 28k sentences and on the
335
+ Vietnamese Treebank of 10k sentences (Nguyen
336
+ et al., 2009) according to 5-fold cross-validation
337
+ scheme [8] . The average accuracy results are presented in the table 1. Achieving an accuracy of
338
+ 92.59% on the Vietnamese Treebank, the RDR
339
+
340
+ 4The thresholds 3 and 2 are reused for all other experiments in English and Vietnamese.
341
+ 5The initial tagger gets a result of 93.58% on the test set.
342
+ 6http://uet.vnu.edu.vn/rivf2013/campaign.html
343
+ 7It took 100 minutes to construct the tree leading to tagging speeds of 1100 words/second and 45k words/second for
344
+ the implementations in Python and Java, respectively, on the
345
+ computer of Core 2Duo 2.4GHz & 3GB of memory.
346
+ 8In each cross-validation run, one fold is selected as test
347
+ set, 4 remaining folds are merged as training set. The initial
348
+ tagger exploits a lexicon generated from the training set. In
349
+ training process, 1-time occurrence words are initially labeled
350
+ as out-of-lexicon words.
351
+
352
+
353
+
354
+ \textbf{Table 1:} Accuracy results for Vietnamese
355
+
356
+ | Corpus | Initial-tagger | RDRPOSTagger |
357
+ |--------|----------------|--------------|
358
+ | 28k | 91.18% | 93.42% |
359
+ | 10k | 90.59% | 92.59% |
360
+
361
+
362
+ POSTagger outperforms previous Maximum Entropy Model, Conditional Random Field and Support Vector Machine-based POS tagging systems
363
+ (Tran et al., 2009) on the same evaluation scheme.
364
+
365
+
366
+ \textbf{5} \textbf{Demonstration and Conclusion}
367
+
368
+
369
+ In addition to English and Vietnamese, in the
370
+ demonstration session, we will present promising
371
+ experimental results and run the RDRPOSTagger
372
+ for other languages including Bulgarian, Czech,
373
+ Danish, Dutch, French, German, Hindi, Italian,
374
+ Lao, Portuguese, Spanish, Swedish and Thai. We
375
+ will also let the audiences to contribute their own
376
+ data sets for retraining and testing the tagger.
377
+ In this paper, we describe the rule-based
378
+ POS tagging toolkit RDRPOSTagger to automatically construct transformation rules in form
379
+ of the SCRDR exception structure. We believe that our robust, easy-to-use and languageindependent toolkit RDRPOSTagger can be useful
380
+ for NLP/CL-related tasks.
381
+
382
+
383
+ \textbf{References}
384
+
385
+
386
+ Thorsten Brants. 2000. TnT: a statistical part-ofspeech tagger. In _Proc. of 6th Applied Natural Lan-_
387
+ _guage Processing Conference_, pages 224–231.
388
+ Eric Brill. 1995. Transformation-based error-driven
389
+ learning and natural language processing: a case
390
+ study in part-of-speech tagging. _Comput. Linguist._,
391
+ 21(4):543–565.
392
+ Mitchell P Marcus, Mary Ann Marcinkiewicz, and
393
+ Beatrice Santorini. 1993. Building a large annotated corpus of English: the penn treebank. _Comput._
394
+ _Linguist._, 19(2):313–330.
395
+ Phuong Thai Nguyen, Xuan Luong Vu, Thi
396
+ Minh Huyen Nguyen, Van Hiep Nguyen, and
397
+ Hong Phuong Le. 2009. Building a Large
398
+ Syntactically-Annotated Corpus of Vietnamese. In
399
+ _Proc. of LAW III workshop_, pages 182–185.
400
+ Dat Quoc Nguyen, Dai Quoc Nguyen, Son Bao Pham,
401
+ and Dang Duc Pham. 2011. Ripple Down Rules for
402
+ Part-of-Speech Tagging. In _Proc. of 12th CICLing -_
403
+ _Volume Part I_, pages 190–201.
404
+ Debbie Richards. 2009. Two decades of ripple down
405
+ rules research. _Knowledge Engineering Review_,
406
+ 24(2):159–184.
407
+ Oanh Thi Tran, Cuong Anh Le, Thuy Quang Ha, and
408
+ Quynh Hoang Le. 2009. An experimental study
409
+ on vietnamese pos tagging. _Proc. of the 2009 Inter-_
410
+ _national Conference on Asian Language Processing_,
411
+ pages 23–27.
412
+
413
+
414
+
415
+ 20
416
+
417
+
418
+
419
+ \end{document}
references/2018.naacl.vu/paper.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "VnCoreNLP: A Vietnamese Natural Language Processing Toolkit"
3
+ authors:
4
+ - "Thanh Vu"
5
+ - "Dat Quoc Nguyen"
6
+ - "Dai Quoc Nguyen"
7
+ - "Mark Dras"
8
+ - "Mark Johnson"
9
+ year: 2018
10
+ venue: "NAACL 2018 Demonstrations"
11
+ url: "https://aclanthology.org/N18-5012/"
12
+ ---
13
+
14
+ We present an easy-to-use and fast toolkit, namely VnCoreNLP---a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks.
15
+ We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP.
16
+ Our VnCoreNLP is open-source and available at: https://github.com/vncorenlp/VnCoreNLP.
17
+ # Introduction
18
+ Research on Vietnamese NLP has been actively explored in the last decade, boosted by the successes of the 4-year KC01.01/2006-2010 national project on Vietnamese language and speech processing (VLSP). Over the last 5 years, standard benchmark datasets for key Vietnamese NLP tasks are publicly available: datasets for word segmentation and POS tagging were released for the first VLSP evaluation campaign in 2013; a dependency treebank was published in 2014 [Nguyen2014NLDB]; and an NER dataset was released for the second VLSP campaign in 2016. So there is a need for building an NLP pipeline, such as the Stanford CoreNLP toolkit [manning-EtAl:2014:P14-5], for those key tasks to assist users and to support researchers and tool developers of downstream tasks.
19
+ and built Vietnamese NLP pipelines by wrapping existing word segmenters and POS taggers including: JVnSegmenter [Y06-1028], vnTokenizer [Le2008], JVnTagger [NguyenPN2010] and vnTagger [lehong00526139]. However,
20
+ these word segmenters and POS taggers are no longer considered
21
+ SOTA models for Vietnamese [NguyenL2016,JCSCE]. built the NNVLP toolkit for Vietnamese sequence labeling tasks by applying a BiLSTM-CNN-CRF model [ma-hovy:2016:P16-1]. However, did not make a comparison to SOTA traditional feature-based models. In addition, NNVLP is slow with a processing speed at about 300 words per second, which is not practical for real-world application such as dealing with large-scale data.
22
+ {5pt plus 2pt minus 1pt}
23
+ [!t]
24
+ [width=7.5cm]{VnCoreNLP_Architecture.pdf}
25
+ object.}
26
+ In this paper, we present a Java NLP toolkit for Vietnamese, namely VnCoreNLP, which aims to facilitate Vietnamese NLP research by providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, NER and dependency parsing. Figure [fig:diagram] describes the overall system architecture. The
27
+ following items highlight typical characteristics of VnCoreNLP:
28
+ {5pt}
29
+ {0pt}
30
+ {0pt}
31
+ - **Easy-to-use** -- All VnCoreNLP components are wrapped into a single .jar file, so users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the Java API.
32
+ - **Fast** -- VnCoreNLP is fast, so it can be used for dealing with large-scale data. Also it benefits users suffering from limited computation resources (e.g. users from Vietnam).
33
+ - **Accurate** -- VnCoreNLP components obtain higher results than all previous published results on the same benchmark datasets.
34
+ # Basic usages
35
+ Our design goal is to make VnCoreNLP simple to setup and run from either the command-line or the Java API. Performing linguistic annotations for a given file can be done by using a simple command as in Figure [fig:command].
36
+ [ht]
37
+ { \$ java -Xmx2g -jar VnCoreNLP.jar -fin input.txt -fout output.txt}
38
+ Suppose that the file { input.txt} in Figure [fig:command] contains a sentence ``Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' (Mr Nguyen Khac Chuc is working at Vietnam National University Hanoi). Table [tab:expoutput] shows the output for this sentence in plain text form.
39
+ [ht]
40
+ {!}{
41
+ | 1 | Ông | Nc | O | 4 | sub |
42
+ |---|---|---|---|---|---|
43
+ | 2 | Nguyễn\_Khắc\_Chúc | Np | B-PER | 1 | nmod |
44
+ | 3 | đang | R | O | 4 | adv |
45
+ | 4 | làm\_việc | V | O | 0 | root |
46
+ | 5 | tại | E | O | 4 | loc |
47
+ | 6 | Đại\_học | N | B-ORG | 5 | pob |
48
+ | 7 | Quốc\_gia | N | I-ORG | 6 | nmod |
49
+ | 8 | Hà\_Nội | Np | I-ORG | 6 | nmod |
50
+ | 9 | . | CH | O | 4 | punct |
51
+ }
52
+ for the sentence `Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' from file { input.txt} in Figure [fig:command]. The output is in a 6-column format representing word index, word form, POS tag, NER label, head index of the current word, and dependency relation type.}
53
+ Similarly, we can also get the same output by using the API as easy as in Listing [lst1].
54
+ [label=lst1,caption= {Minimal code for an analysis pipeline.}]
55
+ VnCoreNLP pipeline = new VnCoreNLP() ;
56
+ Annotation annotation = new Annotation("
57
+ pipeline.annotate(annotation);
58
+ String annotatedStr = annotation.toString();
59
+ In addition,
60
+ Listing [lst2] provides a more realistic and complete example code, presenting key components of the toolkit.
61
+ Here an annotation pipeline can be used for any text rather than just a single sentence, e.g. for a paragraph or entire news story.
62
+ # Components
63
+ This section briefly describes each component of VnCoreNLP. Note that our goal is not to develop new approach or model for each component task. Here we focus on incorporating existing models into a single pipeline. In particular, except a new model we develop for the language-dependent component of word segmentation, we apply traditional feature-based models which obtain SOTA results for English POS tagging, NER and dependency parsing to Vietnamese. The reason is based on a well-established belief in the literature that for a less-resourced language such as Vietnamese, we should consider using feature-based models to obtain fast and accurate performances, rather than using neural network-based models [King2015].
64
+ {1pt plus 1.0pt minus 1.0pt}
65
+ [float=tp,label=lst2,caption= {A simple and complete example code.}]
66
+ import vn.pipeline.*;
67
+ import java.io.*;
68
+ public class VnCoreNLPExample {
69
+ public static void main(String[] args) throws IOException {
70
+ // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively.
71
+ String[] annotators = {"wseg", "pos", "ner", "parse"};
72
+ VnCoreNLP pipeline = new VnCoreNLP(annotators);
73
+ // Mr Nguyen Khac Chuc is working at Vietnam National University, Hanoi. Mrs Lan, Mr Chuc's wife, is also working at this university.
74
+ String str =
75
+ Annotation annotation = new Annotation(str);
76
+ pipeline.annotate(annotation);
77
+ PrintStream outputPrinter = new PrintStream("output.txt");
78
+ pipeline.printToFile(annotation, outputPrinter);
79
+ // Users can get a single sentence to analyze individually
80
+ Sentence firstSentence = annotation.getSentences().get(0);
81
+ }
82
+ }
83
+ {5pt}
84
+ {0pt}
85
+ {0pt}
86
+ - **wseg** -- Unlike English where white space is a strong indicator of word boundaries, when written in Vietnamese white space is also used to separate syllables that constitute words. So word segmentation is referred to as the key first step in Vietnamese NLP{.}\ {W}e have proposed a transformation rule-based learning model for Vietnamese word segmentation, which obtains better segmentation accuracy and speed than all previous word segmenters. See details in .
87
+ - **pos** -- To label words with their POS tag, we apply MarMoT which is a generic
88
+ CRF framework and a SOTA POS and morphological
89
+ tagger [mueller-schmid-schutze:2013:EMNLP].
90
+ - **ner** -- To recognize named entities, we apply a dynamic feature induction model that automatically optimizes feature combinations [choi:2016:N16-1].
91
+ - **parse** -- To perform dependency parsing, we apply the greedy version of a transition-based parsing model with selectional branching [choi2015ACL].
92
+ # Evaluation
93
+ We detail experimental results of the word segmentation (**wseg**) and POS tagging (**pos**) components of VnCoreNLP in and , respectively. In particular, our word segmentation component gets the highest results in terms of both segmentation F1 score at 97.90\
94
+ Following subsections present evaluations for the NER (**ner**) and dependency parsing (**parse**) components.
95
+ ## Named entity recognition
96
+ We make a comparison between SOTA feature-based and neural network-based models, which, to the best of our knowledge, has not been done in any prior work on Vietnamese NER.
97
+ #### Dataset: The NER shared task at the 2016 VLSP workshop provides a set of 16,861 manually annotated sentences for training and development, and a set of 2,831 manually annotated sentences for test, with four NER labels PER, LOC, ORG and MISC. Note that in both datasets, words are also supplied with gold POS tags. In addition, each word representing a full personal name are separated into syllables that constitute the word.
98
+ So this annotation scheme results in an unrealistic scenario for a pipeline evaluation because: (**i**) gold POS tags are not available in a real-world application, and (**ii**) in the standard annotation (and benchmark datasets) for Vietnamese word segmentation and POS tagging [nguyen-EtAl:2009:LAW-III], each full name is referred to as a word token (i.e., all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a label to the entire full-name).
99
+ For a more realistic scenario, we merge those contiguous syllables constituting a full name to form a word. Then we replace the gold POS tags by automatic tags predicted by our POS tagging component. From the set of 16,861 sentences, we sample 2,000 sentences for development and using the remaining 14,861 sentences for training.
100
+ #### Models: We make an empirical comparison between the VnCoreNLP's NER component and the following neural network-based models:
101
+ {5pt}
102
+ {0pt}
103
+ {0pt}
104
+ - {BiLSTM-CRF} [HuangXY15] is a sequence labeling model which extends the BiLSTM model with a CRF layer.
105
+ - {BiLSTM-CRF + CNN-char}, i.e. {BiLSTM-CNN-CRF}, is an extension of {BiLSTM-CRF}, using CNN to derive character-based word representations [ma-hovy:2016:P16-1].
106
+ - {BiLSTM-CRF + LSTM-char} is an extension of {BiLSTM-CRF}, using BiLSTM to derive the character-based word representations [lample-EtAl:2016:N16-1].
107
+ - BiLSTM-CRF is another extension to BiLSTM-CRF, incorporating embeddings of automatically predicted POS tags [reimers-gurevych:2017:EMNLP2017].
108
+ We use a well-known implementation which is optimized for performance of all BiLSTM-CRF-based models from . We then follow [Section 3.4]{NguyenVNDJ-ALTA-2017} to perform hyper-parameter tuning.
109
+ {20.0pt plus 2.0pt minus 4.0pt}
110
+ [!t]
111
+ |
112
+ **Model** | **F1** | **Speed** |
113
+ |---|---|---|
114
+ |
115
+ VnCoreNLP | **88.55** | **18K** |
116
+ | BiLSTM-CRF | 86.48 | 2.8K |
117
+ | \ \ \ \ \ + CNN-char | {88.28} | 1.8K |
118
+ | \ \ \ \ \ + LSTM-char | 87.71 | 1.3K |
119
+ | BiLSTM-CRF | 86.12 | \_ |
120
+ | \ \ \ \ \ + CNN-char | 88.06 | \_ |
121
+ | \ \ \ \ \ + LSTM-char | 87.43 | \_ |
122
+ | |
123
+ #### Main results: Table [tab:ner] presents F1 score and speed of each model on the test set, where VnCoreNLP obtains the highest score at 88.55\
124
+ It is initially surprising that for such an isolated language as Vietnamese where all words are not inflected, using character-based representations helps producing 1+\
125
+ ## Dependency parsing
126
+ #### Experimental setup: We use the Vietnamese dependency treebank VnDT [Nguyen2014NLDB] consisting of 10,200 sentences in our experiments. Following , we use the last 1020 sentences of VnDT for test while the remaining sentences are used for training. Evaluation metrics are the labeled attachment score (LAS) and unlabeled attachment score (UAS).
127
+ #### Main results: Table [tab:dep] compares the dependency parsing results of VnCoreNLP with results reported in prior work, using the same experimental setup. The first six rows present the scores with gold POS tags. The next two rows show scores of VnCoreNLP with automatic POS tags which are produced by our POS tagging component. The last row presents scores of the joint POS tagging and dependency parsing model jPTDP [NguyenCoNLL2017]. Table [tab:dep] shows that compared to previously published results, VnCoreNLP produces the highest LAS score. Note that previous results for other systems are reported without using additional information of automatically predicted NER labels. In this case, the LAS score for VnCoreNLP without automatic NER features (i.e. VnCoreNLP in Table [tab:dep]) is still higher than previous ones. Notably, we also obtain a fast parsing speed at 8K words per second.
128
+ [!t]
129
+ {0.5em}
130
+ |
131
+ {c|}{**Model**} | **LAS** | **UAS** | **Speed** |
132
+ |---|---|---|---|
133
+ |
134
+ {*}{[origin=c]{90}{Gold POS}} | VnCoreNLP | **73.39** | 79.02 | \_ |
135
+ | VnCoreNLP | 73.21 | 78.91 | \_ |
136
+ | BIST-bmstparser | 73.17 | **79.39** | \_ |
137
+ | BIST-barchybrid | 72.53 | 79.33 | \_ |
138
+ | MSTParser | 70.29 | 76.47 | \_ |
139
+ | MaltParser | 69.10 | 74.91 | \_ |
140
+ |
141
+ {*}{[origin=c]{90}{Auto POS}} | VnCoreNLP | **70.23** | 76.93 | 8K |
142
+ | VnCoreNLP | 70.10 | 76.85 | **9K** |
143
+ | jPTDP | 69.49 | **77.68** | 700 |
144
+ | |
145
+ # Conclusion
146
+ In this paper, we have presented the VnCoreNLP toolkit---an easy-to-use, fast and accurate processing pipeline for Vietnamese NLP. VnCoreNLP provides core NLP steps including word segmentation, POS tagging, NER and dependency parsing. Current version of VnCoreNLP has been trained without any linguistic optimization, i.e. we only employ existing pre-defined features in the traditional feature-based models for POS tagging, NER and dependency parsing. So future work will focus on incorporating Vietnamese linguistic features into these feature-based models.
147
+ VnCoreNLP is released for research and educational purposes, and available at: https://github.com/vncorenlp/VnCoreNLP.
references/2018.naacl.vu/paper.tex ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \pdfoutput=1
3
+ \usepackage[hyperref]{naaclhlt2018}
4
+ \usepackage{times}
5
+ \usepackage{latexsym}
6
+
7
+ \usepackage{graphicx}
8
+ \usepackage{tabularx}
9
+ \usepackage{multirow}
10
+ %\usepackage{fixltx2e}
11
+ %\usepackage{enumitem}
12
+ \usepackage{marvosym}
13
+
14
+ \usepackage{vntex}
15
+ \usepackage[english]{babel}
16
+
17
+ %\usepackage[hidelinks]{hyperref}
18
+ \usepackage{url}
19
+
20
+
21
+ \usepackage{listings}
22
+
23
+ \lstset{
24
+ basicstyle=\footnotesize\ttfamily,
25
+ language=Java,
26
+ breaklines=true,
27
+ basicstyle=\footnotesize\ttfamily,
28
+ captionpos=b,
29
+ inputencoding=utf8,
30
+ escapeinside={\%*}{*)}
31
+ }
32
+
33
+ \aclfinalcopy % Uncomment this line for the final submission
34
+ %\def\aclpaperid{***} % Enter the acl Paper ID here
35
+
36
+ \setlength\titlebox{5.25cm}
37
+
38
+ \newcommand\BibTeX{B{\sc ib}\TeX}
39
+
40
+ \title{VnCoreNLP: A Vietnamese Natural Language Processing Toolkit}
41
+
42
+ \author{Thanh Vu$^1$, Dat Quoc Nguyen$^2$, Dai Quoc Nguyen$^3$, Mark Dras$^4$ \and Mark Johnson$^4$\\
43
+ $^1$Newcastle University, United Kingdom; $^2$The University of Melbourne, Australia; \\
44
+ $^3$Deakin University, Australia; $^4$Macquarie University, Australia \\
45
+ {\tt thanh.vu@newcastle.ac.uk}, {\tt dqnguyen@unimelb.edu.au},\\
46
+ {\tt dai.nguyen@deakin.edu.au}, {\tt\{mark.dras, mark.johnson\}@mq.edu.au}}
47
+
48
+ \date{}
49
+
50
+ \begin{document}
51
+ \maketitle
52
+ \begin{abstract}
53
+ We present an easy-to-use and fast toolkit, namely VnCoreNLP---a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks.
54
+ We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP.
55
+ Our VnCoreNLP is open-source and available at: \url{https://github.com/vncorenlp/VnCoreNLP}.
56
+ \end{abstract}
57
+
58
+
59
+
60
+ \section{Introduction}
61
+
62
+ Research on Vietnamese NLP has been actively explored in the last decade, boosted by the successes of the 4-year KC01.01/2006-2010 national project on Vietnamese language and speech processing (VLSP). Over the last 5 years, standard benchmark datasets for key Vietnamese NLP tasks are publicly available: datasets for word segmentation and POS tagging were released for the first VLSP evaluation campaign in 2013; a dependency treebank was published in 2014 \cite{Nguyen2014NLDB}; and an NER dataset was released for the second VLSP campaign in 2016. So there is a need for building an NLP pipeline, such as the Stanford CoreNLP toolkit \cite{manning-EtAl:2014:P14-5}, for those key tasks to assist users and to support researchers and tool developers of downstream tasks.
63
+
64
+ \newcite{NguyenPN2010} and \newcite{Le:2013:VOS} built Vietnamese NLP pipelines by wrapping existing word segmenters and POS taggers including: JVnSegmenter \cite{Y06-1028}, vnTokenizer \cite{Le2008}, JVnTagger \cite{NguyenPN2010} and vnTagger \cite{lehong00526139}. However,
65
+ these word segmenters and POS taggers are no longer considered
66
+ SOTA models for Vietnamese \cite{NguyenL2016,JCSCE}. \newcite{PhamPNP2017b} built the NNVLP toolkit for Vietnamese sequence labeling tasks by applying a BiLSTM-CNN-CRF model \cite{ma-hovy:2016:P16-1}. However, \newcite{PhamPNP2017b} did not make a comparison to SOTA traditional feature-based models. In addition, NNVLP is slow with a processing speed at about 300 words per second, which is not practical for real-world application such as dealing with large-scale data.
67
+
68
+
69
+ \setlength{\abovecaptionskip}{5pt plus 2pt minus 1pt}
70
+ \begin{figure}[!t]
71
+ \centering
72
+ \includegraphics[width=7.5cm]{VnCoreNLP_Architecture.pdf}
73
+ \caption{In pipeline architecture of VnCoreNLP, annotations are performed on an {\tt Annotation} object.}
74
+ \label{fig:diagram}
75
+ \end{figure}
76
+
77
+ In this paper, we present a Java NLP toolkit for Vietnamese, namely VnCoreNLP, which aims to facilitate Vietnamese NLP research by providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, NER and dependency parsing. Figure \ref{fig:diagram} describes the overall system architecture. The
78
+ following items highlight typical characteristics of VnCoreNLP:
79
+
80
+ \begin{itemize}
81
+ \setlength{\itemsep}{5pt}
82
+ \setlength{\parskip}{0pt}
83
+ \setlength{\parsep}{0pt}
84
+
85
+ \item \textbf{Easy-to-use} -- All VnCoreNLP components are wrapped into a single .jar file, so users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the Java API.
86
+
87
+ \item \textbf{Fast} -- VnCoreNLP is fast, so it can be used for dealing with large-scale data. Also it benefits users suffering from limited computation resources (e.g. users from Vietnam).
88
+
89
+
90
+ \item \textbf{Accurate} -- VnCoreNLP components obtain higher results than all previous published results on the same benchmark datasets.
91
+
92
+ \end{itemize}
93
+
94
+
95
+
96
+ \section{Basic usages}
97
+
98
+ Our design goal is to make VnCoreNLP simple to setup and run from either the command-line or the Java API. Performing linguistic annotations for a given file can be done by using a simple command as in Figure \ref{fig:command}.
99
+
100
+ \begin{figure}[ht]
101
+ {\footnotesize\ttfamily \$ java -Xmx2g -jar VnCoreNLP.jar -fin input.txt -fout output.txt}
102
+ \caption{Minimal command to run VnCoreNLP.}
103
+ \label{fig:command}
104
+ \end{figure}
105
+
106
+ Suppose that the file {\ttfamily input.txt} in Figure \ref{fig:command} contains a sentence ``Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' (Mr\textsubscript{Ông} Nguyen Khac Chuc is\textsubscript{đang} working\textsubscript{làm\_việc} at\textsubscript{tại} Vietnam National\textsubscript{quốc\_gia} University\textsubscript{đại\_học} Hanoi\textsubscript{Hà\_Nội}). Table \ref{tab:expoutput} shows the output for this sentence in plain text form.
107
+
108
+ \begin{table}[ht]
109
+ \centering
110
+ \resizebox{8cm}{!}{
111
+ \begin{tabular}{l l l l l l}
112
+ 1 & Ông & Nc & O & 4 & sub \\
113
+ 2 & Nguyễn\_Khắc\_Chúc & Np & B-PER & 1 & nmod\\
114
+ 3 & đang & R & O & 4 & adv\\
115
+ 4 & làm\_việc & V & O & 0 & root\\
116
+ 5 & tại & E & O & 4 & loc\\
117
+ 6 & Đại\_học & N & B-ORG & 5 & pob\\
118
+ 7 & Quốc\_gia & N & I-ORG & 6 & nmod\\
119
+ 8 & Hà\_Nội & Np & I-ORG & 6 & nmod\\
120
+ 9 & . & CH & O & 4 & punct\\
121
+ \end{tabular}
122
+ }
123
+ \caption{The output in file {\ttfamily output.txt} for the sentence `Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' from file {\ttfamily input.txt} in Figure \ref{fig:command}. The output is in a 6-column format representing word index, word form, POS tag, NER label, head index of the current word, and dependency relation type.}
124
+ \label{tab:expoutput}
125
+ \end{table}
126
+
127
+ Similarly, we can also get the same output by using the API as easy as in Listing \ref{lst1}.
128
+
129
+
130
+ \begin{lstlisting}[label=lst1,caption= {Minimal code for an analysis pipeline.}]
131
+ VnCoreNLP pipeline = new VnCoreNLP() ;
132
+ Annotation annotation = new Annotation("%*Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.*)");
133
+ pipeline.annotate(annotation);
134
+ String annotatedStr = annotation.toString();
135
+ \end{lstlisting}
136
+
137
+ In addition,
138
+ Listing \ref{lst2} provides a more realistic and complete example code, presenting key components of the toolkit.
139
+ Here an annotation pipeline can be used for any text rather than just a single sentence, e.g. for a paragraph or entire news story.
140
+
141
+ \section{Components}
142
+
143
+ This section briefly describes each component of VnCoreNLP. Note that our goal is not to develop new approach or model for each component task. Here we focus on incorporating existing models into a single pipeline. In particular, except a new model we develop for the language-dependent component of word segmentation, we apply traditional feature-based models which obtain SOTA results for English POS tagging, NER and dependency parsing to Vietnamese. The reason is based on a well-established belief in the literature that for a less-resourced language such as Vietnamese, we should consider using feature-based models to obtain fast and accurate performances, rather than using neural network-based models \cite{King2015}.
144
+
145
+ \setlength{\textfloatsep}{1pt plus 1.0pt minus 1.0pt}
146
+ \begin{lstlisting}[float=tp,label=lst2,caption= {A simple and complete example code.}]
147
+ import vn.pipeline.*;
148
+ import java.io.*;
149
+ public class VnCoreNLPExample {
150
+ public static void main(String[] args) throws IOException {
151
+ // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively.
152
+ String[] annotators = {"wseg", "pos", "ner", "parse"};
153
+ VnCoreNLP pipeline = new VnCoreNLP(annotators);
154
+ // Mr Nguyen Khac Chuc is working at Vietnam National University, Hanoi. Mrs Lan, Mr Chuc's wife, is also working at this university.
155
+ String str = %*"Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."*);
156
+ Annotation annotation = new Annotation(str);
157
+ pipeline.annotate(annotation);
158
+ PrintStream outputPrinter = new PrintStream("output.txt");
159
+ pipeline.printToFile(annotation, outputPrinter);
160
+ // Users can get a single sentence to analyze individually
161
+ Sentence firstSentence = annotation.getSentences().get(0);
162
+ }
163
+ }
164
+ \end{lstlisting}
165
+ %
166
+ \begin{itemize}
167
+ \setlength{\itemsep}{5pt}
168
+ \setlength{\parskip}{0pt}
169
+ \setlength{\parsep}{0pt}
170
+
171
+ \item \textbf{wseg} -- Unlike English where white space is a strong indicator of word boundaries, when written in Vietnamese white space is also used to separate syllables that constitute words. So word segmentation is referred to as the key first step in Vietnamese NLP{.}\ {W}e have proposed a transformation rule-based learning model for Vietnamese word segmentation, which obtains better segmentation accuracy and speed than all previous word segmenters. See details in \newcite{NguyenNVDJ2018}.
172
+
173
+
174
+ \item \textbf{pos} -- To label words with their POS tag, we apply MarMoT which is a generic
175
+ CRF framework and a SOTA POS and morphological
176
+ tagger \cite{mueller-schmid-schutze:2013:EMNLP}.\footnote{\url{http://cistern.cis.lmu.de/marmot/}}
177
+
178
+
179
+ \item \textbf{ner} -- To recognize named entities, we apply a dynamic feature induction model that automatically optimizes feature combinations \cite{choi:2016:N16-1}.\footnote{\url{https://emorynlp.github.io/nlp4j/components/named-entity-recognition.html}}
180
+
181
+
182
+ \item \textbf{parse} -- To perform dependency parsing, we apply the greedy version of a transition-based parsing model with selectional branching \cite{choi2015ACL}.\footnote{\url{https://emorynlp.github.io/nlp4j/components/dependency-parsing.html}}
183
+ \end{itemize}
184
+
185
+
186
+ \section{Evaluation}
187
+
188
+ We detail experimental results of the word segmentation (\textbf{wseg}) and POS tagging (\textbf{pos}) components of VnCoreNLP in \newcite{NguyenNVDJ2018} and \newcite{NguyenVNDJ-ALTA-2017}, respectively. In particular, our word segmentation component gets the highest results in terms of both segmentation F1 score at 97.90\% and speed at 62K words per second.\footnote{All speeds reported in this paper are computed on a personal computer of Intel Core i7 2.2 GHz.} Our POS tagging component also obtains the highest accuracy to date at 95.88\% with a fast tagging speed at 25K words per second, and outperforms BiLSTM-CRF-based models.
189
+ Following subsections present evaluations for the NER (\textbf{ner}) and dependency parsing (\textbf{parse}) components.
190
+
191
+ \subsection{Named entity recognition}\label{ssec:ner}
192
+
193
+ We make a comparison between SOTA feature-based and neural network-based models, which, to the best of our knowledge, has not been done in any prior work on Vietnamese NER.
194
+
195
+
196
+
197
+ \paragraph{Dataset:} The NER shared task at the 2016 VLSP workshop provides a set of 16,861 manually annotated sentences for training and development, and a set of 2,831 manually annotated sentences for test, with four NER labels PER, LOC, ORG and MISC. Note that in both datasets, words are also supplied with gold POS tags. In addition, each word representing a full personal name are separated into syllables that constitute the word.
198
+ So this annotation scheme results in an unrealistic scenario for a pipeline evaluation because: (\textbf{i}) gold POS tags are not available in a real-world application, and (\textbf{ii}) in the standard annotation (and benchmark datasets) for Vietnamese word segmentation and POS tagging \cite{nguyen-EtAl:2009:LAW-III}, each full name is referred to as a word token (i.e., all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a label to the entire full-name).
199
+
200
+ For a more realistic scenario, we merge those contiguous syllables constituting a full name to form a word.\footnote{Based on the gold label PER, contiguous syllables such as ``Nguyễn/B-PER'', ``Khắc/I-PER'' and ``Chúc/I-PER'' are merged to form a word as ``Nguyễn\_Khắc\_Chúc/B-PER.''} Then we replace the gold POS tags by automatic tags predicted by our POS tagging component. From the set of 16,861 sentences, we sample 2,000 sentences for development and using the remaining 14,861 sentences for training.
201
+
202
+
203
+
204
+ \paragraph{Models:} We make an empirical comparison between the VnCoreNLP's NER component and the following neural network-based models:
205
+ \begin{itemize}
206
+ \setlength{\itemsep}{5pt}
207
+ \setlength{\parskip}{0pt}
208
+ \setlength{\parsep}{0pt}
209
+
210
+ \item {BiLSTM-CRF} \cite{HuangXY15} is a sequence labeling model which extends the BiLSTM model with a CRF layer.
211
+
212
+ \item {BiLSTM-CRF + CNN-char}, i.e. {BiLSTM-CNN-CRF}, is an extension of {BiLSTM-CRF}, using CNN to derive character-based word representations \cite{ma-hovy:2016:P16-1}.%\footnote{\url{https://github.com/XuezheMax/LasagneNLP}}
213
+
214
+ \item {BiLSTM-CRF + LSTM-char} is an extension of {BiLSTM-CRF}, using BiLSTM to derive the character-based word representations \cite{lample-EtAl:2016:N16-1}.
215
+
216
+ \item BiLSTM-CRF\textsubscript{+POS} is another extension to BiLSTM-CRF, incorporating embeddings of automatically predicted POS tags \cite{reimers-gurevych:2017:EMNLP2017}.
217
+
218
+ \end{itemize}
219
+
220
+ We use a well-known implementation which is optimized for performance of all BiLSTM-CRF-based models from \newcite{reimers-gurevych:2017:EMNLP2017}.\footnote{\url{https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf}} We then follow \newcite[Section 3.4]{NguyenVNDJ-ALTA-2017} to perform hyper-parameter tuning.\footnote{We employ pre-trained Vietnamese word vectors from \url{https://github.com/sonvx/word2vecVN}.}
221
+
222
+ %\setlength{\abovecaptionskip}{3pt plus 2pt minus 1pt}
223
+ \setlength{\textfloatsep}{20.0pt plus 2.0pt minus 4.0pt}
224
+ \begin{table}[!t]
225
+ \centering
226
+ \begin{tabular}{l|c|l}
227
+ \hline
228
+ \textbf{Model} & \textbf{F1} & \textbf{Speed} \\
229
+ \hline
230
+ VnCoreNLP & \textbf{88.55} & \textbf{18K} \\
231
+ BiLSTM-CRF & 86.48 & 2.8K \\
232
+ \ \ \ \ \ + CNN-char & {88.28} & 1.8K \\
233
+ \ \ \ \ \ + LSTM-char & 87.71 & 1.3K \\
234
+ BiLSTM-CRF\textsubscript{+POS} & 86.12 & \_ \\
235
+ \ \ \ \ \ + CNN-char & 88.06 & \_ \\
236
+ \ \ \ \ \ + LSTM-char & 87.43 & \_ \\
237
+ \hline
238
+ \end{tabular}
239
+ \caption{F1 scores (in \%) on the test set w.r.t. gold word-segmentation. ``\textbf{Speed}'' denotes the processing speed of the number of words per second (for VnCoreNLP, we include the time POS tagging takes in the speed).}
240
+ \label{tab:ner}
241
+ \end{table}
242
+
243
+
244
+
245
+ \paragraph{Main results:} Table \ref{tab:ner} presents F1 score and speed of each model on the test set, where VnCoreNLP obtains the highest score at 88.55\% with a fast speed at 18K words per second. In particular, VnCoreNLP obtains 10 times faster speed than the second most accurate model BiLSTM-CRF + CNN-char.
246
+
247
+ It is initially surprising that for such an isolated language as Vietnamese where all words are not inflected, using character-based representations helps producing 1+\% improvements to the BiLSTM-CRF model. We find that the improvements to BiLSTM-CRF are mostly accounted for by the PER label. The reason turns out to be simple: about 50\% of named entities are labeled with tag PER, so character-based representations are in fact able to capture common family, middle or given name syllables in `unknown' full-name words. Furthermore, we also find that BiLSTM-CRF-based models do not benefit from additional predicted POS tags. It is probably because BiLSTM can take word order into account, while without word inflection, all grammatical information in Vietnamese is conveyed through its fixed word order, thus explicit predicted POS tags with noisy grammatical information are not helpful.
248
+
249
+
250
+ \subsection{Dependency parsing}\label{ssec:dep}
251
+
252
+ \paragraph{Experimental setup:} We use the Vietnamese dependency treebank VnDT \cite{Nguyen2014NLDB} consisting of 10,200 sentences in our experiments. Following \newcite{NguyenALTA2016}, we use the last 1020 sentences of VnDT for test while the remaining sentences are used for training. Evaluation metrics are the labeled attachment score (LAS) and unlabeled attachment score (UAS).
253
+
254
+
255
+
256
+ \paragraph{Main results:} Table \ref{tab:dep} compares the dependency parsing results of VnCoreNLP with results reported in prior work, using the same experimental setup. The first six rows present the scores with gold POS tags. The next two rows show scores of VnCoreNLP with automatic POS tags which are produced by our POS tagging component. The last row presents scores of the joint POS tagging and dependency parsing model jPTDP \protect\cite{NguyenCoNLL2017}. Table \ref{tab:dep} shows that compared to previously published results, VnCoreNLP produces the highest LAS score. Note that previous results for other systems are reported without using additional information of automatically predicted NER labels. In this case, the LAS score for VnCoreNLP without automatic NER features (i.e. VnCoreNLP\textsubscript{--NER} in Table \ref{tab:dep}) is still higher than previous ones. Notably, we also obtain a fast parsing speed at 8K words per second.
257
+
258
+
259
+ \begin{table}[!t]
260
+ \centering
261
+ \setlength{\tabcolsep}{0.5em}
262
+ \def\arraystretch{1.1}
263
+ \begin{tabular}{l|l|c|c|l }
264
+ \hline
265
+ \multicolumn{2}{c|}{\textbf{Model}} & \textbf{LAS} & \textbf{UAS} & \textbf{Speed} \\
266
+ \hline
267
+ \multirow{6}{*}{\rotatebox[origin=c]{90}{Gold POS}}
268
+ & VnCoreNLP & \textbf{73.39} & 79.02 & \_ \\
269
+ & VnCoreNLP\textsubscript{--NER} & 73.21 & 78.91 & \_ \\
270
+ & BIST-bmstparser & 73.17 & \textbf{79.39} & \_ \\
271
+ & BIST-barchybrid & 72.53 & 79.33 & \_ \\
272
+ & MSTParser & 70.29 & 76.47 & \_\\
273
+ & MaltParser & 69.10 & 74.91 & \_\\
274
+ \hline
275
+ \multirow{3}{*}{\rotatebox[origin=c]{90}{Auto POS}}
276
+ & VnCoreNLP & \textbf{70.23} & 76.93 & 8K \\
277
+ & VnCoreNLP\textsubscript{--NER} & 70.10 & 76.85 & \textbf{9K} \\
278
+ & jPTDP & 69.49 & \textbf{77.68} & 700 \\
279
+ \hline
280
+ \end{tabular}
281
+ \caption{LAS and UAS scores (in \%) computed on all tokens (i.e. including punctuation) on the test set w.r.t. gold word-segmentation. ``\textbf{Speed}'' is defined as in Table \ref{tab:ner}. The subscript ``--NER'' denotes the model without using automatically predicted NER labels as features. The results of the MSTParser \protect\cite{McDonald2005OLT}, MaltParser \protect\cite{Nivre2007}, and BiLSTM-based parsing models BIST-bmstparser and BIST-barchybrid \protect\cite{TACL885} are reported in \protect\newcite{NguyenALTA2016}. The result of the jPTDP model for Vietnamese is mentioned in \protect\newcite{NguyenVNDJ-ALTA-2017}.}% and detailed at \url{https://drive.google.com/drive/folders/0B5eBgc8jrKtpUmhhSmtFLWdrTzQ}.}
282
+ \label{tab:dep}
283
+ \end{table}
284
+
285
+
286
+
287
+
288
+
289
+ \section{Conclusion}
290
+
291
+
292
+ In this paper, we have presented the VnCoreNLP toolkit---an easy-to-use, fast and accurate processing pipeline for Vietnamese NLP. VnCoreNLP provides core NLP steps including word segmentation, POS tagging, NER and dependency parsing. Current version of VnCoreNLP has been trained without any linguistic optimization, i.e. we only employ existing pre-defined features in the traditional feature-based models for POS tagging, NER and dependency parsing. So future work will focus on incorporating Vietnamese linguistic features into these feature-based models.
293
+
294
+ VnCoreNLP is released for research and educational purposes, and available at: \url{https://github.com/vncorenlp/VnCoreNLP}.
295
+
296
+ % include your own bib file like this:
297
+ %\bibliographystyle{acl}
298
+ %\bibliography{acl2018}
299
+ \bibliography{Refs}
300
+ \bibliographystyle{naacl_natbib}
301
+
302
+ \end{document}
references/2018.naacl.vu/source/VnCoreNLP.bbl ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{thebibliography}{}
2
+ \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
3
+
4
+ \bibitem[{Choi(2016)}]{choi:2016:N16-1}
5
+ Jinho~D. Choi. 2016.
6
+ \newblock {Dynamic Feature Induction: The Last Gist to the State-of-the-Art}.
7
+ \newblock In {\em Proceedings of NAACL-HLT\/}. pages 271--281.
8
+
9
+ \bibitem[{Choi et~al.(2015)Choi, Tetreault, and Stent}]{choi2015ACL}
10
+ Jinho~D. Choi, Joel Tetreault, and Amanda Stent. 2015.
11
+ \newblock {It Depends: Dependency Parser Comparison Using A Web-based
12
+ Evaluation Tool}.
13
+ \newblock In {\em Proceedings of ACL-IJCNLP\/}. pages 387--396.
14
+
15
+ \bibitem[{Huang et~al.(2015)Huang, Xu, and Yu}]{HuangXY15}
16
+ Zhiheng Huang, Wei Xu, and Kai Yu. 2015.
17
+ \newblock Bidirectional {LSTM-CRF} models for sequence tagging.
18
+ \newblock {\em arXiv preprint\/} arXiv:1508.01991.
19
+
20
+ \bibitem[{King(2015)}]{King2015}
21
+ Benjamin~Philip King. 2015.
22
+ \newblock {\em Practical Natural Language Processing for Low-Resource
23
+ Languages\/}.
24
+ \newblock Ph.D. thesis, The University of Michigan.
25
+
26
+ \bibitem[{Kiperwasser and Goldberg(2016)}]{TACL885}
27
+ Eliyahu Kiperwasser and Yoav Goldberg. 2016.
28
+ \newblock {Simple and Accurate Dependency Parsing Using Bidirectional LSTM
29
+ Feature Representations}.
30
+ \newblock {\em Transactions of the Association for Computational Linguistics\/}
31
+ 4:313--327.
32
+
33
+ \bibitem[{Lample et~al.(2016)Lample, Ballesteros, Subramanian, Kawakami, and
34
+ Dyer}]{lample-EtAl:2016:N16-1}
35
+ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
36
+ Chris Dyer. 2016.
37
+ \newblock {Neural Architectures for Named Entity Recognition}.
38
+ \newblock In {\em Proceedings of NAACL-HLT\/}. pages 260--270.
39
+
40
+ \bibitem[{Le et~al.(2008)Le, Nguyen, Roussanaly, and Ho}]{Le2008}
41
+ Hong~Phuong Le, Thi Minh~Huyen Nguyen, Azim Roussanaly, and Tuong~Vinh Ho.
42
+ 2008.
43
+ \newblock {A hybrid approach to word segmentation of Vietnamese texts}.
44
+ \newblock In {\em Proceedings of LATA\/}. pages 240--249.
45
+
46
+ \bibitem[{Le et~al.(2013)Le, Do, Nguyen, and Nguyen}]{Le:2013:VOS}
47
+ Ngoc~Minh Le, Bich~Ngoc Do, Vi~Duong Nguyen, and Thi~Dam Nguyen. 2013.
48
+ \newblock {VNLP: An Open Source Framework for Vietnamese Natural Language
49
+ Processing}.
50
+ \newblock In {\em Proceedings of SoICT\/}. pages 88--93.
51
+
52
+ \bibitem[{Le-Hong et~al.(2010)Le-Hong, Roussanaly, Nguyen, and
53
+ Rossignol}]{lehong00526139}
54
+ Phuong Le-Hong, Azim Roussanaly, Thi Minh~Huyen Nguyen, and Mathias Rossignol.
55
+ 2010.
56
+ \newblock {An empirical study of maximum entropy approach for part-of-speech
57
+ tagging of Vietnamese texts}.
58
+ \newblock In {\em {Proceedings of TALN}\/}.
59
+
60
+ \bibitem[{Ma and Hovy(2016)}]{ma-hovy:2016:P16-1}
61
+ Xuezhe Ma and Eduard Hovy. 2016.
62
+ \newblock {End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF}.
63
+ \newblock In {\em Proceedings of ACL (Volume 1: Long Papers)\/}. pages
64
+ 1064--1074.
65
+
66
+ \bibitem[{Manning et~al.(2014)Manning, Surdeanu, Bauer, Finkel, Bethard, and
67
+ McClosky}]{manning-EtAl:2014:P14-5}
68
+ Christopher~D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven~J.
69
+ Bethard, and David McClosky. 2014.
70
+ \newblock The {Stanford} {CoreNLP} natural language processing toolkit.
71
+ \newblock In {\em Proceedings of ACL 2014 System Demonstrations\/}. pages
72
+ 55--60.
73
+
74
+ \bibitem[{McDonald et~al.(2005)McDonald, Crammer, and
75
+ Pereira}]{McDonald2005OLT}
76
+ Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005.
77
+ \newblock {Online Large-margin Training of Dependency Parsers}.
78
+ \newblock In {\em Proceedings of ACL\/}. pages 91--98.
79
+
80
+ \bibitem[{Mueller et~al.(2013)Mueller, Schmid, and
81
+ Sch\"{u}tze}]{mueller-schmid-schutze:2013:EMNLP}
82
+ Thomas Mueller, Helmut Schmid, and Hinrich Sch\"{u}tze. 2013.
83
+ \newblock {Efficient Higher-Order CRFs for Morphological Tagging}.
84
+ \newblock In {\em Proceedings of EMNLP\/}. pages 322--332.
85
+
86
+ \bibitem[{Nguyen et~al.(2006)Nguyen, Nguyen et~al.}]{Y06-1028}
87
+ Cam-Tu Nguyen, Trung-Kien Nguyen, et~al. 2006.
88
+ \newblock {Vietnamese Word Segmentation with CRFs and SVMs: An Investigation}.
89
+ \newblock In {\em Proceedings of PACLIC\/}. pages 215--222.
90
+
91
+ \bibitem[{Nguyen et~al.(2010)Nguyen, Phan, and Nguyen}]{NguyenPN2010}
92
+ Cam-Tu Nguyen, Xuan-Hieu Phan, and Thu-Trang Nguyen. 2010.
93
+ \newblock {JVnTextPro: A Java-based Vietnamese Text Processing Tool}.
94
+ \newblock \url{http://jvntextpro.sourceforge.net/}.
95
+
96
+ \bibitem[{Nguyen et~al.(2016{\natexlab{a}})Nguyen, Dras, and
97
+ Johnson}]{NguyenALTA2016}
98
+ Dat~Quoc Nguyen, Mark Dras, and Mark Johnson. 2016{\natexlab{a}}.
99
+ \newblock {An empirical study for Vietnamese dependency parsing}.
100
+ \newblock In {\em Proceedings of ALTA\/}. pages 143--149.
101
+
102
+ \bibitem[{Nguyen et~al.(2017{\natexlab{a}})Nguyen, Dras, and
103
+ Johnson}]{NguyenCoNLL2017}
104
+ Dat~Quoc Nguyen, Mark Dras, and Mark Johnson. 2017{\natexlab{a}}.
105
+ \newblock {A Novel Neural Network Model for Joint POS Tagging and Graph-based
106
+ Dependency Parsing}.
107
+ \newblock In {\em Proceedings of the CoNLL 2017 Shared Task\/}. pages 134--142.
108
+
109
+ \bibitem[{Nguyen et~al.(2014)Nguyen, Nguyen, Pham, Nguyen, and
110
+ Nguyen}]{Nguyen2014NLDB}
111
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Son~Bao Pham, Phuong-Thai Nguyen, and Minh~Le
112
+ Nguyen. 2014.
113
+ \newblock {From Treebank Conversion to Automatic Dependency Parsing for
114
+ Vietnamese}.
115
+ \newblock In {\em {Proceedings of NLDB}\/}. pages 196--207.
116
+
117
+ \bibitem[{Nguyen et~al.(2018)Nguyen, Nguyen, Vu, Dras, and
118
+ Johnson}]{NguyenNVDJ2018}
119
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018.
120
+ \newblock {A Fast and Accurate Vietnamese Word Segmenter}.
121
+ \newblock In {\em Proceedings of LREC\/}. page to appear.
122
+
123
+ \bibitem[{Nguyen et~al.(2017{\natexlab{b}})Nguyen, Vu, Nguyen, Dras, and
124
+ Johnson}]{NguyenVNDJ-ALTA-2017}
125
+ Dat~Quoc Nguyen, Thanh Vu, Dai~Quoc Nguyen, Mark Dras, and Mark Johnson.
126
+ 2017{\natexlab{b}}.
127
+ \newblock {Fro{m}\ {W}ord Segmentation to POS Tagging for Vietnamese}.
128
+ \newblock In {\em Proceedings of ALTA\/}. pages 108--113.
129
+
130
+ \bibitem[{Nguyen et~al.(2009)Nguyen, Vu et~al.}]{nguyen-EtAl:2009:LAW-III}
131
+ Phuong~Thai Nguyen, Xuan~Luong Vu, et~al. 2009.
132
+ \newblock {Building a Large Syntactically-Annotated Corpus of Vietnamese}.
133
+ \newblock In {\em Proceedings of LAW\/}. pages 182--185.
134
+
135
+ \bibitem[{Nguyen and Le(2016)}]{NguyenL2016}
136
+ Tuan-Phong Nguyen and Anh-Cuong Le. 2016.
137
+ \newblock {A Hybrid Approach to Vietnamese Word Segmentation}.
138
+ \newblock In {\em Proceedings of RIVF\/}. pages 114--119.
139
+
140
+ \bibitem[{Nguyen et~al.(2016{\natexlab{b}})Nguyen, Truong, Nguyen, and
141
+ Le}]{JCSCE}
142
+ Tuan~Phong Nguyen, Quoc~Tuan Truong, Xuan~Nam Nguyen, and Anh~Cuong Le.
143
+ 2016{\natexlab{b}}.
144
+ \newblock {An Experimental Investigation of Part-Of-Speech Taggers for
145
+ Vietnamese}.
146
+ \newblock {\em VNU Journal of Science: Computer Science and Communication
147
+ Engineering\/} 32(3):11--25.
148
+
149
+ \bibitem[{Nivre et~al.(2007)Nivre, Hall et~al.}]{Nivre2007}
150
+ Joakim Nivre, Johan Hall, et~al. 2007.
151
+ \newblock {MaltParser: A language-independent system for data-driven dependency
152
+ parsing}.
153
+ \newblock {\em Natural Language Engineering\/} 13(2):95--135.
154
+
155
+ \bibitem[{Pham et~al.(2017)Pham, Pham, Nguyen, and Le{-}Hong}]{PhamPNP2017b}
156
+ Thai{-}Hoang Pham, Xuan{-}Khoai Pham, Tuan{-}Anh Nguyen, and Phuong Le{-}Hong.
157
+ 2017.
158
+ \newblock {NNVLP: {A} Neural Network-Based Vietnamese Language Processing
159
+ Toolkit}.
160
+ \newblock In {\em Proceedings of the IJCNLP 2017 System Demonstrations\/}.
161
+ pages 37--40.
162
+
163
+ \bibitem[{Reimers and Gurevych(2017)}]{reimers-gurevych:2017:EMNLP2017}
164
+ Nils Reimers and Iryna Gurevych. 2017.
165
+ \newblock {Reporting Score Distributions Makes a Difference: Performance Study
166
+ of LSTM-networks for Sequence Tagging}.
167
+ \newblock In {\em Proceedings of EMNLP\/}. pages 338--348.
168
+
169
+ \end{thebibliography}
references/2018.naacl.vu/source/VnCoreNLP.tex ADDED
@@ -0,0 +1,302 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \pdfoutput=1
3
+ \usepackage[hyperref]{naaclhlt2018}
4
+ \usepackage{times}
5
+ \usepackage{latexsym}
6
+
7
+ \usepackage{graphicx}
8
+ \usepackage{tabularx}
9
+ \usepackage{multirow}
10
+ %\usepackage{fixltx2e}
11
+ %\usepackage{enumitem}
12
+ \usepackage{marvosym}
13
+
14
+ \usepackage{vntex}
15
+ \usepackage[english]{babel}
16
+
17
+ %\usepackage[hidelinks]{hyperref}
18
+ \usepackage{url}
19
+
20
+
21
+ \usepackage{listings}
22
+
23
+ \lstset{
24
+ basicstyle=\footnotesize\ttfamily,
25
+ language=Java,
26
+ breaklines=true,
27
+ basicstyle=\footnotesize\ttfamily,
28
+ captionpos=b,
29
+ inputencoding=utf8,
30
+ escapeinside={\%*}{*)}
31
+ }
32
+
33
+ \aclfinalcopy % Uncomment this line for the final submission
34
+ %\def\aclpaperid{***} % Enter the acl Paper ID here
35
+
36
+ \setlength\titlebox{5.25cm}
37
+
38
+ \newcommand\BibTeX{B{\sc ib}\TeX}
39
+
40
+ \title{VnCoreNLP: A Vietnamese Natural Language Processing Toolkit}
41
+
42
+ \author{Thanh Vu$^1$, Dat Quoc Nguyen$^2$, Dai Quoc Nguyen$^3$, Mark Dras$^4$ \and Mark Johnson$^4$\\
43
+ $^1$Newcastle University, United Kingdom; $^2$The University of Melbourne, Australia; \\
44
+ $^3$Deakin University, Australia; $^4$Macquarie University, Australia \\
45
+ {\tt thanh.vu@newcastle.ac.uk}, {\tt dqnguyen@unimelb.edu.au},\\
46
+ {\tt dai.nguyen@deakin.edu.au}, {\tt\{mark.dras, mark.johnson\}@mq.edu.au}}
47
+
48
+ \date{}
49
+
50
+ \begin{document}
51
+ \maketitle
52
+ \begin{abstract}
53
+ We present an easy-to-use and fast toolkit, namely VnCoreNLP---a Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks.
54
+ We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP.
55
+ Our VnCoreNLP is open-source and available at: \url{https://github.com/vncorenlp/VnCoreNLP}.
56
+ \end{abstract}
57
+
58
+
59
+
60
+ \section{Introduction}
61
+
62
+ Research on Vietnamese NLP has been actively explored in the last decade, boosted by the successes of the 4-year KC01.01/2006-2010 national project on Vietnamese language and speech processing (VLSP). Over the last 5 years, standard benchmark datasets for key Vietnamese NLP tasks are publicly available: datasets for word segmentation and POS tagging were released for the first VLSP evaluation campaign in 2013; a dependency treebank was published in 2014 \cite{Nguyen2014NLDB}; and an NER dataset was released for the second VLSP campaign in 2016. So there is a need for building an NLP pipeline, such as the Stanford CoreNLP toolkit \cite{manning-EtAl:2014:P14-5}, for those key tasks to assist users and to support researchers and tool developers of downstream tasks.
63
+
64
+ \newcite{NguyenPN2010} and \newcite{Le:2013:VOS} built Vietnamese NLP pipelines by wrapping existing word segmenters and POS taggers including: JVnSegmenter \cite{Y06-1028}, vnTokenizer \cite{Le2008}, JVnTagger \cite{NguyenPN2010} and vnTagger \cite{lehong00526139}. However,
65
+ these word segmenters and POS taggers are no longer considered
66
+ SOTA models for Vietnamese \cite{NguyenL2016,JCSCE}. \newcite{PhamPNP2017b} built the NNVLP toolkit for Vietnamese sequence labeling tasks by applying a BiLSTM-CNN-CRF model \cite{ma-hovy:2016:P16-1}. However, \newcite{PhamPNP2017b} did not make a comparison to SOTA traditional feature-based models. In addition, NNVLP is slow with a processing speed at about 300 words per second, which is not practical for real-world application such as dealing with large-scale data.
67
+
68
+
69
+ \setlength{\abovecaptionskip}{5pt plus 2pt minus 1pt}
70
+ \begin{figure}[!t]
71
+ \centering
72
+ \includegraphics[width=7.5cm]{VnCoreNLP_Architecture.pdf}
73
+ \caption{In pipeline architecture of VnCoreNLP, annotations are performed on an {\tt Annotation} object.}
74
+ \label{fig:diagram}
75
+ \end{figure}
76
+
77
+ In this paper, we present a Java NLP toolkit for Vietnamese, namely VnCoreNLP, which aims to facilitate Vietnamese NLP research by providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, NER and dependency parsing. Figure \ref{fig:diagram} describes the overall system architecture. The
78
+ following items highlight typical characteristics of VnCoreNLP:
79
+
80
+ \begin{itemize}
81
+ \setlength{\itemsep}{5pt}
82
+ \setlength{\parskip}{0pt}
83
+ \setlength{\parsep}{0pt}
84
+
85
+ \item \textbf{Easy-to-use} -- All VnCoreNLP components are wrapped into a single .jar file, so users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the Java API.
86
+
87
+ \item \textbf{Fast} -- VnCoreNLP is fast, so it can be used for dealing with large-scale data. Also it benefits users suffering from limited computation resources (e.g. users from Vietnam).
88
+
89
+
90
+ \item \textbf{Accurate} -- VnCoreNLP components obtain higher results than all previous published results on the same benchmark datasets.
91
+
92
+ \end{itemize}
93
+
94
+
95
+
96
+ \section{Basic usages}
97
+
98
+ Our design goal is to make VnCoreNLP simple to setup and run from either the command-line or the Java API. Performing linguistic annotations for a given file can be done by using a simple command as in Figure \ref{fig:command}.
99
+
100
+ \begin{figure}[ht]
101
+ {\footnotesize\ttfamily \$ java -Xmx2g -jar VnCoreNLP.jar -fin input.txt -fout output.txt}
102
+ \caption{Minimal command to run VnCoreNLP.}
103
+ \label{fig:command}
104
+ \end{figure}
105
+
106
+ Suppose that the file {\ttfamily input.txt} in Figure \ref{fig:command} contains a sentence ``Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' (Mr\textsubscript{Ông} Nguyen Khac Chuc is\textsubscript{đang} working\textsubscript{làm\_việc} at\textsubscript{tại} Vietnam National\textsubscript{quốc\_gia} University\textsubscript{đại\_học} Hanoi\textsubscript{Hà\_Nội}). Table \ref{tab:expoutput} shows the output for this sentence in plain text form.
107
+
108
+ \begin{table}[ht]
109
+ \centering
110
+ \resizebox{8cm}{!}{
111
+ \begin{tabular}{l l l l l l}
112
+ 1 & Ông & Nc & O & 4 & sub \\
113
+ 2 & Nguyễn\_Khắc\_Chúc & Np & B-PER & 1 & nmod\\
114
+ 3 & đang & R & O & 4 & adv\\
115
+ 4 & làm\_việc & V & O & 0 & root\\
116
+ 5 & tại & E & O & 4 & loc\\
117
+ 6 & Đại\_học & N & B-ORG & 5 & pob\\
118
+ 7 & Quốc\_gia & N & I-ORG & 6 & nmod\\
119
+ 8 & Hà\_Nội & Np & I-ORG & 6 & nmod\\
120
+ 9 & . & CH & O & 4 & punct\\
121
+ \end{tabular}
122
+ }
123
+ \caption{The output in file {\ttfamily output.txt} for the sentence `Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.'' from file {\ttfamily input.txt} in Figure \ref{fig:command}. The output is in a 6-column format representing word index, word form, POS tag, NER label, head index of the current word, and dependency relation type.}
124
+ \label{tab:expoutput}
125
+ \end{table}
126
+
127
+ Similarly, we can also get the same output by using the API as easy as in Listing \ref{lst1}.
128
+
129
+
130
+ \begin{lstlisting}[label=lst1,caption= {Minimal code for an analysis pipeline.}]
131
+ VnCoreNLP pipeline = new VnCoreNLP() ;
132
+ Annotation annotation = new Annotation("%*Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội.*)");
133
+ pipeline.annotate(annotation);
134
+ String annotatedStr = annotation.toString();
135
+ \end{lstlisting}
136
+
137
+ In addition,
138
+ Listing \ref{lst2} provides a more realistic and complete example code, presenting key components of the toolkit.
139
+ Here an annotation pipeline can be used for any text rather than just a single sentence, e.g. for a paragraph or entire news story.
140
+
141
+ \section{Components}
142
+
143
+ This section briefly describes each component of VnCoreNLP. Note that our goal is not to develop new approach or model for each component task. Here we focus on incorporating existing models into a single pipeline. In particular, except a new model we develop for the language-dependent component of word segmentation, we apply traditional feature-based models which obtain SOTA results for English POS tagging, NER and dependency parsing to Vietnamese. The reason is based on a well-established belief in the literature that for a less-resourced language such as Vietnamese, we should consider using feature-based models to obtain fast and accurate performances, rather than using neural network-based models \cite{King2015}.
144
+
145
+ \setlength{\textfloatsep}{1pt plus 1.0pt minus 1.0pt}
146
+ \begin{lstlisting}[float=tp,label=lst2,caption= {A simple and complete example code.}]
147
+ import vn.pipeline.*;
148
+ import java.io.*;
149
+ public class VnCoreNLPExample {
150
+ public static void main(String[] args) throws IOException {
151
+ // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively.
152
+ String[] annotators = {"wseg", "pos", "ner", "parse"};
153
+ VnCoreNLP pipeline = new VnCoreNLP(annotators);
154
+ // Mr Nguyen Khac Chuc is working at Vietnam National University, Hanoi. Mrs Lan, Mr Chuc's wife, is also working at this university.
155
+ String str = %*"Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."*);
156
+ Annotation annotation = new Annotation(str);
157
+ pipeline.annotate(annotation);
158
+ PrintStream outputPrinter = new PrintStream("output.txt");
159
+ pipeline.printToFile(annotation, outputPrinter);
160
+ // Users can get a single sentence to analyze individually
161
+ Sentence firstSentence = annotation.getSentences().get(0);
162
+ }
163
+ }
164
+ \end{lstlisting}
165
+ %
166
+ \begin{itemize}
167
+ \setlength{\itemsep}{5pt}
168
+ \setlength{\parskip}{0pt}
169
+ \setlength{\parsep}{0pt}
170
+
171
+ \item \textbf{wseg} -- Unlike English where white space is a strong indicator of word boundaries, when written in Vietnamese white space is also used to separate syllables that constitute words. So word segmentation is referred to as the key first step in Vietnamese NLP{.}\ {W}e have proposed a transformation rule-based learning model for Vietnamese word segmentation, which obtains better segmentation accuracy and speed than all previous word segmenters. See details in \newcite{NguyenNVDJ2018}.
172
+
173
+
174
+ \item \textbf{pos} -- To label words with their POS tag, we apply MarMoT which is a generic
175
+ CRF framework and a SOTA POS and morphological
176
+ tagger \cite{mueller-schmid-schutze:2013:EMNLP}.\footnote{\url{http://cistern.cis.lmu.de/marmot/}}
177
+
178
+
179
+ \item \textbf{ner} -- To recognize named entities, we apply a dynamic feature induction model that automatically optimizes feature combinations \cite{choi:2016:N16-1}.\footnote{\url{https://emorynlp.github.io/nlp4j/components/named-entity-recognition.html}}
180
+
181
+
182
+ \item \textbf{parse} -- To perform dependency parsing, we apply the greedy version of a transition-based parsing model with selectional branching \cite{choi2015ACL}.\footnote{\url{https://emorynlp.github.io/nlp4j/components/dependency-parsing.html}}
183
+ \end{itemize}
184
+
185
+
186
+ \section{Evaluation}
187
+
188
+ We detail experimental results of the word segmentation (\textbf{wseg}) and POS tagging (\textbf{pos}) components of VnCoreNLP in \newcite{NguyenNVDJ2018} and \newcite{NguyenVNDJ-ALTA-2017}, respectively. In particular, our word segmentation component gets the highest results in terms of both segmentation F1 score at 97.90\% and speed at 62K words per second.\footnote{All speeds reported in this paper are computed on a personal computer of Intel Core i7 2.2 GHz.} Our POS tagging component also obtains the highest accuracy to date at 95.88\% with a fast tagging speed at 25K words per second, and outperforms BiLSTM-CRF-based models.
189
+ Following subsections present evaluations for the NER (\textbf{ner}) and dependency parsing (\textbf{parse}) components.
190
+
191
+ \subsection{Named entity recognition}\label{ssec:ner}
192
+
193
+ We make a comparison between SOTA feature-based and neural network-based models, which, to the best of our knowledge, has not been done in any prior work on Vietnamese NER.
194
+
195
+
196
+
197
+ \paragraph{Dataset:} The NER shared task at the 2016 VLSP workshop provides a set of 16,861 manually annotated sentences for training and development, and a set of 2,831 manually annotated sentences for test, with four NER labels PER, LOC, ORG and MISC. Note that in both datasets, words are also supplied with gold POS tags. In addition, each word representing a full personal name are separated into syllables that constitute the word.
198
+ So this annotation scheme results in an unrealistic scenario for a pipeline evaluation because: (\textbf{i}) gold POS tags are not available in a real-world application, and (\textbf{ii}) in the standard annotation (and benchmark datasets) for Vietnamese word segmentation and POS tagging \cite{nguyen-EtAl:2009:LAW-III}, each full name is referred to as a word token (i.e., all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a label to the entire full-name).
199
+
200
+ For a more realistic scenario, we merge those contiguous syllables constituting a full name to form a word.\footnote{Based on the gold label PER, contiguous syllables such as ``Nguyễn/B-PER'', ``Khắc/I-PER'' and ``Chúc/I-PER'' are merged to form a word as ``Nguyễn\_Khắc\_Chúc/B-PER.''} Then we replace the gold POS tags by automatic tags predicted by our POS tagging component. From the set of 16,861 sentences, we sample 2,000 sentences for development and using the remaining 14,861 sentences for training.
201
+
202
+
203
+
204
+ \paragraph{Models:} We make an empirical comparison between the VnCoreNLP's NER component and the following neural network-based models:
205
+ \begin{itemize}
206
+ \setlength{\itemsep}{5pt}
207
+ \setlength{\parskip}{0pt}
208
+ \setlength{\parsep}{0pt}
209
+
210
+ \item {BiLSTM-CRF} \cite{HuangXY15} is a sequence labeling model which extends the BiLSTM model with a CRF layer.
211
+
212
+ \item {BiLSTM-CRF + CNN-char}, i.e. {BiLSTM-CNN-CRF}, is an extension of {BiLSTM-CRF}, using CNN to derive character-based word representations \cite{ma-hovy:2016:P16-1}.%\footnote{\url{https://github.com/XuezheMax/LasagneNLP}}
213
+
214
+ \item {BiLSTM-CRF + LSTM-char} is an extension of {BiLSTM-CRF}, using BiLSTM to derive the character-based word representations \cite{lample-EtAl:2016:N16-1}.
215
+
216
+ \item BiLSTM-CRF\textsubscript{+POS} is another extension to BiLSTM-CRF, incorporating embeddings of automatically predicted POS tags \cite{reimers-gurevych:2017:EMNLP2017}.
217
+
218
+ \end{itemize}
219
+
220
+ We use a well-known implementation which is optimized for performance of all BiLSTM-CRF-based models from \newcite{reimers-gurevych:2017:EMNLP2017}.\footnote{\url{https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf}} We then follow \newcite[Section 3.4]{NguyenVNDJ-ALTA-2017} to perform hyper-parameter tuning.\footnote{We employ pre-trained Vietnamese word vectors from \url{https://github.com/sonvx/word2vecVN}.}
221
+
222
+ %\setlength{\abovecaptionskip}{3pt plus 2pt minus 1pt}
223
+ \setlength{\textfloatsep}{20.0pt plus 2.0pt minus 4.0pt}
224
+ \begin{table}[!t]
225
+ \centering
226
+ \begin{tabular}{l|c|l}
227
+ \hline
228
+ \textbf{Model} & \textbf{F1} & \textbf{Speed} \\
229
+ \hline
230
+ VnCoreNLP & \textbf{88.55} & \textbf{18K} \\
231
+ BiLSTM-CRF & 86.48 & 2.8K \\
232
+ \ \ \ \ \ + CNN-char & {88.28} & 1.8K \\
233
+ \ \ \ \ \ + LSTM-char & 87.71 & 1.3K \\
234
+ BiLSTM-CRF\textsubscript{+POS} & 86.12 & \_ \\
235
+ \ \ \ \ \ + CNN-char & 88.06 & \_ \\
236
+ \ \ \ \ \ + LSTM-char & 87.43 & \_ \\
237
+ \hline
238
+ \end{tabular}
239
+ \caption{F1 scores (in \%) on the test set w.r.t. gold word-segmentation. ``\textbf{Speed}'' denotes the processing speed of the number of words per second (for VnCoreNLP, we include the time POS tagging takes in the speed).}
240
+ \label{tab:ner}
241
+ \end{table}
242
+
243
+
244
+
245
+ \paragraph{Main results:} Table \ref{tab:ner} presents F1 score and speed of each model on the test set, where VnCoreNLP obtains the highest score at 88.55\% with a fast speed at 18K words per second. In particular, VnCoreNLP obtains 10 times faster speed than the second most accurate model BiLSTM-CRF + CNN-char.
246
+
247
+ It is initially surprising that for such an isolated language as Vietnamese where all words are not inflected, using character-based representations helps producing 1+\% improvements to the BiLSTM-CRF model. We find that the improvements to BiLSTM-CRF are mostly accounted for by the PER label. The reason turns out to be simple: about 50\% of named entities are labeled with tag PER, so character-based representations are in fact able to capture common family, middle or given name syllables in `unknown' full-name words. Furthermore, we also find that BiLSTM-CRF-based models do not benefit from additional predicted POS tags. It is probably because BiLSTM can take word order into account, while without word inflection, all grammatical information in Vietnamese is conveyed through its fixed word order, thus explicit predicted POS tags with noisy grammatical information are not helpful.
248
+
249
+
250
+ \subsection{Dependency parsing}\label{ssec:dep}
251
+
252
+ \paragraph{Experimental setup:} We use the Vietnamese dependency treebank VnDT \cite{Nguyen2014NLDB} consisting of 10,200 sentences in our experiments. Following \newcite{NguyenALTA2016}, we use the last 1020 sentences of VnDT for test while the remaining sentences are used for training. Evaluation metrics are the labeled attachment score (LAS) and unlabeled attachment score (UAS).
253
+
254
+
255
+
256
+ \paragraph{Main results:} Table \ref{tab:dep} compares the dependency parsing results of VnCoreNLP with results reported in prior work, using the same experimental setup. The first six rows present the scores with gold POS tags. The next two rows show scores of VnCoreNLP with automatic POS tags which are produced by our POS tagging component. The last row presents scores of the joint POS tagging and dependency parsing model jPTDP \protect\cite{NguyenCoNLL2017}. Table \ref{tab:dep} shows that compared to previously published results, VnCoreNLP produces the highest LAS score. Note that previous results for other systems are reported without using additional information of automatically predicted NER labels. In this case, the LAS score for VnCoreNLP without automatic NER features (i.e. VnCoreNLP\textsubscript{--NER} in Table \ref{tab:dep}) is still higher than previous ones. Notably, we also obtain a fast parsing speed at 8K words per second.
257
+
258
+
259
+ \begin{table}[!t]
260
+ \centering
261
+ \setlength{\tabcolsep}{0.5em}
262
+ \def\arraystretch{1.1}
263
+ \begin{tabular}{l|l|c|c|l }
264
+ \hline
265
+ \multicolumn{2}{c|}{\textbf{Model}} & \textbf{LAS} & \textbf{UAS} & \textbf{Speed} \\
266
+ \hline
267
+ \multirow{6}{*}{\rotatebox[origin=c]{90}{Gold POS}}
268
+ & VnCoreNLP & \textbf{73.39} & 79.02 & \_ \\
269
+ & VnCoreNLP\textsubscript{--NER} & 73.21 & 78.91 & \_ \\
270
+ & BIST-bmstparser & 73.17 & \textbf{79.39} & \_ \\
271
+ & BIST-barchybrid & 72.53 & 79.33 & \_ \\
272
+ & MSTParser & 70.29 & 76.47 & \_\\
273
+ & MaltParser & 69.10 & 74.91 & \_\\
274
+ \hline
275
+ \multirow{3}{*}{\rotatebox[origin=c]{90}{Auto POS}}
276
+ & VnCoreNLP & \textbf{70.23} & 76.93 & 8K \\
277
+ & VnCoreNLP\textsubscript{--NER} & 70.10 & 76.85 & \textbf{9K} \\
278
+ & jPTDP & 69.49 & \textbf{77.68} & 700 \\
279
+ \hline
280
+ \end{tabular}
281
+ \caption{LAS and UAS scores (in \%) computed on all tokens (i.e. including punctuation) on the test set w.r.t. gold word-segmentation. ``\textbf{Speed}'' is defined as in Table \ref{tab:ner}. The subscript ``--NER'' denotes the model without using automatically predicted NER labels as features. The results of the MSTParser \protect\cite{McDonald2005OLT}, MaltParser \protect\cite{Nivre2007}, and BiLSTM-based parsing models BIST-bmstparser and BIST-barchybrid \protect\cite{TACL885} are reported in \protect\newcite{NguyenALTA2016}. The result of the jPTDP model for Vietnamese is mentioned in \protect\newcite{NguyenVNDJ-ALTA-2017}.}% and detailed at \url{https://drive.google.com/drive/folders/0B5eBgc8jrKtpUmhhSmtFLWdrTzQ}.}
282
+ \label{tab:dep}
283
+ \end{table}
284
+
285
+
286
+
287
+
288
+
289
+ \section{Conclusion}
290
+
291
+
292
+ In this paper, we have presented the VnCoreNLP toolkit---an easy-to-use, fast and accurate processing pipeline for Vietnamese NLP. VnCoreNLP provides core NLP steps including word segmentation, POS tagging, NER and dependency parsing. Current version of VnCoreNLP has been trained without any linguistic optimization, i.e. we only employ existing pre-defined features in the traditional feature-based models for POS tagging, NER and dependency parsing. So future work will focus on incorporating Vietnamese linguistic features into these feature-based models.
293
+
294
+ VnCoreNLP is released for research and educational purposes, and available at: \url{https://github.com/vncorenlp/VnCoreNLP}.
295
+
296
+ % include your own bib file like this:
297
+ %\bibliographystyle{acl}
298
+ %\bibliography{acl2018}
299
+ \bibliography{Refs}
300
+ \bibliographystyle{naacl_natbib}
301
+
302
+ \end{document}
references/2018.naacl.vu/source/naacl_natbib.bst ADDED
@@ -0,0 +1,1552 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %% Moved into NAACL 2018
2
+ %% M Mitchell 2017/Sep/30
3
+ %%
4
+ %% Output of docstrip was hand-edited to create doi links.
5
+ %% This file creates bib entries with \href commands in the title.
6
+ %% You must use the hyperref package, or define \href to do nothing.
7
+ %% Dan Gildea (gildea) 2016/10/30
8
+ %%
9
+ %% Modified further by Min-Yen Kan (knmnyn) on 2016/12/06 to add
10
+ %% visible DOIs that are also hyperref'ed, and use the newer
11
+ %% https://doi.org/ syntax.
12
+ %% Modified further by Min-Yen Kan (knmnyn) to use the URL field for
13
+ %% reference when no DOI is present.
14
+ %% Modified to use \url command for visible urls -- Dan Gildea 2017/4/12
15
+ %%
16
+ %% This is file `acl.bst',
17
+ %% generated with the docstrip utility.
18
+ %%
19
+ %% The original source files were:
20
+ %%
21
+ %% merlin.mbs (with options: `ay,nat,nm-revv1,jnrlst,keyxyr,dt-beg,yr-per,note-yr,num-xser,jnm-x,pre-pub,xedn')
22
+ %% ----------------------------------------
23
+ %% *** ACL bibliography stule for use with ACL proceedings or CL journal ***
24
+ %%
25
+ %% Copyright 1994-2002 Patrick W Daly
26
+ % ===============================================================
27
+ % IMPORTANT NOTICE:
28
+ % This bibliographic style (bst) file has been generated from one or
29
+ % more master bibliographic style (mbs) files, listed above.
30
+ %
31
+ % This generated file can be redistributed and/or modified under the terms
32
+ % of the LaTeX Project Public License Distributed from CTAN
33
+ % archives in directory macros/latex/base/lppl.txt; either
34
+ % version 1 of the License, or any later version.
35
+ % ===============================================================
36
+ % Name and version information of the main mbs file:
37
+ % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
38
+ % For use with BibTeX version 0.99a or later
39
+ %-------------------------------------------------------------------
40
+ % This bibliography style file is intended for texts in ENGLISH
41
+ % This is an author-year citation style bibliography. As such, it is
42
+ % non-standard LaTeX, and requires a special package file to function properly.
43
+ % Such a package is natbib.sty by Patrick W. Daly
44
+ % The form of the \bibitem entries is
45
+ % \bibitem[Jones et al.(1990)]{key}...
46
+ % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
47
+ % The essential feature is that the label (the part in brackets) consists
48
+ % of the author names, as they should appear in the citation, with the year
49
+ % in parentheses following. There must be no space before the opening
50
+ % parenthesis!
51
+ % With natbib v5.3, a full list of authors may also follow the year.
52
+ % In natbib.sty, it is possible to define the type of enclosures that is
53
+ % really wanted (brackets or parentheses), but in either case, there must
54
+ % be parentheses in the label.
55
+ % The \cite command functions as follows:
56
+ % \citet{key} ==>> Jones et al. (1990)
57
+ % \citet*{key} ==>> Jones, Baker, and Smith (1990)
58
+ % \citep{key} ==>> (Jones et al., 1990)
59
+ % \citep*{key} ==>> (Jones, Baker, and Smith, 1990)
60
+ % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2)
61
+ % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990)
62
+ % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32)
63
+ % \citeauthor{key} ==>> Jones et al.
64
+ % \citeauthor*{key} ==>> Jones, Baker, and Smith
65
+ % \citeyear{key} ==>> 1990
66
+ %---------------------------------------------------------------------
67
+
68
+ ENTRY
69
+ { address
70
+ author
71
+ booktitle
72
+ chapter
73
+ doi
74
+ edition
75
+ editor
76
+ howpublished
77
+ institution
78
+ journal
79
+ key
80
+ month
81
+ note
82
+ number
83
+ organization
84
+ pages
85
+ publisher
86
+ school
87
+ series
88
+ title
89
+ type
90
+ url
91
+ volume
92
+ year
93
+ }
94
+ {}
95
+ { label extra.label sort.label short.list }
96
+ INTEGERS { output.state before.all mid.sentence after.sentence after.block }
97
+ FUNCTION {init.state.consts}
98
+ { #0 'before.all :=
99
+ #1 'mid.sentence :=
100
+ #2 'after.sentence :=
101
+ #3 'after.block :=
102
+ }
103
+ STRINGS { s t}
104
+ FUNCTION {output.nonnull}
105
+ { 's :=
106
+ output.state mid.sentence =
107
+ { ", " * write$ }
108
+ { output.state after.block =
109
+ { add.period$ write$
110
+ newline$
111
+ "\newblock " write$
112
+ }
113
+ { output.state before.all =
114
+ 'write$
115
+ { add.period$ " " * write$ }
116
+ if$
117
+ }
118
+ if$
119
+ mid.sentence 'output.state :=
120
+ }
121
+ if$
122
+ s
123
+ }
124
+ FUNCTION {output}
125
+ { duplicate$ empty$
126
+ 'pop$
127
+ 'output.nonnull
128
+ if$
129
+ }
130
+ FUNCTION {output.check}
131
+ { 't :=
132
+ duplicate$ empty$
133
+ { pop$ "empty " t * " in " * cite$ * warning$ }
134
+ 'output.nonnull
135
+ if$
136
+ }
137
+ FUNCTION {fin.entry}
138
+ { add.period$
139
+ write$
140
+ newline$
141
+ }
142
+
143
+ FUNCTION {new.block}
144
+ { output.state before.all =
145
+ 'skip$
146
+ { after.block 'output.state := }
147
+ if$
148
+ }
149
+ FUNCTION {new.sentence}
150
+ { output.state after.block =
151
+ 'skip$
152
+ { output.state before.all =
153
+ 'skip$
154
+ { after.sentence 'output.state := }
155
+ if$
156
+ }
157
+ if$
158
+ }
159
+ FUNCTION {add.blank}
160
+ { " " * before.all 'output.state :=
161
+ }
162
+
163
+ FUNCTION {date.block}
164
+ {
165
+ new.block
166
+ }
167
+
168
+ FUNCTION {not}
169
+ { { #0 }
170
+ { #1 }
171
+ if$
172
+ }
173
+ FUNCTION {and}
174
+ { 'skip$
175
+ { pop$ #0 }
176
+ if$
177
+ }
178
+ FUNCTION {or}
179
+ { { pop$ #1 }
180
+ 'skip$
181
+ if$
182
+ }
183
+ FUNCTION {new.block.checkb}
184
+ { empty$
185
+ swap$ empty$
186
+ and
187
+ 'skip$
188
+ 'new.block
189
+ if$
190
+ }
191
+ FUNCTION {field.or.null}
192
+ { duplicate$ empty$
193
+ { pop$ "" }
194
+ 'skip$
195
+ if$
196
+ }
197
+ FUNCTION {emphasize}
198
+ { duplicate$ empty$
199
+ { pop$ "" }
200
+ { "{\em " swap$ * "\/}" * }
201
+ if$
202
+ }
203
+ FUNCTION {tie.or.space.prefix}
204
+ { duplicate$ text.length$ #3 <
205
+ { "~" }
206
+ { " " }
207
+ if$
208
+ swap$
209
+ }
210
+
211
+ FUNCTION {capitalize}
212
+ { "u" change.case$ "t" change.case$ }
213
+
214
+ FUNCTION {space.word}
215
+ { " " swap$ * " " * }
216
+ % Here are the language-specific definitions for explicit words.
217
+ % Each function has a name bbl.xxx where xxx is the English word.
218
+ % The language selected here is ENGLISH
219
+ FUNCTION {bbl.and}
220
+ { "and"}
221
+
222
+ FUNCTION {bbl.etal}
223
+ { "et~al." }
224
+
225
+ FUNCTION {bbl.editors}
226
+ { "editors" }
227
+
228
+ FUNCTION {bbl.editor}
229
+ { "editor" }
230
+
231
+ FUNCTION {bbl.edby}
232
+ { "edited by" }
233
+
234
+ FUNCTION {bbl.edition}
235
+ { "edition" }
236
+
237
+ FUNCTION {bbl.volume}
238
+ { "volume" }
239
+
240
+ FUNCTION {bbl.of}
241
+ { "of" }
242
+
243
+ FUNCTION {bbl.number}
244
+ { "number" }
245
+
246
+ FUNCTION {bbl.nr}
247
+ { "no." }
248
+
249
+ FUNCTION {bbl.in}
250
+ { "in" }
251
+
252
+ FUNCTION {bbl.pages}
253
+ { "pages" }
254
+
255
+ FUNCTION {bbl.page}
256
+ { "page" }
257
+
258
+ FUNCTION {bbl.chapter}
259
+ { "chapter" }
260
+
261
+ FUNCTION {bbl.techrep}
262
+ { "Technical Report" }
263
+
264
+ FUNCTION {bbl.mthesis}
265
+ { "Master's thesis" }
266
+
267
+ FUNCTION {bbl.phdthesis}
268
+ { "Ph.D. thesis" }
269
+
270
+ MACRO {jan} {"January"}
271
+
272
+ MACRO {feb} {"February"}
273
+
274
+ MACRO {mar} {"March"}
275
+
276
+ MACRO {apr} {"April"}
277
+
278
+ MACRO {may} {"May"}
279
+
280
+ MACRO {jun} {"June"}
281
+
282
+ MACRO {jul} {"July"}
283
+
284
+ MACRO {aug} {"August"}
285
+
286
+ MACRO {sep} {"September"}
287
+
288
+ MACRO {oct} {"October"}
289
+
290
+ MACRO {nov} {"November"}
291
+
292
+ MACRO {dec} {"December"}
293
+
294
+ MACRO {acmcs} {"ACM Computing Surveys"}
295
+
296
+ MACRO {acta} {"Acta Informatica"}
297
+
298
+ MACRO {cacm} {"Communications of the ACM"}
299
+
300
+ MACRO {ibmjrd} {"IBM Journal of Research and Development"}
301
+
302
+ MACRO {ibmsj} {"IBM Systems Journal"}
303
+
304
+ MACRO {ieeese} {"IEEE Transactions on Software Engineering"}
305
+
306
+ MACRO {ieeetc} {"IEEE Transactions on Computers"}
307
+
308
+ MACRO {ieeetcad}
309
+ {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}
310
+
311
+ MACRO {ipl} {"Information Processing Letters"}
312
+
313
+ MACRO {jacm} {"Journal of the ACM"}
314
+
315
+ MACRO {jcss} {"Journal of Computer and System Sciences"}
316
+
317
+ MACRO {scp} {"Science of Computer Programming"}
318
+
319
+ MACRO {sicomp} {"SIAM Journal on Computing"}
320
+
321
+ MACRO {tocs} {"ACM Transactions on Computer Systems"}
322
+
323
+ MACRO {tods} {"ACM Transactions on Database Systems"}
324
+
325
+ MACRO {tog} {"ACM Transactions on Graphics"}
326
+
327
+ MACRO {toms} {"ACM Transactions on Mathematical Software"}
328
+
329
+ MACRO {toois} {"ACM Transactions on Office Information Systems"}
330
+
331
+ MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}
332
+
333
+ MACRO {tcs} {"Theoretical Computer Science"}
334
+ FUNCTION {bibinfo.check}
335
+ { swap$
336
+ duplicate$ missing$
337
+ {
338
+ pop$ pop$
339
+ ""
340
+ }
341
+ { duplicate$ empty$
342
+ {
343
+ swap$ pop$
344
+ }
345
+ { swap$
346
+ pop$
347
+ }
348
+ if$
349
+ }
350
+ if$
351
+ }
352
+ FUNCTION {bibinfo.warn}
353
+ { swap$
354
+ duplicate$ missing$
355
+ {
356
+ swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
357
+ ""
358
+ }
359
+ { duplicate$ empty$
360
+ {
361
+ swap$ "empty " swap$ * " in " * cite$ * warning$
362
+ }
363
+ { swap$
364
+ pop$
365
+ }
366
+ if$
367
+ }
368
+ if$
369
+ }
370
+ STRINGS { bibinfo}
371
+ INTEGERS { nameptr namesleft numnames }
372
+
373
+ FUNCTION {format.names}
374
+ { 'bibinfo :=
375
+ duplicate$ empty$ 'skip$ {
376
+ 's :=
377
+ "" 't :=
378
+ #1 'nameptr :=
379
+ s num.names$ 'numnames :=
380
+ numnames 'namesleft :=
381
+ { namesleft #0 > }
382
+ { s nameptr
383
+ duplicate$ #1 >
384
+ { "{jj~}{ff~}{vv~}{ll}" }
385
+ { "{jj~}{ff~}{vv~}{ll}" }
386
+ if$
387
+ format.name$
388
+ bibinfo bibinfo.check
389
+ 't :=
390
+ nameptr #1 >
391
+ {
392
+ namesleft #1 >
393
+ { ", " * t * }
394
+ {
395
+ numnames #2 >
396
+ { "," * }
397
+ 'skip$
398
+ if$
399
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
400
+ { 't := }
401
+ { pop$ }
402
+ if$
403
+ t "others" =
404
+ {
405
+ " " * bbl.etal *
406
+ }
407
+ {
408
+ bbl.and
409
+ space.word * t *
410
+ }
411
+ if$
412
+ }
413
+ if$
414
+ }
415
+ 't
416
+ if$
417
+ nameptr #1 + 'nameptr :=
418
+ namesleft #1 - 'namesleft :=
419
+ }
420
+ while$
421
+ } if$
422
+ }
423
+ FUNCTION {format.names.ed}
424
+ {
425
+ 'bibinfo :=
426
+ duplicate$ empty$ 'skip$ {
427
+ 's :=
428
+ "" 't :=
429
+ #1 'nameptr :=
430
+ s num.names$ 'numnames :=
431
+ numnames 'namesleft :=
432
+ { namesleft #0 > }
433
+ { s nameptr
434
+ "{ff~}{vv~}{ll}{, jj}"
435
+ format.name$
436
+ bibinfo bibinfo.check
437
+ 't :=
438
+ nameptr #1 >
439
+ {
440
+ namesleft #1 >
441
+ { ", " * t * }
442
+ {
443
+ numnames #2 >
444
+ { "," * }
445
+ 'skip$
446
+ if$
447
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
448
+ { 't := }
449
+ { pop$ }
450
+ if$
451
+ t "others" =
452
+ {
453
+
454
+ " " * bbl.etal *
455
+ }
456
+ {
457
+ bbl.and
458
+ space.word * t *
459
+ }
460
+ if$
461
+ }
462
+ if$
463
+ }
464
+ 't
465
+ if$
466
+ nameptr #1 + 'nameptr :=
467
+ namesleft #1 - 'namesleft :=
468
+ }
469
+ while$
470
+ } if$
471
+ }
472
+ FUNCTION {format.key}
473
+ { empty$
474
+ { key field.or.null }
475
+ { "" }
476
+ if$
477
+ }
478
+
479
+ FUNCTION {format.authors}
480
+ { author "author" format.names
481
+ }
482
+ FUNCTION {get.bbl.editor}
483
+ { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
484
+
485
+ FUNCTION {format.editors}
486
+ { editor "editor" format.names duplicate$ empty$ 'skip$
487
+ {
488
+ "," *
489
+ " " *
490
+ get.bbl.editor
491
+ *
492
+ }
493
+ if$
494
+ }
495
+ FUNCTION {format.note}
496
+ {
497
+ note empty$
498
+ { "" }
499
+ { note #1 #1 substring$
500
+ duplicate$ "{" =
501
+ 'skip$
502
+ { output.state mid.sentence =
503
+ { "l" }
504
+ { "u" }
505
+ if$
506
+ change.case$
507
+ }
508
+ if$
509
+ note #2 global.max$ substring$ * "note" bibinfo.check
510
+ }
511
+ if$
512
+ }
513
+
514
+ FUNCTION {doilink}
515
+ { duplicate$ empty$
516
+ { pop$ "" }
517
+ { doi empty$
518
+ { url empty$
519
+ { skip$ }
520
+ { "\href{" url * "}{" * swap$ * "}" * }
521
+ if$
522
+ }
523
+ { "\href{https://doi.org/" doi * "}{" * swap$ * "}" * }
524
+ if$
525
+ }
526
+ if$
527
+ }
528
+
529
+ FUNCTION {format.doi}
530
+ { doi empty$
531
+ { "" }
532
+ { "\url{https://doi.org/" doi * "}" * }
533
+ if$
534
+ "doi" bibinfo.check
535
+ }
536
+
537
+ FUNCTION {format.url}
538
+ { doi empty$
539
+ {
540
+ url empty$
541
+ { "" }
542
+ { "\url{" url * "}" * }
543
+ if$
544
+ "url" bibinfo.check
545
+ }
546
+ { "" }
547
+ if$
548
+ }
549
+
550
+ FUNCTION {format.title}
551
+ { title
552
+ duplicate$ empty$ 'skip$
553
+ { "t" change.case$ doilink }
554
+ if$
555
+ "title" bibinfo.check
556
+ }
557
+ FUNCTION {format.full.names}
558
+ {'s :=
559
+ "" 't :=
560
+ #1 'nameptr :=
561
+ s num.names$ 'numnames :=
562
+ numnames 'namesleft :=
563
+ { namesleft #0 > }
564
+ { s nameptr
565
+ "{vv~}{ll}" format.name$
566
+ 't :=
567
+ nameptr #1 >
568
+ {
569
+ namesleft #1 >
570
+ { ", " * t * }
571
+ {
572
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
573
+ { 't := }
574
+ { pop$ }
575
+ if$
576
+ t "others" =
577
+ {
578
+ " " * bbl.etal *
579
+ }
580
+ {
581
+ numnames #2 >
582
+ { "," * }
583
+ 'skip$
584
+ if$
585
+ bbl.and
586
+ space.word * t *
587
+ }
588
+ if$
589
+ }
590
+ if$
591
+ }
592
+ 't
593
+ if$
594
+ nameptr #1 + 'nameptr :=
595
+ namesleft #1 - 'namesleft :=
596
+ }
597
+ while$
598
+ }
599
+
600
+ FUNCTION {author.editor.key.full}
601
+ { author empty$
602
+ { editor empty$
603
+ { key empty$
604
+ { cite$ #1 #3 substring$ }
605
+ 'key
606
+ if$
607
+ }
608
+ { editor format.full.names }
609
+ if$
610
+ }
611
+ { author format.full.names }
612
+ if$
613
+ }
614
+
615
+ FUNCTION {author.key.full}
616
+ { author empty$
617
+ { key empty$
618
+ { cite$ #1 #3 substring$ }
619
+ 'key
620
+ if$
621
+ }
622
+ { author format.full.names }
623
+ if$
624
+ }
625
+
626
+ FUNCTION {editor.key.full}
627
+ { editor empty$
628
+ { key empty$
629
+ { cite$ #1 #3 substring$ }
630
+ 'key
631
+ if$
632
+ }
633
+ { editor format.full.names }
634
+ if$
635
+ }
636
+
637
+ FUNCTION {make.full.names}
638
+ { type$ "book" =
639
+ type$ "inbook" =
640
+ or
641
+ 'author.editor.key.full
642
+ { type$ "proceedings" =
643
+ 'editor.key.full
644
+ 'author.key.full
645
+ if$
646
+ }
647
+ if$
648
+ }
649
+
650
+ FUNCTION {output.bibitem}
651
+ { newline$
652
+ "\bibitem[{" write$
653
+ label write$
654
+ ")" make.full.names duplicate$ short.list =
655
+ { pop$ }
656
+ { * }
657
+ if$
658
+ "}]{" * write$
659
+ cite$ write$
660
+ "}" write$
661
+ newline$
662
+ ""
663
+ before.all 'output.state :=
664
+ }
665
+
666
+ FUNCTION {n.dashify}
667
+ {
668
+ 't :=
669
+ ""
670
+ { t empty$ not }
671
+ { t #1 #1 substring$ "-" =
672
+ { t #1 #2 substring$ "--" = not
673
+ { "--" *
674
+ t #2 global.max$ substring$ 't :=
675
+ }
676
+ { { t #1 #1 substring$ "-" = }
677
+ { "-" *
678
+ t #2 global.max$ substring$ 't :=
679
+ }
680
+ while$
681
+ }
682
+ if$
683
+ }
684
+ { t #1 #1 substring$ *
685
+ t #2 global.max$ substring$ 't :=
686
+ }
687
+ if$
688
+ }
689
+ while$
690
+ }
691
+
692
+ FUNCTION {word.in}
693
+ { bbl.in capitalize
694
+ " " * }
695
+
696
+ FUNCTION {format.date}
697
+ { year "year" bibinfo.check duplicate$ empty$
698
+ {
699
+ "empty year in " cite$ * "; set to ????" * warning$
700
+ pop$ "????"
701
+ }
702
+ 'skip$
703
+ if$
704
+ extra.label *
705
+ before.all 'output.state :=
706
+ after.sentence 'output.state :=
707
+ }
708
+ FUNCTION {format.btitle}
709
+ { title "title" bibinfo.check
710
+ duplicate$ empty$ 'skip$
711
+ {
712
+ emphasize
713
+ }
714
+ if$
715
+ }
716
+ FUNCTION {either.or.check}
717
+ { empty$
718
+ 'pop$
719
+ { "can't use both " swap$ * " fields in " * cite$ * warning$ }
720
+ if$
721
+ }
722
+ FUNCTION {format.bvolume}
723
+ { volume empty$
724
+ { "" }
725
+ { bbl.volume volume tie.or.space.prefix
726
+ "volume" bibinfo.check * *
727
+ series "series" bibinfo.check
728
+ duplicate$ empty$ 'pop$
729
+ { swap$ bbl.of space.word * swap$
730
+ emphasize * }
731
+ if$
732
+ "volume and number" number either.or.check
733
+ }
734
+ if$
735
+ }
736
+ FUNCTION {format.number.series}
737
+ { volume empty$
738
+ { number empty$
739
+ { series field.or.null }
740
+ { series empty$
741
+ { number "number" bibinfo.check }
742
+ { output.state mid.sentence =
743
+ { bbl.number }
744
+ { bbl.number capitalize }
745
+ if$
746
+ number tie.or.space.prefix "number" bibinfo.check * *
747
+ bbl.in space.word *
748
+ series "series" bibinfo.check *
749
+ }
750
+ if$
751
+ }
752
+ if$
753
+ }
754
+ { "" }
755
+ if$
756
+ }
757
+
758
+ FUNCTION {format.edition}
759
+ { edition duplicate$ empty$ 'skip$
760
+ {
761
+ output.state mid.sentence =
762
+ { "l" }
763
+ { "t" }
764
+ if$ change.case$
765
+ "edition" bibinfo.check
766
+ " " * bbl.edition *
767
+ }
768
+ if$
769
+ }
770
+ INTEGERS { multiresult }
771
+ FUNCTION {multi.page.check}
772
+ { 't :=
773
+ #0 'multiresult :=
774
+ { multiresult not
775
+ t empty$ not
776
+ and
777
+ }
778
+ { t #1 #1 substring$
779
+ duplicate$ "-" =
780
+ swap$ duplicate$ "," =
781
+ swap$ "+" =
782
+ or or
783
+ { #1 'multiresult := }
784
+ { t #2 global.max$ substring$ 't := }
785
+ if$
786
+ }
787
+ while$
788
+ multiresult
789
+ }
790
+ FUNCTION {format.pages}
791
+ { pages duplicate$ empty$ 'skip$
792
+ { duplicate$ multi.page.check
793
+ {
794
+ bbl.pages swap$
795
+ n.dashify
796
+ }
797
+ {
798
+ bbl.page swap$
799
+ }
800
+ if$
801
+ tie.or.space.prefix
802
+ "pages" bibinfo.check
803
+ * *
804
+ }
805
+ if$
806
+ }
807
+ FUNCTION {format.journal.pages}
808
+ { pages duplicate$ empty$ 'pop$
809
+ { swap$ duplicate$ empty$
810
+ { pop$ pop$ format.pages }
811
+ {
812
+ ":" *
813
+ swap$
814
+ n.dashify
815
+ "pages" bibinfo.check
816
+ *
817
+ }
818
+ if$
819
+ }
820
+ if$
821
+ }
822
+ FUNCTION {format.vol.num.pages}
823
+ { volume field.or.null
824
+ duplicate$ empty$ 'skip$
825
+ {
826
+ "volume" bibinfo.check
827
+ }
828
+ if$
829
+ number "number" bibinfo.check duplicate$ empty$ 'skip$
830
+ {
831
+ swap$ duplicate$ empty$
832
+ { "there's a number but no volume in " cite$ * warning$ }
833
+ 'skip$
834
+ if$
835
+ swap$
836
+ "(" swap$ * ")" *
837
+ }
838
+ if$ *
839
+ format.journal.pages
840
+ }
841
+
842
+ FUNCTION {format.chapter.pages}
843
+ { chapter empty$
844
+ 'format.pages
845
+ { type empty$
846
+ { bbl.chapter }
847
+ { type "l" change.case$
848
+ "type" bibinfo.check
849
+ }
850
+ if$
851
+ chapter tie.or.space.prefix
852
+ "chapter" bibinfo.check
853
+ * *
854
+ pages empty$
855
+ 'skip$
856
+ { ", " * format.pages * }
857
+ if$
858
+ }
859
+ if$
860
+ }
861
+
862
+ FUNCTION {format.booktitle}
863
+ {
864
+ booktitle "booktitle" bibinfo.check
865
+ emphasize
866
+ }
867
+ FUNCTION {format.in.ed.booktitle}
868
+ { format.booktitle duplicate$ empty$ 'skip$
869
+ {
870
+ editor "editor" format.names.ed duplicate$ empty$ 'pop$
871
+ {
872
+ "," *
873
+ " " *
874
+ get.bbl.editor
875
+ ", " *
876
+ * swap$
877
+ * }
878
+ if$
879
+ word.in swap$ *
880
+ }
881
+ if$
882
+ }
883
+ FUNCTION {format.thesis.type}
884
+ { type duplicate$ empty$
885
+ 'pop$
886
+ { swap$ pop$
887
+ "t" change.case$ "type" bibinfo.check
888
+ }
889
+ if$
890
+ }
891
+ FUNCTION {format.tr.number}
892
+ { number "number" bibinfo.check
893
+ type duplicate$ empty$
894
+ { pop$ bbl.techrep }
895
+ 'skip$
896
+ if$
897
+ "type" bibinfo.check
898
+ swap$ duplicate$ empty$
899
+ { pop$ "t" change.case$ }
900
+ { tie.or.space.prefix * * }
901
+ if$
902
+ }
903
+ FUNCTION {format.article.crossref}
904
+ {
905
+ word.in
906
+ " \cite{" * crossref * "}" *
907
+ }
908
+ FUNCTION {format.book.crossref}
909
+ { volume duplicate$ empty$
910
+ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
911
+ pop$ word.in
912
+ }
913
+ { bbl.volume
914
+ capitalize
915
+ swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
916
+ }
917
+ if$
918
+ " \cite{" * crossref * "}" *
919
+ }
920
+ FUNCTION {format.incoll.inproc.crossref}
921
+ {
922
+ word.in
923
+ " \cite{" * crossref * "}" *
924
+ }
925
+ FUNCTION {format.org.or.pub}
926
+ { 't :=
927
+ ""
928
+ address empty$ t empty$ and
929
+ 'skip$
930
+ {
931
+ t empty$
932
+ { address "address" bibinfo.check *
933
+ }
934
+ { t *
935
+ address empty$
936
+ 'skip$
937
+ { ", " * address "address" bibinfo.check * }
938
+ if$
939
+ }
940
+ if$
941
+ }
942
+ if$
943
+ }
944
+ FUNCTION {format.publisher.address}
945
+ { publisher "publisher" bibinfo.warn format.org.or.pub
946
+ }
947
+
948
+ FUNCTION {format.organization.address}
949
+ { organization "organization" bibinfo.check format.org.or.pub
950
+ }
951
+
952
+ FUNCTION {article}
953
+ { output.bibitem
954
+ format.authors "author" output.check
955
+ author format.key output
956
+ format.date "year" output.check
957
+ date.block
958
+ format.title "title" output.check
959
+ new.block
960
+ crossref missing$
961
+ {
962
+ journal
963
+ "journal" bibinfo.check
964
+ emphasize
965
+ "journal" output.check
966
+ add.blank
967
+ format.vol.num.pages output
968
+ }
969
+ { format.article.crossref output.nonnull
970
+ format.pages output
971
+ }
972
+ if$
973
+ new.block
974
+ format.note output
975
+ new.block
976
+ format.doi output
977
+ format.url output
978
+ fin.entry
979
+ }
980
+ FUNCTION {book}
981
+ { output.bibitem
982
+ author empty$
983
+ { format.editors "author and editor" output.check
984
+ editor format.key output
985
+ }
986
+ { format.authors output.nonnull
987
+ crossref missing$
988
+ { "author and editor" editor either.or.check }
989
+ 'skip$
990
+ if$
991
+ }
992
+ if$
993
+ format.date "year" output.check
994
+ date.block
995
+ format.btitle "title" output.check
996
+ crossref missing$
997
+ { format.bvolume output
998
+ new.block
999
+ format.number.series output
1000
+ new.sentence
1001
+ format.publisher.address output
1002
+ }
1003
+ {
1004
+ new.block
1005
+ format.book.crossref output.nonnull
1006
+ }
1007
+ if$
1008
+ format.edition output
1009
+ new.block
1010
+ format.note output
1011
+ new.block
1012
+ format.doi output
1013
+ format.url output
1014
+ fin.entry
1015
+ }
1016
+ FUNCTION {booklet}
1017
+ { output.bibitem
1018
+ format.authors output
1019
+ author format.key output
1020
+ format.date "year" output.check
1021
+ date.block
1022
+ format.title "title" output.check
1023
+ new.block
1024
+ howpublished "howpublished" bibinfo.check output
1025
+ address "address" bibinfo.check output
1026
+ new.block
1027
+ format.note output
1028
+ new.block
1029
+ format.doi output
1030
+ format.url output
1031
+ fin.entry
1032
+ }
1033
+
1034
+ FUNCTION {inbook}
1035
+ { output.bibitem
1036
+ author empty$
1037
+ { format.editors "author and editor" output.check
1038
+ editor format.key output
1039
+ }
1040
+ { format.authors output.nonnull
1041
+ crossref missing$
1042
+ { "author and editor" editor either.or.check }
1043
+ 'skip$
1044
+ if$
1045
+ }
1046
+ if$
1047
+ format.date "year" output.check
1048
+ date.block
1049
+ format.btitle "title" output.check
1050
+ crossref missing$
1051
+ {
1052
+ format.publisher.address output
1053
+ format.bvolume output
1054
+ format.chapter.pages "chapter and pages" output.check
1055
+ new.block
1056
+ format.number.series output
1057
+ new.sentence
1058
+ }
1059
+ {
1060
+ format.chapter.pages "chapter and pages" output.check
1061
+ new.block
1062
+ format.book.crossref output.nonnull
1063
+ }
1064
+ if$
1065
+ format.edition output
1066
+ new.block
1067
+ format.note output
1068
+ new.block
1069
+ format.doi output
1070
+ format.url output
1071
+ fin.entry
1072
+ }
1073
+
1074
+ FUNCTION {incollection}
1075
+ { output.bibitem
1076
+ format.authors "author" output.check
1077
+ author format.key output
1078
+ format.date "year" output.check
1079
+ date.block
1080
+ format.title "title" output.check
1081
+ new.block
1082
+ crossref missing$
1083
+ { format.in.ed.booktitle "booktitle" output.check
1084
+ format.publisher.address output
1085
+ format.bvolume output
1086
+ format.number.series output
1087
+ format.chapter.pages output
1088
+ new.sentence
1089
+ format.edition output
1090
+ }
1091
+ { format.incoll.inproc.crossref output.nonnull
1092
+ format.chapter.pages output
1093
+ }
1094
+ if$
1095
+ new.block
1096
+ format.note output
1097
+ new.block
1098
+ format.doi output
1099
+ format.url output
1100
+ fin.entry
1101
+ }
1102
+ FUNCTION {inproceedings}
1103
+ { output.bibitem
1104
+ format.authors "author" output.check
1105
+ author format.key output
1106
+ format.date "year" output.check
1107
+ date.block
1108
+ format.title "title" output.check
1109
+ new.block
1110
+ crossref missing$
1111
+ { format.in.ed.booktitle "booktitle" output.check
1112
+ new.sentence
1113
+ publisher empty$
1114
+ { format.organization.address output }
1115
+ { organization "organization" bibinfo.check output
1116
+ format.publisher.address output
1117
+ }
1118
+ if$
1119
+ format.bvolume output
1120
+ format.number.series output
1121
+ format.pages output
1122
+ }
1123
+ { format.incoll.inproc.crossref output.nonnull
1124
+ format.pages output
1125
+ }
1126
+ if$
1127
+ new.block
1128
+ format.note output
1129
+ new.block
1130
+ format.doi output
1131
+ format.url output
1132
+ fin.entry
1133
+ }
1134
+ FUNCTION {conference} { inproceedings }
1135
+ FUNCTION {manual}
1136
+ { output.bibitem
1137
+ format.authors output
1138
+ author format.key output
1139
+ format.date "year" output.check
1140
+ date.block
1141
+ format.btitle "title" output.check
1142
+ organization address new.block.checkb
1143
+ organization "organization" bibinfo.check output
1144
+ address "address" bibinfo.check output
1145
+ format.edition output
1146
+ new.block
1147
+ format.note output
1148
+ new.block
1149
+ format.doi output
1150
+ format.url output
1151
+ fin.entry
1152
+ }
1153
+
1154
+ FUNCTION {mastersthesis}
1155
+ { output.bibitem
1156
+ format.authors "author" output.check
1157
+ author format.key output
1158
+ format.date "year" output.check
1159
+ date.block
1160
+ format.btitle
1161
+ "title" output.check
1162
+ new.block
1163
+ bbl.mthesis format.thesis.type output.nonnull
1164
+ school "school" bibinfo.warn output
1165
+ address "address" bibinfo.check output
1166
+ new.block
1167
+ format.note output
1168
+ new.block
1169
+ format.doi output
1170
+ format.url output
1171
+ fin.entry
1172
+ }
1173
+
1174
+ FUNCTION {misc}
1175
+ { output.bibitem
1176
+ format.authors output
1177
+ author format.key output
1178
+ format.date "year" output.check
1179
+ date.block
1180
+ format.title output
1181
+ new.block
1182
+ howpublished "howpublished" bibinfo.check output
1183
+ new.block
1184
+ format.note output
1185
+ new.block
1186
+ format.doi output
1187
+ format.url output
1188
+ fin.entry
1189
+ }
1190
+ FUNCTION {phdthesis}
1191
+ { output.bibitem
1192
+ format.authors "author" output.check
1193
+ author format.key output
1194
+ format.date "year" output.check
1195
+ date.block
1196
+ format.btitle
1197
+ "title" output.check
1198
+ new.block
1199
+ bbl.phdthesis format.thesis.type output.nonnull
1200
+ school "school" bibinfo.warn output
1201
+ address "address" bibinfo.check output
1202
+ new.block
1203
+ format.note output
1204
+ new.block
1205
+ format.doi output
1206
+ format.url output
1207
+ fin.entry
1208
+ }
1209
+
1210
+ FUNCTION {proceedings}
1211
+ { output.bibitem
1212
+ format.editors output
1213
+ editor format.key output
1214
+ format.date "year" output.check
1215
+ date.block
1216
+ format.btitle "title" output.check
1217
+ format.bvolume output
1218
+ format.number.series output
1219
+ new.sentence
1220
+ publisher empty$
1221
+ { format.organization.address output }
1222
+ { organization "organization" bibinfo.check output
1223
+ format.publisher.address output
1224
+ }
1225
+ if$
1226
+ new.block
1227
+ format.note output
1228
+ new.block
1229
+ format.doi output
1230
+ format.url output
1231
+ fin.entry
1232
+ }
1233
+
1234
+ FUNCTION {techreport}
1235
+ { output.bibitem
1236
+ format.authors "author" output.check
1237
+ author format.key output
1238
+ format.date "year" output.check
1239
+ date.block
1240
+ format.title
1241
+ "title" output.check
1242
+ new.block
1243
+ format.tr.number output.nonnull
1244
+ institution "institution" bibinfo.warn output
1245
+ address "address" bibinfo.check output
1246
+ new.block
1247
+ format.note output
1248
+ new.block
1249
+ format.doi output
1250
+ format.url output
1251
+ fin.entry
1252
+ }
1253
+
1254
+ FUNCTION {unpublished}
1255
+ { output.bibitem
1256
+ format.authors "author" output.check
1257
+ author format.key output
1258
+ format.date "year" output.check
1259
+ date.block
1260
+ format.title "title" output.check
1261
+ new.block
1262
+ format.note "note" output.check
1263
+ new.block
1264
+ format.doi output
1265
+ format.url output
1266
+ fin.entry
1267
+ }
1268
+
1269
+ FUNCTION {default.type} { misc }
1270
+ READ
1271
+ FUNCTION {sortify}
1272
+ { purify$
1273
+ "l" change.case$
1274
+ }
1275
+ INTEGERS { len }
1276
+ FUNCTION {chop.word}
1277
+ { 's :=
1278
+ 'len :=
1279
+ s #1 len substring$ =
1280
+ { s len #1 + global.max$ substring$ }
1281
+ 's
1282
+ if$
1283
+ }
1284
+ FUNCTION {format.lab.names}
1285
+ { 's :=
1286
+ "" 't :=
1287
+ s #1 "{vv~}{ll}" format.name$
1288
+ s num.names$ duplicate$
1289
+ #2 >
1290
+ { pop$
1291
+ " " * bbl.etal *
1292
+ }
1293
+ { #2 <
1294
+ 'skip$
1295
+ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
1296
+ {
1297
+ " " * bbl.etal *
1298
+ }
1299
+ { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
1300
+ * }
1301
+ if$
1302
+ }
1303
+ if$
1304
+ }
1305
+ if$
1306
+ }
1307
+
1308
+ FUNCTION {author.key.label}
1309
+ { author empty$
1310
+ { key empty$
1311
+ { cite$ #1 #3 substring$ }
1312
+ 'key
1313
+ if$
1314
+ }
1315
+ { author format.lab.names }
1316
+ if$
1317
+ }
1318
+
1319
+ FUNCTION {author.editor.key.label}
1320
+ { author empty$
1321
+ { editor empty$
1322
+ { key empty$
1323
+ { cite$ #1 #3 substring$ }
1324
+ 'key
1325
+ if$
1326
+ }
1327
+ { editor format.lab.names }
1328
+ if$
1329
+ }
1330
+ { author format.lab.names }
1331
+ if$
1332
+ }
1333
+
1334
+ FUNCTION {editor.key.label}
1335
+ { editor empty$
1336
+ { key empty$
1337
+ { cite$ #1 #3 substring$ }
1338
+ 'key
1339
+ if$
1340
+ }
1341
+ { editor format.lab.names }
1342
+ if$
1343
+ }
1344
+
1345
+ FUNCTION {calc.short.authors}
1346
+ { type$ "book" =
1347
+ type$ "inbook" =
1348
+ or
1349
+ 'author.editor.key.label
1350
+ { type$ "proceedings" =
1351
+ 'editor.key.label
1352
+ 'author.key.label
1353
+ if$
1354
+ }
1355
+ if$
1356
+ 'short.list :=
1357
+ }
1358
+
1359
+ FUNCTION {calc.label}
1360
+ { calc.short.authors
1361
+ short.list
1362
+ "("
1363
+ *
1364
+ year duplicate$ empty$
1365
+ short.list key field.or.null = or
1366
+ { pop$ "" }
1367
+ 'skip$
1368
+ if$
1369
+ *
1370
+ 'label :=
1371
+ }
1372
+
1373
+ FUNCTION {sort.format.names}
1374
+ { 's :=
1375
+ #1 'nameptr :=
1376
+ ""
1377
+ s num.names$ 'numnames :=
1378
+ numnames 'namesleft :=
1379
+ { namesleft #0 > }
1380
+ { s nameptr
1381
+ "{vv{ } }{ll{ }}{ ff{ }}{ jj{ }}"
1382
+ format.name$ 't :=
1383
+ nameptr #1 >
1384
+ {
1385
+ " " *
1386
+ namesleft #1 = t "others" = and
1387
+ { "zzzzz" * }
1388
+ { t sortify * }
1389
+ if$
1390
+ }
1391
+ { t sortify * }
1392
+ if$
1393
+ nameptr #1 + 'nameptr :=
1394
+ namesleft #1 - 'namesleft :=
1395
+ }
1396
+ while$
1397
+ }
1398
+
1399
+ FUNCTION {sort.format.title}
1400
+ { 't :=
1401
+ "A " #2
1402
+ "An " #3
1403
+ "The " #4 t chop.word
1404
+ chop.word
1405
+ chop.word
1406
+ sortify
1407
+ #1 global.max$ substring$
1408
+ }
1409
+ FUNCTION {author.sort}
1410
+ { author empty$
1411
+ { key empty$
1412
+ { "to sort, need author or key in " cite$ * warning$
1413
+ ""
1414
+ }
1415
+ { key sortify }
1416
+ if$
1417
+ }
1418
+ { author sort.format.names }
1419
+ if$
1420
+ }
1421
+ FUNCTION {author.editor.sort}
1422
+ { author empty$
1423
+ { editor empty$
1424
+ { key empty$
1425
+ { "to sort, need author, editor, or key in " cite$ * warning$
1426
+ ""
1427
+ }
1428
+ { key sortify }
1429
+ if$
1430
+ }
1431
+ { editor sort.format.names }
1432
+ if$
1433
+ }
1434
+ { author sort.format.names }
1435
+ if$
1436
+ }
1437
+ FUNCTION {editor.sort}
1438
+ { editor empty$
1439
+ { key empty$
1440
+ { "to sort, need editor or key in " cite$ * warning$
1441
+ ""
1442
+ }
1443
+ { key sortify }
1444
+ if$
1445
+ }
1446
+ { editor sort.format.names }
1447
+ if$
1448
+ }
1449
+ FUNCTION {presort}
1450
+ { calc.label
1451
+ label sortify
1452
+ " "
1453
+ *
1454
+ type$ "book" =
1455
+ type$ "inbook" =
1456
+ or
1457
+ 'author.editor.sort
1458
+ { type$ "proceedings" =
1459
+ 'editor.sort
1460
+ 'author.sort
1461
+ if$
1462
+ }
1463
+ if$
1464
+ #1 entry.max$ substring$
1465
+ 'sort.label :=
1466
+ sort.label
1467
+ *
1468
+ " "
1469
+ *
1470
+ title field.or.null
1471
+ sort.format.title
1472
+ *
1473
+ #1 entry.max$ substring$
1474
+ 'sort.key$ :=
1475
+ }
1476
+
1477
+ ITERATE {presort}
1478
+ SORT
1479
+ STRINGS { last.label next.extra }
1480
+ INTEGERS { last.extra.num number.label }
1481
+ FUNCTION {initialize.extra.label.stuff}
1482
+ { #0 int.to.chr$ 'last.label :=
1483
+ "" 'next.extra :=
1484
+ #0 'last.extra.num :=
1485
+ #0 'number.label :=
1486
+ }
1487
+ FUNCTION {forward.pass}
1488
+ { last.label label =
1489
+ { last.extra.num #1 + 'last.extra.num :=
1490
+ last.extra.num int.to.chr$ 'extra.label :=
1491
+ }
1492
+ { "a" chr.to.int$ 'last.extra.num :=
1493
+ "" 'extra.label :=
1494
+ label 'last.label :=
1495
+ }
1496
+ if$
1497
+ number.label #1 + 'number.label :=
1498
+ }
1499
+ FUNCTION {reverse.pass}
1500
+ { next.extra "b" =
1501
+ { "a" 'extra.label := }
1502
+ 'skip$
1503
+ if$
1504
+ extra.label 'next.extra :=
1505
+ extra.label
1506
+ duplicate$ empty$
1507
+ 'skip$
1508
+ { "{\natexlab{" swap$ * "}}" * }
1509
+ if$
1510
+ 'extra.label :=
1511
+ label extra.label * 'label :=
1512
+ }
1513
+ EXECUTE {initialize.extra.label.stuff}
1514
+ ITERATE {forward.pass}
1515
+ REVERSE {reverse.pass}
1516
+ FUNCTION {bib.sort.order}
1517
+ { sort.label
1518
+ " "
1519
+ *
1520
+ year field.or.null sortify
1521
+ *
1522
+ " "
1523
+ *
1524
+ title field.or.null
1525
+ sort.format.title
1526
+ *
1527
+ #1 entry.max$ substring$
1528
+ 'sort.key$ :=
1529
+ }
1530
+ ITERATE {bib.sort.order}
1531
+ SORT
1532
+ FUNCTION {begin.bib}
1533
+ { preamble$ empty$
1534
+ 'skip$
1535
+ { preamble$ write$ newline$ }
1536
+ if$
1537
+ "\begin{thebibliography}{}"
1538
+ write$ newline$
1539
+ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
1540
+ write$ newline$
1541
+ }
1542
+ EXECUTE {begin.bib}
1543
+ EXECUTE {init.state.consts}
1544
+ ITERATE {call.type$}
1545
+ FUNCTION {end.bib}
1546
+ { newline$
1547
+ "\end{thebibliography}" write$ newline$
1548
+ }
1549
+ EXECUTE {end.bib}
1550
+ %% End of customized bst file
1551
+ %%
1552
+ %% End of file `acl.bst'.
references/2018.naacl.vu/source/naaclhlt2018.sty ADDED
@@ -0,0 +1,543 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This is the LaTex style file for NAACL 2018. Major modifications include
2
+ % changing the color of the line numbers to a light gray; changing font size of abstract to be 10pt; changing caption font size to be 10pt.
3
+ % -- Meg Mitchell and Stephanie Lukin
4
+
5
+ % 2017: modified to support DOI links in bibliography. Now uses
6
+ % natbib package rather than defining citation commands in this file.
7
+ % Use with acl_natbib.bst bib style. -- Dan Gildea
8
+
9
+ % This is the LaTeX style for ACL 2016. It contains Margaret Mitchell's
10
+ % line number adaptations (ported by Hai Zhao and Yannick Versley).
11
+
12
+ % It is nearly identical to the style files for ACL 2015,
13
+ % ACL 2014, EACL 2006, ACL2005, ACL 2002, ACL 2001, ACL 2000,
14
+ % EACL 95 and EACL 99.
15
+ %
16
+ % Changes made include: adapt layout to A4 and centimeters, widen abstract
17
+
18
+ % This is the LaTeX style file for ACL 2000. It is nearly identical to the
19
+ % style files for EACL 95 and EACL 99. Minor changes include editing the
20
+ % instructions to reflect use of \documentclass rather than \documentstyle
21
+ % and removing the white space before the title on the first page
22
+ % -- John Chen, June 29, 2000
23
+
24
+ % This is the LaTeX style file for EACL-95. It is identical to the
25
+ % style file for ANLP '94 except that the margins are adjusted for A4
26
+ % paper. -- abney 13 Dec 94
27
+
28
+ % The ANLP '94 style file is a slightly modified
29
+ % version of the style used for AAAI and IJCAI, using some changes
30
+ % prepared by Fernando Pereira and others and some minor changes
31
+ % by Paul Jacobs.
32
+
33
+ % Papers prepared using the aclsub.sty file and acl.bst bibtex style
34
+ % should be easily converted to final format using this style.
35
+ % (1) Submission information (\wordcount, \subject, and \makeidpage)
36
+ % should be removed.
37
+ % (2) \summary should be removed. The summary material should come
38
+ % after \maketitle and should be in the ``abstract'' environment
39
+ % (between \begin{abstract} and \end{abstract}).
40
+ % (3) Check all citations. This style should handle citations correctly
41
+ % and also allows multiple citations separated by semicolons.
42
+ % (4) Check figures and examples. Because the final format is double-
43
+ % column, some adjustments may have to be made to fit text in the column
44
+ % or to choose full-width (\figure*} figures.
45
+
46
+ % Place this in a file called aclap.sty in the TeX search path.
47
+ % (Placing it in the same directory as the paper should also work.)
48
+
49
+ % Prepared by Peter F. Patel-Schneider, liberally using the ideas of
50
+ % other style hackers, including Barbara Beeton.
51
+ % This style is NOT guaranteed to work. It is provided in the hope
52
+ % that it will make the preparation of papers easier.
53
+ %
54
+ % There are undoubtably bugs in this style. If you make bug fixes,
55
+ % improvements, etc. please let me know. My e-mail address is:
56
+ % pfps@research.att.com
57
+
58
+ % Papers are to be prepared using the ``acl_natbib'' bibliography style,
59
+ % as follows:
60
+ % \documentclass[11pt]{article}
61
+ % \usepackage{acl2000}
62
+ % \title{Title}
63
+ % \author{Author 1 \and Author 2 \\ Address line \\ Address line \And
64
+ % Author 3 \\ Address line \\ Address line}
65
+ % \begin{document}
66
+ % ...
67
+ % \bibliography{bibliography-file}
68
+ % \bibliographystyle{acl_natbib}
69
+ % \end{document}
70
+
71
+ % Author information can be set in various styles:
72
+ % For several authors from the same institution:
73
+ % \author{Author 1 \and ... \and Author n \\
74
+ % Address line \\ ... \\ Address line}
75
+ % if the names do not fit well on one line use
76
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
77
+ % For authors from different institutions:
78
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
79
+ % \And ... \And
80
+ % Author n \\ Address line \\ ... \\ Address line}
81
+ % To start a seperate ``row'' of authors use \AND, as in
82
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
83
+ % \AND
84
+ % Author 2 \\ Address line \\ ... \\ Address line \And
85
+ % Author 3 \\ Address line \\ ... \\ Address line}
86
+
87
+ % If the title and author information does not fit in the area allocated,
88
+ % place \setlength\titlebox{<new height>} right after
89
+ % \usepackage{acl2015}
90
+ % where <new height> can be something larger than 5cm
91
+
92
+ % include hyperref, unless user specifies nohyperref option like this:
93
+ % \usepackage[nohyperref]{naaclhlt2018}
94
+ \newif\ifacl@hyperref
95
+ \DeclareOption{hyperref}{\acl@hyperreftrue}
96
+ \DeclareOption{nohyperref}{\acl@hyperreffalse}
97
+ \ExecuteOptions{hyperref} % default is to use hyperref
98
+ \ProcessOptions\relax
99
+ \ifacl@hyperref
100
+ \RequirePackage{hyperref}
101
+ \usepackage{xcolor} % make links dark blue
102
+ \definecolor{darkblue}{rgb}{0, 0, 0.5}
103
+ \hypersetup{colorlinks=true,citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue}
104
+ \else
105
+ % This definition is used if the hyperref package is not loaded.
106
+ % It provides a backup, no-op definiton of \href.
107
+ % This is necessary because \href command is used in the acl_natbib.bst file.
108
+ \def\href#1#2{{#2}}
109
+ \fi
110
+
111
+ \typeout{Conference Style for NAACL-HLT 2018}
112
+
113
+ % NOTE: Some laser printers have a serious problem printing TeX output.
114
+ % These printing devices, commonly known as ``write-white'' laser
115
+ % printers, tend to make characters too light. To get around this
116
+ % problem, a darker set of fonts must be created for these devices.
117
+ %
118
+
119
+ \newcommand{\Thanks}[1]{\thanks{\ #1}}
120
+
121
+ % A4 modified by Eneko; again modified by Alexander for 5cm titlebox
122
+ \setlength{\paperwidth}{21cm} % A4
123
+ \setlength{\paperheight}{29.7cm}% A4
124
+ \setlength\topmargin{-0.5cm}
125
+ \setlength\oddsidemargin{0cm}
126
+ \setlength\textheight{24.7cm}
127
+ \setlength\textwidth{16.0cm}
128
+ \setlength\columnsep{0.6cm}
129
+ \newlength\titlebox
130
+ \setlength\titlebox{5cm}
131
+ \setlength\headheight{5pt}
132
+ \setlength\headsep{0pt}
133
+ \thispagestyle{empty}
134
+ \pagestyle{empty}
135
+
136
+
137
+ \flushbottom \twocolumn \sloppy
138
+
139
+ % We're never going to need a table of contents, so just flush it to
140
+ % save space --- suggested by drstrip@sandia-2
141
+ \def\addcontentsline#1#2#3{}
142
+
143
+ \newif\ifaclfinal
144
+ \aclfinalfalse
145
+ \def\aclfinalcopy{\global\aclfinaltrue}
146
+
147
+ %% ----- Set up hooks to repeat content on every page of the output doc,
148
+ %% necessary for the line numbers in the submitted version. --MM
149
+ %%
150
+ %% Copied from CVPR 2015's cvpr_eso.sty, which appears to be largely copied from everyshi.sty.
151
+ %%
152
+ %% Original cvpr_eso.sty available at: http://www.pamitc.org/cvpr15/author_guidelines.php
153
+ %% Original evershi.sty available at: https://www.ctan.org/pkg/everyshi
154
+ %%
155
+ %% Copyright (C) 2001 Martin Schr\"oder:
156
+ %%
157
+ %% Martin Schr"oder
158
+ %% Cr"usemannallee 3
159
+ %% D-28213 Bremen
160
+ %% Martin.Schroeder@ACM.org
161
+ %%
162
+ %% This program may be redistributed and/or modified under the terms
163
+ %% of the LaTeX Project Public License, either version 1.0 of this
164
+ %% license, or (at your option) any later version.
165
+ %% The latest version of this license is in
166
+ %% CTAN:macros/latex/base/lppl.txt.
167
+ %%
168
+ %% Happy users are requested to send [Martin] a postcard. :-)
169
+ %%
170
+ \newcommand{\@EveryShipoutACL@Hook}{}
171
+ \newcommand{\@EveryShipoutACL@AtNextHook}{}
172
+ \newcommand*{\EveryShipoutACL}[1]
173
+ {\g@addto@macro\@EveryShipoutACL@Hook{#1}}
174
+ \newcommand*{\AtNextShipoutACL@}[1]
175
+ {\g@addto@macro\@EveryShipoutACL@AtNextHook{#1}}
176
+ \newcommand{\@EveryShipoutACL@Shipout}{%
177
+ \afterassignment\@EveryShipoutACL@Test
178
+ \global\setbox\@cclv= %
179
+ }
180
+ \newcommand{\@EveryShipoutACL@Test}{%
181
+ \ifvoid\@cclv\relax
182
+ \aftergroup\@EveryShipoutACL@Output
183
+ \else
184
+ \@EveryShipoutACL@Output
185
+ \fi%
186
+ }
187
+ \newcommand{\@EveryShipoutACL@Output}{%
188
+ \@EveryShipoutACL@Hook%
189
+ \@EveryShipoutACL@AtNextHook%
190
+ \gdef\@EveryShipoutACL@AtNextHook{}%
191
+ \@EveryShipoutACL@Org@Shipout\box\@cclv%
192
+ }
193
+ \newcommand{\@EveryShipoutACL@Org@Shipout}{}
194
+ \newcommand*{\@EveryShipoutACL@Init}{%
195
+ \message{ABD: EveryShipout initializing macros}%
196
+ \let\@EveryShipoutACL@Org@Shipout\shipout
197
+ \let\shipout\@EveryShipoutACL@Shipout
198
+ }
199
+ \AtBeginDocument{\@EveryShipoutACL@Init}
200
+
201
+ %% ----- Set up for placing additional items into the submitted version --MM
202
+ %%
203
+ %% Based on eso-pic.sty
204
+ %%
205
+ %% Original available at: https://www.ctan.org/tex-archive/macros/latex/contrib/eso-pic
206
+ %% Copyright (C) 1998-2002 by Rolf Niepraschk <niepraschk@ptb.de>
207
+ %%
208
+ %% Which may be distributed and/or modified under the conditions of
209
+ %% the LaTeX Project Public License, either version 1.2 of this license
210
+ %% or (at your option) any later version. The latest version of this
211
+ %% license is in:
212
+ %%
213
+ %% http://www.latex-project.org/lppl.txt
214
+ %%
215
+ %% and version 1.2 or later is part of all distributions of LaTeX version
216
+ %% 1999/12/01 or later.
217
+ %%
218
+ %% In contrast to the original, we do not include the definitions for/using:
219
+ %% gridpicture, div[2], isMEMOIR[1], gridSetup[6][], subgridstyle{dotted}, labelfactor{}, gap{}, gridunitname{}, gridunit{}, gridlines{\thinlines}, subgridlines{\thinlines}, the {keyval} package, evenside margin, nor any definitions with 'color'.
220
+ %%
221
+ %% These are beyond what is needed for the NAACL style.
222
+ %%
223
+ \newcommand\LenToUnit[1]{#1\@gobble}
224
+ \newcommand\AtPageUpperLeft[1]{%
225
+ \begingroup
226
+ \@tempdima=0pt\relax\@tempdimb=\ESO@yoffsetI\relax
227
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
228
+ \endgroup
229
+ }
230
+ \newcommand\AtPageLowerLeft[1]{\AtPageUpperLeft{%
231
+ \put(0,\LenToUnit{-\paperheight}){#1}}}
232
+ \newcommand\AtPageCenter[1]{\AtPageUpperLeft{%
233
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.5\paperheight}){#1}}}
234
+ \newcommand\AtPageLowerCenter[1]{\AtPageUpperLeft{%
235
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-\paperheight}){#1}}}%
236
+ \newcommand\AtPageLowishCenter[1]{\AtPageUpperLeft{%
237
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.96\paperheight}){#1}}}
238
+ \newcommand\AtTextUpperLeft[1]{%
239
+ \begingroup
240
+ \setlength\@tempdima{1in}%
241
+ \advance\@tempdima\oddsidemargin%
242
+ \@tempdimb=\ESO@yoffsetI\relax\advance\@tempdimb-1in\relax%
243
+ \advance\@tempdimb-\topmargin%
244
+ \advance\@tempdimb-\headheight\advance\@tempdimb-\headsep%
245
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
246
+ \endgroup
247
+ }
248
+ \newcommand\AtTextLowerLeft[1]{\AtTextUpperLeft{%
249
+ \put(0,\LenToUnit{-\textheight}){#1}}}
250
+ \newcommand\AtTextCenter[1]{\AtTextUpperLeft{%
251
+ \put(\LenToUnit{.5\textwidth},\LenToUnit{-.5\textheight}){#1}}}
252
+ \newcommand{\ESO@HookI}{} \newcommand{\ESO@HookII}{}
253
+ \newcommand{\ESO@HookIII}{}
254
+ \newcommand{\AddToShipoutPicture}{%
255
+ \@ifstar{\g@addto@macro\ESO@HookII}{\g@addto@macro\ESO@HookI}}
256
+ \newcommand{\ClearShipoutPicture}{\global\let\ESO@HookI\@empty}
257
+ \newcommand{\@ShipoutPicture}{%
258
+ \bgroup
259
+ \@tempswafalse%
260
+ \ifx\ESO@HookI\@empty\else\@tempswatrue\fi%
261
+ \ifx\ESO@HookII\@empty\else\@tempswatrue\fi%
262
+ \ifx\ESO@HookIII\@empty\else\@tempswatrue\fi%
263
+ \if@tempswa%
264
+ \@tempdima=1in\@tempdimb=-\@tempdima%
265
+ \advance\@tempdimb\ESO@yoffsetI%
266
+ \unitlength=1pt%
267
+ \global\setbox\@cclv\vbox{%
268
+ \vbox{\let\protect\relax
269
+ \pictur@(0,0)(\strip@pt\@tempdima,\strip@pt\@tempdimb)%
270
+ \ESO@HookIII\ESO@HookI\ESO@HookII%
271
+ \global\let\ESO@HookII\@empty%
272
+ \endpicture}%
273
+ \nointerlineskip%
274
+ \box\@cclv}%
275
+ \fi
276
+ \egroup
277
+ }
278
+ \EveryShipoutACL{\@ShipoutPicture}
279
+ \newif\ifESO@dvips\ESO@dvipsfalse
280
+ \newif\ifESO@grid\ESO@gridfalse
281
+ \newif\ifESO@texcoord\ESO@texcoordfalse
282
+ \newcommand*\ESO@griddelta{}\newcommand*\ESO@griddeltaY{}
283
+ \newcommand*\ESO@gridDelta{}\newcommand*\ESO@gridDeltaY{}
284
+ \newcommand*\ESO@yoffsetI{}\newcommand*\ESO@yoffsetII{}
285
+ \ifESO@texcoord
286
+ \def\ESO@yoffsetI{0pt}\def\ESO@yoffsetII{-\paperheight}
287
+ \edef\ESO@griddeltaY{-\ESO@griddelta}\edef\ESO@gridDeltaY{-\ESO@gridDelta}
288
+ \else
289
+ \def\ESO@yoffsetI{\paperheight}\def\ESO@yoffsetII{0pt}
290
+ \edef\ESO@griddeltaY{\ESO@griddelta}\edef\ESO@gridDeltaY{\ESO@gridDelta}
291
+ \fi
292
+
293
+
294
+ %% ----- Submitted version markup: Page numbers, ruler, and confidentiality. Using ideas/code from cvpr.sty 2015. --MM
295
+
296
+ \font\naaclhv = phvb at 8pt
297
+
298
+ %% Define vruler %%
299
+
300
+ %\makeatletter
301
+ \newbox\aclrulerbox
302
+ \newcount\aclrulercount
303
+ \newdimen\aclruleroffset
304
+ \newdimen\cv@lineheight
305
+ \newdimen\cv@boxheight
306
+ \newbox\cv@tmpbox
307
+ \newcount\cv@refno
308
+ \newcount\cv@tot
309
+ % NUMBER with left flushed zeros \fillzeros[<WIDTH>]<NUMBER>
310
+ \newcount\cv@tmpc@ \newcount\cv@tmpc
311
+ \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
312
+ \cv@tmpc=1 %
313
+ \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
314
+ \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
315
+ \ifnum#2<0\advance\cv@tmpc1\relax-\fi
316
+ \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
317
+ \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
318
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
319
+ \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
320
+ \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
321
+ \global\setbox\aclrulerbox=\vbox to \textheight{%
322
+ {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
323
+ \color{gray}
324
+ \cv@lineheight=#1\global\aclrulercount=#2%
325
+ \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
326
+ \cv@refno1\vskip-\cv@lineheight\vskip1ex%
327
+ \loop\setbox\cv@tmpbox=\hbox to0cm{{\naaclhv\hfil\fillzeros[#4]\aclrulercount}}%
328
+ \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
329
+ \advance\cv@refno1\global\advance\aclrulercount#3\relax
330
+ \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}%
331
+ %\makeatother
332
+
333
+
334
+ \def\aclpaperid{***}
335
+ \def\confidential{NAACL-HLT 2018 Submission~\aclpaperid. Confidential Review Copy. DO NOT DISTRIBUTE.}
336
+
337
+ %% Page numbering, Vruler and Confidentiality %%
338
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
339
+ \def\aclruler#1{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}}
340
+ \def\leftoffset{-2.1cm} %original: -45pt
341
+ \def\rightoffset{17.5cm} %original: 500pt
342
+ \ifaclfinal\else\pagenumbering{arabic}
343
+ \AddToShipoutPicture{%
344
+ \ifaclfinal\else
345
+ \AtPageLowishCenter{\thepage}
346
+ \aclruleroffset=\textheight
347
+ \advance\aclruleroffset4pt
348
+ \AtTextUpperLeft{%
349
+ \put(\LenToUnit{\leftoffset},\LenToUnit{-\aclruleroffset}){%left ruler
350
+ \aclruler{\aclrulercount}}
351
+ \put(\LenToUnit{\rightoffset},\LenToUnit{-\aclruleroffset}){%right ruler
352
+ \aclruler{\aclrulercount}}
353
+ }
354
+ \AtTextUpperLeft{%confidential
355
+ \put(0,\LenToUnit{1cm}){\parbox{\textwidth}{\centering\naaclhv\confidential}}
356
+ }
357
+ \fi
358
+ }
359
+
360
+ %%%% ----- End settings for placing additional items into the submitted version --MM ----- %%%%
361
+
362
+ %%%% ----- Begin settings for both submitted and camera-ready version ----- %%%%
363
+
364
+ %% Title and Authors %%
365
+
366
+ \newcommand\outauthor{
367
+ \begin{tabular}[t]{c}
368
+ \ifaclfinal
369
+ \bf\@author
370
+ \else
371
+ % Avoiding common accidental de-anonymization issue. --MM
372
+ \bf Anonymous ACL submission
373
+ \fi
374
+ \end{tabular}}
375
+
376
+ % Changing the expanded titlebox for submissions to 2.5 in (rather than 6.5cm)
377
+ % and moving it to the style sheet, rather than within the example tex file. --MM
378
+ \ifaclfinal
379
+ \else
380
+ \addtolength\titlebox{.25in}
381
+ \fi
382
+ % Mostly taken from deproc.
383
+ \def\maketitle{\par
384
+ \begingroup
385
+ \def\thefootnote{\fnsymbol{footnote}}
386
+ \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}}
387
+ \twocolumn[\@maketitle] \@thanks
388
+ \endgroup
389
+ \setcounter{footnote}{0}
390
+ \let\maketitle\relax \let\@maketitle\relax
391
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
392
+ \def\@maketitle{\vbox to \titlebox{\hsize\textwidth
393
+ \linewidth\hsize \vskip 0.125in minus 0.125in \centering
394
+ {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in
395
+ {\def\and{\unskip\enspace{\rm and}\enspace}%
396
+ \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil
397
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}%
398
+ \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup
399
+ \vskip 0.25in plus 1fil minus 0.125in
400
+ \hbox to \linewidth\bgroup\large \hfil\hfil
401
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}
402
+ \hbox to \linewidth\bgroup\large \hfil\hfil
403
+ \hbox to 0pt\bgroup\hss
404
+ \outauthor
405
+ \hss\egroup
406
+ \hfil\hfil\egroup}
407
+ \vskip 0.3in plus 2fil minus 0.1in
408
+ }}
409
+
410
+ % margins and font size for abstract
411
+ \renewenvironment{abstract}%
412
+ {\centerline{\large\bf Abstract}%
413
+ \begin{list}{}%
414
+ {\setlength{\rightmargin}{0.6cm}%
415
+ \setlength{\leftmargin}{0.6cm}}%
416
+ \item[]\ignorespaces%
417
+ \@setsize\normalsize{12pt}\xpt\@xpt
418
+ }%
419
+ {\unskip\end{list}}
420
+
421
+ %\renewenvironment{abstract}{\centerline{\large\bf
422
+ % Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
423
+
424
+ % Resizing figure and table captions
425
+ \newcommand{\figcapfont}{\rm}
426
+ \newcommand{\tabcapfont}{\rm}
427
+ \renewcommand{\fnum@figure}{\figcapfont Figure \thefigure}
428
+ \renewcommand{\fnum@table}{\tabcapfont Table \thetable}
429
+ \renewcommand{\figcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
430
+ \renewcommand{\tabcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
431
+
432
+ \RequirePackage{natbib}
433
+ % for citation commands in the .tex, authors can use:
434
+ % \citep, \citet, and \citeyearpar for compatibility with natbib, or
435
+ % \cite, \newcite, and \shortcite for compatibility with older ACL .sty files
436
+ \renewcommand\cite{\citep} % to get "(Author Year)" with natbib
437
+ \newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib
438
+ \newcommand\newcite{\citet} % to get "Author (Year)" with natbib
439
+
440
+
441
+ % bibliography
442
+
443
+ \def\@up#1{\raise.2ex\hbox{#1}}
444
+
445
+ % Don't put a label in the bibliography at all. Just use the unlabeled format
446
+ % instead.
447
+ \def\thebibliography#1{\vskip\parskip%
448
+ \vskip\baselineskip%
449
+ \def\baselinestretch{1}%
450
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
451
+ \vskip-\parskip%
452
+ \vskip-\baselineskip%
453
+ \section*{References\@mkboth
454
+ {References}{References}}\list
455
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
456
+ \setlength{\itemindent}{-\parindent}}
457
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
458
+ \sloppy\clubpenalty4000\widowpenalty4000
459
+ \sfcode`\.=1000\relax}
460
+ \let\endthebibliography=\endlist
461
+
462
+
463
+ % Allow for a bibliography of sources of attested examples
464
+ \def\thesourcebibliography#1{\vskip\parskip%
465
+ \vskip\baselineskip%
466
+ \def\baselinestretch{1}%
467
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
468
+ \vskip-\parskip%
469
+ \vskip-\baselineskip%
470
+ \section*{Sources of Attested Examples\@mkboth
471
+ {Sources of Attested Examples}{Sources of Attested Examples}}\list
472
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
473
+ \setlength{\itemindent}{-\parindent}}
474
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
475
+ \sloppy\clubpenalty4000\widowpenalty4000
476
+ \sfcode`\.=1000\relax}
477
+ \let\endthesourcebibliography=\endlist
478
+
479
+ % sections with less space
480
+ \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
481
+ -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}}
482
+ \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
483
+ -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
484
+ %% changed by KO to - values to get teh initial parindent right
485
+ \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus
486
+ -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}}
487
+ \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
488
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
489
+ \def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus
490
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
491
+
492
+ % Footnotes
493
+ \footnotesep 6.65pt %
494
+ \skip\footins 9pt plus 4pt minus 2pt
495
+ \def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt }
496
+ \setcounter{footnote}{0}
497
+
498
+ % Lists and paragraphs
499
+ \parindent 1em
500
+ \topsep 4pt plus 1pt minus 2pt
501
+ \partopsep 1pt plus 0.5pt minus 0.5pt
502
+ \itemsep 2pt plus 1pt minus 0.5pt
503
+ \parsep 2pt plus 1pt minus 0.5pt
504
+
505
+ \leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em
506
+ \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em
507
+ \labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt
508
+
509
+ \def\@listi{\leftmargin\leftmargini}
510
+ \def\@listii{\leftmargin\leftmarginii
511
+ \labelwidth\leftmarginii\advance\labelwidth-\labelsep
512
+ \topsep 2pt plus 1pt minus 0.5pt
513
+ \parsep 1pt plus 0.5pt minus 0.5pt
514
+ \itemsep \parsep}
515
+ \def\@listiii{\leftmargin\leftmarginiii
516
+ \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
517
+ \topsep 1pt plus 0.5pt minus 0.5pt
518
+ \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
519
+ \itemsep \topsep}
520
+ \def\@listiv{\leftmargin\leftmarginiv
521
+ \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
522
+ \def\@listv{\leftmargin\leftmarginv
523
+ \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
524
+ \def\@listvi{\leftmargin\leftmarginvi
525
+ \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
526
+
527
+ \abovedisplayskip 7pt plus2pt minus5pt%
528
+ \belowdisplayskip \abovedisplayskip
529
+ \abovedisplayshortskip 0pt plus3pt%
530
+ \belowdisplayshortskip 4pt plus3pt minus3pt%
531
+
532
+ % Less leading in most fonts (due to the narrow columns)
533
+ % The choices were between 1-pt and 1.5-pt leading
534
+ \def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
535
+ \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
536
+ \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
537
+ \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
538
+ \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
539
+ \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
540
+ \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
541
+ \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
542
+ \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
543
+ \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
references/2020.emnlp.nguyen/paper.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "PhoBERT: Pre-trained language models for Vietnamese"
3
+ authors:
4
+ - "Dat Quoc Nguyen"
5
+ - "Anh Tuan Nguyen"
6
+ year: 2020
7
+ venue: "EMNLP Findings 2020"
8
+ url: "https://aclanthology.org/2020.findings-emnlp.92/"
9
+ ---
10
+
11
+ We present **PhoBERT** with two versions---PhoBERT and PhoBERT---the *first* public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R [conneau2019unsupervised] and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at: https://github.com/VinAIResearch/PhoBERT.
12
+ # Introduction
13
+ Pre-trained language models, especially BERT [devlin-etal-2019-bert]---the Bidirectional Encoder Representations from Transformers [NIPS2017_7181], have recently become extremely popular and helped to produce significant improvement gains for various NLP tasks. The success of pre-trained BERT and its variants has largely been limited to the English language. For other languages, one could retrain a language-specific model using the BERT architecture [abs-1906-08101,vries2019bertje,vu-xuan-etal-2019-etnlp,2019arXiv191103894M] or employ existing pre-trained multilingual BERT-based models [devlin-etal-2019-bert,NIPS2019_8928,conneau2019unsupervised].
14
+ In terms of Vietnamese language modeling, to the best of our knowledge, there are two main concerns as follows:
15
+ [leftmargin=*]
16
+ - sep{-1pt}
17
+ - The Vietnamese Wikipedia corpus is the only data used to train monolingual language models [vu-xuan-etal-2019-etnlp], and it also is the only Vietnamese dataset which is included in the pre-training data used by all multilingual language models except XLM-R. It is worth noting that Wikipedia data is not representative of a general language use, and the Vietnamese Wikipedia data is relatively small (1GB in size uncompressed), while pre-trained language models can be significantly improved by using more pre-training data [RoBERTa].
18
+ - All publicly released monolingual and multilingual BERT-based language models are not aware of the difference between Vietnamese syllables and word tokens. This ambiguity comes from the fact that the white space is also used to separate syllables that constitute words when written in Vietnamese. show that 85\
19
+ For example, a 6-syllable written text ``Tôi là một nghiên cứu viên'' (I am a researcher) forms 4 words ``Tôi là một nghiên\_cứu\_viên''.
20
+ Without doing a pre-process step of Vietnamese word segmentation, those models directly apply Byte-Pair encoding (BPE) methods [sennrich-etal-2016-neural,kudo-richardson-2018-sentencepiece] to the syllable-level Vietnamese pre-training data. any pre-trained BERT-based language model (https://github.com/vietnlp/etnlp). In particular, release a set of 15K BERT-based word embeddings specialized only for the Vietnamese NER task.}
21
+ Intuitively, for word-level Vietnamese NLP tasks, those models pre-trained on syllable-level data might not perform as good as language models pre-trained on word-level data.
22
+ To handle the two concerns above, we train the {first} large-scale monolingual BERT-based ``base'' and ``large'' models using a 20GB *word-level* Vietnamese corpus.
23
+ We evaluate our models on four downstream Vietnamese NLP tasks: the common word-level ones of Part-of-speech (POS) tagging, Dependency parsing and Named-entity recognition (NER), and a language understanding task of Natural language inference (NLI) which can be formulated as either a syllable- or word-level task. Experimental results show that our models obtain state-of-the-art (SOTA) results on all these tasks.
24
+ Our contributions are summarized as follows:
25
+ [leftmargin=*]
26
+ - sep{-1pt}
27
+ - We present the *first* large-scale monolingual language models pre-trained for Vietnamese.
28
+ - Our models help produce SOTA performances on four downstream tasks of POS tagging, Dependency parsing, NER and NLI, thus showing the effectiveness of large-scale BERT-based monolingual language models for Vietnamese.
29
+ - To the best of our knowledge, we also perform the *first* set of experiments to compare monolingual language models with the recent best multilingual model XLM-R in multiple (i.e. four) different language-specific tasks. The experiments show that our models outperform XLM-R on all these tasks, thus convincingly confirming that dedicated language-specific models still outperform multilingual ones.
30
+ - We publicly release our models under the name PhoBERT which can be used with `fairseq` [ott2019fairseq] and `transformers` [Wolf2019HuggingFacesTS]. We hope that PhoBERT can serve as a strong baseline for future Vietnamese NLP research and applications.
31
+ # PhoBERT
32
+ This section outlines the architecture and describes the pre-training data and optimization setup that we use for PhoBERT.
33
+ **Architecture:**\ Our PhoBERT has two versions, PhoBERT and PhoBERT, using the same architectures of BERT and BERT, respectively. PhoBERT pre-training approach is based on RoBERTa [RoBERTa] which optimizes the BERT pre-training procedure for more robust performance.
34
+ **Pre-training data:**\ To handle the first concern mentioned in Section [sec:intro], we use a 20GB pre-training dataset of uncompressed texts. This dataset is a concatenation of two corpora: (i) the first one is the Vietnamese Wikipedia corpus ($$1GB), and (ii) the second corpus ($$19GB) is generated by removing similar articles and duplication from a 50GB Vietnamese news corpus. To solve the second concern,
35
+ we employ RDRSegmenter [nguyen-etal-2018-fast] from VnCoreNLP [vu-etal-2018-vncorenlp] to perform word and sentence segmentation on the pre-training dataset, resulting in $$145M word-segmented sentences ($$3B word tokens). Different from RoBERTa, we then apply `fastBPE` [sennrich-etal-2016-neural] to segment these sentences with subword units, using a vocabulary of 64K subword types. On average there are 24.4 subword tokens per sentence.
36
+ **Optimization:**\ We employ the RoBERTa implementation in `fairseq` [ott2019fairseq]. We set a maximum length at 256 subword tokens, thus generating 145M $$ 24.4 / 256 $$ 13.8M sentence blocks. Following , we optimize the models using Adam [KingmaB14]. We use a batch size of 1024 across 4 V100 GPUs (16GB each) and a peak learning rate of 0.0004 for PhoBERT, and a batch size of 512 and a peak learning rate of 0.0002 for PhoBERT. We run for 40 epochs (here, the learning rate is warmed up for 2 epochs), thus resulting in 13.8M $$ 40 / 1024 $$ 540K training steps for PhoBERT and 1.08M training steps for PhoBERT. We pre-train PhoBERT during 3 weeks, and then PhoBERT during 5 weeks.
37
+ [!t]
38
+ |
39
+ **Task** | **\#training** | **\#valid** | **\#test** |
40
+ |---|---|---|---|
41
+ |
42
+ POS tagging$^$ | 27,000 | 870 | 2,120 |
43
+ | Dep. parsing$^$ | 8,977 | 200 | 1,020 |
44
+ | NER$^$ | 14,861 | 2,000 | 2,831 |
45
+ | NLI$^$ | 392,702 | 2,490 | 5,010 |
46
+ | |
47
+ [!ht]
48
+ {!}{
49
+ |
50
+ {c|}{**POS tagging** (word-level)} | {c}{**Dependency parsing** (word-level)} |
51
+ |---|---|
52
+ |
53
+ Model | Acc. | Model | LAS / UAS |
54
+ |
55
+ RDRPOSTagger [nguyen-etal-2014-rdrpostagger] [$$] | 95.1 | \_ | \_ |
56
+ | BiLSTM-CNN-CRF [ma-hovy-2016-end] [$$] | 95.4 | VnCoreNLP-DEP [vu-etal-2018-vncorenlp] [$$] | 71.38 / 77.35 |
57
+ | VnCoreNLP-POS [nguyen-etal-2017-word] [$$] | 95.9 | jPTDP-v2 [$$] | 73.12 / 79.63 |
58
+ | jPTDP-v2 [nguyen-verspoor-2018-improved] [$$] | 95.7 | jointWPD [$$] | 73.90 / 80.12 |
59
+ | jointWPD [nguyen-2019-neural] [$$] | 96.0 | Biaffine [DozatM17] [$$] | 74.99 / 81.19 |
60
+ | XLM-R (our result) | 96.2 | Biaffine w/ XLM-R (our result) | 76.46 / 83.10 |
61
+ | XLM-R (our result) | 96.3 | Biaffine w/ XLM-R (our result) | 75.87 / 82.70 |
62
+ |
63
+ PhoBERT | 96.7 | Biaffine w/ PhoBERT | **78.77** / **85.22** |
64
+ | PhoBERT | **96.8** | Biaffine w/ PhoBERT | 77.85 / 84.32 |
65
+ | |
66
+ }
67
+ and , respectively.}
68
+ # Experimental setup
69
+ We evaluate the performance of PhoBERT on four downstream Vietnamese NLP tasks: POS tagging, Dependency parsing, NER and NLI.
70
+ ### Downstream task datasets
71
+ Table [tab:data] presents the statistics of the experimental datasets that we employ for downstream task evaluation.
72
+ For POS tagging, Dependency parsing and NER, we follow the VnCoreNLP setup [vu-etal-2018-vncorenlp], using standard benchmarks of the VLSP 2013 POS tagging dataset, the VnDT dependency treebank v1.1 [Nguyen2014NLDB] with POS tags predicted by VnCoreNLP and the VLSP 2016 NER dataset [JCC13161].
73
+ For NLI, we use the manually-constructed Vietnamese validation and test sets from the cross-lingual NLI (XNLI) corpus v1.0 [conneau-etal-2018-xnli] where the Vietnamese training set is released as a machine-translated version of the corresponding English training set [N18-1101].
74
+ Unlike the POS tagging, Dependency parsing and NER datasets which provide the gold word segmentation, for NLI, we employ RDRSegmenter to segment the text into words before applying BPE to produce subwords from word tokens.
75
+ ### Fine-tuning
76
+ Following , for POS tagging and NER, we append a linear prediction layer on top of the PhoBERT architecture (i.e. to the last Transformer layer of PhoBERT) w.r.t. the first subword of each word token.
77
+ For dependency parsing, following , we employ a reimplementation of the state-of-the-art Biaffine dependency parser [DozatM17] from with default optimal hyper-parameters.
78
+ We then extend this parser by replacing the pre-trained word embedding of each word in an input sentence by the corresponding contextualized embedding (from the last layer) computed for the first subword token of the word.
79
+ For POS tagging, NER and NLI, we employ `transformers` [Wolf2019HuggingFacesTS] to fine-tune PhoBERT for each task and each dataset independently. We use AdamW [loshchilov2018decoupled] with a fixed learning rate of 1.e-5 and a batch size of 32 [RoBERTa]. We fine-tune in 30 training epochs, evaluate the task performance after each epoch on the validation set (here, early stopping is applied when there is no improvement after 5 continuous epochs), and then select the best model checkpoint to report the final result on the test set (note that each of our scores is an average over 5 runs with different random seeds).
80
+ [!ht]
81
+ {!}{
82
+ |
83
+ {c|}{**NER** (word-level)} | {c}{**NLI** (syllable- or word-level)} |
84
+ |---|---|
85
+ |
86
+ Model | F | Model | Acc. |
87
+ |
88
+ BiLSTM-CNN-CRF [$$] | 88.3 | \_ | \_ |
89
+ | VnCoreNLP-NER [vu-etal-2018-vncorenlp] [$$] | 88.6 | BiLSTM-max [conneau-etal-2018-xnli] | 66.4 |
90
+ | VNER [8713740] | 89.6 | mBiLSTM [ArtetxeS19] | 72.0 |
91
+ | BiLSTM-CNN-CRF + ETNLP [$$] | 91.1 | multilingual BERT [devlin-etal-2019-bert] [$$] | 69.5 |
92
+ | VnCoreNLP-NER + ETNLP [$$] | 91.3 | XLM [NIPS2019_8928] | 76.6 |
93
+ | XLM-R (our result) | 92.0 | XLM-R [conneau2019unsupervised] | {75.4} |
94
+ | XLM-R (our result) | 92.8 | XLM-R [conneau2019unsupervised] | 79.7 |
95
+ |
96
+ PhoBERT | 93.6 | PhoBERT | {78.5} |
97
+ | PhoBERT | **94.7** | PhoBERT | **80.0** |
98
+ | |
99
+ }
100
+ , and , respectively.
101
+ Note that there are higher Vietnamese NLI results reported for XLM-R when fine-tuning on the concatenation of all 15 training datasets from the XNLI corpus (i.e. TRANSLATE-TRAIN-ALL: 79.5\
102
+ # Experimental results
103
+ ### Main results
104
+ Tables [tab:posdep] and [tab:nernli] compare PhoBERT scores with the previous highest reported results, using the same experimental setup. It is clear that our PhoBERT helps produce new SOTA performance results for all four downstream tasks.
105
+ For POS tagging, the neural model jointWPD for joint POS tagging and dependency parsing [nguyen-2019-neural] and the feature-based model VnCoreNLP-POS [nguyen-etal-2017-word] are the two previous SOTA models, obtaining accuracies at about 96.0\
106
+ For Dependency parsing, the previous highest parsing scores LAS and UAS are obtained by the Biaffine parser at 75.0\
107
+ For NER, PhoBERT produces 1.1 points higher F than PhoBERT. In addition, PhoBERT obtains 2+ points higher than the previous SOTA feature- and neural network-based models VnCoreNLP-NER [vu-etal-2018-vncorenlp] and BiLSTM-CNN-CRF [ma-hovy-2016-end] which are trained with the set of 15K BERT-based ETNLP word embeddings [vu-xuan-etal-2019-etnlp].
108
+ For NLI,
109
+ PhoBERT outperforms the multilingual BERT [devlin-etal-2019-bert] and the BERT-based cross-lingual model with a new translation language modeling objective XLM [NIPS2019_8928] by large margins. PhoBERT also performs better than the recent best pre-trained multilingual model XLM-R but using far fewer parameters than XLM-R: 135M (PhoBERT) vs. 250M (XLM-R); 370M (PhoBERT) vs. 560M (XLM-R).
110
+ ### Discussion
111
+ We find that PhoBERT achieves 0.9\
112
+ Using more pre-training data can significantly improve the quality of the pre-trained language models [RoBERTa]. Thus it is not surprising that PhoBERT helps produce better performance than ETNLP on NER, and the multilingual BERT and XLM on NLI (here, PhoBERT uses 20GB of Vietnamese texts while those models employ the 1GB Vietnamese Wikipedia corpus).
113
+ Following the fine-tuning approach that we use for PhoBERT, we carefully fine-tune XLM-R for the remaining Vietnamese POS tagging, Dependency parsing and NER tasks (here, it is applied to the first sub-syllable token of the first syllable of each word). and the batch size from \{16, 32\}.}
114
+ Tables [tab:posdep] and [tab:nernli] show that our PhoBERT also does better than XLM-R on these three word-level tasks.
115
+ It is worth noting that XLM-R uses a 2.5TB pre-training corpus which contains 137GB of Vietnamese texts (i.e. about 137\ /\ 20 $$ 7 times bigger than our pre-training corpus).
116
+ Recall that PhoBERT performs Vietnamese word segmentation to segment syllable-level sentences into word tokens before applying BPE to segment the word-segmented sentences into subword units, while XLM-R directly applies BPE to the syllable-level Vietnamese pre-training sentences.
117
+ This reconfirms that the dedicated language-specific models still outperform the multilingual ones [2019arXiv191103894M]. only compare their model CamemBERT with XLM-R on the French NLI task.}
118
+ # Conclusion
119
+ In this paper, we have presented the first large-scale monolingual PhoBERT language models pre-trained for Vietnamese. We demonstrate the usefulness of PhoBERT by showing that PhoBERT performs better than the recent best multilingual model XLM-R and helps produce the SOTA performances for four downstream Vietnamese NLP tasks of POS tagging, Dependency parsing, NER and NLI.
120
+ By publicly releasing PhoBERT models,
121
+ we hope that they can foster future research and applications in Vietnamese NLP.
122
+ {
123
+ }
references/2020.emnlp.nguyen/paper.tex ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \usepackage[hyperref]{emnlp2020}
3
+ \pdfoutput=1
4
+ \usepackage{times}
5
+ \usepackage{latexsym}
6
+ %\renewcommand{\UrlFont}{\ttfamily\small}
7
+
8
+ \usepackage{times}
9
+ \usepackage{latexsym}
10
+ \usepackage{amsmath}
11
+ \usepackage{url}
12
+ \usepackage{amssymb}
13
+ \usepackage{amsfonts}
14
+ \usepackage{graphicx}
15
+ \usepackage{tabularx}
16
+ \usepackage{multirow}
17
+ \usepackage{arydshln}
18
+ \usepackage{mathtools,nccmath}
19
+
20
+ \usepackage[utf8]{inputenc}
21
+ \usepackage[utf8]{vietnam}
22
+ \usepackage{enumitem}
23
+ % This is not strictly necessary, and may be commented out,
24
+ % but it will improve the layout of the manuscript,
25
+ % and will typically save some space.
26
+ %\usepackage{microtype}
27
+
28
+
29
+ \setlength{\textfloatsep}{15pt plus 5.0pt minus 5.0pt}
30
+ \setlength{\floatsep}{15pt plus 5.0pt minus 5.0pt}
31
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
32
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
33
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
34
+ \setlength{\abovecaptionskip}{3pt plus 1pt minus 1pt}
35
+
36
+ \aclfinalcopy % Uncomment this line for the final submission
37
+ %\def\aclpaperid{***} % Enter the acl Paper ID here
38
+
39
+ \setlength\titlebox{5cm}
40
+ % You can expand the titlebox if you need extra space
41
+ % to show all the authors. Please do not make the titlebox
42
+ % smaller than 5cm (the original size); we will check this
43
+ % in the camera-ready version and ask you to change it back.
44
+
45
+ \newcommand\BibTeX{B\textsc{ib}\TeX}
46
+
47
+
48
+ \title{PhoBERT: Pre-trained language models for Vietnamese}
49
+
50
+ \author{Dat Quoc Nguyen$^1$ \and Anh Tuan Nguyen$^{2,}$\thanks{\ \ Work done during internship at VinAI Research.} \\
51
+ $^1$VinAI Research, Vietnam; $^2$NVIDIA, USA\\
52
+ \tt{\normalsize v.datnq9@vinai.io, tuananhn@nvidia.com}}
53
+
54
+ \date{}
55
+
56
+ \begin{document}
57
+ \maketitle
58
+ \begin{abstract}
59
+ We present \textbf{PhoBERT} with two versions---PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}---the \emph{first} public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R \citep{conneau2019unsupervised} and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at: \url{https://github.com/VinAIResearch/PhoBERT}.
60
+ \end{abstract}
61
+
62
+ \section{Introduction}\label{sec:intro}
63
+
64
+
65
+ Pre-trained language models, especially BERT \citep{devlin-etal-2019-bert}---the Bidirectional Encoder Representations from Transformers \citep{NIPS2017_7181}, have recently become extremely popular and helped to produce significant improvement gains for various NLP tasks. The success of pre-trained BERT and its variants has largely been limited to the English language. For other languages, one could retrain a language-specific model using the BERT architecture \citep{abs-1906-08101,vries2019bertje,vu-xuan-etal-2019-etnlp,2019arXiv191103894M} or employ existing pre-trained multilingual BERT-based models \citep{devlin-etal-2019-bert,NIPS2019_8928,conneau2019unsupervised}.
66
+
67
+ In terms of Vietnamese language modeling, to the best of our knowledge, there are two main concerns as follows:
68
+
69
+ \begin{itemize}[leftmargin=*]
70
+ \setlength\itemsep{-1pt}
71
+ \item The Vietnamese Wikipedia corpus is the only data used to train monolingual language models \citep{vu-xuan-etal-2019-etnlp}, and it also is the only Vietnamese dataset which is included in the pre-training data used by all multilingual language models except XLM-R. It is worth noting that Wikipedia data is not representative of a general language use, and the Vietnamese Wikipedia data is relatively small (1GB in size uncompressed), while pre-trained language models can be significantly improved by using more pre-training data \cite{RoBERTa}.
72
+
73
+ \item All publicly released monolingual and multilingual BERT-based language models are not aware of the difference between Vietnamese syllables and word tokens. This ambiguity comes from the fact that the white space is also used to separate syllables that constitute words when written in Vietnamese.\footnote{\newcite{DinhQuangThang2008} show that 85\% of Vietnamese word types are composed of at least two syllables.}
74
+ For example, a 6-syllable written text ``Tôi là một nghiên cứu viên'' (I am a researcher) forms 4 words ``Tôi\textsubscript{I} là\textsubscript{am} một\textsubscript{a} nghiên\_cứu\_viên\textsubscript{researcher}''. \\
75
+ Without doing a pre-process step of Vietnamese word segmentation, those models directly apply Byte-Pair encoding (BPE) methods \citep{sennrich-etal-2016-neural,kudo-richardson-2018-sentencepiece} to the syllable-level Vietnamese pre-training data.\footnote{Although performing word segmentation before applying BPE on the Vietnamese Wikipedia corpus, ETNLP \citep{vu-xuan-etal-2019-etnlp} in fact {does not publicly release} any pre-trained BERT-based language model (\url{https://github.com/vietnlp/etnlp}). In particular, \newcite{vu-xuan-etal-2019-etnlp} release a set of 15K BERT-based word embeddings specialized only for the Vietnamese NER task.}
76
+ Intuitively, for word-level Vietnamese NLP tasks, those models pre-trained on syllable-level data might not perform as good as language models pre-trained on word-level data.
77
+
78
+ \end{itemize}
79
+
80
+
81
+ To handle the two concerns above, we train the {first} large-scale monolingual BERT-based ``base'' and ``large'' models using a 20GB \textit{word-level} Vietnamese corpus.
82
+ We evaluate our models on four downstream Vietnamese NLP tasks: the common word-level ones of Part-of-speech (POS) tagging, Dependency parsing and Named-entity recognition (NER), and a language understanding task of Natural language inference (NLI) which can be formulated as either a syllable- or word-level task. Experimental results show that our models obtain state-of-the-art (SOTA) results on all these tasks.
83
+ Our contributions are summarized as follows:
84
+
85
+ \begin{itemize}[leftmargin=*]
86
+ \setlength\itemsep{-1pt}
87
+ \item We present the \textit{first} large-scale monolingual language models pre-trained for Vietnamese.
88
+
89
+ \item Our models help produce SOTA performances on four downstream tasks of POS tagging, Dependency parsing, NER and NLI, thus showing the effectiveness of large-scale BERT-based monolingual language models for Vietnamese.
90
+
91
+ \item To the best of our knowledge, we also perform the \textit{first} set of experiments to compare monolingual language models with the recent best multilingual model XLM-R in multiple (i.e. four) different language-specific tasks. The experiments show that our models outperform XLM-R on all these tasks, thus convincingly confirming that dedicated language-specific models still outperform multilingual ones.
92
+
93
+ \item We publicly release our models under the name PhoBERT which can be used with \texttt{fairseq} \citep{ott2019fairseq} and \texttt{transformers} \cite{Wolf2019HuggingFacesTS}. We hope that PhoBERT can serve as a strong baseline for future Vietnamese NLP research and applications.
94
+ \end{itemize}
95
+
96
+
97
+
98
+
99
+
100
+
101
+
102
+
103
+ \section{PhoBERT}
104
+
105
+ This section outlines the architecture and describes the pre-training data and optimization setup that we use for PhoBERT.
106
+
107
+ \vspace{3pt}
108
+
109
+ \noindent\textbf{Architecture:}\ Our PhoBERT has two versions, PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}, using the same architectures of BERT\textsubscript{base} and BERT\textsubscript{large}, respectively. PhoBERT pre-training approach is based on RoBERTa \citep{RoBERTa} which optimizes the BERT pre-training procedure for more robust performance.
110
+
111
+ \vspace{3pt}
112
+
113
+ \noindent\textbf{Pre-training data:}\ To handle the first concern mentioned in Section \ref{sec:intro}, we use a 20GB pre-training dataset of uncompressed texts. This dataset is a concatenation of two corpora: (i) the first one is the Vietnamese Wikipedia corpus ($\sim$1GB), and (ii) the second corpus ($\sim$19GB) is generated by removing similar articles and duplication from a 50GB Vietnamese news corpus.\footnote{\url{https://github.com/binhvq/news-corpus}, crawled from a wide range of news websites and topics.} To solve the second concern,
114
+ we employ RDRSegmenter \citep{nguyen-etal-2018-fast} from VnCoreNLP \citep{vu-etal-2018-vncorenlp} to perform word and sentence segmentation on the pre-training dataset, resulting in $\sim$145M word-segmented sentences ($\sim$3B word tokens). Different from RoBERTa, we then apply \texttt{fastBPE} \citep{sennrich-etal-2016-neural} to segment these sentences with subword units, using a vocabulary of 64K subword types. On average there are 24.4 subword tokens per sentence.
115
+
116
+ \vspace{3pt}
117
+
118
+ \noindent\textbf{Optimization:}\ We employ the RoBERTa implementation in \texttt{fairseq} \citep{ott2019fairseq}. We set a maximum length at 256 subword tokens, thus generating 145M $\times$ 24.4 / 256 $\approx$ 13.8M sentence blocks. Following \newcite{RoBERTa}, we optimize the models using Adam \citep{KingmaB14}. We use a batch size of 1024 across 4 V100 GPUs (16GB each) and a peak learning rate of 0.0004 for PhoBERT\textsubscript{base}, and a batch size of 512 and a peak learning rate of 0.0002 for PhoBERT\textsubscript{large}. We run for 40 epochs (here, the learning rate is warmed up for 2 epochs), thus resulting in 13.8M $\times$ 40 / 1024 $\approx$ 540K training steps for PhoBERT\textsubscript{base} and 1.08M training steps for PhoBERT\textsubscript{large}. We pre-train PhoBERT\textsubscript{base} during 3 weeks, and then PhoBERT\textsubscript{large} during 5 weeks.
119
+
120
+
121
+ \begin{table}[!t]
122
+ \centering
123
+ \begin{tabular}{l|l|l|l}
124
+ \hline
125
+ \textbf{Task} & \textbf{\#training} & \textbf{\#valid} & \textbf{\#test} \\
126
+ \hline
127
+
128
+ POS tagging$^\dagger$ & 27,000 & 870 & 2,120 \\
129
+ Dep. parsing$^\dagger$ & 8,977 & 200 & 1,020 \\
130
+ NER$^\dagger$ & 14,861 & 2,000 & 2,831\\
131
+ NLI$^\ddagger$ & 392,702 & 2,490 & 5,010\\
132
+ \hline
133
+ \end{tabular}
134
+ \caption{Statistics of the downstream task datasets. ``\#training'', ``\#valid'' and ``\#test'' denote the size of the training, validation and test sets, respectively. $\dagger$ and $\ddagger$ refer to the dataset size as the numbers of sentences and sentence pairs, respectively.}
135
+ \label{tab:data}
136
+ \end{table}
137
+
138
+
139
+ \begin{table*}[!ht]
140
+ \centering
141
+ \resizebox{15.5cm}{!}{
142
+ %\setlength{\tabcolsep}{0.3em}
143
+ \begin{tabular}{l|l|l|l}
144
+ \hline
145
+ \multicolumn{2}{c|}{\textbf{POS tagging} (word-level)} & \multicolumn{2}{c}{\textbf{Dependency parsing} (word-level)}\\
146
+ \hline
147
+ Model & Acc. & Model & LAS / UAS \\
148
+ \hline
149
+ RDRPOSTagger \citep{nguyen-etal-2014-rdrpostagger} [$\clubsuit$] & 95.1 & \_ & \_ \\
150
+
151
+ BiLSTM-CNN-CRF \citep{ma-hovy-2016-end} [$\clubsuit$] & 95.4 & VnCoreNLP-DEP \citep{vu-etal-2018-vncorenlp} [$\bigstar$] & 71.38 / 77.35 \\
152
+
153
+
154
+ VnCoreNLP-POS \citep{nguyen-etal-2017-word} [$\clubsuit$] & 95.9 &jPTDP-v2 [$\bigstar$] & 73.12 / 79.63 \\
155
+
156
+ jPTDP-v2 \citep{nguyen-verspoor-2018-improved} [$\bigstar$] & 95.7 &jointWPD [$\bigstar$] & 73.90 / 80.12 \\
157
+
158
+ jointWPD \citep{nguyen-2019-neural} [$\bigstar$] & 96.0 & Biaffine \citep{DozatM17} [$\bigstar$] & 74.99 / 81.19 \\
159
+
160
+ XLM-R\textsubscript{base} (our result) & 96.2 & Biaffine w/ XLM-R\textsubscript{base} (our result) & 76.46 / 83.10 \\
161
+
162
+ XLM-R\textsubscript{large} (our result) & 96.3 & Biaffine w/ XLM-R\textsubscript{large} (our result) & 75.87 / 82.70 \\
163
+
164
+ \hline
165
+ PhoBERT\textsubscript{base} & \underline{96.7} & Biaffine w/ PhoBERT\textsubscript{base} & \textbf{78.77} / \textbf{85.22} \\
166
+
167
+ PhoBERT\textsubscript{large} & \textbf{96.8} & Biaffine w/ PhoBERT\textsubscript{large} & \underline{77.85} / \underline{84.32} \\
168
+ \hline
169
+ \end{tabular}
170
+ }
171
+ \caption{Performance scores (in \%) on the POS tagging and Dependency parsing test sets. ``Acc.'', ``LAS'' and ``UAS'' abbreviate the Accuracy, the Labeled Attachment Score and the Unlabeled Attachment Score, respectively (here, all these evaluation metrics are computed on all word tokens, including punctuation).
172
+ [$\clubsuit$] and [$\bigstar$] denote
173
+ results reported by \newcite{nguyen-etal-2017-word} and \newcite{nguyen-2019-neural}, respectively.}
174
+ \label{tab:posdep}
175
+ \end{table*}
176
+
177
+
178
+
179
+ \section{Experimental setup}
180
+
181
+ We evaluate the performance of PhoBERT on four downstream Vietnamese NLP tasks: POS tagging, Dependency parsing, NER and NLI.
182
+
183
+
184
+ \subsubsection*{Downstream task datasets}
185
+
186
+ Table \ref{tab:data} presents the statistics of the experimental datasets that we employ for downstream task evaluation.
187
+ For POS tagging, Dependency parsing and NER, we follow the VnCoreNLP setup \citep{vu-etal-2018-vncorenlp}, using standard benchmarks of the VLSP 2013 POS tagging dataset,\footnote{\url{https://vlsp.org.vn/vlsp2013/eval}} the VnDT dependency treebank v1.1 \cite{Nguyen2014NLDB} with POS tags predicted by VnCoreNLP and the VLSP 2016 NER dataset \citep{JCC13161}.
188
+
189
+ For NLI, we use the manually-constructed Vietnamese validation and test sets from the cross-lingual NLI (XNLI) corpus v1.0 \citep{conneau-etal-2018-xnli} where the Vietnamese training set is released as a machine-translated version of the corresponding English training set \citep{N18-1101}.
190
+ Unlike the POS tagging, Dependency parsing and NER datasets which provide the gold word segmentation, for NLI, we employ RDRSegmenter to segment the text into words before applying BPE to produce subwords from word tokens.
191
+
192
+ \subsubsection*{Fine-tuning}
193
+
194
+ Following \newcite{devlin-etal-2019-bert}, for POS tagging and NER, we append a linear prediction layer on top of the PhoBERT architecture (i.e. to the last Transformer layer of PhoBERT) w.r.t. the first subword of each word token.\footnote{In our preliminary experiments, using the average of contextualized embeddings of subword tokens of each word to represent the word produces slightly lower performance than using the contextualized embedding of the first subword.}
195
+ For dependency parsing, following \newcite{nguyen-2019-neural}, we employ a reimplementation of the state-of-the-art Biaffine dependency parser \citep{DozatM17} from \newcite{ma-etal-2018-stack} with default optimal hyper-parameters. %\footnote{\url{https://github.com/XuezheMax/NeuroNLP2}}
196
+ We then extend this parser by replacing the pre-trained word embedding of each word in an input sentence by the corresponding contextualized embedding (from the last layer) computed for the first subword token of the word.
197
+
198
+ For POS tagging, NER and NLI, we employ \texttt{transformers} \cite{Wolf2019HuggingFacesTS} to fine-tune PhoBERT for each task and each dataset independently. We use AdamW \citep{loshchilov2018decoupled} with a fixed learning rate of 1.e-5 and a batch size of 32 \citep{RoBERTa}. We fine-tune in 30 training epochs, evaluate the task performance after each epoch on the validation set (here, early stopping is applied when there is no improvement after 5 continuous epochs), and then select the best model checkpoint to report the final result on the test set (note that each of our scores is an average over 5 runs with different random seeds). %Section \ref{sec:results} shows that using this relatively straightforward fine-tuning manner can lead to SOTA results. %Note that we might boost our downstream task performances even further by doing a more careful hyper-parameter tuning.
199
+
200
+
201
+
202
+
203
+ \begin{table*}[!ht]
204
+ \centering
205
+ \resizebox{15.5cm}{!}{
206
+ %\setlength{\tabcolsep}{0.3em}
207
+ \begin{tabular}{l|l|l|l}
208
+ \hline
209
+ \multicolumn{2}{c|}{\textbf{NER} (word-level)} & \multicolumn{2}{c}{\textbf{NLI} (syllable- or word-level)} \\
210
+
211
+ \hline
212
+ Model & F\textsubscript{1} & Model & Acc. \\
213
+ \hline
214
+ BiLSTM-CNN-CRF [$\blacklozenge$] & 88.3 & \_ & \_\\
215
+
216
+ VnCoreNLP-NER \citep{vu-etal-2018-vncorenlp} [$\blacklozenge$] & 88.6 & BiLSTM-max \citep{conneau-etal-2018-xnli} & 66.4 \\
217
+
218
+
219
+ VNER \citep{8713740} & 89.6 & mBiLSTM \citep{ArtetxeS19} & 72.0 \\
220
+
221
+ BiLSTM-CNN-CRF + ETNLP [$\spadesuit$] & 91.1 & multilingual BERT \citep{devlin-etal-2019-bert} [$\blacksquare$] & 69.5 \\
222
+
223
+ VnCoreNLP-NER + ETNLP [$\spadesuit$] & 91.3 & XLM\textsubscript{MLM+TLM} \citep{NIPS2019_8928} & 76.6 \\
224
+
225
+ XLM-R\textsubscript{base} (our result) & 92.0 & XLM-R\textsubscript{base} \citep{conneau2019unsupervised} & {75.4} \\
226
+
227
+ XLM-R\textsubscript{large} (our result) & 92.8 & XLM-R\textsubscript{large} \citep{conneau2019unsupervised} & \underline{79.7} \\
228
+
229
+ \hline
230
+ PhoBERT\textsubscript{base}& \underline{93.6} & PhoBERT\textsubscript{base}& {78.5} \\
231
+
232
+ PhoBERT\textsubscript{large}& \textbf{94.7} & PhoBERT\textsubscript{large}& \textbf{80.0} \\
233
+ \hline
234
+
235
+
236
+ \end{tabular}
237
+ }
238
+ \caption{Performance scores (in \%) on the NER and NLI test sets.
239
+ [$\blacklozenge$], [$\spadesuit$] and [$\blacksquare$] denote
240
+ results reported by \newcite{vu-etal-2018-vncorenlp}, \newcite{vu-xuan-etal-2019-etnlp} and \newcite{wu-dredze-2019-beto}, respectively.
241
+ %``mBiLSTM'' denotes a BiLSTM-based multilingual embedding model.
242
+ Note that there are higher Vietnamese NLI results reported for XLM-R when fine-tuning on the concatenation of all 15 training datasets from the XNLI corpus (i.e. TRANSLATE-TRAIN-ALL: 79.5\% for XLM-R\textsubscript{base} and 83.4\% XLM-R\textsubscript{large}). However, those results might not be comparable as we only use the monolingual Vietnamese training data for fine-tuning. }
243
+ \label{tab:nernli}
244
+ \end{table*}
245
+
246
+ \section{Experimental results}\label{sec:results}
247
+
248
+ \subsubsection*{Main results}
249
+
250
+ Tables \ref{tab:posdep} and \ref{tab:nernli} compare PhoBERT scores with the previous highest reported results, using the same experimental setup. It is clear that our PhoBERT helps produce new SOTA performance results for all four downstream tasks.
251
+
252
+ For \underline{POS tagging}, the neural model jointWPD for joint POS tagging and dependency parsing \citep{nguyen-2019-neural} and the feature-based model VnCoreNLP-POS \citep{nguyen-etal-2017-word} are the two previous SOTA models, obtaining accuracies at about 96.0\%. PhoBERT obtains 0.8\% absolute higher accuracy than these two models.
253
+
254
+ For \underline{Dependency parsing}, the previous highest parsing scores LAS and UAS are obtained by the Biaffine parser at 75.0\% and 81.2\%, respectively. PhoBERT helps boost the Biaffine parser with about 4\% absolute improvement, achieving a LAS at 78.8\% and a UAS at 85.2\%.
255
+
256
+
257
+ For \underline{NER}, PhoBERT\textsubscript{large} produces 1.1 points higher F\textsubscript{1} than PhoBERT\textsubscript{base}. In addition, PhoBERT\textsubscript{base} obtains 2+ points higher than the previous SOTA feature- and neural network-based models VnCoreNLP-NER \citep{vu-etal-2018-vncorenlp} and BiLSTM-CNN-CRF \citep{ma-hovy-2016-end} which are trained with the set of 15K BERT-based ETNLP word embeddings \citep{vu-xuan-etal-2019-etnlp}.
258
+
259
+ For \underline{NLI},
260
+ PhoBERT outperforms the multilingual BERT \citep{devlin-etal-2019-bert} and the BERT-based cross-lingual model with a new translation language modeling objective XLM\textsubscript{MLM+TLM} \citep{NIPS2019_8928} by large margins. PhoBERT also performs better than the recent best pre-trained multilingual model XLM-R but using far fewer parameters than XLM-R: 135M (PhoBERT\textsubscript{base}) vs. 250M (XLM-R\textsubscript{base}); 370M (PhoBERT\textsubscript{large}) vs. 560M (XLM-R\textsubscript{large}).
261
+
262
+
263
+
264
+
265
+
266
+
267
+
268
+ \subsubsection*{Discussion}
269
+
270
+ We find that PhoBERT\textsubscript{large} achieves 0.9\% lower dependency parsing scores than PhoBERT\textsubscript{base}. One possible reason is that the last Transformer layer in the BERT architecture might not be the optimal one which encodes the richest information of syntactic structures \cite{hewitt-manning-2019-structural,jawahar-etal-2019-bert}. Future work will study which PhoBERT's Transformer layer contains richer syntactic information by evaluating the Vietnamese parsing performance from each layer.
271
+
272
+ Using more pre-training data can significantly improve the quality of the pre-trained language models \cite{RoBERTa}. Thus it is not surprising that PhoBERT helps produce better performance than ETNLP on NER, and the multilingual BERT and XLM\textsubscript{MLM+TLM} on NLI (here, PhoBERT uses 20GB of Vietnamese texts while those models employ the 1GB Vietnamese Wikipedia corpus).
273
+
274
+ Following the fine-tuning approach that we use for PhoBERT, we carefully fine-tune XLM-R for the remaining Vietnamese POS tagging, Dependency parsing and NER tasks (here, it is applied to the first sub-syllable token of the first syllable of each word).\footnote{For fine-tuning XLM-R, we use a grid search on the validation set to select the AdamW learning rate from \{5e-6, 1e-5, 2e-5, 4e-5\} and the batch size from \{16, 32\}.}
275
+ Tables \ref{tab:posdep} and \ref{tab:nernli} show that our PhoBERT also does better than XLM-R on these three word-level tasks.
276
+ It is worth noting that XLM-R uses a 2.5TB pre-training corpus which contains 137GB of Vietnamese texts (i.e. about 137\ /\ 20 $\approx$ 7 times bigger than our pre-training corpus).
277
+ Recall that PhoBERT performs Vietnamese word segmentation to segment syllable-level sentences into word tokens before applying BPE to segment the word-segmented sentences into subword units, while XLM-R directly applies BPE to the syllable-level Vietnamese pre-training sentences.
278
+ This reconfirms that the dedicated language-specific models still outperform the multilingual ones \citep{2019arXiv191103894M}.\footnote{Note that \newcite{2019arXiv191103894M} only compare their model CamemBERT with XLM-R on the French NLI task.}
279
+
280
+
281
+
282
+
283
+
284
+
285
+
286
+ \section{Conclusion}
287
+
288
+ In this paper, we have presented the first large-scale monolingual PhoBERT language models pre-trained for Vietnamese. We demonstrate the usefulness of PhoBERT by showing that PhoBERT performs better than the recent best multilingual model XLM-R and helps produce the SOTA performances for four downstream Vietnamese NLP tasks of POS tagging, Dependency parsing, NER and NLI.
289
+ By publicly releasing PhoBERT models, %\footnote{\url{https://github.com/VinAIResearch/PhoBERT}}
290
+ we hope that they can foster future research and applications in Vietnamese NLP. %Our PhoBERT and its usage are available at: \url{https://github.com/VinAIResearch/PhoBERT}.
291
+
292
+ {%\footnotesize
293
+ \bibliographystyle{acl_natbib}
294
+ \bibliography{REFs}
295
+ }
296
+
297
+
298
+
299
+
300
+
301
+ \end{document}
references/2020.emnlp.nguyen/source/acl_natbib.bst ADDED
@@ -0,0 +1,1975 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%% acl_natbib.bst
2
+ %%% Modification of BibTeX style file acl_natbib_nourl.bst
3
+ %%% ... by urlbst, version 0.7 (marked with "% urlbst")
4
+ %%% See <http://purl.org/nxg/dist/urlbst>
5
+ %%% Added webpage entry type, and url and lastchecked fields.
6
+ %%% Added eprint support.
7
+ %%% Added DOI support.
8
+ %%% Added PUBMED support.
9
+ %%% Added hyperref support.
10
+ %%% Original headers follow...
11
+
12
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
13
+ %
14
+ % BibTeX style file acl_natbib_nourl.bst
15
+ %
16
+ % intended as input to urlbst script
17
+ % $ ./urlbst --hyperref --inlinelinks acl_natbib_nourl.bst > acl_natbib.bst
18
+ %
19
+ % adapted from compling.bst
20
+ % in order to mimic the style files for ACL conferences prior to 2017
21
+ % by making the following three changes:
22
+ % - for @incollection, page numbers now follow volume title.
23
+ % - for @inproceedings, address now follows conference name.
24
+ % (address is intended as location of conference,
25
+ % not address of publisher.)
26
+ % - for papers with three authors, use et al. in citation
27
+ % Dan Gildea 2017/06/08
28
+ % - fixed a bug with format.chapter - error given if chapter is empty
29
+ % with inbook.
30
+ % Shay Cohen 2018/02/16
31
+
32
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
33
+ %
34
+ % BibTeX style file compling.bst
35
+ %
36
+ % Intended for the journal Computational Linguistics (ACL/MIT Press)
37
+ % Created by Ron Artstein on 2005/08/22
38
+ % For use with <natbib.sty> for author-year citations.
39
+ %
40
+ % I created this file in order to allow submissions to the journal
41
+ % Computational Linguistics using the <natbib> package for author-year
42
+ % citations, which offers a lot more flexibility than <fullname>, CL's
43
+ % official citation package. This file adheres strictly to the official
44
+ % style guide available from the MIT Press:
45
+ %
46
+ % http://mitpress.mit.edu/journals/coli/compling_style.pdf
47
+ %
48
+ % This includes all the various quirks of the style guide, for example:
49
+ % - a chapter from a monograph (@inbook) has no page numbers.
50
+ % - an article from an edited volume (@incollection) has page numbers
51
+ % after the publisher and address.
52
+ % - an article from a proceedings volume (@inproceedings) has page
53
+ % numbers before the publisher and address.
54
+ %
55
+ % Where the style guide was inconsistent or not specific enough I
56
+ % looked at actual published articles and exercised my own judgment.
57
+ % I noticed two inconsistencies in the style guide:
58
+ %
59
+ % - The style guide gives one example of an article from an edited
60
+ % volume with the editor's name spelled out in full, and another
61
+ % with the editors' names abbreviated. I chose to accept the first
62
+ % one as correct, since the style guide generally shuns abbreviations,
63
+ % and editors' names are also spelled out in some recently published
64
+ % articles.
65
+ %
66
+ % - The style guide gives one example of a reference where the word
67
+ % "and" between two authors is preceded by a comma. This is most
68
+ % likely a typo, since in all other cases with just two authors or
69
+ % editors there is no comma before the word "and".
70
+ %
71
+ % One case where the style guide is not being specific is the placement
72
+ % of the edition number, for which no example is given. I chose to put
73
+ % it immediately after the title, which I (subjectively) find natural,
74
+ % and is also the place of the edition in a few recently published
75
+ % articles.
76
+ %
77
+ % This file correctly reproduces all of the examples in the official
78
+ % style guide, except for the two inconsistencies noted above. I even
79
+ % managed to get it to correctly format the proceedings example which
80
+ % has an organization, a publisher, and two addresses (the conference
81
+ % location and the publisher's address), though I cheated a bit by
82
+ % putting the conference location and month as part of the title field;
83
+ % I feel that in this case the conference location and month can be
84
+ % considered as part of the title, and that adding a location field
85
+ % is not justified. Note also that a location field is not standard,
86
+ % so entries made with this field would not port nicely to other styles.
87
+ % However, if authors feel that there's a need for a location field
88
+ % then tell me and I'll see what I can do.
89
+ %
90
+ % The file also produces to my satisfaction all the bibliographical
91
+ % entries in my recent (joint) submission to CL (this was the original
92
+ % motivation for creating the file). I also tested it by running it
93
+ % on a larger set of entries and eyeballing the results. There may of
94
+ % course still be errors, especially with combinations of fields that
95
+ % are not that common, or with cross-references (which I seldom use).
96
+ % If you find such errors please write to me.
97
+ %
98
+ % I hope people find this file useful. Please email me with comments
99
+ % and suggestions.
100
+ %
101
+ % Ron Artstein
102
+ % artstein [at] essex.ac.uk
103
+ % August 22, 2005.
104
+ %
105
+ % Some technical notes.
106
+ %
107
+ % This file is based on a file generated with the package <custom-bib>
108
+ % by Patrick W. Daly (see selected options below), which was then
109
+ % manually customized to conform with certain CL requirements which
110
+ % cannot be met by <custom-bib>. Departures from the generated file
111
+ % include:
112
+ %
113
+ % Function inbook: moved publisher and address to the end; moved
114
+ % edition after title; replaced function format.chapter.pages by
115
+ % new function format.chapter to output chapter without pages.
116
+ %
117
+ % Function inproceedings: moved publisher and address to the end;
118
+ % replaced function format.in.ed.booktitle by new function
119
+ % format.in.booktitle to output the proceedings title without
120
+ % the editor.
121
+ %
122
+ % Functions book, incollection, manual: moved edition after title.
123
+ %
124
+ % Function mastersthesis: formatted title as for articles (unlike
125
+ % phdthesis which is formatted as book) and added month.
126
+ %
127
+ % Function proceedings: added new.sentence between organization and
128
+ % publisher when both are present.
129
+ %
130
+ % Function format.lab.names: modified so that it gives all the
131
+ % authors' surnames for in-text citations for one, two and three
132
+ % authors and only uses "et. al" for works with four authors or more
133
+ % (thanks to Ken Shan for convincing me to go through the trouble of
134
+ % modifying this function rather than using unreliable hacks).
135
+ %
136
+ % Changes:
137
+ %
138
+ % 2006-10-27: Changed function reverse.pass so that the extra label is
139
+ % enclosed in parentheses when the year field ends in an uppercase or
140
+ % lowercase letter (change modeled after Uli Sauerland's modification
141
+ % of nals.bst). RA.
142
+ %
143
+ %
144
+ % The preamble of the generated file begins below:
145
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
146
+ %%
147
+ %% This is file `compling.bst',
148
+ %% generated with the docstrip utility.
149
+ %%
150
+ %% The original source files were:
151
+ %%
152
+ %% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss')
153
+ %% ----------------------------------------
154
+ %% *** Intended for the journal Computational Linguistics ***
155
+ %%
156
+ %% Copyright 1994-2002 Patrick W Daly
157
+ % ===============================================================
158
+ % IMPORTANT NOTICE:
159
+ % This bibliographic style (bst) file has been generated from one or
160
+ % more master bibliographic style (mbs) files, listed above.
161
+ %
162
+ % This generated file can be redistributed and/or modified under the terms
163
+ % of the LaTeX Project Public License Distributed from CTAN
164
+ % archives in directory macros/latex/base/lppl.txt; either
165
+ % version 1 of the License, or any later version.
166
+ % ===============================================================
167
+ % Name and version information of the main mbs file:
168
+ % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
169
+ % For use with BibTeX version 0.99a or later
170
+ %-------------------------------------------------------------------
171
+ % This bibliography style file is intended for texts in ENGLISH
172
+ % This is an author-year citation style bibliography. As such, it is
173
+ % non-standard LaTeX, and requires a special package file to function properly.
174
+ % Such a package is natbib.sty by Patrick W. Daly
175
+ % The form of the \bibitem entries is
176
+ % \bibitem[Jones et al.(1990)]{key}...
177
+ % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
178
+ % The essential feature is that the label (the part in brackets) consists
179
+ % of the author names, as they should appear in the citation, with the year
180
+ % in parentheses following. There must be no space before the opening
181
+ % parenthesis!
182
+ % With natbib v5.3, a full list of authors may also follow the year.
183
+ % In natbib.sty, it is possible to define the type of enclosures that is
184
+ % really wanted (brackets or parentheses), but in either case, there must
185
+ % be parentheses in the label.
186
+ % The \cite command functions as follows:
187
+ % \citet{key} ==>> Jones et al. (1990)
188
+ % \citet*{key} ==>> Jones, Baker, and Smith (1990)
189
+ % \citep{key} ==>> (Jones et al., 1990)
190
+ % \citep*{key} ==>> (Jones, Baker, and Smith, 1990)
191
+ % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2)
192
+ % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990)
193
+ % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32)
194
+ % \citeauthor{key} ==>> Jones et al.
195
+ % \citeauthor*{key} ==>> Jones, Baker, and Smith
196
+ % \citeyear{key} ==>> 1990
197
+ %---------------------------------------------------------------------
198
+
199
+ ENTRY
200
+ { address
201
+ author
202
+ booktitle
203
+ chapter
204
+ edition
205
+ editor
206
+ howpublished
207
+ institution
208
+ journal
209
+ key
210
+ month
211
+ note
212
+ number
213
+ organization
214
+ pages
215
+ publisher
216
+ school
217
+ series
218
+ title
219
+ type
220
+ volume
221
+ year
222
+ eprint % urlbst
223
+ doi % urlbst
224
+ pubmed % urlbst
225
+ url % urlbst
226
+ lastchecked % urlbst
227
+ }
228
+ {}
229
+ { label extra.label sort.label short.list }
230
+ INTEGERS { output.state before.all mid.sentence after.sentence after.block }
231
+ % urlbst...
232
+ % urlbst constants and state variables
233
+ STRINGS { urlintro
234
+ eprinturl eprintprefix doiprefix doiurl pubmedprefix pubmedurl
235
+ citedstring onlinestring linktextstring
236
+ openinlinelink closeinlinelink }
237
+ INTEGERS { hrefform inlinelinks makeinlinelink
238
+ addeprints adddoiresolver addpubmedresolver }
239
+ FUNCTION {init.urlbst.variables}
240
+ {
241
+ % The following constants may be adjusted by hand, if desired
242
+
243
+ % The first set allow you to enable or disable certain functionality.
244
+ #1 'addeprints := % 0=no eprints; 1=include eprints
245
+ #1 'adddoiresolver := % 0=no DOI resolver; 1=include it
246
+ #1 'addpubmedresolver := % 0=no PUBMED resolver; 1=include it
247
+ #2 'hrefform := % 0=no crossrefs; 1=hypertex xrefs; 2=hyperref refs
248
+ #1 'inlinelinks := % 0=URLs explicit; 1=URLs attached to titles
249
+
250
+ % String constants, which you _might_ want to tweak.
251
+ "URL: " 'urlintro := % prefix before URL; typically "Available from:" or "URL":
252
+ "online" 'onlinestring := % indication that resource is online; typically "online"
253
+ "cited " 'citedstring := % indicator of citation date; typically "cited "
254
+ "[link]" 'linktextstring := % dummy link text; typically "[link]"
255
+ "http://arxiv.org/abs/" 'eprinturl := % prefix to make URL from eprint ref
256
+ "arXiv:" 'eprintprefix := % text prefix printed before eprint ref; typically "arXiv:"
257
+ "https://doi.org/" 'doiurl := % prefix to make URL from DOI
258
+ "doi:" 'doiprefix := % text prefix printed before DOI ref; typically "doi:"
259
+ "http://www.ncbi.nlm.nih.gov/pubmed/" 'pubmedurl := % prefix to make URL from PUBMED
260
+ "PMID:" 'pubmedprefix := % text prefix printed before PUBMED ref; typically "PMID:"
261
+
262
+ % The following are internal state variables, not configuration constants,
263
+ % so they shouldn't be fiddled with.
264
+ #0 'makeinlinelink := % state variable managed by possibly.setup.inlinelink
265
+ "" 'openinlinelink := % ditto
266
+ "" 'closeinlinelink := % ditto
267
+ }
268
+ INTEGERS {
269
+ bracket.state
270
+ outside.brackets
271
+ open.brackets
272
+ within.brackets
273
+ close.brackets
274
+ }
275
+ % ...urlbst to here
276
+ FUNCTION {init.state.consts}
277
+ { #0 'outside.brackets := % urlbst...
278
+ #1 'open.brackets :=
279
+ #2 'within.brackets :=
280
+ #3 'close.brackets := % ...urlbst to here
281
+
282
+ #0 'before.all :=
283
+ #1 'mid.sentence :=
284
+ #2 'after.sentence :=
285
+ #3 'after.block :=
286
+ }
287
+ STRINGS { s t}
288
+ % urlbst
289
+ FUNCTION {output.nonnull.original}
290
+ { 's :=
291
+ output.state mid.sentence =
292
+ { ", " * write$ }
293
+ { output.state after.block =
294
+ { add.period$ write$
295
+ newline$
296
+ "\newblock " write$
297
+ }
298
+ { output.state before.all =
299
+ 'write$
300
+ { add.period$ " " * write$ }
301
+ if$
302
+ }
303
+ if$
304
+ mid.sentence 'output.state :=
305
+ }
306
+ if$
307
+ s
308
+ }
309
+
310
+ % urlbst...
311
+ % The following three functions are for handling inlinelink. They wrap
312
+ % a block of text which is potentially output with write$ by multiple
313
+ % other functions, so we don't know the content a priori.
314
+ % They communicate between each other using the variables makeinlinelink
315
+ % (which is true if a link should be made), and closeinlinelink (which holds
316
+ % the string which should close any current link. They can be called
317
+ % at any time, but start.inlinelink will be a no-op unless something has
318
+ % previously set makeinlinelink true, and the two ...end.inlinelink functions
319
+ % will only do their stuff if start.inlinelink has previously set
320
+ % closeinlinelink to be non-empty.
321
+ % (thanks to 'ijvm' for suggested code here)
322
+ FUNCTION {uand}
323
+ { 'skip$ { pop$ #0 } if$ } % 'and' (which isn't defined at this point in the file)
324
+ FUNCTION {possibly.setup.inlinelink}
325
+ { makeinlinelink hrefform #0 > uand
326
+ { doi empty$ adddoiresolver uand
327
+ { pubmed empty$ addpubmedresolver uand
328
+ { eprint empty$ addeprints uand
329
+ { url empty$
330
+ { "" }
331
+ { url }
332
+ if$ }
333
+ { eprinturl eprint * }
334
+ if$ }
335
+ { pubmedurl pubmed * }
336
+ if$ }
337
+ { doiurl doi * }
338
+ if$
339
+ % an appropriately-formatted URL is now on the stack
340
+ hrefform #1 = % hypertex
341
+ { "\special {html:<a href=" quote$ * swap$ * quote$ * "> }{" * 'openinlinelink :=
342
+ "\special {html:</a>}" 'closeinlinelink := }
343
+ { "\href {" swap$ * "} {" * 'openinlinelink := % hrefform=#2 -- hyperref
344
+ % the space between "} {" matters: a URL of just the right length can cause "\% newline em"
345
+ "}" 'closeinlinelink := }
346
+ if$
347
+ #0 'makeinlinelink :=
348
+ }
349
+ 'skip$
350
+ if$ % makeinlinelink
351
+ }
352
+ FUNCTION {add.inlinelink}
353
+ { openinlinelink empty$
354
+ 'skip$
355
+ { openinlinelink swap$ * closeinlinelink *
356
+ "" 'openinlinelink :=
357
+ }
358
+ if$
359
+ }
360
+ FUNCTION {output.nonnull}
361
+ { % Save the thing we've been asked to output
362
+ 's :=
363
+ % If the bracket-state is close.brackets, then add a close-bracket to
364
+ % what is currently at the top of the stack, and set bracket.state
365
+ % to outside.brackets
366
+ bracket.state close.brackets =
367
+ { "]" *
368
+ outside.brackets 'bracket.state :=
369
+ }
370
+ 'skip$
371
+ if$
372
+ bracket.state outside.brackets =
373
+ { % We're outside all brackets -- this is the normal situation.
374
+ % Write out what's currently at the top of the stack, using the
375
+ % original output.nonnull function.
376
+ s
377
+ add.inlinelink
378
+ output.nonnull.original % invoke the original output.nonnull
379
+ }
380
+ { % Still in brackets. Add open-bracket or (continuation) comma, add the
381
+ % new text (in s) to the top of the stack, and move to the close-brackets
382
+ % state, ready for next time (unless inbrackets resets it). If we come
383
+ % into this branch, then output.state is carefully undisturbed.
384
+ bracket.state open.brackets =
385
+ { " [" * }
386
+ { ", " * } % bracket.state will be within.brackets
387
+ if$
388
+ s *
389
+ close.brackets 'bracket.state :=
390
+ }
391
+ if$
392
+ }
393
+
394
+ % Call this function just before adding something which should be presented in
395
+ % brackets. bracket.state is handled specially within output.nonnull.
396
+ FUNCTION {inbrackets}
397
+ { bracket.state close.brackets =
398
+ { within.brackets 'bracket.state := } % reset the state: not open nor closed
399
+ { open.brackets 'bracket.state := }
400
+ if$
401
+ }
402
+
403
+ FUNCTION {format.lastchecked}
404
+ { lastchecked empty$
405
+ { "" }
406
+ { inbrackets citedstring lastchecked * }
407
+ if$
408
+ }
409
+ % ...urlbst to here
410
+ FUNCTION {output}
411
+ { duplicate$ empty$
412
+ 'pop$
413
+ 'output.nonnull
414
+ if$
415
+ }
416
+ FUNCTION {output.check}
417
+ { 't :=
418
+ duplicate$ empty$
419
+ { pop$ "empty " t * " in " * cite$ * warning$ }
420
+ 'output.nonnull
421
+ if$
422
+ }
423
+ FUNCTION {fin.entry.original} % urlbst (renamed from fin.entry, so it can be wrapped below)
424
+ { add.period$
425
+ write$
426
+ newline$
427
+ }
428
+
429
+ FUNCTION {new.block}
430
+ { output.state before.all =
431
+ 'skip$
432
+ { after.block 'output.state := }
433
+ if$
434
+ }
435
+ FUNCTION {new.sentence}
436
+ { output.state after.block =
437
+ 'skip$
438
+ { output.state before.all =
439
+ 'skip$
440
+ { after.sentence 'output.state := }
441
+ if$
442
+ }
443
+ if$
444
+ }
445
+ FUNCTION {add.blank}
446
+ { " " * before.all 'output.state :=
447
+ }
448
+
449
+ FUNCTION {date.block}
450
+ {
451
+ new.block
452
+ }
453
+
454
+ FUNCTION {not}
455
+ { { #0 }
456
+ { #1 }
457
+ if$
458
+ }
459
+ FUNCTION {and}
460
+ { 'skip$
461
+ { pop$ #0 }
462
+ if$
463
+ }
464
+ FUNCTION {or}
465
+ { { pop$ #1 }
466
+ 'skip$
467
+ if$
468
+ }
469
+ FUNCTION {new.block.checkb}
470
+ { empty$
471
+ swap$ empty$
472
+ and
473
+ 'skip$
474
+ 'new.block
475
+ if$
476
+ }
477
+ FUNCTION {field.or.null}
478
+ { duplicate$ empty$
479
+ { pop$ "" }
480
+ 'skip$
481
+ if$
482
+ }
483
+ FUNCTION {emphasize}
484
+ { duplicate$ empty$
485
+ { pop$ "" }
486
+ { "\emph{" swap$ * "}" * }
487
+ if$
488
+ }
489
+ FUNCTION {tie.or.space.prefix}
490
+ { duplicate$ text.length$ #3 <
491
+ { "~" }
492
+ { " " }
493
+ if$
494
+ swap$
495
+ }
496
+
497
+ FUNCTION {capitalize}
498
+ { "u" change.case$ "t" change.case$ }
499
+
500
+ FUNCTION {space.word}
501
+ { " " swap$ * " " * }
502
+ % Here are the language-specific definitions for explicit words.
503
+ % Each function has a name bbl.xxx where xxx is the English word.
504
+ % The language selected here is ENGLISH
505
+ FUNCTION {bbl.and}
506
+ { "and"}
507
+
508
+ FUNCTION {bbl.etal}
509
+ { "et~al." }
510
+
511
+ FUNCTION {bbl.editors}
512
+ { "editors" }
513
+
514
+ FUNCTION {bbl.editor}
515
+ { "editor" }
516
+
517
+ FUNCTION {bbl.edby}
518
+ { "edited by" }
519
+
520
+ FUNCTION {bbl.edition}
521
+ { "edition" }
522
+
523
+ FUNCTION {bbl.volume}
524
+ { "volume" }
525
+
526
+ FUNCTION {bbl.of}
527
+ { "of" }
528
+
529
+ FUNCTION {bbl.number}
530
+ { "number" }
531
+
532
+ FUNCTION {bbl.nr}
533
+ { "no." }
534
+
535
+ FUNCTION {bbl.in}
536
+ { "in" }
537
+
538
+ FUNCTION {bbl.pages}
539
+ { "pages" }
540
+
541
+ FUNCTION {bbl.page}
542
+ { "page" }
543
+
544
+ FUNCTION {bbl.chapter}
545
+ { "chapter" }
546
+
547
+ FUNCTION {bbl.techrep}
548
+ { "Technical Report" }
549
+
550
+ FUNCTION {bbl.mthesis}
551
+ { "Master's thesis" }
552
+
553
+ FUNCTION {bbl.phdthesis}
554
+ { "Ph.D. thesis" }
555
+
556
+ MACRO {jan} {"January"}
557
+
558
+ MACRO {feb} {"February"}
559
+
560
+ MACRO {mar} {"March"}
561
+
562
+ MACRO {apr} {"April"}
563
+
564
+ MACRO {may} {"May"}
565
+
566
+ MACRO {jun} {"June"}
567
+
568
+ MACRO {jul} {"July"}
569
+
570
+ MACRO {aug} {"August"}
571
+
572
+ MACRO {sep} {"September"}
573
+
574
+ MACRO {oct} {"October"}
575
+
576
+ MACRO {nov} {"November"}
577
+
578
+ MACRO {dec} {"December"}
579
+
580
+ MACRO {acmcs} {"ACM Computing Surveys"}
581
+
582
+ MACRO {acta} {"Acta Informatica"}
583
+
584
+ MACRO {cacm} {"Communications of the ACM"}
585
+
586
+ MACRO {ibmjrd} {"IBM Journal of Research and Development"}
587
+
588
+ MACRO {ibmsj} {"IBM Systems Journal"}
589
+
590
+ MACRO {ieeese} {"IEEE Transactions on Software Engineering"}
591
+
592
+ MACRO {ieeetc} {"IEEE Transactions on Computers"}
593
+
594
+ MACRO {ieeetcad}
595
+ {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}
596
+
597
+ MACRO {ipl} {"Information Processing Letters"}
598
+
599
+ MACRO {jacm} {"Journal of the ACM"}
600
+
601
+ MACRO {jcss} {"Journal of Computer and System Sciences"}
602
+
603
+ MACRO {scp} {"Science of Computer Programming"}
604
+
605
+ MACRO {sicomp} {"SIAM Journal on Computing"}
606
+
607
+ MACRO {tocs} {"ACM Transactions on Computer Systems"}
608
+
609
+ MACRO {tods} {"ACM Transactions on Database Systems"}
610
+
611
+ MACRO {tog} {"ACM Transactions on Graphics"}
612
+
613
+ MACRO {toms} {"ACM Transactions on Mathematical Software"}
614
+
615
+ MACRO {toois} {"ACM Transactions on Office Information Systems"}
616
+
617
+ MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}
618
+
619
+ MACRO {tcs} {"Theoretical Computer Science"}
620
+ FUNCTION {bibinfo.check}
621
+ { swap$
622
+ duplicate$ missing$
623
+ {
624
+ pop$ pop$
625
+ ""
626
+ }
627
+ { duplicate$ empty$
628
+ {
629
+ swap$ pop$
630
+ }
631
+ { swap$
632
+ pop$
633
+ }
634
+ if$
635
+ }
636
+ if$
637
+ }
638
+ FUNCTION {bibinfo.warn}
639
+ { swap$
640
+ duplicate$ missing$
641
+ {
642
+ swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
643
+ ""
644
+ }
645
+ { duplicate$ empty$
646
+ {
647
+ swap$ "empty " swap$ * " in " * cite$ * warning$
648
+ }
649
+ { swap$
650
+ pop$
651
+ }
652
+ if$
653
+ }
654
+ if$
655
+ }
656
+ STRINGS { bibinfo}
657
+ INTEGERS { nameptr namesleft numnames }
658
+
659
+ FUNCTION {format.names}
660
+ { 'bibinfo :=
661
+ duplicate$ empty$ 'skip$ {
662
+ 's :=
663
+ "" 't :=
664
+ #1 'nameptr :=
665
+ s num.names$ 'numnames :=
666
+ numnames 'namesleft :=
667
+ { namesleft #0 > }
668
+ { s nameptr
669
+ duplicate$ #1 >
670
+ { "{ff~}{vv~}{ll}{, jj}" }
671
+ { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author
672
+ % { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author
673
+ if$
674
+ format.name$
675
+ bibinfo bibinfo.check
676
+ 't :=
677
+ nameptr #1 >
678
+ {
679
+ namesleft #1 >
680
+ { ", " * t * }
681
+ {
682
+ numnames #2 >
683
+ { "," * }
684
+ 'skip$
685
+ if$
686
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
687
+ { 't := }
688
+ { pop$ }
689
+ if$
690
+ t "others" =
691
+ {
692
+ " " * bbl.etal *
693
+ }
694
+ {
695
+ bbl.and
696
+ space.word * t *
697
+ }
698
+ if$
699
+ }
700
+ if$
701
+ }
702
+ 't
703
+ if$
704
+ nameptr #1 + 'nameptr :=
705
+ namesleft #1 - 'namesleft :=
706
+ }
707
+ while$
708
+ } if$
709
+ }
710
+ FUNCTION {format.names.ed}
711
+ {
712
+ 'bibinfo :=
713
+ duplicate$ empty$ 'skip$ {
714
+ 's :=
715
+ "" 't :=
716
+ #1 'nameptr :=
717
+ s num.names$ 'numnames :=
718
+ numnames 'namesleft :=
719
+ { namesleft #0 > }
720
+ { s nameptr
721
+ "{ff~}{vv~}{ll}{, jj}"
722
+ format.name$
723
+ bibinfo bibinfo.check
724
+ 't :=
725
+ nameptr #1 >
726
+ {
727
+ namesleft #1 >
728
+ { ", " * t * }
729
+ {
730
+ numnames #2 >
731
+ { "," * }
732
+ 'skip$
733
+ if$
734
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
735
+ { 't := }
736
+ { pop$ }
737
+ if$
738
+ t "others" =
739
+ {
740
+
741
+ " " * bbl.etal *
742
+ }
743
+ {
744
+ bbl.and
745
+ space.word * t *
746
+ }
747
+ if$
748
+ }
749
+ if$
750
+ }
751
+ 't
752
+ if$
753
+ nameptr #1 + 'nameptr :=
754
+ namesleft #1 - 'namesleft :=
755
+ }
756
+ while$
757
+ } if$
758
+ }
759
+ FUNCTION {format.key}
760
+ { empty$
761
+ { key field.or.null }
762
+ { "" }
763
+ if$
764
+ }
765
+
766
+ FUNCTION {format.authors}
767
+ { author "author" format.names
768
+ }
769
+ FUNCTION {get.bbl.editor}
770
+ { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
771
+
772
+ FUNCTION {format.editors}
773
+ { editor "editor" format.names duplicate$ empty$ 'skip$
774
+ {
775
+ "," *
776
+ " " *
777
+ get.bbl.editor
778
+ *
779
+ }
780
+ if$
781
+ }
782
+ FUNCTION {format.note}
783
+ {
784
+ note empty$
785
+ { "" }
786
+ { note #1 #1 substring$
787
+ duplicate$ "{" =
788
+ 'skip$
789
+ { output.state mid.sentence =
790
+ { "l" }
791
+ { "u" }
792
+ if$
793
+ change.case$
794
+ }
795
+ if$
796
+ note #2 global.max$ substring$ * "note" bibinfo.check
797
+ }
798
+ if$
799
+ }
800
+
801
+ FUNCTION {format.title}
802
+ { title
803
+ duplicate$ empty$ 'skip$
804
+ { "t" change.case$ }
805
+ if$
806
+ "title" bibinfo.check
807
+ }
808
+ FUNCTION {format.full.names}
809
+ {'s :=
810
+ "" 't :=
811
+ #1 'nameptr :=
812
+ s num.names$ 'numnames :=
813
+ numnames 'namesleft :=
814
+ { namesleft #0 > }
815
+ { s nameptr
816
+ "{vv~}{ll}" format.name$
817
+ 't :=
818
+ nameptr #1 >
819
+ {
820
+ namesleft #1 >
821
+ { ", " * t * }
822
+ {
823
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
824
+ { 't := }
825
+ { pop$ }
826
+ if$
827
+ t "others" =
828
+ {
829
+ " " * bbl.etal *
830
+ }
831
+ {
832
+ numnames #2 >
833
+ { "," * }
834
+ 'skip$
835
+ if$
836
+ bbl.and
837
+ space.word * t *
838
+ }
839
+ if$
840
+ }
841
+ if$
842
+ }
843
+ 't
844
+ if$
845
+ nameptr #1 + 'nameptr :=
846
+ namesleft #1 - 'namesleft :=
847
+ }
848
+ while$
849
+ }
850
+
851
+ FUNCTION {author.editor.key.full}
852
+ { author empty$
853
+ { editor empty$
854
+ { key empty$
855
+ { cite$ #1 #3 substring$ }
856
+ 'key
857
+ if$
858
+ }
859
+ { editor format.full.names }
860
+ if$
861
+ }
862
+ { author format.full.names }
863
+ if$
864
+ }
865
+
866
+ FUNCTION {author.key.full}
867
+ { author empty$
868
+ { key empty$
869
+ { cite$ #1 #3 substring$ }
870
+ 'key
871
+ if$
872
+ }
873
+ { author format.full.names }
874
+ if$
875
+ }
876
+
877
+ FUNCTION {editor.key.full}
878
+ { editor empty$
879
+ { key empty$
880
+ { cite$ #1 #3 substring$ }
881
+ 'key
882
+ if$
883
+ }
884
+ { editor format.full.names }
885
+ if$
886
+ }
887
+
888
+ FUNCTION {make.full.names}
889
+ { type$ "book" =
890
+ type$ "inbook" =
891
+ or
892
+ 'author.editor.key.full
893
+ { type$ "proceedings" =
894
+ 'editor.key.full
895
+ 'author.key.full
896
+ if$
897
+ }
898
+ if$
899
+ }
900
+
901
+ FUNCTION {output.bibitem.original} % urlbst (renamed from output.bibitem, so it can be wrapped below)
902
+ { newline$
903
+ "\bibitem[{" write$
904
+ label write$
905
+ ")" make.full.names duplicate$ short.list =
906
+ { pop$ }
907
+ { * }
908
+ if$
909
+ "}]{" * write$
910
+ cite$ write$
911
+ "}" write$
912
+ newline$
913
+ ""
914
+ before.all 'output.state :=
915
+ }
916
+
917
+ FUNCTION {n.dashify}
918
+ {
919
+ 't :=
920
+ ""
921
+ { t empty$ not }
922
+ { t #1 #1 substring$ "-" =
923
+ { t #1 #2 substring$ "--" = not
924
+ { "--" *
925
+ t #2 global.max$ substring$ 't :=
926
+ }
927
+ { { t #1 #1 substring$ "-" = }
928
+ { "-" *
929
+ t #2 global.max$ substring$ 't :=
930
+ }
931
+ while$
932
+ }
933
+ if$
934
+ }
935
+ { t #1 #1 substring$ *
936
+ t #2 global.max$ substring$ 't :=
937
+ }
938
+ if$
939
+ }
940
+ while$
941
+ }
942
+
943
+ FUNCTION {word.in}
944
+ { bbl.in capitalize
945
+ " " * }
946
+
947
+ FUNCTION {format.date}
948
+ { year "year" bibinfo.check duplicate$ empty$
949
+ {
950
+ }
951
+ 'skip$
952
+ if$
953
+ extra.label *
954
+ before.all 'output.state :=
955
+ after.sentence 'output.state :=
956
+ }
957
+ FUNCTION {format.btitle}
958
+ { title "title" bibinfo.check
959
+ duplicate$ empty$ 'skip$
960
+ {
961
+ emphasize
962
+ }
963
+ if$
964
+ }
965
+ FUNCTION {either.or.check}
966
+ { empty$
967
+ 'pop$
968
+ { "can't use both " swap$ * " fields in " * cite$ * warning$ }
969
+ if$
970
+ }
971
+ FUNCTION {format.bvolume}
972
+ { volume empty$
973
+ { "" }
974
+ { bbl.volume volume tie.or.space.prefix
975
+ "volume" bibinfo.check * *
976
+ series "series" bibinfo.check
977
+ duplicate$ empty$ 'pop$
978
+ { swap$ bbl.of space.word * swap$
979
+ emphasize * }
980
+ if$
981
+ "volume and number" number either.or.check
982
+ }
983
+ if$
984
+ }
985
+ FUNCTION {format.number.series}
986
+ { volume empty$
987
+ { number empty$
988
+ { series field.or.null }
989
+ { series empty$
990
+ { number "number" bibinfo.check }
991
+ { output.state mid.sentence =
992
+ { bbl.number }
993
+ { bbl.number capitalize }
994
+ if$
995
+ number tie.or.space.prefix "number" bibinfo.check * *
996
+ bbl.in space.word *
997
+ series "series" bibinfo.check *
998
+ }
999
+ if$
1000
+ }
1001
+ if$
1002
+ }
1003
+ { "" }
1004
+ if$
1005
+ }
1006
+
1007
+ FUNCTION {format.edition}
1008
+ { edition duplicate$ empty$ 'skip$
1009
+ {
1010
+ output.state mid.sentence =
1011
+ { "l" }
1012
+ { "t" }
1013
+ if$ change.case$
1014
+ "edition" bibinfo.check
1015
+ " " * bbl.edition *
1016
+ }
1017
+ if$
1018
+ }
1019
+ INTEGERS { multiresult }
1020
+ FUNCTION {multi.page.check}
1021
+ { 't :=
1022
+ #0 'multiresult :=
1023
+ { multiresult not
1024
+ t empty$ not
1025
+ and
1026
+ }
1027
+ { t #1 #1 substring$
1028
+ duplicate$ "-" =
1029
+ swap$ duplicate$ "," =
1030
+ swap$ "+" =
1031
+ or or
1032
+ { #1 'multiresult := }
1033
+ { t #2 global.max$ substring$ 't := }
1034
+ if$
1035
+ }
1036
+ while$
1037
+ multiresult
1038
+ }
1039
+ FUNCTION {format.pages}
1040
+ { pages duplicate$ empty$ 'skip$
1041
+ { duplicate$ multi.page.check
1042
+ {
1043
+ bbl.pages swap$
1044
+ n.dashify
1045
+ }
1046
+ {
1047
+ bbl.page swap$
1048
+ }
1049
+ if$
1050
+ tie.or.space.prefix
1051
+ "pages" bibinfo.check
1052
+ * *
1053
+ }
1054
+ if$
1055
+ }
1056
+ FUNCTION {format.journal.pages}
1057
+ { pages duplicate$ empty$ 'pop$
1058
+ { swap$ duplicate$ empty$
1059
+ { pop$ pop$ format.pages }
1060
+ {
1061
+ ":" *
1062
+ swap$
1063
+ n.dashify
1064
+ "pages" bibinfo.check
1065
+ *
1066
+ }
1067
+ if$
1068
+ }
1069
+ if$
1070
+ }
1071
+ FUNCTION {format.vol.num.pages}
1072
+ { volume field.or.null
1073
+ duplicate$ empty$ 'skip$
1074
+ {
1075
+ "volume" bibinfo.check
1076
+ }
1077
+ if$
1078
+ number "number" bibinfo.check duplicate$ empty$ 'skip$
1079
+ {
1080
+ swap$ duplicate$ empty$
1081
+ { "there's a number but no volume in " cite$ * warning$ }
1082
+ 'skip$
1083
+ if$
1084
+ swap$
1085
+ "(" swap$ * ")" *
1086
+ }
1087
+ if$ *
1088
+ format.journal.pages
1089
+ }
1090
+
1091
+ FUNCTION {format.chapter}
1092
+ { chapter empty$
1093
+ 'format.pages
1094
+ { type empty$
1095
+ { bbl.chapter }
1096
+ { type "l" change.case$
1097
+ "type" bibinfo.check
1098
+ }
1099
+ if$
1100
+ chapter tie.or.space.prefix
1101
+ "chapter" bibinfo.check
1102
+ * *
1103
+ }
1104
+ if$
1105
+ }
1106
+
1107
+ FUNCTION {format.chapter.pages}
1108
+ { chapter empty$
1109
+ 'format.pages
1110
+ { type empty$
1111
+ { bbl.chapter }
1112
+ { type "l" change.case$
1113
+ "type" bibinfo.check
1114
+ }
1115
+ if$
1116
+ chapter tie.or.space.prefix
1117
+ "chapter" bibinfo.check
1118
+ * *
1119
+ pages empty$
1120
+ 'skip$
1121
+ { ", " * format.pages * }
1122
+ if$
1123
+ }
1124
+ if$
1125
+ }
1126
+
1127
+ FUNCTION {format.booktitle}
1128
+ {
1129
+ booktitle "booktitle" bibinfo.check
1130
+ emphasize
1131
+ }
1132
+ FUNCTION {format.in.booktitle}
1133
+ { format.booktitle duplicate$ empty$ 'skip$
1134
+ {
1135
+ word.in swap$ *
1136
+ }
1137
+ if$
1138
+ }
1139
+ FUNCTION {format.in.ed.booktitle}
1140
+ { format.booktitle duplicate$ empty$ 'skip$
1141
+ {
1142
+ editor "editor" format.names.ed duplicate$ empty$ 'pop$
1143
+ {
1144
+ "," *
1145
+ " " *
1146
+ get.bbl.editor
1147
+ ", " *
1148
+ * swap$
1149
+ * }
1150
+ if$
1151
+ word.in swap$ *
1152
+ }
1153
+ if$
1154
+ }
1155
+ FUNCTION {format.thesis.type}
1156
+ { type duplicate$ empty$
1157
+ 'pop$
1158
+ { swap$ pop$
1159
+ "t" change.case$ "type" bibinfo.check
1160
+ }
1161
+ if$
1162
+ }
1163
+ FUNCTION {format.tr.number}
1164
+ { number "number" bibinfo.check
1165
+ type duplicate$ empty$
1166
+ { pop$ bbl.techrep }
1167
+ 'skip$
1168
+ if$
1169
+ "type" bibinfo.check
1170
+ swap$ duplicate$ empty$
1171
+ { pop$ "t" change.case$ }
1172
+ { tie.or.space.prefix * * }
1173
+ if$
1174
+ }
1175
+ FUNCTION {format.article.crossref}
1176
+ {
1177
+ word.in
1178
+ " \cite{" * crossref * "}" *
1179
+ }
1180
+ FUNCTION {format.book.crossref}
1181
+ { volume duplicate$ empty$
1182
+ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
1183
+ pop$ word.in
1184
+ }
1185
+ { bbl.volume
1186
+ capitalize
1187
+ swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
1188
+ }
1189
+ if$
1190
+ " \cite{" * crossref * "}" *
1191
+ }
1192
+ FUNCTION {format.incoll.inproc.crossref}
1193
+ {
1194
+ word.in
1195
+ " \cite{" * crossref * "}" *
1196
+ }
1197
+ FUNCTION {format.org.or.pub}
1198
+ { 't :=
1199
+ ""
1200
+ address empty$ t empty$ and
1201
+ 'skip$
1202
+ {
1203
+ t empty$
1204
+ { address "address" bibinfo.check *
1205
+ }
1206
+ { t *
1207
+ address empty$
1208
+ 'skip$
1209
+ { ", " * address "address" bibinfo.check * }
1210
+ if$
1211
+ }
1212
+ if$
1213
+ }
1214
+ if$
1215
+ }
1216
+ FUNCTION {format.publisher.address}
1217
+ { publisher "publisher" bibinfo.warn format.org.or.pub
1218
+ }
1219
+
1220
+ FUNCTION {format.organization.address}
1221
+ { organization "organization" bibinfo.check format.org.or.pub
1222
+ }
1223
+
1224
+ % urlbst...
1225
+ % Functions for making hypertext links.
1226
+ % In all cases, the stack has (link-text href-url)
1227
+ %
1228
+ % make 'null' specials
1229
+ FUNCTION {make.href.null}
1230
+ {
1231
+ pop$
1232
+ }
1233
+ % make hypertex specials
1234
+ FUNCTION {make.href.hypertex}
1235
+ {
1236
+ "\special {html:<a href=" quote$ *
1237
+ swap$ * quote$ * "> }" * swap$ *
1238
+ "\special {html:</a>}" *
1239
+ }
1240
+ % make hyperref specials
1241
+ FUNCTION {make.href.hyperref}
1242
+ {
1243
+ "\href {" swap$ * "} {\path{" * swap$ * "}}" *
1244
+ }
1245
+ FUNCTION {make.href}
1246
+ { hrefform #2 =
1247
+ 'make.href.hyperref % hrefform = 2
1248
+ { hrefform #1 =
1249
+ 'make.href.hypertex % hrefform = 1
1250
+ 'make.href.null % hrefform = 0 (or anything else)
1251
+ if$
1252
+ }
1253
+ if$
1254
+ }
1255
+
1256
+ % If inlinelinks is true, then format.url should be a no-op, since it's
1257
+ % (a) redundant, and (b) could end up as a link-within-a-link.
1258
+ FUNCTION {format.url}
1259
+ { inlinelinks #1 = url empty$ or
1260
+ { "" }
1261
+ { hrefform #1 =
1262
+ { % special case -- add HyperTeX specials
1263
+ urlintro "\url{" url * "}" * url make.href.hypertex * }
1264
+ { urlintro "\url{" * url * "}" * }
1265
+ if$
1266
+ }
1267
+ if$
1268
+ }
1269
+
1270
+ FUNCTION {format.eprint}
1271
+ { eprint empty$
1272
+ { "" }
1273
+ { eprintprefix eprint * eprinturl eprint * make.href }
1274
+ if$
1275
+ }
1276
+
1277
+ FUNCTION {format.doi}
1278
+ { doi empty$
1279
+ { "" }
1280
+ { doiprefix doi * doiurl doi * make.href }
1281
+ if$
1282
+ }
1283
+
1284
+ FUNCTION {format.pubmed}
1285
+ { pubmed empty$
1286
+ { "" }
1287
+ { pubmedprefix pubmed * pubmedurl pubmed * make.href }
1288
+ if$
1289
+ }
1290
+
1291
+ % Output a URL. We can't use the more normal idiom (something like
1292
+ % `format.url output'), because the `inbrackets' within
1293
+ % format.lastchecked applies to everything between calls to `output',
1294
+ % so that `format.url format.lastchecked * output' ends up with both
1295
+ % the URL and the lastchecked in brackets.
1296
+ FUNCTION {output.url}
1297
+ { url empty$
1298
+ 'skip$
1299
+ { new.block
1300
+ format.url output
1301
+ format.lastchecked output
1302
+ }
1303
+ if$
1304
+ }
1305
+
1306
+ FUNCTION {output.web.refs}
1307
+ {
1308
+ new.block
1309
+ inlinelinks
1310
+ 'skip$ % links were inline -- don't repeat them
1311
+ {
1312
+ output.url
1313
+ addeprints eprint empty$ not and
1314
+ { format.eprint output.nonnull }
1315
+ 'skip$
1316
+ if$
1317
+ adddoiresolver doi empty$ not and
1318
+ { format.doi output.nonnull }
1319
+ 'skip$
1320
+ if$
1321
+ addpubmedresolver pubmed empty$ not and
1322
+ { format.pubmed output.nonnull }
1323
+ 'skip$
1324
+ if$
1325
+ }
1326
+ if$
1327
+ }
1328
+
1329
+ % Wrapper for output.bibitem.original.
1330
+ % If the URL field is not empty, set makeinlinelink to be true,
1331
+ % so that an inline link will be started at the next opportunity
1332
+ FUNCTION {output.bibitem}
1333
+ { outside.brackets 'bracket.state :=
1334
+ output.bibitem.original
1335
+ inlinelinks url empty$ not doi empty$ not or pubmed empty$ not or eprint empty$ not or and
1336
+ { #1 'makeinlinelink := }
1337
+ { #0 'makeinlinelink := }
1338
+ if$
1339
+ }
1340
+
1341
+ % Wrapper for fin.entry.original
1342
+ FUNCTION {fin.entry}
1343
+ { output.web.refs % urlbst
1344
+ makeinlinelink % ooops, it appears we didn't have a title for inlinelink
1345
+ { possibly.setup.inlinelink % add some artificial link text here, as a fallback
1346
+ linktextstring output.nonnull }
1347
+ 'skip$
1348
+ if$
1349
+ bracket.state close.brackets = % urlbst
1350
+ { "]" * }
1351
+ 'skip$
1352
+ if$
1353
+ fin.entry.original
1354
+ }
1355
+
1356
+ % Webpage entry type.
1357
+ % Title and url fields required;
1358
+ % author, note, year, month, and lastchecked fields optional
1359
+ % See references
1360
+ % ISO 690-2 http://www.nlc-bnc.ca/iso/tc46sc9/standard/690-2e.htm
1361
+ % http://www.classroom.net/classroom/CitingNetResources.html
1362
+ % http://neal.ctstateu.edu/history/cite.html
1363
+ % http://www.cas.usf.edu/english/walker/mla.html
1364
+ % for citation formats for web pages.
1365
+ FUNCTION {webpage}
1366
+ { output.bibitem
1367
+ author empty$
1368
+ { editor empty$
1369
+ 'skip$ % author and editor both optional
1370
+ { format.editors output.nonnull }
1371
+ if$
1372
+ }
1373
+ { editor empty$
1374
+ { format.authors output.nonnull }
1375
+ { "can't use both author and editor fields in " cite$ * warning$ }
1376
+ if$
1377
+ }
1378
+ if$
1379
+ new.block
1380
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$
1381
+ format.title "title" output.check
1382
+ inbrackets onlinestring output
1383
+ new.block
1384
+ year empty$
1385
+ 'skip$
1386
+ { format.date "year" output.check }
1387
+ if$
1388
+ % We don't need to output the URL details ('lastchecked' and 'url'),
1389
+ % because fin.entry does that for us, using output.web.refs. The only
1390
+ % reason we would want to put them here is if we were to decide that
1391
+ % they should go in front of the rather miscellaneous information in 'note'.
1392
+ new.block
1393
+ note output
1394
+ fin.entry
1395
+ }
1396
+ % ...urlbst to here
1397
+
1398
+
1399
+ FUNCTION {article}
1400
+ { output.bibitem
1401
+ format.authors "author" output.check
1402
+ author format.key output
1403
+ format.date "year" output.check
1404
+ date.block
1405
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1406
+ format.title "title" output.check
1407
+ new.block
1408
+ crossref missing$
1409
+ {
1410
+ journal
1411
+ "journal" bibinfo.check
1412
+ emphasize
1413
+ "journal" output.check
1414
+ possibly.setup.inlinelink format.vol.num.pages output% urlbst
1415
+ }
1416
+ { format.article.crossref output.nonnull
1417
+ format.pages output
1418
+ }
1419
+ if$
1420
+ new.block
1421
+ format.note output
1422
+ fin.entry
1423
+ }
1424
+ FUNCTION {book}
1425
+ { output.bibitem
1426
+ author empty$
1427
+ { format.editors "author and editor" output.check
1428
+ editor format.key output
1429
+ }
1430
+ { format.authors output.nonnull
1431
+ crossref missing$
1432
+ { "author and editor" editor either.or.check }
1433
+ 'skip$
1434
+ if$
1435
+ }
1436
+ if$
1437
+ format.date "year" output.check
1438
+ date.block
1439
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1440
+ format.btitle "title" output.check
1441
+ format.edition output
1442
+ crossref missing$
1443
+ { format.bvolume output
1444
+ new.block
1445
+ format.number.series output
1446
+ new.sentence
1447
+ format.publisher.address output
1448
+ }
1449
+ {
1450
+ new.block
1451
+ format.book.crossref output.nonnull
1452
+ }
1453
+ if$
1454
+ new.block
1455
+ format.note output
1456
+ fin.entry
1457
+ }
1458
+ FUNCTION {booklet}
1459
+ { output.bibitem
1460
+ format.authors output
1461
+ author format.key output
1462
+ format.date "year" output.check
1463
+ date.block
1464
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1465
+ format.title "title" output.check
1466
+ new.block
1467
+ howpublished "howpublished" bibinfo.check output
1468
+ address "address" bibinfo.check output
1469
+ new.block
1470
+ format.note output
1471
+ fin.entry
1472
+ }
1473
+
1474
+ FUNCTION {inbook}
1475
+ { output.bibitem
1476
+ author empty$
1477
+ { format.editors "author and editor" output.check
1478
+ editor format.key output
1479
+ }
1480
+ { format.authors output.nonnull
1481
+ crossref missing$
1482
+ { "author and editor" editor either.or.check }
1483
+ 'skip$
1484
+ if$
1485
+ }
1486
+ if$
1487
+ format.date "year" output.check
1488
+ date.block
1489
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1490
+ format.btitle "title" output.check
1491
+ format.edition output
1492
+ crossref missing$
1493
+ {
1494
+ format.bvolume output
1495
+ format.number.series output
1496
+ format.chapter "chapter" output.check
1497
+ new.sentence
1498
+ format.publisher.address output
1499
+ new.block
1500
+ }
1501
+ {
1502
+ format.chapter "chapter" output.check
1503
+ new.block
1504
+ format.book.crossref output.nonnull
1505
+ }
1506
+ if$
1507
+ new.block
1508
+ format.note output
1509
+ fin.entry
1510
+ }
1511
+
1512
+ FUNCTION {incollection}
1513
+ { output.bibitem
1514
+ format.authors "author" output.check
1515
+ author format.key output
1516
+ format.date "year" output.check
1517
+ date.block
1518
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1519
+ format.title "title" output.check
1520
+ new.block
1521
+ crossref missing$
1522
+ { format.in.ed.booktitle "booktitle" output.check
1523
+ format.edition output
1524
+ format.bvolume output
1525
+ format.number.series output
1526
+ format.chapter.pages output
1527
+ new.sentence
1528
+ format.publisher.address output
1529
+ }
1530
+ { format.incoll.inproc.crossref output.nonnull
1531
+ format.chapter.pages output
1532
+ }
1533
+ if$
1534
+ new.block
1535
+ format.note output
1536
+ fin.entry
1537
+ }
1538
+ FUNCTION {inproceedings}
1539
+ { output.bibitem
1540
+ format.authors "author" output.check
1541
+ author format.key output
1542
+ format.date "year" output.check
1543
+ date.block
1544
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1545
+ format.title "title" output.check
1546
+ new.block
1547
+ crossref missing$
1548
+ { format.in.booktitle "booktitle" output.check
1549
+ format.bvolume output
1550
+ format.number.series output
1551
+ format.pages output
1552
+ address "address" bibinfo.check output
1553
+ new.sentence
1554
+ organization "organization" bibinfo.check output
1555
+ publisher "publisher" bibinfo.check output
1556
+ }
1557
+ { format.incoll.inproc.crossref output.nonnull
1558
+ format.pages output
1559
+ }
1560
+ if$
1561
+ new.block
1562
+ format.note output
1563
+ fin.entry
1564
+ }
1565
+ FUNCTION {conference} { inproceedings }
1566
+ FUNCTION {manual}
1567
+ { output.bibitem
1568
+ format.authors output
1569
+ author format.key output
1570
+ format.date "year" output.check
1571
+ date.block
1572
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1573
+ format.btitle "title" output.check
1574
+ format.edition output
1575
+ organization address new.block.checkb
1576
+ organization "organization" bibinfo.check output
1577
+ address "address" bibinfo.check output
1578
+ new.block
1579
+ format.note output
1580
+ fin.entry
1581
+ }
1582
+
1583
+ FUNCTION {mastersthesis}
1584
+ { output.bibitem
1585
+ format.authors "author" output.check
1586
+ author format.key output
1587
+ format.date "year" output.check
1588
+ date.block
1589
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1590
+ format.title
1591
+ "title" output.check
1592
+ new.block
1593
+ bbl.mthesis format.thesis.type output.nonnull
1594
+ school "school" bibinfo.warn output
1595
+ address "address" bibinfo.check output
1596
+ month "month" bibinfo.check output
1597
+ new.block
1598
+ format.note output
1599
+ fin.entry
1600
+ }
1601
+
1602
+ FUNCTION {misc}
1603
+ { output.bibitem
1604
+ format.authors output
1605
+ author format.key output
1606
+ format.date "year" output.check
1607
+ date.block
1608
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1609
+ format.title output
1610
+ new.block
1611
+ howpublished "howpublished" bibinfo.check output
1612
+ new.block
1613
+ format.note output
1614
+ fin.entry
1615
+ }
1616
+ FUNCTION {phdthesis}
1617
+ { output.bibitem
1618
+ format.authors "author" output.check
1619
+ author format.key output
1620
+ format.date "year" output.check
1621
+ date.block
1622
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1623
+ format.btitle
1624
+ "title" output.check
1625
+ new.block
1626
+ bbl.phdthesis format.thesis.type output.nonnull
1627
+ school "school" bibinfo.warn output
1628
+ address "address" bibinfo.check output
1629
+ new.block
1630
+ format.note output
1631
+ fin.entry
1632
+ }
1633
+
1634
+ FUNCTION {proceedings}
1635
+ { output.bibitem
1636
+ format.editors output
1637
+ editor format.key output
1638
+ format.date "year" output.check
1639
+ date.block
1640
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1641
+ format.btitle "title" output.check
1642
+ format.bvolume output
1643
+ format.number.series output
1644
+ new.sentence
1645
+ publisher empty$
1646
+ { format.organization.address output }
1647
+ { organization "organization" bibinfo.check output
1648
+ new.sentence
1649
+ format.publisher.address output
1650
+ }
1651
+ if$
1652
+ new.block
1653
+ format.note output
1654
+ fin.entry
1655
+ }
1656
+
1657
+ FUNCTION {techreport}
1658
+ { output.bibitem
1659
+ format.authors "author" output.check
1660
+ author format.key output
1661
+ format.date "year" output.check
1662
+ date.block
1663
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1664
+ format.title
1665
+ "title" output.check
1666
+ new.block
1667
+ format.tr.number output.nonnull
1668
+ institution "institution" bibinfo.warn output
1669
+ address "address" bibinfo.check output
1670
+ new.block
1671
+ format.note output
1672
+ fin.entry
1673
+ }
1674
+
1675
+ FUNCTION {unpublished}
1676
+ { output.bibitem
1677
+ format.authors "author" output.check
1678
+ author format.key output
1679
+ format.date "year" output.check
1680
+ date.block
1681
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1682
+ format.title "title" output.check
1683
+ new.block
1684
+ format.note "note" output.check
1685
+ fin.entry
1686
+ }
1687
+
1688
+ FUNCTION {default.type} { misc }
1689
+ READ
1690
+ FUNCTION {sortify}
1691
+ { purify$
1692
+ "l" change.case$
1693
+ }
1694
+ INTEGERS { len }
1695
+ FUNCTION {chop.word}
1696
+ { 's :=
1697
+ 'len :=
1698
+ s #1 len substring$ =
1699
+ { s len #1 + global.max$ substring$ }
1700
+ 's
1701
+ if$
1702
+ }
1703
+ FUNCTION {format.lab.names}
1704
+ { 's :=
1705
+ "" 't :=
1706
+ s #1 "{vv~}{ll}" format.name$
1707
+ s num.names$ duplicate$
1708
+ #2 >
1709
+ { pop$
1710
+ " " * bbl.etal *
1711
+ }
1712
+ { #2 <
1713
+ 'skip$
1714
+ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
1715
+ {
1716
+ " " * bbl.etal *
1717
+ }
1718
+ { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
1719
+ * }
1720
+ if$
1721
+ }
1722
+ if$
1723
+ }
1724
+ if$
1725
+ }
1726
+
1727
+ FUNCTION {author.key.label}
1728
+ { author empty$
1729
+ { key empty$
1730
+ { cite$ #1 #3 substring$ }
1731
+ 'key
1732
+ if$
1733
+ }
1734
+ { author format.lab.names }
1735
+ if$
1736
+ }
1737
+
1738
+ FUNCTION {author.editor.key.label}
1739
+ { author empty$
1740
+ { editor empty$
1741
+ { key empty$
1742
+ { cite$ #1 #3 substring$ }
1743
+ 'key
1744
+ if$
1745
+ }
1746
+ { editor format.lab.names }
1747
+ if$
1748
+ }
1749
+ { author format.lab.names }
1750
+ if$
1751
+ }
1752
+
1753
+ FUNCTION {editor.key.label}
1754
+ { editor empty$
1755
+ { key empty$
1756
+ { cite$ #1 #3 substring$ }
1757
+ 'key
1758
+ if$
1759
+ }
1760
+ { editor format.lab.names }
1761
+ if$
1762
+ }
1763
+
1764
+ FUNCTION {calc.short.authors}
1765
+ { type$ "book" =
1766
+ type$ "inbook" =
1767
+ or
1768
+ 'author.editor.key.label
1769
+ { type$ "proceedings" =
1770
+ 'editor.key.label
1771
+ 'author.key.label
1772
+ if$
1773
+ }
1774
+ if$
1775
+ 'short.list :=
1776
+ }
1777
+
1778
+ FUNCTION {calc.label}
1779
+ { calc.short.authors
1780
+ short.list
1781
+ "("
1782
+ *
1783
+ year duplicate$ empty$
1784
+ short.list key field.or.null = or
1785
+ { pop$ "" }
1786
+ 'skip$
1787
+ if$
1788
+ *
1789
+ 'label :=
1790
+ }
1791
+
1792
+ FUNCTION {sort.format.names}
1793
+ { 's :=
1794
+ #1 'nameptr :=
1795
+ ""
1796
+ s num.names$ 'numnames :=
1797
+ numnames 'namesleft :=
1798
+ { namesleft #0 > }
1799
+ { s nameptr
1800
+ "{ll{ }}{ ff{ }}{ jj{ }}"
1801
+ format.name$ 't :=
1802
+ nameptr #1 >
1803
+ {
1804
+ " " *
1805
+ namesleft #1 = t "others" = and
1806
+ { "zzzzz" * }
1807
+ { t sortify * }
1808
+ if$
1809
+ }
1810
+ { t sortify * }
1811
+ if$
1812
+ nameptr #1 + 'nameptr :=
1813
+ namesleft #1 - 'namesleft :=
1814
+ }
1815
+ while$
1816
+ }
1817
+
1818
+ FUNCTION {sort.format.title}
1819
+ { 't :=
1820
+ "A " #2
1821
+ "An " #3
1822
+ "The " #4 t chop.word
1823
+ chop.word
1824
+ chop.word
1825
+ sortify
1826
+ #1 global.max$ substring$
1827
+ }
1828
+ FUNCTION {author.sort}
1829
+ { author empty$
1830
+ { key empty$
1831
+ { "to sort, need author or key in " cite$ * warning$
1832
+ ""
1833
+ }
1834
+ { key sortify }
1835
+ if$
1836
+ }
1837
+ { author sort.format.names }
1838
+ if$
1839
+ }
1840
+ FUNCTION {author.editor.sort}
1841
+ { author empty$
1842
+ { editor empty$
1843
+ { key empty$
1844
+ { "to sort, need author, editor, or key in " cite$ * warning$
1845
+ ""
1846
+ }
1847
+ { key sortify }
1848
+ if$
1849
+ }
1850
+ { editor sort.format.names }
1851
+ if$
1852
+ }
1853
+ { author sort.format.names }
1854
+ if$
1855
+ }
1856
+ FUNCTION {editor.sort}
1857
+ { editor empty$
1858
+ { key empty$
1859
+ { "to sort, need editor or key in " cite$ * warning$
1860
+ ""
1861
+ }
1862
+ { key sortify }
1863
+ if$
1864
+ }
1865
+ { editor sort.format.names }
1866
+ if$
1867
+ }
1868
+ FUNCTION {presort}
1869
+ { calc.label
1870
+ label sortify
1871
+ " "
1872
+ *
1873
+ type$ "book" =
1874
+ type$ "inbook" =
1875
+ or
1876
+ 'author.editor.sort
1877
+ { type$ "proceedings" =
1878
+ 'editor.sort
1879
+ 'author.sort
1880
+ if$
1881
+ }
1882
+ if$
1883
+ #1 entry.max$ substring$
1884
+ 'sort.label :=
1885
+ sort.label
1886
+ *
1887
+ " "
1888
+ *
1889
+ title field.or.null
1890
+ sort.format.title
1891
+ *
1892
+ #1 entry.max$ substring$
1893
+ 'sort.key$ :=
1894
+ }
1895
+
1896
+ ITERATE {presort}
1897
+ SORT
1898
+ STRINGS { last.label next.extra }
1899
+ INTEGERS { last.extra.num number.label }
1900
+ FUNCTION {initialize.extra.label.stuff}
1901
+ { #0 int.to.chr$ 'last.label :=
1902
+ "" 'next.extra :=
1903
+ #0 'last.extra.num :=
1904
+ #0 'number.label :=
1905
+ }
1906
+ FUNCTION {forward.pass}
1907
+ { last.label label =
1908
+ { last.extra.num #1 + 'last.extra.num :=
1909
+ last.extra.num int.to.chr$ 'extra.label :=
1910
+ }
1911
+ { "a" chr.to.int$ 'last.extra.num :=
1912
+ "" 'extra.label :=
1913
+ label 'last.label :=
1914
+ }
1915
+ if$
1916
+ number.label #1 + 'number.label :=
1917
+ }
1918
+ FUNCTION {reverse.pass}
1919
+ { next.extra "b" =
1920
+ { "a" 'extra.label := }
1921
+ 'skip$
1922
+ if$
1923
+ extra.label 'next.extra :=
1924
+ extra.label
1925
+ duplicate$ empty$
1926
+ 'skip$
1927
+ { year field.or.null #-1 #1 substring$ chr.to.int$ #65 <
1928
+ { "{\natexlab{" swap$ * "}}" * }
1929
+ { "{(\natexlab{" swap$ * "})}" * }
1930
+ if$ }
1931
+ if$
1932
+ 'extra.label :=
1933
+ label extra.label * 'label :=
1934
+ }
1935
+ EXECUTE {initialize.extra.label.stuff}
1936
+ ITERATE {forward.pass}
1937
+ REVERSE {reverse.pass}
1938
+ FUNCTION {bib.sort.order}
1939
+ { sort.label
1940
+ " "
1941
+ *
1942
+ year field.or.null sortify
1943
+ *
1944
+ " "
1945
+ *
1946
+ title field.or.null
1947
+ sort.format.title
1948
+ *
1949
+ #1 entry.max$ substring$
1950
+ 'sort.key$ :=
1951
+ }
1952
+ ITERATE {bib.sort.order}
1953
+ SORT
1954
+ FUNCTION {begin.bib}
1955
+ { preamble$ empty$
1956
+ 'skip$
1957
+ { preamble$ write$ newline$ }
1958
+ if$
1959
+ "\begin{thebibliography}{" number.label int.to.str$ * "}" *
1960
+ write$ newline$
1961
+ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
1962
+ write$ newline$
1963
+ }
1964
+ EXECUTE {begin.bib}
1965
+ EXECUTE {init.urlbst.variables} % urlbst
1966
+ EXECUTE {init.state.consts}
1967
+ ITERATE {call.type$}
1968
+ FUNCTION {end.bib}
1969
+ { newline$
1970
+ "\end{thebibliography}" write$ newline$
1971
+ }
1972
+ EXECUTE {end.bib}
1973
+ %% End of customized bst file
1974
+ %%
1975
+ %% End of file `compling.bst'.
references/2020.emnlp.nguyen/source/emnlp2020.sty ADDED
@@ -0,0 +1,560 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This is the LaTex style file for EMNLP 2020, based off of ACL 2020.
2
+
3
+ % Addressing bibtex issues mentioned in https://github.com/acl-org/acl-pub/issues/2
4
+ % Other major modifications include
5
+ % changing the color of the line numbers to a light gray; changing font size of abstract to be 10pt; changing caption font size to be 10pt.
6
+ % -- M Mitchell and Stephanie Lukin
7
+
8
+ % 2017: modified to support DOI links in bibliography. Now uses
9
+ % natbib package rather than defining citation commands in this file.
10
+ % Use with acl_natbib.bst bib style. -- Dan Gildea
11
+
12
+ % This is the LaTeX style for ACL 2016. It contains Margaret Mitchell's
13
+ % line number adaptations (ported by Hai Zhao and Yannick Versley).
14
+
15
+ % It is nearly identical to the style files for ACL 2015,
16
+ % ACL 2014, EACL 2006, ACL2005, ACL 2002, ACL 2001, ACL 2000,
17
+ % EACL 95 and EACL 99.
18
+ %
19
+ % Changes made include: adapt layout to A4 and centimeters, widen abstract
20
+
21
+ % This is the LaTeX style file for ACL 2000. It is nearly identical to the
22
+ % style files for EACL 95 and EACL 99. Minor changes include editing the
23
+ % instructions to reflect use of \documentclass rather than \documentstyle
24
+ % and removing the white space before the title on the first page
25
+ % -- John Chen, June 29, 2000
26
+
27
+ % This is the LaTeX style file for EACL-95. It is identical to the
28
+ % style file for ANLP '94 except that the margins are adjusted for A4
29
+ % paper. -- abney 13 Dec 94
30
+
31
+ % The ANLP '94 style file is a slightly modified
32
+ % version of the style used for AAAI and IJCAI, using some changes
33
+ % prepared by Fernando Pereira and others and some minor changes
34
+ % by Paul Jacobs.
35
+
36
+ % Papers prepared using the aclsub.sty file and acl.bst bibtex style
37
+ % should be easily converted to final format using this style.
38
+ % (1) Submission information (\wordcount, \subject, and \makeidpage)
39
+ % should be removed.
40
+ % (2) \summary should be removed. The summary material should come
41
+ % after \maketitle and should be in the ``abstract'' environment
42
+ % (between \begin{abstract} and \end{abstract}).
43
+ % (3) Check all citations. This style should handle citations correctly
44
+ % and also allows multiple citations separated by semicolons.
45
+ % (4) Check figures and examples. Because the final format is double-
46
+ % column, some adjustments may have to be made to fit text in the column
47
+ % or to choose full-width (\figure*} figures.
48
+
49
+ % Place this in a file called aclap.sty in the TeX search path.
50
+ % (Placing it in the same directory as the paper should also work.)
51
+
52
+ % Prepared by Peter F. Patel-Schneider, liberally using the ideas of
53
+ % other style hackers, including Barbara Beeton.
54
+ % This style is NOT guaranteed to work. It is provided in the hope
55
+ % that it will make the preparation of papers easier.
56
+ %
57
+ % There are undoubtably bugs in this style. If you make bug fixes,
58
+ % improvements, etc. please let me know. My e-mail address is:
59
+ % pfps@research.att.com
60
+
61
+ % Papers are to be prepared using the ``acl_natbib'' bibliography style,
62
+ % as follows:
63
+ % \documentclass[11pt]{article}
64
+ % \usepackage{acl2000}
65
+ % \title{Title}
66
+ % \author{Author 1 \and Author 2 \\ Address line \\ Address line \And
67
+ % Author 3 \\ Address line \\ Address line}
68
+ % \begin{document}
69
+ % ...
70
+ % \bibliography{bibliography-file}
71
+ % \bibliographystyle{acl_natbib}
72
+ % \end{document}
73
+
74
+ % Author information can be set in various styles:
75
+ % For several authors from the same institution:
76
+ % \author{Author 1 \and ... \and Author n \\
77
+ % Address line \\ ... \\ Address line}
78
+ % if the names do not fit well on one line use
79
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
80
+ % For authors from different institutions:
81
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
82
+ % \And ... \And
83
+ % Author n \\ Address line \\ ... \\ Address line}
84
+ % To start a seperate ``row'' of authors use \AND, as in
85
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
86
+ % \AND
87
+ % Author 2 \\ Address line \\ ... \\ Address line \And
88
+ % Author 3 \\ Address line \\ ... \\ Address line}
89
+
90
+ % If the title and author information does not fit in the area allocated,
91
+ % place \setlength\titlebox{<new height>} right after
92
+ % \usepackage{acl2015}
93
+ % where <new height> can be something larger than 5cm
94
+
95
+ % include hyperref, unless user specifies nohyperref option like this:
96
+ % \usepackage[nohyperref]{naaclhlt2018}
97
+ \newif\ifacl@hyperref
98
+ \DeclareOption{hyperref}{\acl@hyperreftrue}
99
+ \DeclareOption{nohyperref}{\acl@hyperreffalse}
100
+ \ExecuteOptions{hyperref} % default is to use hyperref
101
+ \ProcessOptions\relax
102
+ \ifacl@hyperref
103
+ \RequirePackage{hyperref}
104
+ \usepackage{xcolor} % make links dark blue
105
+ \definecolor{darkblue}{rgb}{0, 0, 0.5}
106
+ \hypersetup{colorlinks=true,citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue}
107
+ \else
108
+ % This definition is used if the hyperref package is not loaded.
109
+ % It provides a backup, no-op definiton of \href.
110
+ % This is necessary because \href command is used in the acl_natbib.bst file.
111
+ \def\href#1#2{{#2}}
112
+ % We still need to load xcolor in this case because the lighter line numbers require it. (SC/KG/WL)
113
+ \usepackage{xcolor}
114
+ \fi
115
+
116
+ \typeout{Conference Style for EMNLP 2020}
117
+
118
+ % NOTE: Some laser printers have a serious problem printing TeX output.
119
+ % These printing devices, commonly known as ``write-white'' laser
120
+ % printers, tend to make characters too light. To get around this
121
+ % problem, a darker set of fonts must be created for these devices.
122
+ %
123
+
124
+ \newcommand{\Thanks}[1]{\thanks{\ #1}}
125
+
126
+ % A4 modified by Eneko; again modified by Alexander for 5cm titlebox
127
+ \setlength{\paperwidth}{21cm} % A4
128
+ \setlength{\paperheight}{29.7cm}% A4
129
+ \setlength\topmargin{-0.5cm}
130
+ \setlength\oddsidemargin{0cm}
131
+ \setlength\textheight{24.7cm}
132
+ \setlength\textwidth{16.0cm}
133
+ \setlength\columnsep{0.6cm}
134
+ \newlength\titlebox
135
+ \setlength\titlebox{5cm}
136
+ \setlength\headheight{5pt}
137
+ \setlength\headsep{0pt}
138
+ \thispagestyle{empty}
139
+ \pagestyle{empty}
140
+
141
+
142
+ \flushbottom \twocolumn \sloppy
143
+
144
+ % We're never going to need a table of contents, so just flush it to
145
+ % save space --- suggested by drstrip@sandia-2
146
+ \def\addcontentsline#1#2#3{}
147
+
148
+ \newif\ifaclfinal
149
+ \aclfinalfalse
150
+ \def\aclfinalcopy{\global\aclfinaltrue}
151
+
152
+ %% ----- Set up hooks to repeat content on every page of the output doc,
153
+ %% necessary for the line numbers in the submitted version. --MM
154
+ %%
155
+ %% Copied from CVPR 2015's cvpr_eso.sty, which appears to be largely copied from everyshi.sty.
156
+ %%
157
+ %% Original cvpr_eso.sty available at: http://www.pamitc.org/cvpr15/author_guidelines.php
158
+ %% Original evershi.sty available at: https://www.ctan.org/pkg/everyshi
159
+ %%
160
+ %% Copyright (C) 2001 Martin Schr\"oder:
161
+ %%
162
+ %% Martin Schr"oder
163
+ %% Cr"usemannallee 3
164
+ %% D-28213 Bremen
165
+ %% Martin.Schroeder@ACM.org
166
+ %%
167
+ %% This program may be redistributed and/or modified under the terms
168
+ %% of the LaTeX Project Public License, either version 1.0 of this
169
+ %% license, or (at your option) any later version.
170
+ %% The latest version of this license is in
171
+ %% CTAN:macros/latex/base/lppl.txt.
172
+ %%
173
+ %% Happy users are requested to send [Martin] a postcard. :-)
174
+ %%
175
+ \newcommand{\@EveryShipoutACL@Hook}{}
176
+ \newcommand{\@EveryShipoutACL@AtNextHook}{}
177
+ \newcommand*{\EveryShipoutACL}[1]
178
+ {\g@addto@macro\@EveryShipoutACL@Hook{#1}}
179
+ \newcommand*{\AtNextShipoutACL@}[1]
180
+ {\g@addto@macro\@EveryShipoutACL@AtNextHook{#1}}
181
+ \newcommand{\@EveryShipoutACL@Shipout}{%
182
+ \afterassignment\@EveryShipoutACL@Test
183
+ \global\setbox\@cclv= %
184
+ }
185
+ \newcommand{\@EveryShipoutACL@Test}{%
186
+ \ifvoid\@cclv\relax
187
+ \aftergroup\@EveryShipoutACL@Output
188
+ \else
189
+ \@EveryShipoutACL@Output
190
+ \fi%
191
+ }
192
+ \newcommand{\@EveryShipoutACL@Output}{%
193
+ \@EveryShipoutACL@Hook%
194
+ \@EveryShipoutACL@AtNextHook%
195
+ \gdef\@EveryShipoutACL@AtNextHook{}%
196
+ \@EveryShipoutACL@Org@Shipout\box\@cclv%
197
+ }
198
+ \newcommand{\@EveryShipoutACL@Org@Shipout}{}
199
+ \newcommand*{\@EveryShipoutACL@Init}{%
200
+ \message{ABD: EveryShipout initializing macros}%
201
+ \let\@EveryShipoutACL@Org@Shipout\shipout
202
+ \let\shipout\@EveryShipoutACL@Shipout
203
+ }
204
+ \AtBeginDocument{\@EveryShipoutACL@Init}
205
+
206
+ %% ----- Set up for placing additional items into the submitted version --MM
207
+ %%
208
+ %% Based on eso-pic.sty
209
+ %%
210
+ %% Original available at: https://www.ctan.org/tex-archive/macros/latex/contrib/eso-pic
211
+ %% Copyright (C) 1998-2002 by Rolf Niepraschk <niepraschk@ptb.de>
212
+ %%
213
+ %% Which may be distributed and/or modified under the conditions of
214
+ %% the LaTeX Project Public License, either version 1.2 of this license
215
+ %% or (at your option) any later version. The latest version of this
216
+ %% license is in:
217
+ %%
218
+ %% http://www.latex-project.org/lppl.txt
219
+ %%
220
+ %% and version 1.2 or later is part of all distributions of LaTeX version
221
+ %% 1999/12/01 or later.
222
+ %%
223
+ %% In contrast to the original, we do not include the definitions for/using:
224
+ %% gridpicture, div[2], isMEMOIR[1], gridSetup[6][], subgridstyle{dotted}, labelfactor{}, gap{}, gridunitname{}, gridunit{}, gridlines{\thinlines}, subgridlines{\thinlines}, the {keyval} package, evenside margin, nor any definitions with 'color'.
225
+ %%
226
+ %% These are beyond what is needed for the NAACL/ACL style.
227
+ %%
228
+ \newcommand\LenToUnit[1]{#1\@gobble}
229
+ \newcommand\AtPageUpperLeft[1]{%
230
+ \begingroup
231
+ \@tempdima=0pt\relax\@tempdimb=\ESO@yoffsetI\relax
232
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
233
+ \endgroup
234
+ }
235
+ \newcommand\AtPageLowerLeft[1]{\AtPageUpperLeft{%
236
+ \put(0,\LenToUnit{-\paperheight}){#1}}}
237
+ \newcommand\AtPageCenter[1]{\AtPageUpperLeft{%
238
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.5\paperheight}){#1}}}
239
+ \newcommand\AtPageLowerCenter[1]{\AtPageUpperLeft{%
240
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-\paperheight}){#1}}}%
241
+ \newcommand\AtPageLowishCenter[1]{\AtPageUpperLeft{%
242
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.96\paperheight}){#1}}}
243
+ \newcommand\AtTextUpperLeft[1]{%
244
+ \begingroup
245
+ \setlength\@tempdima{1in}%
246
+ \advance\@tempdima\oddsidemargin%
247
+ \@tempdimb=\ESO@yoffsetI\relax\advance\@tempdimb-1in\relax%
248
+ \advance\@tempdimb-\topmargin%
249
+ \advance\@tempdimb-\headheight\advance\@tempdimb-\headsep%
250
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
251
+ \endgroup
252
+ }
253
+ \newcommand\AtTextLowerLeft[1]{\AtTextUpperLeft{%
254
+ \put(0,\LenToUnit{-\textheight}){#1}}}
255
+ \newcommand\AtTextCenter[1]{\AtTextUpperLeft{%
256
+ \put(\LenToUnit{.5\textwidth},\LenToUnit{-.5\textheight}){#1}}}
257
+ \newcommand{\ESO@HookI}{} \newcommand{\ESO@HookII}{}
258
+ \newcommand{\ESO@HookIII}{}
259
+ \newcommand{\AddToShipoutPicture}{%
260
+ \@ifstar{\g@addto@macro\ESO@HookII}{\g@addto@macro\ESO@HookI}}
261
+ \newcommand{\ClearShipoutPicture}{\global\let\ESO@HookI\@empty}
262
+ \newcommand{\@ShipoutPicture}{%
263
+ \bgroup
264
+ \@tempswafalse%
265
+ \ifx\ESO@HookI\@empty\else\@tempswatrue\fi%
266
+ \ifx\ESO@HookII\@empty\else\@tempswatrue\fi%
267
+ \ifx\ESO@HookIII\@empty\else\@tempswatrue\fi%
268
+ \if@tempswa%
269
+ \@tempdima=1in\@tempdimb=-\@tempdima%
270
+ \advance\@tempdimb\ESO@yoffsetI%
271
+ \unitlength=1pt%
272
+ \global\setbox\@cclv\vbox{%
273
+ \vbox{\let\protect\relax
274
+ \pictur@(0,0)(\strip@pt\@tempdima,\strip@pt\@tempdimb)%
275
+ \ESO@HookIII\ESO@HookI\ESO@HookII%
276
+ \global\let\ESO@HookII\@empty%
277
+ \endpicture}%
278
+ \nointerlineskip%
279
+ \box\@cclv}%
280
+ \fi
281
+ \egroup
282
+ }
283
+ \EveryShipoutACL{\@ShipoutPicture}
284
+ \newif\ifESO@dvips\ESO@dvipsfalse
285
+ \newif\ifESO@grid\ESO@gridfalse
286
+ \newif\ifESO@texcoord\ESO@texcoordfalse
287
+ \newcommand*\ESO@griddelta{}\newcommand*\ESO@griddeltaY{}
288
+ \newcommand*\ESO@gridDelta{}\newcommand*\ESO@gridDeltaY{}
289
+ \newcommand*\ESO@yoffsetI{}\newcommand*\ESO@yoffsetII{}
290
+ \ifESO@texcoord
291
+ \def\ESO@yoffsetI{0pt}\def\ESO@yoffsetII{-\paperheight}
292
+ \edef\ESO@griddeltaY{-\ESO@griddelta}\edef\ESO@gridDeltaY{-\ESO@gridDelta}
293
+ \else
294
+ \def\ESO@yoffsetI{\paperheight}\def\ESO@yoffsetII{0pt}
295
+ \edef\ESO@griddeltaY{\ESO@griddelta}\edef\ESO@gridDeltaY{\ESO@gridDelta}
296
+ \fi
297
+
298
+
299
+ %% ----- Submitted version markup: Page numbers, ruler, and confidentiality. Using ideas/code from cvpr.sty 2015. --MM
300
+
301
+ \font\aclhv = phvb at 8pt
302
+
303
+ %% Define vruler %%
304
+
305
+ %\makeatletter
306
+ \newbox\aclrulerbox
307
+ \newcount\aclrulercount
308
+ \newdimen\aclruleroffset
309
+ \newdimen\cv@lineheight
310
+ \newdimen\cv@boxheight
311
+ \newbox\cv@tmpbox
312
+ \newcount\cv@refno
313
+ \newcount\cv@tot
314
+ % NUMBER with left flushed zeros \fillzeros[<WIDTH>]<NUMBER>
315
+ \newcount\cv@tmpc@ \newcount\cv@tmpc
316
+ \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
317
+ \cv@tmpc=1 %
318
+ \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
319
+ \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
320
+ \ifnum#2<0\advance\cv@tmpc1\relax-\fi
321
+ \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
322
+ \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
323
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
324
+ \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
325
+ \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
326
+ \global\setbox\aclrulerbox=\vbox to \textheight{%
327
+ {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
328
+ \color{gray}
329
+ \cv@lineheight=#1\global\aclrulercount=#2%
330
+ \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
331
+ \cv@refno1\vskip-\cv@lineheight\vskip1ex%
332
+ \loop\setbox\cv@tmpbox=\hbox to0cm{{\aclhv\hfil\fillzeros[#4]\aclrulercount}}%
333
+ \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
334
+ \advance\cv@refno1\global\advance\aclrulercount#3\relax
335
+ \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}%
336
+ %\makeatother
337
+
338
+
339
+ \def\aclpaperid{***}
340
+ \def\confidential{\textcolor{black}{EMNLP 2020 Submission~\aclpaperid. Confidential Review Copy. DO NOT DISTRIBUTE.}}
341
+
342
+ %% Page numbering, Vruler and Confidentiality %%
343
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
344
+
345
+ % SC/KG/WL - changed line numbering to gainsboro
346
+ \definecolor{gainsboro}{rgb}{0.8, 0.8, 0.8}
347
+ %\def\aclruler#1{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}} %% old line
348
+ \def\aclruler#1{\textcolor{gainsboro}{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}}}
349
+
350
+ \def\leftoffset{-2.1cm} %original: -45pt
351
+ \def\rightoffset{17.5cm} %original: 500pt
352
+ \ifaclfinal\else\pagenumbering{arabic}
353
+ \AddToShipoutPicture{%
354
+ \ifaclfinal\else
355
+ \AtPageLowishCenter{\textcolor{black}{\thepage}}
356
+ \aclruleroffset=\textheight
357
+ \advance\aclruleroffset4pt
358
+ \AtTextUpperLeft{%
359
+ \put(\LenToUnit{\leftoffset},\LenToUnit{-\aclruleroffset}){%left ruler
360
+ \aclruler{\aclrulercount}}
361
+ \put(\LenToUnit{\rightoffset},\LenToUnit{-\aclruleroffset}){%right ruler
362
+ \aclruler{\aclrulercount}}
363
+ }
364
+ \AtTextUpperLeft{%confidential
365
+ \put(0,\LenToUnit{1cm}){\parbox{\textwidth}{\centering\aclhv\confidential}}
366
+ }
367
+ \fi
368
+ }
369
+
370
+ %%%% ----- End settings for placing additional items into the submitted version --MM ----- %%%%
371
+
372
+ %%%% ----- Begin settings for both submitted and camera-ready version ----- %%%%
373
+
374
+ %% Title and Authors %%
375
+
376
+ \newcommand\outauthor{
377
+ \begin{tabular}[t]{c}
378
+ \ifaclfinal
379
+ \bf\@author
380
+ \else
381
+ % Avoiding common accidental de-anonymization issue. --MM
382
+ \bf Anonymous EMNLP submission
383
+ \fi
384
+ \end{tabular}}
385
+
386
+ % Changing the expanded titlebox for submissions to 2.5 in (rather than 6.5cm)
387
+ % and moving it to the style sheet, rather than within the example tex file. --MM
388
+ \ifaclfinal
389
+ \else
390
+ \addtolength\titlebox{.25in}
391
+ \fi
392
+ % Mostly taken from deproc.
393
+ \def\maketitle{\par
394
+ \begingroup
395
+ \def\thefootnote{\fnsymbol{footnote}}
396
+ \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}}
397
+ \twocolumn[\@maketitle] \@thanks
398
+ \endgroup
399
+ \setcounter{footnote}{0}
400
+ \let\maketitle\relax \let\@maketitle\relax
401
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
402
+ \def\@maketitle{\vbox to \titlebox{\hsize\textwidth
403
+ \linewidth\hsize \vskip 0.125in minus 0.125in \centering
404
+ {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in
405
+ {\def\and{\unskip\enspace{\rm and}\enspace}%
406
+ \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil
407
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}%
408
+ \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup
409
+ \vskip 0.25in plus 1fil minus 0.125in
410
+ \hbox to \linewidth\bgroup\large \hfil\hfil
411
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}
412
+ \hbox to \linewidth\bgroup\large \hfil\hfil
413
+ \hbox to 0pt\bgroup\hss
414
+ \outauthor
415
+ \hss\egroup
416
+ \hfil\hfil\egroup}
417
+ \vskip 0.3in plus 2fil minus 0.1in
418
+ }}
419
+
420
+ % margins and font size for abstract
421
+ \renewenvironment{abstract}%
422
+ {\centerline{\large\bf Abstract}%
423
+ \begin{list}{}%
424
+ {\setlength{\rightmargin}{0.6cm}%
425
+ \setlength{\leftmargin}{0.6cm}}%
426
+ \item[]\ignorespaces%
427
+ \@setsize\normalsize{12pt}\xpt\@xpt
428
+ }%
429
+ {\unskip\end{list}}
430
+
431
+ %\renewenvironment{abstract}{\centerline{\large\bf
432
+ % Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
433
+
434
+ % Resizing figure and table captions - SL
435
+ \newcommand{\figcapfont}{\rm}
436
+ \newcommand{\tabcapfont}{\rm}
437
+ \renewcommand{\fnum@figure}{\figcapfont Figure \thefigure}
438
+ \renewcommand{\fnum@table}{\tabcapfont Table \thetable}
439
+ \renewcommand{\figcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
440
+ \renewcommand{\tabcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
441
+ % Support for interacting with the caption, subfigure, and subcaption packages - SL
442
+ \usepackage{caption}
443
+ \DeclareCaptionFont{10pt}{\fontsize{10pt}{12pt}\selectfont}
444
+ \captionsetup{font=10pt}
445
+
446
+ \RequirePackage{natbib}
447
+ % for citation commands in the .tex, authors can use:
448
+ % \citep, \citet, and \citeyearpar for compatibility with natbib, or
449
+ % \cite, \newcite, and \shortcite for compatibility with older ACL .sty files
450
+ \renewcommand\cite{\citep} % to get "(Author Year)" with natbib
451
+ \newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib
452
+ \newcommand\newcite{\citet} % to get "Author (Year)" with natbib
453
+
454
+ % DK/IV: Workaround for annoying hyperref pagewrap bug
455
+ %\RequirePackage{etoolbox}
456
+ %\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{\errmessage{\noexpand patch failed}}
457
+
458
+ % bibliography
459
+
460
+ \def\@up#1{\raise.2ex\hbox{#1}}
461
+
462
+ % Don't put a label in the bibliography at all. Just use the unlabeled format
463
+ % instead.
464
+ \def\thebibliography#1{\vskip\parskip%
465
+ \vskip\baselineskip%
466
+ \def\baselinestretch{1}%
467
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
468
+ \vskip-\parskip%
469
+ \vskip-\baselineskip%
470
+ \section*{References\@mkboth
471
+ {References}{References}}\list
472
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
473
+ \setlength{\itemindent}{-\parindent}}
474
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
475
+ \sloppy\clubpenalty4000\widowpenalty4000
476
+ \sfcode`\.=1000\relax}
477
+ \let\endthebibliography=\endlist
478
+
479
+
480
+ % Allow for a bibliography of sources of attested examples
481
+ \def\thesourcebibliography#1{\vskip\parskip%
482
+ \vskip\baselineskip%
483
+ \def\baselinestretch{1}%
484
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
485
+ \vskip-\parskip%
486
+ \vskip-\baselineskip%
487
+ \section*{Sources of Attested Examples\@mkboth
488
+ {Sources of Attested Examples}{Sources of Attested Examples}}\list
489
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
490
+ \setlength{\itemindent}{-\parindent}}
491
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
492
+ \sloppy\clubpenalty4000\widowpenalty4000
493
+ \sfcode`\.=1000\relax}
494
+ \let\endthesourcebibliography=\endlist
495
+
496
+ % sections with less space
497
+ \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
498
+ -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}}
499
+ \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
500
+ -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
501
+ %% changed by KO to - values to get teh initial parindent right
502
+ \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus
503
+ -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}}
504
+ \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
505
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
506
+ \def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus
507
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
508
+
509
+ % Footnotes
510
+ \footnotesep 6.65pt %
511
+ \skip\footins 9pt plus 4pt minus 2pt
512
+ \def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt }
513
+ \setcounter{footnote}{0}
514
+
515
+ % Lists and paragraphs
516
+ \parindent 1em
517
+ \topsep 4pt plus 1pt minus 2pt
518
+ \partopsep 1pt plus 0.5pt minus 0.5pt
519
+ \itemsep 2pt plus 1pt minus 0.5pt
520
+ \parsep 2pt plus 1pt minus 0.5pt
521
+
522
+ \leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em
523
+ \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em
524
+ \labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt
525
+
526
+ \def\@listi{\leftmargin\leftmargini}
527
+ \def\@listii{\leftmargin\leftmarginii
528
+ \labelwidth\leftmarginii\advance\labelwidth-\labelsep
529
+ \topsep 2pt plus 1pt minus 0.5pt
530
+ \parsep 1pt plus 0.5pt minus 0.5pt
531
+ \itemsep \parsep}
532
+ \def\@listiii{\leftmargin\leftmarginiii
533
+ \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
534
+ \topsep 1pt plus 0.5pt minus 0.5pt
535
+ \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
536
+ \itemsep \topsep}
537
+ \def\@listiv{\leftmargin\leftmarginiv
538
+ \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
539
+ \def\@listv{\leftmargin\leftmarginv
540
+ \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
541
+ \def\@listvi{\leftmargin\leftmarginvi
542
+ \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
543
+
544
+ \abovedisplayskip 7pt plus2pt minus5pt%
545
+ \belowdisplayskip \abovedisplayskip
546
+ \abovedisplayshortskip 0pt plus3pt%
547
+ \belowdisplayshortskip 4pt plus3pt minus3pt%
548
+
549
+ % Less leading in most fonts (due to the narrow columns)
550
+ % The choices were between 1-pt and 1.5-pt leading
551
+ \def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
552
+ \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
553
+ \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
554
+ \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
555
+ \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
556
+ \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
557
+ \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
558
+ \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
559
+ \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
560
+ \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
references/2020.emnlp.nguyen/source/emnlp2020_PhoBERT.bbl ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{thebibliography}{34}
2
+ \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
3
+
4
+ \bibitem[{Artetxe and Schwenk(2019)}]{ArtetxeS19}
5
+ Mikel Artetxe and Holger Schwenk. 2019.
6
+ \newblock {Massively Multilingual Sentence Embeddings for Zero-Shot
7
+ Cross-Lingual Transfer and Beyond}.
8
+ \newblock \emph{{TACL}}, 7:597--610.
9
+
10
+ \bibitem[{Conneau et~al.(2020)Conneau, Khandelwal, Goyal, Chaudhary, Wenzek,
11
+ Guzm{\'a}n, Grave, Ott, Zettlemoyer, and Stoyanov}]{conneau2019unsupervised}
12
+ Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume
13
+ Wenzek, Francisco Guzm{\'a}n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and
14
+ Veselin Stoyanov. 2020.
15
+ \newblock \href {https://arxiv.org/pdf/1911.02116v1.pdf} {{Unsupervised
16
+ Cross-lingual Representation Learning at Scale}}.
17
+ \newblock In \emph{Proceedings of ACL}, pages 8440--8451.
18
+
19
+ \bibitem[{Conneau and Lample(2019)}]{NIPS2019_8928}
20
+ Alexis Conneau and Guillaume Lample. 2019.
21
+ \newblock {Cross-lingual Language Model Pretraining}.
22
+ \newblock In \emph{Proceedings of NeurIPS}, pages 7059--7069.
23
+
24
+ \bibitem[{Conneau et~al.(2018)Conneau, Rinott, Lample, Schwenk, Stoyanov,
25
+ Williams, and Bowman}]{conneau-etal-2018-xnli}
26
+ Alexis Conneau, Ruty Rinott, Guillaume Lample, Holger Schwenk, Ves Stoyanov,
27
+ Adina Williams, and Samuel~R. Bowman. 2018.
28
+ \newblock {XNLI}: Evaluating cross-lingual sentence representations.
29
+ \newblock In \emph{Proceedings of EMNLP}, pages 2475--2485.
30
+
31
+ \bibitem[{Cui et~al.(2019)Cui, Che, Liu, Qin, Yang, Wang, and
32
+ Hu}]{abs-1906-08101}
33
+ Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and
34
+ Guoping Hu. 2019.
35
+ \newblock {Pre-Training with Whole Word Masking for Chinese BERT}.
36
+ \newblock \emph{arXiv preprint}, arXiv:1906.08101.
37
+
38
+ \bibitem[{Devlin et~al.(2019)Devlin, Chang, Lee, and
39
+ Toutanova}]{devlin-etal-2019-bert}
40
+ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
41
+ \newblock {BERT}: Pre-training of deep bidirectional transformers for language
42
+ understanding.
43
+ \newblock In \emph{Proceedings of NAACL}, pages 4171--4186.
44
+
45
+ \bibitem[{Dozat and Manning(2017)}]{DozatM17}
46
+ Timothy Dozat and Christopher~D. Manning. 2017.
47
+ \newblock {Deep Biaffine Attention for Neural Dependency Parsing}.
48
+ \newblock In \emph{Proceedings of ICLR}.
49
+
50
+ \bibitem[{Hewitt and Manning(2019)}]{hewitt-manning-2019-structural}
51
+ John Hewitt and Christopher~D. Manning. 2019.
52
+ \newblock {A} structural probe for finding syntax in word representations.
53
+ \newblock In \emph{Proceedings of NAACL}, pages 4129--4138.
54
+
55
+ \bibitem[{Jawahar et~al.(2019)Jawahar, Sagot, and
56
+ Seddah}]{jawahar-etal-2019-bert}
57
+ Ganesh Jawahar, Beno{\^\i}t Sagot, and Djam{\'e} Seddah. 2019.
58
+ \newblock What does {BERT} learn about the structure of language?
59
+ \newblock In \emph{Proceedings of ACL}, pages 3651--3657.
60
+
61
+ \bibitem[{Kingma and Ba(2014)}]{KingmaB14}
62
+ Diederik~P. Kingma and Jimmy Ba. 2014.
63
+ \newblock {Adam: {A} Method for Stochastic Optimization}.
64
+ \newblock \emph{arXiv preprint}, arXiv:1412.6980.
65
+
66
+ \bibitem[{Kudo and Richardson(2018)}]{kudo-richardson-2018-sentencepiece}
67
+ Taku Kudo and John Richardson. 2018.
68
+ \newblock {{S}entence{P}iece: A simple and language independent subword
69
+ tokenizer and detokenizer for Neural Text Processing}.
70
+ \newblock In \emph{Proceedings of EMNLP: System Demonstrations}, pages 66--71.
71
+
72
+ \bibitem[{Liu et~al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis,
73
+ Zettlemoyer, and Stoyanov}]{RoBERTa}
74
+ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
75
+ Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
76
+ \newblock {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach}.
77
+ \newblock \emph{arXiv preprint}, arXiv:1907.11692.
78
+
79
+ \bibitem[{Loshchilov and Hutter(2019)}]{loshchilov2018decoupled}
80
+ Ilya Loshchilov and Frank Hutter. 2019.
81
+ \newblock {Decoupled Weight Decay Regularization}.
82
+ \newblock In \emph{Proceedings of ICLR}.
83
+
84
+ \bibitem[{Ma and Hovy(2016)}]{ma-hovy-2016-end}
85
+ Xuezhe Ma and Eduard Hovy. 2016.
86
+ \newblock End-to-end sequence labeling via bi-directional {LSTM}-{CNN}s-{CRF}.
87
+ \newblock In \emph{Proceedings of ACL}, pages 1064--1074.
88
+
89
+ \bibitem[{Ma et~al.(2018)Ma, Hu, Liu, Peng, Neubig, and
90
+ Hovy}]{ma-etal-2018-stack}
91
+ Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard
92
+ Hovy. 2018.
93
+ \newblock {Stack-Pointer Networks for Dependency Parsing}.
94
+ \newblock In \emph{Proceedings of ACL}, pages 1403--1414.
95
+
96
+ \bibitem[{{Martin} et~al.(2020){Martin}, {Muller}, {Ortiz Su{\'a}rez},
97
+ {Dupont}, {Romary}, {Villemonte de la Clergerie}, {Seddah}, and
98
+ {Sagot}}]{2019arXiv191103894M}
99
+ Louis {Martin}, Benjamin {Muller}, Pedro~Javier {Ortiz Su{\'a}rez}, Yoann
100
+ {Dupont}, Laurent {Romary}, {\'E}ric {Villemonte de la Clergerie}, Djam{\'e}
101
+ {Seddah}, and Beno{\^\i}t {Sagot}. 2020.
102
+ \newblock {CamemBERT: a Tasty French Language Model}.
103
+ \newblock In \emph{Proceedings of ACL}, pages 7203--7219.
104
+
105
+ \bibitem[{Nguyen(2019)}]{nguyen-2019-neural}
106
+ Dat~Quoc Nguyen. 2019.
107
+ \newblock A neural joint model for {V}ietnamese word segmentation, {POS}
108
+ tagging and dependency parsing.
109
+ \newblock In \emph{Proceedings of ALTA}, pages 28--34.
110
+
111
+ \bibitem[{Nguyen et~al.(2014{\natexlab{a}})Nguyen, Nguyen, Pham, and
112
+ Pham}]{nguyen-etal-2014-rdrpostagger}
113
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Dang~Duc Pham, and Son~Bao Pham.
114
+ 2014{\natexlab{a}}.
115
+ \newblock {RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger}.
116
+ \newblock In \emph{Proceedings of the Demonstrations at EACL}, pages 17--20.
117
+
118
+ \bibitem[{Nguyen et~al.(2014{\natexlab{b}})Nguyen, Nguyen, Pham, Nguyen, and
119
+ Nguyen}]{Nguyen2014NLDB}
120
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Son~Bao Pham, Phuong-Thai Nguyen, and Minh~Le
121
+ Nguyen. 2014{\natexlab{b}}.
122
+ \newblock {From Treebank Conversion to Automatic Dependency Parsing for
123
+ Vietnamese}.
124
+ \newblock In \emph{{Proceedings of NLDB}}, pages 196--207.
125
+
126
+ \bibitem[{Nguyen et~al.(2018)Nguyen, Nguyen, Vu, Dras, and
127
+ Johnson}]{nguyen-etal-2018-fast}
128
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson. 2018.
129
+ \newblock {A Fast and Accurate Vietnamese Word Segmenter}.
130
+ \newblock In \emph{Proceedings of LREC}, pages 2582--2587.
131
+
132
+ \bibitem[{Nguyen and Verspoor(2018)}]{nguyen-verspoor-2018-improved}
133
+ Dat~Quoc Nguyen and Karin Verspoor. 2018.
134
+ \newblock An improved neural network model for joint {POS} tagging and
135
+ dependency parsing.
136
+ \newblock In \emph{Proceedings of the {C}o{NLL} 2018 Shared Task}, pages
137
+ 81--91.
138
+
139
+ \bibitem[{Nguyen et~al.(2017)Nguyen, Vu, Nguyen, Dras, and
140
+ Johnson}]{nguyen-etal-2017-word}
141
+ Dat~Quoc Nguyen, Thanh Vu, Dai~Quoc Nguyen, Mark Dras, and Mark Johnson. 2017.
142
+ \newblock From word segmentation to {POS} tagging for {V}ietnamese.
143
+ \newblock In \emph{Proceedings of ALTA}, pages 108--113.
144
+
145
+ \bibitem[{Nguyen et~al.(2019{\natexlab{a}})Nguyen, Ngo, Vu, Tran, and
146
+ Nguyen}]{JCC13161}
147
+ Huyen Nguyen, Quyen Ngo, Luong Vu, Vu~Tran, and Hien Nguyen.
148
+ 2019{\natexlab{a}}.
149
+ \newblock {VLSP Shared Task: Named Entity Recognition}.
150
+ \newblock \emph{Journal of Computer Science and Cybernetics}, 34(4):283--294.
151
+
152
+ \bibitem[{Nguyen et~al.(2019{\natexlab{b}})Nguyen, Dong, and Nguyen}]{8713740}
153
+ Kim~Anh Nguyen, Ngan Dong, and Cam-Tu Nguyen. 2019{\natexlab{b}}.
154
+ \newblock {Attentive Neural Network for Named Entity Recognition in
155
+ Vietnamese}.
156
+ \newblock In \emph{Proceedings of RIVF}.
157
+
158
+ \bibitem[{Ott et~al.(2019)Ott, Edunov, Baevski, Fan, Gross, Ng, Grangier, and
159
+ Auli}]{ott2019fairseq}
160
+ Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng,
161
+ David Grangier, and Michael Auli. 2019.
162
+ \newblock {fairseq: A Fast, Extensible Toolkit for Sequence Modeling}.
163
+ \newblock In \emph{Proceedings of NAACL-HLT 2019: Demonstrations}, pages
164
+ 48--53.
165
+
166
+ \bibitem[{Sennrich et~al.(2016)Sennrich, Haddow, and
167
+ Birch}]{sennrich-etal-2016-neural}
168
+ Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016.
169
+ \newblock {Neural Machine Translation of Rare Words with Subword Units}.
170
+ \newblock In \emph{Proceedings of ACL}, pages 1715--1725.
171
+
172
+ \bibitem[{Thang et~al.(2008)Thang, Phuong, Huyen, Tu, Rossignol, and
173
+ Luong}]{DinhQuangThang2008}
174
+ Dinh~Quang Thang, Le~Hong Phuong, Nguyen Thi~Minh Huyen, Nguyen~Cam Tu, Mathias
175
+ Rossignol, and Vu~Xuan Luong. 2008.
176
+ \newblock {Word segmentation of Vietnamese texts: a comparison of approaches}.
177
+ \newblock In \emph{Proceedings of LREC}, pages 1933--1936.
178
+
179
+ \bibitem[{Vaswani et~al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones,
180
+ Gomez, Kaiser, and Polosukhin}]{NIPS2017_7181}
181
+ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
182
+ Aidan~N Gomez, {\L}ukasz Kaiser, and Illia Polosukhin. 2017.
183
+ \newblock {Attention is All you Need}.
184
+ \newblock In \emph{Advances in Neural Information Processing Systems 30}, pages
185
+ 5998--6008.
186
+
187
+ \bibitem[{de~Vries et~al.(2019)de~Vries, van Cranenburgh, Bisazza, Caselli, van
188
+ Noord, and Nissim}]{vries2019bertje}
189
+ Wietse de~Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli,
190
+ Gertjan van Noord, and Malvina Nissim. 2019.
191
+ \newblock {BERTje: A Dutch BERT Model}.
192
+ \newblock \emph{arXiv preprint}, arXiv:1912.09582.
193
+
194
+ \bibitem[{Vu et~al.(2018)Vu, Nguyen, Nguyen, Dras, and
195
+ Johnson}]{vu-etal-2018-vncorenlp}
196
+ Thanh Vu, Dat~Quoc Nguyen, Dai~Quoc Nguyen, Mark Dras, and Mark Johnson. 2018.
197
+ \newblock {VnCoreNLP: A Vietnamese Natural Language Processing Toolkit}.
198
+ \newblock In \emph{Proceedings of NAACL: Demonstrations}, pages 56--60.
199
+
200
+ \bibitem[{Vu et~al.(2019)Vu, Vu, Tran, and Jiang}]{vu-xuan-etal-2019-etnlp}
201
+ Xuan-Son Vu, Thanh Vu, Son Tran, and Lili Jiang. 2019.
202
+ \newblock {ETNLP}: A visual-aided systematic approach to select pre-trained
203
+ embeddings for a downstream task.
204
+ \newblock In \emph{Proceedings of RANLP}, pages 1285--1294.
205
+
206
+ \bibitem[{Williams et~al.(2018)Williams, Nangia, and Bowman}]{N18-1101}
207
+ Adina Williams, Nikita Nangia, and Samuel Bowman. 2018.
208
+ \newblock {A Broad-Coverage Challenge Corpus for Sentence Understanding through
209
+ Inference}.
210
+ \newblock In \emph{Proceedings of NAACL}, pages 1112--1122.
211
+
212
+ \bibitem[{Wolf et~al.(2019)Wolf, Debut, Sanh, Chaumond, Delangue, Moi, Cistac,
213
+ Rault, Louf, Funtowicz, and Brew}]{Wolf2019HuggingFacesTS}
214
+ Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
215
+ Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and
216
+ Jamie Brew. 2019.
217
+ \newblock {HuggingFace's Transformers: State-of-the-art Natural Language
218
+ Processing}.
219
+ \newblock \emph{arXiv preprint}, arXiv:1910.03771.
220
+
221
+ \bibitem[{Wu and Dredze(2019)}]{wu-dredze-2019-beto}
222
+ Shijie Wu and Mark Dredze. 2019.
223
+ \newblock Beto, bentz, becas: The surprising cross-lingual effectiveness of
224
+ {BERT}.
225
+ \newblock In \emph{Proceedings of EMNLP-IJCNLP}, pages 833--844.
226
+
227
+ \end{thebibliography}
references/2020.emnlp.nguyen/source/emnlp2020_PhoBERT.tex ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \usepackage[hyperref]{emnlp2020}
3
+ \pdfoutput=1
4
+ \usepackage{times}
5
+ \usepackage{latexsym}
6
+ %\renewcommand{\UrlFont}{\ttfamily\small}
7
+
8
+ \usepackage{times}
9
+ \usepackage{latexsym}
10
+ \usepackage{amsmath}
11
+ \usepackage{url}
12
+ \usepackage{amssymb}
13
+ \usepackage{amsfonts}
14
+ \usepackage{graphicx}
15
+ \usepackage{tabularx}
16
+ \usepackage{multirow}
17
+ \usepackage{arydshln}
18
+ \usepackage{mathtools,nccmath}
19
+
20
+ \usepackage[utf8]{inputenc}
21
+ \usepackage[utf8]{vietnam}
22
+ \usepackage{enumitem}
23
+ % This is not strictly necessary, and may be commented out,
24
+ % but it will improve the layout of the manuscript,
25
+ % and will typically save some space.
26
+ %\usepackage{microtype}
27
+
28
+
29
+ \setlength{\textfloatsep}{15pt plus 5.0pt minus 5.0pt}
30
+ \setlength{\floatsep}{15pt plus 5.0pt minus 5.0pt}
31
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
32
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
33
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
34
+ \setlength{\abovecaptionskip}{3pt plus 1pt minus 1pt}
35
+
36
+ \aclfinalcopy % Uncomment this line for the final submission
37
+ %\def\aclpaperid{***} % Enter the acl Paper ID here
38
+
39
+ \setlength\titlebox{5cm}
40
+ % You can expand the titlebox if you need extra space
41
+ % to show all the authors. Please do not make the titlebox
42
+ % smaller than 5cm (the original size); we will check this
43
+ % in the camera-ready version and ask you to change it back.
44
+
45
+ \newcommand\BibTeX{B\textsc{ib}\TeX}
46
+
47
+
48
+ \title{PhoBERT: Pre-trained language models for Vietnamese}
49
+
50
+ \author{Dat Quoc Nguyen$^1$ \and Anh Tuan Nguyen$^{2,}$\thanks{\ \ Work done during internship at VinAI Research.} \\
51
+ $^1$VinAI Research, Vietnam; $^2$NVIDIA, USA\\
52
+ \tt{\normalsize v.datnq9@vinai.io, tuananhn@nvidia.com}}
53
+
54
+ \date{}
55
+
56
+ \begin{document}
57
+ \maketitle
58
+ \begin{abstract}
59
+ We present \textbf{PhoBERT} with two versions---PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}---the \emph{first} public large-scale monolingual language models pre-trained for Vietnamese. Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R \citep{conneau2019unsupervised} and improves the state-of-the-art in multiple Vietnamese-specific NLP tasks including Part-of-speech tagging, Dependency parsing, Named-entity recognition and Natural language inference. We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP. Our PhoBERT models are available at: \url{https://github.com/VinAIResearch/PhoBERT}.
60
+ \end{abstract}
61
+
62
+ \section{Introduction}\label{sec:intro}
63
+
64
+
65
+ Pre-trained language models, especially BERT \citep{devlin-etal-2019-bert}---the Bidirectional Encoder Representations from Transformers \citep{NIPS2017_7181}, have recently become extremely popular and helped to produce significant improvement gains for various NLP tasks. The success of pre-trained BERT and its variants has largely been limited to the English language. For other languages, one could retrain a language-specific model using the BERT architecture \citep{abs-1906-08101,vries2019bertje,vu-xuan-etal-2019-etnlp,2019arXiv191103894M} or employ existing pre-trained multilingual BERT-based models \citep{devlin-etal-2019-bert,NIPS2019_8928,conneau2019unsupervised}.
66
+
67
+ In terms of Vietnamese language modeling, to the best of our knowledge, there are two main concerns as follows:
68
+
69
+ \begin{itemize}[leftmargin=*]
70
+ \setlength\itemsep{-1pt}
71
+ \item The Vietnamese Wikipedia corpus is the only data used to train monolingual language models \citep{vu-xuan-etal-2019-etnlp}, and it also is the only Vietnamese dataset which is included in the pre-training data used by all multilingual language models except XLM-R. It is worth noting that Wikipedia data is not representative of a general language use, and the Vietnamese Wikipedia data is relatively small (1GB in size uncompressed), while pre-trained language models can be significantly improved by using more pre-training data \cite{RoBERTa}.
72
+
73
+ \item All publicly released monolingual and multilingual BERT-based language models are not aware of the difference between Vietnamese syllables and word tokens. This ambiguity comes from the fact that the white space is also used to separate syllables that constitute words when written in Vietnamese.\footnote{\newcite{DinhQuangThang2008} show that 85\% of Vietnamese word types are composed of at least two syllables.}
74
+ For example, a 6-syllable written text ``Tôi là một nghiên cứu viên'' (I am a researcher) forms 4 words ``Tôi\textsubscript{I} là\textsubscript{am} một\textsubscript{a} nghiên\_cứu\_viên\textsubscript{researcher}''. \\
75
+ Without doing a pre-process step of Vietnamese word segmentation, those models directly apply Byte-Pair encoding (BPE) methods \citep{sennrich-etal-2016-neural,kudo-richardson-2018-sentencepiece} to the syllable-level Vietnamese pre-training data.\footnote{Although performing word segmentation before applying BPE on the Vietnamese Wikipedia corpus, ETNLP \citep{vu-xuan-etal-2019-etnlp} in fact {does not publicly release} any pre-trained BERT-based language model (\url{https://github.com/vietnlp/etnlp}). In particular, \newcite{vu-xuan-etal-2019-etnlp} release a set of 15K BERT-based word embeddings specialized only for the Vietnamese NER task.}
76
+ Intuitively, for word-level Vietnamese NLP tasks, those models pre-trained on syllable-level data might not perform as good as language models pre-trained on word-level data.
77
+
78
+ \end{itemize}
79
+
80
+
81
+ To handle the two concerns above, we train the {first} large-scale monolingual BERT-based ``base'' and ``large'' models using a 20GB \textit{word-level} Vietnamese corpus.
82
+ We evaluate our models on four downstream Vietnamese NLP tasks: the common word-level ones of Part-of-speech (POS) tagging, Dependency parsing and Named-entity recognition (NER), and a language understanding task of Natural language inference (NLI) which can be formulated as either a syllable- or word-level task. Experimental results show that our models obtain state-of-the-art (SOTA) results on all these tasks.
83
+ Our contributions are summarized as follows:
84
+
85
+ \begin{itemize}[leftmargin=*]
86
+ \setlength\itemsep{-1pt}
87
+ \item We present the \textit{first} large-scale monolingual language models pre-trained for Vietnamese.
88
+
89
+ \item Our models help produce SOTA performances on four downstream tasks of POS tagging, Dependency parsing, NER and NLI, thus showing the effectiveness of large-scale BERT-based monolingual language models for Vietnamese.
90
+
91
+ \item To the best of our knowledge, we also perform the \textit{first} set of experiments to compare monolingual language models with the recent best multilingual model XLM-R in multiple (i.e. four) different language-specific tasks. The experiments show that our models outperform XLM-R on all these tasks, thus convincingly confirming that dedicated language-specific models still outperform multilingual ones.
92
+
93
+ \item We publicly release our models under the name PhoBERT which can be used with \texttt{fairseq} \citep{ott2019fairseq} and \texttt{transformers} \cite{Wolf2019HuggingFacesTS}. We hope that PhoBERT can serve as a strong baseline for future Vietnamese NLP research and applications.
94
+ \end{itemize}
95
+
96
+
97
+
98
+
99
+
100
+
101
+
102
+
103
+ \section{PhoBERT}
104
+
105
+ This section outlines the architecture and describes the pre-training data and optimization setup that we use for PhoBERT.
106
+
107
+ \vspace{3pt}
108
+
109
+ \noindent\textbf{Architecture:}\ Our PhoBERT has two versions, PhoBERT\textsubscript{base} and PhoBERT\textsubscript{large}, using the same architectures of BERT\textsubscript{base} and BERT\textsubscript{large}, respectively. PhoBERT pre-training approach is based on RoBERTa \citep{RoBERTa} which optimizes the BERT pre-training procedure for more robust performance.
110
+
111
+ \vspace{3pt}
112
+
113
+ \noindent\textbf{Pre-training data:}\ To handle the first concern mentioned in Section \ref{sec:intro}, we use a 20GB pre-training dataset of uncompressed texts. This dataset is a concatenation of two corpora: (i) the first one is the Vietnamese Wikipedia corpus ($\sim$1GB), and (ii) the second corpus ($\sim$19GB) is generated by removing similar articles and duplication from a 50GB Vietnamese news corpus.\footnote{\url{https://github.com/binhvq/news-corpus}, crawled from a wide range of news websites and topics.} To solve the second concern,
114
+ we employ RDRSegmenter \citep{nguyen-etal-2018-fast} from VnCoreNLP \citep{vu-etal-2018-vncorenlp} to perform word and sentence segmentation on the pre-training dataset, resulting in $\sim$145M word-segmented sentences ($\sim$3B word tokens). Different from RoBERTa, we then apply \texttt{fastBPE} \citep{sennrich-etal-2016-neural} to segment these sentences with subword units, using a vocabulary of 64K subword types. On average there are 24.4 subword tokens per sentence.
115
+
116
+ \vspace{3pt}
117
+
118
+ \noindent\textbf{Optimization:}\ We employ the RoBERTa implementation in \texttt{fairseq} \citep{ott2019fairseq}. We set a maximum length at 256 subword tokens, thus generating 145M $\times$ 24.4 / 256 $\approx$ 13.8M sentence blocks. Following \newcite{RoBERTa}, we optimize the models using Adam \citep{KingmaB14}. We use a batch size of 1024 across 4 V100 GPUs (16GB each) and a peak learning rate of 0.0004 for PhoBERT\textsubscript{base}, and a batch size of 512 and a peak learning rate of 0.0002 for PhoBERT\textsubscript{large}. We run for 40 epochs (here, the learning rate is warmed up for 2 epochs), thus resulting in 13.8M $\times$ 40 / 1024 $\approx$ 540K training steps for PhoBERT\textsubscript{base} and 1.08M training steps for PhoBERT\textsubscript{large}. We pre-train PhoBERT\textsubscript{base} during 3 weeks, and then PhoBERT\textsubscript{large} during 5 weeks.
119
+
120
+
121
+ \begin{table}[!t]
122
+ \centering
123
+ \begin{tabular}{l|l|l|l}
124
+ \hline
125
+ \textbf{Task} & \textbf{\#training} & \textbf{\#valid} & \textbf{\#test} \\
126
+ \hline
127
+
128
+ POS tagging$^\dagger$ & 27,000 & 870 & 2,120 \\
129
+ Dep. parsing$^\dagger$ & 8,977 & 200 & 1,020 \\
130
+ NER$^\dagger$ & 14,861 & 2,000 & 2,831\\
131
+ NLI$^\ddagger$ & 392,702 & 2,490 & 5,010\\
132
+ \hline
133
+ \end{tabular}
134
+ \caption{Statistics of the downstream task datasets. ``\#training'', ``\#valid'' and ``\#test'' denote the size of the training, validation and test sets, respectively. $\dagger$ and $\ddagger$ refer to the dataset size as the numbers of sentences and sentence pairs, respectively.}
135
+ \label{tab:data}
136
+ \end{table}
137
+
138
+
139
+ \begin{table*}[!ht]
140
+ \centering
141
+ \resizebox{15.5cm}{!}{
142
+ %\setlength{\tabcolsep}{0.3em}
143
+ \begin{tabular}{l|l|l|l}
144
+ \hline
145
+ \multicolumn{2}{c|}{\textbf{POS tagging} (word-level)} & \multicolumn{2}{c}{\textbf{Dependency parsing} (word-level)}\\
146
+ \hline
147
+ Model & Acc. & Model & LAS / UAS \\
148
+ \hline
149
+ RDRPOSTagger \citep{nguyen-etal-2014-rdrpostagger} [$\clubsuit$] & 95.1 & \_ & \_ \\
150
+
151
+ BiLSTM-CNN-CRF \citep{ma-hovy-2016-end} [$\clubsuit$] & 95.4 & VnCoreNLP-DEP \citep{vu-etal-2018-vncorenlp} [$\bigstar$] & 71.38 / 77.35 \\
152
+
153
+
154
+ VnCoreNLP-POS \citep{nguyen-etal-2017-word} [$\clubsuit$] & 95.9 &jPTDP-v2 [$\bigstar$] & 73.12 / 79.63 \\
155
+
156
+ jPTDP-v2 \citep{nguyen-verspoor-2018-improved} [$\bigstar$] & 95.7 &jointWPD [$\bigstar$] & 73.90 / 80.12 \\
157
+
158
+ jointWPD \citep{nguyen-2019-neural} [$\bigstar$] & 96.0 & Biaffine \citep{DozatM17} [$\bigstar$] & 74.99 / 81.19 \\
159
+
160
+ XLM-R\textsubscript{base} (our result) & 96.2 & Biaffine w/ XLM-R\textsubscript{base} (our result) & 76.46 / 83.10 \\
161
+
162
+ XLM-R\textsubscript{large} (our result) & 96.3 & Biaffine w/ XLM-R\textsubscript{large} (our result) & 75.87 / 82.70 \\
163
+
164
+ \hline
165
+ PhoBERT\textsubscript{base} & \underline{96.7} & Biaffine w/ PhoBERT\textsubscript{base} & \textbf{78.77} / \textbf{85.22} \\
166
+
167
+ PhoBERT\textsubscript{large} & \textbf{96.8} & Biaffine w/ PhoBERT\textsubscript{large} & \underline{77.85} / \underline{84.32} \\
168
+ \hline
169
+ \end{tabular}
170
+ }
171
+ \caption{Performance scores (in \%) on the POS tagging and Dependency parsing test sets. ``Acc.'', ``LAS'' and ``UAS'' abbreviate the Accuracy, the Labeled Attachment Score and the Unlabeled Attachment Score, respectively (here, all these evaluation metrics are computed on all word tokens, including punctuation).
172
+ [$\clubsuit$] and [$\bigstar$] denote
173
+ results reported by \newcite{nguyen-etal-2017-word} and \newcite{nguyen-2019-neural}, respectively.}
174
+ \label{tab:posdep}
175
+ \end{table*}
176
+
177
+
178
+
179
+ \section{Experimental setup}
180
+
181
+ We evaluate the performance of PhoBERT on four downstream Vietnamese NLP tasks: POS tagging, Dependency parsing, NER and NLI.
182
+
183
+
184
+ \subsubsection*{Downstream task datasets}
185
+
186
+ Table \ref{tab:data} presents the statistics of the experimental datasets that we employ for downstream task evaluation.
187
+ For POS tagging, Dependency parsing and NER, we follow the VnCoreNLP setup \citep{vu-etal-2018-vncorenlp}, using standard benchmarks of the VLSP 2013 POS tagging dataset,\footnote{\url{https://vlsp.org.vn/vlsp2013/eval}} the VnDT dependency treebank v1.1 \cite{Nguyen2014NLDB} with POS tags predicted by VnCoreNLP and the VLSP 2016 NER dataset \citep{JCC13161}.
188
+
189
+ For NLI, we use the manually-constructed Vietnamese validation and test sets from the cross-lingual NLI (XNLI) corpus v1.0 \citep{conneau-etal-2018-xnli} where the Vietnamese training set is released as a machine-translated version of the corresponding English training set \citep{N18-1101}.
190
+ Unlike the POS tagging, Dependency parsing and NER datasets which provide the gold word segmentation, for NLI, we employ RDRSegmenter to segment the text into words before applying BPE to produce subwords from word tokens.
191
+
192
+ \subsubsection*{Fine-tuning}
193
+
194
+ Following \newcite{devlin-etal-2019-bert}, for POS tagging and NER, we append a linear prediction layer on top of the PhoBERT architecture (i.e. to the last Transformer layer of PhoBERT) w.r.t. the first subword of each word token.\footnote{In our preliminary experiments, using the average of contextualized embeddings of subword tokens of each word to represent the word produces slightly lower performance than using the contextualized embedding of the first subword.}
195
+ For dependency parsing, following \newcite{nguyen-2019-neural}, we employ a reimplementation of the state-of-the-art Biaffine dependency parser \citep{DozatM17} from \newcite{ma-etal-2018-stack} with default optimal hyper-parameters. %\footnote{\url{https://github.com/XuezheMax/NeuroNLP2}}
196
+ We then extend this parser by replacing the pre-trained word embedding of each word in an input sentence by the corresponding contextualized embedding (from the last layer) computed for the first subword token of the word.
197
+
198
+ For POS tagging, NER and NLI, we employ \texttt{transformers} \cite{Wolf2019HuggingFacesTS} to fine-tune PhoBERT for each task and each dataset independently. We use AdamW \citep{loshchilov2018decoupled} with a fixed learning rate of 1.e-5 and a batch size of 32 \citep{RoBERTa}. We fine-tune in 30 training epochs, evaluate the task performance after each epoch on the validation set (here, early stopping is applied when there is no improvement after 5 continuous epochs), and then select the best model checkpoint to report the final result on the test set (note that each of our scores is an average over 5 runs with different random seeds). %Section \ref{sec:results} shows that using this relatively straightforward fine-tuning manner can lead to SOTA results. %Note that we might boost our downstream task performances even further by doing a more careful hyper-parameter tuning.
199
+
200
+
201
+
202
+
203
+ \begin{table*}[!ht]
204
+ \centering
205
+ \resizebox{15.5cm}{!}{
206
+ %\setlength{\tabcolsep}{0.3em}
207
+ \begin{tabular}{l|l|l|l}
208
+ \hline
209
+ \multicolumn{2}{c|}{\textbf{NER} (word-level)} & \multicolumn{2}{c}{\textbf{NLI} (syllable- or word-level)} \\
210
+
211
+ \hline
212
+ Model & F\textsubscript{1} & Model & Acc. \\
213
+ \hline
214
+ BiLSTM-CNN-CRF [$\blacklozenge$] & 88.3 & \_ & \_\\
215
+
216
+ VnCoreNLP-NER \citep{vu-etal-2018-vncorenlp} [$\blacklozenge$] & 88.6 & BiLSTM-max \citep{conneau-etal-2018-xnli} & 66.4 \\
217
+
218
+
219
+ VNER \citep{8713740} & 89.6 & mBiLSTM \citep{ArtetxeS19} & 72.0 \\
220
+
221
+ BiLSTM-CNN-CRF + ETNLP [$\spadesuit$] & 91.1 & multilingual BERT \citep{devlin-etal-2019-bert} [$\blacksquare$] & 69.5 \\
222
+
223
+ VnCoreNLP-NER + ETNLP [$\spadesuit$] & 91.3 & XLM\textsubscript{MLM+TLM} \citep{NIPS2019_8928} & 76.6 \\
224
+
225
+ XLM-R\textsubscript{base} (our result) & 92.0 & XLM-R\textsubscript{base} \citep{conneau2019unsupervised} & {75.4} \\
226
+
227
+ XLM-R\textsubscript{large} (our result) & 92.8 & XLM-R\textsubscript{large} \citep{conneau2019unsupervised} & \underline{79.7} \\
228
+
229
+ \hline
230
+ PhoBERT\textsubscript{base}& \underline{93.6} & PhoBERT\textsubscript{base}& {78.5} \\
231
+
232
+ PhoBERT\textsubscript{large}& \textbf{94.7} & PhoBERT\textsubscript{large}& \textbf{80.0} \\
233
+ \hline
234
+
235
+
236
+ \end{tabular}
237
+ }
238
+ \caption{Performance scores (in \%) on the NER and NLI test sets.
239
+ [$\blacklozenge$], [$\spadesuit$] and [$\blacksquare$] denote
240
+ results reported by \newcite{vu-etal-2018-vncorenlp}, \newcite{vu-xuan-etal-2019-etnlp} and \newcite{wu-dredze-2019-beto}, respectively.
241
+ %``mBiLSTM'' denotes a BiLSTM-based multilingual embedding model.
242
+ Note that there are higher Vietnamese NLI results reported for XLM-R when fine-tuning on the concatenation of all 15 training datasets from the XNLI corpus (i.e. TRANSLATE-TRAIN-ALL: 79.5\% for XLM-R\textsubscript{base} and 83.4\% XLM-R\textsubscript{large}). However, those results might not be comparable as we only use the monolingual Vietnamese training data for fine-tuning. }
243
+ \label{tab:nernli}
244
+ \end{table*}
245
+
246
+ \section{Experimental results}\label{sec:results}
247
+
248
+ \subsubsection*{Main results}
249
+
250
+ Tables \ref{tab:posdep} and \ref{tab:nernli} compare PhoBERT scores with the previous highest reported results, using the same experimental setup. It is clear that our PhoBERT helps produce new SOTA performance results for all four downstream tasks.
251
+
252
+ For \underline{POS tagging}, the neural model jointWPD for joint POS tagging and dependency parsing \citep{nguyen-2019-neural} and the feature-based model VnCoreNLP-POS \citep{nguyen-etal-2017-word} are the two previous SOTA models, obtaining accuracies at about 96.0\%. PhoBERT obtains 0.8\% absolute higher accuracy than these two models.
253
+
254
+ For \underline{Dependency parsing}, the previous highest parsing scores LAS and UAS are obtained by the Biaffine parser at 75.0\% and 81.2\%, respectively. PhoBERT helps boost the Biaffine parser with about 4\% absolute improvement, achieving a LAS at 78.8\% and a UAS at 85.2\%.
255
+
256
+
257
+ For \underline{NER}, PhoBERT\textsubscript{large} produces 1.1 points higher F\textsubscript{1} than PhoBERT\textsubscript{base}. In addition, PhoBERT\textsubscript{base} obtains 2+ points higher than the previous SOTA feature- and neural network-based models VnCoreNLP-NER \citep{vu-etal-2018-vncorenlp} and BiLSTM-CNN-CRF \citep{ma-hovy-2016-end} which are trained with the set of 15K BERT-based ETNLP word embeddings \citep{vu-xuan-etal-2019-etnlp}.
258
+
259
+ For \underline{NLI},
260
+ PhoBERT outperforms the multilingual BERT \citep{devlin-etal-2019-bert} and the BERT-based cross-lingual model with a new translation language modeling objective XLM\textsubscript{MLM+TLM} \citep{NIPS2019_8928} by large margins. PhoBERT also performs better than the recent best pre-trained multilingual model XLM-R but using far fewer parameters than XLM-R: 135M (PhoBERT\textsubscript{base}) vs. 250M (XLM-R\textsubscript{base}); 370M (PhoBERT\textsubscript{large}) vs. 560M (XLM-R\textsubscript{large}).
261
+
262
+
263
+
264
+
265
+
266
+
267
+
268
+ \subsubsection*{Discussion}
269
+
270
+ We find that PhoBERT\textsubscript{large} achieves 0.9\% lower dependency parsing scores than PhoBERT\textsubscript{base}. One possible reason is that the last Transformer layer in the BERT architecture might not be the optimal one which encodes the richest information of syntactic structures \cite{hewitt-manning-2019-structural,jawahar-etal-2019-bert}. Future work will study which PhoBERT's Transformer layer contains richer syntactic information by evaluating the Vietnamese parsing performance from each layer.
271
+
272
+ Using more pre-training data can significantly improve the quality of the pre-trained language models \cite{RoBERTa}. Thus it is not surprising that PhoBERT helps produce better performance than ETNLP on NER, and the multilingual BERT and XLM\textsubscript{MLM+TLM} on NLI (here, PhoBERT uses 20GB of Vietnamese texts while those models employ the 1GB Vietnamese Wikipedia corpus).
273
+
274
+ Following the fine-tuning approach that we use for PhoBERT, we carefully fine-tune XLM-R for the remaining Vietnamese POS tagging, Dependency parsing and NER tasks (here, it is applied to the first sub-syllable token of the first syllable of each word).\footnote{For fine-tuning XLM-R, we use a grid search on the validation set to select the AdamW learning rate from \{5e-6, 1e-5, 2e-5, 4e-5\} and the batch size from \{16, 32\}.}
275
+ Tables \ref{tab:posdep} and \ref{tab:nernli} show that our PhoBERT also does better than XLM-R on these three word-level tasks.
276
+ It is worth noting that XLM-R uses a 2.5TB pre-training corpus which contains 137GB of Vietnamese texts (i.e. about 137\ /\ 20 $\approx$ 7 times bigger than our pre-training corpus).
277
+ Recall that PhoBERT performs Vietnamese word segmentation to segment syllable-level sentences into word tokens before applying BPE to segment the word-segmented sentences into subword units, while XLM-R directly applies BPE to the syllable-level Vietnamese pre-training sentences.
278
+ This reconfirms that the dedicated language-specific models still outperform the multilingual ones \citep{2019arXiv191103894M}.\footnote{Note that \newcite{2019arXiv191103894M} only compare their model CamemBERT with XLM-R on the French NLI task.}
279
+
280
+
281
+
282
+
283
+
284
+
285
+
286
+ \section{Conclusion}
287
+
288
+ In this paper, we have presented the first large-scale monolingual PhoBERT language models pre-trained for Vietnamese. We demonstrate the usefulness of PhoBERT by showing that PhoBERT performs better than the recent best multilingual model XLM-R and helps produce the SOTA performances for four downstream Vietnamese NLP tasks of POS tagging, Dependency parsing, NER and NLI.
289
+ By publicly releasing PhoBERT models, %\footnote{\url{https://github.com/VinAIResearch/PhoBERT}}
290
+ we hope that they can foster future research and applications in Vietnamese NLP. %Our PhoBERT and its usage are available at: \url{https://github.com/VinAIResearch/PhoBERT}.
291
+
292
+ {%\footnotesize
293
+ \bibliographystyle{acl_natbib}
294
+ \bibliography{REFs}
295
+ }
296
+
297
+
298
+
299
+
300
+
301
+ \end{document}
references/2021.naacl.nguyen/paper.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing"
3
+ authors:
4
+ - "Linh The Nguyen"
5
+ - "Dat Quoc Nguyen"
6
+ year: 2021
7
+ venue: "NAACL 2021 Demonstrations"
8
+ url: "https://aclanthology.org/2021.naacl-demos.1/"
9
+ ---
10
+
11
+ We present the first multi-task learning model---named PhoNLP---for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency
12
+ parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT [phobert] for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0.
13
+ Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing.
14
+ We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP.
15
+ [!t]
16
+ [width=12.5cm]{JointModel.pdf}
17
+ {
18
+ |
19
+ **ID** | **Form** | **POS** | **NER** | **Head** | **DepRel** |
20
+ |---|---|---|---|---|---|
21
+ |
22
+ 1 | Đây | PRON | O | 2 | sub |
23
+ | 2 | là | VERB | O | 0 | root |
24
+ | 3 | Hà\_Nội | NOUN | B-LOC | 2 | vmod |
25
+ | |
26
+ }
27
+ # Introduction
28
+ Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/2006-2010 and VLSP workshops that have run shared tasks since 2013. Fundamental tasks of POS tagging, NER and dependency parsing thus play important roles, providing useful features for many downstream application tasks such as machine translation [7800281], sentiment analysis [BANG20182016IIP0038], relation extraction [9287471], semantic parsing [vitext2sql], open information extraction [3155133.3155171] and question answering [NguyenNP_SWJ,3184558.3191535].
29
+ Thus, there is a need to develop NLP toolkits for linguistic annotations w.r.t. Vietnamese POS tagging, NER and dependency parsing.
30
+ VnCoreNLP [vu-etal-2018-vncorenlp] is the previous public toolkit employing traditional feature-based machine learning models to handle those Vietnamese NLP tasks. However, VnCoreNLP is now no longer considered state-of-the-art because its performance results are significantly outperformed by ones obtained when fine-tuning PhoBERT---the current state-of-the-art monolingual pre-trained language model for Vietnamese [phobert]. Note that there are no publicly available fine-tuned BERT-based models for the three Vietnamese tasks. Assuming that there would be, a potential drawback might be that an NLP package wrapping such fine-tuned BERT-based models would take a large storage space, i.e. three times larger than the storage space used by a BERT model [devlin-etal-2019-bert], thus it would not be suitable for practical applications that require a smaller storage space. Jointly multi-task learning is a promising solution as it might help reduce the storage space. In addition, POS tagging, NER and dependency parsing are related tasks: POS tags are essential input features used for dependency parsing and POS tags are also used as additional features for NER. Jointly multi-task learning thus might also help improve the performance results against the single-task learning [Ruder2019Neural].
31
+ In this paper, we present a new multi-task learning model---named PhoNLP---for joint POS tagging, NER and dependency parsing. In particular, given an input sentence of words to PhoNLP, an encoding layer generates contextualized word embeddings that represent the input words. These contextualized word embeddings are fed into a POS tagging layer that is in fact a linear prediction layer [devlin-etal-2019-bert] to predict POS tags for the corresponding input words. Each predicted POS tag is then represented by two ``soft'' embeddings that are later fed into NER and dependency parsing layers separately.
32
+ More specifically, based on both the contextualized word embeddings and the ``soft'' POS tag embeddings, the NER layer uses a linear-chain CRF predictor [Lafferty:2001] to predict NER labels for the input words, while the dependency parsing layer uses a Biaffine classifier [DozatM17] to predict dependency arcs between the words and another Biaffine classifier to label the predicted arcs.
33
+ Our contributions are summarized as follows:
34
+ [leftmargin=*]
35
+ - sep{-1pt}
36
+ - To the best of our knowledge, PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese.
37
+ - We discuss a data leakage issue in the Vietnamese benchmark datasets, that has not yet been pointed out before. Experiments show that PhoNLP obtains state-of-the-art performance results, outperforming the PhoBERT-based single task learning.
38
+ - We publicly release PhoNLP as an open-source toolkit that is simple to setup and efficiently run from both the command-line and Python API. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and downstream applications.
39
+ # Model description
40
+ Figure [fig:architecture] illustrates our PhoNLP architecture that can be viewed as a mixture of a BERT-based encoding layer and three decoding layers of POS tagging, NER and dependency parsing.
41
+ ## Encoder \& Contextualized embeddings
42
+ Given an input sentence consisting of $n$ word tokens $w_1, w_2, ..., w_n$, the encoding layer employs PhoBERT to generate contextualized latent feature embeddings $_{i}$ each representing the $i^{th}$ word $w_i$:
43
+ $$
44
+ _{i} = }({w}_{1:n}, i)
45
+ $$
46
+ In particular, the encoding layer employs the **PhoBERT version. Because PhoBERT uses BPE [sennrich-etal-2016-neural] to segment the input sentence with subword units, the encoding layer in fact represents the $i^{th}$ word $w_i$ by using the contextualized embedding of its first subword.
47
+ ## POS tagging
48
+ Following a common manner when fine-tuning a pre-trained language model for a sequence labeling task [devlin-etal-2019-bert], the POS tagging layer is a linear prediction layer that is appended on top of the encoder. In particular, the POS tagging layer feeds the contextualized word embeddings $_{i}$ into a feed-forward network (FFNN) followed by a $$ predictor for POS tag prediction:
49
+ $$
50
+ _{i} = (}(_{i}))
51
+ $$
52
+ where the output layer size of FFNN is the number of POS tags. Based on probability vectors $_{i}$, a cross-entropy objective loss **$}$} is calculated for POS tagging during training.
53
+ ## NER
54
+ The NER layer creates a sequence of vectors $_{1:n}$ in which each $_{i}$ is resulted in by concatenating the contextualized word embedding $_{i}$ and a ``soft'' POS tag embedding $_{i}^{(1)}$:
55
+ $$
56
+ _{i} = _{i} _{i}^{(1)}
57
+ $$
58
+ where following , the ``soft'' POS tag embedding $_{i}^{(1)}$ is computed by multiplying a label weight matrix $^{(1)}$ with the corresponding probability vector $_{i}$:
59
+ $$
60
+ _{i}^{(1)} = ^{(1)}_{i}
61
+ $$
62
+ The NER layer then passes each vector $_{i}$ into a FFNN (FFNN):
63
+ $$
64
+ _{i} = }(_{i})
65
+ $$
66
+ where the output layer size of FFNN is the number of BIO-based NER labels.
67
+ The NER layer feeds the output vectors $_{i}$ into a linear-chain
68
+ CRF predictor for NER label prediction [Lafferty:2001]. A cross-entropy loss **$}$} is calculated for NER during
69
+ training while the Viterbi algorithm is used for inference.
70
+ ## Dependency parsing
71
+ The dependency parsing layer creates vectors $_{1:n}$ in which each $_{i}$ is resulted in by concatenating $_{i}$ and another ``soft'' POS tag embedding $_{i}^{(2)}$:
72
+ _{i} &=& _{i} _{i}^{(2)}
73
+ _{i}^{(2)} &=& ^{(2)}_{i}
74
+ Following , the dependency parsing layer uses FFNNs to split $_{i}$ into *head* and *dependent* representations:
75
+ _{i}^{()} &=& _{}(_{i})
76
+ _{i}^{()} &=& _{}(_{i})
77
+ _{i}^{()} &=& _{}(_{i})
78
+ _{i}^{()} &=& _{}(_{i})
79
+ To predict potential dependency arcs, based on input vectors $_{i}^{()}$ and $_{j}^{()}$, the parsing layer uses a Biaffine classifier's variant [qi-etal-2018-universal] that additionally takes into account the distance and relative ordering between two words to produce a probability distribution of
80
+ arc heads for each word.
81
+ For inference, the Chu–Liu/Edmonds' algorithm is used to find a maximum spanning tree [chuliu,Edmonds].
82
+ The parsing layer also uses another Biaffine classifier to label the predicted arcs, based on input vectors $_{i}^{()}$ and $_{j}^{()}$. An objective loss **$}$} is computed by summing a cross entropy loss for unlabeled dependency parsing and another cross entropy loss for dependency label prediction during training based on gold arcs and arc labels.
83
+ ## Joint multi-task learning
84
+ The final training objective loss **$ of our model PhoNLP is the weighted sum of the POS tagging loss {$_{}$}, the NER loss {$_{}$} and the dependency parsing loss {$_{}$}:
85
+ $$
86
+ **$ = \lambda_1_{} + \lambda_2_{} + (1 - \lambda_1 - \lambda_2)_{}
87
+ $$
88
+ #### Discussion: Our PhoNLP can be viewed as an extension of previous joint POS tagging and dependency parsing models [hashimoto-etal-2017-joint,li-etal-2018-joint-learning,nguyen-verspoor-2018-improved,NguyenALTA2019,kondratyuk-straka-2019-75], where we additionally incorporate a CRF-based prediction layer for NER. Unlike , , and that use BiLSTM-based encoders to extract contextualized feature embeddings, we use a BERT-based encoder. also employ a BERT-based encoder. However, different from PhoNLP where we construct a hierarchical architecture over the POS tagging and dependency parsing layers, do not make use of POS tag embeddings for dependency parsing.
89
+ # Experiments
90
+ ## Setup
91
+ ### Datasets
92
+ To conduct experiments, we use the benchmark datasets of the VLSP 2013 POS tagging dataset, the VLSP 2016 NER dataset [JCC13161] and the VnDT dependency treebank v1.1 , following the setup used by the VnCoreNLP toolkit [vu-etal-2018-vncorenlp]. Here, VnDT is converted from the Vietnamese constituent treebank [nguyen-etal-2009-building].
93
+ #### Data leakage issue: We further discover an issue of data leakage, that has not yet been pointed out before. That is, all sentences from the VLSP 2016 NER dataset and the VnDT treebank are included in the VLSP 2013 POS tagging dataset. In particular, 90+\
94
+ To handle the data leakage issue, we have to re-split the VLSP 2013 POS tagging dataset to avoid the data leakage issue: The POS tagging validation/test set now only contains sentences that appear in the union of the NER and dependency parsing validation/test sets (i.e. the validation/test sentences for NER and dependency parsing only appear in the POS tagging validation/test set).
95
+ In addition, there are 594 duplicated sentences in the VLSP 2013 POS tagging dataset (here, sentence duplication is not found in the union of the NER and dependency parsing sentences). Thus we have to perform duplication removal on the POS tagging dataset.
96
+ Table [tab:Datasets] details the statistics of the experimental datasets.
97
+ [!t]
98
+ {!}{
99
+ |
100
+ **Task** | **\#train** | **\#valid** | **\#test** |
101
+ |---|---|---|---|
102
+ |
103
+ {POS tagging (leakage)} | {27000} | {870} | {2120} |
104
+ |
105
+ POS tagging (re-split) | 23906 | 2009 | 3481 |
106
+ |
107
+ NER | 14861 | 2000 | 2831 |
108
+ |
109
+ Dependency parsing | 8977 | 200 | 1020 |
110
+ | |
111
+ }
112
+ '' and ``POS tagging (re-split)'' refer to the statistics for POS tagging before and after re-splitting \& sentence duplication removal, respectively.}
113
+ ### Implementation
114
+ PhoNLP is implemented based on PyTorch [NEURIPS2019_9015], employing the PhoBERT encoder implementation available from the $$ library [wolf-etal-2020-transformers] and the Biaffine classifier implementation from . We set both the label weight matrices $^{(1)}$ and $^{(2)}$ to have 100 rows, resulting in 100-dimensional soft POS tag embeddings. In addition, following , FFNNs in equations [equa:fc6]--[equa:fc9] use 400-dimensional output layers.
115
+ We use the AdamW optimizer [loshchilov2018decoupled] and a fixed batch size at 32, and train for 40 epochs. The sizes of training sets are different, in which the POS tagging
116
+ training set is the largest, consisting of 23906 sentences. Thus for each training epoch, we repeatedly sample from the NER and dependency parsing training sets to fill the gaps between the training set sizes. We perform a grid search to select the initial AdamW learning rate, $\lambda_1$ and $\lambda_2$. We find the optimal initial AdamW learning rate, $\lambda_1$ and $\lambda_2$ at 1e-5, 0.4 and 0.2, respectively. Here, we compute the average of the POS tagging accuracy, NER F-score and dependency parsing score LAS after each training epoch on the validation sets. We select the model checkpoint that produces the highest average score over the validation sets to apply to the test sets. Each of our reported scores is an average over 5 runs with different random seeds.
117
+ ## Results
118
+ Table [tab:results] presents results obtained for our PhoNLP and compares them with those of a baseline approach of single-task training. For the single-task training approach: (i) We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section [ssec:pos]. (ii) For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. (iii) For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding PhoBERT-based contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers for dependency parsing [qi-etal-2018-universal]. Here, the single-task training approach is based on the PhoBERT version, employing the same hyper-parameter tuning and model selection strategy that we use for PhoNLP.
119
+ [!t]
120
+ {!}{
121
+ | | **Model** | **POS** | **NER** | **LAS** | **UAS** |
122
+ |---|---|---|---|---|---|
123
+ |
124
+ {*}{[origin=c]{90}{{Leak.}}} | Single-task | 96.7$^$ | 93.69 | 78.77$^$ | 85.22$^$ |
125
+ | | PhoNLP | **96.76** | **94.41** | **79.11** | **85.47** |
126
+ |
127
+ {*}{[origin=c]{90}{{Re-spl}}} | Single-task | 93.68 | 93.69 | 77.89 | 84.78 |
128
+ | | PhoNLP | **93.88** | **94.51** | **78.17** | **84.95** |
129
+ | |
130
+ }
131
+ Note that PhoBERT helps produce state-of-the-art results for multiple Vietnamese NLP tasks (including but not limited to POS tagging, NER and dependency parsing in a single-task training strategy), and obtains higher performance results than VnCoreNLP.
132
+ However, in both the PhoBERT and VnCoreNLP papers [phobert,vu-etal-2018-vncorenlp], results for POS tagging and dependency parsing are reported w.r.t. the data leakage issue. Our ``Single-task'' results in Table [tab:results] regarding ``Re-spl'' (i.e. the data re-split and duplication removal for POS tagging to avoid the data leakage issue) can be viewed as new PhoBERT results for a proper experimental setup. Table [tab:results] shows that in both setups ``Leak.'' and ``Re-spl'', our joint multi-task training approach PhoNLP performs better than the PhoBERT-based single-task training approach, thus resulting in state-of-the-art performances for the three tasks of Vietnamese POS tagging, NER and dependency parsing.
133
+ # PhoNLP toolkit
134
+ We present in this section a basic usage of our PhoNLP toolkit.
135
+ We make PhoNLP simple to setup, i.e. users can install PhoNLP from either source or $$ (e.g. $$). We also aim to make PhoNLP simple to run from both the command-line and the Python API. For example, annotating a corpus with POS tagging, NER and dependency parsing can be performed by using a simple command as in Figure [fig:command].
136
+ Assume that the input file ``{ input.txt}'' in Figure [fig:command] contains a sentence ``Tôi đang làm\_việc tại VinAI .'' (I am working at VinAI). Table [tab:format] shows the annotated output in plain text form for this
137
+ sentence. Similarly, we also get the same output by using the Python API as simple as in Figure [fig:code].
138
+ Furthermore, commands to (re-)train and evaluate PhoNLP using gold annotated corpora are detailed in the PhoNLP GitHub repository. Note that it is absolutely possible to directly employ our PhoNLP (re-)training and evaluation command scripts for other languages that have gold annotated corpora available for the three tasks and a pre-trained BERT-based language model available from the $$ library.
139
+ [!t]
140
+ { python3 run\_phonlp.py {-}{-}save\_dir ./pretrained\_phonlp {-}{-}mode
141
+ annotate {-}{-}input\_file input.txt {-}{-}output\_file output.txt}
142
+ [!t]
143
+ | 1 | Tôi | P | O | 3 | sub |
144
+ |---|---|---|---|---|---|
145
+ | 2 | đang | R | O | 3 | adv |
146
+ | 3 | làm\_việc | V | O | 0 | root |
147
+ | 4 | tại | E | O | 3 | loc |
148
+ | 5 | VinAI | Np | B-ORG | 4 | pob |
149
+ | 6 | . | CH | O | 3 | punct |
150
+ '' for the sentence ``Tôi đang làm\_việc tại VinAI .'' from the input file ``{ input.txt}'' in Figure [fig:command]. The output is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type.}
151
+ #### Speed test: We perform a sole CPU-based speed test using a personal computer with Intel Core i5 8265U 1.6GHz \& 8GB of memory. For a GPU-based speed test, we employ a machine with a single NVIDIA RTX 2080Ti GPU. For performing the three NLP tasks jointly, PhoNLP obtains a speed at {15 sentences per second} for the CPU-based test and {129 sentences per second} for the GPU-based test, respectively, with an average of 23 word tokens per sentence and a batch size of 8.
152
+ [!t]
153
+ [commandchars=
154
+ \{\}]
155
+ {import} {phonlp}
156
+ { Automatically download the pretrained PhoNLP model}
157
+ { and save it in a local machine folder}
158
+ {phonlp}{.}{download}{(}{savedir}{=}{./pretrainedphonlp}{)}
159
+ { Load the pretrained PhoNLP model}
160
+ {model} {=} {phonlp}{.}{load}{(}{savedir}{=}{./pretrainedphonlp}{)}
161
+ { Annotate a corpus}
162
+ {model}{.}{annotate}{(}{inputfile}{=}{input.txt}{,} {outputfile}{=}{output.txt}{)}
163
+ { Annotate a sentence}
164
+ {model}{.}{printout}{(}{model}{.}{annotate}{(}{text}{=}{Tôi đang làmviệc tại VinAI .}{))}
165
+ # Conclusion and future work
166
+ We have presented the first multi-task learning model PhoNLP for joint POS tagging, NER and dependency parsing in Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP outperforms its strong fine-tuned PhoBERT-based single-task training baseline, producing state-of-the-art performance results. We publicly release PhoNLP as an easy-to-use open-source toolkit and hope that PhoNLP can facilitate future NLP research and applications.
167
+ In future work, we will also apply PhoNLP to other languages.
references/2021.naacl.nguyen/paper.tex ADDED
@@ -0,0 +1,641 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
2
+ \pdfoutput=1
3
+ % In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.
4
+
5
+ \documentclass[11pt]{article}
6
+
7
+ % Remove the "review" option to generate the final version.
8
+ \usepackage{naacl2021}
9
+
10
+ % Standard package includes
11
+ \usepackage{times}
12
+ \usepackage{latexsym}
13
+
14
+ %\renewcommand{\UrlFont}{\ttfamily\small}
15
+
16
+
17
+ %\renewcommand{\UrlFont}{\ttfamily\small}
18
+
19
+ \usepackage{amsmath}
20
+ \usepackage{url}
21
+ \usepackage{amssymb}
22
+ \usepackage{amsfonts}
23
+ \usepackage{graphicx}
24
+ \usepackage{tabularx}
25
+ \usepackage{multirow}
26
+ \usepackage{arydshln}
27
+ \usepackage{mathtools,nccmath}
28
+ \usepackage{listings}
29
+
30
+ \usepackage[T5]{fontenc}
31
+ %\usepackage[utf8]{vietnam}
32
+ \usepackage{enumitem}
33
+ %\usepackage{ulem}
34
+ \usepackage{todonotes}
35
+ % \usepackage[usenames,dvipsnames]{color}
36
+ \usepackage{cancel}
37
+ \usepackage[draft]{minted}
38
+
39
+ % This is not strictly necessary, and may be commented out,
40
+ % but it will improve the layout of the manuscript,
41
+ % and will typically save some space.
42
+ \usepackage{microtype}
43
+
44
+
45
+
46
+ \makeatletter
47
+ \def\PYGdefault@reset{\let\PYGdefault@it=\relax \let\PYGdefault@bf=\relax%
48
+ \let\PYGdefault@ul=\relax \let\PYGdefault@tc=\relax%
49
+ \let\PYGdefault@bc=\relax \let\PYGdefault@ff=\relax}
50
+ \def\PYGdefault@tok#1{\csname PYGdefault@tok@#1\endcsname}
51
+ \def\PYGdefault@toks#1+{\ifx\relax#1\empty\else%
52
+ \PYGdefault@tok{#1}\expandafter\PYGdefault@toks\fi}
53
+ \def\PYGdefault@do#1{\PYGdefault@bc{\PYGdefault@tc{\PYGdefault@ul{%
54
+ \PYGdefault@it{\PYGdefault@bf{\PYGdefault@ff{#1}}}}}}}
55
+ \def\PYGdefault#1#2{\PYGdefault@reset\PYGdefault@toks#1+\relax+\PYGdefault@do{#2}}
56
+
57
+ \expandafter\def\csname PYGdefault@tok@w\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.73,0.73}{##1}}}
58
+ \expandafter\def\csname PYGdefault@tok@c\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
59
+ \expandafter\def\csname PYGdefault@tok@cp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.74,0.48,0.00}{##1}}}
60
+ \expandafter\def\csname PYGdefault@tok@k\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
61
+ \expandafter\def\csname PYGdefault@tok@kp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
62
+ \expandafter\def\csname PYGdefault@tok@kt\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.69,0.00,0.25}{##1}}}
63
+ \expandafter\def\csname PYGdefault@tok@o\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
64
+ \expandafter\def\csname PYGdefault@tok@ow\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
65
+ \expandafter\def\csname PYGdefault@tok@nb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
66
+ \expandafter\def\csname PYGdefault@tok@nf\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
67
+ \expandafter\def\csname PYGdefault@tok@nc\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
68
+ \expandafter\def\csname PYGdefault@tok@nn\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
69
+ \expandafter\def\csname PYGdefault@tok@ne\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.82,0.25,0.23}{##1}}}
70
+ \expandafter\def\csname PYGdefault@tok@nv\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
71
+ \expandafter\def\csname PYGdefault@tok@no\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.53,0.00,0.00}{##1}}}
72
+ \expandafter\def\csname PYGdefault@tok@nl\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.63,0.63,0.00}{##1}}}
73
+ \expandafter\def\csname PYGdefault@tok@ni\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.60,0.60,0.60}{##1}}}
74
+ \expandafter\def\csname PYGdefault@tok@na\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.49,0.56,0.16}{##1}}}
75
+ \expandafter\def\csname PYGdefault@tok@nt\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
76
+ \expandafter\def\csname PYGdefault@tok@nd\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
77
+ \expandafter\def\csname PYGdefault@tok@s\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
78
+ \expandafter\def\csname PYGdefault@tok@sd\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
79
+ \expandafter\def\csname PYGdefault@tok@si\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
80
+ \expandafter\def\csname PYGdefault@tok@se\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.13}{##1}}}
81
+ \expandafter\def\csname PYGdefault@tok@sr\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
82
+ \expandafter\def\csname PYGdefault@tok@ss\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
83
+ \expandafter\def\csname PYGdefault@tok@sx\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
84
+ \expandafter\def\csname PYGdefault@tok@m\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
85
+ \expandafter\def\csname PYGdefault@tok@gh\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
86
+ \expandafter\def\csname PYGdefault@tok@gu\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.50,0.00,0.50}{##1}}}
87
+ \expandafter\def\csname PYGdefault@tok@gd\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.63,0.00,0.00}{##1}}}
88
+ \expandafter\def\csname PYGdefault@tok@gi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.63,0.00}{##1}}}
89
+ \expandafter\def\csname PYGdefault@tok@gr\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{1.00,0.00,0.00}{##1}}}
90
+ \expandafter\def\csname PYGdefault@tok@ge\endcsname{\let\PYGdefault@it=\textit}
91
+ \expandafter\def\csname PYGdefault@tok@gs\endcsname{\let\PYGdefault@bf=\textbf}
92
+ \expandafter\def\csname PYGdefault@tok@gp\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
93
+ \expandafter\def\csname PYGdefault@tok@go\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.53,0.53,0.53}{##1}}}
94
+ \expandafter\def\csname PYGdefault@tok@gt\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.27,0.87}{##1}}}
95
+ \expandafter\def\csname PYGdefault@tok@err\endcsname{\def\PYGdefault@bc##1{\setlength{\fboxsep}{0pt}\fcolorbox[rgb]{1.00,0.00,0.00}{1,1,1}{\strut ##1}}}
96
+ \expandafter\def\csname PYGdefault@tok@kc\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
97
+ \expandafter\def\csname PYGdefault@tok@kd\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
98
+ \expandafter\def\csname PYGdefault@tok@kn\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
99
+ \expandafter\def\csname PYGdefault@tok@kr\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
100
+ \expandafter\def\csname PYGdefault@tok@bp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
101
+ \expandafter\def\csname PYGdefault@tok@fm\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
102
+ \expandafter\def\csname PYGdefault@tok@vc\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
103
+ \expandafter\def\csname PYGdefault@tok@vg\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
104
+ \expandafter\def\csname PYGdefault@tok@vi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
105
+ \expandafter\def\csname PYGdefault@tok@vm\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
106
+ \expandafter\def\csname PYGdefault@tok@sa\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
107
+ \expandafter\def\csname PYGdefault@tok@sb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
108
+ \expandafter\def\csname PYGdefault@tok@sc\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
109
+ \expandafter\def\csname PYGdefault@tok@dl\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
110
+ \expandafter\def\csname PYGdefault@tok@s2\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
111
+ \expandafter\def\csname PYGdefault@tok@sh\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
112
+ \expandafter\def\csname PYGdefault@tok@s1\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
113
+ \expandafter\def\csname PYGdefault@tok@mb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
114
+ \expandafter\def\csname PYGdefault@tok@mf\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
115
+ \expandafter\def\csname PYGdefault@tok@mh\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
116
+ \expandafter\def\csname PYGdefault@tok@mi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
117
+ \expandafter\def\csname PYGdefault@tok@il\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
118
+ \expandafter\def\csname PYGdefault@tok@mo\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
119
+ \expandafter\def\csname PYGdefault@tok@ch\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
120
+ \expandafter\def\csname PYGdefault@tok@cm\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
121
+ \expandafter\def\csname PYGdefault@tok@cpf\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
122
+ \expandafter\def\csname PYGdefault@tok@c1\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
123
+ \expandafter\def\csname PYGdefault@tok@cs\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
124
+
125
+ \def\PYGdefaultZbs{\char`\\}
126
+ \def\PYGdefaultZus{\char`\_}
127
+ \def\PYGdefaultZob{\char`\{}
128
+ \def\PYGdefaultZcb{\char`\}}
129
+ \def\PYGdefaultZca{\char`\^}
130
+ \def\PYGdefaultZam{\char`\&}
131
+ \def\PYGdefaultZlt{\char`\<}
132
+ \def\PYGdefaultZgt{\char`\>}
133
+ \def\PYGdefaultZsh{\char`\#}
134
+ \def\PYGdefaultZpc{\char`\%}
135
+ \def\PYGdefaultZdl{\char`\$}
136
+ \def\PYGdefaultZhy{\char`\-}
137
+ \def\PYGdefaultZsq{\char`\'}
138
+ \def\PYGdefaultZdq{\char`\"}
139
+ \def\PYGdefaultZti{\char`\~}
140
+ % for compatibility with earlier versions
141
+ \def\PYGdefaultZat{@}
142
+ \def\PYGdefaultZlb{[}
143
+ \def\PYGdefaultZrb{]}
144
+ \makeatother
145
+
146
+
147
+
148
+ \makeatletter
149
+ \def\PYG@reset{\let\PYG@it=\relax \let\PYG@bf=\relax%
150
+ \let\PYG@ul=\relax \let\PYG@tc=\relax%
151
+ \let\PYG@bc=\relax \let\PYG@ff=\relax}
152
+ \def\PYG@tok#1{\csname PYG@tok@#1\endcsname}
153
+ \def\PYG@toks#1+{\ifx\relax#1\empty\else%
154
+ \PYG@tok{#1}\expandafter\PYG@toks\fi}
155
+ \def\PYG@do#1{\PYG@bc{\PYG@tc{\PYG@ul{%
156
+ \PYG@it{\PYG@bf{\PYG@ff{#1}}}}}}}
157
+ \def\PYG#1#2{\PYG@reset\PYG@toks#1+\relax+\PYG@do{#2}}
158
+
159
+ \expandafter\def\csname PYG@tok@w\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.73,0.73}{##1}}}
160
+ \expandafter\def\csname PYG@tok@c\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
161
+ \expandafter\def\csname PYG@tok@cp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.74,0.48,0.00}{##1}}}
162
+ \expandafter\def\csname PYG@tok@k\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
163
+ \expandafter\def\csname PYG@tok@kp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
164
+ \expandafter\def\csname PYG@tok@kt\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.69,0.00,0.25}{##1}}}
165
+ \expandafter\def\csname PYG@tok@o\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
166
+ \expandafter\def\csname PYG@tok@ow\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
167
+ \expandafter\def\csname PYG@tok@nb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
168
+ \expandafter\def\csname PYG@tok@nf\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
169
+ \expandafter\def\csname PYG@tok@nc\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
170
+ \expandafter\def\csname PYG@tok@nn\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
171
+ \expandafter\def\csname PYG@tok@ne\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.82,0.25,0.23}{##1}}}
172
+ \expandafter\def\csname PYG@tok@nv\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
173
+ \expandafter\def\csname PYG@tok@no\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.53,0.00,0.00}{##1}}}
174
+ \expandafter\def\csname PYG@tok@nl\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.63,0.63,0.00}{##1}}}
175
+ \expandafter\def\csname PYG@tok@ni\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.60,0.60,0.60}{##1}}}
176
+ \expandafter\def\csname PYG@tok@na\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.49,0.56,0.16}{##1}}}
177
+ \expandafter\def\csname PYG@tok@nt\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
178
+ \expandafter\def\csname PYG@tok@nd\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
179
+ \expandafter\def\csname PYG@tok@s\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
180
+ \expandafter\def\csname PYG@tok@sd\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
181
+ \expandafter\def\csname PYG@tok@si\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
182
+ \expandafter\def\csname PYG@tok@se\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.13}{##1}}}
183
+ \expandafter\def\csname PYG@tok@sr\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
184
+ \expandafter\def\csname PYG@tok@ss\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
185
+ \expandafter\def\csname PYG@tok@sx\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
186
+ \expandafter\def\csname PYG@tok@m\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
187
+ \expandafter\def\csname PYG@tok@gh\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
188
+ \expandafter\def\csname PYG@tok@gu\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.50,0.00,0.50}{##1}}}
189
+ \expandafter\def\csname PYG@tok@gd\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.63,0.00,0.00}{##1}}}
190
+ \expandafter\def\csname PYG@tok@gi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.63,0.00}{##1}}}
191
+ \expandafter\def\csname PYG@tok@gr\endcsname{\def\PYG@tc##1{\textcolor[rgb]{1.00,0.00,0.00}{##1}}}
192
+ \expandafter\def\csname PYG@tok@ge\endcsname{\let\PYG@it=\textit}
193
+ \expandafter\def\csname PYG@tok@gs\endcsname{\let\PYG@bf=\textbf}
194
+ \expandafter\def\csname PYG@tok@gp\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
195
+ \expandafter\def\csname PYG@tok@go\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.53,0.53,0.53}{##1}}}
196
+ \expandafter\def\csname PYG@tok@gt\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.27,0.87}{##1}}}
197
+ \expandafter\def\csname PYG@tok@err\endcsname{\def\PYG@bc##1{\setlength{\fboxsep}{0pt}\fcolorbox[rgb]{1.00,0.00,0.00}{1,1,1}{\strut ##1}}}
198
+ \expandafter\def\csname PYG@tok@kc\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
199
+ \expandafter\def\csname PYG@tok@kd\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
200
+ \expandafter\def\csname PYG@tok@kn\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
201
+ \expandafter\def\csname PYG@tok@kr\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
202
+ \expandafter\def\csname PYG@tok@bp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
203
+ \expandafter\def\csname PYG@tok@fm\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
204
+ \expandafter\def\csname PYG@tok@vc\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
205
+ \expandafter\def\csname PYG@tok@vg\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
206
+ \expandafter\def\csname PYG@tok@vi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
207
+ \expandafter\def\csname PYG@tok@vm\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
208
+ \expandafter\def\csname PYG@tok@sa\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
209
+ \expandafter\def\csname PYG@tok@sb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
210
+ \expandafter\def\csname PYG@tok@sc\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
211
+ \expandafter\def\csname PYG@tok@dl\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
212
+ \expandafter\def\csname PYG@tok@s2\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
213
+ \expandafter\def\csname PYG@tok@sh\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
214
+ \expandafter\def\csname PYG@tok@s1\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
215
+ \expandafter\def\csname PYG@tok@mb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
216
+ \expandafter\def\csname PYG@tok@mf\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
217
+ \expandafter\def\csname PYG@tok@mh\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
218
+ \expandafter\def\csname PYG@tok@mi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
219
+ \expandafter\def\csname PYG@tok@il\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
220
+ \expandafter\def\csname PYG@tok@mo\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
221
+ \expandafter\def\csname PYG@tok@ch\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
222
+ \expandafter\def\csname PYG@tok@cm\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
223
+ \expandafter\def\csname PYG@tok@cpf\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
224
+ \expandafter\def\csname PYG@tok@c1\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
225
+ \expandafter\def\csname PYG@tok@cs\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
226
+
227
+ \def\PYGZbs{\char`\\}
228
+ \def\PYGZus{\char`\_}
229
+ \def\PYGZob{\char`\{}
230
+ \def\PYGZcb{\char`\}}
231
+ \def\PYGZca{\char`\^}
232
+ \def\PYGZam{\char`\&}
233
+ \def\PYGZlt{\char`\<}
234
+ \def\PYGZgt{\char`\>}
235
+ \def\PYGZsh{\char`\#}
236
+ \def\PYGZpc{\char`\%}
237
+ \def\PYGZdl{\char`\$}
238
+ \def\PYGZhy{\char`\-}
239
+ \def\PYGZsq{\char`\'}
240
+ \def\PYGZdq{\char`\"}
241
+ \def\PYGZti{\char`\~}
242
+ % for compatibility with earlier versions
243
+ \def\PYGZat{@}
244
+ \def\PYGZlb{[}
245
+ \def\PYGZrb{]}
246
+ \makeatother
247
+
248
+
249
+
250
+
251
+
252
+ \setlength{\textfloatsep}{15pt plus 5.0pt minus 3.0pt}
253
+ \setlength{\floatsep}{15pt plus 5.0pt minus 3.0pt}
254
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
255
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
256
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
257
+ \setlength{\abovecaptionskip}{5pt plus 1pt minus 1pt}
258
+
259
+ % If the title and author information does not fit in the area allocated, uncomment the following
260
+ %
261
+ %\setlength\titlebox{<dim>}
262
+ %
263
+ % and set <dim> to something 5cm or larger.
264
+
265
+ %\setlength\titlebox{5cm}
266
+
267
+ %\setlength{\textfloatsep}{15pt plus 5.0pt minus 5.0pt}
268
+ %\setlength{\floatsep}{15pt plus 5.0pt minus 5.0pt}
269
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
270
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
271
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
272
+ %\setlength{\abovecaptionskip}{5pt plus 1pt minus 1pt}
273
+
274
+ % If the title and author information does not fit in the area allocated, uncomment the following
275
+ %
276
+ \setlength\titlebox{5cm}
277
+ %
278
+ % and set <dim> to something 5cm or larger.
279
+
280
+ \title{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}
281
+
282
+ % Author information can be set in various styles:
283
+ % For several authors from the same institution:
284
+ \author{Linh The Nguyen \and Dat Quoc Nguyen\\
285
+ VinAI Research, Hanoi, Vietnam \\
286
+ \tt{\normalsize \{v.linhnt140, v.datnq9\}@vinai.io}}
287
+ % if the names do not fit well on one line use
288
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
289
+ % For authors from different institutions:
290
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
291
+ % \And ... \And
292
+ % Author n \\ Address line \\ ... \\ Address line}
293
+ % To start a seperate ``row'' of authors use \AND, as in
294
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
295
+ % \AND
296
+ % Author 2 \\ Address line \\ ... \\ Address line \And
297
+ % Author 3 \\ Address line \\ ... \\ Address line}
298
+
299
+ %\author{ }
300
+
301
+ \begin{document}
302
+ \maketitle
303
+
304
+ \begin{abstract}
305
+ We present the first multi-task learning model---named PhoNLP---for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency
306
+ parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT \cite{phobert} for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0.
307
+ Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing.
308
+ We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at \url{https://github.com/VinAIResearch/PhoNLP}.
309
+ \end{abstract}
310
+
311
+ \vspace{-5pt}
312
+
313
+ \begin{figure*}[!t]
314
+ \centering
315
+ \includegraphics[width=12.5cm]{JointModel.pdf}
316
+ {\small
317
+ \begin{tabular}{crllll}
318
+ \hline
319
+ \textbf{ID} & \textbf{Form} & \textbf{POS} & \textbf{NER} & \textbf{Head} & \textbf{DepRel} \\
320
+ \hline
321
+ 1 & Đây\textsubscript{This} & PRON & O & 2 & sub \\
322
+ 2 & là\textsubscript{is} & VERB& O & 0 & root \\
323
+ 3 & Hà\_Nội\textsubscript{Ha\_Noi} & NOUN & B-LOC & 2 & vmod \\
324
+ \hline
325
+ \end{tabular}
326
+ }
327
+ \caption{Illustration of our PhoNLP model.}
328
+ \label{fig:architecture}
329
+ \end{figure*}
330
+
331
+
332
+
333
+ \section{Introduction}
334
+
335
+ Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/2006-2010 and VLSP workshops that have run shared tasks since 2013.\footnote{\url{https://vlsp.org.vn/}} Fundamental tasks of POS tagging, NER and dependency parsing thus play important roles, providing useful features for many downstream application tasks such as machine translation \cite{7800281}, sentiment analysis \cite{BANG20182016IIP0038}, relation extraction \cite{9287471}, semantic parsing \cite{vitext2sql}, open information extraction \cite{3155133.3155171} and question answering \cite{NguyenNP_SWJ,3184558.3191535}. % \cite{7800281,BANG20182016IIP0038,3155133.3155171,9287471}.
336
+ Thus, there is a need to develop NLP toolkits for linguistic annotations w.r.t. Vietnamese POS tagging, NER and dependency parsing.
337
+
338
+ VnCoreNLP \cite{vu-etal-2018-vncorenlp} is the previous public toolkit employing traditional feature-based machine learning models to handle those Vietnamese NLP tasks. However, VnCoreNLP is now no longer considered state-of-the-art because its performance results are significantly outperformed by ones obtained when fine-tuning PhoBERT---the current state-of-the-art monolingual pre-trained language model for Vietnamese \cite{phobert}. Note that there are no publicly available fine-tuned BERT-based models for the three Vietnamese tasks. Assuming that there would be, a potential drawback might be that an NLP package wrapping such fine-tuned BERT-based models would take a large storage space, i.e. three times larger than the storage space used by a BERT model \cite{devlin-etal-2019-bert}, thus it would not be suitable for practical applications that require a smaller storage space. Jointly multi-task learning is a promising solution as it might help reduce the storage space. In addition, POS tagging, NER and dependency parsing are related tasks: POS tags are essential input features used for dependency parsing and POS tags are also used as additional features for NER. Jointly multi-task learning thus might also help improve the performance results against the single-task learning \cite{Ruder2019Neural}.
339
+
340
+
341
+ In this paper, we present a new multi-task learning model---named PhoNLP---for joint POS tagging, NER and dependency parsing. In particular, given an input sentence of words to PhoNLP, an encoding layer generates contextualized word embeddings that represent the input words. These contextualized word embeddings are fed into a POS tagging layer that is in fact a linear prediction layer \cite{devlin-etal-2019-bert} to predict POS tags for the corresponding input words. Each predicted POS tag is then represented by two ``soft'' embeddings that are later fed into NER and dependency parsing layers separately.
342
+ More specifically, based on both the contextualized word embeddings and the ``soft'' POS tag embeddings, the NER layer uses a linear-chain CRF predictor \cite{Lafferty:2001} to predict NER labels for the input words, while the dependency parsing layer uses a Biaffine classifier \cite{DozatM17} to predict dependency arcs between the words and another Biaffine classifier to label the predicted arcs.
343
+ %To the best of our knowledge, our PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results.
344
+ Our contributions are summarized as follows:
345
+
346
+ %\vspace{-2pt}
347
+
348
+ \begin{itemize}[leftmargin=*]
349
+ \setlength\itemsep{-1pt}
350
+ \item To the best of our knowledge, PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese.
351
+ \item We discuss a data leakage issue in the Vietnamese benchmark datasets, that has not yet been pointed out before. Experiments show that PhoNLP obtains state-of-the-art performance results, outperforming the PhoBERT-based single task learning.
352
+ \item We publicly release PhoNLP as an open-source toolkit that is simple to setup and efficiently run from both the command-line and Python API. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and downstream applications.
353
+ \end{itemize}
354
+
355
+
356
+
357
+ \section{Model description}
358
+
359
+ Figure \ref{fig:architecture} illustrates our PhoNLP architecture that can be viewed as a mixture of a BERT-based encoding layer and three decoding layers of POS tagging, NER and dependency parsing.
360
+
361
+ \subsection{Encoder \& Contextualized embeddings}
362
+
363
+ Given an input sentence consisting of $n$ word tokens $w_1, w_2, ..., w_n$, the encoding layer employs PhoBERT to generate contextualized latent feature embeddings $\mathbf{e}_{i}$ each representing the $i^{th}$ word $w_i$:
364
+
365
+ \begin{equation}
366
+ \mathbf{e}_{i} = \mathrm{PhoBERT\textsubscript{base}}\big({w}_{1:n}, i\big)
367
+ \end{equation}
368
+
369
+ In particular, the encoding layer employs the \textbf{PhoBERT\textsubscript{base}} version. Because PhoBERT uses BPE \cite{sennrich-etal-2016-neural} to segment the input sentence with subword units, the encoding layer in fact represents the $i^{th}$ word $w_i$ by using the contextualized embedding of its first subword.
370
+
371
+ \subsection{POS tagging}\label{ssec:pos}
372
+
373
+ Following a common manner when fine-tuning a pre-trained language model for a sequence labeling task \cite{devlin-etal-2019-bert}, the POS tagging layer is a linear prediction layer that is appended on top of the encoder. In particular, the POS tagging layer feeds the contextualized word embeddings $\mathbf{e}_{i}$ into a feed-forward network (FFNN\textsubscript{POS}) followed by a $\mathsf{softmax}$ predictor for POS tag prediction:
374
+
375
+ \begin{equation}
376
+ \mathbf{p}_{i} = \mathsf{softmax}\big(\mathrm{FFNN\textsubscript{POS}}\big(\mathbf{e}_{i}\big)\big) \label{eq2}
377
+ \end{equation}
378
+
379
+ \noindent where the output layer size of FFNN\textsubscript{POS} is the number of POS tags. Based on probability vectors $\mathbf{p}_{i}$, a cross-entropy objective loss \textbf{$\mathcal{L}_{\text{POS}}$} is calculated for POS tagging during training.
380
+
381
+
382
+ \subsection{NER}\label{ssec:ner}
383
+
384
+ The NER layer creates a sequence of vectors $\mathbf{v}_{1:n}$ in which each $\mathbf{v}_{i}$ is resulted in by concatenating the contextualized word embedding $\mathbf{e}_{i}$ and a ``soft'' POS tag embedding $\mathbf{t}_{i}^{(1)}$:
385
+
386
+ \begin{equation}
387
+ \mathbf{v}_{i} = \mathbf{e}_{i} \circ \mathbf{t}_{i}^{(1)}
388
+ \label{equa:ner}
389
+ \end{equation}
390
+
391
+ \noindent where following \newcite{hashimoto-etal-2017-joint}, the ``soft'' POS tag embedding $\mathbf{t}_{i}^{(1)}$ is computed by multiplying a label weight matrix $\mathbf{W}^{(1)}$ with the corresponding probability vector $\mathbf{p}_{i}$:
392
+
393
+ \begin{equation*}
394
+ \mathbf{t}_{i}^{(1)} = \mathbf{W}^{(1)}\mathbf{p}_{i}
395
+ \end{equation*}
396
+
397
+ The NER layer then passes each vector $\mathbf{v}_{i}$ into a FFNN (FFNN\textsubscript{NER}):
398
+
399
+ \begin{equation}
400
+ \mathbf{h}_{i} = \mathrm{FFNN\textsubscript{NER}}\big(\mathbf{v}_{i}\big) \label{eq4}
401
+ \end{equation}
402
+
403
+ \noindent where the output layer size of FFNN\textsubscript{NER} is the number of BIO-based NER labels.
404
+
405
+
406
+ The NER layer feeds the output vectors $\mathbf{h}_{i}$ into a linear-chain
407
+ CRF predictor for NER label prediction \cite{Lafferty:2001}. A cross-entropy loss \textbf{$\mathcal{L}_{\text{NER}}$} is calculated for NER during
408
+ training while the Viterbi algorithm is used for inference.
409
+
410
+ \subsection{Dependency parsing}
411
+
412
+ The dependency parsing layer creates vectors $\mathbf{z}_{1:n}$ in which each $\mathbf{z}_{i}$ is resulted in by concatenating $\mathbf{e}_{i}$ and another ``soft'' POS tag embedding $\mathbf{t}_{i}^{(2)}$:
413
+
414
+ \begin{eqnarray}
415
+ \mathbf{z}_{i} &=& \mathbf{e}_{i} \circ \mathbf{t}_{i}^{(2)} \label{equa:posdep} \\
416
+ \mathbf{t}_{i}^{(2)} &=& \mathbf{W}^{(2)}\mathbf{p}_{i} \nonumber
417
+ \end{eqnarray}
418
+
419
+ Following \newcite{DozatM17}, the dependency parsing layer uses FFNNs to split $\mathbf{z}_{i}$ into \emph{head} and \emph{dependent} representations:
420
+
421
+ \begin{eqnarray}
422
+ \mathbf{h}_{i}^{(\textsc{a-h})} &=& \mathrm{FFNN}_{\text{Arc-Head}}\big(\mathbf{z}_{i}\big) \label{equa:fc6} \\
423
+ \mathbf{h}_{i}^{(\textsc{a-d})} &=& \mathrm{FFNN}_{\text{Arc-Dep}}\big(\mathbf{z}_{i}\big) \\
424
+ \mathbf{h}_{i}^{(\textsc{l-h})} &=& \mathrm{FFNN}_{\text{Label-Head}}\big(\mathbf{z}_{i}\big) \\
425
+ \mathbf{h}_{i}^{(\textsc{l-d})} &=& \mathrm{FFNN}_{\text{Label-Dep}}\big(\mathbf{z}_{i}\big) \label{equa:fc9}
426
+ \end{eqnarray}
427
+
428
+
429
+ To predict potential dependency arcs, based on input vectors $\mathbf{h}_{i}^{(\textsc{a-h})}$ and $\mathbf{h}_{j}^{(\textsc{a-d})}$, the parsing layer uses a Biaffine classifier's variant \cite{qi-etal-2018-universal} that additionally takes into account the distance and relative ordering between two words to produce a probability distribution of
430
+ arc heads for each word. %\footnote{We utilize an implementation of the Biaffine classifier's variant \cite{qi-etal-2018-universal} from \newcite{qi-etal-2020-stanza}.}
431
+ For inference, the Chu–Liu/Edmonds' algorithm is used to find a maximum spanning tree \cite{chuliu,Edmonds}.
432
+ The parsing layer also uses another Biaffine classifier to label the predicted arcs, based on input vectors $\mathbf{h}_{i}^{(\textsc{l-h})}$ and $\mathbf{h}_{j}^{(\textsc{l-d})}$. An objective loss \textbf{$\mathcal{L}_{\text{DEP}}$} is computed by summing a cross entropy loss for unlabeled dependency parsing and another cross entropy loss for dependency label prediction during training based on gold arcs and arc labels.
433
+
434
+
435
+ %\begin{eqnarray}
436
+ % {s}_{i,j} &=&\mathrm{Biaff\textsuperscript{(A)}}\Big(\mathbf{h}_{i}^{(\textsc{a-h})}, \mathbf{h}_{j}^{(\textsc{a-d})}\Big) \\
437
+ %\mathrm{Biaff\textsuperscript{(A)}}\big(\mathbf{x}, \mathbf{y}\big) &=& \mathbf{x}^{\mathsf{T}} \mathbf{U}_1 \mathbf{y} + \mathbf{w}_1^{\mathsf{T}}(\mathbf{x} \circ \mathbf{y}) + {b}_1 \nonumber
438
+ %\end{eqnarray}
439
+ %
440
+ %\noindent where $\mathbf{U}_1$, $\mathbf{w}_1$ and $b_1$ are a $k\times 1 \times k$ tensor, a $2k$-dimensional vector and a bias scalar, respectively
441
+ %(here, $k$ is the size of the {head} and {dependent} representations).
442
+ %
443
+ %\begin{eqnarray}
444
+ % \mathbf{s}_{i,j} &=&\mathrm{Biaff\textsuperscript{(L)}}\Big(\mathbf{h}_{i}^{(\textsc{l-h})}, \mathbf{h}_{j}^{(\textsc{l-d})}\Big) \\
445
+ %\mathrm{Biaff\textsuperscript{(L)}}\big(\mathbf{x}, \mathbf{y}\big) &=& \mathbf{x}^{\mathsf{T}} \mathbf{U}_2 \mathbf{y} + \mathbf{W}_2(\mathbf{x} \circ \mathbf{y}) + \mathbf{b}_2 \nonumber
446
+ %\end{eqnarray}
447
+ %
448
+ %\noindent where $\mathbf{U}_2$, $\mathbf{W}_2$ and $\mathbf{b}_2$ are a $k\times l \times k$ tensor, a $l \times 2k$ matrix and a bias vector, respectively (here, $l$ is the number of dependency labels).
449
+
450
+
451
+ \subsection{Joint multi-task learning}
452
+
453
+ The final training objective loss \textbf{$\mathcal{L}$} of our model PhoNLP is the weighted sum of the POS tagging loss {$\mathcal{L}_{\text{POS}}$}, the NER loss {$\mathcal{L}_{\text{NER}}$} and the dependency parsing loss {$\mathcal{L}_{\text{DEP}}$}:
454
+
455
+ \begin{equation}
456
+ \textbf{$\mathcal{L}$} = \lambda_1\mathcal{L}_{\text{POS}} + \lambda_2\mathcal{L}_{\text{NER}} + (1 - \lambda_1 - \lambda_2)\mathcal{L}_{\text{DEP}}
457
+ \end{equation}
458
+
459
+ \paragraph{Discussion:} Our PhoNLP can be viewed as an extension of previous joint POS tagging and dependency parsing models \cite{hashimoto-etal-2017-joint,li-etal-2018-joint-learning,nguyen-verspoor-2018-improved,NguyenALTA2019,kondratyuk-straka-2019-75}, where we additionally incorporate a CRF-based prediction layer for NER. Unlike \newcite{hashimoto-etal-2017-joint}, \newcite{nguyen-verspoor-2018-improved}, \newcite{li-etal-2018-joint-learning} and \newcite{NguyenALTA2019} that use BiLSTM-based encoders to extract contextualized feature embeddings, we use a BERT-based encoder. \newcite{kondratyuk-straka-2019-75} also employ a BERT-based encoder. However, different from PhoNLP where we construct a hierarchical architecture over the POS tagging and dependency parsing layers, \newcite{kondratyuk-straka-2019-75} do not make use of POS tag embeddings for dependency parsing.\footnote{In our preliminary experiments, not feeding the POS tag embeddings into the dependency parsing layer decreases the performance.}
460
+
461
+
462
+
463
+
464
+ \section{Experiments}
465
+
466
+ \subsection{Setup}
467
+
468
+ \subsubsection{Datasets}
469
+
470
+ To conduct experiments, we use the benchmark datasets of the VLSP 2013 POS tagging dataset,\footnote{\url{https://vlsp.org.vn/vlsp2013/eval}} the VLSP 2016 NER dataset \cite{JCC13161} and the VnDT dependency treebank v1.1 \newcite{Nguyen2014NLDB}, following the setup used by the VnCoreNLP toolkit \cite{vu-etal-2018-vncorenlp}. Here, VnDT is converted from the Vietnamese constituent treebank \cite{nguyen-etal-2009-building}.
471
+
472
+
473
+ \paragraph{Data leakage issue:} We further discover an issue of data leakage, that has not yet been pointed out before. That is, all sentences from the VLSP 2016 NER dataset and the VnDT treebank are included in the VLSP 2013 POS tagging dataset. In particular, 90+\% of sentences from both validation and test sets for NER and dependency parsing are included in the POS tagging training set, resulting in an unrealistic evaluation scenario where the POS tags are used as input features for NER and dependency parsing.
474
+
475
+ To handle the data leakage issue, we have to re-split the VLSP 2013 POS tagging dataset to avoid the data leakage issue: The POS tagging validation/test set now only contains sentences that appear in the union of the NER and dependency parsing validation/test sets (i.e. the validation/test sentences for NER and dependency parsing only appear in the POS tagging validation/test set).
476
+ In addition, there are 594 duplicated sentences in the VLSP 2013 POS tagging dataset (here, sentence duplication is not found in the union of the NER and dependency parsing sentences). Thus we have to perform duplication removal on the POS tagging dataset.
477
+ Table \ref{tab:Datasets} details the statistics of the experimental datasets.
478
+
479
+ \begin{table}[!t]
480
+ \centering
481
+ \resizebox{7.5cm}{!}{
482
+ \begin{tabular}{l|l|l|l}
483
+ \hline
484
+ \textbf{Task} & \textbf{\#train} & \textbf{\#valid} & \textbf{\#test} \\
485
+ \hline
486
+ {POS tagging (leakage)} & {27000} & {870} & {2120} \\
487
+ \hdashline
488
+ POS tagging (re-split) & 23906 & 2009 & 3481\\
489
+ \hline
490
+ NER & 14861 & 2000 & 2831 \\
491
+ \hline
492
+ Dependency parsing & 8977 & 200 & 1020 \\
493
+ \hline
494
+ \end{tabular}
495
+ }
496
+ \caption{Dataset statistics. \textbf{\#train}, \textbf{\#valid} and \textbf{\#test} denote the numbers of training, validation and test sentences, respectively. Here,
497
+ ``{POS tagging (leakage)}'' and ``POS tagging (re-split)'' refer to the statistics for POS tagging before and after re-splitting \& sentence duplication removal, respectively.}
498
+ \label{tab:Datasets}
499
+ \end{table}
500
+
501
+ \subsubsection{Implementation}
502
+
503
+ PhoNLP is implemented based on PyTorch \cite{NEURIPS2019_9015}, employing the PhoBERT encoder implementation available from the $\mathrm{transformers}$ library \cite{wolf-etal-2020-transformers} and the Biaffine classifier implementation from \newcite{qi-etal-2020-stanza}. We set both the label weight matrices $\mathbf{W}^{(1)}$ and $\mathbf{W}^{(2)}$ to have 100 rows, resulting in 100-dimensional soft POS tag embeddings. In addition, following \newcite{qi-etal-2018-universal,qi-etal-2020-stanza}, FFNNs in equations \ref{equa:fc6}--\ref{equa:fc9} use 400-dimensional output layers.
504
+
505
+ We use the AdamW optimizer \cite{loshchilov2018decoupled} and a fixed batch size at 32, and train for 40 epochs. The sizes of training sets are different, in which the POS tagging
506
+ training set is the largest, consisting of 23906 sentences. Thus for each training epoch, we repeatedly sample from the NER and dependency parsing training sets to fill the gaps between the training set sizes. We perform a grid search to select the initial AdamW learning rate, $\lambda_1$ and $\lambda_2$. We find the optimal initial AdamW learning rate, $\lambda_1$ and $\lambda_2$ at 1e-5, 0.4 and 0.2, respectively. Here, we compute the average of the POS tagging accuracy, NER F\textsubscript{1}-score and dependency parsing score LAS after each training epoch on the validation sets. We select the model checkpoint that produces the highest average score over the validation sets to apply to the test sets. Each of our reported scores is an average over 5 runs with different random seeds.
507
+
508
+ %\subsubsection{Baseline single-task training}
509
+
510
+ %We also conduct experiments for a single-task training strategy. We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section \ref{ssec:pos}. For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers \cite{qi-etal-2018-universal}.
511
+
512
+
513
+
514
+
515
+
516
+
517
+
518
+
519
+
520
+
521
+
522
+ \subsection{Results}
523
+
524
+ %\subsubsection*{Main results}
525
+
526
+
527
+
528
+ Table \ref{tab:results} presents results obtained for our PhoNLP and compares them with those of a baseline approach of single-task training. For the single-task training approach: (i) We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section \ref{ssec:pos}. (ii) For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. (iii) For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding PhoBERT-based contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers for dependency parsing \cite{qi-etal-2018-universal}. Here, the single-task training approach is based on the PhoBERT\textsubscript{base} version, employing the same hyper-parameter tuning and model selection strategy that we use for PhoNLP.
529
+
530
+ \begin{table}[!t]
531
+ \centering
532
+ \def\arraystretch{1.2}
533
+ \resizebox{7.5cm}{!}{
534
+ \begin{tabular}{ll|l|l|l|l}
535
+ \hline
536
+ & \textbf{Model} & \textbf{POS} & \textbf{NER} & \textbf{LAS} & \textbf{UAS} \\
537
+ \hline
538
+ \multirow{2}{*}{\rotatebox[origin=c]{90}{{Leak.}}}& Single-task & 96.7$^\dagger$ & 93.69 & 78.77$^\dagger$ & 85.22$^\dagger$ \\
539
+ \cdashline{2-6}
540
+ & PhoNLP & \textbf{96.76} & \textbf{94.41} & \textbf{79.11} & \textbf{85.47}\\
541
+ \hline
542
+ \hline
543
+ \multirow{2}{*}{\rotatebox[origin=c]{90}{{Re-spl}}}& Single-task & 93.68 & 93.69 & 77.89 & 84.78 \\
544
+ \cdashline{2-6}
545
+ & PhoNLP & \textbf{93.88} & \textbf{94.51} & \textbf{78.17} & \textbf{84.95} \\
546
+ %& PhoNLP & \textbf{93.88}$^{*}$ & \textbf{94.51}$^{**}$ & \textbf{78.17}$^{*}$ & \textbf{84.95} \\
547
+ \hline
548
+ \end{tabular}
549
+ }
550
+ \caption{Performance results (in \%) on the test sets for POS tagging (i.e. accuracy), NER (i.e. F\textsubscript{1}-score) and dependency parsing (i.e. LAS and UAS scores). ``Leak.'' abbreviates ``leakage'', denoting the results obtained w.r.t. the data leakage issue. ``Re-spl'' denotes the results obtained w.r.t. the data re-split and duplication removal for POS tagging to avoid the data leakage issue. ``Single-task'' refers to as the single-task training approach.
551
+ $\dagger$ denotes scores taken from the PhoBERT paper \cite{phobert}. Note that ``Single-task'' NER is not affected by the data leakage issue.
552
+ %Here, $^{*}$ and $^{**}$ denote the statistically significant differences between ``Single-task'' and PhoNLP at p $\leq$ 0.05 and p $\leq$ 0.01, respectively.
553
+ }
554
+ \label{tab:results}
555
+ \end{table}
556
+
557
+ Note that PhoBERT helps produce state-of-the-art results for multiple Vietnamese NLP tasks (including but not limited to POS tagging, NER and dependency parsing in a single-task training strategy), and obtains higher performance results than VnCoreNLP.
558
+ However, in both the PhoBERT and VnCoreNLP papers \cite{phobert,vu-etal-2018-vncorenlp}, results for POS tagging and dependency parsing are reported w.r.t. the data leakage issue. Our ``Single-task'' results in Table \ref{tab:results} regarding ``Re-spl'' (i.e. the data re-split and duplication removal for POS tagging to avoid the data leakage issue) can be viewed as new PhoBERT results for a proper experimental setup. Table \ref{tab:results} shows that in both setups ``Leak.'' and ``Re-spl'', our joint multi-task training approach PhoNLP performs better than the PhoBERT-based single-task training approach, thus resulting in state-of-the-art performances for the three tasks of Vietnamese POS tagging, NER and dependency parsing.
559
+
560
+
561
+
562
+
563
+ \section{PhoNLP toolkit}
564
+
565
+ We present in this section a basic usage of our PhoNLP toolkit.
566
+ We make PhoNLP simple to setup, i.e. users can install PhoNLP from either source or $\mathsf{pip}$ (e.g. $\mathsf{pip3\ install\ phonlp}$). We also aim to make PhoNLP simple to run from both the command-line and the Python API. For example, annotating a corpus with POS tagging, NER and dependency parsing can be performed by using a simple command as in Figure \ref{fig:command}.
567
+
568
+ Assume that the input file ``{\ttfamily input.txt}'' in Figure \ref{fig:command} contains a sentence ``Tôi đang làm\_việc tại VinAI .'' (I\textsubscript{Tôi} am\textsubscript{đang} working\textsubscript{làm\_việc} at\textsubscript{tại} VinAI). Table \ref{tab:format} shows the annotated output in plain text form for this
569
+ sentence. Similarly, we also get the same output by using the Python API as simple as in Figure \ref{fig:code}.
570
+ Furthermore, commands to (re-)train and evaluate PhoNLP using gold annotated corpora are detailed in the PhoNLP GitHub repository. Note that it is absolutely possible to directly employ our PhoNLP (re-)training and evaluation command scripts for other languages that have gold annotated corpora available for the three tasks and a pre-trained BERT-based language model available from the $\mathrm{transformers}$ library.
571
+
572
+ %\setcounter{figure}{1}
573
+ \begin{figure}[!t]
574
+ %{\footnotesize\ttfamily python3 phonlp.py {-}{-}save\_dir model\_folder\_path {-}{-}mode annotate {-}{-}input\_file path\_to\_input\_file {-}{-}output\_file path\_to\_output\_file}
575
+ {\ttfamily python3 run\_phonlp.py {-}{-}save\_dir ./pretrained\_phonlp {-}{-}mode \\ annotate {-}{-}input\_file input.txt {-}{-}output\_file output.txt}
576
+ \caption{Minimal command to run PhoNLP. Here ``save\_dir'' denote the path to the local machine folder that stores the pre-trained PhoNLP model.}
577
+ \label{fig:command}
578
+ \end{figure}
579
+
580
+ \begin{table}[!t]
581
+ \centering
582
+ %\resizebox{7.5cm}{!}{
583
+ \begin{tabular}{llllll}
584
+ 1 & Tôi & P & O & 3 & sub \\
585
+ 2 & đang & R & O & 3 & adv \\
586
+ 3 & làm\_việc & V & O & 0 & root \\
587
+ 4 & tại & E & O & 3 & loc \\
588
+ 5 & VinAI & Np & B-ORG & 4 & pob \\
589
+ 6 & . & CH & O & 3 & punct \\
590
+ \end{tabular}
591
+ %}
592
+ \caption{The output in the output file ``{\ttfamily output.txt}'' for the sentence ``Tôi đang làm\_việc tại VinAI .'' from the input file ``{\ttfamily input.txt}'' in Figure \ref{fig:command}. The output is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type.}
593
+ \label{tab:format}
594
+ \end{table}
595
+
596
+ \paragraph{Speed test:} We perform a sole CPU-based speed test using a personal computer with Intel Core i5 8265U 1.6GHz \& 8GB of memory. For a GPU-based speed test, we employ a machine with a single NVIDIA RTX 2080Ti GPU. For performing the three NLP tasks jointly, PhoNLP obtains a speed at {15 sentences per second} for the CPU-based test and {129 sentences per second} for the GPU-based test, respectively, with an average of 23 word tokens per sentence and a batch size of 8.
597
+
598
+ %\setcounter{figure}{2}
599
+ \begin{figure*}[!t]
600
+ %\begin{minted}{python}
601
+ %import phonlp
602
+ %# Automatically download the pre-trained PhoNLP model
603
+ %# and save it in a local machine folder
604
+ %phonlp.download(save_dir='./pretrained_phonlp')
605
+ %# Load the pre-trained PhoNLP model
606
+ %model = phonlp.load(save_dir='./pretrained_phonlp')
607
+ %# Annotate a corpus
608
+ %model.annotate(input_file='input.txt', output_file='output.txt')
609
+ %# Annotate a sentence
610
+ %model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))
611
+ %\end{minted}
612
+ \begin{Verbatim}[commandchars=\\\{\}]
613
+ \PYG{k+kn}{import} \PYG{n+nn}{phonlp}
614
+ \PYG{c+c1}{\PYGZsh{} Automatically download the pre\PYGZhy{}trained PhoNLP model}
615
+ \PYG{c+c1}{\PYGZsh{} and save it in a local machine folder}
616
+ \PYG{n}{phonlp}\PYG{o}{.}\PYG{n}{download}\PYG{p}{(}\PYG{n}{save\PYGZus{}dir}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}./pretrained\PYGZus{}phonlp\PYGZsq{}}\PYG{p}{)}
617
+ \PYG{c+c1}{\PYGZsh{} Load the pre\PYGZhy{}trained PhoNLP model}
618
+ \PYG{n}{model} \PYG{o}{=} \PYG{n}{phonlp}\PYG{o}{.}\PYG{n}{load}\PYG{p}{(}\PYG{n}{save\PYGZus{}dir}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}./pretrained\PYGZus{}phonlp\PYGZsq{}}\PYG{p}{)}
619
+ \PYG{c+c1}{\PYGZsh{} Annotate a corpus}
620
+ \PYG{n}{model}\PYG{o}{.}\PYG{n}{annotate}\PYG{p}{(}\PYG{n}{input\PYGZus{}file}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}input.txt\PYGZsq{}}\PYG{p}{,} \PYG{n}{output\PYGZus{}file}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}output.txt\PYGZsq{}}\PYG{p}{)}
621
+ \PYG{c+c1}{\PYGZsh{} Annotate a sentence}
622
+ \PYG{n}{model}\PYG{o}{.}\PYG{n}{print\PYGZus{}out}\PYG{p}{(}\PYG{n}{model}\PYG{o}{.}\PYG{n}{annotate}\PYG{p}{(}\PYG{n}{text}\PYG{o}{=}\PYG{l+s+s2}{\PYGZdq{}Tôi đang làm\PYGZus{}việc tại VinAI .\PYGZdq{}}\PYG{p}{))}
623
+ \end{Verbatim}
624
+ %\vspace{-5pt}
625
+ \caption{A simple and complete example code for using PhoNLP in Python.}
626
+ \label{fig:code}
627
+ \end{figure*}
628
+
629
+ \section{Conclusion and future work}
630
+
631
+
632
+ We have presented the first multi-task learning model PhoNLP for joint POS tagging, NER and dependency parsing in Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP outperforms its strong fine-tuned PhoBERT-based single-task training baseline, producing state-of-the-art performance results. We publicly release PhoNLP as an easy-to-use open-source toolkit and hope that PhoNLP can facilitate future NLP research and applications. % such as in question answering and dialogue systems.
633
+ In future work, we will also apply PhoNLP to other languages.
634
+
635
+ %Although we specify PhoNLP for Vietnamese, the PhoNLP (re-)training and evaluation command scripts in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from the $\mathrm{transformers}$ library. In future work, we will apply our PhoNLP toolkit to those languages.
636
+
637
+
638
+ \bibliography{refs}
639
+ \bibliographystyle{acl_natbib}
640
+
641
+ \end{document}
references/2021.naacl.nguyen/source/acl_natbib.bst ADDED
@@ -0,0 +1,1979 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%% Modification of BibTeX style file acl_natbib_nourl.bst
2
+ %%% ... by urlbst, version 0.7 (marked with "% urlbst")
3
+ %%% See <http://purl.org/nxg/dist/urlbst>
4
+ %%% Added webpage entry type, and url and lastchecked fields.
5
+ %%% Added eprint support.
6
+ %%% Added DOI support.
7
+ %%% Added PUBMED support.
8
+ %%% Added hyperref support.
9
+ %%% Original headers follow...
10
+
11
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
12
+ %
13
+ % BibTeX style file acl_natbib_nourl.bst
14
+ %
15
+ % intended as input to urlbst script
16
+ % $ ./urlbst --hyperref --inlinelinks acl_natbib_nourl.bst > acl_natbib.bst
17
+ %
18
+ % adapted from compling.bst
19
+ % in order to mimic the style files for ACL conferences prior to 2017
20
+ % by making the following three changes:
21
+ % - for @incollection, page numbers now follow volume title.
22
+ % - for @inproceedings, address now follows conference name.
23
+ % (address is intended as location of conference,
24
+ % not address of publisher.)
25
+ % - for papers with three authors, use et al. in citation
26
+ % Dan Gildea 2017/06/08
27
+ %
28
+ % - fixed a bug with format.chapter - error given if chapter is empty
29
+ % with inbook.
30
+ % Shay Cohen 2018/02/16
31
+ %
32
+ % - sort "van Noord" under "v" not "N"
33
+ % this is what previous ACL style files did and is pretty standard
34
+ % Dan Gildea 2019/04/12
35
+
36
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
37
+ %
38
+ % BibTeX style file compling.bst
39
+ %
40
+ % Intended for the journal Computational Linguistics (ACL/MIT Press)
41
+ % Created by Ron Artstein on 2005/08/22
42
+ % For use with <natbib.sty> for author-year citations.
43
+ %
44
+ % I created this file in order to allow submissions to the journal
45
+ % Computational Linguistics using the <natbib> package for author-year
46
+ % citations, which offers a lot more flexibility than <fullname>, CL's
47
+ % official citation package. This file adheres strictly to the official
48
+ % style guide available from the MIT Press:
49
+ %
50
+ % http://mitpress.mit.edu/journals/coli/compling_style.pdf
51
+ %
52
+ % This includes all the various quirks of the style guide, for example:
53
+ % - a chapter from a monograph (@inbook) has no page numbers.
54
+ % - an article from an edited volume (@incollection) has page numbers
55
+ % after the publisher and address.
56
+ % - an article from a proceedings volume (@inproceedings) has page
57
+ % numbers before the publisher and address.
58
+ %
59
+ % Where the style guide was inconsistent or not specific enough I
60
+ % looked at actual published articles and exercised my own judgment.
61
+ % I noticed two inconsistencies in the style guide:
62
+ %
63
+ % - The style guide gives one example of an article from an edited
64
+ % volume with the editor's name spelled out in full, and another
65
+ % with the editors' names abbreviated. I chose to accept the first
66
+ % one as correct, since the style guide generally shuns abbreviations,
67
+ % and editors' names are also spelled out in some recently published
68
+ % articles.
69
+ %
70
+ % - The style guide gives one example of a reference where the word
71
+ % "and" between two authors is preceded by a comma. This is most
72
+ % likely a typo, since in all other cases with just two authors or
73
+ % editors there is no comma before the word "and".
74
+ %
75
+ % One case where the style guide is not being specific is the placement
76
+ % of the edition number, for which no example is given. I chose to put
77
+ % it immediately after the title, which I (subjectively) find natural,
78
+ % and is also the place of the edition in a few recently published
79
+ % articles.
80
+ %
81
+ % This file correctly reproduces all of the examples in the official
82
+ % style guide, except for the two inconsistencies noted above. I even
83
+ % managed to get it to correctly format the proceedings example which
84
+ % has an organization, a publisher, and two addresses (the conference
85
+ % location and the publisher's address), though I cheated a bit by
86
+ % putting the conference location and month as part of the title field;
87
+ % I feel that in this case the conference location and month can be
88
+ % considered as part of the title, and that adding a location field
89
+ % is not justified. Note also that a location field is not standard,
90
+ % so entries made with this field would not port nicely to other styles.
91
+ % However, if authors feel that there's a need for a location field
92
+ % then tell me and I'll see what I can do.
93
+ %
94
+ % The file also produces to my satisfaction all the bibliographical
95
+ % entries in my recent (joint) submission to CL (this was the original
96
+ % motivation for creating the file). I also tested it by running it
97
+ % on a larger set of entries and eyeballing the results. There may of
98
+ % course still be errors, especially with combinations of fields that
99
+ % are not that common, or with cross-references (which I seldom use).
100
+ % If you find such errors please write to me.
101
+ %
102
+ % I hope people find this file useful. Please email me with comments
103
+ % and suggestions.
104
+ %
105
+ % Ron Artstein
106
+ % artstein [at] essex.ac.uk
107
+ % August 22, 2005.
108
+ %
109
+ % Some technical notes.
110
+ %
111
+ % This file is based on a file generated with the package <custom-bib>
112
+ % by Patrick W. Daly (see selected options below), which was then
113
+ % manually customized to conform with certain CL requirements which
114
+ % cannot be met by <custom-bib>. Departures from the generated file
115
+ % include:
116
+ %
117
+ % Function inbook: moved publisher and address to the end; moved
118
+ % edition after title; replaced function format.chapter.pages by
119
+ % new function format.chapter to output chapter without pages.
120
+ %
121
+ % Function inproceedings: moved publisher and address to the end;
122
+ % replaced function format.in.ed.booktitle by new function
123
+ % format.in.booktitle to output the proceedings title without
124
+ % the editor.
125
+ %
126
+ % Functions book, incollection, manual: moved edition after title.
127
+ %
128
+ % Function mastersthesis: formatted title as for articles (unlike
129
+ % phdthesis which is formatted as book) and added month.
130
+ %
131
+ % Function proceedings: added new.sentence between organization and
132
+ % publisher when both are present.
133
+ %
134
+ % Function format.lab.names: modified so that it gives all the
135
+ % authors' surnames for in-text citations for one, two and three
136
+ % authors and only uses "et. al" for works with four authors or more
137
+ % (thanks to Ken Shan for convincing me to go through the trouble of
138
+ % modifying this function rather than using unreliable hacks).
139
+ %
140
+ % Changes:
141
+ %
142
+ % 2006-10-27: Changed function reverse.pass so that the extra label is
143
+ % enclosed in parentheses when the year field ends in an uppercase or
144
+ % lowercase letter (change modeled after Uli Sauerland's modification
145
+ % of nals.bst). RA.
146
+ %
147
+ %
148
+ % The preamble of the generated file begins below:
149
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
150
+ %%
151
+ %% This is file `compling.bst',
152
+ %% generated with the docstrip utility.
153
+ %%
154
+ %% The original source files were:
155
+ %%
156
+ %% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss')
157
+ %% ----------------------------------------
158
+ %% *** Intended for the journal Computational Linguistics ***
159
+ %%
160
+ %% Copyright 1994-2002 Patrick W Daly
161
+ % ===============================================================
162
+ % IMPORTANT NOTICE:
163
+ % This bibliographic style (bst) file has been generated from one or
164
+ % more master bibliographic style (mbs) files, listed above.
165
+ %
166
+ % This generated file can be redistributed and/or modified under the terms
167
+ % of the LaTeX Project Public License Distributed from CTAN
168
+ % archives in directory macros/latex/base/lppl.txt; either
169
+ % version 1 of the License, or any later version.
170
+ % ===============================================================
171
+ % Name and version information of the main mbs file:
172
+ % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
173
+ % For use with BibTeX version 0.99a or later
174
+ %-------------------------------------------------------------------
175
+ % This bibliography style file is intended for texts in ENGLISH
176
+ % This is an author-year citation style bibliography. As such, it is
177
+ % non-standard LaTeX, and requires a special package file to function properly.
178
+ % Such a package is natbib.sty by Patrick W. Daly
179
+ % The form of the \bibitem entries is
180
+ % \bibitem[Jones et al.(1990)]{key}...
181
+ % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
182
+ % The essential feature is that the label (the part in brackets) consists
183
+ % of the author names, as they should appear in the citation, with the year
184
+ % in parentheses following. There must be no space before the opening
185
+ % parenthesis!
186
+ % With natbib v5.3, a full list of authors may also follow the year.
187
+ % In natbib.sty, it is possible to define the type of enclosures that is
188
+ % really wanted (brackets or parentheses), but in either case, there must
189
+ % be parentheses in the label.
190
+ % The \cite command functions as follows:
191
+ % \citet{key} ==>> Jones et al. (1990)
192
+ % \citet*{key} ==>> Jones, Baker, and Smith (1990)
193
+ % \citep{key} ==>> (Jones et al., 1990)
194
+ % \citep*{key} ==>> (Jones, Baker, and Smith, 1990)
195
+ % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2)
196
+ % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990)
197
+ % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32)
198
+ % \citeauthor{key} ==>> Jones et al.
199
+ % \citeauthor*{key} ==>> Jones, Baker, and Smith
200
+ % \citeyear{key} ==>> 1990
201
+ %---------------------------------------------------------------------
202
+
203
+ ENTRY
204
+ { address
205
+ author
206
+ booktitle
207
+ chapter
208
+ edition
209
+ editor
210
+ howpublished
211
+ institution
212
+ journal
213
+ key
214
+ month
215
+ note
216
+ number
217
+ organization
218
+ pages
219
+ publisher
220
+ school
221
+ series
222
+ title
223
+ type
224
+ volume
225
+ year
226
+ eprint % urlbst
227
+ doi % urlbst
228
+ pubmed % urlbst
229
+ url % urlbst
230
+ lastchecked % urlbst
231
+ }
232
+ {}
233
+ { label extra.label sort.label short.list }
234
+ INTEGERS { output.state before.all mid.sentence after.sentence after.block }
235
+ % urlbst...
236
+ % urlbst constants and state variables
237
+ STRINGS { urlintro
238
+ eprinturl eprintprefix doiprefix doiurl pubmedprefix pubmedurl
239
+ citedstring onlinestring linktextstring
240
+ openinlinelink closeinlinelink }
241
+ INTEGERS { hrefform inlinelinks makeinlinelink
242
+ addeprints adddoiresolver addpubmedresolver }
243
+ FUNCTION {init.urlbst.variables}
244
+ {
245
+ % The following constants may be adjusted by hand, if desired
246
+
247
+ % The first set allow you to enable or disable certain functionality.
248
+ #1 'addeprints := % 0=no eprints; 1=include eprints
249
+ #1 'adddoiresolver := % 0=no DOI resolver; 1=include it
250
+ #1 'addpubmedresolver := % 0=no PUBMED resolver; 1=include it
251
+ #2 'hrefform := % 0=no crossrefs; 1=hypertex xrefs; 2=hyperref refs
252
+ #1 'inlinelinks := % 0=URLs explicit; 1=URLs attached to titles
253
+
254
+ % String constants, which you _might_ want to tweak.
255
+ "URL: " 'urlintro := % prefix before URL; typically "Available from:" or "URL":
256
+ "online" 'onlinestring := % indication that resource is online; typically "online"
257
+ "cited " 'citedstring := % indicator of citation date; typically "cited "
258
+ "[link]" 'linktextstring := % dummy link text; typically "[link]"
259
+ "http://arxiv.org/abs/" 'eprinturl := % prefix to make URL from eprint ref
260
+ "arXiv:" 'eprintprefix := % text prefix printed before eprint ref; typically "arXiv:"
261
+ "https://doi.org/" 'doiurl := % prefix to make URL from DOI
262
+ "doi:" 'doiprefix := % text prefix printed before DOI ref; typically "doi:"
263
+ "http://www.ncbi.nlm.nih.gov/pubmed/" 'pubmedurl := % prefix to make URL from PUBMED
264
+ "PMID:" 'pubmedprefix := % text prefix printed before PUBMED ref; typically "PMID:"
265
+
266
+ % The following are internal state variables, not configuration constants,
267
+ % so they shouldn't be fiddled with.
268
+ #0 'makeinlinelink := % state variable managed by possibly.setup.inlinelink
269
+ "" 'openinlinelink := % ditto
270
+ "" 'closeinlinelink := % ditto
271
+ }
272
+ INTEGERS {
273
+ bracket.state
274
+ outside.brackets
275
+ open.brackets
276
+ within.brackets
277
+ close.brackets
278
+ }
279
+ % ...urlbst to here
280
+ FUNCTION {init.state.consts}
281
+ { #0 'outside.brackets := % urlbst...
282
+ #1 'open.brackets :=
283
+ #2 'within.brackets :=
284
+ #3 'close.brackets := % ...urlbst to here
285
+
286
+ #0 'before.all :=
287
+ #1 'mid.sentence :=
288
+ #2 'after.sentence :=
289
+ #3 'after.block :=
290
+ }
291
+ STRINGS { s t}
292
+ % urlbst
293
+ FUNCTION {output.nonnull.original}
294
+ { 's :=
295
+ output.state mid.sentence =
296
+ { ", " * write$ }
297
+ { output.state after.block =
298
+ { add.period$ write$
299
+ newline$
300
+ "\newblock " write$
301
+ }
302
+ { output.state before.all =
303
+ 'write$
304
+ { add.period$ " " * write$ }
305
+ if$
306
+ }
307
+ if$
308
+ mid.sentence 'output.state :=
309
+ }
310
+ if$
311
+ s
312
+ }
313
+
314
+ % urlbst...
315
+ % The following three functions are for handling inlinelink. They wrap
316
+ % a block of text which is potentially output with write$ by multiple
317
+ % other functions, so we don't know the content a priori.
318
+ % They communicate between each other using the variables makeinlinelink
319
+ % (which is true if a link should be made), and closeinlinelink (which holds
320
+ % the string which should close any current link. They can be called
321
+ % at any time, but start.inlinelink will be a no-op unless something has
322
+ % previously set makeinlinelink true, and the two ...end.inlinelink functions
323
+ % will only do their stuff if start.inlinelink has previously set
324
+ % closeinlinelink to be non-empty.
325
+ % (thanks to 'ijvm' for suggested code here)
326
+ FUNCTION {uand}
327
+ { 'skip$ { pop$ #0 } if$ } % 'and' (which isn't defined at this point in the file)
328
+ FUNCTION {possibly.setup.inlinelink}
329
+ { makeinlinelink hrefform #0 > uand
330
+ { doi empty$ adddoiresolver uand
331
+ { pubmed empty$ addpubmedresolver uand
332
+ { eprint empty$ addeprints uand
333
+ { url empty$
334
+ { "" }
335
+ { url }
336
+ if$ }
337
+ { eprinturl eprint * }
338
+ if$ }
339
+ { pubmedurl pubmed * }
340
+ if$ }
341
+ { doiurl doi * }
342
+ if$
343
+ % an appropriately-formatted URL is now on the stack
344
+ hrefform #1 = % hypertex
345
+ { "\special {html:<a href=" quote$ * swap$ * quote$ * "> }{" * 'openinlinelink :=
346
+ "\special {html:</a>}" 'closeinlinelink := }
347
+ { "\href {" swap$ * "} {" * 'openinlinelink := % hrefform=#2 -- hyperref
348
+ % the space between "} {" matters: a URL of just the right length can cause "\% newline em"
349
+ "}" 'closeinlinelink := }
350
+ if$
351
+ #0 'makeinlinelink :=
352
+ }
353
+ 'skip$
354
+ if$ % makeinlinelink
355
+ }
356
+ FUNCTION {add.inlinelink}
357
+ { openinlinelink empty$
358
+ 'skip$
359
+ { openinlinelink swap$ * closeinlinelink *
360
+ "" 'openinlinelink :=
361
+ }
362
+ if$
363
+ }
364
+ FUNCTION {output.nonnull}
365
+ { % Save the thing we've been asked to output
366
+ 's :=
367
+ % If the bracket-state is close.brackets, then add a close-bracket to
368
+ % what is currently at the top of the stack, and set bracket.state
369
+ % to outside.brackets
370
+ bracket.state close.brackets =
371
+ { "]" *
372
+ outside.brackets 'bracket.state :=
373
+ }
374
+ 'skip$
375
+ if$
376
+ bracket.state outside.brackets =
377
+ { % We're outside all brackets -- this is the normal situation.
378
+ % Write out what's currently at the top of the stack, using the
379
+ % original output.nonnull function.
380
+ s
381
+ add.inlinelink
382
+ output.nonnull.original % invoke the original output.nonnull
383
+ }
384
+ { % Still in brackets. Add open-bracket or (continuation) comma, add the
385
+ % new text (in s) to the top of the stack, and move to the close-brackets
386
+ % state, ready for next time (unless inbrackets resets it). If we come
387
+ % into this branch, then output.state is carefully undisturbed.
388
+ bracket.state open.brackets =
389
+ { " [" * }
390
+ { ", " * } % bracket.state will be within.brackets
391
+ if$
392
+ s *
393
+ close.brackets 'bracket.state :=
394
+ }
395
+ if$
396
+ }
397
+
398
+ % Call this function just before adding something which should be presented in
399
+ % brackets. bracket.state is handled specially within output.nonnull.
400
+ FUNCTION {inbrackets}
401
+ { bracket.state close.brackets =
402
+ { within.brackets 'bracket.state := } % reset the state: not open nor closed
403
+ { open.brackets 'bracket.state := }
404
+ if$
405
+ }
406
+
407
+ FUNCTION {format.lastchecked}
408
+ { lastchecked empty$
409
+ { "" }
410
+ { inbrackets citedstring lastchecked * }
411
+ if$
412
+ }
413
+ % ...urlbst to here
414
+ FUNCTION {output}
415
+ { duplicate$ empty$
416
+ 'pop$
417
+ 'output.nonnull
418
+ if$
419
+ }
420
+ FUNCTION {output.check}
421
+ { 't :=
422
+ duplicate$ empty$
423
+ { pop$ "empty " t * " in " * cite$ * warning$ }
424
+ 'output.nonnull
425
+ if$
426
+ }
427
+ FUNCTION {fin.entry.original} % urlbst (renamed from fin.entry, so it can be wrapped below)
428
+ { add.period$
429
+ write$
430
+ newline$
431
+ }
432
+
433
+ FUNCTION {new.block}
434
+ { output.state before.all =
435
+ 'skip$
436
+ { after.block 'output.state := }
437
+ if$
438
+ }
439
+ FUNCTION {new.sentence}
440
+ { output.state after.block =
441
+ 'skip$
442
+ { output.state before.all =
443
+ 'skip$
444
+ { after.sentence 'output.state := }
445
+ if$
446
+ }
447
+ if$
448
+ }
449
+ FUNCTION {add.blank}
450
+ { " " * before.all 'output.state :=
451
+ }
452
+
453
+ FUNCTION {date.block}
454
+ {
455
+ new.block
456
+ }
457
+
458
+ FUNCTION {not}
459
+ { { #0 }
460
+ { #1 }
461
+ if$
462
+ }
463
+ FUNCTION {and}
464
+ { 'skip$
465
+ { pop$ #0 }
466
+ if$
467
+ }
468
+ FUNCTION {or}
469
+ { { pop$ #1 }
470
+ 'skip$
471
+ if$
472
+ }
473
+ FUNCTION {new.block.checkb}
474
+ { empty$
475
+ swap$ empty$
476
+ and
477
+ 'skip$
478
+ 'new.block
479
+ if$
480
+ }
481
+ FUNCTION {field.or.null}
482
+ { duplicate$ empty$
483
+ { pop$ "" }
484
+ 'skip$
485
+ if$
486
+ }
487
+ FUNCTION {emphasize}
488
+ { duplicate$ empty$
489
+ { pop$ "" }
490
+ { "\emph{" swap$ * "}" * }
491
+ if$
492
+ }
493
+ FUNCTION {tie.or.space.prefix}
494
+ { duplicate$ text.length$ #3 <
495
+ { "~" }
496
+ { " " }
497
+ if$
498
+ swap$
499
+ }
500
+
501
+ FUNCTION {capitalize}
502
+ { "u" change.case$ "t" change.case$ }
503
+
504
+ FUNCTION {space.word}
505
+ { " " swap$ * " " * }
506
+ % Here are the language-specific definitions for explicit words.
507
+ % Each function has a name bbl.xxx where xxx is the English word.
508
+ % The language selected here is ENGLISH
509
+ FUNCTION {bbl.and}
510
+ { "and"}
511
+
512
+ FUNCTION {bbl.etal}
513
+ { "et~al." }
514
+
515
+ FUNCTION {bbl.editors}
516
+ { "editors" }
517
+
518
+ FUNCTION {bbl.editor}
519
+ { "editor" }
520
+
521
+ FUNCTION {bbl.edby}
522
+ { "edited by" }
523
+
524
+ FUNCTION {bbl.edition}
525
+ { "edition" }
526
+
527
+ FUNCTION {bbl.volume}
528
+ { "volume" }
529
+
530
+ FUNCTION {bbl.of}
531
+ { "of" }
532
+
533
+ FUNCTION {bbl.number}
534
+ { "number" }
535
+
536
+ FUNCTION {bbl.nr}
537
+ { "no." }
538
+
539
+ FUNCTION {bbl.in}
540
+ { "in" }
541
+
542
+ FUNCTION {bbl.pages}
543
+ { "pages" }
544
+
545
+ FUNCTION {bbl.page}
546
+ { "page" }
547
+
548
+ FUNCTION {bbl.chapter}
549
+ { "chapter" }
550
+
551
+ FUNCTION {bbl.techrep}
552
+ { "Technical Report" }
553
+
554
+ FUNCTION {bbl.mthesis}
555
+ { "Master's thesis" }
556
+
557
+ FUNCTION {bbl.phdthesis}
558
+ { "Ph.D. thesis" }
559
+
560
+ MACRO {jan} {"January"}
561
+
562
+ MACRO {feb} {"February"}
563
+
564
+ MACRO {mar} {"March"}
565
+
566
+ MACRO {apr} {"April"}
567
+
568
+ MACRO {may} {"May"}
569
+
570
+ MACRO {jun} {"June"}
571
+
572
+ MACRO {jul} {"July"}
573
+
574
+ MACRO {aug} {"August"}
575
+
576
+ MACRO {sep} {"September"}
577
+
578
+ MACRO {oct} {"October"}
579
+
580
+ MACRO {nov} {"November"}
581
+
582
+ MACRO {dec} {"December"}
583
+
584
+ MACRO {acmcs} {"ACM Computing Surveys"}
585
+
586
+ MACRO {acta} {"Acta Informatica"}
587
+
588
+ MACRO {cacm} {"Communications of the ACM"}
589
+
590
+ MACRO {ibmjrd} {"IBM Journal of Research and Development"}
591
+
592
+ MACRO {ibmsj} {"IBM Systems Journal"}
593
+
594
+ MACRO {ieeese} {"IEEE Transactions on Software Engineering"}
595
+
596
+ MACRO {ieeetc} {"IEEE Transactions on Computers"}
597
+
598
+ MACRO {ieeetcad}
599
+ {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}
600
+
601
+ MACRO {ipl} {"Information Processing Letters"}
602
+
603
+ MACRO {jacm} {"Journal of the ACM"}
604
+
605
+ MACRO {jcss} {"Journal of Computer and System Sciences"}
606
+
607
+ MACRO {scp} {"Science of Computer Programming"}
608
+
609
+ MACRO {sicomp} {"SIAM Journal on Computing"}
610
+
611
+ MACRO {tocs} {"ACM Transactions on Computer Systems"}
612
+
613
+ MACRO {tods} {"ACM Transactions on Database Systems"}
614
+
615
+ MACRO {tog} {"ACM Transactions on Graphics"}
616
+
617
+ MACRO {toms} {"ACM Transactions on Mathematical Software"}
618
+
619
+ MACRO {toois} {"ACM Transactions on Office Information Systems"}
620
+
621
+ MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}
622
+
623
+ MACRO {tcs} {"Theoretical Computer Science"}
624
+ FUNCTION {bibinfo.check}
625
+ { swap$
626
+ duplicate$ missing$
627
+ {
628
+ pop$ pop$
629
+ ""
630
+ }
631
+ { duplicate$ empty$
632
+ {
633
+ swap$ pop$
634
+ }
635
+ { swap$
636
+ pop$
637
+ }
638
+ if$
639
+ }
640
+ if$
641
+ }
642
+ FUNCTION {bibinfo.warn}
643
+ { swap$
644
+ duplicate$ missing$
645
+ {
646
+ swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
647
+ ""
648
+ }
649
+ { duplicate$ empty$
650
+ {
651
+ swap$ "empty " swap$ * " in " * cite$ * warning$
652
+ }
653
+ { swap$
654
+ pop$
655
+ }
656
+ if$
657
+ }
658
+ if$
659
+ }
660
+ STRINGS { bibinfo}
661
+ INTEGERS { nameptr namesleft numnames }
662
+
663
+ FUNCTION {format.names}
664
+ { 'bibinfo :=
665
+ duplicate$ empty$ 'skip$ {
666
+ 's :=
667
+ "" 't :=
668
+ #1 'nameptr :=
669
+ s num.names$ 'numnames :=
670
+ numnames 'namesleft :=
671
+ { namesleft #0 > }
672
+ { s nameptr
673
+ duplicate$ #1 >
674
+ { "{ff~}{vv~}{ll}{, jj}" }
675
+ { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author
676
+ % { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author
677
+ if$
678
+ format.name$
679
+ bibinfo bibinfo.check
680
+ 't :=
681
+ nameptr #1 >
682
+ {
683
+ namesleft #1 >
684
+ { ", " * t * }
685
+ {
686
+ numnames #2 >
687
+ { "," * }
688
+ 'skip$
689
+ if$
690
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
691
+ { 't := }
692
+ { pop$ }
693
+ if$
694
+ t "others" =
695
+ {
696
+ " " * bbl.etal *
697
+ }
698
+ {
699
+ bbl.and
700
+ space.word * t *
701
+ }
702
+ if$
703
+ }
704
+ if$
705
+ }
706
+ 't
707
+ if$
708
+ nameptr #1 + 'nameptr :=
709
+ namesleft #1 - 'namesleft :=
710
+ }
711
+ while$
712
+ } if$
713
+ }
714
+ FUNCTION {format.names.ed}
715
+ {
716
+ 'bibinfo :=
717
+ duplicate$ empty$ 'skip$ {
718
+ 's :=
719
+ "" 't :=
720
+ #1 'nameptr :=
721
+ s num.names$ 'numnames :=
722
+ numnames 'namesleft :=
723
+ { namesleft #0 > }
724
+ { s nameptr
725
+ "{ff~}{vv~}{ll}{, jj}"
726
+ format.name$
727
+ bibinfo bibinfo.check
728
+ 't :=
729
+ nameptr #1 >
730
+ {
731
+ namesleft #1 >
732
+ { ", " * t * }
733
+ {
734
+ numnames #2 >
735
+ { "," * }
736
+ 'skip$
737
+ if$
738
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
739
+ { 't := }
740
+ { pop$ }
741
+ if$
742
+ t "others" =
743
+ {
744
+
745
+ " " * bbl.etal *
746
+ }
747
+ {
748
+ bbl.and
749
+ space.word * t *
750
+ }
751
+ if$
752
+ }
753
+ if$
754
+ }
755
+ 't
756
+ if$
757
+ nameptr #1 + 'nameptr :=
758
+ namesleft #1 - 'namesleft :=
759
+ }
760
+ while$
761
+ } if$
762
+ }
763
+ FUNCTION {format.key}
764
+ { empty$
765
+ { key field.or.null }
766
+ { "" }
767
+ if$
768
+ }
769
+
770
+ FUNCTION {format.authors}
771
+ { author "author" format.names
772
+ }
773
+ FUNCTION {get.bbl.editor}
774
+ { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
775
+
776
+ FUNCTION {format.editors}
777
+ { editor "editor" format.names duplicate$ empty$ 'skip$
778
+ {
779
+ "," *
780
+ " " *
781
+ get.bbl.editor
782
+ *
783
+ }
784
+ if$
785
+ }
786
+ FUNCTION {format.note}
787
+ {
788
+ note empty$
789
+ { "" }
790
+ { note #1 #1 substring$
791
+ duplicate$ "{" =
792
+ 'skip$
793
+ { output.state mid.sentence =
794
+ { "l" }
795
+ { "u" }
796
+ if$
797
+ change.case$
798
+ }
799
+ if$
800
+ note #2 global.max$ substring$ * "note" bibinfo.check
801
+ }
802
+ if$
803
+ }
804
+
805
+ FUNCTION {format.title}
806
+ { title
807
+ duplicate$ empty$ 'skip$
808
+ { "t" change.case$ }
809
+ if$
810
+ "title" bibinfo.check
811
+ }
812
+ FUNCTION {format.full.names}
813
+ {'s :=
814
+ "" 't :=
815
+ #1 'nameptr :=
816
+ s num.names$ 'numnames :=
817
+ numnames 'namesleft :=
818
+ { namesleft #0 > }
819
+ { s nameptr
820
+ "{vv~}{ll}" format.name$
821
+ 't :=
822
+ nameptr #1 >
823
+ {
824
+ namesleft #1 >
825
+ { ", " * t * }
826
+ {
827
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
828
+ { 't := }
829
+ { pop$ }
830
+ if$
831
+ t "others" =
832
+ {
833
+ " " * bbl.etal *
834
+ }
835
+ {
836
+ numnames #2 >
837
+ { "," * }
838
+ 'skip$
839
+ if$
840
+ bbl.and
841
+ space.word * t *
842
+ }
843
+ if$
844
+ }
845
+ if$
846
+ }
847
+ 't
848
+ if$
849
+ nameptr #1 + 'nameptr :=
850
+ namesleft #1 - 'namesleft :=
851
+ }
852
+ while$
853
+ }
854
+
855
+ FUNCTION {author.editor.key.full}
856
+ { author empty$
857
+ { editor empty$
858
+ { key empty$
859
+ { cite$ #1 #3 substring$ }
860
+ 'key
861
+ if$
862
+ }
863
+ { editor format.full.names }
864
+ if$
865
+ }
866
+ { author format.full.names }
867
+ if$
868
+ }
869
+
870
+ FUNCTION {author.key.full}
871
+ { author empty$
872
+ { key empty$
873
+ { cite$ #1 #3 substring$ }
874
+ 'key
875
+ if$
876
+ }
877
+ { author format.full.names }
878
+ if$
879
+ }
880
+
881
+ FUNCTION {editor.key.full}
882
+ { editor empty$
883
+ { key empty$
884
+ { cite$ #1 #3 substring$ }
885
+ 'key
886
+ if$
887
+ }
888
+ { editor format.full.names }
889
+ if$
890
+ }
891
+
892
+ FUNCTION {make.full.names}
893
+ { type$ "book" =
894
+ type$ "inbook" =
895
+ or
896
+ 'author.editor.key.full
897
+ { type$ "proceedings" =
898
+ 'editor.key.full
899
+ 'author.key.full
900
+ if$
901
+ }
902
+ if$
903
+ }
904
+
905
+ FUNCTION {output.bibitem.original} % urlbst (renamed from output.bibitem, so it can be wrapped below)
906
+ { newline$
907
+ "\bibitem[{" write$
908
+ label write$
909
+ ")" make.full.names duplicate$ short.list =
910
+ { pop$ }
911
+ { * }
912
+ if$
913
+ "}]{" * write$
914
+ cite$ write$
915
+ "}" write$
916
+ newline$
917
+ ""
918
+ before.all 'output.state :=
919
+ }
920
+
921
+ FUNCTION {n.dashify}
922
+ {
923
+ 't :=
924
+ ""
925
+ { t empty$ not }
926
+ { t #1 #1 substring$ "-" =
927
+ { t #1 #2 substring$ "--" = not
928
+ { "--" *
929
+ t #2 global.max$ substring$ 't :=
930
+ }
931
+ { { t #1 #1 substring$ "-" = }
932
+ { "-" *
933
+ t #2 global.max$ substring$ 't :=
934
+ }
935
+ while$
936
+ }
937
+ if$
938
+ }
939
+ { t #1 #1 substring$ *
940
+ t #2 global.max$ substring$ 't :=
941
+ }
942
+ if$
943
+ }
944
+ while$
945
+ }
946
+
947
+ FUNCTION {word.in}
948
+ { bbl.in capitalize
949
+ " " * }
950
+
951
+ FUNCTION {format.date}
952
+ { year "year" bibinfo.check duplicate$ empty$
953
+ {
954
+ }
955
+ 'skip$
956
+ if$
957
+ extra.label *
958
+ before.all 'output.state :=
959
+ after.sentence 'output.state :=
960
+ }
961
+ FUNCTION {format.btitle}
962
+ { title "title" bibinfo.check
963
+ duplicate$ empty$ 'skip$
964
+ {
965
+ emphasize
966
+ }
967
+ if$
968
+ }
969
+ FUNCTION {either.or.check}
970
+ { empty$
971
+ 'pop$
972
+ { "can't use both " swap$ * " fields in " * cite$ * warning$ }
973
+ if$
974
+ }
975
+ FUNCTION {format.bvolume}
976
+ { volume empty$
977
+ { "" }
978
+ { bbl.volume volume tie.or.space.prefix
979
+ "volume" bibinfo.check * *
980
+ series "series" bibinfo.check
981
+ duplicate$ empty$ 'pop$
982
+ { swap$ bbl.of space.word * swap$
983
+ emphasize * }
984
+ if$
985
+ "volume and number" number either.or.check
986
+ }
987
+ if$
988
+ }
989
+ FUNCTION {format.number.series}
990
+ { volume empty$
991
+ { number empty$
992
+ { series field.or.null }
993
+ { series empty$
994
+ { number "number" bibinfo.check }
995
+ { output.state mid.sentence =
996
+ { bbl.number }
997
+ { bbl.number capitalize }
998
+ if$
999
+ number tie.or.space.prefix "number" bibinfo.check * *
1000
+ bbl.in space.word *
1001
+ series "series" bibinfo.check *
1002
+ }
1003
+ if$
1004
+ }
1005
+ if$
1006
+ }
1007
+ { "" }
1008
+ if$
1009
+ }
1010
+
1011
+ FUNCTION {format.edition}
1012
+ { edition duplicate$ empty$ 'skip$
1013
+ {
1014
+ output.state mid.sentence =
1015
+ { "l" }
1016
+ { "t" }
1017
+ if$ change.case$
1018
+ "edition" bibinfo.check
1019
+ " " * bbl.edition *
1020
+ }
1021
+ if$
1022
+ }
1023
+ INTEGERS { multiresult }
1024
+ FUNCTION {multi.page.check}
1025
+ { 't :=
1026
+ #0 'multiresult :=
1027
+ { multiresult not
1028
+ t empty$ not
1029
+ and
1030
+ }
1031
+ { t #1 #1 substring$
1032
+ duplicate$ "-" =
1033
+ swap$ duplicate$ "," =
1034
+ swap$ "+" =
1035
+ or or
1036
+ { #1 'multiresult := }
1037
+ { t #2 global.max$ substring$ 't := }
1038
+ if$
1039
+ }
1040
+ while$
1041
+ multiresult
1042
+ }
1043
+ FUNCTION {format.pages}
1044
+ { pages duplicate$ empty$ 'skip$
1045
+ { duplicate$ multi.page.check
1046
+ {
1047
+ bbl.pages swap$
1048
+ n.dashify
1049
+ }
1050
+ {
1051
+ bbl.page swap$
1052
+ }
1053
+ if$
1054
+ tie.or.space.prefix
1055
+ "pages" bibinfo.check
1056
+ * *
1057
+ }
1058
+ if$
1059
+ }
1060
+ FUNCTION {format.journal.pages}
1061
+ { pages duplicate$ empty$ 'pop$
1062
+ { swap$ duplicate$ empty$
1063
+ { pop$ pop$ format.pages }
1064
+ {
1065
+ ":" *
1066
+ swap$
1067
+ n.dashify
1068
+ "pages" bibinfo.check
1069
+ *
1070
+ }
1071
+ if$
1072
+ }
1073
+ if$
1074
+ }
1075
+ FUNCTION {format.vol.num.pages}
1076
+ { volume field.or.null
1077
+ duplicate$ empty$ 'skip$
1078
+ {
1079
+ "volume" bibinfo.check
1080
+ }
1081
+ if$
1082
+ number "number" bibinfo.check duplicate$ empty$ 'skip$
1083
+ {
1084
+ swap$ duplicate$ empty$
1085
+ { "there's a number but no volume in " cite$ * warning$ }
1086
+ 'skip$
1087
+ if$
1088
+ swap$
1089
+ "(" swap$ * ")" *
1090
+ }
1091
+ if$ *
1092
+ format.journal.pages
1093
+ }
1094
+
1095
+ FUNCTION {format.chapter}
1096
+ { chapter empty$
1097
+ 'format.pages
1098
+ { type empty$
1099
+ { bbl.chapter }
1100
+ { type "l" change.case$
1101
+ "type" bibinfo.check
1102
+ }
1103
+ if$
1104
+ chapter tie.or.space.prefix
1105
+ "chapter" bibinfo.check
1106
+ * *
1107
+ }
1108
+ if$
1109
+ }
1110
+
1111
+ FUNCTION {format.chapter.pages}
1112
+ { chapter empty$
1113
+ 'format.pages
1114
+ { type empty$
1115
+ { bbl.chapter }
1116
+ { type "l" change.case$
1117
+ "type" bibinfo.check
1118
+ }
1119
+ if$
1120
+ chapter tie.or.space.prefix
1121
+ "chapter" bibinfo.check
1122
+ * *
1123
+ pages empty$
1124
+ 'skip$
1125
+ { ", " * format.pages * }
1126
+ if$
1127
+ }
1128
+ if$
1129
+ }
1130
+
1131
+ FUNCTION {format.booktitle}
1132
+ {
1133
+ booktitle "booktitle" bibinfo.check
1134
+ emphasize
1135
+ }
1136
+ FUNCTION {format.in.booktitle}
1137
+ { format.booktitle duplicate$ empty$ 'skip$
1138
+ {
1139
+ word.in swap$ *
1140
+ }
1141
+ if$
1142
+ }
1143
+ FUNCTION {format.in.ed.booktitle}
1144
+ { format.booktitle duplicate$ empty$ 'skip$
1145
+ {
1146
+ editor "editor" format.names.ed duplicate$ empty$ 'pop$
1147
+ {
1148
+ "," *
1149
+ " " *
1150
+ get.bbl.editor
1151
+ ", " *
1152
+ * swap$
1153
+ * }
1154
+ if$
1155
+ word.in swap$ *
1156
+ }
1157
+ if$
1158
+ }
1159
+ FUNCTION {format.thesis.type}
1160
+ { type duplicate$ empty$
1161
+ 'pop$
1162
+ { swap$ pop$
1163
+ "t" change.case$ "type" bibinfo.check
1164
+ }
1165
+ if$
1166
+ }
1167
+ FUNCTION {format.tr.number}
1168
+ { number "number" bibinfo.check
1169
+ type duplicate$ empty$
1170
+ { pop$ bbl.techrep }
1171
+ 'skip$
1172
+ if$
1173
+ "type" bibinfo.check
1174
+ swap$ duplicate$ empty$
1175
+ { pop$ "t" change.case$ }
1176
+ { tie.or.space.prefix * * }
1177
+ if$
1178
+ }
1179
+ FUNCTION {format.article.crossref}
1180
+ {
1181
+ word.in
1182
+ " \cite{" * crossref * "}" *
1183
+ }
1184
+ FUNCTION {format.book.crossref}
1185
+ { volume duplicate$ empty$
1186
+ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
1187
+ pop$ word.in
1188
+ }
1189
+ { bbl.volume
1190
+ capitalize
1191
+ swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
1192
+ }
1193
+ if$
1194
+ " \cite{" * crossref * "}" *
1195
+ }
1196
+ FUNCTION {format.incoll.inproc.crossref}
1197
+ {
1198
+ word.in
1199
+ " \cite{" * crossref * "}" *
1200
+ }
1201
+ FUNCTION {format.org.or.pub}
1202
+ { 't :=
1203
+ ""
1204
+ address empty$ t empty$ and
1205
+ 'skip$
1206
+ {
1207
+ t empty$
1208
+ { address "address" bibinfo.check *
1209
+ }
1210
+ { t *
1211
+ address empty$
1212
+ 'skip$
1213
+ { ", " * address "address" bibinfo.check * }
1214
+ if$
1215
+ }
1216
+ if$
1217
+ }
1218
+ if$
1219
+ }
1220
+ FUNCTION {format.publisher.address}
1221
+ { publisher "publisher" bibinfo.warn format.org.or.pub
1222
+ }
1223
+
1224
+ FUNCTION {format.organization.address}
1225
+ { organization "organization" bibinfo.check format.org.or.pub
1226
+ }
1227
+
1228
+ % urlbst...
1229
+ % Functions for making hypertext links.
1230
+ % In all cases, the stack has (link-text href-url)
1231
+ %
1232
+ % make 'null' specials
1233
+ FUNCTION {make.href.null}
1234
+ {
1235
+ pop$
1236
+ }
1237
+ % make hypertex specials
1238
+ FUNCTION {make.href.hypertex}
1239
+ {
1240
+ "\special {html:<a href=" quote$ *
1241
+ swap$ * quote$ * "> }" * swap$ *
1242
+ "\special {html:</a>}" *
1243
+ }
1244
+ % make hyperref specials
1245
+ FUNCTION {make.href.hyperref}
1246
+ {
1247
+ "\href {" swap$ * "} {\path{" * swap$ * "}}" *
1248
+ }
1249
+ FUNCTION {make.href}
1250
+ { hrefform #2 =
1251
+ 'make.href.hyperref % hrefform = 2
1252
+ { hrefform #1 =
1253
+ 'make.href.hypertex % hrefform = 1
1254
+ 'make.href.null % hrefform = 0 (or anything else)
1255
+ if$
1256
+ }
1257
+ if$
1258
+ }
1259
+
1260
+ % If inlinelinks is true, then format.url should be a no-op, since it's
1261
+ % (a) redundant, and (b) could end up as a link-within-a-link.
1262
+ FUNCTION {format.url}
1263
+ { inlinelinks #1 = url empty$ or
1264
+ { "" }
1265
+ { hrefform #1 =
1266
+ { % special case -- add HyperTeX specials
1267
+ urlintro "\url{" url * "}" * url make.href.hypertex * }
1268
+ { urlintro "\url{" * url * "}" * }
1269
+ if$
1270
+ }
1271
+ if$
1272
+ }
1273
+
1274
+ FUNCTION {format.eprint}
1275
+ { eprint empty$
1276
+ { "" }
1277
+ { eprintprefix eprint * eprinturl eprint * make.href }
1278
+ if$
1279
+ }
1280
+
1281
+ FUNCTION {format.doi}
1282
+ { doi empty$
1283
+ { "" }
1284
+ { doiprefix doi * doiurl doi * make.href }
1285
+ if$
1286
+ }
1287
+
1288
+ FUNCTION {format.pubmed}
1289
+ { pubmed empty$
1290
+ { "" }
1291
+ { pubmedprefix pubmed * pubmedurl pubmed * make.href }
1292
+ if$
1293
+ }
1294
+
1295
+ % Output a URL. We can't use the more normal idiom (something like
1296
+ % `format.url output'), because the `inbrackets' within
1297
+ % format.lastchecked applies to everything between calls to `output',
1298
+ % so that `format.url format.lastchecked * output' ends up with both
1299
+ % the URL and the lastchecked in brackets.
1300
+ FUNCTION {output.url}
1301
+ { url empty$
1302
+ 'skip$
1303
+ { new.block
1304
+ format.url output
1305
+ format.lastchecked output
1306
+ }
1307
+ if$
1308
+ }
1309
+
1310
+ FUNCTION {output.web.refs}
1311
+ {
1312
+ new.block
1313
+ inlinelinks
1314
+ 'skip$ % links were inline -- don't repeat them
1315
+ {
1316
+ output.url
1317
+ addeprints eprint empty$ not and
1318
+ { format.eprint output.nonnull }
1319
+ 'skip$
1320
+ if$
1321
+ adddoiresolver doi empty$ not and
1322
+ { format.doi output.nonnull }
1323
+ 'skip$
1324
+ if$
1325
+ addpubmedresolver pubmed empty$ not and
1326
+ { format.pubmed output.nonnull }
1327
+ 'skip$
1328
+ if$
1329
+ }
1330
+ if$
1331
+ }
1332
+
1333
+ % Wrapper for output.bibitem.original.
1334
+ % If the URL field is not empty, set makeinlinelink to be true,
1335
+ % so that an inline link will be started at the next opportunity
1336
+ FUNCTION {output.bibitem}
1337
+ { outside.brackets 'bracket.state :=
1338
+ output.bibitem.original
1339
+ inlinelinks url empty$ not doi empty$ not or pubmed empty$ not or eprint empty$ not or and
1340
+ { #1 'makeinlinelink := }
1341
+ { #0 'makeinlinelink := }
1342
+ if$
1343
+ }
1344
+
1345
+ % Wrapper for fin.entry.original
1346
+ FUNCTION {fin.entry}
1347
+ { output.web.refs % urlbst
1348
+ makeinlinelink % ooops, it appears we didn't have a title for inlinelink
1349
+ { possibly.setup.inlinelink % add some artificial link text here, as a fallback
1350
+ linktextstring output.nonnull }
1351
+ 'skip$
1352
+ if$
1353
+ bracket.state close.brackets = % urlbst
1354
+ { "]" * }
1355
+ 'skip$
1356
+ if$
1357
+ fin.entry.original
1358
+ }
1359
+
1360
+ % Webpage entry type.
1361
+ % Title and url fields required;
1362
+ % author, note, year, month, and lastchecked fields optional
1363
+ % See references
1364
+ % ISO 690-2 http://www.nlc-bnc.ca/iso/tc46sc9/standard/690-2e.htm
1365
+ % http://www.classroom.net/classroom/CitingNetResources.html
1366
+ % http://neal.ctstateu.edu/history/cite.html
1367
+ % http://www.cas.usf.edu/english/walker/mla.html
1368
+ % for citation formats for web pages.
1369
+ FUNCTION {webpage}
1370
+ { output.bibitem
1371
+ author empty$
1372
+ { editor empty$
1373
+ 'skip$ % author and editor both optional
1374
+ { format.editors output.nonnull }
1375
+ if$
1376
+ }
1377
+ { editor empty$
1378
+ { format.authors output.nonnull }
1379
+ { "can't use both author and editor fields in " cite$ * warning$ }
1380
+ if$
1381
+ }
1382
+ if$
1383
+ new.block
1384
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$
1385
+ format.title "title" output.check
1386
+ inbrackets onlinestring output
1387
+ new.block
1388
+ year empty$
1389
+ 'skip$
1390
+ { format.date "year" output.check }
1391
+ if$
1392
+ % We don't need to output the URL details ('lastchecked' and 'url'),
1393
+ % because fin.entry does that for us, using output.web.refs. The only
1394
+ % reason we would want to put them here is if we were to decide that
1395
+ % they should go in front of the rather miscellaneous information in 'note'.
1396
+ new.block
1397
+ note output
1398
+ fin.entry
1399
+ }
1400
+ % ...urlbst to here
1401
+
1402
+
1403
+ FUNCTION {article}
1404
+ { output.bibitem
1405
+ format.authors "author" output.check
1406
+ author format.key output
1407
+ format.date "year" output.check
1408
+ date.block
1409
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1410
+ format.title "title" output.check
1411
+ new.block
1412
+ crossref missing$
1413
+ {
1414
+ journal
1415
+ "journal" bibinfo.check
1416
+ emphasize
1417
+ "journal" output.check
1418
+ possibly.setup.inlinelink format.vol.num.pages output% urlbst
1419
+ }
1420
+ { format.article.crossref output.nonnull
1421
+ format.pages output
1422
+ }
1423
+ if$
1424
+ new.block
1425
+ format.note output
1426
+ fin.entry
1427
+ }
1428
+ FUNCTION {book}
1429
+ { output.bibitem
1430
+ author empty$
1431
+ { format.editors "author and editor" output.check
1432
+ editor format.key output
1433
+ }
1434
+ { format.authors output.nonnull
1435
+ crossref missing$
1436
+ { "author and editor" editor either.or.check }
1437
+ 'skip$
1438
+ if$
1439
+ }
1440
+ if$
1441
+ format.date "year" output.check
1442
+ date.block
1443
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1444
+ format.btitle "title" output.check
1445
+ format.edition output
1446
+ crossref missing$
1447
+ { format.bvolume output
1448
+ new.block
1449
+ format.number.series output
1450
+ new.sentence
1451
+ format.publisher.address output
1452
+ }
1453
+ {
1454
+ new.block
1455
+ format.book.crossref output.nonnull
1456
+ }
1457
+ if$
1458
+ new.block
1459
+ format.note output
1460
+ fin.entry
1461
+ }
1462
+ FUNCTION {booklet}
1463
+ { output.bibitem
1464
+ format.authors output
1465
+ author format.key output
1466
+ format.date "year" output.check
1467
+ date.block
1468
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1469
+ format.title "title" output.check
1470
+ new.block
1471
+ howpublished "howpublished" bibinfo.check output
1472
+ address "address" bibinfo.check output
1473
+ new.block
1474
+ format.note output
1475
+ fin.entry
1476
+ }
1477
+
1478
+ FUNCTION {inbook}
1479
+ { output.bibitem
1480
+ author empty$
1481
+ { format.editors "author and editor" output.check
1482
+ editor format.key output
1483
+ }
1484
+ { format.authors output.nonnull
1485
+ crossref missing$
1486
+ { "author and editor" editor either.or.check }
1487
+ 'skip$
1488
+ if$
1489
+ }
1490
+ if$
1491
+ format.date "year" output.check
1492
+ date.block
1493
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1494
+ format.btitle "title" output.check
1495
+ format.edition output
1496
+ crossref missing$
1497
+ {
1498
+ format.bvolume output
1499
+ format.number.series output
1500
+ format.chapter "chapter" output.check
1501
+ new.sentence
1502
+ format.publisher.address output
1503
+ new.block
1504
+ }
1505
+ {
1506
+ format.chapter "chapter" output.check
1507
+ new.block
1508
+ format.book.crossref output.nonnull
1509
+ }
1510
+ if$
1511
+ new.block
1512
+ format.note output
1513
+ fin.entry
1514
+ }
1515
+
1516
+ FUNCTION {incollection}
1517
+ { output.bibitem
1518
+ format.authors "author" output.check
1519
+ author format.key output
1520
+ format.date "year" output.check
1521
+ date.block
1522
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1523
+ format.title "title" output.check
1524
+ new.block
1525
+ crossref missing$
1526
+ { format.in.ed.booktitle "booktitle" output.check
1527
+ format.edition output
1528
+ format.bvolume output
1529
+ format.number.series output
1530
+ format.chapter.pages output
1531
+ new.sentence
1532
+ format.publisher.address output
1533
+ }
1534
+ { format.incoll.inproc.crossref output.nonnull
1535
+ format.chapter.pages output
1536
+ }
1537
+ if$
1538
+ new.block
1539
+ format.note output
1540
+ fin.entry
1541
+ }
1542
+ FUNCTION {inproceedings}
1543
+ { output.bibitem
1544
+ format.authors "author" output.check
1545
+ author format.key output
1546
+ format.date "year" output.check
1547
+ date.block
1548
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1549
+ format.title "title" output.check
1550
+ new.block
1551
+ crossref missing$
1552
+ { format.in.booktitle "booktitle" output.check
1553
+ format.bvolume output
1554
+ format.number.series output
1555
+ format.pages output
1556
+ address "address" bibinfo.check output
1557
+ new.sentence
1558
+ organization "organization" bibinfo.check output
1559
+ publisher "publisher" bibinfo.check output
1560
+ }
1561
+ { format.incoll.inproc.crossref output.nonnull
1562
+ format.pages output
1563
+ }
1564
+ if$
1565
+ new.block
1566
+ format.note output
1567
+ fin.entry
1568
+ }
1569
+ FUNCTION {conference} { inproceedings }
1570
+ FUNCTION {manual}
1571
+ { output.bibitem
1572
+ format.authors output
1573
+ author format.key output
1574
+ format.date "year" output.check
1575
+ date.block
1576
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1577
+ format.btitle "title" output.check
1578
+ format.edition output
1579
+ organization address new.block.checkb
1580
+ organization "organization" bibinfo.check output
1581
+ address "address" bibinfo.check output
1582
+ new.block
1583
+ format.note output
1584
+ fin.entry
1585
+ }
1586
+
1587
+ FUNCTION {mastersthesis}
1588
+ { output.bibitem
1589
+ format.authors "author" output.check
1590
+ author format.key output
1591
+ format.date "year" output.check
1592
+ date.block
1593
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1594
+ format.title
1595
+ "title" output.check
1596
+ new.block
1597
+ bbl.mthesis format.thesis.type output.nonnull
1598
+ school "school" bibinfo.warn output
1599
+ address "address" bibinfo.check output
1600
+ month "month" bibinfo.check output
1601
+ new.block
1602
+ format.note output
1603
+ fin.entry
1604
+ }
1605
+
1606
+ FUNCTION {misc}
1607
+ { output.bibitem
1608
+ format.authors output
1609
+ author format.key output
1610
+ format.date "year" output.check
1611
+ date.block
1612
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1613
+ format.title output
1614
+ new.block
1615
+ howpublished "howpublished" bibinfo.check output
1616
+ new.block
1617
+ format.note output
1618
+ fin.entry
1619
+ }
1620
+ FUNCTION {phdthesis}
1621
+ { output.bibitem
1622
+ format.authors "author" output.check
1623
+ author format.key output
1624
+ format.date "year" output.check
1625
+ date.block
1626
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1627
+ format.btitle
1628
+ "title" output.check
1629
+ new.block
1630
+ bbl.phdthesis format.thesis.type output.nonnull
1631
+ school "school" bibinfo.warn output
1632
+ address "address" bibinfo.check output
1633
+ new.block
1634
+ format.note output
1635
+ fin.entry
1636
+ }
1637
+
1638
+ FUNCTION {proceedings}
1639
+ { output.bibitem
1640
+ format.editors output
1641
+ editor format.key output
1642
+ format.date "year" output.check
1643
+ date.block
1644
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1645
+ format.btitle "title" output.check
1646
+ format.bvolume output
1647
+ format.number.series output
1648
+ new.sentence
1649
+ publisher empty$
1650
+ { format.organization.address output }
1651
+ { organization "organization" bibinfo.check output
1652
+ new.sentence
1653
+ format.publisher.address output
1654
+ }
1655
+ if$
1656
+ new.block
1657
+ format.note output
1658
+ fin.entry
1659
+ }
1660
+
1661
+ FUNCTION {techreport}
1662
+ { output.bibitem
1663
+ format.authors "author" output.check
1664
+ author format.key output
1665
+ format.date "year" output.check
1666
+ date.block
1667
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1668
+ format.title
1669
+ "title" output.check
1670
+ new.block
1671
+ format.tr.number output.nonnull
1672
+ institution "institution" bibinfo.warn output
1673
+ address "address" bibinfo.check output
1674
+ new.block
1675
+ format.note output
1676
+ fin.entry
1677
+ }
1678
+
1679
+ FUNCTION {unpublished}
1680
+ { output.bibitem
1681
+ format.authors "author" output.check
1682
+ author format.key output
1683
+ format.date "year" output.check
1684
+ date.block
1685
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1686
+ format.title "title" output.check
1687
+ new.block
1688
+ format.note "note" output.check
1689
+ fin.entry
1690
+ }
1691
+
1692
+ FUNCTION {default.type} { misc }
1693
+ READ
1694
+ FUNCTION {sortify}
1695
+ { purify$
1696
+ "l" change.case$
1697
+ }
1698
+ INTEGERS { len }
1699
+ FUNCTION {chop.word}
1700
+ { 's :=
1701
+ 'len :=
1702
+ s #1 len substring$ =
1703
+ { s len #1 + global.max$ substring$ }
1704
+ 's
1705
+ if$
1706
+ }
1707
+ FUNCTION {format.lab.names}
1708
+ { 's :=
1709
+ "" 't :=
1710
+ s #1 "{vv~}{ll}" format.name$
1711
+ s num.names$ duplicate$
1712
+ #2 >
1713
+ { pop$
1714
+ " " * bbl.etal *
1715
+ }
1716
+ { #2 <
1717
+ 'skip$
1718
+ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
1719
+ {
1720
+ " " * bbl.etal *
1721
+ }
1722
+ { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
1723
+ * }
1724
+ if$
1725
+ }
1726
+ if$
1727
+ }
1728
+ if$
1729
+ }
1730
+
1731
+ FUNCTION {author.key.label}
1732
+ { author empty$
1733
+ { key empty$
1734
+ { cite$ #1 #3 substring$ }
1735
+ 'key
1736
+ if$
1737
+ }
1738
+ { author format.lab.names }
1739
+ if$
1740
+ }
1741
+
1742
+ FUNCTION {author.editor.key.label}
1743
+ { author empty$
1744
+ { editor empty$
1745
+ { key empty$
1746
+ { cite$ #1 #3 substring$ }
1747
+ 'key
1748
+ if$
1749
+ }
1750
+ { editor format.lab.names }
1751
+ if$
1752
+ }
1753
+ { author format.lab.names }
1754
+ if$
1755
+ }
1756
+
1757
+ FUNCTION {editor.key.label}
1758
+ { editor empty$
1759
+ { key empty$
1760
+ { cite$ #1 #3 substring$ }
1761
+ 'key
1762
+ if$
1763
+ }
1764
+ { editor format.lab.names }
1765
+ if$
1766
+ }
1767
+
1768
+ FUNCTION {calc.short.authors}
1769
+ { type$ "book" =
1770
+ type$ "inbook" =
1771
+ or
1772
+ 'author.editor.key.label
1773
+ { type$ "proceedings" =
1774
+ 'editor.key.label
1775
+ 'author.key.label
1776
+ if$
1777
+ }
1778
+ if$
1779
+ 'short.list :=
1780
+ }
1781
+
1782
+ FUNCTION {calc.label}
1783
+ { calc.short.authors
1784
+ short.list
1785
+ "("
1786
+ *
1787
+ year duplicate$ empty$
1788
+ short.list key field.or.null = or
1789
+ { pop$ "" }
1790
+ 'skip$
1791
+ if$
1792
+ *
1793
+ 'label :=
1794
+ }
1795
+
1796
+ FUNCTION {sort.format.names}
1797
+ { 's :=
1798
+ #1 'nameptr :=
1799
+ ""
1800
+ s num.names$ 'numnames :=
1801
+ numnames 'namesleft :=
1802
+ { namesleft #0 > }
1803
+ { s nameptr
1804
+ "{vv{ } }{ll{ }}{ ff{ }}{ jj{ }}"
1805
+ format.name$ 't :=
1806
+ nameptr #1 >
1807
+ {
1808
+ " " *
1809
+ namesleft #1 = t "others" = and
1810
+ { "zzzzz" * }
1811
+ { t sortify * }
1812
+ if$
1813
+ }
1814
+ { t sortify * }
1815
+ if$
1816
+ nameptr #1 + 'nameptr :=
1817
+ namesleft #1 - 'namesleft :=
1818
+ }
1819
+ while$
1820
+ }
1821
+
1822
+ FUNCTION {sort.format.title}
1823
+ { 't :=
1824
+ "A " #2
1825
+ "An " #3
1826
+ "The " #4 t chop.word
1827
+ chop.word
1828
+ chop.word
1829
+ sortify
1830
+ #1 global.max$ substring$
1831
+ }
1832
+ FUNCTION {author.sort}
1833
+ { author empty$
1834
+ { key empty$
1835
+ { "to sort, need author or key in " cite$ * warning$
1836
+ ""
1837
+ }
1838
+ { key sortify }
1839
+ if$
1840
+ }
1841
+ { author sort.format.names }
1842
+ if$
1843
+ }
1844
+ FUNCTION {author.editor.sort}
1845
+ { author empty$
1846
+ { editor empty$
1847
+ { key empty$
1848
+ { "to sort, need author, editor, or key in " cite$ * warning$
1849
+ ""
1850
+ }
1851
+ { key sortify }
1852
+ if$
1853
+ }
1854
+ { editor sort.format.names }
1855
+ if$
1856
+ }
1857
+ { author sort.format.names }
1858
+ if$
1859
+ }
1860
+ FUNCTION {editor.sort}
1861
+ { editor empty$
1862
+ { key empty$
1863
+ { "to sort, need editor or key in " cite$ * warning$
1864
+ ""
1865
+ }
1866
+ { key sortify }
1867
+ if$
1868
+ }
1869
+ { editor sort.format.names }
1870
+ if$
1871
+ }
1872
+ FUNCTION {presort}
1873
+ { calc.label
1874
+ label sortify
1875
+ " "
1876
+ *
1877
+ type$ "book" =
1878
+ type$ "inbook" =
1879
+ or
1880
+ 'author.editor.sort
1881
+ { type$ "proceedings" =
1882
+ 'editor.sort
1883
+ 'author.sort
1884
+ if$
1885
+ }
1886
+ if$
1887
+ #1 entry.max$ substring$
1888
+ 'sort.label :=
1889
+ sort.label
1890
+ *
1891
+ " "
1892
+ *
1893
+ title field.or.null
1894
+ sort.format.title
1895
+ *
1896
+ #1 entry.max$ substring$
1897
+ 'sort.key$ :=
1898
+ }
1899
+
1900
+ ITERATE {presort}
1901
+ SORT
1902
+ STRINGS { last.label next.extra }
1903
+ INTEGERS { last.extra.num number.label }
1904
+ FUNCTION {initialize.extra.label.stuff}
1905
+ { #0 int.to.chr$ 'last.label :=
1906
+ "" 'next.extra :=
1907
+ #0 'last.extra.num :=
1908
+ #0 'number.label :=
1909
+ }
1910
+ FUNCTION {forward.pass}
1911
+ { last.label label =
1912
+ { last.extra.num #1 + 'last.extra.num :=
1913
+ last.extra.num int.to.chr$ 'extra.label :=
1914
+ }
1915
+ { "a" chr.to.int$ 'last.extra.num :=
1916
+ "" 'extra.label :=
1917
+ label 'last.label :=
1918
+ }
1919
+ if$
1920
+ number.label #1 + 'number.label :=
1921
+ }
1922
+ FUNCTION {reverse.pass}
1923
+ { next.extra "b" =
1924
+ { "a" 'extra.label := }
1925
+ 'skip$
1926
+ if$
1927
+ extra.label 'next.extra :=
1928
+ extra.label
1929
+ duplicate$ empty$
1930
+ 'skip$
1931
+ { year field.or.null #-1 #1 substring$ chr.to.int$ #65 <
1932
+ { "{\natexlab{" swap$ * "}}" * }
1933
+ { "{(\natexlab{" swap$ * "})}" * }
1934
+ if$ }
1935
+ if$
1936
+ 'extra.label :=
1937
+ label extra.label * 'label :=
1938
+ }
1939
+ EXECUTE {initialize.extra.label.stuff}
1940
+ ITERATE {forward.pass}
1941
+ REVERSE {reverse.pass}
1942
+ FUNCTION {bib.sort.order}
1943
+ { sort.label
1944
+ " "
1945
+ *
1946
+ year field.or.null sortify
1947
+ *
1948
+ " "
1949
+ *
1950
+ title field.or.null
1951
+ sort.format.title
1952
+ *
1953
+ #1 entry.max$ substring$
1954
+ 'sort.key$ :=
1955
+ }
1956
+ ITERATE {bib.sort.order}
1957
+ SORT
1958
+ FUNCTION {begin.bib}
1959
+ { preamble$ empty$
1960
+ 'skip$
1961
+ { preamble$ write$ newline$ }
1962
+ if$
1963
+ "\begin{thebibliography}{" number.label int.to.str$ * "}" *
1964
+ write$ newline$
1965
+ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
1966
+ write$ newline$
1967
+ }
1968
+ EXECUTE {begin.bib}
1969
+ EXECUTE {init.urlbst.variables} % urlbst
1970
+ EXECUTE {init.state.consts}
1971
+ ITERATE {call.type$}
1972
+ FUNCTION {end.bib}
1973
+ { newline$
1974
+ "\end{thebibliography}" write$ newline$
1975
+ }
1976
+ EXECUTE {end.bib}
1977
+ %% End of customized bst file
1978
+ %%
1979
+ %% End of file `compling.bst'.
references/2021.naacl.nguyen/source/minted.sty ADDED
@@ -0,0 +1,1212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%
2
+ %% This is file `minted.sty',
3
+ %% generated with the docstrip utility.
4
+ %%
5
+ %% The original source files were:
6
+ %%
7
+ %% minted.dtx (with options: `package')
8
+ %% Copyright 2013--2017 Geoffrey M. Poore
9
+ %% Copyright 2010--2011 Konrad Rudolph
10
+ %%
11
+ %% This work may be distributed and/or modified under the
12
+ %% conditions of the LaTeX Project Public License, either version 1.3
13
+ %% of this license or (at your option) any later version.
14
+ %% The latest version of this license is in
15
+ %% http://www.latex-project.org/lppl.txt
16
+ %% and version 1.3 or later is part of all distributions of LaTeX
17
+ %% version 2005/12/01 or later.
18
+ %%
19
+ %% Additionally, the project may be distributed under the terms of the new BSD
20
+ %% license.
21
+ %%
22
+ %% This work has the LPPL maintenance status `maintained'.
23
+ %%
24
+ %% The Current Maintainer of this work is Geoffrey Poore.
25
+ %%
26
+ %% This work consists of the files minted.dtx and minted.ins
27
+ %% and the derived file minted.sty.
28
+ \NeedsTeXFormat{LaTeX2e}
29
+ \ProvidesPackage{minted}
30
+ [2017/09/03 v2.5.1dev Yet another Pygments shim for LaTeX]
31
+ \RequirePackage{keyval}
32
+ \RequirePackage{kvoptions}
33
+ \RequirePackage{fvextra}
34
+ \RequirePackage{ifthen}
35
+ \RequirePackage{calc}
36
+ \IfFileExists{shellesc.sty}
37
+ {\RequirePackage{shellesc}
38
+ \@ifpackagelater{shellesc}{2016/04/29}
39
+ {}
40
+ {\protected\def\ShellEscape{\immediate\write18 }}}
41
+ {\protected\def\ShellEscape{\immediate\write18 }}
42
+ \RequirePackage{ifplatform}
43
+ \RequirePackage{pdftexcmds}
44
+ \RequirePackage{etoolbox}
45
+ \RequirePackage{xstring}
46
+ \RequirePackage{lineno}
47
+ \RequirePackage{framed}
48
+ \AtEndPreamble{%
49
+ \@ifpackageloaded{color}{}{%
50
+ \@ifpackageloaded{xcolor}{}{\RequirePackage{xcolor}}}%
51
+ }
52
+ \DeclareVoidOption{chapter}{\def\minted@float@within{chapter}}
53
+ \DeclareVoidOption{section}{\def\minted@float@within{section}}
54
+ \DeclareBoolOption{newfloat}
55
+ \DeclareBoolOption[true]{cache}
56
+ \StrSubstitute{\jobname}{ }{_}[\minted@jobname]
57
+ \StrSubstitute{\minted@jobname}{*}{_}[\minted@jobname]
58
+ \StrSubstitute{\minted@jobname}{"}{}[\minted@jobname]
59
+ \StrSubstitute{\minted@jobname}{'}{_}[\minted@jobname]
60
+ \newcommand{\minted@cachedir}{\detokenize{_}minted-\minted@jobname}
61
+ \let\minted@cachedir@windows\minted@cachedir
62
+ \define@key{minted}{cachedir}{%
63
+ \@namedef{minted@cachedir}{#1}%
64
+ \StrSubstitute{\minted@cachedir}{/}{\@backslashchar}[\minted@cachedir@windows]}
65
+ \DeclareBoolOption{finalizecache}
66
+ \DeclareBoolOption{frozencache}
67
+ \let\minted@outputdir\@empty
68
+ \let\minted@outputdir@windows\@empty
69
+ \define@key{minted}{outputdir}{%
70
+ \@namedef{minted@outputdir}{#1/}%
71
+ \StrSubstitute{\minted@outputdir}{/}%
72
+ {\@backslashchar}[\minted@outputdir@windows]}
73
+ \DeclareBoolOption{kpsewhich}
74
+ \DeclareBoolOption{langlinenos}
75
+ \DeclareBoolOption{draft}
76
+ \DeclareComplementaryOption{final}{draft}
77
+ \ProcessKeyvalOptions*
78
+ \ifthenelse{\boolean{minted@newfloat}}{\RequirePackage{newfloat}}{\RequirePackage{float}}
79
+ \ifcsname tikzifexternalizing\endcsname
80
+ \tikzifexternalizing{\minted@drafttrue\minted@cachefalse}{}
81
+ \else
82
+ \ifcsname tikzexternalrealjob\endcsname
83
+ \minted@drafttrue
84
+ \minted@cachefalse
85
+ \else
86
+ \fi
87
+ \fi
88
+ \ifthenelse{\boolean{minted@finalizecache}}%
89
+ {\ifthenelse{\boolean{minted@frozencache}}%
90
+ {\PackageError{minted}%
91
+ {Options "finalizecache" and "frozencache" are not compatible}%
92
+ {Options "finalizecache" and "frozencache" are not compatible}}%
93
+ {}}%
94
+ {}
95
+ \ifthenelse{\boolean{minted@cache}}%
96
+ {\ifthenelse{\boolean{minted@frozencache}}%
97
+ {}%
98
+ {\AtEndOfPackage{\ProvideDirectory{\minted@outputdir\minted@cachedir}}}}%
99
+ {}
100
+ \newcommand{\minted@input}[1]{%
101
+ \IfFileExists{#1}%
102
+ {\input{#1}}%
103
+ {\PackageError{minted}{Missing Pygments output; \string\inputminted\space
104
+ was^^Jprobably given a file that does not exist--otherwise, you may need
105
+ ^^Jthe outputdir package option, or may be using an incompatible build
106
+ tool,^^Jor may be using frozencache with a missing file}%
107
+ {This could be caused by using -output-directory or -aux-directory
108
+ ^^Jwithout setting minted's outputdir, or by using a build tool that
109
+ ^^Jchanges paths in ways minted cannot detect,
110
+ ^^Jor using frozencache with a missing file.}}%
111
+ }
112
+ \newcommand{\minted@infile}{\minted@jobname.out.pyg}
113
+ \newcommand{\minted@cachelist}{}
114
+ \newcommand{\minted@addcachefile}[1]{%
115
+ \expandafter\long\expandafter\gdef\expandafter\minted@cachelist\expandafter{%
116
+ \minted@cachelist,^^J%
117
+ \space\space#1}%
118
+ \expandafter\gdef\csname minted@cached@#1\endcsname{}%
119
+ }
120
+ \newcommand{\minted@savecachelist}{%
121
+ \ifdefempty{\minted@cachelist}{}{%
122
+ \immediate\write\@mainaux{%
123
+ \string\gdef\string\minted@oldcachelist\string{%
124
+ \minted@cachelist\string}}%
125
+ }%
126
+ }
127
+ \newcommand{\minted@cleancache}{%
128
+ \ifcsname minted@oldcachelist\endcsname
129
+ \def\do##1{%
130
+ \ifthenelse{\equal{##1}{}}{}{%
131
+ \ifcsname minted@cached@##1\endcsname\else
132
+ \DeleteFile[\minted@outputdir\minted@cachedir]{##1}%
133
+ \fi
134
+ }%
135
+ }%
136
+ \expandafter\docsvlist\expandafter{\minted@oldcachelist}%
137
+ \else
138
+ \fi
139
+ }
140
+ \ifthenelse{\boolean{minted@draft}}%
141
+ {\AtEndDocument{%
142
+ \ifcsname minted@oldcachelist\endcsname
143
+ \StrSubstitute{\minted@oldcachelist}{,}{,^^J }[\minted@cachelist]
144
+ \minted@savecachelist
145
+ \fi}}%
146
+ {\ifthenelse{\boolean{minted@frozencache}}%
147
+ {\AtEndDocument{%
148
+ \ifcsname minted@oldcachelist\endcsname
149
+ \StrSubstitute{\minted@oldcachelist}{,}{,^^J }[\minted@cachelist]
150
+ \minted@savecachelist
151
+ \fi}}%
152
+ {\AtEndDocument{%
153
+ \minted@savecachelist
154
+ \minted@cleancache}}}%
155
+ \ifwindows
156
+ \providecommand{\DeleteFile}[2][]{%
157
+ \ifthenelse{\equal{#1}{}}%
158
+ {\IfFileExists{#2}{\ShellEscape{del #2}}{}}%
159
+ {\IfFileExists{#1/#2}{%
160
+ \StrSubstitute{#1}{/}{\@backslashchar}[\minted@windir]
161
+ \ShellEscape{del \minted@windir\@backslashchar #2}}{}}}
162
+ \else
163
+ \providecommand{\DeleteFile}[2][]{%
164
+ \ifthenelse{\equal{#1}{}}%
165
+ {\IfFileExists{#2}{\ShellEscape{rm #2}}{}}%
166
+ {\IfFileExists{#1/#2}{\ShellEscape{rm #1/#2}}{}}}
167
+ \fi
168
+ \ifwindows
169
+ \newcommand{\ProvideDirectory}[1]{%
170
+ \StrSubstitute{#1}{/}{\@backslashchar}[\minted@windir]
171
+ \ShellEscape{if not exist \minted@windir\space mkdir \minted@windir}}
172
+ \else
173
+ \newcommand{\ProvideDirectory}[1]{%
174
+ \ShellEscape{mkdir -p #1}}
175
+ \fi
176
+ \newboolean{AppExists}
177
+ \newread\minted@appexistsfile
178
+ \newcommand{\TestAppExists}[1]{
179
+ \ifwindows
180
+ \DeleteFile{\minted@jobname.aex}
181
+ \ShellEscape{for \string^\@percentchar i in (#1.exe #1.bat #1.cmd)
182
+ do set > \minted@jobname.aex <nul: /p
183
+ x=\string^\@percentchar \string~$PATH:i>> \minted@jobname.aex}
184
+ %$ <- balance syntax highlighting
185
+ \immediate\openin\minted@appexistsfile\minted@jobname.aex
186
+ \expandafter\def\expandafter\@tmp@cr\expandafter{\the\endlinechar}
187
+ \endlinechar=-1\relax
188
+ \readline\minted@appexistsfile to \minted@apppathifexists
189
+ \endlinechar=\@tmp@cr
190
+ \ifthenelse{\equal{\minted@apppathifexists}{}}
191
+ {\AppExistsfalse}
192
+ {\AppExiststrue}
193
+ \immediate\closein\minted@appexistsfile
194
+ \DeleteFile{\minted@jobname.aex}
195
+ \else
196
+ \ShellEscape{which #1 && touch \minted@jobname.aex}
197
+ \IfFileExists{\minted@jobname.aex}
198
+ {\AppExiststrue
199
+ \DeleteFile{\minted@jobname.aex}}
200
+ {\AppExistsfalse}
201
+ \fi
202
+ }
203
+ \newcommand{\minted@optlistcl@g}{}
204
+ \newcommand{\minted@optlistcl@g@i}{}
205
+ \let\minted@lang\@empty
206
+ \newcommand{\minted@optlistcl@lang}{}
207
+ \newcommand{\minted@optlistcl@lang@i}{}
208
+ \newcommand{\minted@optlistcl@cmd}{}
209
+ \newcommand{\minted@optlistfv@g}{}
210
+ \newcommand{\minted@optlistfv@g@i}{}
211
+ \newcommand{\minted@optlistfv@lang}{}
212
+ \newcommand{\minted@optlistfv@lang@i}{}
213
+ \newcommand{\minted@optlistfv@cmd}{}
214
+ \newcommand{\minted@configlang}[1]{%
215
+ \def\minted@lang{#1}%
216
+ \ifcsname minted@optlistcl@lang\minted@lang\endcsname\else
217
+ \expandafter\gdef\csname minted@optlistcl@lang\minted@lang\endcsname{}%
218
+ \fi
219
+ \ifcsname minted@optlistcl@lang\minted@lang @i\endcsname\else
220
+ \expandafter\gdef\csname minted@optlistcl@lang\minted@lang @i\endcsname{}%
221
+ \fi
222
+ \ifcsname minted@optlistfv@lang\minted@lang\endcsname\else
223
+ \expandafter\gdef\csname minted@optlistfv@lang\minted@lang\endcsname{}%
224
+ \fi
225
+ \ifcsname minted@optlistfv@lang\minted@lang @i\endcsname\else
226
+ \expandafter\gdef\csname minted@optlistfv@lang\minted@lang @i\endcsname{}%
227
+ \fi
228
+ }
229
+ \newcommand{\minted@addto@optlistcl}[2]{%
230
+ \expandafter\def\expandafter#1\expandafter{#1%
231
+ \detokenize{#2}\space}}
232
+ \newcommand{\minted@addto@optlistcl@lang}[2]{%
233
+ \expandafter\let\expandafter\minted@tmp\csname #1\endcsname
234
+ \expandafter\def\expandafter\minted@tmp\expandafter{\minted@tmp%
235
+ \detokenize{#2}\space}%
236
+ \expandafter\let\csname #1\endcsname\minted@tmp}
237
+ \newcommand{\minted@def@optcl}[4][]{%
238
+ \ifthenelse{\equal{#1}{}}%
239
+ {\define@key{minted@opt@g}{#2}{%
240
+ \minted@addto@optlistcl{\minted@optlistcl@g}{#3=#4}%
241
+ \@namedef{minted@opt@g:#2}{#4}}%
242
+ \define@key{minted@opt@g@i}{#2}{%
243
+ \minted@addto@optlistcl{\minted@optlistcl@g@i}{#3=#4}%
244
+ \@namedef{minted@opt@g@i:#2}{#4}}%
245
+ \define@key{minted@opt@lang}{#2}{%
246
+ \minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang}{#3=#4}%
247
+ \@namedef{minted@opt@lang\minted@lang:#2}{#4}}%
248
+ \define@key{minted@opt@lang@i}{#2}{%
249
+ \minted@addto@optlistcl@lang{%
250
+ minted@optlistcl@lang\minted@lang @i}{#3=#4}%
251
+ \@namedef{minted@opt@lang\minted@lang @i:#2}{#4}}%
252
+ \define@key{minted@opt@cmd}{#2}{%
253
+ \minted@addto@optlistcl{\minted@optlistcl@cmd}{#3=#4}%
254
+ \@namedef{minted@opt@cmd:#2}{#4}}}%
255
+ {\define@key{minted@opt@g}{#2}[#1]{%
256
+ \minted@addto@optlistcl{\minted@optlistcl@g}{#3=#4}%
257
+ \@namedef{minted@opt@g:#2}{#4}}%
258
+ \define@key{minted@opt@g@i}{#2}[#1]{%
259
+ \minted@addto@optlistcl{\minted@optlistcl@g@i}{#3=#4}%
260
+ \@namedef{minted@opt@g@i:#2}{#4}}%
261
+ \define@key{minted@opt@lang}{#2}[#1]{%
262
+ \minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang}{#3=#4}%
263
+ \@namedef{minted@opt@lang\minted@lang:#2}{#4}}%
264
+ \define@key{minted@opt@lang@i}{#2}[#1]{%
265
+ \minted@addto@optlistcl@lang{%
266
+ minted@optlistcl@lang\minted@lang @i}{#3=#4}%
267
+ \@namedef{minted@opt@lang\minted@lang @i:#2}{#4}}%
268
+ \define@key{minted@opt@cmd}{#2}[#1]{%
269
+ \minted@addto@optlistcl{\minted@optlistcl@cmd}{#3=#4}%
270
+ \@namedef{minted@opt@cmd:#2}{#4}}}%
271
+ }
272
+ \edef\minted@hashchar{\string#}
273
+ \edef\minted@dollarchar{\string$}
274
+ \edef\minted@ampchar{\string&}
275
+ \edef\minted@underscorechar{\string_}
276
+ \edef\minted@tildechar{\string~}
277
+ \edef\minted@leftsquarebracket{\string[}
278
+ \edef\minted@rightsquarebracket{\string]}
279
+ \newcommand{\minted@escchars}{%
280
+ \let\#\minted@hashchar
281
+ \let\%\@percentchar
282
+ \let\{\@charlb
283
+ \let\}\@charrb
284
+ \let\$\minted@dollarchar
285
+ \let\&\minted@ampchar
286
+ \let\_\minted@underscorechar
287
+ \let\\\@backslashchar
288
+ \let~\minted@tildechar
289
+ \let\~\minted@tildechar
290
+ \let\[\minted@leftsquarebracket
291
+ \let\]\minted@rightsquarebracket
292
+ } %$ <- highlighting
293
+ \newcommand{\minted@addto@optlistcl@e}[2]{%
294
+ \begingroup
295
+ \minted@escchars
296
+ \xdef\minted@xtmp{#2}%
297
+ \endgroup
298
+ \expandafter\minted@addto@optlistcl@e@i\expandafter{\minted@xtmp}{#1}}
299
+ \def\minted@addto@optlistcl@e@i#1#2{%
300
+ \expandafter\def\expandafter#2\expandafter{#2#1\space}}
301
+ \newcommand{\minted@addto@optlistcl@lang@e}[2]{%
302
+ \begingroup
303
+ \minted@escchars
304
+ \xdef\minted@xtmp{#2}%
305
+ \endgroup
306
+ \expandafter\minted@addto@optlistcl@lang@e@i\expandafter{\minted@xtmp}{#1}}
307
+ \def\minted@addto@optlistcl@lang@e@i#1#2{%
308
+ \expandafter\let\expandafter\minted@tmp\csname #2\endcsname
309
+ \expandafter\def\expandafter\minted@tmp\expandafter{\minted@tmp#1\space}%
310
+ \expandafter\let\csname #2\endcsname\minted@tmp}
311
+ \newcommand{\minted@def@optcl@e}[4][]{%
312
+ \ifthenelse{\equal{#1}{}}%
313
+ {\define@key{minted@opt@g}{#2}{%
314
+ \minted@addto@optlistcl@e{\minted@optlistcl@g}{#3=#4}%
315
+ \@namedef{minted@opt@g:#2}{#4}}%
316
+ \define@key{minted@opt@g@i}{#2}{%
317
+ \minted@addto@optlistcl@e{\minted@optlistcl@g@i}{#3=#4}%
318
+ \@namedef{minted@opt@g@i:#2}{#4}}%
319
+ \define@key{minted@opt@lang}{#2}{%
320
+ \minted@addto@optlistcl@lang@e{minted@optlistcl@lang\minted@lang}{#3=#4}%
321
+ \@namedef{minted@opt@lang\minted@lang:#2}{#4}}%
322
+ \define@key{minted@opt@lang@i}{#2}{%
323
+ \minted@addto@optlistcl@lang@e{%
324
+ minted@optlistcl@lang\minted@lang @i}{#3=#4}%
325
+ \@namedef{minted@opt@lang\minted@lang @i:#2}{#4}}%
326
+ \define@key{minted@opt@cmd}{#2}{%
327
+ \minted@addto@optlistcl@e{\minted@optlistcl@cmd}{#3=#4}%
328
+ \@namedef{minted@opt@cmd:#2}{#4}}}%
329
+ {\define@key{minted@opt@g}{#2}[#1]{%
330
+ \minted@addto@optlistcl@e{\minted@optlistcl@g}{#3=#4}%
331
+ \@namedef{minted@opt@g:#2}{#4}}%
332
+ \define@key{minted@opt@g@i}{#2}[#1]{%
333
+ \minted@addto@optlistcl@e{\minted@optlistcl@g@i}{#3=#4}%
334
+ \@namedef{minted@opt@g@i:#2}{#4}}%
335
+ \define@key{minted@opt@lang}{#2}[#1]{%
336
+ \minted@addto@optlistcl@lang@e{minted@optlistcl@lang\minted@lang}{#3=#4}%
337
+ \@namedef{minted@opt@lang\minted@lang:#2}{#4}}%
338
+ \define@key{minted@opt@lang@i}{#2}[#1]{%
339
+ \minted@addto@optlistcl@lang@e{%
340
+ minted@optlistcl@lang\minted@lang @i}{#3=#4}%
341
+ \@namedef{minted@opt@lang\minted@lang @i:#2}{#4}}%
342
+ \define@key{minted@opt@cmd}{#2}[#1]{%
343
+ \minted@addto@optlistcl@e{\minted@optlistcl@cmd}{#3=#4}%
344
+ \@namedef{minted@opt@cmd:#2}{#4}}}%
345
+ }
346
+ \newcommand{\minted@def@optcl@switch}[2]{%
347
+ \define@booleankey{minted@opt@g}{#1}%
348
+ {\minted@addto@optlistcl{\minted@optlistcl@g}{#2=True}%
349
+ \@namedef{minted@opt@g:#1}{true}}
350
+ {\minted@addto@optlistcl{\minted@optlistcl@g}{#2=False}%
351
+ \@namedef{minted@opt@g:#1}{false}}
352
+ \define@booleankey{minted@opt@g@i}{#1}%
353
+ {\minted@addto@optlistcl{\minted@optlistcl@g@i}{#2=True}%
354
+ \@namedef{minted@opt@g@i:#1}{true}}
355
+ {\minted@addto@optlistcl{\minted@optlistcl@g@i}{#2=False}%
356
+ \@namedef{minted@opt@g@i:#1}{false}}
357
+ \define@booleankey{minted@opt@lang}{#1}%
358
+ {\minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang}{#2=True}%
359
+ \@namedef{minted@opt@lang\minted@lang:#1}{true}}
360
+ {\minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang}{#2=False}%
361
+ \@namedef{minted@opt@lang\minted@lang:#1}{false}}
362
+ \define@booleankey{minted@opt@lang@i}{#1}%
363
+ {\minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang @i}{#2=True}%
364
+ \@namedef{minted@opt@lang\minted@lang @i:#1}{true}}
365
+ {\minted@addto@optlistcl@lang{minted@optlistcl@lang\minted@lang @i}{#2=False}%
366
+ \@namedef{minted@opt@lang\minted@lang @i:#1}{false}}
367
+ \define@booleankey{minted@opt@cmd}{#1}%
368
+ {\minted@addto@optlistcl{\minted@optlistcl@cmd}{#2=True}%
369
+ \@namedef{minted@opt@cmd:#1}{true}}
370
+ {\minted@addto@optlistcl{\minted@optlistcl@cmd}{#2=False}%
371
+ \@namedef{minted@opt@cmd:#1}{false}}
372
+ }
373
+ \newcommand{\minted@def@optfv}[1]{%
374
+ \define@key{minted@opt@g}{#1}{%
375
+ \expandafter\def\expandafter\minted@optlistfv@g\expandafter{%
376
+ \minted@optlistfv@g#1={##1},}%
377
+ \@namedef{minted@opt@g:#1}{##1}}
378
+ \define@key{minted@opt@g@i}{#1}{%
379
+ \expandafter\def\expandafter\minted@optlistfv@g@i\expandafter{%
380
+ \minted@optlistfv@g@i#1={##1},}%
381
+ \@namedef{minted@opt@g@i:#1}{##1}}
382
+ \define@key{minted@opt@lang}{#1}{%
383
+ \expandafter\let\expandafter\minted@tmp%
384
+ \csname minted@optlistfv@lang\minted@lang\endcsname
385
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
386
+ \minted@tmp#1={##1},}%
387
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang\endcsname%
388
+ \minted@tmp
389
+ \@namedef{minted@opt@lang\minted@lang:#1}{##1}}
390
+ \define@key{minted@opt@lang@i}{#1}{%
391
+ \expandafter\let\expandafter\minted@tmp%
392
+ \csname minted@optlistfv@lang\minted@lang @i\endcsname
393
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
394
+ \minted@tmp#1={##1},}%
395
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang @i\endcsname%
396
+ \minted@tmp
397
+ \@namedef{minted@opt@lang\minted@lang @i:#1}{##1}}
398
+ \define@key{minted@opt@cmd}{#1}{%
399
+ \expandafter\def\expandafter\minted@optlistfv@cmd\expandafter{%
400
+ \minted@optlistfv@cmd#1={##1},}%
401
+ \@namedef{minted@opt@cmd:#1}{##1}}
402
+ }
403
+ \newcommand{\minted@def@optfv@switch}[1]{%
404
+ \define@booleankey{minted@opt@g}{#1}%
405
+ {\expandafter\def\expandafter\minted@optlistfv@g\expandafter{%
406
+ \minted@optlistfv@g#1=true,}%
407
+ \@namedef{minted@opt@g:#1}{true}}%
408
+ {\expandafter\def\expandafter\minted@optlistfv@g\expandafter{%
409
+ \minted@optlistfv@g#1=false,}%
410
+ \@namedef{minted@opt@g:#1}{false}}%
411
+ \define@booleankey{minted@opt@g@i}{#1}%
412
+ {\expandafter\def\expandafter\minted@optlistfv@g@i\expandafter{%
413
+ \minted@optlistfv@g@i#1=true,}%
414
+ \@namedef{minted@opt@g@i:#1}{true}}%
415
+ {\expandafter\def\expandafter\minted@optlistfv@g@i\expandafter{%
416
+ \minted@optlistfv@g@i#1=false,}%
417
+ \@namedef{minted@opt@g@i:#1}{false}}%
418
+ \define@booleankey{minted@opt@lang}{#1}%
419
+ {\expandafter\let\expandafter\minted@tmp%
420
+ \csname minted@optlistfv@lang\minted@lang\endcsname
421
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
422
+ \minted@tmp#1=true,}%
423
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang\endcsname%
424
+ \minted@tmp
425
+ \@namedef{minted@opt@lang\minted@lang:#1}{true}}%
426
+ {\expandafter\let\expandafter\minted@tmp%
427
+ \csname minted@optlistfv@lang\minted@lang\endcsname
428
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
429
+ \minted@tmp#1=false,}%
430
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang\endcsname%
431
+ \minted@tmp
432
+ \@namedef{minted@opt@lang\minted@lang:#1}{false}}%
433
+ \define@booleankey{minted@opt@lang@i}{#1}%
434
+ {\expandafter\let\expandafter\minted@tmp%
435
+ \csname minted@optlistfv@lang\minted@lang @i\endcsname
436
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
437
+ \minted@tmp#1=true,}%
438
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang @i\endcsname%
439
+ \minted@tmp
440
+ \@namedef{minted@opt@lang\minted@lang @i:#1}{true}}%
441
+ {\expandafter\let\expandafter\minted@tmp%
442
+ \csname minted@optlistfv@lang\minted@lang @i\endcsname
443
+ \expandafter\def\expandafter\minted@tmp\expandafter{%
444
+ \minted@tmp#1=false,}%
445
+ \expandafter\let\csname minted@optlistfv@lang\minted@lang @i\endcsname%
446
+ \minted@tmp
447
+ \@namedef{minted@opt@lang\minted@lang @i:#1}{false}}%
448
+ \define@booleankey{minted@opt@cmd}{#1}%
449
+ {\expandafter\def\expandafter\minted@optlistfv@cmd\expandafter{%
450
+ \minted@optlistfv@cmd#1=true,}%
451
+ \@namedef{minted@opt@cmd:#1}{true}}%
452
+ {\expandafter\def\expandafter\minted@optlistfv@cmd\expandafter{%
453
+ \minted@optlistfv@cmd#1=false,}%
454
+ \@namedef{minted@opt@cmd:#1}{false}}%
455
+ }
456
+ \newboolean{minted@isinline}
457
+ \newcommand{\minted@fvset}{%
458
+ \expandafter\fvset\expandafter{\minted@optlistfv@g}%
459
+ \expandafter\let\expandafter\minted@tmp%
460
+ \csname minted@optlistfv@lang\minted@lang\endcsname
461
+ \expandafter\fvset\expandafter{\minted@tmp}%
462
+ \ifthenelse{\boolean{minted@isinline}}%
463
+ {\expandafter\fvset\expandafter{\minted@optlistfv@g@i}%
464
+ \expandafter\let\expandafter\minted@tmp%
465
+ \csname minted@optlistfv@lang\minted@lang @i\endcsname
466
+ \expandafter\fvset\expandafter{\minted@tmp}}%
467
+ {}%
468
+ \expandafter\fvset\expandafter{\minted@optlistfv@cmd}%
469
+ }
470
+ \newcommand{\minted@def@opt}[2][]{%
471
+ \define@key{minted@opt@g}{#2}{%
472
+ \@namedef{minted@opt@g:#2}{##1}}
473
+ \define@key{minted@opt@g@i}{#2}{%
474
+ \@namedef{minted@opt@g@i:#2}{##1}}
475
+ \define@key{minted@opt@lang}{#2}{%
476
+ \@namedef{minted@opt@lang\minted@lang:#2}{##1}}
477
+ \define@key{minted@opt@lang@i}{#2}{%
478
+ \@namedef{minted@opt@lang\minted@lang @i:#2}{##1}}
479
+ \define@key{minted@opt@cmd}{#2}{%
480
+ \@namedef{minted@opt@cmd:#2}{##1}}
481
+ \ifstrempty{#1}{}{\@namedef{minted@opt@g:#2}{#1}}%
482
+ }
483
+ \newcommand{\minted@checkstyle}[1]{%
484
+ \ifcsname minted@styleloaded@\ifstrempty{#1}{default-pyg-prefix}{#1}\endcsname\else
485
+ \ifstrempty{#1}{}{\ifcsname PYG\endcsname\else\minted@checkstyle{}\fi}%
486
+ \expandafter\gdef%
487
+ \csname minted@styleloaded@\ifstrempty{#1}{default-pyg-prefix}{#1}\endcsname{}%
488
+ \ifthenelse{\boolean{minted@cache}}%
489
+ {\IfFileExists
490
+ {\minted@outputdir\minted@cachedir/\ifstrempty{#1}{default-pyg-prefix}{#1}.pygstyle}%
491
+ {}%
492
+ {%
493
+ \ifthenelse{\boolean{minted@frozencache}}%
494
+ {\PackageError{minted}%
495
+ {Missing style definition for #1 with frozencache}%
496
+ {Missing style definition for #1 with frozencache}}%
497
+ {\ifwindows
498
+ \ShellEscape{%
499
+ \MintedPygmentize\space -S \ifstrempty{#1}{default}{#1} -f latex
500
+ -P commandprefix=PYG#1
501
+ > \minted@outputdir@windows\minted@cachedir@windows\@backslashchar%
502
+ \ifstrempty{#1}{default-pyg-prefix}{#1}.pygstyle}%
503
+ \else
504
+ \ShellEscape{%
505
+ \MintedPygmentize\space -S \ifstrempty{#1}{default}{#1} -f latex
506
+ -P commandprefix=PYG#1
507
+ > \minted@outputdir\minted@cachedir/%
508
+ \ifstrempty{#1}{default-pyg-prefix}{#1}.pygstyle}%
509
+ \fi}%
510
+ }%
511
+ \begingroup
512
+ \let\def\gdef
513
+ \catcode\string``=12
514
+ \catcode`\_=11
515
+ \catcode`\-=11
516
+ \catcode`\%=14
517
+ \endlinechar=-1\relax
518
+ \minted@input{%
519
+ \minted@outputdir\minted@cachedir/\ifstrempty{#1}{default-pyg-prefix}{#1}.pygstyle}%
520
+ \endgroup
521
+ \minted@addcachefile{\ifstrempty{#1}{default-pyg-prefix}{#1}.pygstyle}}%
522
+ {%
523
+ \ifwindows
524
+ \ShellEscape{%
525
+ \MintedPygmentize\space -S \ifstrempty{#1}{default}{#1} -f latex
526
+ -P commandprefix=PYG#1 > \minted@outputdir@windows\minted@jobname.out.pyg}%
527
+ \else
528
+ \ShellEscape{%
529
+ \MintedPygmentize\space -S \ifstrempty{#1}{default}{#1} -f latex
530
+ -P commandprefix=PYG#1 > \minted@outputdir\minted@jobname.out.pyg}%
531
+ \fi
532
+ \begingroup
533
+ \let\def\gdef
534
+ \catcode\string``=12
535
+ \catcode`\_=11
536
+ \catcode`\-=11
537
+ \catcode`\%=14
538
+ \endlinechar=-1\relax
539
+ \minted@input{\minted@outputdir\minted@jobname.out.pyg}%
540
+ \endgroup}%
541
+ \ifstrempty{#1}{\minted@patch@PYGZsq}{}%
542
+ \fi
543
+ }
544
+ \ifthenelse{\boolean{minted@draft}}{\renewcommand{\minted@checkstyle}[1]{}}{}
545
+ \newcommand{\minted@patch@PYGZsq}{%
546
+ \ifcsname PYGZsq\endcsname
547
+ \expandafter\ifdefstring\expandafter{\csname PYGZsq\endcsname}{\char`\'}%
548
+ {\minted@patch@PYGZsq@i}%
549
+ {}%
550
+ \fi
551
+ }
552
+ \begingroup
553
+ \catcode`\'=\active
554
+ \gdef\minted@patch@PYGZsq@i{\gdef\PYGZsq{'}}
555
+ \endgroup
556
+ \ifthenelse{\boolean{minted@draft}}{}{\AtBeginDocument{\minted@patch@PYGZsq}}
557
+ \newcommand{\minted@def@opt@switch}[2][false]{%
558
+ \define@booleankey{minted@opt@g}{#2}%
559
+ {\@namedef{minted@opt@g:#2}{true}}%
560
+ {\@namedef{minted@opt@g:#2}{false}}
561
+ \define@booleankey{minted@opt@g@i}{#2}%
562
+ {\@namedef{minted@opt@g@i:#2}{true}}%
563
+ {\@namedef{minted@opt@g@i:#2}{false}}
564
+ \define@booleankey{minted@opt@lang}{#2}%
565
+ {\@namedef{minted@opt@lang\minted@lang:#2}{true}}%
566
+ {\@namedef{minted@opt@lang\minted@lang:#2}{false}}
567
+ \define@booleankey{minted@opt@lang@i}{#2}%
568
+ {\@namedef{minted@opt@lang\minted@lang @i:#2}{true}}%
569
+ {\@namedef{minted@opt@lang\minted@lang @i:#2}{false}}
570
+ \define@booleankey{minted@opt@cmd}{#2}%
571
+ {\@namedef{minted@opt@cmd:#2}{true}}%
572
+ {\@namedef{minted@opt@cmd:#2}{false}}%
573
+ \@namedef{minted@opt@g:#2}{#1}%
574
+ }
575
+ \def\minted@get@opt#1#2{%
576
+ \ifcsname minted@opt@cmd:#1\endcsname
577
+ \csname minted@opt@cmd:#1\endcsname
578
+ \else
579
+ \ifminted@isinline
580
+ \ifcsname minted@opt@lang\minted@lang @i:#1\endcsname
581
+ \csname minted@opt@lang\minted@lang @i:#1\endcsname
582
+ \else
583
+ \ifcsname minted@opt@g@i:#1\endcsname
584
+ \csname minted@opt@g@i:#1\endcsname
585
+ \else
586
+ \ifcsname minted@opt@lang\minted@lang:#1\endcsname
587
+ \csname minted@opt@lang\minted@lang:#1\endcsname
588
+ \else
589
+ \ifcsname minted@opt@g:#1\endcsname
590
+ \csname minted@opt@g:#1\endcsname
591
+ \else
592
+ #2%
593
+ \fi
594
+ \fi
595
+ \fi
596
+ \fi
597
+ \else
598
+ \ifcsname minted@opt@lang\minted@lang:#1\endcsname
599
+ \csname minted@opt@lang\minted@lang:#1\endcsname
600
+ \else
601
+ \ifcsname minted@opt@g:#1\endcsname
602
+ \csname minted@opt@g:#1\endcsname
603
+ \else
604
+ #2%
605
+ \fi
606
+ \fi
607
+ \fi
608
+ \fi
609
+ }%
610
+ \minted@def@optcl{encoding}{-P encoding}{#1}
611
+ \minted@def@optcl{outencoding}{-P outencoding}{#1}
612
+ \minted@def@optcl@e{escapeinside}{-P "escapeinside}{#1"}
613
+ \minted@def@optcl@switch{stripnl}{-P stripnl}
614
+ \minted@def@optcl@switch{stripall}{-P stripall}
615
+ \minted@def@optcl@switch{python3}{-P python3}
616
+ \minted@def@optcl@switch{funcnamehighlighting}{-P funcnamehighlighting}
617
+ \minted@def@optcl@switch{startinline}{-P startinline}
618
+ \ifthenelse{\boolean{minted@draft}}%
619
+ {\minted@def@optfv{gobble}}%
620
+ {\minted@def@optcl{gobble}{-F gobble:n}{#1}}
621
+ \minted@def@optcl{codetagify}{-F codetagify:codetags}{#1}
622
+ \minted@def@optcl{keywordcase}{-F keywordcase:case}{#1}
623
+ \minted@def@optcl@switch{texcl}{-P texcomments}
624
+ \minted@def@optcl@switch{texcomments}{-P texcomments}
625
+ \minted@def@optcl@switch{mathescape}{-P mathescape}
626
+ \minted@def@optfv@switch{linenos}
627
+ \minted@def@opt{style}
628
+ \minted@def@optfv{frame}
629
+ \minted@def@optfv{framesep}
630
+ \minted@def@optfv{framerule}
631
+ \minted@def@optfv{rulecolor}
632
+ \minted@def@optfv{numbersep}
633
+ \minted@def@optfv{numbers}
634
+ \minted@def@optfv{firstnumber}
635
+ \minted@def@optfv{stepnumber}
636
+ \minted@def@optfv{firstline}
637
+ \minted@def@optfv{lastline}
638
+ \minted@def@optfv{baselinestretch}
639
+ \minted@def@optfv{xleftmargin}
640
+ \minted@def@optfv{xrightmargin}
641
+ \minted@def@optfv{fillcolor}
642
+ \minted@def@optfv{tabsize}
643
+ \minted@def@optfv{fontfamily}
644
+ \minted@def@optfv{fontsize}
645
+ \minted@def@optfv{fontshape}
646
+ \minted@def@optfv{fontseries}
647
+ \minted@def@optfv{formatcom}
648
+ \minted@def@optfv{label}
649
+ \minted@def@optfv{labelposition}
650
+ \minted@def@optfv{highlightlines}
651
+ \minted@def@optfv{highlightcolor}
652
+ \minted@def@optfv{space}
653
+ \minted@def@optfv{spacecolor}
654
+ \minted@def@optfv{tab}
655
+ \minted@def@optfv{tabcolor}
656
+ \minted@def@optfv{highlightcolor}
657
+ \minted@def@optfv@switch{beameroverlays}
658
+ \minted@def@optfv@switch{curlyquotes}
659
+ \minted@def@optfv@switch{numberfirstline}
660
+ \minted@def@optfv@switch{numberblanklines}
661
+ \minted@def@optfv@switch{stepnumberfromfirst}
662
+ \minted@def@optfv@switch{stepnumberoffsetvalues}
663
+ \minted@def@optfv@switch{showspaces}
664
+ \minted@def@optfv@switch{resetmargins}
665
+ \minted@def@optfv@switch{samepage}
666
+ \minted@def@optfv@switch{showtabs}
667
+ \minted@def@optfv@switch{obeytabs}
668
+ \minted@def@optfv@switch{breaklines}
669
+ \minted@def@optfv@switch{breakbytoken}
670
+ \minted@def@optfv@switch{breakbytokenanywhere}
671
+ \minted@def@optfv{breakindent}
672
+ \minted@def@optfv{breakindentnchars}
673
+ \minted@def@optfv@switch{breakautoindent}
674
+ \minted@def@optfv{breaksymbol}
675
+ \minted@def@optfv{breaksymbolsep}
676
+ \minted@def@optfv{breaksymbolsepnchars}
677
+ \minted@def@optfv{breaksymbolindent}
678
+ \minted@def@optfv{breaksymbolindentnchars}
679
+ \minted@def@optfv{breaksymbolleft}
680
+ \minted@def@optfv{breaksymbolsepleft}
681
+ \minted@def@optfv{breaksymbolsepleftnchars}
682
+ \minted@def@optfv{breaksymbolindentleft}
683
+ \minted@def@optfv{breaksymbolindentleftnchars}
684
+ \minted@def@optfv{breaksymbolright}
685
+ \minted@def@optfv{breaksymbolsepright}
686
+ \minted@def@optfv{breaksymbolseprightnchars}
687
+ \minted@def@optfv{breaksymbolindentright}
688
+ \minted@def@optfv{breaksymbolindentrightnchars}
689
+ \minted@def@optfv{breakbefore}
690
+ \minted@def@optfv{breakbeforesymbolpre}
691
+ \minted@def@optfv{breakbeforesymbolpost}
692
+ \minted@def@optfv@switch{breakbeforegroup}
693
+ \minted@def@optfv{breakafter}
694
+ \minted@def@optfv@switch{breakaftergroup}
695
+ \minted@def@optfv{breakaftersymbolpre}
696
+ \minted@def@optfv{breakaftersymbolpost}
697
+ \minted@def@optfv@switch{breakanywhere}
698
+ \minted@def@optfv{breakanywheresymbolpre}
699
+ \minted@def@optfv{breakanywheresymbolpost}
700
+ \minted@def@opt{bgcolor}
701
+ \minted@def@opt@switch{autogobble}
702
+ \newcommand{\minted@encoding}{\minted@get@opt{encoding}{UTF8}}
703
+ \newenvironment{minted@snugshade*}[1]{%
704
+ \def\FrameCommand##1{\hskip\@totalleftmargin
705
+ \colorbox{#1}{##1}%
706
+ \hskip-\linewidth \hskip-\@totalleftmargin \hskip\columnwidth}%
707
+ \MakeFramed{\advance\hsize-\width
708
+ \@totalleftmargin\z@ \linewidth\hsize
709
+ \advance\labelsep\fboxsep
710
+ \@setminipage}%
711
+ }{\par\unskip\@minipagefalse\endMakeFramed}
712
+ \newsavebox{\minted@bgbox}
713
+ \newenvironment{minted@colorbg}[1]{%
714
+ \setlength{\OuterFrameSep}{0pt}%
715
+ \let\minted@tmp\FV@NumberSep
716
+ \edef\FV@NumberSep{%
717
+ \the\numexpr\dimexpr\minted@tmp+\number\fboxsep\relax sp\relax}%
718
+ \medskip
719
+ \begin{minted@snugshade*}{#1}}
720
+ {\end{minted@snugshade*}%
721
+ \medskip\noindent}
722
+ \newwrite\minted@code
723
+ \newcommand{\minted@savecode}[1]{
724
+ \immediate\openout\minted@code\minted@jobname.pyg\relax
725
+ \immediate\write\minted@code{\expandafter\detokenize\expandafter{#1}}%
726
+ \immediate\closeout\minted@code}
727
+ \newcounter{minted@FancyVerbLineTemp}
728
+ \newcommand{\minted@write@detok}[1]{%
729
+ \immediate\write\FV@OutFile{\detokenize{#1}}}
730
+ \newcommand{\minted@FVB@VerbatimOut}[1]{%
731
+ \setcounter{minted@FancyVerbLineTemp}{\value{FancyVerbLine}}%
732
+ \@bsphack
733
+ \begingroup
734
+ \FV@UseKeyValues
735
+ \FV@DefineWhiteSpace
736
+ \def\FV@Space{\space}%
737
+ \FV@DefineTabOut
738
+ \let\FV@ProcessLine\minted@write@detok
739
+ \immediate\openout\FV@OutFile #1\relax
740
+ \let\FV@FontScanPrep\relax
741
+ \let\@noligs\relax
742
+ \FV@Scan}
743
+ \newcommand{\minted@FVE@VerbatimOut}{%
744
+ \immediate\closeout\FV@OutFile\endgroup\@esphack
745
+ \setcounter{FancyVerbLine}{\value{minted@FancyVerbLineTemp}}}%
746
+ \ifcsname MintedPygmentize\endcsname\else
747
+ \newcommand{\MintedPygmentize}{pygmentize}
748
+ \fi
749
+ \newcounter{minted@pygmentizecounter}
750
+ \newcommand{\minted@pygmentize}[2][\minted@outputdir\minted@jobname.pyg]{%
751
+ \minted@checkstyle{\minted@get@opt{style}{default}}%
752
+ \stepcounter{minted@pygmentizecounter}%
753
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}%
754
+ {\def\minted@codefile{\minted@outputdir\minted@jobname.pyg}}%
755
+ {\def\minted@codefile{#1}}%
756
+ \ifthenelse{\boolean{minted@isinline}}%
757
+ {\def\minted@optlistcl@inlines{%
758
+ \minted@optlistcl@g@i
759
+ \csname minted@optlistcl@lang\minted@lang @i\endcsname}}%
760
+ {\let\minted@optlistcl@inlines\@empty}%
761
+ \def\minted@cmd{%
762
+ \ifminted@kpsewhich
763
+ \ifwindows
764
+ \detokenize{for /f "usebackq tokens=*"}\space\@percentchar\detokenize{a in (`kpsewhich}\space\minted@codefile\detokenize{`) do}\space
765
+ \fi
766
+ \fi
767
+ \MintedPygmentize\space -l #2
768
+ -f latex -P commandprefix=PYG -F tokenmerge
769
+ \minted@optlistcl@g \csname minted@optlistcl@lang\minted@lang\endcsname
770
+ \minted@optlistcl@inlines
771
+ \minted@optlistcl@cmd -o \minted@outputdir\minted@infile\space
772
+ \ifminted@kpsewhich
773
+ \ifwindows
774
+ \@percentchar\detokenize{a}%
775
+ \else
776
+ \detokenize{`}kpsewhich \minted@codefile\space
777
+ \detokenize{||} \minted@codefile\detokenize{`}%
778
+ \fi
779
+ \else
780
+ \minted@codefile
781
+ \fi}%
782
+ % For debugging, uncomment: %%%%
783
+ % \immediate\typeout{\minted@cmd}%
784
+ % %%%%
785
+ \ifthenelse{\boolean{minted@cache}}%
786
+ {%
787
+ \ifminted@frozencache
788
+ \else
789
+ \ifx\XeTeXinterchartoks\minted@undefined
790
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}%
791
+ {\edef\minted@hash{\pdf@filemdfivesum{#1}%
792
+ \pdf@mdfivesum{\minted@cmd autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}}%
793
+ {\edef\minted@hash{\pdf@filemdfivesum{#1}%
794
+ \pdf@mdfivesum{\minted@cmd}}}%
795
+ \else
796
+ \ifx\mdfivesum\minted@undefined
797
+ \immediate\openout\minted@code\minted@jobname.mintedcmd\relax
798
+ \immediate\write\minted@code{\minted@cmd}%
799
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}%
800
+ {\immediate\write\minted@code{autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}{}%
801
+ \immediate\closeout\minted@code
802
+ \edef\minted@argone@esc{#1}%
803
+ \StrSubstitute{\minted@argone@esc}{\@backslashchar}{\@backslashchar\@backslashchar}[\minted@argone@esc]%
804
+ \StrSubstitute{\minted@argone@esc}{"}{\@backslashchar"}[\minted@argone@esc]%
805
+ \edef\minted@tmpfname@esc{\minted@outputdir\minted@jobname}%
806
+ \StrSubstitute{\minted@tmpfname@esc}{\@backslashchar}{\@backslashchar\@backslashchar}[\minted@tmpfname@esc]%
807
+ \StrSubstitute{\minted@tmpfname@esc}{"}{\@backslashchar"}[\minted@tmpfname@esc]%
808
+ %Cheating a little here by using ASCII codes to write `{` and `}`
809
+ %in the Python code
810
+ \def\minted@hashcmd{%
811
+ \detokenize{python -c "import hashlib; import os;
812
+ hasher = hashlib.sha1();
813
+ f = open(os.path.expanduser(os.path.expandvars(\"}\minted@tmpfname@esc.mintedcmd\detokenize{\")), \"rb\");
814
+ hasher.update(f.read());
815
+ f.close();
816
+ f = open(os.path.expanduser(os.path.expandvars(\"}\minted@argone@esc\detokenize{\")), \"rb\");
817
+ hasher.update(f.read());
818
+ f.close();
819
+ f = open(os.path.expanduser(os.path.expandvars(\"}\minted@tmpfname@esc.mintedmd5\detokenize{\")), \"w\");
820
+ macro = \"\\edef\\minted@hash\" + chr(123) + hasher.hexdigest() + chr(125) + \"\";
821
+ f.write(\"\\makeatletter\" + macro + \"\\makeatother\\endinput\n\");
822
+ f.close();"}}%
823
+ \ShellEscape{\minted@hashcmd}%
824
+ \minted@input{\minted@outputdir\minted@jobname.mintedmd5}%
825
+ \else
826
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}%
827
+ {\edef\minted@hash{\mdfivesum file {#1}%
828
+ \mdfivesum{\minted@cmd autogobble(\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi-\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi)}}}%
829
+ {\edef\minted@hash{\mdfivesum file {#1}%
830
+ \mdfivesum{\minted@cmd}}}%
831
+ \fi
832
+ \fi
833
+ \edef\minted@infile{\minted@cachedir/\minted@hash.pygtex}%
834
+ \IfFileExists{\minted@infile}{}{%
835
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}{%
836
+ \minted@autogobble{#1}}{}%
837
+ \ShellEscape{\minted@cmd}}%
838
+ \fi
839
+ \ifthenelse{\boolean{minted@finalizecache}}%
840
+ {%
841
+ \edef\minted@cachefilename{listing\arabic{minted@pygmentizecounter}.pygtex}%
842
+ \edef\minted@actualinfile{\minted@cachedir/\minted@cachefilename}%
843
+ \ifwindows
844
+ \StrSubstitute{\minted@infile}{/}{\@backslashchar}[\minted@infile@windows]
845
+ \StrSubstitute{\minted@actualinfile}{/}{\@backslashchar}[\minted@actualinfile@windows]
846
+ \ShellEscape{move /y \minted@outputdir\minted@infile@windows\space\minted@outputdir\minted@actualinfile@windows}%
847
+ \else
848
+ \ShellEscape{mv -f \minted@outputdir\minted@infile\space\minted@outputdir\minted@actualinfile}%
849
+ \fi
850
+ \let\minted@infile\minted@actualinfile
851
+ \expandafter\minted@addcachefile\expandafter{\minted@cachefilename}%
852
+ }%
853
+ {\ifthenelse{\boolean{minted@frozencache}}%
854
+ {%
855
+ \edef\minted@cachefilename{listing\arabic{minted@pygmentizecounter}.pygtex}%
856
+ \edef\minted@infile{\minted@cachedir/\minted@cachefilename}%
857
+ \expandafter\minted@addcachefile\expandafter{\minted@cachefilename}}%
858
+ {\expandafter\minted@addcachefile\expandafter{\minted@hash.pygtex}}%
859
+ }%
860
+ \minted@inputpyg}%
861
+ {%
862
+ \ifthenelse{\equal{\minted@get@opt{autogobble}{false}}{true}}{%
863
+ \minted@autogobble{#1}}{}%
864
+ \ShellEscape{\minted@cmd}%
865
+ \minted@inputpyg}%
866
+ }
867
+ \def\minted@autogobble#1{%
868
+ \edef\minted@argone@esc{#1}%
869
+ \StrSubstitute{\minted@argone@esc}{\@backslashchar}{\@backslashchar\@backslashchar}[\minted@argone@esc]%
870
+ \StrSubstitute{\minted@argone@esc}{"}{\@backslashchar"}[\minted@argone@esc]%
871
+ \edef\minted@tmpfname@esc{\minted@outputdir\minted@jobname}%
872
+ \StrSubstitute{\minted@tmpfname@esc}{\@backslashchar}{\@backslashchar\@backslashchar}[\minted@tmpfname@esc]%
873
+ \StrSubstitute{\minted@tmpfname@esc}{"}{\@backslashchar"}[\minted@tmpfname@esc]%
874
+ %Need a version of open() that supports encoding under Python 2
875
+ \edef\minted@autogobblecmd{%
876
+ \ifminted@kpsewhich
877
+ \ifwindows
878
+ \detokenize{for /f "usebackq tokens=*" }\@percentchar\detokenize{a in (`kpsewhich} #1\detokenize{`) do}\space
879
+ \fi
880
+ \fi
881
+ \detokenize{python -c "import sys; import os;
882
+ import textwrap;
883
+ from io import open;
884
+ fname = }%
885
+ \ifminted@kpsewhich
886
+ \detokenize{sys.argv[1];}\space%
887
+ \else
888
+ \detokenize{os.path.expanduser(os.path.expandvars(\"}\minted@argone@esc\detokenize{\"));}\space%
889
+ \fi
890
+ \detokenize{f = open(fname, \"r\", encoding=\"}\minted@encoding\detokenize{\") if os.path.isfile(fname) else None;
891
+ t = f.readlines() if f is not None else None;
892
+ t_opt = t if t is not None else [];
893
+ f.close() if f is not None else None;
894
+ tmpfname = os.path.expanduser(os.path.expandvars(\"}\minted@tmpfname@esc.pyg\detokenize{\"));
895
+ f = open(tmpfname, \"w\", encoding=\"}\minted@encoding\detokenize{\") if t is not None else None;
896
+ fvstartnum = }\ifx\FancyVerbStartNum\z@ 0\else\FancyVerbStartNum\fi\detokenize{;
897
+ fvstopnum = }\ifx\FancyVerbStopNum\z@ 0\else\FancyVerbStopNum\fi\detokenize{;
898
+ s = fvstartnum-1 if fvstartnum != 0 else 0;
899
+ e = fvstopnum if fvstopnum != 0 else len(t_opt);
900
+ [f.write(textwrap.dedent(\"\".join(x))) for x in (t_opt[0:s], t_opt[s:e], t_opt[e:]) if x and t is not None];
901
+ f.close() if t is not None else os.remove(tmpfname);"}%
902
+ \ifminted@kpsewhich
903
+ \ifwindows
904
+ \space\@percentchar\detokenize{a}%
905
+ \else
906
+ \space\detokenize{`}kpsewhich #1\space\detokenize{||} #1\detokenize{`}%
907
+ \fi
908
+ \fi
909
+ }%
910
+ \ShellEscape{\minted@autogobblecmd}%
911
+ }
912
+ \newcommand{\minted@inputpyg}{%
913
+ \expandafter\let\expandafter\minted@PYGstyle%
914
+ \csname PYG\minted@get@opt{style}{default}\endcsname
915
+ \VerbatimPygments{\PYG}{\minted@PYGstyle}%
916
+ \ifthenelse{\boolean{minted@isinline}}%
917
+ {\ifthenelse{\equal{\minted@get@opt{breaklines}{false}}{true}}%
918
+ {\let\FV@BeginVBox\relax
919
+ \let\FV@EndVBox\relax
920
+ \def\FV@BProcessLine##1{\FancyVerbFormatLine{##1}}%
921
+ \minted@inputpyg@inline}%
922
+ {\minted@inputpyg@inline}}%
923
+ {\minted@inputpyg@block}%
924
+ }
925
+ \def\minted@inputpyg@inline{%
926
+ \ifthenelse{\equal{\minted@get@opt{bgcolor}{}}{}}%
927
+ {\minted@input{\minted@outputdir\minted@infile}}%
928
+ {\colorbox{\minted@get@opt{bgcolor}{}}{%
929
+ \minted@input{\minted@outputdir\minted@infile}}}%
930
+ }
931
+ \def\minted@inputpyg@block{%
932
+ \ifthenelse{\equal{\minted@get@opt{bgcolor}{}}{}}%
933
+ {\minted@input{\minted@outputdir\minted@infile}}%
934
+ {\begin{minted@colorbg}{\minted@get@opt{bgcolor}{}}%
935
+ \minted@input{\minted@outputdir\minted@infile}%
936
+ \end{minted@colorbg}}}
937
+ \newcommand{\minted@langlinenoson}{%
938
+ \ifcsname c@minted@lang\minted@lang\endcsname\else
939
+ \newcounter{minted@lang\minted@lang}%
940
+ \fi
941
+ \setcounter{minted@FancyVerbLineTemp}{\value{FancyVerbLine}}%
942
+ \setcounter{FancyVerbLine}{\value{minted@lang\minted@lang}}%
943
+ }
944
+ \newcommand{\minted@langlinenosoff}{%
945
+ \setcounter{minted@lang\minted@lang}{\value{FancyVerbLine}}%
946
+ \setcounter{FancyVerbLine}{\value{minted@FancyVerbLineTemp}}%
947
+ }
948
+ \ifthenelse{\boolean{minted@langlinenos}}{}{%
949
+ \let\minted@langlinenoson\relax
950
+ \let\minted@langlinenosoff\relax
951
+ }
952
+ \newcommand{\setminted}[2][]{%
953
+ \ifthenelse{\equal{#1}{}}%
954
+ {\setkeys{minted@opt@g}{#2}}%
955
+ {\minted@configlang{#1}%
956
+ \setkeys{minted@opt@lang}{#2}}}
957
+ \newcommand{\setmintedinline}[2][]{%
958
+ \ifthenelse{\equal{#1}{}}%
959
+ {\setkeys{minted@opt@g@i}{#2}}%
960
+ {\minted@configlang{#1}%
961
+ \setkeys{minted@opt@lang@i}{#2}}}
962
+ \setmintedinline[php]{startinline=true}
963
+ \setminted{tabcolor=black}
964
+ \newcommand{\usemintedstyle}[2][]{\setminted[#1]{style=#2}}
965
+ \begingroup
966
+ \catcode`\ =\active
967
+ \catcode`\^^I=\active
968
+ \gdef\minted@defwhitespace@retok{\def {\noexpand\FV@Space}\def^^I{\noexpand\FV@Tab}}%
969
+ \endgroup
970
+ \newcommand{\minted@writecmdcode}[1]{%
971
+ \immediate\openout\minted@code\minted@jobname.pyg\relax
972
+ \immediate\write\minted@code{\detokenize{#1}}%
973
+ \immediate\closeout\minted@code}
974
+ \newrobustcmd{\mintinline}[2][]{%
975
+ \begingroup
976
+ \setboolean{minted@isinline}{true}%
977
+ \minted@configlang{#2}%
978
+ \setkeys{minted@opt@cmd}{#1}%
979
+ \minted@fvset
980
+ \begingroup
981
+ \let\do\@makeother\dospecials
982
+ \catcode`\{=1
983
+ \catcode`\}=2
984
+ \catcode`\^^I=\active
985
+ \@ifnextchar\bgroup
986
+ {\minted@inline@iii}%
987
+ {\catcode`\{=12\catcode`\}=12
988
+ \minted@inline@i}}
989
+ \def\minted@inline@i#1{%
990
+ \endgroup
991
+ \def\minted@inline@ii##1#1{%
992
+ \minted@inline@iii{##1}}%
993
+ \begingroup
994
+ \let\do\@makeother\dospecials
995
+ \catcode`\^^I=\active
996
+ \minted@inline@ii}
997
+ \ifthenelse{\boolean{minted@draft}}%
998
+ {\newcommand{\minted@inline@iii}[1]{%
999
+ \endgroup
1000
+ \begingroup
1001
+ \minted@defwhitespace@retok
1002
+ \everyeof{\noexpand}%
1003
+ \endlinechar-1\relax
1004
+ \let\do\@makeother\dospecials
1005
+ \catcode`\ =\active
1006
+ \catcode`\^^I=\active
1007
+ \xdef\minted@tmp{\scantokens{#1}}%
1008
+ \endgroup
1009
+ \let\FV@Line\minted@tmp
1010
+ \def\FV@SV@minted@tmp{%
1011
+ \FV@Gobble
1012
+ \expandafter\FV@ProcessLine\expandafter{\FV@Line}}%
1013
+ \ifthenelse{\equal{\minted@get@opt{breaklines}{false}}{true}}%
1014
+ {\let\FV@BeginVBox\relax
1015
+ \let\FV@EndVBox\relax
1016
+ \def\FV@BProcessLine##1{\FancyVerbFormatLine{##1}}%
1017
+ \BUseVerbatim{minted@tmp}}%
1018
+ {\BUseVerbatim{minted@tmp}}%
1019
+ \endgroup}}%
1020
+ {\newcommand{\minted@inline@iii}[1]{%
1021
+ \endgroup
1022
+ \minted@writecmdcode{#1}%
1023
+ \RecustomVerbatimEnvironment{Verbatim}{BVerbatim}{}%
1024
+ \setcounter{minted@FancyVerbLineTemp}{\value{FancyVerbLine}}%
1025
+ \minted@pygmentize{\minted@lang}%
1026
+ \setcounter{FancyVerbLine}{\value{minted@FancyVerbLineTemp}}%
1027
+ \endgroup}}
1028
+ \newrobustcmd{\mint}[2][]{%
1029
+ \begingroup
1030
+ \minted@configlang{#2}%
1031
+ \setkeys{minted@opt@cmd}{#1}%
1032
+ \minted@fvset
1033
+ \begingroup
1034
+ \let\do\@makeother\dospecials
1035
+ \catcode`\{=1
1036
+ \catcode`\}=2
1037
+ \catcode`\^^I=\active
1038
+ \@ifnextchar\bgroup
1039
+ {\mint@iii}%
1040
+ {\catcode`\{=12\catcode`\}=12
1041
+ \mint@i}}
1042
+ \def\mint@i#1{%
1043
+ \endgroup
1044
+ \def\mint@ii##1#1{%
1045
+ \mint@iii{##1}}%
1046
+ \begingroup
1047
+ \let\do\@makeother\dospecials
1048
+ \catcode`\^^I=\active
1049
+ \mint@ii}
1050
+ \ifthenelse{\boolean{minted@draft}}%
1051
+ {\newcommand{\mint@iii}[1]{%
1052
+ \endgroup
1053
+ \begingroup
1054
+ \minted@defwhitespace@retok
1055
+ \everyeof{\noexpand}%
1056
+ \endlinechar-1\relax
1057
+ \let\do\@makeother\dospecials
1058
+ \catcode`\ =\active
1059
+ \catcode`\^^I=\active
1060
+ \xdef\minted@tmp{\scantokens{#1}}%
1061
+ \endgroup
1062
+ \let\FV@Line\minted@tmp
1063
+ \def\FV@SV@minted@tmp{%
1064
+ \FV@CodeLineNo=1\FV@StepLineNo
1065
+ \FV@Gobble
1066
+ \expandafter\FV@ProcessLine\expandafter{\FV@Line}}%
1067
+ \minted@langlinenoson
1068
+ \UseVerbatim{minted@tmp}%
1069
+ \minted@langlinenosoff
1070
+ \endgroup}}%
1071
+ {\newcommand{\mint@iii}[1]{%
1072
+ \endgroup
1073
+ \minted@writecmdcode{#1}%
1074
+ \minted@langlinenoson
1075
+ \minted@pygmentize{\minted@lang}%
1076
+ \minted@langlinenosoff
1077
+ \endgroup}}
1078
+ \ifthenelse{\boolean{minted@draft}}%
1079
+ {\newenvironment{minted}[2][]
1080
+ {\VerbatimEnvironment
1081
+ \minted@configlang{#2}%
1082
+ \setkeys{minted@opt@cmd}{#1}%
1083
+ \minted@fvset
1084
+ \minted@langlinenoson
1085
+ \begin{Verbatim}}%
1086
+ {\end{Verbatim}%
1087
+ \minted@langlinenosoff}}%
1088
+ {\newenvironment{minted}[2][]
1089
+ {\VerbatimEnvironment
1090
+ \let\FVB@VerbatimOut\minted@FVB@VerbatimOut
1091
+ \let\FVE@VerbatimOut\minted@FVE@VerbatimOut
1092
+ \minted@configlang{#2}%
1093
+ \setkeys{minted@opt@cmd}{#1}%
1094
+ \minted@fvset
1095
+ \begin{VerbatimOut}[codes={\catcode`\^^I=12},firstline,lastline]{\minted@jobname.pyg}}%
1096
+ {\end{VerbatimOut}%
1097
+ \minted@langlinenoson
1098
+ \minted@pygmentize{\minted@lang}%
1099
+ \minted@langlinenosoff}}
1100
+ \ifthenelse{\boolean{minted@draft}}%
1101
+ {\newcommand{\inputminted}[3][]{%
1102
+ \begingroup
1103
+ \minted@configlang{#2}%
1104
+ \setkeys{minted@opt@cmd}{#1}%
1105
+ \minted@fvset
1106
+ \VerbatimInput{#3}%
1107
+ \endgroup}}%
1108
+ {\newcommand{\inputminted}[3][]{%
1109
+ \begingroup
1110
+ \minted@configlang{#2}%
1111
+ \setkeys{minted@opt@cmd}{#1}%
1112
+ \minted@fvset
1113
+ \minted@pygmentize[#3]{#2}%
1114
+ \endgroup}}
1115
+ \newcommand{\newminted}[3][]{
1116
+ \ifthenelse{\equal{#1}{}}
1117
+ {\def\minted@envname{#2code}}
1118
+ {\def\minted@envname{#1}}
1119
+ \newenvironment{\minted@envname}
1120
+ {\VerbatimEnvironment
1121
+ \begin{minted}[#3]{#2}}
1122
+ {\end{minted}}
1123
+ \newenvironment{\minted@envname *}[1]
1124
+ {\VerbatimEnvironment\begin{minted}[#3,##1]{#2}}
1125
+ {\end{minted}}}
1126
+ \newcommand{\newmint}[3][]{
1127
+ \ifthenelse{\equal{#1}{}}
1128
+ {\def\minted@shortname{#2}}
1129
+ {\def\minted@shortname{#1}}
1130
+ \expandafter\newcommand\csname\minted@shortname\endcsname[2][]{
1131
+ \mint[#3,##1]{#2}##2}}
1132
+ \newcommand{\newmintedfile}[3][]{
1133
+ \ifthenelse{\equal{#1}{}}
1134
+ {\def\minted@shortname{#2file}}
1135
+ {\def\minted@shortname{#1}}
1136
+ \expandafter\newcommand\csname\minted@shortname\endcsname[2][]{
1137
+ \inputminted[#3,##1]{#2}{##2}}}
1138
+ \newcommand{\newmintinline}[3][]{%
1139
+ \ifthenelse{\equal{#1}{}}%
1140
+ {\def\minted@shortname{#2inline}}%
1141
+ {\def\minted@shortname{#1}}%
1142
+ \expandafter\newrobustcmd\csname\minted@shortname\endcsname{%
1143
+ \begingroup
1144
+ \let\do\@makeother\dospecials
1145
+ \catcode`\{=1
1146
+ \catcode`\}=2
1147
+ \@ifnextchar[{\endgroup\minted@inliner[#3][#2]}%
1148
+ {\endgroup\minted@inliner[#3][#2][]}}%
1149
+ \def\minted@inliner[##1][##2][##3]{\mintinline[##1,##3]{##2}}%
1150
+ }
1151
+ \ifthenelse{\boolean{minted@newfloat}}%
1152
+ {\@ifundefined{minted@float@within}%
1153
+ {\DeclareFloatingEnvironment[fileext=lol,placement=tbp]{listing}}%
1154
+ {\def\minted@tmp#1{%
1155
+ \DeclareFloatingEnvironment[fileext=lol,placement=tbp, within=#1]{listing}}%
1156
+ \expandafter\minted@tmp\expandafter{\minted@float@within}}}%
1157
+ {\@ifundefined{minted@float@within}%
1158
+ {\newfloat{listing}{tbp}{lol}}%
1159
+ {\newfloat{listing}{tbp}{lol}[\minted@float@within]}}
1160
+ \ifminted@newfloat\else
1161
+ \newcommand{\listingscaption}{Listing}
1162
+ \floatname{listing}{\listingscaption}
1163
+ \newcommand{\listoflistingscaption}{List of Listings}
1164
+ \providecommand{\listoflistings}{\listof{listing}{\listoflistingscaption}}
1165
+ \fi
1166
+ \AtEndOfPackage{%
1167
+ \ifthenelse{\boolean{minted@draft}}%
1168
+ {}%
1169
+ {%
1170
+ \ifthenelse{\boolean{minted@frozencache}}{}{%
1171
+ \ifnum\pdf@shellescape=1\relax\else
1172
+ \PackageError{minted}%
1173
+ {You must invoke LaTeX with the
1174
+ -shell-escape flag}%
1175
+ {Pass the -shell-escape flag to LaTeX. Refer to the minted.sty
1176
+ documentation for more information.}%
1177
+ \fi}%
1178
+ }%
1179
+ }
1180
+ \AtEndPreamble{%
1181
+ \ifthenelse{\boolean{minted@draft}}%
1182
+ {}%
1183
+ {%
1184
+ \ifthenelse{\boolean{minted@frozencache}}{}{%
1185
+ \TestAppExists{\MintedPygmentize}%
1186
+ \ifAppExists\else
1187
+ \PackageError{minted}%
1188
+ {You must have `pygmentize' installed
1189
+ to use this package}%
1190
+ {Refer to the installation instructions in the minted
1191
+ documentation for more information.}%
1192
+ \fi}%
1193
+ }%
1194
+ }
1195
+ \AfterEndDocument{%
1196
+ \ifthenelse{\boolean{minted@draft}}%
1197
+ {}%
1198
+ {\ifthenelse{\boolean{minted@frozencache}}%
1199
+ {}
1200
+ {\ifx\XeTeXinterchartoks\minted@undefined
1201
+ \else
1202
+ \DeleteFile[\minted@outputdir]{\minted@jobname.mintedcmd}%
1203
+ \DeleteFile[\minted@outputdir]{\minted@jobname.mintedmd5}%
1204
+ \fi
1205
+ \DeleteFile[\minted@outputdir]{\minted@jobname.pyg}%
1206
+ \DeleteFile[\minted@outputdir]{\minted@jobname.out.pyg}%
1207
+ }%
1208
+ }%
1209
+ }
1210
+ \endinput
1211
+ %%
1212
+ %% End of file `minted.sty'.
references/2021.naacl.nguyen/source/naacl2021.bbl ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{thebibliography}{29}
2
+ \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
3
+
4
+ \bibitem[{Bang and Sornlertlamvanich(2018)}]{BANG20182016IIP0038}
5
+ Tran~Sy Bang and Virach Sornlertlamvanich. 2018.
6
+ \newblock Sentiment classification for hotel booking review based on sentence
7
+ dependency structure and sub-opinion analysis.
8
+ \newblock \emph{IEICE Transactions on Information and Systems},
9
+ E101.D(4):909--916.
10
+
11
+ \bibitem[{Chu and Liu(1965)}]{chuliu}
12
+ Yoeng-Jin Chu and Tseng-Hong Liu. 1965.
13
+ \newblock {On the Shortest Arborescence of a Directed Graph}.
14
+ \newblock \emph{Science Sinica}, 14:1396--1400.
15
+
16
+ \bibitem[{Devlin et~al.(2019)Devlin, Chang, Lee, and
17
+ Toutanova}]{devlin-etal-2019-bert}
18
+ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019.
19
+ \newblock {BERT}: Pre-training of deep bidirectional transformers for language
20
+ understanding.
21
+ \newblock In \emph{Proceedings of NAACL}, pages 4171--4186.
22
+
23
+ \bibitem[{Dozat and Manning(2017)}]{DozatM17}
24
+ Timothy Dozat and Christopher~D. Manning. 2017.
25
+ \newblock {Deep Biaffine Attention for Neural Dependency Parsing}.
26
+ \newblock In \emph{Proceedings of ICLR}.
27
+
28
+ \bibitem[{Edmonds(1967)}]{Edmonds}
29
+ Jack Edmonds. 1967.
30
+ \newblock {Optimum Branchings}.
31
+ \newblock \emph{Journal of Research of the National Bureau of Standards},
32
+ 71:233--240.
33
+
34
+ \bibitem[{Hashimoto et~al.(2017)Hashimoto, Xiong, Tsuruoka, and
35
+ Socher}]{hashimoto-etal-2017-joint}
36
+ Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2017.
37
+ \newblock {A Joint Many-Task Model: Growing a Neural Network for Multiple {NLP}
38
+ Tasks}.
39
+ \newblock In \emph{Proceedings of EMNLP}, pages 1923--1933.
40
+
41
+ \bibitem[{Kondratyuk and Straka(2019)}]{kondratyuk-straka-2019-75}
42
+ Dan Kondratyuk and Milan Straka. 2019.
43
+ \newblock {75 Languages, 1 Model: Parsing {U}niversal {D}ependencies
44
+ Universally}.
45
+ \newblock In \emph{Proceedings of EMNLP-IJCNLP}, pages 2779--2795.
46
+
47
+ \bibitem[{Lafferty et~al.(2001)Lafferty, McCallum, and Pereira}]{Lafferty:2001}
48
+ John~D. Lafferty, Andrew McCallum, and Fernando C.~N. Pereira. 2001.
49
+ \newblock {Conditional Random Fields: Probabilistic Models for Segmenting and
50
+ Labeling Sequence Data}.
51
+ \newblock In \emph{Proceedings of ICML}, pages 282--289.
52
+
53
+ \bibitem[{Le-Hong and Bui(2018)}]{3184558.3191535}
54
+ Phuong Le-Hong and Duc-Thien Bui. 2018.
55
+ \newblock {A Factoid Question Answering System for Vietnamese}.
56
+ \newblock In \emph{Companion Proceedings of the The Web Conference 2018}, page
57
+ 1049–1055.
58
+
59
+ \bibitem[{Li et~al.(2018)Li, He, Zhang, and Zhao}]{li-etal-2018-joint-learning}
60
+ Zuchao Li, Shexia He, Zhuosheng Zhang, and Hai Zhao. 2018.
61
+ \newblock {Joint Learning of {POS} and Dependencies for Multilingual
62
+ {U}niversal {D}ependency Parsing}.
63
+ \newblock In \emph{Proceedings of the {C}o{NLL} 2018 Shared Task}, pages
64
+ 65--73.
65
+
66
+ \bibitem[{Loshchilov and Hutter(2019)}]{loshchilov2018decoupled}
67
+ Ilya Loshchilov and Frank Hutter. 2019.
68
+ \newblock {Decoupled Weight Decay Regularization}.
69
+ \newblock In \emph{Proceedings of ICLR}.
70
+
71
+ \bibitem[{Nguyen et~al.(2020)Nguyen, Dao, and Nguyen}]{vitext2sql}
72
+ Anh~Tuan Nguyen, Mai~Hoang Dao, and Dat~Quoc Nguyen. 2020.
73
+ \newblock {A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese}.
74
+ \newblock In \emph{Findings of EMNLP 2020}, pages 4079--4085.
75
+
76
+ \bibitem[{Nguyen(2019)}]{NguyenALTA2019}
77
+ Dat~Quoc Nguyen. 2019.
78
+ \newblock {A neural joint model for Vietnamese word segmentation, POS tagging
79
+ and dependency parsing}.
80
+ \newblock In \emph{Proceedings of ALTA}, pages 28--34.
81
+
82
+ \bibitem[{Nguyen and Nguyen(2020)}]{phobert}
83
+ Dat~Quoc Nguyen and Anh~Tuan Nguyen. 2020.
84
+ \newblock {PhoBERT: Pre-trained language models for Vietnamese}.
85
+ \newblock In \emph{Findings of EMNLP 2020}, pages 1037--1042.
86
+
87
+ \bibitem[{Nguyen et~al.(2017)Nguyen, Nguyen, and Pham}]{NguyenNP_SWJ}
88
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, and Son~Bao Pham. 2017.
89
+ \newblock {Ripple Down Rules for Question Answering}.
90
+ \newblock \emph{Semantic Web}, 8(4):511--532.
91
+
92
+ \bibitem[{Nguyen et~al.(2014)Nguyen, Nguyen, Pham, Nguyen, and
93
+ Nguyen}]{Nguyen2014NLDB}
94
+ Dat~Quoc Nguyen, Dai~Quoc Nguyen, Son~Bao Pham, Phuong-Thai Nguyen, and Minh~Le
95
+ Nguyen. 2014.
96
+ \newblock {From Treebank Conversion to Automatic Dependency Parsing for
97
+ Vietnamese}.
98
+ \newblock In \emph{{Proceedings of NLDB}}, pages 196--207.
99
+
100
+ \bibitem[{Nguyen and Verspoor(2018)}]{nguyen-verspoor-2018-improved}
101
+ Dat~Quoc Nguyen and Karin Verspoor. 2018.
102
+ \newblock An improved neural network model for joint {POS} tagging and
103
+ dependency parsing.
104
+ \newblock In \emph{Proceedings of the {C}o{NLL} 2018 Shared Task}, pages
105
+ 81--91.
106
+
107
+ \bibitem[{Nguyen et~al.(2019)Nguyen, Ngo, Vu, Tran, and Nguyen}]{JCC13161}
108
+ Huyen Nguyen, Quyen Ngo, Luong Vu, Vu~Tran, and Hien Nguyen. 2019.
109
+ \newblock {VLSP Shared Task: Named Entity Recognition}.
110
+ \newblock \emph{Journal of Computer Science and Cybernetics}, 34(4):283--294.
111
+
112
+ \bibitem[{Nguyen et~al.(2009)Nguyen, Vu, Nguyen, Nguyen, and
113
+ Le}]{nguyen-etal-2009-building}
114
+ Phuong-Thai Nguyen, Xuan-Luong Vu, Thi-Minh-Huyen Nguyen, Van-Hiep Nguyen, and
115
+ Hong-Phuong Le. 2009.
116
+ \newblock {Building a Large Syntactically-Annotated Corpus of {V}ietnamese}.
117
+ \newblock In \emph{Proceedings of {LAW}}, pages 182--185.
118
+
119
+ \bibitem[{Paszke et~al.(2019)Paszke, Gross et~al.}]{NEURIPS2019_9015}
120
+ Adam Paszke, Sam Gross, et~al. 2019.
121
+ \newblock {PyTorch: An Imperative Style, High-Performance Deep Learning
122
+ Library}.
123
+ \newblock In \emph{Proceedings of NeurIPS 2019}, pages 8024--8035.
124
+
125
+ \bibitem[{Qi et~al.(2018)Qi, Dozat, Zhang, and
126
+ Manning}]{qi-etal-2018-universal}
127
+ Peng Qi, Timothy Dozat, Yuhao Zhang, and Christopher~D. Manning. 2018.
128
+ \newblock {U}niversal {D}ependency parsing from scratch.
129
+ \newblock In \emph{Proceedings of the {C}o{NLL} 2018 Shared Task}, pages
130
+ 160--170.
131
+
132
+ \bibitem[{Qi et~al.(2020)Qi, Zhang, Zhang, Bolton, and
133
+ Manning}]{qi-etal-2020-stanza}
134
+ Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher~D. Manning.
135
+ 2020.
136
+ \newblock {S}tanza: A python natural language processing toolkit for many human
137
+ languages.
138
+ \newblock In \emph{Proceedings of ACL: System Demonstrations}, pages 101--108.
139
+
140
+ \bibitem[{Ruder(2019)}]{Ruder2019Neural}
141
+ Sebastian Ruder. 2019.
142
+ \newblock \emph{Neural Transfer Learning for Natural Language Processing}.
143
+ \newblock Ph.D. thesis, National University of Ireland, Galway.
144
+
145
+ \bibitem[{Sennrich et~al.(2016)Sennrich, Haddow, and
146
+ Birch}]{sennrich-etal-2016-neural}
147
+ Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016.
148
+ \newblock {Neural Machine Translation of Rare Words with Subword Units}.
149
+ \newblock In \emph{Proceedings of ACL}, pages 1715--1725.
150
+
151
+ \bibitem[{To and Do(2020)}]{9287471}
152
+ Huong~Duong To and Phuc Do. 2020.
153
+ \newblock Extracting triples from vietnamese text to create knowledge graph.
154
+ \newblock In \emph{Proceedings of KSE}, pages 219--223.
155
+
156
+ \bibitem[{Tran et~al.(2016)Tran, Vu, Pham, Nguyen, and Nguyen}]{7800281}
157
+ Viet~Hong Tran, Huyen~Thuong Vu, Thu~Hoai Pham, Vinh~Van Nguyen, and Minh~Le
158
+ Nguyen. 2016.
159
+ \newblock {A reordering model for Vietnamese-English statistical machine
160
+ translation using dependency information}.
161
+ \newblock In \emph{Proceedings of RIVF}, pages 125--130.
162
+
163
+ \bibitem[{Truong et~al.(2017)Truong, Vo, and Nguyen}]{3155133.3155171}
164
+ Diem Truong, Duc-Thuan Vo, and Uyen~Trang Nguyen. 2017.
165
+ \newblock Vietnamese open information extraction.
166
+ \newblock In \emph{Proceedings of SoICT}, page 135–142.
167
+
168
+ \bibitem[{Vu et~al.(2018)Vu, Nguyen, Nguyen, Dras, and
169
+ Johnson}]{vu-etal-2018-vncorenlp}
170
+ Thanh Vu, Dat~Quoc Nguyen, Dai~Quoc Nguyen, Mark Dras, and Mark Johnson. 2018.
171
+ \newblock {VnCoreNLP: A Vietnamese Natural Language Processing Toolkit}.
172
+ \newblock In \emph{Proceedings of NAACL: Demonstrations}, pages 56--60.
173
+
174
+ \bibitem[{Wolf et~al.(2020)Wolf, Debut et~al.}]{wolf-etal-2020-transformers}
175
+ Thomas Wolf, Lysandre Debut, et~al. 2020.
176
+ \newblock {Transformers: State-of-the-Art Natural Language Processing}.
177
+ \newblock In \emph{Proceedings of EMNLP 2020: System Demonstrations}, pages
178
+ 38--45.
179
+
180
+ \end{thebibliography}
references/2021.naacl.nguyen/source/naacl2021.sty ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This is the LaTex style file for *ACL.
2
+ % The official sources can be found at
3
+ %
4
+ % https://github.com/acl-org/ACLPUB/
5
+ %
6
+ % This package is activated by adding
7
+ %
8
+ % \usepackage{acl}
9
+ %
10
+ % to your LaTeX file. When submitting your paper for review, add the "review" option:
11
+ %
12
+ % \usepackage[review]{acl}
13
+
14
+ \newif\ifacl@finalcopy
15
+ \DeclareOption{final}{\acl@finalcopytrue}
16
+ \DeclareOption{review}{\acl@finalcopyfalse}
17
+ \ExecuteOptions{final} % final copy is the default
18
+
19
+ % include hyperref, unless user specifies nohyperref option like this:
20
+ % \usepackage[nohyperref]{acl}
21
+ \newif\ifacl@hyperref
22
+ \DeclareOption{hyperref}{\acl@hyperreftrue}
23
+ \DeclareOption{nohyperref}{\acl@hyperreffalse}
24
+ \ExecuteOptions{hyperref} % default is to use hyperref
25
+ \ProcessOptions\relax
26
+
27
+ \typeout{Conference Style for ACL 2020}
28
+
29
+ \usepackage{xcolor}
30
+
31
+ \ifacl@hyperref
32
+ \PassOptionsToPackage{breaklinks}{hyperref}
33
+ \RequirePackage{hyperref}
34
+ % make links dark blue
35
+ \definecolor{darkblue}{rgb}{0, 0, 0.5}
36
+ \hypersetup{colorlinks=true, citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue}
37
+ \else
38
+ % This definition is used if the hyperref package is not loaded.
39
+ % It provides a backup, no-op definiton of \href.
40
+ % This is necessary because \href command is used in the acl_natbib.bst file.
41
+ \def\href#1#2{{#2}}
42
+ \usepackage{url}
43
+ \fi
44
+
45
+ \ifacl@finalcopy
46
+ % Hack to ignore these commands, which review mode puts into the .aux file.
47
+ \newcommand{\@LN@col}[1]{}
48
+ \newcommand{\@LN}[2]{}
49
+ \else
50
+ % Add draft line numbering via the lineno package
51
+ % https://texblog.org/2012/02/08/adding-line-numbers-to-documents/
52
+ \usepackage[switch,mathlines]{lineno}
53
+
54
+ % Line numbers in gray Helvetica 8pt
55
+ \font\aclhv = phvb at 8pt
56
+ \renewcommand\linenumberfont{\aclhv\color{lightgray}}
57
+
58
+ % Zero-fill line numbers
59
+ % NUMBER with left flushed zeros \fillzeros[<WIDTH>]<NUMBER>
60
+ \newcount\cv@tmpc@ \newcount\cv@tmpc
61
+ \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
62
+ \cv@tmpc=1 %
63
+ \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
64
+ \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
65
+ \ifnum#2<0\advance\cv@tmpc1\relax-\fi
66
+ \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
67
+ \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
68
+ \renewcommand\thelinenumber{\fillzeros[3]{\arabic{linenumber}}}
69
+ \linenumbers
70
+
71
+ \setlength{\linenumbersep}{1.6cm}
72
+
73
+ % Bug: An equation with $$ ... $$ isn't numbered, nor is the previous line.
74
+
75
+ % Patch amsmath commands so that the previous line and the equation itself
76
+ % are numbered. Bug: multline has an extra line number.
77
+ % https://tex.stackexchange.com/questions/461186/how-to-use-lineno-with-amsmath-align
78
+ \usepackage{etoolbox} %% <- for \pretocmd, \apptocmd and \patchcmd
79
+
80
+ \newcommand*\linenomathpatch[1]{%
81
+ \expandafter\pretocmd\csname #1\endcsname {\linenomath}{}{}%
82
+ \expandafter\pretocmd\csname #1*\endcsname {\linenomath}{}{}%
83
+ \expandafter\apptocmd\csname end#1\endcsname {\endlinenomath}{}{}%
84
+ \expandafter\apptocmd\csname end#1*\endcsname {\endlinenomath}{}{}%
85
+ }
86
+ \newcommand*\linenomathpatchAMS[1]{%
87
+ \expandafter\pretocmd\csname #1\endcsname {\linenomathAMS}{}{}%
88
+ \expandafter\pretocmd\csname #1*\endcsname {\linenomathAMS}{}{}%
89
+ \expandafter\apptocmd\csname end#1\endcsname {\endlinenomath}{}{}%
90
+ \expandafter\apptocmd\csname end#1*\endcsname {\endlinenomath}{}{}%
91
+ }
92
+
93
+ %% Definition of \linenomathAMS depends on whether the mathlines option is provided
94
+ \expandafter\ifx\linenomath\linenomathWithnumbers
95
+ \let\linenomathAMS\linenomathWithnumbers
96
+ %% The following line gets rid of an extra line numbers at the bottom:
97
+ \patchcmd\linenomathAMS{\advance\postdisplaypenalty\linenopenalty}{}{}{}
98
+ \else
99
+ \let\linenomathAMS\linenomathNonumbers
100
+ \fi
101
+
102
+ \AtBeginDocument{%
103
+ \linenomathpatch{equation}%
104
+ \linenomathpatchAMS{gather}%
105
+ \linenomathpatchAMS{multline}%
106
+ \linenomathpatchAMS{align}%
107
+ \linenomathpatchAMS{alignat}%
108
+ \linenomathpatchAMS{flalign}%
109
+ }
110
+ \fi
111
+
112
+ \iffalse
113
+ \PassOptionsToPackage{
114
+ a4paper,
115
+ top=2.21573cm,left=2.54cm,
116
+ textheight=24.7cm,textwidth=16.0cm,
117
+ headheight=0.17573cm,headsep=0cm
118
+ }{geometry}
119
+ \fi
120
+ \PassOptionsToPackage{a4paper,margin=2.5cm}{geometry}
121
+ \RequirePackage{geometry}
122
+
123
+ \setlength\columnsep{0.6cm}
124
+ \newlength\titlebox
125
+ \setlength\titlebox{5cm}
126
+
127
+ \flushbottom \twocolumn \sloppy
128
+
129
+ % We're never going to need a table of contents, so just flush it to
130
+ % save space --- suggested by drstrip@sandia-2
131
+ \def\addcontentsline#1#2#3{}
132
+
133
+ \ifacl@finalcopy
134
+ \thispagestyle{empty}
135
+ \pagestyle{empty}
136
+ \else
137
+ \pagenumbering{arabic}
138
+ \fi
139
+
140
+ %% Title and Authors %%
141
+
142
+ \newcommand{\Thanks}[1]{\thanks{\ #1}}
143
+
144
+ \newcommand\outauthor{
145
+ \begin{tabular}[t]{c}
146
+ \ifacl@finalcopy
147
+ \bf\@author
148
+ \else
149
+ % Avoiding common accidental de-anonymization issue. --MM
150
+ \bf Anonymous NAACL-HLT 2021 submission
151
+ \fi
152
+ \end{tabular}}
153
+
154
+ % Mostly taken from deproc.
155
+ \def\maketitle{\par
156
+ \begingroup
157
+ \def\thefootnote{\fnsymbol{footnote}}
158
+ \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}}
159
+ \twocolumn[\@maketitle] \@thanks
160
+ \endgroup
161
+ \setcounter{footnote}{0}
162
+ \let\maketitle\relax \let\@maketitle\relax
163
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
164
+ \def\@maketitle{\vbox to \titlebox{\hsize\textwidth
165
+ \linewidth\hsize \vskip 0.125in minus 0.125in \centering
166
+ {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in
167
+ {\def\and{\unskip\enspace{\rm and}\enspace}%
168
+ \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil
169
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}%
170
+ \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup
171
+ \vskip 0.25in plus 1fil minus 0.125in
172
+ \hbox to \linewidth\bgroup\large \hfil\hfil
173
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}
174
+ \hbox to \linewidth\bgroup\large \hfil\hfil
175
+ \hbox to 0pt\bgroup\hss
176
+ \outauthor
177
+ \hss\egroup
178
+ \hfil\hfil\egroup}
179
+ \vskip 0.3in plus 2fil minus 0.1in
180
+ }}
181
+
182
+ % margins and font size for abstract
183
+ \renewenvironment{abstract}%
184
+ {\centerline{\large\bf Abstract}%
185
+ \begin{list}{}%
186
+ {\setlength{\rightmargin}{0.6cm}%
187
+ \setlength{\leftmargin}{0.6cm}}%
188
+ \item[]\ignorespaces%
189
+ \@setsize\normalsize{12pt}\xpt\@xpt
190
+ }%
191
+ {\unskip\end{list}}
192
+
193
+ %\renewenvironment{abstract}{\centerline{\large\bf
194
+ % Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
195
+
196
+ % Resizing figure and table captions - SL
197
+ % Support for interacting with the caption, subfigure, and subcaption packages - SL
198
+ \RequirePackage{caption}
199
+ \DeclareCaptionFont{10pt}{\fontsize{10pt}{12pt}\selectfont}
200
+ \captionsetup{font=10pt}
201
+
202
+ \RequirePackage{natbib}
203
+ % for citation commands in the .tex, authors can use:
204
+ % \citep, \citet, and \citeyearpar for compatibility with natbib, or
205
+ % \cite, \newcite, and \shortcite for compatibility with older ACL .sty files
206
+ \renewcommand\cite{\citep} % to get "(Author Year)" with natbib
207
+ \newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib
208
+ \newcommand\newcite{\citet} % to get "Author (Year)" with natbib
209
+
210
+ % Bibliography
211
+
212
+ % Don't put a label in the bibliography at all. Just use the unlabeled format
213
+ % instead.
214
+ \def\thebibliography#1{\vskip\parskip%
215
+ \vskip\baselineskip%
216
+ \def\baselinestretch{1}%
217
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
218
+ \vskip-\parskip%
219
+ \vskip-\baselineskip%
220
+ \section*{References\@mkboth
221
+ {References}{References}}\list
222
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
223
+ \setlength{\itemindent}{-\parindent}}
224
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
225
+ \sloppy\clubpenalty4000\widowpenalty4000
226
+ \sfcode`\.=1000\relax}
227
+ \let\endthebibliography=\endlist
228
+
229
+
230
+ % Allow for a bibliography of sources of attested examples
231
+ \def\thesourcebibliography#1{\vskip\parskip%
232
+ \vskip\baselineskip%
233
+ \def\baselinestretch{1}%
234
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
235
+ \vskip-\parskip%
236
+ \vskip-\baselineskip%
237
+ \section*{Sources of Attested Examples\@mkboth
238
+ {Sources of Attested Examples}{Sources of Attested Examples}}\list
239
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
240
+ \setlength{\itemindent}{-\parindent}}
241
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
242
+ \sloppy\clubpenalty4000\widowpenalty4000
243
+ \sfcode`\.=1000\relax}
244
+ \let\endthesourcebibliography=\endlist
245
+
246
+ % sections with less space
247
+ \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
248
+ -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}}
249
+ \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
250
+ -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
251
+ %% changed by KO to - values to get the initial parindent right
252
+ \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus
253
+ -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}}
254
+ \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
255
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
256
+ \def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus
257
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
258
+
259
+ % Footnotes
260
+ \footnotesep 6.65pt %
261
+ \skip\footins 9pt plus 4pt minus 2pt
262
+ \def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt }
263
+ \setcounter{footnote}{0}
264
+
265
+ % Lists and paragraphs
266
+ \parindent 1em
267
+ \topsep 4pt plus 1pt minus 2pt
268
+ \partopsep 1pt plus 0.5pt minus 0.5pt
269
+ \itemsep 2pt plus 1pt minus 0.5pt
270
+ \parsep 2pt plus 1pt minus 0.5pt
271
+
272
+ \leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em
273
+ \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em
274
+ \labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt
275
+
276
+ \def\@listi{\leftmargin\leftmargini}
277
+ \def\@listii{\leftmargin\leftmarginii
278
+ \labelwidth\leftmarginii\advance\labelwidth-\labelsep
279
+ \topsep 2pt plus 1pt minus 0.5pt
280
+ \parsep 1pt plus 0.5pt minus 0.5pt
281
+ \itemsep \parsep}
282
+ \def\@listiii{\leftmargin\leftmarginiii
283
+ \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
284
+ \topsep 1pt plus 0.5pt minus 0.5pt
285
+ \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
286
+ \itemsep \topsep}
287
+ \def\@listiv{\leftmargin\leftmarginiv
288
+ \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
289
+ \def\@listv{\leftmargin\leftmarginv
290
+ \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
291
+ \def\@listvi{\leftmargin\leftmarginvi
292
+ \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
293
+
294
+ \abovedisplayskip 7pt plus2pt minus5pt%
295
+ \belowdisplayskip \abovedisplayskip
296
+ \abovedisplayshortskip 0pt plus3pt%
297
+ \belowdisplayshortskip 4pt plus3pt minus3pt%
298
+
299
+ % Less leading in most fonts (due to the narrow columns)
300
+ % The choices were between 1-pt and 1.5-pt leading
301
+ \def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
302
+ \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
303
+ \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
304
+ \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
305
+ \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
306
+ \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
307
+ \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
308
+ \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
309
+ \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
310
+ \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
references/2021.naacl.nguyen/source/naacl2021.tex ADDED
@@ -0,0 +1,641 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended.
2
+ \pdfoutput=1
3
+ % In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines.
4
+
5
+ \documentclass[11pt]{article}
6
+
7
+ % Remove the "review" option to generate the final version.
8
+ \usepackage{naacl2021}
9
+
10
+ % Standard package includes
11
+ \usepackage{times}
12
+ \usepackage{latexsym}
13
+
14
+ %\renewcommand{\UrlFont}{\ttfamily\small}
15
+
16
+
17
+ %\renewcommand{\UrlFont}{\ttfamily\small}
18
+
19
+ \usepackage{amsmath}
20
+ \usepackage{url}
21
+ \usepackage{amssymb}
22
+ \usepackage{amsfonts}
23
+ \usepackage{graphicx}
24
+ \usepackage{tabularx}
25
+ \usepackage{multirow}
26
+ \usepackage{arydshln}
27
+ \usepackage{mathtools,nccmath}
28
+ \usepackage{listings}
29
+
30
+ \usepackage[T5]{fontenc}
31
+ %\usepackage[utf8]{vietnam}
32
+ \usepackage{enumitem}
33
+ %\usepackage{ulem}
34
+ \usepackage{todonotes}
35
+ % \usepackage[usenames,dvipsnames]{color}
36
+ \usepackage{cancel}
37
+ \usepackage[draft]{minted}
38
+
39
+ % This is not strictly necessary, and may be commented out,
40
+ % but it will improve the layout of the manuscript,
41
+ % and will typically save some space.
42
+ \usepackage{microtype}
43
+
44
+
45
+
46
+ \makeatletter
47
+ \def\PYGdefault@reset{\let\PYGdefault@it=\relax \let\PYGdefault@bf=\relax%
48
+ \let\PYGdefault@ul=\relax \let\PYGdefault@tc=\relax%
49
+ \let\PYGdefault@bc=\relax \let\PYGdefault@ff=\relax}
50
+ \def\PYGdefault@tok#1{\csname PYGdefault@tok@#1\endcsname}
51
+ \def\PYGdefault@toks#1+{\ifx\relax#1\empty\else%
52
+ \PYGdefault@tok{#1}\expandafter\PYGdefault@toks\fi}
53
+ \def\PYGdefault@do#1{\PYGdefault@bc{\PYGdefault@tc{\PYGdefault@ul{%
54
+ \PYGdefault@it{\PYGdefault@bf{\PYGdefault@ff{#1}}}}}}}
55
+ \def\PYGdefault#1#2{\PYGdefault@reset\PYGdefault@toks#1+\relax+\PYGdefault@do{#2}}
56
+
57
+ \expandafter\def\csname PYGdefault@tok@w\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.73,0.73}{##1}}}
58
+ \expandafter\def\csname PYGdefault@tok@c\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
59
+ \expandafter\def\csname PYGdefault@tok@cp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.74,0.48,0.00}{##1}}}
60
+ \expandafter\def\csname PYGdefault@tok@k\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
61
+ \expandafter\def\csname PYGdefault@tok@kp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
62
+ \expandafter\def\csname PYGdefault@tok@kt\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.69,0.00,0.25}{##1}}}
63
+ \expandafter\def\csname PYGdefault@tok@o\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
64
+ \expandafter\def\csname PYGdefault@tok@ow\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
65
+ \expandafter\def\csname PYGdefault@tok@nb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
66
+ \expandafter\def\csname PYGdefault@tok@nf\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
67
+ \expandafter\def\csname PYGdefault@tok@nc\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
68
+ \expandafter\def\csname PYGdefault@tok@nn\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
69
+ \expandafter\def\csname PYGdefault@tok@ne\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.82,0.25,0.23}{##1}}}
70
+ \expandafter\def\csname PYGdefault@tok@nv\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
71
+ \expandafter\def\csname PYGdefault@tok@no\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.53,0.00,0.00}{##1}}}
72
+ \expandafter\def\csname PYGdefault@tok@nl\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.63,0.63,0.00}{##1}}}
73
+ \expandafter\def\csname PYGdefault@tok@ni\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.60,0.60,0.60}{##1}}}
74
+ \expandafter\def\csname PYGdefault@tok@na\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.49,0.56,0.16}{##1}}}
75
+ \expandafter\def\csname PYGdefault@tok@nt\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
76
+ \expandafter\def\csname PYGdefault@tok@nd\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
77
+ \expandafter\def\csname PYGdefault@tok@s\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
78
+ \expandafter\def\csname PYGdefault@tok@sd\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
79
+ \expandafter\def\csname PYGdefault@tok@si\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
80
+ \expandafter\def\csname PYGdefault@tok@se\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.13}{##1}}}
81
+ \expandafter\def\csname PYGdefault@tok@sr\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
82
+ \expandafter\def\csname PYGdefault@tok@ss\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
83
+ \expandafter\def\csname PYGdefault@tok@sx\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
84
+ \expandafter\def\csname PYGdefault@tok@m\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
85
+ \expandafter\def\csname PYGdefault@tok@gh\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
86
+ \expandafter\def\csname PYGdefault@tok@gu\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.50,0.00,0.50}{##1}}}
87
+ \expandafter\def\csname PYGdefault@tok@gd\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.63,0.00,0.00}{##1}}}
88
+ \expandafter\def\csname PYGdefault@tok@gi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.63,0.00}{##1}}}
89
+ \expandafter\def\csname PYGdefault@tok@gr\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{1.00,0.00,0.00}{##1}}}
90
+ \expandafter\def\csname PYGdefault@tok@ge\endcsname{\let\PYGdefault@it=\textit}
91
+ \expandafter\def\csname PYGdefault@tok@gs\endcsname{\let\PYGdefault@bf=\textbf}
92
+ \expandafter\def\csname PYGdefault@tok@gp\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
93
+ \expandafter\def\csname PYGdefault@tok@go\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.53,0.53,0.53}{##1}}}
94
+ \expandafter\def\csname PYGdefault@tok@gt\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.27,0.87}{##1}}}
95
+ \expandafter\def\csname PYGdefault@tok@err\endcsname{\def\PYGdefault@bc##1{\setlength{\fboxsep}{0pt}\fcolorbox[rgb]{1.00,0.00,0.00}{1,1,1}{\strut ##1}}}
96
+ \expandafter\def\csname PYGdefault@tok@kc\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
97
+ \expandafter\def\csname PYGdefault@tok@kd\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
98
+ \expandafter\def\csname PYGdefault@tok@kn\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
99
+ \expandafter\def\csname PYGdefault@tok@kr\endcsname{\let\PYGdefault@bf=\textbf\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
100
+ \expandafter\def\csname PYGdefault@tok@bp\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
101
+ \expandafter\def\csname PYGdefault@tok@fm\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
102
+ \expandafter\def\csname PYGdefault@tok@vc\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
103
+ \expandafter\def\csname PYGdefault@tok@vg\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
104
+ \expandafter\def\csname PYGdefault@tok@vi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
105
+ \expandafter\def\csname PYGdefault@tok@vm\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
106
+ \expandafter\def\csname PYGdefault@tok@sa\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
107
+ \expandafter\def\csname PYGdefault@tok@sb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
108
+ \expandafter\def\csname PYGdefault@tok@sc\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
109
+ \expandafter\def\csname PYGdefault@tok@dl\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
110
+ \expandafter\def\csname PYGdefault@tok@s2\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
111
+ \expandafter\def\csname PYGdefault@tok@sh\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
112
+ \expandafter\def\csname PYGdefault@tok@s1\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
113
+ \expandafter\def\csname PYGdefault@tok@mb\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
114
+ \expandafter\def\csname PYGdefault@tok@mf\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
115
+ \expandafter\def\csname PYGdefault@tok@mh\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
116
+ \expandafter\def\csname PYGdefault@tok@mi\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
117
+ \expandafter\def\csname PYGdefault@tok@il\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
118
+ \expandafter\def\csname PYGdefault@tok@mo\endcsname{\def\PYGdefault@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
119
+ \expandafter\def\csname PYGdefault@tok@ch\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
120
+ \expandafter\def\csname PYGdefault@tok@cm\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
121
+ \expandafter\def\csname PYGdefault@tok@cpf\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
122
+ \expandafter\def\csname PYGdefault@tok@c1\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
123
+ \expandafter\def\csname PYGdefault@tok@cs\endcsname{\let\PYGdefault@it=\textit\def\PYGdefault@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
124
+
125
+ \def\PYGdefaultZbs{\char`\\}
126
+ \def\PYGdefaultZus{\char`\_}
127
+ \def\PYGdefaultZob{\char`\{}
128
+ \def\PYGdefaultZcb{\char`\}}
129
+ \def\PYGdefaultZca{\char`\^}
130
+ \def\PYGdefaultZam{\char`\&}
131
+ \def\PYGdefaultZlt{\char`\<}
132
+ \def\PYGdefaultZgt{\char`\>}
133
+ \def\PYGdefaultZsh{\char`\#}
134
+ \def\PYGdefaultZpc{\char`\%}
135
+ \def\PYGdefaultZdl{\char`\$}
136
+ \def\PYGdefaultZhy{\char`\-}
137
+ \def\PYGdefaultZsq{\char`\'}
138
+ \def\PYGdefaultZdq{\char`\"}
139
+ \def\PYGdefaultZti{\char`\~}
140
+ % for compatibility with earlier versions
141
+ \def\PYGdefaultZat{@}
142
+ \def\PYGdefaultZlb{[}
143
+ \def\PYGdefaultZrb{]}
144
+ \makeatother
145
+
146
+
147
+
148
+ \makeatletter
149
+ \def\PYG@reset{\let\PYG@it=\relax \let\PYG@bf=\relax%
150
+ \let\PYG@ul=\relax \let\PYG@tc=\relax%
151
+ \let\PYG@bc=\relax \let\PYG@ff=\relax}
152
+ \def\PYG@tok#1{\csname PYG@tok@#1\endcsname}
153
+ \def\PYG@toks#1+{\ifx\relax#1\empty\else%
154
+ \PYG@tok{#1}\expandafter\PYG@toks\fi}
155
+ \def\PYG@do#1{\PYG@bc{\PYG@tc{\PYG@ul{%
156
+ \PYG@it{\PYG@bf{\PYG@ff{#1}}}}}}}
157
+ \def\PYG#1#2{\PYG@reset\PYG@toks#1+\relax+\PYG@do{#2}}
158
+
159
+ \expandafter\def\csname PYG@tok@w\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.73,0.73}{##1}}}
160
+ \expandafter\def\csname PYG@tok@c\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
161
+ \expandafter\def\csname PYG@tok@cp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.74,0.48,0.00}{##1}}}
162
+ \expandafter\def\csname PYG@tok@k\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
163
+ \expandafter\def\csname PYG@tok@kp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
164
+ \expandafter\def\csname PYG@tok@kt\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.69,0.00,0.25}{##1}}}
165
+ \expandafter\def\csname PYG@tok@o\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
166
+ \expandafter\def\csname PYG@tok@ow\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
167
+ \expandafter\def\csname PYG@tok@nb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
168
+ \expandafter\def\csname PYG@tok@nf\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
169
+ \expandafter\def\csname PYG@tok@nc\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
170
+ \expandafter\def\csname PYG@tok@nn\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
171
+ \expandafter\def\csname PYG@tok@ne\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.82,0.25,0.23}{##1}}}
172
+ \expandafter\def\csname PYG@tok@nv\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
173
+ \expandafter\def\csname PYG@tok@no\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.53,0.00,0.00}{##1}}}
174
+ \expandafter\def\csname PYG@tok@nl\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.63,0.63,0.00}{##1}}}
175
+ \expandafter\def\csname PYG@tok@ni\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.60,0.60,0.60}{##1}}}
176
+ \expandafter\def\csname PYG@tok@na\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.49,0.56,0.16}{##1}}}
177
+ \expandafter\def\csname PYG@tok@nt\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
178
+ \expandafter\def\csname PYG@tok@nd\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.67,0.13,1.00}{##1}}}
179
+ \expandafter\def\csname PYG@tok@s\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
180
+ \expandafter\def\csname PYG@tok@sd\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
181
+ \expandafter\def\csname PYG@tok@si\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
182
+ \expandafter\def\csname PYG@tok@se\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.13}{##1}}}
183
+ \expandafter\def\csname PYG@tok@sr\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.40,0.53}{##1}}}
184
+ \expandafter\def\csname PYG@tok@ss\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
185
+ \expandafter\def\csname PYG@tok@sx\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
186
+ \expandafter\def\csname PYG@tok@m\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
187
+ \expandafter\def\csname PYG@tok@gh\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
188
+ \expandafter\def\csname PYG@tok@gu\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.50,0.00,0.50}{##1}}}
189
+ \expandafter\def\csname PYG@tok@gd\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.63,0.00,0.00}{##1}}}
190
+ \expandafter\def\csname PYG@tok@gi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.63,0.00}{##1}}}
191
+ \expandafter\def\csname PYG@tok@gr\endcsname{\def\PYG@tc##1{\textcolor[rgb]{1.00,0.00,0.00}{##1}}}
192
+ \expandafter\def\csname PYG@tok@ge\endcsname{\let\PYG@it=\textit}
193
+ \expandafter\def\csname PYG@tok@gs\endcsname{\let\PYG@bf=\textbf}
194
+ \expandafter\def\csname PYG@tok@gp\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,0.50}{##1}}}
195
+ \expandafter\def\csname PYG@tok@go\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.53,0.53,0.53}{##1}}}
196
+ \expandafter\def\csname PYG@tok@gt\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.27,0.87}{##1}}}
197
+ \expandafter\def\csname PYG@tok@err\endcsname{\def\PYG@bc##1{\setlength{\fboxsep}{0pt}\fcolorbox[rgb]{1.00,0.00,0.00}{1,1,1}{\strut ##1}}}
198
+ \expandafter\def\csname PYG@tok@kc\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
199
+ \expandafter\def\csname PYG@tok@kd\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
200
+ \expandafter\def\csname PYG@tok@kn\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
201
+ \expandafter\def\csname PYG@tok@kr\endcsname{\let\PYG@bf=\textbf\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
202
+ \expandafter\def\csname PYG@tok@bp\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.50,0.00}{##1}}}
203
+ \expandafter\def\csname PYG@tok@fm\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.00,0.00,1.00}{##1}}}
204
+ \expandafter\def\csname PYG@tok@vc\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
205
+ \expandafter\def\csname PYG@tok@vg\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
206
+ \expandafter\def\csname PYG@tok@vi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
207
+ \expandafter\def\csname PYG@tok@vm\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.10,0.09,0.49}{##1}}}
208
+ \expandafter\def\csname PYG@tok@sa\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
209
+ \expandafter\def\csname PYG@tok@sb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
210
+ \expandafter\def\csname PYG@tok@sc\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
211
+ \expandafter\def\csname PYG@tok@dl\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
212
+ \expandafter\def\csname PYG@tok@s2\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
213
+ \expandafter\def\csname PYG@tok@sh\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
214
+ \expandafter\def\csname PYG@tok@s1\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.73,0.13,0.13}{##1}}}
215
+ \expandafter\def\csname PYG@tok@mb\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
216
+ \expandafter\def\csname PYG@tok@mf\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
217
+ \expandafter\def\csname PYG@tok@mh\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
218
+ \expandafter\def\csname PYG@tok@mi\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
219
+ \expandafter\def\csname PYG@tok@il\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
220
+ \expandafter\def\csname PYG@tok@mo\endcsname{\def\PYG@tc##1{\textcolor[rgb]{0.40,0.40,0.40}{##1}}}
221
+ \expandafter\def\csname PYG@tok@ch\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
222
+ \expandafter\def\csname PYG@tok@cm\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
223
+ \expandafter\def\csname PYG@tok@cpf\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
224
+ \expandafter\def\csname PYG@tok@c1\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
225
+ \expandafter\def\csname PYG@tok@cs\endcsname{\let\PYG@it=\textit\def\PYG@tc##1{\textcolor[rgb]{0.25,0.50,0.50}{##1}}}
226
+
227
+ \def\PYGZbs{\char`\\}
228
+ \def\PYGZus{\char`\_}
229
+ \def\PYGZob{\char`\{}
230
+ \def\PYGZcb{\char`\}}
231
+ \def\PYGZca{\char`\^}
232
+ \def\PYGZam{\char`\&}
233
+ \def\PYGZlt{\char`\<}
234
+ \def\PYGZgt{\char`\>}
235
+ \def\PYGZsh{\char`\#}
236
+ \def\PYGZpc{\char`\%}
237
+ \def\PYGZdl{\char`\$}
238
+ \def\PYGZhy{\char`\-}
239
+ \def\PYGZsq{\char`\'}
240
+ \def\PYGZdq{\char`\"}
241
+ \def\PYGZti{\char`\~}
242
+ % for compatibility with earlier versions
243
+ \def\PYGZat{@}
244
+ \def\PYGZlb{[}
245
+ \def\PYGZrb{]}
246
+ \makeatother
247
+
248
+
249
+
250
+
251
+
252
+ \setlength{\textfloatsep}{15pt plus 5.0pt minus 3.0pt}
253
+ \setlength{\floatsep}{15pt plus 5.0pt minus 3.0pt}
254
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
255
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
256
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
257
+ \setlength{\abovecaptionskip}{5pt plus 1pt minus 1pt}
258
+
259
+ % If the title and author information does not fit in the area allocated, uncomment the following
260
+ %
261
+ %\setlength\titlebox{<dim>}
262
+ %
263
+ % and set <dim> to something 5cm or larger.
264
+
265
+ %\setlength\titlebox{5cm}
266
+
267
+ %\setlength{\textfloatsep}{15pt plus 5.0pt minus 5.0pt}
268
+ %\setlength{\floatsep}{15pt plus 5.0pt minus 5.0pt}
269
+ %\setlength{\dbltextfloatsep }{15pt plus 2.0pt minus 3.0pt}
270
+ %\setlength{\dblfloatsep}{15pt plus 2.0pt minus 3.0pt}
271
+ %\setlength{\intextsep}{15pt plus 2.0pt minus 3.0pt}
272
+ %\setlength{\abovecaptionskip}{5pt plus 1pt minus 1pt}
273
+
274
+ % If the title and author information does not fit in the area allocated, uncomment the following
275
+ %
276
+ \setlength\titlebox{5cm}
277
+ %
278
+ % and set <dim> to something 5cm or larger.
279
+
280
+ \title{PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing}
281
+
282
+ % Author information can be set in various styles:
283
+ % For several authors from the same institution:
284
+ \author{Linh The Nguyen \and Dat Quoc Nguyen\\
285
+ VinAI Research, Hanoi, Vietnam \\
286
+ \tt{\normalsize \{v.linhnt140, v.datnq9\}@vinai.io}}
287
+ % if the names do not fit well on one line use
288
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
289
+ % For authors from different institutions:
290
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
291
+ % \And ... \And
292
+ % Author n \\ Address line \\ ... \\ Address line}
293
+ % To start a seperate ``row'' of authors use \AND, as in
294
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
295
+ % \AND
296
+ % Author 2 \\ Address line \\ ... \\ Address line \And
297
+ % Author 3 \\ Address line \\ ... \\ Address line}
298
+
299
+ %\author{ }
300
+
301
+ \begin{document}
302
+ \maketitle
303
+
304
+ \begin{abstract}
305
+ We present the first multi-task learning model---named PhoNLP---for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency
306
+ parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT \cite{phobert} for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0.
307
+ Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing.
308
+ We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at \url{https://github.com/VinAIResearch/PhoNLP}.
309
+ \end{abstract}
310
+
311
+ \vspace{-5pt}
312
+
313
+ \begin{figure*}[!t]
314
+ \centering
315
+ \includegraphics[width=12.5cm]{JointModel.pdf}
316
+ {\small
317
+ \begin{tabular}{crllll}
318
+ \hline
319
+ \textbf{ID} & \textbf{Form} & \textbf{POS} & \textbf{NER} & \textbf{Head} & \textbf{DepRel} \\
320
+ \hline
321
+ 1 & Đây\textsubscript{This} & PRON & O & 2 & sub \\
322
+ 2 & là\textsubscript{is} & VERB& O & 0 & root \\
323
+ 3 & Hà\_Nội\textsubscript{Ha\_Noi} & NOUN & B-LOC & 2 & vmod \\
324
+ \hline
325
+ \end{tabular}
326
+ }
327
+ \caption{Illustration of our PhoNLP model.}
328
+ \label{fig:architecture}
329
+ \end{figure*}
330
+
331
+
332
+
333
+ \section{Introduction}
334
+
335
+ Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/2006-2010 and VLSP workshops that have run shared tasks since 2013.\footnote{\url{https://vlsp.org.vn/}} Fundamental tasks of POS tagging, NER and dependency parsing thus play important roles, providing useful features for many downstream application tasks such as machine translation \cite{7800281}, sentiment analysis \cite{BANG20182016IIP0038}, relation extraction \cite{9287471}, semantic parsing \cite{vitext2sql}, open information extraction \cite{3155133.3155171} and question answering \cite{NguyenNP_SWJ,3184558.3191535}. % \cite{7800281,BANG20182016IIP0038,3155133.3155171,9287471}.
336
+ Thus, there is a need to develop NLP toolkits for linguistic annotations w.r.t. Vietnamese POS tagging, NER and dependency parsing.
337
+
338
+ VnCoreNLP \cite{vu-etal-2018-vncorenlp} is the previous public toolkit employing traditional feature-based machine learning models to handle those Vietnamese NLP tasks. However, VnCoreNLP is now no longer considered state-of-the-art because its performance results are significantly outperformed by ones obtained when fine-tuning PhoBERT---the current state-of-the-art monolingual pre-trained language model for Vietnamese \cite{phobert}. Note that there are no publicly available fine-tuned BERT-based models for the three Vietnamese tasks. Assuming that there would be, a potential drawback might be that an NLP package wrapping such fine-tuned BERT-based models would take a large storage space, i.e. three times larger than the storage space used by a BERT model \cite{devlin-etal-2019-bert}, thus it would not be suitable for practical applications that require a smaller storage space. Jointly multi-task learning is a promising solution as it might help reduce the storage space. In addition, POS tagging, NER and dependency parsing are related tasks: POS tags are essential input features used for dependency parsing and POS tags are also used as additional features for NER. Jointly multi-task learning thus might also help improve the performance results against the single-task learning \cite{Ruder2019Neural}.
339
+
340
+
341
+ In this paper, we present a new multi-task learning model---named PhoNLP---for joint POS tagging, NER and dependency parsing. In particular, given an input sentence of words to PhoNLP, an encoding layer generates contextualized word embeddings that represent the input words. These contextualized word embeddings are fed into a POS tagging layer that is in fact a linear prediction layer \cite{devlin-etal-2019-bert} to predict POS tags for the corresponding input words. Each predicted POS tag is then represented by two ``soft'' embeddings that are later fed into NER and dependency parsing layers separately.
342
+ More specifically, based on both the contextualized word embeddings and the ``soft'' POS tag embeddings, the NER layer uses a linear-chain CRF predictor \cite{Lafferty:2001} to predict NER labels for the input words, while the dependency parsing layer uses a Biaffine classifier \cite{DozatM17} to predict dependency arcs between the words and another Biaffine classifier to label the predicted arcs.
343
+ %To the best of our knowledge, our PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results.
344
+ Our contributions are summarized as follows:
345
+
346
+ %\vspace{-2pt}
347
+
348
+ \begin{itemize}[leftmargin=*]
349
+ \setlength\itemsep{-1pt}
350
+ \item To the best of our knowledge, PhoNLP is the first proposed model to jointly learn POS tagging, NER and dependency parsing for Vietnamese.
351
+ \item We discuss a data leakage issue in the Vietnamese benchmark datasets, that has not yet been pointed out before. Experiments show that PhoNLP obtains state-of-the-art performance results, outperforming the PhoBERT-based single task learning.
352
+ \item We publicly release PhoNLP as an open-source toolkit that is simple to setup and efficiently run from both the command-line and Python API. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and downstream applications.
353
+ \end{itemize}
354
+
355
+
356
+
357
+ \section{Model description}
358
+
359
+ Figure \ref{fig:architecture} illustrates our PhoNLP architecture that can be viewed as a mixture of a BERT-based encoding layer and three decoding layers of POS tagging, NER and dependency parsing.
360
+
361
+ \subsection{Encoder \& Contextualized embeddings}
362
+
363
+ Given an input sentence consisting of $n$ word tokens $w_1, w_2, ..., w_n$, the encoding layer employs PhoBERT to generate contextualized latent feature embeddings $\mathbf{e}_{i}$ each representing the $i^{th}$ word $w_i$:
364
+
365
+ \begin{equation}
366
+ \mathbf{e}_{i} = \mathrm{PhoBERT\textsubscript{base}}\big({w}_{1:n}, i\big)
367
+ \end{equation}
368
+
369
+ In particular, the encoding layer employs the \textbf{PhoBERT\textsubscript{base}} version. Because PhoBERT uses BPE \cite{sennrich-etal-2016-neural} to segment the input sentence with subword units, the encoding layer in fact represents the $i^{th}$ word $w_i$ by using the contextualized embedding of its first subword.
370
+
371
+ \subsection{POS tagging}\label{ssec:pos}
372
+
373
+ Following a common manner when fine-tuning a pre-trained language model for a sequence labeling task \cite{devlin-etal-2019-bert}, the POS tagging layer is a linear prediction layer that is appended on top of the encoder. In particular, the POS tagging layer feeds the contextualized word embeddings $\mathbf{e}_{i}$ into a feed-forward network (FFNN\textsubscript{POS}) followed by a $\mathsf{softmax}$ predictor for POS tag prediction:
374
+
375
+ \begin{equation}
376
+ \mathbf{p}_{i} = \mathsf{softmax}\big(\mathrm{FFNN\textsubscript{POS}}\big(\mathbf{e}_{i}\big)\big) \label{eq2}
377
+ \end{equation}
378
+
379
+ \noindent where the output layer size of FFNN\textsubscript{POS} is the number of POS tags. Based on probability vectors $\mathbf{p}_{i}$, a cross-entropy objective loss \textbf{$\mathcal{L}_{\text{POS}}$} is calculated for POS tagging during training.
380
+
381
+
382
+ \subsection{NER}\label{ssec:ner}
383
+
384
+ The NER layer creates a sequence of vectors $\mathbf{v}_{1:n}$ in which each $\mathbf{v}_{i}$ is resulted in by concatenating the contextualized word embedding $\mathbf{e}_{i}$ and a ``soft'' POS tag embedding $\mathbf{t}_{i}^{(1)}$:
385
+
386
+ \begin{equation}
387
+ \mathbf{v}_{i} = \mathbf{e}_{i} \circ \mathbf{t}_{i}^{(1)}
388
+ \label{equa:ner}
389
+ \end{equation}
390
+
391
+ \noindent where following \newcite{hashimoto-etal-2017-joint}, the ``soft'' POS tag embedding $\mathbf{t}_{i}^{(1)}$ is computed by multiplying a label weight matrix $\mathbf{W}^{(1)}$ with the corresponding probability vector $\mathbf{p}_{i}$:
392
+
393
+ \begin{equation*}
394
+ \mathbf{t}_{i}^{(1)} = \mathbf{W}^{(1)}\mathbf{p}_{i}
395
+ \end{equation*}
396
+
397
+ The NER layer then passes each vector $\mathbf{v}_{i}$ into a FFNN (FFNN\textsubscript{NER}):
398
+
399
+ \begin{equation}
400
+ \mathbf{h}_{i} = \mathrm{FFNN\textsubscript{NER}}\big(\mathbf{v}_{i}\big) \label{eq4}
401
+ \end{equation}
402
+
403
+ \noindent where the output layer size of FFNN\textsubscript{NER} is the number of BIO-based NER labels.
404
+
405
+
406
+ The NER layer feeds the output vectors $\mathbf{h}_{i}$ into a linear-chain
407
+ CRF predictor for NER label prediction \cite{Lafferty:2001}. A cross-entropy loss \textbf{$\mathcal{L}_{\text{NER}}$} is calculated for NER during
408
+ training while the Viterbi algorithm is used for inference.
409
+
410
+ \subsection{Dependency parsing}
411
+
412
+ The dependency parsing layer creates vectors $\mathbf{z}_{1:n}$ in which each $\mathbf{z}_{i}$ is resulted in by concatenating $\mathbf{e}_{i}$ and another ``soft'' POS tag embedding $\mathbf{t}_{i}^{(2)}$:
413
+
414
+ \begin{eqnarray}
415
+ \mathbf{z}_{i} &=& \mathbf{e}_{i} \circ \mathbf{t}_{i}^{(2)} \label{equa:posdep} \\
416
+ \mathbf{t}_{i}^{(2)} &=& \mathbf{W}^{(2)}\mathbf{p}_{i} \nonumber
417
+ \end{eqnarray}
418
+
419
+ Following \newcite{DozatM17}, the dependency parsing layer uses FFNNs to split $\mathbf{z}_{i}$ into \emph{head} and \emph{dependent} representations:
420
+
421
+ \begin{eqnarray}
422
+ \mathbf{h}_{i}^{(\textsc{a-h})} &=& \mathrm{FFNN}_{\text{Arc-Head}}\big(\mathbf{z}_{i}\big) \label{equa:fc6} \\
423
+ \mathbf{h}_{i}^{(\textsc{a-d})} &=& \mathrm{FFNN}_{\text{Arc-Dep}}\big(\mathbf{z}_{i}\big) \\
424
+ \mathbf{h}_{i}^{(\textsc{l-h})} &=& \mathrm{FFNN}_{\text{Label-Head}}\big(\mathbf{z}_{i}\big) \\
425
+ \mathbf{h}_{i}^{(\textsc{l-d})} &=& \mathrm{FFNN}_{\text{Label-Dep}}\big(\mathbf{z}_{i}\big) \label{equa:fc9}
426
+ \end{eqnarray}
427
+
428
+
429
+ To predict potential dependency arcs, based on input vectors $\mathbf{h}_{i}^{(\textsc{a-h})}$ and $\mathbf{h}_{j}^{(\textsc{a-d})}$, the parsing layer uses a Biaffine classifier's variant \cite{qi-etal-2018-universal} that additionally takes into account the distance and relative ordering between two words to produce a probability distribution of
430
+ arc heads for each word. %\footnote{We utilize an implementation of the Biaffine classifier's variant \cite{qi-etal-2018-universal} from \newcite{qi-etal-2020-stanza}.}
431
+ For inference, the Chu–Liu/Edmonds' algorithm is used to find a maximum spanning tree \cite{chuliu,Edmonds}.
432
+ The parsing layer also uses another Biaffine classifier to label the predicted arcs, based on input vectors $\mathbf{h}_{i}^{(\textsc{l-h})}$ and $\mathbf{h}_{j}^{(\textsc{l-d})}$. An objective loss \textbf{$\mathcal{L}_{\text{DEP}}$} is computed by summing a cross entropy loss for unlabeled dependency parsing and another cross entropy loss for dependency label prediction during training based on gold arcs and arc labels.
433
+
434
+
435
+ %\begin{eqnarray}
436
+ % {s}_{i,j} &=&\mathrm{Biaff\textsuperscript{(A)}}\Big(\mathbf{h}_{i}^{(\textsc{a-h})}, \mathbf{h}_{j}^{(\textsc{a-d})}\Big) \\
437
+ %\mathrm{Biaff\textsuperscript{(A)}}\big(\mathbf{x}, \mathbf{y}\big) &=& \mathbf{x}^{\mathsf{T}} \mathbf{U}_1 \mathbf{y} + \mathbf{w}_1^{\mathsf{T}}(\mathbf{x} \circ \mathbf{y}) + {b}_1 \nonumber
438
+ %\end{eqnarray}
439
+ %
440
+ %\noindent where $\mathbf{U}_1$, $\mathbf{w}_1$ and $b_1$ are a $k\times 1 \times k$ tensor, a $2k$-dimensional vector and a bias scalar, respectively
441
+ %(here, $k$ is the size of the {head} and {dependent} representations).
442
+ %
443
+ %\begin{eqnarray}
444
+ % \mathbf{s}_{i,j} &=&\mathrm{Biaff\textsuperscript{(L)}}\Big(\mathbf{h}_{i}^{(\textsc{l-h})}, \mathbf{h}_{j}^{(\textsc{l-d})}\Big) \\
445
+ %\mathrm{Biaff\textsuperscript{(L)}}\big(\mathbf{x}, \mathbf{y}\big) &=& \mathbf{x}^{\mathsf{T}} \mathbf{U}_2 \mathbf{y} + \mathbf{W}_2(\mathbf{x} \circ \mathbf{y}) + \mathbf{b}_2 \nonumber
446
+ %\end{eqnarray}
447
+ %
448
+ %\noindent where $\mathbf{U}_2$, $\mathbf{W}_2$ and $\mathbf{b}_2$ are a $k\times l \times k$ tensor, a $l \times 2k$ matrix and a bias vector, respectively (here, $l$ is the number of dependency labels).
449
+
450
+
451
+ \subsection{Joint multi-task learning}
452
+
453
+ The final training objective loss \textbf{$\mathcal{L}$} of our model PhoNLP is the weighted sum of the POS tagging loss {$\mathcal{L}_{\text{POS}}$}, the NER loss {$\mathcal{L}_{\text{NER}}$} and the dependency parsing loss {$\mathcal{L}_{\text{DEP}}$}:
454
+
455
+ \begin{equation}
456
+ \textbf{$\mathcal{L}$} = \lambda_1\mathcal{L}_{\text{POS}} + \lambda_2\mathcal{L}_{\text{NER}} + (1 - \lambda_1 - \lambda_2)\mathcal{L}_{\text{DEP}}
457
+ \end{equation}
458
+
459
+ \paragraph{Discussion:} Our PhoNLP can be viewed as an extension of previous joint POS tagging and dependency parsing models \cite{hashimoto-etal-2017-joint,li-etal-2018-joint-learning,nguyen-verspoor-2018-improved,NguyenALTA2019,kondratyuk-straka-2019-75}, where we additionally incorporate a CRF-based prediction layer for NER. Unlike \newcite{hashimoto-etal-2017-joint}, \newcite{nguyen-verspoor-2018-improved}, \newcite{li-etal-2018-joint-learning} and \newcite{NguyenALTA2019} that use BiLSTM-based encoders to extract contextualized feature embeddings, we use a BERT-based encoder. \newcite{kondratyuk-straka-2019-75} also employ a BERT-based encoder. However, different from PhoNLP where we construct a hierarchical architecture over the POS tagging and dependency parsing layers, \newcite{kondratyuk-straka-2019-75} do not make use of POS tag embeddings for dependency parsing.\footnote{In our preliminary experiments, not feeding the POS tag embeddings into the dependency parsing layer decreases the performance.}
460
+
461
+
462
+
463
+
464
+ \section{Experiments}
465
+
466
+ \subsection{Setup}
467
+
468
+ \subsubsection{Datasets}
469
+
470
+ To conduct experiments, we use the benchmark datasets of the VLSP 2013 POS tagging dataset,\footnote{\url{https://vlsp.org.vn/vlsp2013/eval}} the VLSP 2016 NER dataset \cite{JCC13161} and the VnDT dependency treebank v1.1 \newcite{Nguyen2014NLDB}, following the setup used by the VnCoreNLP toolkit \cite{vu-etal-2018-vncorenlp}. Here, VnDT is converted from the Vietnamese constituent treebank \cite{nguyen-etal-2009-building}.
471
+
472
+
473
+ \paragraph{Data leakage issue:} We further discover an issue of data leakage, that has not yet been pointed out before. That is, all sentences from the VLSP 2016 NER dataset and the VnDT treebank are included in the VLSP 2013 POS tagging dataset. In particular, 90+\% of sentences from both validation and test sets for NER and dependency parsing are included in the POS tagging training set, resulting in an unrealistic evaluation scenario where the POS tags are used as input features for NER and dependency parsing.
474
+
475
+ To handle the data leakage issue, we have to re-split the VLSP 2013 POS tagging dataset to avoid the data leakage issue: The POS tagging validation/test set now only contains sentences that appear in the union of the NER and dependency parsing validation/test sets (i.e. the validation/test sentences for NER and dependency parsing only appear in the POS tagging validation/test set).
476
+ In addition, there are 594 duplicated sentences in the VLSP 2013 POS tagging dataset (here, sentence duplication is not found in the union of the NER and dependency parsing sentences). Thus we have to perform duplication removal on the POS tagging dataset.
477
+ Table \ref{tab:Datasets} details the statistics of the experimental datasets.
478
+
479
+ \begin{table}[!t]
480
+ \centering
481
+ \resizebox{7.5cm}{!}{
482
+ \begin{tabular}{l|l|l|l}
483
+ \hline
484
+ \textbf{Task} & \textbf{\#train} & \textbf{\#valid} & \textbf{\#test} \\
485
+ \hline
486
+ {POS tagging (leakage)} & {27000} & {870} & {2120} \\
487
+ \hdashline
488
+ POS tagging (re-split) & 23906 & 2009 & 3481\\
489
+ \hline
490
+ NER & 14861 & 2000 & 2831 \\
491
+ \hline
492
+ Dependency parsing & 8977 & 200 & 1020 \\
493
+ \hline
494
+ \end{tabular}
495
+ }
496
+ \caption{Dataset statistics. \textbf{\#train}, \textbf{\#valid} and \textbf{\#test} denote the numbers of training, validation and test sentences, respectively. Here,
497
+ ``{POS tagging (leakage)}'' and ``POS tagging (re-split)'' refer to the statistics for POS tagging before and after re-splitting \& sentence duplication removal, respectively.}
498
+ \label{tab:Datasets}
499
+ \end{table}
500
+
501
+ \subsubsection{Implementation}
502
+
503
+ PhoNLP is implemented based on PyTorch \cite{NEURIPS2019_9015}, employing the PhoBERT encoder implementation available from the $\mathrm{transformers}$ library \cite{wolf-etal-2020-transformers} and the Biaffine classifier implementation from \newcite{qi-etal-2020-stanza}. We set both the label weight matrices $\mathbf{W}^{(1)}$ and $\mathbf{W}^{(2)}$ to have 100 rows, resulting in 100-dimensional soft POS tag embeddings. In addition, following \newcite{qi-etal-2018-universal,qi-etal-2020-stanza}, FFNNs in equations \ref{equa:fc6}--\ref{equa:fc9} use 400-dimensional output layers.
504
+
505
+ We use the AdamW optimizer \cite{loshchilov2018decoupled} and a fixed batch size at 32, and train for 40 epochs. The sizes of training sets are different, in which the POS tagging
506
+ training set is the largest, consisting of 23906 sentences. Thus for each training epoch, we repeatedly sample from the NER and dependency parsing training sets to fill the gaps between the training set sizes. We perform a grid search to select the initial AdamW learning rate, $\lambda_1$ and $\lambda_2$. We find the optimal initial AdamW learning rate, $\lambda_1$ and $\lambda_2$ at 1e-5, 0.4 and 0.2, respectively. Here, we compute the average of the POS tagging accuracy, NER F\textsubscript{1}-score and dependency parsing score LAS after each training epoch on the validation sets. We select the model checkpoint that produces the highest average score over the validation sets to apply to the test sets. Each of our reported scores is an average over 5 runs with different random seeds.
507
+
508
+ %\subsubsection{Baseline single-task training}
509
+
510
+ %We also conduct experiments for a single-task training strategy. We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section \ref{ssec:pos}. For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers \cite{qi-etal-2018-universal}.
511
+
512
+
513
+
514
+
515
+
516
+
517
+
518
+
519
+
520
+
521
+
522
+ \subsection{Results}
523
+
524
+ %\subsubsection*{Main results}
525
+
526
+
527
+
528
+ Table \ref{tab:results} presents results obtained for our PhoNLP and compares them with those of a baseline approach of single-task training. For the single-task training approach: (i) We follow a common approach to fine-tune a pre-trained language model for POS tagging, appending a linear prediction layer on top of PhoBERT, as briefly described in Section \ref{ssec:pos}. (ii) For NER, instead of a linear prediction layer, we append a CRF prediction layer on top of PhoBERT. (iii) For dependency parsing, predicted POS tags are produced by the learned single-task POS tagging model; then POS tags are represented by embeddings that are concatenated with the corresponding PhoBERT-based contextualized word embeddings, resulting in a sequence of input vectors for the Biaffine-based classifiers for dependency parsing \cite{qi-etal-2018-universal}. Here, the single-task training approach is based on the PhoBERT\textsubscript{base} version, employing the same hyper-parameter tuning and model selection strategy that we use for PhoNLP.
529
+
530
+ \begin{table}[!t]
531
+ \centering
532
+ \def\arraystretch{1.2}
533
+ \resizebox{7.5cm}{!}{
534
+ \begin{tabular}{ll|l|l|l|l}
535
+ \hline
536
+ & \textbf{Model} & \textbf{POS} & \textbf{NER} & \textbf{LAS} & \textbf{UAS} \\
537
+ \hline
538
+ \multirow{2}{*}{\rotatebox[origin=c]{90}{{Leak.}}}& Single-task & 96.7$^\dagger$ & 93.69 & 78.77$^\dagger$ & 85.22$^\dagger$ \\
539
+ \cdashline{2-6}
540
+ & PhoNLP & \textbf{96.76} & \textbf{94.41} & \textbf{79.11} & \textbf{85.47}\\
541
+ \hline
542
+ \hline
543
+ \multirow{2}{*}{\rotatebox[origin=c]{90}{{Re-spl}}}& Single-task & 93.68 & 93.69 & 77.89 & 84.78 \\
544
+ \cdashline{2-6}
545
+ & PhoNLP & \textbf{93.88} & \textbf{94.51} & \textbf{78.17} & \textbf{84.95} \\
546
+ %& PhoNLP & \textbf{93.88}$^{*}$ & \textbf{94.51}$^{**}$ & \textbf{78.17}$^{*}$ & \textbf{84.95} \\
547
+ \hline
548
+ \end{tabular}
549
+ }
550
+ \caption{Performance results (in \%) on the test sets for POS tagging (i.e. accuracy), NER (i.e. F\textsubscript{1}-score) and dependency parsing (i.e. LAS and UAS scores). ``Leak.'' abbreviates ``leakage'', denoting the results obtained w.r.t. the data leakage issue. ``Re-spl'' denotes the results obtained w.r.t. the data re-split and duplication removal for POS tagging to avoid the data leakage issue. ``Single-task'' refers to as the single-task training approach.
551
+ $\dagger$ denotes scores taken from the PhoBERT paper \cite{phobert}. Note that ``Single-task'' NER is not affected by the data leakage issue.
552
+ %Here, $^{*}$ and $^{**}$ denote the statistically significant differences between ``Single-task'' and PhoNLP at p $\leq$ 0.05 and p $\leq$ 0.01, respectively.
553
+ }
554
+ \label{tab:results}
555
+ \end{table}
556
+
557
+ Note that PhoBERT helps produce state-of-the-art results for multiple Vietnamese NLP tasks (including but not limited to POS tagging, NER and dependency parsing in a single-task training strategy), and obtains higher performance results than VnCoreNLP.
558
+ However, in both the PhoBERT and VnCoreNLP papers \cite{phobert,vu-etal-2018-vncorenlp}, results for POS tagging and dependency parsing are reported w.r.t. the data leakage issue. Our ``Single-task'' results in Table \ref{tab:results} regarding ``Re-spl'' (i.e. the data re-split and duplication removal for POS tagging to avoid the data leakage issue) can be viewed as new PhoBERT results for a proper experimental setup. Table \ref{tab:results} shows that in both setups ``Leak.'' and ``Re-spl'', our joint multi-task training approach PhoNLP performs better than the PhoBERT-based single-task training approach, thus resulting in state-of-the-art performances for the three tasks of Vietnamese POS tagging, NER and dependency parsing.
559
+
560
+
561
+
562
+
563
+ \section{PhoNLP toolkit}
564
+
565
+ We present in this section a basic usage of our PhoNLP toolkit.
566
+ We make PhoNLP simple to setup, i.e. users can install PhoNLP from either source or $\mathsf{pip}$ (e.g. $\mathsf{pip3\ install\ phonlp}$). We also aim to make PhoNLP simple to run from both the command-line and the Python API. For example, annotating a corpus with POS tagging, NER and dependency parsing can be performed by using a simple command as in Figure \ref{fig:command}.
567
+
568
+ Assume that the input file ``{\ttfamily input.txt}'' in Figure \ref{fig:command} contains a sentence ``Tôi đang làm\_việc tại VinAI .'' (I\textsubscript{Tôi} am\textsubscript{đang} working\textsubscript{làm\_việc} at\textsubscript{tại} VinAI). Table \ref{tab:format} shows the annotated output in plain text form for this
569
+ sentence. Similarly, we also get the same output by using the Python API as simple as in Figure \ref{fig:code}.
570
+ Furthermore, commands to (re-)train and evaluate PhoNLP using gold annotated corpora are detailed in the PhoNLP GitHub repository. Note that it is absolutely possible to directly employ our PhoNLP (re-)training and evaluation command scripts for other languages that have gold annotated corpora available for the three tasks and a pre-trained BERT-based language model available from the $\mathrm{transformers}$ library.
571
+
572
+ %\setcounter{figure}{1}
573
+ \begin{figure}[!t]
574
+ %{\footnotesize\ttfamily python3 phonlp.py {-}{-}save\_dir model\_folder\_path {-}{-}mode annotate {-}{-}input\_file path\_to\_input\_file {-}{-}output\_file path\_to\_output\_file}
575
+ {\ttfamily python3 run\_phonlp.py {-}{-}save\_dir ./pretrained\_phonlp {-}{-}mode \\ annotate {-}{-}input\_file input.txt {-}{-}output\_file output.txt}
576
+ \caption{Minimal command to run PhoNLP. Here ``save\_dir'' denote the path to the local machine folder that stores the pre-trained PhoNLP model.}
577
+ \label{fig:command}
578
+ \end{figure}
579
+
580
+ \begin{table}[!t]
581
+ \centering
582
+ %\resizebox{7.5cm}{!}{
583
+ \begin{tabular}{llllll}
584
+ 1 & Tôi & P & O & 3 & sub \\
585
+ 2 & đang & R & O & 3 & adv \\
586
+ 3 & làm\_việc & V & O & 0 & root \\
587
+ 4 & tại & E & O & 3 & loc \\
588
+ 5 & VinAI & Np & B-ORG & 4 & pob \\
589
+ 6 & . & CH & O & 3 & punct \\
590
+ \end{tabular}
591
+ %}
592
+ \caption{The output in the output file ``{\ttfamily output.txt}'' for the sentence ``Tôi đang làm\_việc tại VinAI .'' from the input file ``{\ttfamily input.txt}'' in Figure \ref{fig:command}. The output is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type.}
593
+ \label{tab:format}
594
+ \end{table}
595
+
596
+ \paragraph{Speed test:} We perform a sole CPU-based speed test using a personal computer with Intel Core i5 8265U 1.6GHz \& 8GB of memory. For a GPU-based speed test, we employ a machine with a single NVIDIA RTX 2080Ti GPU. For performing the three NLP tasks jointly, PhoNLP obtains a speed at {15 sentences per second} for the CPU-based test and {129 sentences per second} for the GPU-based test, respectively, with an average of 23 word tokens per sentence and a batch size of 8.
597
+
598
+ %\setcounter{figure}{2}
599
+ \begin{figure*}[!t]
600
+ %\begin{minted}{python}
601
+ %import phonlp
602
+ %# Automatically download the pre-trained PhoNLP model
603
+ %# and save it in a local machine folder
604
+ %phonlp.download(save_dir='./pretrained_phonlp')
605
+ %# Load the pre-trained PhoNLP model
606
+ %model = phonlp.load(save_dir='./pretrained_phonlp')
607
+ %# Annotate a corpus
608
+ %model.annotate(input_file='input.txt', output_file='output.txt')
609
+ %# Annotate a sentence
610
+ %model.print_out(model.annotate(text="Tôi đang làm_việc tại VinAI ."))
611
+ %\end{minted}
612
+ \begin{Verbatim}[commandchars=\\\{\}]
613
+ \PYG{k+kn}{import} \PYG{n+nn}{phonlp}
614
+ \PYG{c+c1}{\PYGZsh{} Automatically download the pre\PYGZhy{}trained PhoNLP model}
615
+ \PYG{c+c1}{\PYGZsh{} and save it in a local machine folder}
616
+ \PYG{n}{phonlp}\PYG{o}{.}\PYG{n}{download}\PYG{p}{(}\PYG{n}{save\PYGZus{}dir}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}./pretrained\PYGZus{}phonlp\PYGZsq{}}\PYG{p}{)}
617
+ \PYG{c+c1}{\PYGZsh{} Load the pre\PYGZhy{}trained PhoNLP model}
618
+ \PYG{n}{model} \PYG{o}{=} \PYG{n}{phonlp}\PYG{o}{.}\PYG{n}{load}\PYG{p}{(}\PYG{n}{save\PYGZus{}dir}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}./pretrained\PYGZus{}phonlp\PYGZsq{}}\PYG{p}{)}
619
+ \PYG{c+c1}{\PYGZsh{} Annotate a corpus}
620
+ \PYG{n}{model}\PYG{o}{.}\PYG{n}{annotate}\PYG{p}{(}\PYG{n}{input\PYGZus{}file}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}input.txt\PYGZsq{}}\PYG{p}{,} \PYG{n}{output\PYGZus{}file}\PYG{o}{=}\PYG{l+s+s1}{\PYGZsq{}output.txt\PYGZsq{}}\PYG{p}{)}
621
+ \PYG{c+c1}{\PYGZsh{} Annotate a sentence}
622
+ \PYG{n}{model}\PYG{o}{.}\PYG{n}{print\PYGZus{}out}\PYG{p}{(}\PYG{n}{model}\PYG{o}{.}\PYG{n}{annotate}\PYG{p}{(}\PYG{n}{text}\PYG{o}{=}\PYG{l+s+s2}{\PYGZdq{}Tôi đang làm\PYGZus{}việc tại VinAI .\PYGZdq{}}\PYG{p}{))}
623
+ \end{Verbatim}
624
+ %\vspace{-5pt}
625
+ \caption{A simple and complete example code for using PhoNLP in Python.}
626
+ \label{fig:code}
627
+ \end{figure*}
628
+
629
+ \section{Conclusion and future work}
630
+
631
+
632
+ We have presented the first multi-task learning model PhoNLP for joint POS tagging, NER and dependency parsing in Vietnamese. Experiments on Vietnamese benchmark datasets show that PhoNLP outperforms its strong fine-tuned PhoBERT-based single-task training baseline, producing state-of-the-art performance results. We publicly release PhoNLP as an easy-to-use open-source toolkit and hope that PhoNLP can facilitate future NLP research and applications. % such as in question answering and dialogue systems.
633
+ In future work, we will also apply PhoNLP to other languages.
634
+
635
+ %Although we specify PhoNLP for Vietnamese, the PhoNLP (re-)training and evaluation command scripts in fact can directly work for other languages that have gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing, and a pre-trained BERT-based language model available from the $\mathrm{transformers}$ library. In future work, we will apply our PhoNLP toolkit to those languages.
636
+
637
+
638
+ \bibliography{refs}
639
+ \bibliographystyle{acl_natbib}
640
+
641
+ \end{document}
references/2021.naacl.nguyen/source/refs.bib ADDED
@@ -0,0 +1,625 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @inproceedings{3184558.3191535,
2
+ author = {Le-Hong, Phuong and Bui, Duc-Thien},
3
+ title = {{A Factoid Question Answering System for Vietnamese}},
4
+ year = {2018},
5
+ booktitle = {Companion Proceedings of the The Web Conference 2018},
6
+ pages = {1049–1055},
7
+ }
8
+
9
+ @InProceedings{NguyenNMP11,
10
+ title = {{Automatic Ontology Construction from Vietnamese text}},
11
+ author = {Dai Quoc Nguyen and Dat Quoc Nguyen and Khoi Trong Ma and Son Bao Pham},
12
+ booktitle = {Proceedings of NLPKE},
13
+ year = {2011},
14
+ pages = {485--488}
15
+ }
16
+
17
+ @inproceedings{vitext2sql,
18
+ title = {{A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese}},
19
+ author = {Anh Tuan Nguyen and Mai Hoang Dao and Dat Quoc Nguyen},
20
+ booktitle = {Findings of EMNLP 2020},
21
+ year = {2020},
22
+ pages = {4079--4085}
23
+ }
24
+
25
+ @article{NguyenNP_SWJ,
26
+ title = {{Ripple Down Rules for Question Answering}},
27
+ author = {Nguyen, Dat Quoc and Nguyen, Dai Quoc and Pham, Son Bao},
28
+ journal = {Semantic Web},
29
+ volume = {8},
30
+ number = {4},
31
+ pages = {511--532},
32
+ year = {2017}
33
+ }
34
+
35
+ @inproceedings{3155133.3155171,
36
+ author = {Truong, Diem and Vo, Duc-Thuan and Nguyen, Uyen Trang},
37
+ title = {Vietnamese Open Information Extraction},
38
+ year = {2017},
39
+ booktitle = {Proceedings of SoICT},
40
+ pages = {135–142}
41
+ }
42
+
43
+ @PhdThesis{Ruder2019Neural,
44
+ title={Neural Transfer Learning for Natural Language Processing},
45
+ author={Ruder, Sebastian},
46
+ year={2019},
47
+ school={National University of Ireland, Galway}
48
+ }
49
+
50
+ @INPROCEEDINGS{7800281,
51
+ author={Viet Hong Tran and Huyen Thuong Vu and Thu Hoai Pham and Vinh Van Nguyen and Minh Le Nguyen},
52
+ booktitle={Proceedings of RIVF},
53
+ title={{A reordering model for Vietnamese-English statistical machine translation using dependency information}},
54
+ year={2016},
55
+ pages={125-130}
56
+ }
57
+
58
+
59
+ @article{BANG20182016IIP0038,
60
+ title={Sentiment Classification for Hotel Booking Review Based on Sentence Dependency Structure and Sub-Opinion Analysis},
61
+ author={Tran Sy Bang and Virach Sornlertlamvanich},
62
+ journal={IEICE Transactions on Information and Systems},
63
+ volume={E101.D},
64
+ number={4},
65
+ pages={909-916},
66
+ year={2018}
67
+ }
68
+
69
+ @INPROCEEDINGS{9287471,
70
+ author={Huong Duong To and Phuc Do},
71
+ booktitle={Proceedings of KSE},
72
+ title={Extracting triples from Vietnamese text to create knowledge graph},
73
+ year={2020},
74
+ pages={219-223}}
75
+
76
+
77
+ @article{chuliu,
78
+ title={{On the Shortest Arborescence of a Directed Graph}},
79
+ author={Yoeng-Jin Chu and Tseng-Hong Liu},
80
+ journal={Science Sinica},
81
+ volume={14},
82
+ year = {1965},
83
+ pages={1396--1400}
84
+ }
85
+
86
+ @inproceedings{vielectra,
87
+ title = {{Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models}},
88
+ author = "Viet Bui The and Oanh Tran Thi and Phuong Le-Hong",
89
+ booktitle = "Proceedings of PACLIC 2020",
90
+ year = "2020"
91
+ }
92
+
93
+
94
+
95
+ @inproceedings{wolf-etal-2020-transformers,
96
+ title = {{Transformers: State-of-the-Art Natural Language Processing}},
97
+ author = "Thomas Wolf and Lysandre Debut and others",
98
+ booktitle = "Proceedings of EMNLP 2020: System Demonstrations",
99
+ year = "2020",
100
+ pages = "38--45"
101
+ }
102
+
103
+ @incollection{NEURIPS2019_9015,
104
+ title = {{PyTorch: An Imperative Style, High-Performance Deep Learning Library}},
105
+ author = {Paszke, Adam and Gross, Sam and
106
+ others},
107
+ booktitle = {Proceedings of NeurIPS 2019},
108
+ pages = {8024--8035},
109
+ year = {2019}
110
+ }
111
+
112
+ @inproceedings{nguyen-etal-2009-building,
113
+ title = {{Building a Large Syntactically-Annotated Corpus of {V}ietnamese}},
114
+ author = "Nguyen, Phuong-Thai and
115
+ Vu, Xuan-Luong and
116
+ Nguyen, Thi-Minh-Huyen and
117
+ Nguyen, Van-Hiep and
118
+ Le, Hong-Phuong",
119
+ booktitle = "Proceedings of {LAW}",
120
+ year = "2009",
121
+ pages = "182--185",
122
+ }
123
+
124
+ @inproceedings{li-etal-2018-joint-learning,
125
+ title = {{Joint Learning of {POS} and Dependencies for Multilingual {U}niversal {D}ependency Parsing}},
126
+ author = "Li, Zuchao and
127
+ He, Shexia and
128
+ Zhang, Zhuosheng and
129
+ Zhao, Hai",
130
+ booktitle = "Proceedings of the {C}o{NLL} 2018 Shared Task",
131
+ year = "2018",
132
+ pages = "65--73"
133
+ }
134
+
135
+ @article{Edmonds,
136
+ title={{Optimum Branchings}},
137
+ author={Jack Edmonds},
138
+ journal={Journal of Research of the National Bureau of Standards},
139
+ volume={71},
140
+ year = {1967},
141
+ pages={233--240}
142
+ }
143
+
144
+ @inproceedings{phobert,
145
+ title = {{PhoBERT: Pre-trained language models for Vietnamese}},
146
+ author = {Dat Quoc Nguyen and Anh Tuan Nguyen},
147
+ booktitle = "Findings of EMNLP 2020",
148
+ year = {2020},
149
+ pages = {1037--1042}
150
+ }
151
+
152
+ @inproceedings{kondratyuk-straka-2019-75,
153
+ title = {{75 Languages, 1 Model: Parsing {U}niversal {D}ependencies Universally}},
154
+ author = "Kondratyuk, Dan and
155
+ Straka, Milan",
156
+ booktitle = "Proceedings of EMNLP-IJCNLP",
157
+ year = "2019",
158
+ pages = "2779--2795"
159
+ }
160
+
161
+ @inproceedings{zhang-weiss-2016-stack,
162
+ title = "Stack-propagation: Improved Representation Learning for Syntax",
163
+ author = "Zhang, Yuan and
164
+ Weiss, David",
165
+ booktitle = "Proceedings of ACL",
166
+ year = "2016",
167
+ pages = "1557--1566",
168
+ }
169
+
170
+ @InProceedings{nguyenverspoorK18,
171
+ author = {Nguyen, Dat Quoc and Verspoor, Karin},
172
+ title = {{An Improved Neural Network Model for Joint {POS} Tagging and Dependency Parsing}},
173
+ booktitle = {Proceedings of the {CoNLL} 2018 Shared Task},
174
+ year = {2018},
175
+ pages = {81--91}
176
+ }
177
+
178
+ @InProceedings{NguyenALTA2019,
179
+ title = {{A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing}},
180
+ author = {Dat Quoc Nguyen},
181
+ booktitle = {Proceedings of ALTA},
182
+ year = {2019},
183
+ pages = {28--34}
184
+ }
185
+
186
+ @inproceedings{hashimoto-etal-2017-joint,
187
+ title = {{A Joint Many-Task Model: Growing a Neural Network for Multiple {NLP} Tasks}},
188
+ author = "Hashimoto, Kazuma and
189
+ Xiong, Caiming and
190
+ Tsuruoka, Yoshimasa and
191
+ Socher, Richard",
192
+ booktitle = "Proceedings of EMNLP",
193
+ year = "2017",
194
+ pages = "1923--1933"
195
+ }
196
+
197
+ @inproceedings{qi-etal-2020-stanza,
198
+ title = {{S}tanza: A Python Natural Language Processing Toolkit for Many Human Languages},
199
+ author = "Qi, Peng and
200
+ Zhang, Yuhao and
201
+ Zhang, Yuhui and
202
+ Bolton, Jason and
203
+ Manning, Christopher D.",
204
+ booktitle = "Proceedings of ACL: System Demonstrations",
205
+ year = "2020",
206
+ pages = "101--108"
207
+ }
208
+
209
+ @inproceedings{qi-etal-2018-universal,
210
+ title = "{U}niversal {D}ependency Parsing from Scratch",
211
+ author = "Qi, Peng and
212
+ Dozat, Timothy and
213
+ Zhang, Yuhao and
214
+ Manning, Christopher D.",
215
+ booktitle = "Proceedings of the {C}o{NLL} 2018 Shared Task",
216
+ year = "2018",
217
+ pages = "160--170"
218
+ }
219
+
220
+ @inproceedings{Lafferty:2001,
221
+ author = {Lafferty, John D. and McCallum, Andrew and Pereira, Fernando C. N.},
222
+ title = {{Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data}},
223
+ booktitle = {Proceedings of ICML},
224
+ year = {2001},
225
+ pages = {282--289}
226
+ }
227
+
228
+ @article{Wolf2019HuggingFacesTS,
229
+ title={{HuggingFace's Transformers: State-of-the-art Natural Language Processing}},
230
+ author={Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R'emi Louf and Morgan Funtowicz and Jamie Brew},
231
+ journal={arXiv preprint},
232
+ year={2019},
233
+ volume={arXiv:1910.03771}
234
+ }
235
+
236
+ @inproceedings{kudo-richardson-2018-sentencepiece,
237
+ title = {{{S}entence{P}iece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing}},
238
+ author = "Kudo, Taku and
239
+ Richardson, John",
240
+ booktitle = "Proceedings of EMNLP: System Demonstrations",
241
+ year = "2018",
242
+ pages = "66--71"
243
+ }
244
+
245
+ @inproceedings{wang-etal-2019-tree,
246
+ title = {{Tree Transformer: Integrating Tree Structures into Self-Attention}},
247
+ author = "Wang, Yaushian and
248
+ Lee, Hung-Yi and
249
+ Chen, Yun-Nung",
250
+ booktitle = "Proceedings of EMNLP-IJCNLP",
251
+ year = "2019",
252
+ pages = "1061--1070",
253
+ }
254
+
255
+ @incollection{NIPS2017_7181,
256
+ title = {{Attention is All you Need}},
257
+ author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {\L}ukasz and Polosukhin, Illia},
258
+ booktitle = {Advances in Neural Information Processing Systems 30},
259
+ pages = {5998--6008},
260
+ year = {2017},
261
+ }
262
+
263
+
264
+ @inproceedings{jawahar-etal-2019-bert,
265
+ title = "What Does {BERT} Learn about the Structure of Language?",
266
+ author = "Jawahar, Ganesh and
267
+ Sagot, Beno{\^\i}t and
268
+ Seddah, Djam{\'e}",
269
+ booktitle = "Proceedings of ACL",
270
+ year = "2019",
271
+ pages = "3651--3657"
272
+ }
273
+
274
+ @inproceedings{hewitt-manning-2019-structural,
275
+ title = "{A} Structural Probe for Finding Syntax in Word Representations",
276
+ author = "Hewitt, John and
277
+ Manning, Christopher D.",
278
+ booktitle = "Proceedings of NAACL",
279
+ year = "2019",
280
+ pages = "4129--4138"
281
+ }
282
+
283
+ @inproceedings{ma-etal-2018-stack,
284
+ title = {{Stack-Pointer Networks for Dependency Parsing}},
285
+ author = "Ma, Xuezhe and
286
+ Hu, Zecong and
287
+ Liu, Jingzhou and
288
+ Peng, Nanyun and
289
+ Neubig, Graham and
290
+ Hovy, Eduard",
291
+ booktitle = "Proceedings of ACL",
292
+ year = "2018",
293
+ pages = "1403--1414"
294
+ }
295
+
296
+ @article{lewis2019mlqa,
297
+ title={{MLQA: Evaluating Cross-lingual Extractive Question Answering}},
298
+ author={Lewis, Patrick and O\u{g}uz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},
299
+ journal={arXiv preprint},
300
+ volume={arXiv:1910.07475},
301
+ year={2019}
302
+ }
303
+
304
+ @InProceedings{Nguyen2014NLDB,
305
+ author = {Nguyen, Dat Quoc and Nguyen, Dai Quoc and Pham, Son Bao and Nguyen, Phuong-Thai and Nguyen, Minh Le},
306
+ title = {{From Treebank Conversion to Automatic Dependency Parsing for Vietnamese}},
307
+ booktitle = {{Proceedings of NLDB}},
308
+ year = {2014},
309
+ pages = {196-207}
310
+ }
311
+
312
+
313
+ @InProceedings{N18-1101,
314
+ author = "Williams, Adina
315
+ and Nangia, Nikita
316
+ and Bowman, Samuel",
317
+ title = {{A Broad-Coverage Challenge Corpus for
318
+ Sentence Understanding through Inference}},
319
+ booktitle = "Proceedings of NAACL",
320
+ year = "2018",
321
+ pages = "1112--1122",
322
+ }
323
+
324
+ @inproceedings{rajpurkar-etal-2016-squad,
325
+ title = "{SQ}u{AD}: 100,000+ Questions for Machine Comprehension of Text",
326
+ author = "Rajpurkar, Pranav and
327
+ Zhang, Jian and
328
+ Lopyrev, Konstantin and
329
+ Liang, Percy",
330
+ booktitle = "Proceedings of EMNLP",
331
+ year = "2016",
332
+ pages = "2383--2392",
333
+ }
334
+
335
+ @inproceedings{DozatM17,
336
+ author = {Timothy Dozat and
337
+ Christopher D. Manning},
338
+ title = {{Deep Biaffine Attention for Neural Dependency Parsing}},
339
+ booktitle = {Proceedings of ICLR},
340
+ year = {2017}
341
+ }
342
+
343
+ @inproceedings{DinhQuangThang2008,
344
+ author = {Dinh Quang Thang and Phuong, Le Hong and Nguyen Thi Minh Huyen and Tu, Nguyen Cam and Rossignol, Mathias and Luong, Vu Xuan},
345
+ booktitle = {Proceedings of LREC},
346
+ pages = {1933--1936},
347
+ title = {{Word segmentation of Vietnamese texts: a comparison of approaches}},
348
+ year = {2008}
349
+ }
350
+
351
+
352
+ @inproceedings{peters-etal-2018-deep,
353
+ title = {{Deep Contextualized Word Representations}},
354
+ author = "Peters, Matthew and
355
+ Neumann, Mark and
356
+ Iyyer, Mohit and
357
+ Gardner, Matt and
358
+ Clark, Christopher and
359
+ Lee, Kenton and
360
+ Zettlemoyer, Luke",
361
+ booktitle = "Proceedings of NAACL",
362
+ year = "2018",
363
+ pages = "2227--2237"
364
+ }
365
+
366
+ @article{abs-1906-08101,
367
+ author = {Yiming Cui and
368
+ Wanxiang Che and
369
+ Ting Liu and
370
+ Bing Qin and
371
+ Ziqing Yang and
372
+ Shijin Wang and
373
+ Guoping Hu},
374
+ title = {{Pre-Training with Whole Word Masking for Chinese BERT}},
375
+ journal={arXiv preprint},
376
+ volume = {arXiv:1906.08101},
377
+ year = {2019}
378
+ }
379
+
380
+ @inproceedings{le2019flaubert,
381
+ title={{FlauBERT: Unsupervised Language Model Pre-training for French}},
382
+ author={Hang Le and Lo\"ic Vial and others},
383
+ booktitle = {Proceedings of LREC},
384
+ year = "2020",
385
+ pages = {2479--2490}
386
+ }
387
+
388
+ @inproceedings{conneau2019unsupervised,
389
+ title={{Unsupervised Cross-lingual Representation Learning at Scale}},
390
+ author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
391
+ booktitle = {Proceedings of ACL},
392
+ year = "2020",
393
+ pages = {8440--8451},
394
+ url={https://arxiv.org/pdf/1911.02116v1.pdf}
395
+ }
396
+
397
+ @inproceedings{vu-xuan-etal-2019-etnlp,
398
+ title = "{ETNLP}: A Visual-Aided Systematic Approach to Select Pre-Trained Embeddings for a Downstream Task",
399
+ author = "Vu, Xuan-Son and
400
+ Vu, Thanh and
401
+ Tran, Son and
402
+ Jiang, Lili",
403
+ booktitle = "Proceedings of RANLP",
404
+ year = "2019",
405
+ pages = "1285--1294"
406
+ }
407
+
408
+ @inproceedings{nguyen-2019-neural,
409
+ title = "A neural joint model for {V}ietnamese word segmentation, {POS} tagging and dependency parsing",
410
+ author = "Nguyen, Dat Quoc",
411
+ booktitle = "Proceedings of ALTA",
412
+ year = "2019",
413
+ pages = "28--34",
414
+ }
415
+
416
+
417
+ @inproceedings{nguyen-etal-2017-word,
418
+ title = "From Word Segmentation to {POS} Tagging for {V}ietnamese",
419
+ author = "Nguyen, Dat Quoc and
420
+ Vu, Thanh and
421
+ Nguyen, Dai Quoc and
422
+ Dras, Mark and
423
+ Johnson, Mark",
424
+ booktitle = "Proceedings of ALTA",
425
+ year = "2017",
426
+ pages = "108--113",
427
+ }
428
+
429
+ @inproceedings{nguyen-verspoor-2018-improved,
430
+ title = "An Improved Neural Network Model for Joint {POS} Tagging and Dependency Parsing",
431
+ author = "Nguyen, Dat Quoc and
432
+ Verspoor, Karin",
433
+ booktitle = "Proceedings of the {C}o{NLL} 2018 Shared Task",
434
+ year = "2018",
435
+ pages = "81--91"
436
+ }
437
+
438
+ @inproceedings{ma-hovy-2016-end,
439
+ title = "End-to-end Sequence Labeling via Bi-directional {LSTM}-{CNN}s-{CRF}",
440
+ author = "Ma, Xuezhe and
441
+ Hovy, Eduard",
442
+ booktitle = "Proceedings of ACL",
443
+ year = "2016",
444
+ pages = "1064--1074",
445
+ }
446
+
447
+ @inproceedings{nguyen-etal-2014-rdrpostagger,
448
+ title = {{RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger}},
449
+ author = "Nguyen, Dat Quoc and
450
+ Nguyen, Dai Quoc and
451
+ Pham, Dang Duc and
452
+ Pham, Son Bao",
453
+ booktitle = "Proceedings of the Demonstrations at EACL",
454
+ year = "2014",
455
+ pages = "17--20"
456
+ }
457
+
458
+ @inproceedings{sennrich-etal-2016-neural,
459
+ title = {{Neural Machine Translation of Rare Words with Subword Units}},
460
+ author = "Sennrich, Rico and
461
+ Haddow, Barry and
462
+ Birch, Alexandra",
463
+ booktitle = "Proceedings of ACL",
464
+ year = "2016",
465
+ pages = "1715--1725",
466
+ }
467
+
468
+ @inproceedings{conneau-etal-2018-xnli,
469
+ title = "{XNLI}: Evaluating Cross-lingual Sentence Representations",
470
+ author = "Alexis Conneau and Ruty Rinott and Guillaume Lample and Holger Schwenk and Ves Stoyanov and Adina Williams and Samuel R. Bowman",
471
+ booktitle = "Proceedings of EMNLP",
472
+ year = "2018",
473
+ pages = "2475--2485"
474
+ }
475
+
476
+ @article{ArtetxeS19,
477
+ author = {Mikel Artetxe and
478
+ Holger Schwenk},
479
+ title = {{Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual
480
+ Transfer and Beyond}},
481
+ journal = {{TACL}},
482
+ volume = {7},
483
+ pages = {597--610},
484
+ year = {2019}
485
+ }
486
+
487
+ @inproceedings{NIPS2019_8928,
488
+ title = {{Cross-lingual Language Model Pretraining}},
489
+ author = {Conneau, Alexis and Lample, Guillaume},
490
+ booktitle = {Proceedings of NeurIPS},
491
+ pages = {7059--7069},
492
+ year = {2019},
493
+ }
494
+
495
+
496
+ @inproceedings{wu-dredze-2019-beto,
497
+ title = "Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of {BERT}",
498
+ author = "Wu, Shijie and
499
+ Dredze, Mark",
500
+ booktitle = "Proceedings of EMNLP-IJCNLP",
501
+ year = "2019",
502
+ pages = "833--844"
503
+ }
504
+
505
+ @inproceedings{2019arXiv191103894M,
506
+ author = {{Martin}, Louis and {Muller}, Benjamin and
507
+ {Ortiz Su{\'a}rez}, Pedro Javier and {Dupont}, Yoann and
508
+ {Romary}, Laurent and {Villemonte de la Clergerie}, {\'E}ric and
509
+ {Seddah}, Djam{\'e} and {Sagot}, Beno{\^\i}t},
510
+ title = "{CamemBERT: a Tasty French Language Model}",
511
+ booktitle = {Proceedings of ACL},
512
+ year = "2020",
513
+ pages = {7203--7219}
514
+ }
515
+
516
+ @ARTICLE{vries2019bertje,
517
+ title={{BERTje: A Dutch BERT Model}},
518
+ author={Wietse de Vries and Andreas van Cranenburgh and Arianna Bisazza and Tommaso Caselli and Gertjan van Noord and Malvina Nissim},
519
+ year={2019},
520
+ volume={arXiv:1912.09582},
521
+ journal={arXiv preprint}
522
+ }
523
+
524
+ @INPROCEEDINGS{loshchilov2018decoupled,
525
+ title={{Decoupled Weight Decay Regularization}},
526
+ author={Ilya Loshchilov and Frank Hutter},
527
+ booktitle={Proceedings of ICLR},
528
+ year={2019},
529
+ }
530
+
531
+ @INPROCEEDINGS{8713740,
532
+ author={Kim Anh Nguyen and Ngan Dong and Cam-Tu Nguyen},
533
+ booktitle={Proceedings of RIVF},
534
+ title={{Attentive Neural Network for Named Entity Recognition in Vietnamese}},
535
+ year={2019}
536
+ }
537
+
538
+ @article{JCC13161,
539
+ author = {Huyen Nguyen and Quyen Ngo and Luong Vu and Vu Tran and Hien Nguyen},
540
+ title = {{VLSP Shared Task: Named Entity Recognition}},
541
+ journal = {Journal of Computer Science and Cybernetics},
542
+ volume = {34},
543
+ number = {4},
544
+ year = {2019},
545
+ pages = {283--294}
546
+ }
547
+
548
+ @article{JCC13163,
549
+ author = {Minh Quang Nhat Pham},
550
+ title = {{A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Evaluation Campaign}},
551
+ journal = {Journal of Computer Science and Cybernetics},
552
+ volume = {34},
553
+ number = {4},
554
+ year = {2019},
555
+ pages = {311--321},
556
+ }
557
+
558
+ @article{KingmaB14,
559
+ author = {Diederik P. Kingma and
560
+ Jimmy Ba},
561
+ title = {{Adam: {A} Method for Stochastic Optimization}},
562
+ journal = {arXiv preprint},
563
+ volume = {arXiv:1412.6980},
564
+ year = {2014}
565
+ }
566
+
567
+ @inproceedings{ott2019fairseq,
568
+ title = {{fairseq: A Fast, Extensible Toolkit for Sequence Modeling}},
569
+ author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
570
+ booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
571
+ year = {2019},
572
+ pages={48--53}
573
+ }
574
+
575
+ @inproceedings{devlin-etal-2019-bert,
576
+ title = {{BERT}: Pre-training of Deep Bidirectional Transformers for Language Understanding},
577
+ author = "Devlin, Jacob and
578
+ Chang, Ming-Wei and
579
+ Lee, Kenton and
580
+ Toutanova, Kristina",
581
+ booktitle = "Proceedings of NAACL",
582
+ year = "2019",
583
+ pages = "4171--4186",
584
+ }
585
+
586
+ @article{RoBERTa,
587
+ author = {Yinhan Liu and
588
+ Myle Ott and
589
+ Naman Goyal and
590
+ Jingfei Du and
591
+ Mandar Joshi and
592
+ Danqi Chen and
593
+ Omer Levy and
594
+ Mike Lewis and
595
+ Luke Zettlemoyer and
596
+ Veselin Stoyanov},
597
+ title = {{RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach}},
598
+ journal = {arXiv preprint},
599
+ volume = {arXiv:1907.11692},
600
+ year = {2019}
601
+ }
602
+
603
+ @inproceedings{nguyen-etal-2018-fast,
604
+ title = {{A Fast and Accurate Vietnamese Word Segmenter}},
605
+ author = "Nguyen, Dat Quoc and
606
+ Nguyen, Dai Quoc and
607
+ Vu, Thanh and
608
+ Dras, Mark and
609
+ Johnson, Mark",
610
+ booktitle = "Proceedings of LREC",
611
+ year = "2018",
612
+ pages = "2582--2587"
613
+ }
614
+
615
+ @inproceedings{vu-etal-2018-vncorenlp,
616
+ title = {{VnCoreNLP: A Vietnamese Natural Language Processing Toolkit}},
617
+ author = "Vu, Thanh and
618
+ Nguyen, Dat Quoc and
619
+ Nguyen, Dai Quoc and
620
+ Dras, Mark and
621
+ Johnson, Mark",
622
+ booktitle = "Proceedings of NAACL: Demonstrations",
623
+ year = "2018",
624
+ pages = "56--60"
625
+ }
references/README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # References
2
+
3
+ Reference materials for Vietnamese POS Tagger (TRE-1).
4
+
5
+ ## Papers
6
+
7
+ | Folder | Title | Authors | Year |
8
+ |--------|-------|---------|------|
9
+ | [2001.icml.lafferty](2001.icml.lafferty/) | Conditional Random Fields | Lafferty, McCallum, Pereira | 2001 |
10
+ | [2014.eacl.nguyen](2014.eacl.nguyen/) | RDRPOSTagger | Nguyen et al. | 2014 |
11
+ | [2018.naacl.vu](2018.naacl.vu/) | VnCoreNLP | Vu, Nguyen et al. | 2018 |
12
+ | [2020.emnlp.nguyen](2020.emnlp.nguyen/) | PhoBERT | Nguyen & Nguyen | 2020 |
13
+ | [2021.naacl.nguyen](2021.naacl.nguyen/) | PhoNLP | Nguyen & Nguyen | 2021 |
14
+
15
+ Each paper folder contains:
16
+ - `paper.md` - Markdown with YAML front matter (for LLM/RAG)
17
+ - `paper.tex` - LaTeX source (original from arXiv or generated)
18
+ - `paper.pdf` - PDF file
19
+ - `source/` - Full arXiv source (if available)
20
+
21
+ ## Resources
22
+
23
+ | File | Title | Type |
24
+ |------|-------|------|
25
+ | [universal_dependencies.md](universal_dependencies.md) | Universal Dependencies | Annotation Framework |
26
+ | [underthesea.md](underthesea.md) | Underthesea | Vietnamese NLP Toolkit |
27
+ | [python_crfsuite.md](python_crfsuite.md) | python-crfsuite | CRF Library |
28
+
29
+ ## Research Notes
30
+
31
+ | Folder | Description |
32
+ |--------|-------------|
33
+ | [research_vietnamese_pos](research_vietnamese_pos/) | Literature review on Vietnamese POS tagging |
34
+
35
+ ## Vietnamese POS Tagging Benchmarks
36
+
37
+ | Model | Dataset | Accuracy | Year |
38
+ |-------|---------|----------|------|
39
+ | PhoNLP | VLSP 2013 | 96.91% | 2021 |
40
+ | PhoBERT-large | VLSP 2013 | 96.8% | 2020 |
41
+ | VnMarMoT | VLSP 2013 | 95.88% | 2018 |
42
+ | **TRE-1** | **UDD-1** | **95.89%** | **2026** |
43
+ | RDRPOSTagger | VLSP 2013 | 95.11% | 2014 |
references/python_crfsuite.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "python-crfsuite"
3
+ type: "resource"
4
+ url: "https://github.com/scrapinghub/python-crfsuite"
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ python-crfsuite provides Python bindings for the CRFsuite conditional random field toolkit, enabling efficient sequence labeling in Python.
10
+
11
+ ## Key Information
12
+
13
+ | Field | Value |
14
+ |-------|-------|
15
+ | **GitHub** | https://github.com/scrapinghub/python-crfsuite |
16
+ | **PyPI** | https://pypi.org/project/python-crfsuite/ |
17
+ | **Documentation** | https://python-crfsuite.readthedocs.io/ |
18
+ | **License** | MIT (python-crfsuite), BSD (CRFsuite) |
19
+ | **Latest Version** | 0.9.12 (December 2025) |
20
+ | **Stars** | 771 |
21
+
22
+ ## Features
23
+
24
+ - **Fast Performance**: Faster than official SWIG wrapper
25
+ - **No External Dependencies**: CRFsuite bundled; NumPy/SciPy not required
26
+ - **Python 2 & 3 Support**: Works with both Python versions
27
+ - **Cython-based**: High-performance C++ bindings
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ # Using pip
33
+ pip install python-crfsuite
34
+
35
+ # Using conda
36
+ conda install -c conda-forge python-crfsuite
37
+ ```
38
+
39
+ ## Usage
40
+
41
+ ### Training
42
+
43
+ ```python
44
+ import pycrfsuite
45
+
46
+ # Create trainer
47
+ trainer = pycrfsuite.Trainer(verbose=True)
48
+
49
+ # Add training data
50
+ for xseq, yseq in zip(X_train, y_train):
51
+ trainer.append(xseq, yseq)
52
+
53
+ # Set parameters
54
+ trainer.set_params({
55
+ 'c1': 1.0, # L1 regularization
56
+ 'c2': 0.001, # L2 regularization
57
+ 'max_iterations': 100,
58
+ 'feature.possible_transitions': True
59
+ })
60
+
61
+ # Train model
62
+ trainer.train('model.crfsuite')
63
+ ```
64
+
65
+ ### Inference
66
+
67
+ ```python
68
+ import pycrfsuite
69
+
70
+ # Load model
71
+ tagger = pycrfsuite.Tagger()
72
+ tagger.open('model.crfsuite')
73
+
74
+ # Predict
75
+ y_pred = tagger.tag(x_seq)
76
+ ```
77
+
78
+ ### Feature Format
79
+
80
+ Features are lists of strings in `name=value` format:
81
+
82
+ ```python
83
+ features = [
84
+ ['word=hello', 'pos=NN', 'is_capitalized=True'],
85
+ ['word=world', 'pos=NN', 'is_capitalized=False'],
86
+ ]
87
+ ```
88
+
89
+ ## Training Algorithms
90
+
91
+ | Algorithm | Description |
92
+ |-----------|-------------|
93
+ | `lbfgs` | Limited-memory BFGS (default) |
94
+ | `l2sgd` | SGD with L2 regularization |
95
+ | `ap` | Averaged Perceptron |
96
+ | `pa` | Passive Aggressive |
97
+ | `arow` | Adaptive Regularization of Weights |
98
+
99
+ ## Parameters (L-BFGS)
100
+
101
+ | Parameter | Default | Description |
102
+ |-----------|---------|-------------|
103
+ | `c1` | 0 | L1 regularization coefficient |
104
+ | `c2` | 1.0 | L2 regularization coefficient |
105
+ | `max_iterations` | unlimited | Maximum iterations |
106
+ | `num_memories` | 6 | Number of memories for L-BFGS |
107
+ | `epsilon` | 1e-5 | Convergence threshold |
108
+
109
+ ## Related Projects
110
+
111
+ - **sklearn-crfsuite**: Scikit-learn compatible wrapper
112
+ - **CRFsuite**: Original C++ implementation
113
+
114
+ ## Citation
115
+
116
+ ```bibtex
117
+ @misc{python-crfsuite,
118
+ author = {Scrapinghub},
119
+ title = {python-crfsuite: Python binding to CRFsuite},
120
+ year = {2014},
121
+ publisher = {GitHub},
122
+ url = {https://github.com/scrapinghub/python-crfsuite}
123
+ }
124
+
125
+ @misc{crfsuite,
126
+ author = {Okazaki, Naoaki},
127
+ title = {CRFsuite: A fast implementation of Conditional Random Fields},
128
+ year = {2007},
129
+ url = {http://www.chokkan.org/software/crfsuite/}
130
+ }
131
+ ```
references/research_vietnamese_pos/README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Literature Review: Vietnamese POS Tagging
2
+
3
+ **Date**: 2026-01-31
4
+ **Project**: TRE-1 Vietnamese POS Tagger
5
+
6
+ ## Executive Summary
7
+
8
+ This literature review surveys the state-of-the-art in Vietnamese Part-of-Speech (POS) tagging. The field has evolved from rule-based systems (RDRPOSTagger, 2014) through feature-based CRF models (VnCoreNLP/VnMarMoT, 2018) to transformer-based approaches (PhoBERT, 2020; PhoNLP, 2021). Current SOTA on VLSP 2013 benchmark achieves ~96.8% accuracy using PhoBERT-based models.
9
+
10
+ ## Research Questions
11
+
12
+ - **RQ1**: What is the current state-of-the-art for Vietnamese POS tagging?
13
+ - **RQ2**: How do CRF-based methods compare to neural/transformer approaches?
14
+ - **RQ3**: What datasets and benchmarks exist for evaluation?
15
+ - **RQ4**: What are the remaining challenges and research gaps?
16
+
17
+ ## Methodology
18
+
19
+ - **Search sources**: ACL Anthology, Semantic Scholar, arXiv, VLSP
20
+ - **Search terms**: "Vietnamese POS tagging", "PhoBERT", "VnCoreNLP", "VLSP 2013"
21
+ - **Timeframe**: 2014-2026
22
+ - **Inclusion criteria**: Peer-reviewed papers on Vietnamese POS tagging
23
+
24
+ ## PRISMA Flow
25
+
26
+ - Records identified: 30+
27
+ - Duplicates removed: 5
28
+ - Records screened: 25
29
+ - Studies included: 8 key papers (5 fetched, 3 referenced)
30
+
31
+ ---
32
+
33
+ ## Findings
34
+
35
+ ### RQ1: State-of-the-Art Results
36
+
37
+ | Model | Year | Method | VLSP 2013 Acc | Notes |
38
+ |-------|------|--------|---------------|-------|
39
+ | **PhoNLP** | 2021 | PhoBERT + MTL | **96.91%** | Multi-task with NER, DP |
40
+ | PhoBERT-large | 2020 | Transformer | 96.8% | Pre-trained LM |
41
+ | vELECTRA | 2020 | ELECTRA | 96.77% | FPT.AI |
42
+ | PhoBERT-base | 2020 | Transformer | 96.7% | VinAI Research |
43
+ | VnMarMoT | 2018 | CRF | 95.88% | Part of VnCoreNLP |
44
+ | **TRE-1** | 2026 | CRF | 95.89%* | This work (UDD-1) |
45
+ | RDRPOSTagger | 2014 | Rules (RDR) | 95.11% | Ripple Down Rules |
46
+ | Neural BiLSTM-CRF | 2018 | BiLSTM-CRF | 93.52% | Without PLM |
47
+
48
+ *Note: TRE-1 evaluated on UDD-1, not VLSP 2013. Direct comparison limited.
49
+
50
+ ### RQ2: CRF vs Neural Methods
51
+
52
+ #### Traditional/CRF Approaches
53
+
54
+ **Strengths:**
55
+ - Fast inference (90K words/sec in Java)
56
+ - Interpretable feature templates
57
+ - Low resource requirements
58
+ - No GPU needed
59
+
60
+ **Limitations:**
61
+ - Manual feature engineering required
62
+ - Limited context window (typically ±2 tokens)
63
+ - Cannot leverage pre-trained embeddings
64
+
65
+ #### Transformer-Based Approaches
66
+
67
+ **Strengths:**
68
+ - Leverage large-scale pre-training (PhoBERT: 20GB Vietnamese text)
69
+ - Capture long-range dependencies
70
+ - State-of-the-art accuracy (+1-2% over CRF)
71
+ - Transfer learning benefits
72
+
73
+ **Limitations:**
74
+ - Require GPU for training/inference
75
+ - Slower inference
76
+ - Larger model size (135M-370M parameters)
77
+ - Need more training data to fine-tune effectively
78
+
79
+ ### RQ3: Datasets and Benchmarks
80
+
81
+ | Dataset | Sentences | Domain | Annotation | Access |
82
+ |---------|-----------|--------|------------|--------|
83
+ | **VLSP 2013** | 27,870 | News | Manual | Request |
84
+ | VietTreeBank | 10,000+ | Mixed | Manual | Research |
85
+ | **UDD-1** | 20,000 | Legal+News | Machine | HuggingFace |
86
+ | VnDT v1.1 | 3,000 | News | Manual | Research |
87
+
88
+ **Note on Data Leakage**: VLSP 2016 NER and VnDT sentences are included in VLSP 2013, causing potential leakage issues.
89
+
90
+ ### RQ4: Challenges and Research Gaps
91
+
92
+ 1. **Lexical Ambiguity**: Many Vietnamese words can be multiple POS (e.g., "năm" = NUM "five" or NOUN "year")
93
+
94
+ 2. **Word Segmentation Dependency**: POS tagging requires pre-segmented input; errors propagate
95
+
96
+ 3. **Domain Adaptation**: Models trained on news/legal may underperform on social media, conversational text
97
+
98
+ 4. **Rare Tags**: PART, X tags have limited samples, causing lower performance
99
+
100
+ 5. **Benchmark Standardization**: Multiple datasets with different tagsets make comparison difficult
101
+
102
+ ---
103
+
104
+ ## Related Work Synthesis
105
+
106
+ ### Rule-Based Methods
107
+
108
+ **RDRPOSTagger** (Nguyen et al., 2014) uses Ripple Down Rules to automatically learn transformation rules from training data. Achieves 95.11% on VLSP 2013. Fast and interpretable but requires manual feature selection.
109
+
110
+ ### Statistical/CRF Methods
111
+
112
+ **VnMarMoT** (part of VnCoreNLP, 2018) is a CRF-based tagger achieving 95.88% on VLSP 2013. Uses MarMoT architecture with hand-crafted features including word forms, prefixes/suffixes, and context windows.
113
+
114
+ **TRE-1** (this work) follows similar CRF approach with 27 feature templates inspired by underthesea library. Achieves 95.89% on UDD-1 dataset.
115
+
116
+ ### Neural/Transformer Methods
117
+
118
+ **PhoBERT** (Nguyen & Nguyen, 2020) is a Vietnamese pre-trained language model based on RoBERTa architecture, trained on 20GB of Vietnamese text. Fine-tuned for POS tagging, achieves 96.8% on VLSP 2013.
119
+
120
+ **PhoNLP** (Nguyen & Nguyen, 2021) extends PhoBERT with multi-task learning for joint POS tagging, NER, and dependency parsing. Achieves state-of-the-art 96.91% on VLSP 2013 by sharing representations across tasks.
121
+
122
+ ---
123
+
124
+ ## Recommendations for TRE-1
125
+
126
+ 1. **Evaluate on VLSP 2013**: Enable direct comparison with prior work
127
+ 2. **Consider PhoBERT Fine-tuning**: Could improve accuracy by ~1%
128
+ 3. **Multi-domain Training**: Include social media data for robustness
129
+ 4. **Error Analysis on Rare Tags**: Focus on PART, X, DET improvement
130
+
131
+ ---
132
+
133
+ ## References
134
+
135
+ 1. Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2014). RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. EACL. https://aclanthology.org/E14-2005/
136
+
137
+ 2. Vu, T., Nguyen, D. Q., Nguyen, D. Q., Dras, M., & Johnson, M. (2018). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. NAACL. https://aclanthology.org/N18-5012/
138
+
139
+ 3. Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. EMNLP Findings. https://aclanthology.org/2020.findings-emnlp.92/
140
+
141
+ 4. Nguyen, L. T., & Nguyen, D. Q. (2021). PhoNLP: A joint multi-task learning model for Vietnamese POS tagging, NER and dependency parsing. NAACL. https://aclanthology.org/2021.naacl-demos.1/
142
+
143
+ 5. VLSP 2013 POS Tagging Shared Task. https://vlsp.org.vn/vlsp2013/eval/ws-pos
144
+
145
+ 6. NLP-progress Vietnamese. https://nlpprogress.com/vietnamese/vietnamese.html
references/research_vietnamese_pos/bibliography.bib ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @inproceedings{nguyen-nguyen-2020-phobert,
2
+ title = "{P}ho{BERT}: Pre-trained language models for {V}ietnamese",
3
+ author = "Nguyen, Dat Quoc and Nguyen, Anh Tuan",
4
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
5
+ month = nov,
6
+ year = "2020",
7
+ address = "Online",
8
+ publisher = "Association for Computational Linguistics",
9
+ url = "https://aclanthology.org/2020.findings-emnlp.92",
10
+ doi = "10.18653/v1/2020.findings-emnlp.92",
11
+ pages = "1037--1042",
12
+ }
13
+
14
+ @inproceedings{vu-etal-2018-vncorenlp,
15
+ title = "{V}n{C}ore{NLP}: A {V}ietnamese Natural Language Processing Toolkit",
16
+ author = "Vu, Thanh and Nguyen, Dat Quoc and Nguyen, Dai Quoc and Dras, Mark and Johnson, Mark",
17
+ booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations",
18
+ month = jun,
19
+ year = "2018",
20
+ address = "New Orleans, Louisiana",
21
+ publisher = "Association for Computational Linguistics",
22
+ url = "https://aclanthology.org/N18-5012",
23
+ doi = "10.18653/v1/N18-5012",
24
+ pages = "56--60",
25
+ }
26
+
27
+ @inproceedings{nguyen-etal-2014-rdrpostagger,
28
+ title = "{RDRPOS}Tagger: A Ripple Down Rules-based Part-Of-Speech Tagger",
29
+ author = "Nguyen, Dat Quoc and Nguyen, Dai Quoc and Pham, Dang Duc and Pham, Son Bao",
30
+ booktitle = "Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics",
31
+ month = apr,
32
+ year = "2014",
33
+ address = "Gothenburg, Sweden",
34
+ publisher = "Association for Computational Linguistics",
35
+ url = "https://aclanthology.org/E14-2005",
36
+ doi = "10.3115/v1/E14-2005",
37
+ pages = "17--20",
38
+ }
39
+
40
+ @inproceedings{nguyen-nguyen-2021-phonlp,
41
+ title = "{P}ho{NLP}: A joint multi-task learning model for {V}ietnamese part-of-speech tagging, named entity recognition and dependency parsing",
42
+ author = "Nguyen, Linh The and Nguyen, Dat Quoc",
43
+ booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations",
44
+ month = jun,
45
+ year = "2021",
46
+ address = "Online",
47
+ publisher = "Association for Computational Linguistics",
48
+ url = "https://aclanthology.org/2021.naacl-demos.1",
49
+ doi = "10.18653/v1/2021.naacl-demos.1",
50
+ pages = "1--7",
51
+ }
52
+
53
+ @misc{vlsp2013,
54
+ title = "{VLSP} 2013 {POS} Tagging Shared Task",
55
+ author = "{VLSP}",
56
+ year = "2013",
57
+ url = "https://vlsp.org.vn/vlsp2013/eval/ws-pos",
58
+ }
59
+
60
+ @inproceedings{lafferty2001crf,
61
+ title = "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data",
62
+ author = "Lafferty, John D. and McCallum, Andrew and Pereira, Fernando C. N.",
63
+ booktitle = "Proceedings of the Eighteenth International Conference on Machine Learning (ICML)",
64
+ pages = "282--289",
65
+ year = "2001",
66
+ url = "https://dl.acm.org/doi/10.5555/645530.655813",
67
+ }
68
+
69
+ @inproceedings{bui2020velecra,
70
+ title = "Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models",
71
+ author = "Bui, Tuan Viet and Tran, Oanh Thi Kim and Le-Hong, Phuong",
72
+ booktitle = "Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation (PACLIC)",
73
+ year = "2020",
74
+ url = "https://github.com/fpt-corp/vELECTRA",
75
+ }
76
+
77
+ @inproceedings{ma-hovy-2016-end,
78
+ title = "End-to-end Sequence Labeling via Bi-directional {LSTM}-{CNN}s-{CRF}",
79
+ author = "Ma, Xuezhe and Hovy, Eduard",
80
+ booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics",
81
+ month = aug,
82
+ year = "2016",
83
+ url = "https://aclanthology.org/P16-1101",
84
+ doi = "10.18653/v1/P16-1101",
85
+ pages = "1064--1074",
86
+ }
87
+
88
+ @inproceedings{nguyen2018neural,
89
+ title = "Neural Sequence Labeling for Vietnamese POS Tagging and NER",
90
+ author = "Nguyen, Hoang Anh DU and Nguyen, Kiem Hieu and Van, Victor",
91
+ booktitle = "2019 IEEE-RIVF International Conference on Computing and Communication Technologies",
92
+ year = "2019",
93
+ url = "https://arxiv.org/abs/1811.03754",
94
+ doi = "10.1109/RIVF.2019.8713710",
95
+ }
96
+
97
+ @misc{universaldependencies,
98
+ title = "Universal Dependencies",
99
+ author = "{Universal Dependencies Contributors}",
100
+ year = "2024",
101
+ url = "https://universaldependencies.org/",
102
+ }
103
+
104
+ @misc{underthesea,
105
+ title = "Underthesea: Vietnamese NLP Toolkit",
106
+ author = "{Underthesea Team}",
107
+ year = "2024",
108
+ url = "https://github.com/undertheseanlp/underthesea",
109
+ }
110
+
111
+ @article{tran2021bartpho,
112
+ title = "BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese",
113
+ author = "Tran, Nguyen Luong and Le, Duong Minh and Nguyen, Dat Quoc",
114
+ journal = "arXiv preprint arXiv:2109.09701",
115
+ year = "2021",
116
+ url = "https://arxiv.org/abs/2109.09701",
117
+ }
118
+
119
+ @inproceedings{phan2022vit5,
120
+ title = "{V}i{T}5: Pretrained Text-to-Text Transformer for {V}ietnamese Language Generation",
121
+ author = "Phan, Long and Tran, Hieu and Nguyen, Hieu and Trinh, Trieu H.",
122
+ booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop",
123
+ month = jul,
124
+ year = "2022",
125
+ address = "Seattle, Washington",
126
+ publisher = "Association for Computational Linguistics",
127
+ url = "https://aclanthology.org/2022.naacl-srw.18",
128
+ pages = "136--142",
129
+ }
130
+
131
+ @inproceedings{tran2023videberta,
132
+ title = "{V}i{D}e{BERT}a: A powerful pre-trained language model for {V}ietnamese",
133
+ author = "Tran, Cong Dao and Pham, Nhut Huy and Nguyen, Anh and Hy, Truong-Son",
134
+ booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
135
+ month = may,
136
+ year = "2023",
137
+ address = "Dubrovnik, Croatia",
138
+ publisher = "Association for Computational Linguistics",
139
+ url = "https://aclanthology.org/2023.findings-eacl.79",
140
+ pages = "1071--1078",
141
+ }
142
+
143
+ @inproceedings{nguyen2023visobert,
144
+ title = "{V}i{S}o{BERT}: A Pre-Trained Language Model for {V}ietnamese Social Media Text Processing",
145
+ author = "Nguyen, Nam and Phan, Thang and Nguyen, Duc-Vu and Nguyen, Kiet",
146
+ booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
147
+ month = dec,
148
+ year = "2023",
149
+ address = "Singapore",
150
+ publisher = "Association for Computational Linguistics",
151
+ url = "https://aclanthology.org/2023.emnlp-main.315",
152
+ pages = "5191--5207",
153
+ }
154
+
155
+ @article{nguyen2023phogpt,
156
+ title = "PhoGPT: Generative Pre-training for Vietnamese",
157
+ author = "Nguyen, Dat Quoc and Nguyen, Linh The and Tran, Chi and Nguyen, Dung Ngoc and Phung, Dinh and Bui, Hung",
158
+ journal = "arXiv preprint arXiv:2311.02945",
159
+ year = "2023",
160
+ url = "https://arxiv.org/abs/2311.02945",
161
+ }
162
+
163
+ @article{zheng2022vietnamese,
164
+ title = "Deep Neural Networks Algorithm for Vietnamese Word Segmentation",
165
+ author = "Zheng, Yi and others",
166
+ journal = "Scientific Programming",
167
+ volume = "2022",
168
+ year = "2022",
169
+ doi = "10.1155/2022/8187680",
170
+ url = "https://doi.org/10.1155/2022/8187680",
171
+ }
references/research_vietnamese_pos/papers.md ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper Database: Vietnamese POS Tagging
2
+
3
+ ## Paper 1: PhoBERT
4
+
5
+ - **Title**: PhoBERT: Pre-trained language models for Vietnamese
6
+ - **Authors**: Dat Quoc Nguyen, Anh Tuan Nguyen
7
+ - **Venue**: EMNLP Findings 2020
8
+ - **URL**: https://aclanthology.org/2020.findings-emnlp.92/
9
+ - **Citations**: 411+
10
+ - **Local**: [paper.pdf](../2020.emnlp.nguyen/paper.pdf)
11
+
12
+ ### Summary
13
+ First public large-scale monolingual language models pre-trained for Vietnamese. PhoBERT-base (135M params) and PhoBERT-large (370M params) trained on 20GB Vietnamese Wikipedia and news text using RoBERTa architecture.
14
+
15
+ ### Key Contributions
16
+ 1. Release pre-trained Vietnamese language models (PhoBERT-base, PhoBERT-large)
17
+ 2. State-of-the-art results on 4 Vietnamese NLP tasks including POS tagging
18
+ 3. Show monolingual models outperform multilingual BERT for Vietnamese
19
+
20
+ ### Methodology
21
+ - **Approach**: RoBERTa-style pre-training on Vietnamese text
22
+ - **Dataset**: 20GB Vietnamese text (Wikipedia + news)
23
+ - **Pre-training**: Masked language modeling, 40 epochs
24
+
25
+ ### Results
26
+ | Task | Dataset | PhoBERT-base | PhoBERT-large |
27
+ |------|---------|--------------|---------------|
28
+ | POS Tagging | VLSP 2013 | 96.7% | 96.8% |
29
+ | NER | VLSP 2016 | 94.0% F1 | 94.5% F1 |
30
+
31
+ ### Relevance to TRE-1
32
+ PhoBERT represents the transformer-based SOTA that TRE-1's CRF approach competes against. TRE-1 achieves 95.89% vs PhoBERT's 96.8% - a gap of ~1%.
33
+
34
+ ---
35
+
36
+ ## Paper 2: VnCoreNLP
37
+
38
+ - **Title**: VnCoreNLP: A Vietnamese Natural Language Processing Toolkit
39
+ - **Authors**: Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras, Mark Johnson
40
+ - **Venue**: NAACL 2018 Demonstrations
41
+ - **URL**: https://aclanthology.org/N18-5012/
42
+ - **Citations**: 300+
43
+ - **Local**: [paper.pdf](../2018.naacl.vu/paper.pdf)
44
+
45
+ ### Summary
46
+ Fast and accurate NLP pipeline for Vietnamese covering word segmentation, POS tagging, NER, and dependency parsing. POS component (VnMarMoT) uses CRF with MarMoT architecture.
47
+
48
+ ### Key Contributions
49
+ 1. Integrated toolkit for Vietnamese NLP pipeline
50
+ 2. State-of-the-art results on standard benchmarks
51
+ 3. Fast processing (8K-90K words/sec)
52
+
53
+ ### Methodology
54
+ - **Approach**: CRF-based (MarMoT) for POS tagging
55
+ - **Features**: Word forms, prefixes, suffixes, context windows
56
+ - **Dataset**: VLSP 2013
57
+
58
+ ### Results
59
+ | Task | Dataset | Accuracy/F1 |
60
+ |------|---------|-------------|
61
+ | Word Segmentation | - | 97.90% |
62
+ | POS Tagging | VLSP 2013 | 95.88% |
63
+ | NER | VLSP 2016 | 88.55% F1 |
64
+
65
+ ### Relevance to TRE-1
66
+ VnMarMoT is the closest comparable CRF-based system. TRE-1 (95.89%) matches VnMarMoT (95.88%) performance level using similar feature engineering approach.
67
+
68
+ ---
69
+
70
+ ## Paper 3: RDRPOSTagger
71
+
72
+ - **Title**: RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger
73
+ - **Authors**: Dat Quoc Nguyen, Dai Quoc Nguyen, Dang Duc Pham, Son Bao Pham
74
+ - **Venue**: EACL 2014 Demonstrations
75
+ - **URL**: https://aclanthology.org/E14-2005/
76
+ - **Citations**: 150+
77
+ - **Local**: [paper.pdf](../2014.eacl.nguyen/paper.pdf)
78
+
79
+ ### Summary
80
+ Error-driven approach using Ripple Down Rules to automatically construct transformation rules for POS tagging. Language-independent and supports 80+ languages including Vietnamese.
81
+
82
+ ### Key Contributions
83
+ 1. Novel error-driven rule learning approach
84
+ 2. Fast training and inference
85
+ 3. Multi-language support
86
+
87
+ ### Methodology
88
+ - **Approach**: Single Classification Ripple Down Rules (SCRDR)
89
+ - **Learning**: Error-driven transformation rules
90
+ - **Features**: Template-based rules
91
+
92
+ ### Results
93
+ | Language | Dataset | Accuracy |
94
+ |----------|---------|----------|
95
+ | English | Penn WSJ | 97.10% |
96
+ | Vietnamese | VietTreeBank | 95.11% |
97
+
98
+ ### Relevance to TRE-1
99
+ RDRPOSTagger represents earlier rule-based approach. TRE-1's CRF approach (95.89%) outperforms RDRPOSTagger (95.11%) by 0.78%.
100
+
101
+ ---
102
+
103
+ ## Paper 4: PhoNLP
104
+
105
+ - **Title**: PhoNLP: A joint multi-task learning model for Vietnamese POS tagging, NER and dependency parsing
106
+ - **Authors**: Linh The Nguyen, Dat Quoc Nguyen
107
+ - **Venue**: NAACL 2021 Demonstrations
108
+ - **URL**: https://aclanthology.org/2021.naacl-demos.1/
109
+ - **Citations**: 50+
110
+ - **Local**: [paper.pdf](../2021.naacl.nguyen/paper.pdf)
111
+
112
+ ### Summary
113
+ First multi-task learning model for joint Vietnamese POS tagging, NER, and dependency parsing. Uses PhoBERT as encoder with task-specific prediction layers.
114
+
115
+ ### Key Contributions
116
+ 1. Joint multi-task learning for Vietnamese NLP
117
+ 2. Soft POS embeddings shared across tasks
118
+ 3. State-of-the-art on all three tasks
119
+
120
+ ### Methodology
121
+ - **Approach**: PhoBERT encoder + multi-task prediction heads
122
+ - **Architecture**: Shared encoder, task-specific CRF/Biaffine classifiers
123
+ - **Training**: Joint optimization on all tasks
124
+
125
+ ### Results
126
+ | Task | Dataset | Score |
127
+ |------|---------|-------|
128
+ | POS Tagging | VLSP 2013 | 96.91% |
129
+ | NER | VLSP 2016 | 94.51% F1 |
130
+ | Dep Parsing | VnDT | 77.85% LAS |
131
+
132
+ ### Relevance to TRE-1
133
+ PhoNLP represents current SOTA (96.91%). TRE-1's CRF approach (95.89%) is 1.02% behind but requires no GPU and is significantly faster.
134
+
135
+ ---
136
+
137
+ ---
138
+
139
+ ## Paper 5: CRF (Lafferty et al. 2001)
140
+
141
+ - **Title**: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
142
+ - **Authors**: John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira
143
+ - **Venue**: ICML 2001
144
+ - **URL**: https://dl.acm.org/doi/10.5555/645530.655813
145
+ - **Citations**: 20,000+
146
+ - **Local**: [paper.pdf](../2001.icml.lafferty/paper.pdf)
147
+
148
+ ### Summary
149
+ Foundational paper introducing Conditional Random Fields (CRF) for sequence labeling. Addresses the label bias problem inherent in MEMMs and provides theoretical foundation for discriminative sequence models.
150
+
151
+ ### Key Contributions
152
+ 1. CRF framework as undirected graphical model for sequences
153
+ 2. Solution to label bias problem in directed models
154
+ 3. Iterative parameter estimation algorithms
155
+
156
+ ### Methodology
157
+ - **Approach**: Undirected graphical model with global normalization
158
+ - **Training**: Maximum conditional likelihood with L-BFGS
159
+ - **Inference**: Viterbi algorithm for most likely sequence
160
+
161
+ ### Relevance to TRE-1
162
+ This is the foundational algorithm used in TRE-1. Our implementation uses python-crfsuite with L-BFGS optimization, c1=1.0 (L1), c2=0.001 (L2).
163
+
164
+ ---
165
+
166
+ ## Paper 6: vELECTRA (Bui et al. 2020)
167
+
168
+ - **Title**: Improving Sequence Tagging for Vietnamese Text using Transformer-based Neural Models
169
+ - **Authors**: T. V. Bui, O. T. Tran, P. Le-Hong
170
+ - **Venue**: PACLIC 2020
171
+ - **URL**: https://github.com/fpt-corp/vELECTRA
172
+ - **Citations**: 50+
173
+
174
+ ### Summary
175
+ Vietnamese ELECTRA model pre-trained on 60GB of text using replaced token detection task. Alternative to PhoBERT with different pre-training objective.
176
+
177
+ ### Key Contributions
178
+ 1. Vietnamese ELECTRA architecture
179
+ 2. Pre-training on 60GB Vietnamese text
180
+ 3. Competitive results with PhoBERT
181
+
182
+ ### Results
183
+ | Task | Dataset | Accuracy |
184
+ |------|---------|----------|
185
+ | POS Tagging | VLSP 2013 | 96.77% |
186
+
187
+ ### Relevance to TRE-1
188
+ Another transformer baseline showing ~1% gap over CRF approaches.
189
+
190
+ ---
191
+
192
+ ## Paper 7: Neural Sequence Labeling (Nguyen et al. 2018)
193
+
194
+ - **Title**: Neural Sequence Labeling for Vietnamese POS Tagging and NER
195
+ - **Authors**: DU Nguyen Hoang Anh, Hieu Nguyen Kiem, Victor Van
196
+ - **Venue**: RIVF 2019
197
+ - **arXiv**: https://arxiv.org/abs/1811.03754
198
+ - **Citations**: 13
199
+
200
+ ### Summary
201
+ BiLSTM-CRF model with character and word embeddings for Vietnamese sequence labeling. Pre-PhoBERT era neural approach.
202
+
203
+ ### Results
204
+ | Task | Score |
205
+ |------|-------|
206
+ | POS Tagging | 93.52% |
207
+ | NER | 94.88% F1 |
208
+
209
+ ### Relevance to TRE-1
210
+ Shows that without pre-trained LMs, neural approaches (93.52%) underperform CRF with good features (95.89%).
211
+
212
+ ---
213
+
214
+ ## Paper 8: BiLSTM-CNN-CRF (Ma & Hovy 2016)
215
+
216
+ - **Title**: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
217
+ - **Authors**: Xuezhe Ma, Eduard Hovy
218
+ - **Venue**: ACL 2016
219
+ - **arXiv**: https://arxiv.org/abs/1603.01354
220
+ - **Citations**: 5,000+
221
+
222
+ ### Summary
223
+ Seminal neural sequence labeling architecture combining BiLSTM, CNN character embeddings, and CRF layer. Established neural+CRF as dominant approach before transformers.
224
+
225
+ ### Results
226
+ | Task | Dataset | Score |
227
+ |------|---------|-------|
228
+ | POS Tagging | Penn Treebank | 97.55% |
229
+ | NER | CoNLL 2003 | 91.21% F1 |
230
+
231
+ ### Relevance to TRE-1
232
+ The neural+CRF architecture this paper introduced was later adapted for Vietnamese (Paper 7). Shows potential path for TRE-1 improvement.
233
+
234
+ ---
235
+
236
+ ## Paper 9: BARTpho (Tran et al. 2021)
237
+
238
+ - **Title**: BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese
239
+ - **Authors**: Nguyen Luong Tran, Duong Minh Le, Dat Quoc Nguyen
240
+ - **Venue**: arXiv 2021
241
+ - **arXiv**: https://arxiv.org/abs/2109.09701
242
+ - **Citations**: 100+
243
+
244
+ ### Summary
245
+ First large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Two versions: BARTpho-syllable and BARTpho-word. Uses BART "large" architecture with denoising pre-training.
246
+
247
+ ### Key Contributions
248
+ 1. First Vietnamese seq2seq pre-trained models
249
+ 2. Two tokenization strategies (syllable vs word)
250
+ 3. SOTA on Vietnamese summarization and translation
251
+
252
+ ### Relevance to TRE-1
253
+ Seq2seq architecture not directly applicable to sequence labeling, but demonstrates Vietnamese pre-training advances.
254
+
255
+ ---
256
+
257
+ ## Paper 10: ViT5 (Phan et al. 2022)
258
+
259
+ - **Title**: ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
260
+ - **Authors**: Long Phan, Hieu Tran, Hieu Nguyen, Trieu H. Trinh
261
+ - **Venue**: NAACL 2022 (Student Research Workshop)
262
+ - **URL**: https://aclanthology.org/2022.naacl-srw.18/
263
+ - **arXiv**: https://arxiv.org/abs/2205.06457
264
+ - **Citations**: 50+
265
+
266
+ ### Summary
267
+ T5-style encoder-decoder model for Vietnamese. Base (310M) and large (866M) versions trained on CC100 Vietnamese corpus.
268
+
269
+ ### Results
270
+ | Task | Dataset | Score |
271
+ |------|---------|-------|
272
+ | Summarization | WikiLingua | SOTA |
273
+ | NER | - | Competitive |
274
+
275
+ ### Relevance to TRE-1
276
+ Text-to-text formulation could be applied to sequence labeling but not typical approach.
277
+
278
+ ---
279
+
280
+ ## Paper 11: ViDeBERTa (Tran et al. 2023)
281
+
282
+ - **Title**: ViDeBERTa: A powerful pre-trained language model for Vietnamese
283
+ - **Authors**: Cong Dao Tran, Nhut Huy Pham, Anh Nguyen, Truong-Son Hy
284
+ - **Venue**: EACL 2023 Findings
285
+ - **URL**: https://aclanthology.org/2023.findings-eacl.79/
286
+ - **arXiv**: https://arxiv.org/abs/2301.10439
287
+ - **Citations**: 30+
288
+
289
+ ### Summary
290
+ DeBERTa-based Vietnamese model with xsmall, base, and large versions. Trained on CC100 Vietnamese. More parameter-efficient than PhoBERT.
291
+
292
+ ### Key Results
293
+ | Model | Params | POS (VLSP 2013) |
294
+ |-------|--------|-----------------|
295
+ | ViDeBERTa_base | 86M | ~96.8%* |
296
+ | PhoBERT_large | 370M | 96.8% |
297
+
298
+ *ViDeBERTa_base achieves similar results with only 23% of PhoBERT_large parameters.
299
+
300
+ ### Relevance to TRE-1
301
+ Demonstrates efficiency gains in Vietnamese PLMs. Could be alternative to PhoBERT for neural approaches.
302
+
303
+ ---
304
+
305
+ ## Paper 12: ViSoBERT (Nguyen et al. 2023)
306
+
307
+ - **Title**: ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing
308
+ - **Authors**: Nam Nguyen, Thang Phan, Duc-Vu Nguyen, Kiet Nguyen
309
+ - **Venue**: EMNLP 2023
310
+ - **URL**: https://aclanthology.org/2023.emnlp-main.315/
311
+ - **arXiv**: https://arxiv.org/abs/2310.11166
312
+ - **Citations**: 20+
313
+
314
+ ### Summary
315
+ First pre-trained model specifically for Vietnamese social media. Uses XLM-R architecture trained on social media corpus.
316
+
317
+ ### Key Contributions
318
+ 1. Domain-specific pre-training for social media
319
+ 2. SOTA on emotion recognition, hate speech, sentiment analysis
320
+ 3. Addresses domain gap in existing Vietnamese PLMs
321
+
322
+ ### Relevance to TRE-1
323
+ Highlights domain adaptation challenge. TRE-1 (legal/news) may underperform on social media text.
324
+
325
+ ---
326
+
327
+ ## Paper 13: PhoGPT (Nguyen et al. 2023)
328
+
329
+ - **Title**: PhoGPT: Generative Pre-training for Vietnamese
330
+ - **Authors**: Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, Hung Bui
331
+ - **Venue**: arXiv 2023
332
+ - **arXiv**: https://arxiv.org/abs/2311.02945
333
+ - **Citations**: 20+
334
+
335
+ ### Summary
336
+ Open-source 4B-parameter generative model for Vietnamese. PhoGPT-4B base model and PhoGPT-4B-Chat fine-tuned version. Trained on 102B Vietnamese tokens.
337
+
338
+ ### Key Contributions
339
+ 1. First large-scale Vietnamese GPT model
340
+ 2. 8192 context length
341
+ 3. Instruction-following capability (Chat variant)
342
+
343
+ ### Relevance to TRE-1
344
+ Generative models not directly applicable to sequence labeling, but represents frontier of Vietnamese LLMs.
345
+
346
+ ---
347
+
348
+ ## Paper 14: Deep Neural Networks for Vietnamese Word Segmentation (Zheng et al. 2022)
349
+
350
+ - **Title**: Deep Neural Networks Algorithm for Vietnamese Word Segmentation
351
+ - **Authors**: Zheng et al.
352
+ - **Venue**: Scientific Programming 2022
353
+ - **DOI**: https://doi.org/10.1155/2022/8187680
354
+
355
+ ### Summary
356
+ LSTM-based approach to Vietnamese word segmentation addressing combination and cross ambiguity challenges.
357
+
358
+ ### Relevance to TRE-1
359
+ Word segmentation is prerequisite for POS tagging. Errors in segmentation propagate to TRE-1.
360
+
361
+ ---
362
+
363
+ ## Comparison Summary
364
+
365
+ | Paper | Year | Method | VLSP 2013 Acc | Params | GPU Required |
366
+ |-------|------|--------|---------------|--------|--------------|
367
+ | RDRPOSTagger | 2014 | Rules | 95.11% | - | No |
368
+ | Neural Seq | 2018 | BiLSTM-CRF | 93.52% | ~10M | Yes |
369
+ | VnCoreNLP | 2018 | CRF | 95.88% | - | No |
370
+ | **TRE-1** | 2026 | CRF | 95.89%* | - | No |
371
+ | vELECTRA | 2020 | ELECTRA | 96.77% | 110M | Yes |
372
+ | PhoBERT-large | 2020 | RoBERTa | 96.8% | 370M | Yes |
373
+ | PhoNLP | 2021 | PhoBERT+MTL | **96.91%** | 135M+ | Yes |
374
+ | ViDeBERTa-base | 2023 | DeBERTa | ~96.8% | 86M | Yes |
375
+ | ViSoBERT | 2023 | XLM-R | - | 278M | Yes |
376
+ | PhoGPT | 2023 | GPT | - | 3.7B | Yes |
377
+
378
+ *TRE-1 evaluated on UDD-1, not VLSP 2013.
references/research_vietnamese_pos/sota.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # State-of-the-Art: Vietnamese POS Tagging
2
+
3
+ **Last Updated**: 2026-01-31
4
+
5
+ ## Current Best Results (VLSP 2013 Benchmark)
6
+
7
+ | Rank | Model | Year | Accuracy | Method | Params | Paper |
8
+ |------|-------|------|----------|--------|--------|-------|
9
+ | 1 | **PhoNLP** | 2021 | **96.91%** | PhoBERT + MTL | 135M+ | [Nguyen & Nguyen, 2021](https://aclanthology.org/2021.naacl-demos.1/) |
10
+ | 2 | ViDeBERTa-base | 2023 | ~96.8% | DeBERTa | 86M | [Tran et al., 2023](https://aclanthology.org/2023.findings-eacl.79/) |
11
+ | 3 | PhoBERT-large | 2020 | 96.8% | RoBERTa | 370M | [Nguyen & Nguyen, 2020](https://aclanthology.org/2020.findings-emnlp.92/) |
12
+ | 4 | vELECTRA | 2020 | 96.77% | ELECTRA | 110M | [Bui et al., 2020](https://github.com/fpt-corp/vELECTRA) |
13
+ | 5 | PhoBERT-base | 2020 | 96.7% | RoBERTa | 135M | [Nguyen & Nguyen, 2020](https://aclanthology.org/2020.findings-emnlp.92/) |
14
+ | 6 | VnMarMoT | 2018 | 95.88% | CRF | - | [Vu et al., 2018](https://aclanthology.org/N18-5012/) |
15
+ | 7 | RDRPOSTagger | 2014 | 95.11% | Rules (RDR) | - | [Nguyen et al., 2014](https://aclanthology.org/E14-2005/) |
16
+
17
+ ## TRE-1 Position
18
+
19
+ | Model | Dataset | Accuracy | F1 (macro) |
20
+ |-------|---------|----------|------------|
21
+ | **TRE-1 v1.1** | UDD-1 | 95.89% | 92.71% |
22
+
23
+ **Note**: TRE-1 evaluated on UDD-1 (20K sentences), not VLSP 2013. Direct comparison requires evaluation on same benchmark.
24
+
25
+ ## Trends
26
+
27
+ ### 1. Shift to Pre-trained Models (2020+)
28
+ - PhoBERT established transformer-based SOTA
29
+ - ~1% accuracy gain over CRF methods
30
+ - Trend toward multi-task learning (PhoNLP)
31
+
32
+ ### 2. Efficiency Focus (2023+)
33
+ - ViDeBERTa achieves PhoBERT-level accuracy with 4x fewer parameters
34
+ - Trade-off between accuracy and computational cost becoming important
35
+ - Edge deployment considerations
36
+
37
+ ### 3. Domain Adaptation
38
+ - ViSoBERT (EMNLP 2023) targets social media domain
39
+ - General models underperform on domain-specific text
40
+ - Need for domain-specific pre-training
41
+
42
+ ### 4. Generative Models (2023+)
43
+ - PhoGPT (4B params) represents Vietnamese LLM frontier
44
+ - Potential for zero-shot/few-shot sequence labeling
45
+ - Not yet competitive with fine-tuned discriminative models
46
+
47
+ ### 5. CRF Still Competitive
48
+ - VnMarMoT (95.88%) very close to TRE-1 (95.89%)
49
+ - Advantages: Fast, no GPU, interpretable
50
+ - Gap to SOTA: ~1%
51
+
52
+ ## Performance by Method Type
53
+
54
+ | Method Type | Best Model | VLSP 2013 | Params | Speed | Resources |
55
+ |-------------|------------|-----------|--------|-------|-----------|
56
+ | **Multi-task** | PhoNLP | 96.91% | 135M+ | Slow | GPU |
57
+ | **Efficient** | ViDeBERTa | ~96.8% | 86M | Medium | GPU |
58
+ | **Transformer** | PhoBERT-large | 96.8% | 370M | Slow | GPU |
59
+ | **CRF** | VnMarMoT, TRE-1 | 95.88% | - | Fast | CPU |
60
+ | **Rules** | RDRPOSTagger | 95.11% | - | Fast | CPU |
61
+
62
+ ## Open Challenges
63
+
64
+ 1. **Rare Tags**: PART, X tags still underperform
65
+ 2. **Domain Transfer**: News-trained models struggle on social media
66
+ 3. **Word Segmentation**: Errors propagate to POS tagging
67
+ 4. **Benchmark Fragmentation**: Multiple datasets complicate comparison
68
+
69
+ ## Recommended Baselines for New Work
70
+
71
+ 1. **SOTA comparison**: PhoNLP (96.91%)
72
+ 2. **Efficient neural**: ViDeBERTa-base (~96.8%, 86M params)
73
+ 3. **CRF baseline**: VnCoreNLP/VnMarMoT (95.88%)
74
+ 4. **Rule baseline**: RDRPOSTagger (95.11%)
75
+ 5. **Domain-specific**: ViSoBERT (social media)
76
+
77
+ ## Vietnamese Pre-trained Models (2020-2024)
78
+
79
+ | Model | Year | Architecture | Params | Focus |
80
+ |-------|------|--------------|--------|-------|
81
+ | PhoBERT | 2020 | RoBERTa | 135M/370M | General |
82
+ | vELECTRA | 2020 | ELECTRA | 110M | General |
83
+ | BARTpho | 2021 | BART | Large | Seq2Seq |
84
+ | ViT5 | 2022 | T5 | 310M/866M | Text-to-Text |
85
+ | ViDeBERTa | 2023 | DeBERTa | 86M/304M | Efficient |
86
+ | ViSoBERT | 2023 | XLM-R | 278M | Social Media |
87
+ | PhoGPT | 2023 | GPT | 3.7B | Generative |
88
+
89
+ ## Resources
90
+
91
+ - **NLP-progress Vietnamese**: https://nlpprogress.com/vietnamese/vietnamese.html
92
+ - **Papers With Code**: https://paperswithcode.com/task/part-of-speech-tagging
93
+ - **VLSP Resources**: https://vlsp.org.vn/resources
94
+ - **VinAI Research**: https://research.vinai.io/
95
+ - **Hugging Face Vietnamese**: https://huggingface.co/models?language=vi
references/underthesea.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Underthesea: Vietnamese NLP Toolkit"
3
+ type: "resource"
4
+ url: "https://github.com/undertheseanlp/underthesea"
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Underthesea is an open-source Python library providing a suite of tools for Vietnamese natural language processing.
10
+
11
+ ## Key Information
12
+
13
+ | Field | Value |
14
+ |-------|-------|
15
+ | **Website** | https://undertheseanlp.com/ |
16
+ | **GitHub** | https://github.com/undertheseanlp/underthesea |
17
+ | **Documentation** | https://undertheseanlp.github.io/underthesea/ |
18
+ | **PyPI** | https://pypi.org/project/underthesea/ |
19
+ | **License** | GPL-3.0 |
20
+ | **Language** | Python 3.6+ |
21
+
22
+ ## Features
23
+
24
+ ### Core NLP Tasks
25
+
26
+ | Task | Description |
27
+ |------|-------------|
28
+ | Sentence Segmentation | Split text into sentences |
29
+ | Text Normalization | Normalize Vietnamese text |
30
+ | Word Tokenization | Segment words (with compound word support) |
31
+ | POS Tagging | Part-of-speech tagging |
32
+ | Chunking | Phrase grouping |
33
+ | NER | Named Entity Recognition |
34
+ | Dependency Parsing | Syntactic dependency analysis |
35
+ | Text Classification | Document classification |
36
+ | Sentiment Analysis | Sentiment detection |
37
+
38
+ ### Advanced Features
39
+
40
+ - Machine Translation (Vietnamese ↔ English)
41
+ - Language Detection
42
+ - Text-to-Speech
43
+ - Conversational AI Agent
44
+
45
+ ## Installation
46
+
47
+ ```bash
48
+ # Basic installation
49
+ pip install underthesea
50
+
51
+ # With deep learning support
52
+ pip install "underthesea[deep]"
53
+
54
+ # With text-to-speech
55
+ pip install "underthesea[voice]"
56
+
57
+ # With AI agent
58
+ pip install "underthesea[agent]"
59
+ ```
60
+
61
+ ## Usage Examples
62
+
63
+ ### Word Segmentation
64
+
65
+ ```python
66
+ from underthesea import word_tokenize
67
+
68
+ text = "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò"
69
+ tokens = word_tokenize(text)
70
+ # ['Chàng trai', '9X', 'Quảng Trị', 'khởi nghiệp', 'từ', 'nấm sò']
71
+ ```
72
+
73
+ ### POS Tagging
74
+
75
+ ```python
76
+ from underthesea import pos_tag
77
+
78
+ text = "Tôi yêu Việt Nam"
79
+ tagged = pos_tag(text)
80
+ # [('Tôi', 'P'), ('yêu', 'V'), ('Việt Nam', 'Np')]
81
+ ```
82
+
83
+ ### Named Entity Recognition
84
+
85
+ ```python
86
+ from underthesea import ner
87
+
88
+ text = "Bộ Công Thương xóa một tổng cục"
89
+ entities = ner(text)
90
+ # [('Bộ Công Thương', 'B-ORG'), ...]
91
+ ```
92
+
93
+ ### Text Classification
94
+
95
+ ```python
96
+ from underthesea import classify
97
+
98
+ text = "HLV đầu tiên ở Premier League bị sa thải"
99
+ category = classify(text)
100
+ # 'Thể thao'
101
+ ```
102
+
103
+ ## POS Tag Set
104
+
105
+ Underthesea uses Vietnamese-specific POS tags:
106
+
107
+ | Tag | Description |
108
+ |-----|-------------|
109
+ | N | Noun |
110
+ | V | Verb |
111
+ | A | Adjective |
112
+ | P | Pronoun |
113
+ | Np | Proper noun |
114
+ | E | Preposition |
115
+ | C | Conjunction |
116
+ | R | Adverb |
117
+ | M | Numeral |
118
+ | L | Determiner |
119
+ | T | Particle |
120
+ | X | Unknown |
121
+
122
+ ## Related Projects
123
+
124
+ - [underthesea-core](https://github.com/undertheseanlp/underthesea-core) - Core algorithms
125
+ - [NLP-Vietnamese-progress](https://github.com/undertheseanlp/NLP-Vietnamese-progress) - Vietnamese NLP benchmarks
126
+
127
+ ## Citation
128
+
129
+ ```bibtex
130
+ @misc{underthesea,
131
+ author = {Underthesea Team},
132
+ title = {Underthesea: Vietnamese NLP Toolkit},
133
+ year = {2018},
134
+ publisher = {GitHub},
135
+ url = {https://github.com/undertheseanlp/underthesea}
136
+ }
137
+ ```
references/universal_dependencies.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Universal Dependencies"
3
+ type: "resource"
4
+ url: "https://universaldependencies.org/"
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages.
10
+
11
+ ## Key Information
12
+
13
+ | Field | Value |
14
+ |-------|-------|
15
+ | **Website** | https://universaldependencies.org/ |
16
+ | **Languages** | 150+ languages |
17
+ | **Treebanks** | 200+ treebanks |
18
+ | **Contributors** | 600+ contributors |
19
+ | **Latest Version** | 2.17 (November 2025) |
20
+ | **License** | Various (mostly CC-BY) |
21
+
22
+ ## Purpose
23
+
24
+ UD aims to facilitate:
25
+ - Cross-linguistic research in NLP
26
+ - Multilingual parser development
27
+ - Typological studies
28
+ - Language learning applications
29
+
30
+ ## Annotation Layers
31
+
32
+ ### 1. Universal POS Tags (UPOS)
33
+
34
+ | Tag | Description | Example |
35
+ |-----|-------------|---------|
36
+ | ADJ | Adjective | big, old |
37
+ | ADP | Adposition | in, to |
38
+ | ADV | Adverb | very, well |
39
+ | AUX | Auxiliary | is, has |
40
+ | CCONJ | Coordinating conjunction | and, or |
41
+ | DET | Determiner | the, a |
42
+ | INTJ | Interjection | oh, wow |
43
+ | NOUN | Noun | house, cat |
44
+ | NUM | Numeral | one, 2 |
45
+ | PART | Particle | not, 's |
46
+ | PRON | Pronoun | I, she |
47
+ | PROPN | Proper noun | John, Paris |
48
+ | PUNCT | Punctuation | . , ? |
49
+ | SCONJ | Subordinating conjunction | if, that |
50
+ | SYM | Symbol | $, % |
51
+ | VERB | Verb | run, eat |
52
+ | X | Other | - |
53
+
54
+ ### 2. Morphological Features
55
+
56
+ - Case, Gender, Number, Person, Tense, Aspect, Mood, etc.
57
+
58
+ ### 3. Dependency Relations
59
+
60
+ - 37 universal syntactic relations (nsubj, obj, iobj, csubj, ccomp, xcomp, etc.)
61
+
62
+ ## CoNLL-U Format
63
+
64
+ Standard format for UD treebanks:
65
+
66
+ ```
67
+ # sent_id = 1
68
+ # text = They buy and sell books.
69
+ 1 They they PRON PRP _ 2 nsubj _ _
70
+ 2 buy buy VERB VBP _ 0 root _ _
71
+ 3 and and CCONJ CC _ 4 cc _ _
72
+ 4 sell sell VERB VBP _ 2 conj _ _
73
+ 5 books book NOUN NNS _ 2 obj _ _
74
+ 6 . . PUNCT . _ 2 punct _ _
75
+ ```
76
+
77
+ ## Vietnamese Treebanks
78
+
79
+ | Treebank | Sentences | Tokens |
80
+ |----------|-----------|--------|
81
+ | UD_Vietnamese-VTB | 3,000 | 43,754 |
82
+
83
+ ## Links
84
+
85
+ - **Website**: https://universaldependencies.org/
86
+ - **GitHub**: https://github.com/UniversalDependencies
87
+ - **Documentation**: https://universaldependencies.org/guidelines.html
88
+ - **Vietnamese UD**: https://universaldependencies.org/treebanks/vi_vtb/
89
+
90
+ ## BibTeX
91
+
92
+ ```bibtex
93
+ @inproceedings{nivre-etal-2020-universal,
94
+ title = "Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection",
95
+ author = "Nivre, Joakim and de Marneffe, Marie-Catherine and Ginter, Filip and others",
96
+ booktitle = "Proceedings of LREC 2020",
97
+ year = "2020",
98
+ pages = "4034--4043",
99
+ }
100
+ ```