Tiep Claude Opus 4.6 commited on
Commit
ef06968
·
1 Parent(s): 1e215f7

Add references folder and research skills

Browse files

- Add 12 research papers (PhoBERT, ViSoBERT, vELECTRA, SMTCE, RoBERTa, XLM-R, UIT-VSMEC, UIT-VSFC, etc.) with PDF, LaTeX source, and Markdown
- Add research synthesis: literature review, benchmark comparison, SOTA summary, bibliography
- Add references website (index.html) with Vellum-style leaderboard, paper database, benchmarks, citation network, and research methodology guide
- Add paper management tools (paper_db.py, fetch_papers.py, paper_db.json)
- Add Claude Code skills for paper-fetch, paper-research, paper-review, paper-write
- Configure git-lfs for binary files (PDF, PNG, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .claude/skills/paper-fetch/SKILL.md +1010 -0
  2. .claude/skills/paper-research/SKILL.md +502 -0
  3. .claude/skills/paper-review/SKILL.md +275 -0
  4. .claude/skills/paper-write/SKILL.md +402 -0
  5. .gitattributes +6 -0
  6. references/2007.rivf.hoang/paper.md +39 -0
  7. references/2018.kse.nguyen/paper.md +32 -0
  8. references/2019.arxiv.conneau/paper.md +35 -0
  9. references/2019.arxiv.conneau/paper.pdf +3 -0
  10. references/2019.arxiv.conneau/paper.tex +45 -0
  11. references/2019.arxiv.conneau/source/XLMR Paper/acl2020.bib +739 -0
  12. references/2019.arxiv.conneau/source/XLMR Paper/acl2020.sty +560 -0
  13. references/2019.arxiv.conneau/source/XLMR Paper/acl_natbib.bst +1975 -0
  14. references/2019.arxiv.conneau/source/XLMR Paper/appendix.tex +45 -0
  15. references/2019.arxiv.conneau/source/XLMR Paper/content/batchsize.pdf +3 -0
  16. references/2019.arxiv.conneau/source/XLMR Paper/content/capacity.pdf +3 -0
  17. references/2019.arxiv.conneau/source/XLMR Paper/content/datasize.pdf +3 -0
  18. references/2019.arxiv.conneau/source/XLMR Paper/content/dilution.pdf +3 -0
  19. references/2019.arxiv.conneau/source/XLMR Paper/content/langsampling.pdf +3 -0
  20. references/2019.arxiv.conneau/source/XLMR Paper/content/tables.tex +398 -0
  21. references/2019.arxiv.conneau/source/XLMR Paper/content/vocabsize.pdf +3 -0
  22. references/2019.arxiv.conneau/source/XLMR Paper/content/wikicc.pdf +3 -0
  23. references/2019.arxiv.conneau/source/XLMR Paper/texput.log +21 -0
  24. references/2019.arxiv.conneau/source/XLMR Paper/xlmr.bbl +285 -0
  25. references/2019.arxiv.conneau/source/XLMR Paper/xlmr.synctex +3 -0
  26. references/2019.arxiv.conneau/source/XLMR Paper/xlmr.tex +307 -0
  27. references/2019.arxiv.conneau/source/acl2020.bib +739 -0
  28. references/2019.arxiv.conneau/source/acl2020.sty +560 -0
  29. references/2019.arxiv.conneau/source/acl_natbib.bst +1975 -0
  30. references/2019.arxiv.conneau/source/appendix.tex +45 -0
  31. references/2019.arxiv.conneau/source/content/batchsize.pdf +3 -0
  32. references/2019.arxiv.conneau/source/content/capacity.pdf +3 -0
  33. references/2019.arxiv.conneau/source/content/datasize.pdf +3 -0
  34. references/2019.arxiv.conneau/source/content/dilution.pdf +3 -0
  35. references/2019.arxiv.conneau/source/content/langsampling.pdf +3 -0
  36. references/2019.arxiv.conneau/source/content/tables.tex +398 -0
  37. references/2019.arxiv.conneau/source/content/vocabsize.pdf +3 -0
  38. references/2019.arxiv.conneau/source/content/wikicc.pdf +3 -0
  39. references/2019.arxiv.conneau/source/texput.log +21 -0
  40. references/2019.arxiv.conneau/source/xlmr.bbl +285 -0
  41. references/2019.arxiv.conneau/source/xlmr.synctex +3 -0
  42. references/2019.arxiv.conneau/source/xlmr.tex +307 -0
  43. references/2019.arxiv.ho/paper.md +220 -0
  44. references/2019.arxiv.ho/paper.pdf +3 -0
  45. references/2019.arxiv.ho/paper.tex +239 -0
  46. references/2019.arxiv.ho/source/bibliography.bib +289 -0
  47. references/2019.arxiv.ho/source/images/DataProcessing.pdf +3 -0
  48. references/2019.arxiv.ho/source/images/cnnmodel.pdf +3 -0
  49. references/2019.arxiv.ho/source/images/con_matrix.png +3 -0
  50. references/2019.arxiv.ho/source/images/confusion_matrix.pdf +3 -0
.claude/skills/paper-fetch/SKILL.md ADDED
@@ -0,0 +1,1010 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: paper-fetch
3
+ description: Fetch research papers from arXiv, ACL Anthology, or Semantic Scholar. Extracts to both Markdown (for LLM/RAG) and LaTeX (for compilation) formats.
4
+ argument-hint: "<paper-id-or-url>"
5
+ ---
6
+
7
+ # Paper Fetcher
8
+
9
+ Fetch research papers and store them in the `references/` folder with extracted text content.
10
+
11
+ ## Target
12
+
13
+ **Arguments**: $ARGUMENTS
14
+
15
+ The argument can be:
16
+ - **arXiv ID**: `2301.10140`, `arxiv:2301.10140`, or full URL
17
+ - **ACL Anthology ID**: `P19-1017`, `2023.acl-long.1`, or full URL
18
+ - **Semantic Scholar ID**: `s2:649def34...` or search query
19
+ - **DOI**: `10.18653/v1/P19-1017`
20
+
21
+ ## Output Structure
22
+
23
+ Each paper gets its own folder named `year.<conference>.main-author`:
24
+
25
+ ```
26
+ references/
27
+ 2020.emnlp.nguyen/ # PhoBERT paper (has arXiv source)
28
+ paper.tex # Original LaTeX source from arXiv
29
+ paper.md # Generated from LaTeX (with YAML front matter)
30
+ paper.pdf # PDF for reference
31
+ source/ # Full arXiv source files
32
+ 2014.eacl.nguyen/ # RDRPOSTagger (no arXiv)
33
+ paper.tex # Generated from PDF
34
+ paper.md # Extracted from PDF (with YAML front matter)
35
+ paper.pdf
36
+ ```
37
+
38
+ ### paper.md Format
39
+
40
+ Metadata stored in YAML front matter:
41
+
42
+ ```markdown
43
+ ---
44
+ title: "PhoBERT: Pre-trained language models for Vietnamese"
45
+ authors:
46
+ - "Dat Quoc Nguyen"
47
+ - "Anh Tuan Nguyen"
48
+ year: 2020
49
+ venue: "EMNLP Findings 2020"
50
+ url: "https://aclanthology.org/2020.findings-emnlp.92/"
51
+ arxiv: "2003.00744"
52
+ ---
53
+
54
+ # Introduction
55
+ ...
56
+ ```
57
+
58
+ ### Priority Order (arXiv papers)
59
+
60
+ 1. **Download LaTeX source** from `arxiv.org/e-print/{id}` (tar.gz)
61
+ 2. **Generate paper.md** from LaTeX (higher quality than PDF extraction)
62
+ 3. **Download PDF** for reference
63
+
64
+ ### Fallback (non-arXiv papers)
65
+
66
+ 1. Download PDF
67
+ 2. Extract paper.md from PDF (pymupdf4llm)
68
+ 3. Generate paper.tex from Markdown
69
+
70
+ ### Folder Naming Convention
71
+
72
+ Format: `{year}.{venue}.{first_author_lastname}`
73
+
74
+ | Paper ID | Folder Name |
75
+ |----------|-------------|
76
+ | `2020.findings-emnlp.92` | `2020.emnlp.nguyen` |
77
+ | `N18-5012` | `2018.naacl.vu` |
78
+ | `E14-2005` | `2014.eacl.nguyen` |
79
+ | `2301.10140` | `2023.arxiv.smith` |
80
+
81
+ ### Format Comparison
82
+
83
+ | File | Source | Best For |
84
+ |------|--------|----------|
85
+ | `paper.tex` | arXiv e-print (original) or generated | Recompilation, precise formulas |
86
+ | `paper.md` | Converted from LaTeX or PDF | LLM/RAG, quick reading, GitHub |
87
+ | `source/` | Full arXiv source (if available) | Build, figures, bibliography |
88
+
89
+ ---
90
+
91
+ ## PDF to Markdown Extraction Methods
92
+
93
+ ### Comparison of Methods
94
+
95
+ | Method | Table Quality | Speed | Dependencies | Best For |
96
+ |--------|---------------|-------|--------------|----------|
97
+ | **pymupdf4llm** | ★★★★☆ | Fast | pymupdf4llm | General papers, good tables |
98
+ | **pdfplumber** | ★★★★★ | Medium | pdfplumber | Complex tables, accuracy |
99
+ | **Marker** | ★★★★★ | Slow | marker-pdf, torch | Best quality, LaTeX formulas |
100
+ | **MinerU** | ★★★★★ | Slow | magic-pdf, torch | Academic papers, LaTeX output |
101
+ | **Nougat** | ★★★★★ | Slow | nougat-ocr, torch | arXiv papers, full LaTeX |
102
+ | PyMuPDF (basic) | ★★☆☆☆ | Fast | pymupdf | Simple text only |
103
+ | pdftotext | ★★☆☆☆ | Fast | poppler-utils | Basic extraction |
104
+
105
+ ---
106
+
107
+ ### Method 1: pymupdf4llm (Recommended)
108
+
109
+ Best balance of quality and speed. Produces GitHub-compatible markdown with proper table formatting.
110
+
111
+ ```bash
112
+ uv run --with pymupdf4llm python -c "
113
+ import pymupdf4llm
114
+ import pathlib
115
+
116
+ pdf_path = 'references/{paper_id}/paper.pdf'
117
+ md_path = 'references/{paper_id}/paper.md'
118
+
119
+ # Extract with table support
120
+ md_text = pymupdf4llm.to_markdown(pdf_path)
121
+
122
+ pathlib.Path(md_path).write_text(md_text, encoding='utf-8')
123
+ print(f'Extracted to: {md_path}')
124
+ "
125
+ ```
126
+
127
+ **Features:**
128
+ - Automatic table detection and markdown formatting
129
+ - Preserves document structure (headers, lists)
130
+ - Fast processing
131
+ - GitHub-compatible markdown output
132
+
133
+ **Advanced options:**
134
+ ```python
135
+ import pymupdf4llm
136
+
137
+ # With page chunks and table info
138
+ md_text = pymupdf4llm.to_markdown(
139
+ "paper.pdf",
140
+ page_chunks=True, # Get per-page chunks
141
+ write_images=True, # Extract images
142
+ image_path="images/", # Image output folder
143
+ )
144
+ ```
145
+
146
+ ---
147
+
148
+ ### Method 2: pdfplumber (Best for Complex Tables)
149
+
150
+ Most accurate for papers with complex table structures.
151
+
152
+ ```bash
153
+ uv run --with pdfplumber --with pandas python -c "
154
+ import pdfplumber
155
+ import pandas as pd
156
+
157
+ pdf_path = 'references/{paper_id}/paper.pdf'
158
+ md_path = 'references/{paper_id}/paper.md'
159
+
160
+ output = []
161
+ with pdfplumber.open(pdf_path) as pdf:
162
+ for i, page in enumerate(pdf.pages):
163
+ output.append(f'<!-- Page {i+1} -->\n')
164
+
165
+ # Extract text
166
+ text = page.extract_text() or ''
167
+
168
+ # Extract tables
169
+ tables = page.extract_tables()
170
+
171
+ if tables:
172
+ for j, table in enumerate(tables):
173
+ # Convert to markdown table
174
+ if table and len(table) > 0:
175
+ df = pd.DataFrame(table[1:], columns=table[0])
176
+ md_table = df.to_markdown(index=False)
177
+ text += f'\n\n**Table {j+1}:**\n{md_table}\n'
178
+
179
+ output.append(text)
180
+ output.append('\n\n---\n\n')
181
+
182
+ with open(md_path, 'w', encoding='utf-8') as f:
183
+ f.write(''.join(output))
184
+
185
+ print(f'Extracted with tables to: {md_path}')
186
+ "
187
+ ```
188
+
189
+ **Features:**
190
+ - Excellent table detection
191
+ - Handles complex multi-row/column tables
192
+ - Detailed control over extraction
193
+ - Can extract individual table cells
194
+
195
+ ---
196
+
197
+ ### Method 3: Marker (Best Quality - Deep Learning)
198
+
199
+ Uses deep learning for highest quality extraction. Best for papers with LaTeX, complex layouts.
200
+
201
+ ```bash
202
+ # Install marker
203
+ uv pip install marker-pdf
204
+
205
+ # Convert PDF to markdown
206
+ marker_single "references/{paper_id}/paper.pdf" "references/{paper_id}/" --output_format markdown
207
+ ```
208
+
209
+ Or via Python:
210
+
211
+ ```bash
212
+ uv run --with marker-pdf python -c "
213
+ from marker.converters.pdf import PdfConverter
214
+ from marker.models import create_model_dict
215
+ from marker.output import text_from_rendered
216
+
217
+ # Load models (first run downloads ~2GB)
218
+ models = create_model_dict()
219
+ converter = PdfConverter(artifact_dict=models)
220
+
221
+ # Convert
222
+ rendered = converter('references/{paper_id}/paper.pdf')
223
+ text, _, images = text_from_rendered(rendered)
224
+
225
+ with open('references/{paper_id}/paper.md', 'w') as f:
226
+ f.write(text)
227
+ "
228
+ ```
229
+
230
+ **Features:**
231
+ - AI-based layout detection
232
+ - Excellent table recognition
233
+ - LaTeX formula extraction
234
+ - Best for academic papers
235
+ - Handles multi-column layouts
236
+
237
+ **Note:** Requires GPU for best performance, downloads ~2GB models on first run.
238
+
239
+ ---
240
+
241
+ ### Method 4: Nougat (PDF to LaTeX - for arXiv papers)
242
+
243
+ Best for converting academic papers to full LaTeX source.
244
+
245
+ ```bash
246
+ # Install nougat
247
+ uv pip install nougat-ocr
248
+
249
+ # Convert PDF to LaTeX/Markdown with math
250
+ nougat "references/{paper_id}/paper.pdf" -o "references/{paper_id}/" -m 0.1.0-base
251
+ ```
252
+
253
+ Or via Python:
254
+
255
+ ```bash
256
+ uv run --with nougat-ocr python -c "
257
+ from nougat import NougatModel
258
+ from nougat.utils.dataset import LaTeXDataset
259
+
260
+ model = NougatModel.from_pretrained('facebook/nougat-base')
261
+ model.eval()
262
+
263
+ # Process PDF
264
+ latex_output = model.predict('references/{paper_id}/paper.pdf')
265
+
266
+ with open('references/{paper_id}/paper.tex', 'w') as f:
267
+ f.write(latex_output)
268
+ "
269
+ ```
270
+
271
+ **Features:**
272
+ - Full LaTeX output with equations
273
+ - Trained on arXiv papers
274
+ - Best for math-heavy documents
275
+ - Outputs compilable LaTeX
276
+
277
+ ---
278
+
279
+ ### Method 5: MinerU (Best for Academic Papers)
280
+
281
+ Comprehensive tool for high-quality extraction with LaTeX formula support.
282
+
283
+ ```bash
284
+ # Install MinerU
285
+ pip install magic-pdf[full]
286
+
287
+ # Convert PDF
288
+ magic-pdf -p "references/{paper_id}/paper.pdf" -o "references/{paper_id}/"
289
+ ```
290
+
291
+ **Features:**
292
+ - High formula recognition rate
293
+ - LaTeX-friendly output
294
+ - Table structure preservation
295
+ - Multi-format output (MD, JSON, LaTeX)
296
+
297
+ ---
298
+
299
+ ### Method 6: Hybrid Approach (Recommended for Academic Papers)
300
+
301
+ Combine methods for best results:
302
+
303
+ ```python
304
+ # /// script
305
+ # requires-python = ">=3.9"
306
+ # dependencies = ["pymupdf4llm>=0.0.10", "pdfplumber>=0.10.0", "pandas>=2.0.0"]
307
+ # ///
308
+ """
309
+ Hybrid PDF extraction: pymupdf4llm for text, pdfplumber for tables.
310
+ """
311
+ import pymupdf4llm
312
+ import pdfplumber
313
+ import pandas as pd
314
+ import re
315
+ import sys
316
+ import os
317
+
318
+ def extract_tables_pdfplumber(pdf_path: str) -> dict:
319
+ """Extract tables using pdfplumber (more accurate)."""
320
+ tables_by_page = {}
321
+ with pdfplumber.open(pdf_path) as pdf:
322
+ for i, page in enumerate(pdf.pages):
323
+ tables = page.extract_tables()
324
+ if tables:
325
+ page_tables = []
326
+ for table in tables:
327
+ if table and len(table) > 1:
328
+ try:
329
+ # Clean table data
330
+ cleaned = [[str(cell).strip() if cell else '' for cell in row] for row in table]
331
+ df = pd.DataFrame(cleaned[1:], columns=cleaned[0])
332
+ md_table = df.to_markdown(index=False)
333
+ page_tables.append(md_table)
334
+ except Exception as e:
335
+ print(f"Table error on page {i+1}: {e}")
336
+ if page_tables:
337
+ tables_by_page[i] = page_tables
338
+ return tables_by_page
339
+
340
+ def extract_hybrid(pdf_path: str, output_path: str):
341
+ """Hybrid extraction: pymupdf4llm + pdfplumber tables."""
342
+
343
+ # Get base markdown from pymupdf4llm
344
+ md_text = pymupdf4llm.to_markdown(pdf_path)
345
+
346
+ # Get accurate tables from pdfplumber
347
+ tables = extract_tables_pdfplumber(pdf_path)
348
+
349
+ # If pdfplumber found tables, we can append or replace
350
+ if tables:
351
+ md_text += "\n\n---\n\n## Extracted Tables (pdfplumber)\n\n"
352
+ for page_num, page_tables in sorted(tables.items()):
353
+ md_text += f"### Page {page_num + 1}\n\n"
354
+ for i, table in enumerate(page_tables):
355
+ md_text += f"**Table {i+1}:**\n\n{table}\n\n"
356
+
357
+ with open(output_path, 'w', encoding='utf-8') as f:
358
+ f.write(md_text)
359
+
360
+ print(f"Extracted to: {output_path}")
361
+ print(f"Found {sum(len(t) for t in tables.values())} tables")
362
+
363
+ if __name__ == "__main__":
364
+ pdf_path = sys.argv[1]
365
+ output_path = sys.argv[2] if len(sys.argv) > 2 else pdf_path.replace('.pdf', '.md')
366
+ extract_hybrid(pdf_path, output_path)
367
+ ```
368
+
369
+ ---
370
+
371
+ ## Complete Fetch Script (Both Formats)
372
+
373
+ ```python
374
+ # /// script
375
+ # requires-python = ">=3.9"
376
+ # dependencies = ["arxiv>=2.0.0", "requests>=2.28.0", "pymupdf4llm>=0.0.10"]
377
+ # ///
378
+ """
379
+ Fetch paper and extract to both Markdown and LaTeX formats.
380
+ Folder naming: year.<venue>.main-author
381
+ """
382
+ import arxiv
383
+ import pymupdf4llm
384
+ import requests
385
+ import json
386
+ import sys
387
+ import os
388
+ import re
389
+ import unicodedata
390
+
391
+ def normalize_name(name: str) -> str:
392
+ """Normalize author name to lowercase ASCII."""
393
+ # Get last name (last word)
394
+ parts = name.strip().split()
395
+ lastname = parts[-1] if parts else name
396
+
397
+ # Remove accents and convert to lowercase
398
+ normalized = unicodedata.normalize('NFD', lastname)
399
+ ascii_name = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
400
+ return ascii_name.lower()
401
+
402
+ def get_folder_name(year: int, venue: str, first_author: str) -> str:
403
+ """Generate folder name: year.venue.author"""
404
+ # Normalize venue name
405
+ venue = venue.lower()
406
+ venue = re.sub(r'findings-', '', venue) # findings-emnlp -> emnlp
407
+ venue = re.sub(r'-demos?', '', venue) # naacl-demos -> naacl
408
+ venue = re.sub(r'-main', '', venue) # acl-main -> acl
409
+ venue = re.sub(r'-long', '', venue) # acl-long -> acl
410
+ venue = re.sub(r'-short', '', venue)
411
+ venue = re.sub(r'[^a-z0-9]', '', venue) # Remove special chars
412
+
413
+ # Normalize author
414
+ author = normalize_name(first_author)
415
+
416
+ return f"{year}.{venue}.{author}"
417
+
418
+ def convert_md_to_latex(md_text: str) -> str:
419
+ """Convert Markdown to basic LaTeX document."""
420
+ # LaTeX document header
421
+ latex = r"""\documentclass[11pt]{article}
422
+ \usepackage[utf8]{inputenc}
423
+ \usepackage{amsmath,amssymb}
424
+ \usepackage{booktabs}
425
+ \usepackage{hyperref}
426
+ \usepackage{graphicx}
427
+
428
+ \begin{document}
429
+
430
+ """
431
+ # Convert Markdown to LaTeX
432
+ content = md_text
433
+
434
+ # Headers
435
+ content = re.sub(r'^# (.+)$', r'\\section*{\1}', content, flags=re.MULTILINE)
436
+ content = re.sub(r'^## (.+)$', r'\\subsection*{\1}', content, flags=re.MULTILINE)
437
+ content = re.sub(r'^### (.+)$', r'\\subsubsection*{\1}', content, flags=re.MULTILINE)
438
+
439
+ # Bold and italic
440
+ content = re.sub(r'\*\*(.+?)\*\*', r'\\textbf{\1}', content)
441
+ content = re.sub(r'\*(.+?)\*', r'\\textit{\1}', content)
442
+ content = re.sub(r'_(.+?)_', r'\\textit{\1}', content)
443
+
444
+ # Bullet points
445
+ content = re.sub(r'^- (.+)$', r'\\item \1', content, flags=re.MULTILINE)
446
+
447
+ # Code blocks
448
+ content = re.sub(r'```\w*\n(.*?)\n```', r'\\begin{verbatim}\n\1\n\\end{verbatim}', content, flags=re.DOTALL)
449
+
450
+ # Inline code
451
+ content = re.sub(r'`([^`]+)`', r'\\texttt{\1}', content)
452
+
453
+ # Convert markdown tables to LaTeX (basic)
454
+ def convert_table(match):
455
+ lines = match.group(0).strip().split('\n')
456
+ if len(lines) < 2:
457
+ return match.group(0)
458
+
459
+ # Get header and determine columns
460
+ header = lines[0]
461
+ cols = header.count('|') - 1
462
+ if cols <= 0:
463
+ return match.group(0)
464
+
465
+ latex_table = "\\begin{table}[h]\n\\centering\n"
466
+ latex_table += "\\begin{tabular}{" + "l" * cols + "}\n\\toprule\n"
467
+
468
+ for i, line in enumerate(lines):
469
+ if '---' in line:
470
+ continue
471
+ cells = [c.strip() for c in line.split('|')[1:-1]]
472
+ latex_table += " & ".join(cells) + " \\\\\n"
473
+ if i == 0:
474
+ latex_table += "\\midrule\n"
475
+
476
+ latex_table += "\\bottomrule\n\\end{tabular}\n\\end{table}\n"
477
+ return latex_table
478
+
479
+ content = re.sub(r'\|.+\|[\s\S]*?\|.+\|', convert_table, content)
480
+
481
+ latex += content
482
+ latex += "\n\\end{document}\n"
483
+
484
+ return latex
485
+
486
+ def convert_latex_to_md(tex_content: str) -> str:
487
+ """Convert LaTeX to Markdown for LLM/RAG use."""
488
+ md = tex_content
489
+
490
+ # Remove LaTeX preamble (everything before \begin{document})
491
+ doc_match = re.search(r'\\begin\{document\}', md)
492
+ if doc_match:
493
+ md = md[doc_match.end():]
494
+ md = re.sub(r'\\end\{document\}.*', '', md, flags=re.DOTALL)
495
+
496
+ # Remove comments
497
+ md = re.sub(r'%.*$', '', md, flags=re.MULTILINE)
498
+
499
+ # Convert sections
500
+ md = re.sub(r'\\section\*?\{([^}]+)\}', r'# \1', md)
501
+ md = re.sub(r'\\subsection\*?\{([^}]+)\}', r'## \1', md)
502
+ md = re.sub(r'\\subsubsection\*?\{([^}]+)\}', r'### \1', md)
503
+ md = re.sub(r'\\paragraph\*?\{([^}]+)\}', r'#### \1', md)
504
+
505
+ # Convert formatting
506
+ md = re.sub(r'\\textbf\{([^}]+)\}', r'**\1**', md)
507
+ md = re.sub(r'\\textit\{([^}]+)\}', r'*\1*', md)
508
+ md = re.sub(r'\\emph\{([^}]+)\}', r'*\1*', md)
509
+ md = re.sub(r'\\texttt\{([^}]+)\}', r'`\1`', md)
510
+ md = re.sub(r'\\underline\{([^}]+)\}', r'\1', md)
511
+
512
+ # Convert citations and references
513
+ md = re.sub(r'\\cite\{([^}]+)\}', r'[\1]', md)
514
+ md = re.sub(r'\\citep?\{([^}]+)\}', r'[\1]', md)
515
+ md = re.sub(r'\\citet?\{([^}]+)\}', r'[\1]', md)
516
+ md = re.sub(r'\\ref\{([^}]+)\}', r'[\1]', md)
517
+ md = re.sub(r'\\label\{[^}]+\}', '', md)
518
+
519
+ # Convert URLs
520
+ md = re.sub(r'\\url\{([^}]+)\}', r'\1', md)
521
+ md = re.sub(r'\\href\{([^}]+)\}\{([^}]+)\}', r'[\2](\1)', md)
522
+
523
+ # Convert lists
524
+ md = re.sub(r'\\begin\{itemize\}', '', md)
525
+ md = re.sub(r'\\end\{itemize\}', '', md)
526
+ md = re.sub(r'\\begin\{enumerate\}', '', md)
527
+ md = re.sub(r'\\end\{enumerate\}', '', md)
528
+ md = re.sub(r'\\item\s*', '- ', md)
529
+
530
+ # Convert math (keep as-is for LaTeX rendering)
531
+ md = re.sub(r'\$\$([^$]+)\$\$', r'\n$$\1$$\n', md)
532
+ md = re.sub(r'\\begin\{equation\*?\}(.*?)\\end\{equation\*?\}', r'\n$$\1$$\n', md, flags=re.DOTALL)
533
+ md = re.sub(r'\\begin\{align\*?\}(.*?)\\end\{align\*?\}', r'\n$$\1$$\n', md, flags=re.DOTALL)
534
+
535
+ # Convert tables (basic - keep structure)
536
+ def convert_table(match):
537
+ table_content = match.group(1)
538
+ rows = re.split(r'\\\\', table_content)
539
+ md_rows = []
540
+ for i, row in enumerate(rows):
541
+ cells = [c.strip() for c in re.split(r'&', row) if c.strip()]
542
+ if cells:
543
+ md_rows.append('| ' + ' | '.join(cells) + ' |')
544
+ if i == 0:
545
+ md_rows.append('|' + '---|' * len(cells))
546
+ return '\n'.join(md_rows)
547
+
548
+ md = re.sub(r'\\begin\{tabular\}\{[^}]*\}(.*?)\\end\{tabular\}', convert_table, md, flags=re.DOTALL)
549
+
550
+ # Remove common LaTeX commands
551
+ md = re.sub(r'\\(small|large|Large|footnotesize|normalsize|tiny|huge)\b', '', md)
552
+ md = re.sub(r'\\(hline|toprule|midrule|bottomrule|cline\{[^}]*\})', '', md)
553
+ md = re.sub(r'\\(vspace|hspace|vskip|hskip)\{[^}]*\}', '', md)
554
+ md = re.sub(r'\\(centering|raggedright|raggedleft)\b', '', md)
555
+ md = re.sub(r'\\(newline|linebreak|pagebreak|newpage)\b', '\n', md)
556
+ md = re.sub(r'\\\\', '\n', md)
557
+ md = re.sub(r'\\[a-zA-Z]+\{[^}]*\}', '', md) # Remove remaining commands with args
558
+ md = re.sub(r'\\[a-zA-Z]+\b', '', md) # Remove remaining commands without args
559
+
560
+ # Clean up
561
+ md = re.sub(r'\n{3,}', '\n\n', md)
562
+ md = re.sub(r'^\s+', '', md, flags=re.MULTILINE)
563
+
564
+ return md.strip()
565
+
566
+ def download_arxiv_source(arxiv_id: str, folder: str) -> str:
567
+ """Download LaTeX source from arXiv e-print. Returns tex content or None."""
568
+ import tarfile
569
+ import gzip
570
+ from io import BytesIO
571
+
572
+ source_url = f"https://arxiv.org/e-print/{arxiv_id}"
573
+ try:
574
+ response = requests.get(source_url, allow_redirects=True)
575
+ response.raise_for_status()
576
+
577
+ content = response.content
578
+ tex_content = None
579
+
580
+ # Try to extract as tar.gz
581
+ try:
582
+ with tarfile.open(fileobj=BytesIO(content), mode='r:gz') as tar:
583
+ tex_files = [m.name for m in tar.getmembers() if m.name.endswith('.tex')]
584
+
585
+ # Extract all files for reference
586
+ os.makedirs(f"{folder}/source", exist_ok=True)
587
+ tar.extractall(path=f"{folder}/source")
588
+ print(f"Extracted {len(tar.getmembers())} source files to {folder}/source/")
589
+
590
+ # Find main tex file
591
+ main_tex = None
592
+ for name in tex_files:
593
+ if 'main' in name.lower():
594
+ main_tex = name
595
+ break
596
+ if not main_tex and tex_files:
597
+ main_tex = tex_files[0]
598
+
599
+ if main_tex:
600
+ with open(f"{folder}/source/{main_tex}", 'r', encoding='utf-8', errors='ignore') as f:
601
+ tex_content = f.read()
602
+ print(f"Main LaTeX: {main_tex}")
603
+
604
+ if tex_content:
605
+ return tex_content
606
+ except tarfile.TarError:
607
+ pass
608
+
609
+ # Try as plain gzipped tex file
610
+ try:
611
+ tex_content = gzip.decompress(content).decode('utf-8', errors='ignore')
612
+ if '\\documentclass' in tex_content or '\\begin{document}' in tex_content:
613
+ print(f"Extracted LaTeX source (gzip)")
614
+ return tex_content
615
+ except:
616
+ pass
617
+
618
+ # Try as plain tex file
619
+ try:
620
+ tex_content = content.decode('utf-8', errors='ignore')
621
+ if '\\documentclass' in tex_content or '\\begin{document}' in tex_content:
622
+ print(f"Extracted LaTeX source (plain)")
623
+ return tex_content
624
+ except:
625
+ pass
626
+
627
+ print(f"Could not extract LaTeX source from arXiv")
628
+ return None
629
+
630
+ except Exception as e:
631
+ print(f"Failed to download arXiv source: {e}")
632
+ return None
633
+
634
+ def build_front_matter(title: str, authors: list, year: int, venue: str, url: str, arxiv_id: str = None) -> str:
635
+ """Build YAML front matter for paper.md"""
636
+ authors_yaml = '\n'.join(f' - "{a}"' for a in authors)
637
+ fm = f'''---
638
+ title: "{title}"
639
+ authors:
640
+ {authors_yaml}
641
+ year: {year}
642
+ venue: "{venue}"
643
+ url: "{url}"'''
644
+ if arxiv_id:
645
+ fm += f'\narxiv: "{arxiv_id}"'
646
+ fm += '\n---\n\n'
647
+ return fm
648
+
649
+ def fetch_arxiv(arxiv_id: str):
650
+ """Fetch paper from arXiv. Priority: LaTeX source -> generate MD from it."""
651
+ arxiv_id = re.sub(r'^(arxiv:|https?://arxiv\.org/(abs|pdf)/)', '', arxiv_id)
652
+ arxiv_id = arxiv_id.rstrip('.pdf').rstrip('/')
653
+
654
+ # Get paper metadata first
655
+ client = arxiv.Client()
656
+ paper = next(client.results(arxiv.Search(id_list=[arxiv_id])))
657
+
658
+ # Generate folder name: year.arxiv.author
659
+ year = paper.published.year
660
+ first_author = paper.authors[0].name if paper.authors else "unknown"
661
+ folder_name = get_folder_name(year, "arxiv", first_author)
662
+ folder = f"references/{folder_name}"
663
+ os.makedirs(folder, exist_ok=True)
664
+ print(f"Folder: {folder}")
665
+
666
+ # Build front matter
667
+ front_matter = build_front_matter(
668
+ title=paper.title,
669
+ authors=[a.name for a in paper.authors],
670
+ year=year,
671
+ venue="arXiv",
672
+ url=paper.entry_id,
673
+ arxiv_id=arxiv_id
674
+ )
675
+
676
+ # 1. Download LaTeX source first (priority)
677
+ tex_content = download_arxiv_source(arxiv_id, folder)
678
+
679
+ if tex_content:
680
+ # Save paper.tex
681
+ with open(f"{folder}/paper.tex", 'w', encoding='utf-8') as f:
682
+ f.write(tex_content)
683
+ print(f"Saved: paper.tex (original arXiv source)")
684
+
685
+ # Generate paper.md from LaTeX with front matter
686
+ md_text = convert_latex_to_md(tex_content)
687
+ with open(f"{folder}/paper.md", 'w', encoding='utf-8') as f:
688
+ f.write(front_matter + md_text)
689
+ print(f"Generated: paper.md (from LaTeX)")
690
+ has_source = True
691
+ else:
692
+ has_source = False
693
+
694
+ # 2. Download PDF (always, for reference)
695
+ pdf_path = f"{folder}/paper.pdf"
696
+ paper.download_pdf(filename=pdf_path)
697
+ print(f"Downloaded: paper.pdf")
698
+
699
+ # 3. If no LaTeX source, extract from PDF
700
+ if not has_source:
701
+ md_text = pymupdf4llm.to_markdown(pdf_path)
702
+ with open(f"{folder}/paper.md", 'w', encoding='utf-8') as f:
703
+ f.write(front_matter + md_text)
704
+ print(f"Extracted: paper.md (from PDF)")
705
+
706
+ tex_content = convert_md_to_latex(md_text)
707
+ with open(f"{folder}/paper.tex", 'w', encoding='utf-8') as f:
708
+ f.write(tex_content)
709
+ print(f"Generated: paper.tex (from PDF)")
710
+
711
+ return folder
712
+
713
+ def parse_acl_id(paper_id: str) -> tuple:
714
+ """Parse ACL paper ID to extract year and venue."""
715
+ # New format: 2020.findings-emnlp.92, 2021.naacl-demos.1
716
+ new_match = re.match(r'^(\d{4})\.([a-z\-]+)\.(\d+)$', paper_id)
717
+ if new_match:
718
+ year = int(new_match.group(1))
719
+ venue = new_match.group(2)
720
+ return year, venue
721
+
722
+ # Old format: E14-2005, N18-5012, P19-1017
723
+ old_match = re.match(r'^([A-Z])(\d{2})-\d+$', paper_id)
724
+ if old_match:
725
+ prefix = old_match.group(1)
726
+ year_short = int(old_match.group(2))
727
+ year = 2000 + year_short if year_short < 50 else 1900 + year_short
728
+
729
+ venue_map = {
730
+ 'P': 'acl', 'N': 'naacl', 'E': 'eacl', 'D': 'emnlp',
731
+ 'C': 'coling', 'W': 'workshop', 'S': 'semeval', 'Q': 'tacl'
732
+ }
733
+ venue = venue_map.get(prefix, 'acl')
734
+ return year, venue
735
+
736
+ return None, paper_id
737
+
738
+ def fetch_acl(paper_id: str):
739
+ """Fetch paper from ACL Anthology."""
740
+ paper_id = re.sub(r'^https?://aclanthology\.org/', '', paper_id)
741
+ paper_id = paper_id.rstrip('/').rstrip('.pdf')
742
+
743
+ # Get metadata from ACL Anthology BibTeX
744
+ bib_url = f"https://aclanthology.org/{paper_id}.bib"
745
+ title = ""
746
+ authors = []
747
+ booktitle = ""
748
+ try:
749
+ bib_response = requests.get(bib_url)
750
+ bib_text = bib_response.text
751
+
752
+ # Extract title
753
+ title_match = re.search(r'title\s*=\s*["{]([^"}]+)', bib_text)
754
+ title = title_match.group(1) if title_match else ""
755
+
756
+ # Extract all authors
757
+ author_match = re.search(r'author\s*=\s*["{]([^"}]+)', bib_text)
758
+ if author_match:
759
+ authors_str = author_match.group(1)
760
+ authors = [a.strip() for a in authors_str.split(' and ')]
761
+
762
+ # Extract booktitle/venue
763
+ booktitle_match = re.search(r'booktitle\s*=\s*["{]([^"}]+)', bib_text)
764
+ booktitle = booktitle_match.group(1) if booktitle_match else ""
765
+
766
+ # Extract year
767
+ year_match = re.search(r'year\s*=\s*["{]?(\d{4})', bib_text)
768
+ year = int(year_match.group(1)) if year_match else None
769
+ except:
770
+ authors = ["unknown"]
771
+ year = None
772
+
773
+ first_author = authors[0] if authors else "unknown"
774
+
775
+ # Parse venue from paper_id
776
+ parsed_year, venue = parse_acl_id(paper_id)
777
+ if year is None:
778
+ year = parsed_year or 2020
779
+
780
+ # Generate folder name
781
+ folder_name = get_folder_name(year, venue, first_author)
782
+ folder = f"references/{folder_name}"
783
+ os.makedirs(folder, exist_ok=True)
784
+ print(f"Folder: {folder}")
785
+
786
+ # Build front matter
787
+ front_matter = build_front_matter(
788
+ title=title,
789
+ authors=authors,
790
+ year=year,
791
+ venue=booktitle or venue.upper(),
792
+ url=f"https://aclanthology.org/{paper_id}"
793
+ )
794
+
795
+ pdf_url = f"https://aclanthology.org/{paper_id}.pdf"
796
+ pdf_path = f"{folder}/paper.pdf"
797
+
798
+ response = requests.get(pdf_url, allow_redirects=True)
799
+ response.raise_for_status()
800
+ with open(pdf_path, 'wb') as f:
801
+ f.write(response.content)
802
+ print(f"Downloaded PDF: {pdf_path}")
803
+
804
+ # Extract markdown from PDF
805
+ md_text = pymupdf4llm.to_markdown(pdf_path)
806
+ with open(f"{folder}/paper.md", 'w', encoding='utf-8') as f:
807
+ f.write(front_matter + md_text)
808
+ print(f"Extracted: paper.md")
809
+
810
+ # Generate LaTeX from markdown
811
+ tex_content = convert_md_to_latex(md_text)
812
+ with open(f"{folder}/paper.tex", 'w', encoding='utf-8') as f:
813
+ f.write(tex_content)
814
+ print(f"Generated: paper.tex")
815
+
816
+ return folder
817
+
818
+ def fetch_semantic_scholar(query: str):
819
+ """Fetch paper via Semantic Scholar."""
820
+ if re.match(r'^[0-9a-f]{40}$', query.replace('s2:', '')):
821
+ paper_id = query.replace('s2:', '')
822
+ else:
823
+ url = "https://api.semanticscholar.org/graph/v1/paper/search"
824
+ params = {"query": query, "limit": 1, "fields": "paperId,title"}
825
+ response = requests.get(url, params=params)
826
+ data = response.json()
827
+ if not data.get('data'):
828
+ print(f"No papers found for: {query}")
829
+ return None
830
+ paper_id = data['data'][0]['paperId']
831
+ print(f"Found: {data['data'][0]['title']}")
832
+
833
+ url = f"https://api.semanticscholar.org/graph/v1/paper/{paper_id}"
834
+ params = {"fields": "title,authors,abstract,year,openAccessPdf,externalIds,url,venue"}
835
+ response = requests.get(url, params=params)
836
+ response.raise_for_status()
837
+ data = response.json()
838
+
839
+ # Get year, venue, first author for folder name
840
+ year = data.get('year') or 2020
841
+ venue = data.get('venue') or 'unknown'
842
+ # Normalize venue - handle common patterns
843
+ if not venue or venue == 'unknown':
844
+ if 'ArXiv' in data.get('externalIds', {}):
845
+ venue = 'arxiv'
846
+ elif 'ACL' in data.get('externalIds', {}):
847
+ venue = 'acl'
848
+ else:
849
+ venue = 'paper'
850
+
851
+ authors = data.get('authors', [])
852
+ author_names = [a['name'] for a in authors]
853
+ first_author = author_names[0] if author_names else 'unknown'
854
+
855
+ # Generate folder name: year.venue.author
856
+ folder_name = get_folder_name(year, venue, first_author)
857
+ folder = f"references/{folder_name}"
858
+ os.makedirs(folder, exist_ok=True)
859
+ print(f"Folder: {folder}")
860
+
861
+ # Build front matter
862
+ front_matter = build_front_matter(
863
+ title=data.get('title', ''),
864
+ authors=author_names,
865
+ year=year,
866
+ venue=venue,
867
+ url=data.get('url', '')
868
+ )
869
+
870
+ pdf_info = data.get('openAccessPdf')
871
+ if pdf_info and pdf_info.get('url'):
872
+ pdf_url = pdf_info['url']
873
+ pdf_path = f"{folder}/paper.pdf"
874
+
875
+ pdf_response = requests.get(pdf_url, allow_redirects=True)
876
+ pdf_response.raise_for_status()
877
+ with open(pdf_path, 'wb') as f:
878
+ f.write(pdf_response.content)
879
+ print(f"Downloaded PDF: {pdf_path}")
880
+
881
+ # Extract markdown from PDF with front matter
882
+ md_text = pymupdf4llm.to_markdown(pdf_path)
883
+ with open(f"{folder}/paper.md", 'w', encoding='utf-8') as f:
884
+ f.write(front_matter + md_text)
885
+ print(f"Extracted: paper.md")
886
+
887
+ # Generate LaTeX
888
+ tex_content = convert_md_to_latex(md_text)
889
+ with open(f"{folder}/paper.tex", 'w', encoding='utf-8') as f:
890
+ f.write(tex_content)
891
+ print(f"Generated: paper.tex")
892
+ else:
893
+ print("No open access PDF available")
894
+ if 'ArXiv' in data.get('externalIds', {}):
895
+ return fetch_arxiv(data['externalIds']['ArXiv'])
896
+
897
+ return folder
898
+
899
+ if __name__ == "__main__":
900
+ if len(sys.argv) < 2:
901
+ print("Usage: uv run fetch_paper.py <paper_id_or_url_or_query>")
902
+ sys.exit(1)
903
+
904
+ query = ' '.join(sys.argv[1:])
905
+
906
+ if re.match(r'^\d{4}\.\d{4,5}', query) or 'arxiv.org' in query or query.startswith('arxiv:'):
907
+ fetch_arxiv(query)
908
+ elif re.match(r'^[A-Z]\d{2}-\d{4}$', query) or re.match(r'^\d{4}\.[a-z]+-', query) or 'aclanthology.org' in query:
909
+ fetch_acl(query)
910
+ else:
911
+ fetch_semantic_scholar(query)
912
+ ```
913
+
914
+ ---
915
+
916
+ ## Quick Commands
917
+
918
+ **Fetch and extract to both formats (MD + LaTeX):**
919
+ ```bash
920
+ mkdir -p references/E14-2005 && \
921
+ curl -L "https://aclanthology.org/E14-2005.pdf" -o references/E14-2005/paper.pdf && \
922
+ uv run --with pymupdf4llm python << 'EOF'
923
+ import pymupdf4llm
924
+ import re
925
+
926
+ folder = 'references/E14-2005'
927
+ pdf_path = f'{folder}/paper.pdf'
928
+
929
+ # Extract to Markdown
930
+ md = pymupdf4llm.to_markdown(pdf_path)
931
+ open(f'{folder}/paper.md', 'w').write(md)
932
+ print(f'Created: {folder}/paper.md')
933
+
934
+ # Convert to basic LaTeX
935
+ tex = f"""\\documentclass{{article}}
936
+ \\usepackage[utf8]{{inputenc}}
937
+ \\usepackage{{amsmath,booktabs}}
938
+ \\begin{{document}}
939
+ {md}
940
+ \\end{{document}}
941
+ """
942
+ open(f'{folder}/paper.tex', 'w').write(tex)
943
+ print(f'Created: {folder}/paper.tex')
944
+ EOF
945
+ ```
946
+
947
+ **Using pdfplumber for complex tables:**
948
+ ```bash
949
+ uv run --with pdfplumber --with pandas python -c "
950
+ import pdfplumber
951
+ import pandas as pd
952
+
953
+ with pdfplumber.open('references/E14-2005/paper.pdf') as pdf:
954
+ for page in pdf.pages:
955
+ for table in page.extract_tables():
956
+ if table:
957
+ print(pd.DataFrame(table[1:], columns=table[0]).to_markdown())
958
+ "
959
+ ```
960
+
961
+ ---
962
+
963
+ ## Troubleshooting Table Extraction
964
+
965
+ ### Problem: Tables not detected
966
+
967
+ **Solution 1:** Try different extraction strategy with pymupdf4llm:
968
+ ```python
969
+ md = pymupdf4llm.to_markdown("paper.pdf", table_strategy="lines") # or "text"
970
+ ```
971
+
972
+ **Solution 2:** Use pdfplumber with custom settings:
973
+ ```python
974
+ import pdfplumber
975
+ with pdfplumber.open("paper.pdf") as pdf:
976
+ page = pdf.pages[0]
977
+ tables = page.extract_tables(table_settings={
978
+ "vertical_strategy": "text",
979
+ "horizontal_strategy": "text",
980
+ })
981
+ ```
982
+
983
+ ### Problem: Table columns misaligned
984
+
985
+ **Solution:** Use pdfplumber's debug mode to visualize:
986
+ ```python
987
+ import pdfplumber
988
+ with pdfplumber.open("paper.pdf") as pdf:
989
+ page = pdf.pages[3] # Page with table
990
+ im = page.to_image()
991
+ im.debug_tablefinder()
992
+ im.save("debug_table.png")
993
+ ```
994
+
995
+ ### Problem: Multi-column papers
996
+
997
+ **Solution:** Use Marker for best results:
998
+ ```bash
999
+ marker_single paper.pdf output/ --output_format markdown
1000
+ ```
1001
+
1002
+ ---
1003
+
1004
+ ## Sources
1005
+
1006
+ - [pymupdf4llm](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) - Best for markdown with tables
1007
+ - [pdfplumber](https://github.com/jsvine/pdfplumber) - Most accurate table extraction
1008
+ - [Marker](https://github.com/datalab-to/marker) - Deep learning PDF to markdown
1009
+ - [Camelot](https://camelot-py.readthedocs.io/) - Specialized table extraction
1010
+ - [PDF Parsing Comparison Study](https://arxiv.org/html/2410.09871v1) - Academic comparison
.claude/skills/paper-research/SKILL.md ADDED
@@ -0,0 +1,502 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: paper-research
3
+ description: Research a topic systematically following ACL/NeurIPS/ICML best practices. Finds papers, builds citation networks, and synthesizes findings.
4
+ argument-hint: "<research-topic>"
5
+ dependencies:
6
+ - paper-fetch
7
+ ---
8
+
9
+ # Systematic Research Skill
10
+
11
+ Research a topic following best practices from ACL, NeurIPS, ICML, and systematic literature review (SLR) methodology.
12
+
13
+ **Integrates with**: `/paper-fetch` - automatically fetch and store important papers for full-text analysis.
14
+
15
+ ## Target
16
+
17
+ **Research Topic**: $ARGUMENTS
18
+
19
+ If no topic specified, analyze the current project to identify relevant research topics.
20
+
21
+ ---
22
+
23
+ ## Research Methodology
24
+
25
+ Follow the systematic approach based on PRISMA guidelines and AI conference best practices.
26
+
27
+ ### Phase 1: Define Research Scope
28
+
29
+ #### 1.1 Formulate Research Questions
30
+
31
+ Use the PICO/PICo framework adapted for CS/NLP:
32
+
33
+ | Component | Description | Example |
34
+ |-----------|-------------|---------|
35
+ | **P**opulation | Task/Domain | Vietnamese POS tagging |
36
+ | **I**ntervention | Method/Approach | CRF, Transformers, BERT |
37
+ | **C**omparison | Baselines | Rule-based, HMM, BiLSTM |
38
+ | **O**utcome | Metrics | Accuracy, F1, inference speed |
39
+
40
+ **Template research questions:**
41
+ - RQ1: What is the current state-of-the-art for [task]?
42
+ - RQ2: What methods have been applied to [task]?
43
+ - RQ3: What are the main challenges and open problems?
44
+ - RQ4: What datasets and benchmarks exist?
45
+
46
+ #### 1.2 Define Search Terms
47
+
48
+ Create a comprehensive keyword list:
49
+
50
+ ```
51
+ Primary terms: [main task] (e.g., "POS tagging", "part-of-speech")
52
+ Method terms: [approaches] (e.g., "CRF", "neural", "transformer")
53
+ Domain terms: [language/domain] (e.g., "Vietnamese", "low-resource")
54
+ Synonyms: [alternatives] (e.g., "word tagging", "morphological analysis")
55
+ ```
56
+
57
+ Build search queries using Boolean operators:
58
+ ```
59
+ ("POS tagging" OR "part-of-speech") AND ("Vietnamese" OR "low-resource") AND ("CRF" OR "neural" OR "BERT")
60
+ ```
61
+
62
+ ---
63
+
64
+ ### Phase 2: Search for Papers
65
+
66
+ #### 2.1 Search Sources
67
+
68
+ Search these sources in order of priority:
69
+
70
+ | Source | Best For | URL |
71
+ |--------|----------|-----|
72
+ | **ACL Anthology** | NLP/CL papers | https://aclanthology.org |
73
+ | **Semantic Scholar** | AI/ML papers, citations | https://semanticscholar.org |
74
+ | **arXiv** | Preprints, latest work | https://arxiv.org |
75
+ | **Google Scholar** | Broad coverage | https://scholar.google.com |
76
+ | **DBLP** | CS bibliography | https://dblp.org |
77
+ | **Papers With Code** | SOTA benchmarks | https://paperswithcode.com |
78
+
79
+ #### 2.2 Search Commands
80
+
81
+ **ACL Anthology:**
82
+ ```bash
83
+ # Search via web
84
+ curl "https://aclanthology.org/search/?q=vietnamese+pos+tagging"
85
+ ```
86
+
87
+ **Semantic Scholar API:**
88
+ ```bash
89
+ # Search papers
90
+ curl "https://api.semanticscholar.org/graph/v1/paper/search?query=vietnamese+POS+tagging&limit=20&fields=title,year,citationCount,authors,abstract,openAccessPdf"
91
+
92
+ # Get paper details with citations
93
+ curl "https://api.semanticscholar.org/graph/v1/paper/{paper_id}?fields=title,abstract,citations,references"
94
+ ```
95
+
96
+ **arXiv API:**
97
+ ```bash
98
+ # Search arXiv
99
+ curl "http://export.arxiv.org/api/query?search_query=all:vietnamese+pos+tagging&max_results=20"
100
+ ```
101
+
102
+ **Papers With Code:**
103
+ ```bash
104
+ # Check SOTA
105
+ curl "https://paperswithcode.com/api/v1/search/?q=vietnamese+pos+tagging"
106
+ ```
107
+
108
+ #### 2.3 Citation Network Exploration
109
+
110
+ Use these strategies to find related work:
111
+
112
+ 1. **Backward search**: Check references of key papers
113
+ 2. **Forward search**: Find papers that cite key papers
114
+ 3. **Author search**: Find other papers by same authors
115
+ 4. **Similar papers**: Use Semantic Scholar's recommendations
116
+
117
+ ```bash
118
+ # Get citations (papers that cite this paper)
119
+ curl "https://api.semanticscholar.org/graph/v1/paper/{paper_id}/citations?fields=title,year,citationCount&limit=50"
120
+
121
+ # Get references (papers this paper cites)
122
+ curl "https://api.semanticscholar.org/graph/v1/paper/{paper_id}/references?fields=title,year,citationCount&limit=50"
123
+ ```
124
+
125
+ #### 2.4 Discovery Tools
126
+
127
+ Use these tools for visual exploration:
128
+
129
+ | Tool | Purpose | URL |
130
+ |------|---------|-----|
131
+ | **Connected Papers** | Visual citation graph | https://connectedpapers.com |
132
+ | **Research Rabbit** | Paper recommendations | https://researchrabbit.ai |
133
+ | **Litmaps** | Citation mapping | https://litmaps.com |
134
+ | **Elicit** | AI paper search | https://elicit.com |
135
+ | **Inciteful** | Citation network | https://inciteful.xyz |
136
+
137
+ ---
138
+
139
+ ### Phase 3: Screen and Select Papers
140
+
141
+ #### 3.1 Inclusion/Exclusion Criteria
142
+
143
+ Define clear criteria:
144
+
145
+ **Include:**
146
+ - Published in peer-reviewed venue (ACL, EMNLP, NAACL, COLING, etc.)
147
+ - Relevant to research questions
148
+ - Published within timeframe (e.g., last 5-10 years)
149
+ - English language
150
+
151
+ **Exclude:**
152
+ - Non-peer-reviewed (unless highly cited preprint)
153
+ - Tangentially related
154
+ - Superseded by newer work
155
+ - Duplicate/extended versions (keep most comprehensive)
156
+
157
+ #### 3.2 Screening Process
158
+
159
+ 1. **Title/Abstract screening**: Quick relevance check
160
+ 2. **Full-text screening**: Detailed relevance assessment
161
+ 3. **Quality assessment**: Methodological rigor
162
+
163
+ Track using PRISMA flow:
164
+ ```
165
+ Records identified: N
166
+ Duplicates removed: N
167
+ Records screened: N
168
+ Records excluded: N
169
+ Full-text assessed: N
170
+ Studies included: N
171
+ ```
172
+
173
+ ---
174
+
175
+ ### Phase 3.5: Fetch Selected Papers
176
+
177
+ After screening, use the **paper-fetch** skill to download important papers for full-text analysis.
178
+
179
+ #### 3.5.1 Identify Papers to Fetch
180
+
181
+ Prioritize fetching:
182
+ 1. **Seminal papers**: Highly cited foundational work
183
+ 2. **SOTA papers**: Current best-performing methods
184
+ 3. **Directly relevant**: Papers closest to your research
185
+ 4. **Methodology papers**: Detailed method descriptions needed
186
+
187
+ #### 3.5.2 Fetch Papers Using paper-fetch Skill
188
+
189
+ For each selected paper, invoke `/paper-fetch` with the paper ID:
190
+
191
+ ```bash
192
+ # arXiv papers
193
+ /paper-fetch 2301.10140
194
+ /paper-fetch arxiv:1810.04805 # BERT paper
195
+
196
+ # ACL Anthology papers
197
+ /paper-fetch P19-1017
198
+ /paper-fetch 2023.acl-long.1
199
+
200
+ # Semantic Scholar (by title search)
201
+ /paper-fetch "BERT: Pre-training of Deep Bidirectional Transformers"
202
+ ```
203
+
204
+ #### 3.5.3 Batch Fetching
205
+
206
+ For multiple papers, fetch in sequence:
207
+
208
+ ```bash
209
+ # Create list of paper IDs to fetch
210
+ PAPERS=(
211
+ "1810.04805" # BERT
212
+ "2003.10555" # PhoBERT
213
+ "P19-1017" # Example ACL paper
214
+ )
215
+
216
+ # Fetch each paper
217
+ for paper_id in "${PAPERS[@]}"; do
218
+ /paper-fetch $paper_id
219
+ done
220
+ ```
221
+
222
+ #### 3.5.4 Output Structure
223
+
224
+ Each fetched paper creates:
225
+
226
+ ```
227
+ references/
228
+ 1810.04805/
229
+ paper.pdf # Original PDF
230
+ paper.md # Extracted text (for full-text search/analysis)
231
+ metadata.json # Title, authors, abstract
232
+ 2003.10555/
233
+ paper.pdf
234
+ paper.md
235
+ metadata.json
236
+ research_{topic}/ # Research synthesis (Phase 6)
237
+ README.md
238
+ papers.md
239
+ ...
240
+ ```
241
+
242
+ #### 3.5.5 Use Fetched Papers
243
+
244
+ After fetching, you can:
245
+ - **Read full text**: Open `paper.md` for detailed analysis
246
+ - **Search across papers**: Grep through all `paper.md` files
247
+ - **Extract quotes**: Copy relevant sections with page references
248
+ - **Verify claims**: Check original source for accuracy
249
+
250
+ ```bash
251
+ # Search across all fetched papers
252
+ grep -r "CRF" references/*/paper.md
253
+
254
+ # Find specific methodology details
255
+ grep -r "feature template" references/*/paper.md
256
+ ```
257
+
258
+ ---
259
+
260
+ ### Phase 4: Extract and Organize Information
261
+
262
+ #### 4.1 Create Paper Database
263
+
264
+ For each paper, extract:
265
+
266
+ ```markdown
267
+ ## Paper: [Title]
268
+
269
+ - **Authors**: [Names]
270
+ - **Venue**: [Conference/Journal] [Year]
271
+ - **URL**: [Link]
272
+ - **Citations**: [Count]
273
+
274
+ ### Summary
275
+ [2-3 sentence summary]
276
+
277
+ ### Key Contributions
278
+ 1. [Contribution 1]
279
+ 2. [Contribution 2]
280
+
281
+ ### Methodology
282
+ - **Approach**: [Method name/type]
283
+ - **Dataset**: [Dataset used]
284
+ - **Metrics**: [Evaluation metrics]
285
+
286
+ ### Results
287
+ | Dataset | Metric | Score |
288
+ |---------|--------|-------|
289
+ | [Name] | [Acc] | [XX%] |
290
+
291
+ ### Strengths
292
+ - [Strength 1]
293
+
294
+ ### Limitations
295
+ - [Limitation 1]
296
+
297
+ ### Relevance to Our Work
298
+ [How this paper relates to current project]
299
+ ```
300
+
301
+ #### 4.2 Comparison Table
302
+
303
+ Create a summary table:
304
+
305
+ ```markdown
306
+ | Paper | Year | Method | Dataset | Accuracy | F1 | Key Innovation |
307
+ |-------|------|--------|---------|----------|-----|----------------|
308
+ | [1] | 2023 | BERT | UDD | 97.2% | 96.8| Fine-tuning |
309
+ | [2] | 2022 | CRF | VLSP | 95.5% | 94.1| Feature eng. |
310
+ ```
311
+
312
+ ---
313
+
314
+ ### Phase 5: Synthesize Findings
315
+
316
+ #### 5.1 Thematic Analysis
317
+
318
+ Organize findings by themes (not chronologically):
319
+
320
+ ```markdown
321
+ ## Related Work Synthesis
322
+
323
+ ### Traditional Approaches
324
+ - Rule-based methods: [Summary]
325
+ - Statistical methods (HMM, CRF): [Summary]
326
+
327
+ ### Neural Approaches
328
+ - RNN/LSTM-based: [Summary]
329
+ - Transformer-based: [Summary]
330
+
331
+ ### Vietnamese-Specific Work
332
+ - [Summary of Vietnamese NLP research]
333
+
334
+ ### Datasets and Benchmarks
335
+ - [Available resources]
336
+
337
+ ### Open Challenges
338
+ - [Remaining problems]
339
+ ```
340
+
341
+ #### 5.2 Gap Analysis
342
+
343
+ Identify what's missing:
344
+
345
+ ```markdown
346
+ ## Research Gaps
347
+
348
+ 1. **Methodological gaps**: [What methods haven't been tried?]
349
+ 2. **Data gaps**: [What data is missing?]
350
+ 3. **Evaluation gaps**: [What isn't being measured?]
351
+ 4. **Domain gaps**: [What domains lack coverage?]
352
+ ```
353
+
354
+ #### 5.3 SOTA Summary
355
+
356
+ ```markdown
357
+ ## State-of-the-Art
358
+
359
+ ### Current Best Results
360
+ | Task | Dataset | SOTA Model | Score | Paper |
361
+ |------|---------|------------|-------|-------|
362
+ | POS | UDD | PhoBERT | 97.2% | [Ref] |
363
+
364
+ ### Trends
365
+ - [Trend 1: e.g., "Shift from CRF to Transformers"]
366
+ - [Trend 2: e.g., "Increasing use of pre-trained models"]
367
+ ```
368
+
369
+ ---
370
+
371
+ ### Phase 6: Document Research
372
+
373
+ #### 6.1 Output Structure
374
+
375
+ Save research to `references/` with fetched papers and synthesis:
376
+
377
+ ```
378
+ references/
379
+ # Fetched papers (via /paper-fetch)
380
+ 1810.04805/ # BERT paper
381
+ paper.pdf
382
+ paper.md # Full text for analysis
383
+ metadata.json
384
+ 2003.10555/ # PhoBERT paper
385
+ paper.pdf
386
+ paper.md
387
+ metadata.json
388
+ P19-1017/ # ACL paper
389
+ paper.pdf
390
+ paper.md
391
+ metadata.json
392
+
393
+ # Research synthesis (this skill)
394
+ research_vietnamese_pos/
395
+ README.md # Research summary & findings
396
+ papers.md # Paper database with notes
397
+ comparison.md # Comparison tables
398
+ bibliography.bib # BibTeX references
399
+ sota.md # State-of-the-art summary
400
+ ```
401
+
402
+ #### 6.2 Research Report Template
403
+
404
+ ```markdown
405
+ # Literature Review: [Topic]
406
+
407
+ **Date**: [YYYY-MM-DD]
408
+ **Research Questions**: [RQs]
409
+
410
+ ## Executive Summary
411
+ [1 paragraph overview]
412
+
413
+ ## Methodology
414
+ - **Search sources**: [List]
415
+ - **Search terms**: [Keywords]
416
+ - **Timeframe**: [Date range]
417
+ - **Inclusion criteria**: [Criteria]
418
+
419
+ ## PRISMA Flow
420
+ - Records identified: N
421
+ - Studies included: N
422
+
423
+ ## Findings
424
+
425
+ ### RQ1: [Question]
426
+ [Answer with citations]
427
+
428
+ ### RQ2: [Question]
429
+ [Answer with citations]
430
+
431
+ ## State-of-the-Art
432
+ [Current best methods/results]
433
+
434
+ ## Research Gaps
435
+ [Identified opportunities]
436
+
437
+ ## Recommendations
438
+ [Suggested directions]
439
+
440
+ ## References
441
+ [Bibliography]
442
+ ```
443
+
444
+ ---
445
+
446
+ ## Best Practices (ACL/NeurIPS/ICML)
447
+
448
+ ### DO:
449
+ - **Explain differences**: "Related work should not just list prior work, but explain how the proposed work differs" (NeurIPS guidelines)
450
+ - **Be comprehensive**: Cover all major approaches and methods
451
+ - **Be fair**: Acknowledge contributions of prior work
452
+ - **Be current**: Include recent work (but contemporaneous papers within 2 months are excused)
453
+ - **Include proper citations**: Use DOIs or ACL Anthology links (ACL requirement)
454
+
455
+ ### DON'T:
456
+ - Just list papers without synthesis
457
+ - Ignore non-English language work
458
+ - Miss seminal papers in the field
459
+ - Cherry-pick only papers that support your position
460
+ - Dismiss work as "obvious in retrospect"
461
+
462
+ ### Quality Checks:
463
+ - [ ] All major approaches covered
464
+ - [ ] Recent work (last 2-3 years) included
465
+ - [ ] Seminal papers cited
466
+ - [ ] Fair characterization of prior work
467
+ - [ ] Clear connection to your work
468
+ - [ ] Proper citation format
469
+
470
+ ---
471
+
472
+ ## Quick Reference: API Endpoints
473
+
474
+ ```bash
475
+ # Semantic Scholar - Search
476
+ curl "https://api.semanticscholar.org/graph/v1/paper/search?query=QUERY&limit=20&fields=title,year,authors,citationCount,abstract"
477
+
478
+ # Semantic Scholar - Paper details
479
+ curl "https://api.semanticscholar.org/graph/v1/paper/PAPER_ID?fields=title,abstract,citations,references,tldr"
480
+
481
+ # Semantic Scholar - Author papers
482
+ curl "https://api.semanticscholar.org/graph/v1/author/AUTHOR_ID/papers?fields=title,year,venue"
483
+
484
+ # arXiv - Search
485
+ curl "http://export.arxiv.org/api/query?search_query=QUERY&max_results=20"
486
+
487
+ # DBLP - Search
488
+ curl "https://dblp.org/search/publ/api?q=QUERY&format=json"
489
+ ```
490
+
491
+ ---
492
+
493
+ ## References
494
+
495
+ Based on guidelines from:
496
+ - [ACL Rolling Review Author Guidelines](https://aclrollingreview.org/authors)
497
+ - [NeurIPS Reviewer Guidelines](https://neurips.cc/Conferences/2025/ReviewerGuidelines)
498
+ - [ICML Paper Guidelines](https://icml.cc/Conferences/2024/PaperGuidelines)
499
+ - [PRISMA Statement](https://www.prisma-statement.org/)
500
+ - [How-to conduct a systematic literature review (CS)](https://www.sciencedirect.com/science/article/pii/S2215016122002746)
501
+ - [Semantic Scholar API](https://api.semanticscholar.org/api-docs/)
502
+ - [ACL Anthology](https://aclanthology.org)
.claude/skills/paper-review/SKILL.md ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: paper-review
3
+ description: Review research papers following ACL/EMNLP conference standards. Provides structured feedback with soundness, excitement, and overall assessment scores.
4
+ argument-hint: "[file-path]"
5
+ ---
6
+
7
+ # Academic Paper Review (ACL/EMNLP Standards)
8
+
9
+ Review papers following ACL Rolling Review (ARR) guidelines and best practices from top NLP conferences.
10
+
11
+ ## Target File
12
+
13
+ Review the file: $ARGUMENTS
14
+
15
+ If no file specified, review `TECHNICAL_REPORT.md` in the project root.
16
+
17
+ ## Review Process
18
+
19
+ ### Step 1: Reading Strategy (Two-Pass Method)
20
+
21
+ 1. **First Pass (Skim)**: Read abstract, introduction, and conclusion first to understand research questions, scope, and claimed contributions
22
+ 2. **Second Pass (Deep Dive)**: Evaluate technical soundness, methodology, evidence quality, and reproducibility
23
+
24
+ ### Step 2: Research Relevant Papers
25
+
26
+ Before completing the review, research the current state of the field to properly contextualize the contribution.
27
+
28
+ #### 2.1 Identify Key Topics
29
+ Extract from the paper:
30
+ - Main task/problem (e.g., "Vietnamese POS tagging", "named entity recognition")
31
+ - Methods used (e.g., "CRF", "transformer", "BERT-based")
32
+ - Dataset/benchmark names
33
+ - Baseline systems mentioned
34
+
35
+ #### 2.2 Search for Related Work
36
+ Use web search to find:
37
+
38
+ 1. **State-of-the-art results** on the same task/dataset:
39
+ - Search: "[task name] [dataset name] benchmark results [current year]"
40
+ - Search: "[task name] state of the art [current year]"
41
+
42
+ 2. **Competing approaches**:
43
+ - Search: "[task name] [alternative method] comparison"
44
+ - Search: "best [task name] models [current year]"
45
+
46
+ 3. **Prior work by same authors** (for context on research trajectory):
47
+ - Search author names + institution
48
+
49
+ 4. **Survey papers** for comprehensive background:
50
+ - Search: "[task name] survey" OR "[task name] review paper"
51
+
52
+ 5. **Datasets and benchmarks**:
53
+ - Search: "[dataset name] leaderboard" OR "[dataset name] benchmark"
54
+
55
+ #### 2.3 Verify Claims
56
+ Cross-check the paper's claims against found literature:
57
+ - Are baseline comparisons fair and up-to-date?
58
+ - Are cited SOTA numbers accurate?
59
+ - Is related work coverage comprehensive?
60
+ - Are there significant missing references?
61
+
62
+ #### 2.4 Document Findings
63
+ Record relevant papers found during research:
64
+
65
+ ```markdown
66
+ ## Related Work Research
67
+
68
+ ### Papers Found
69
+ | Paper | Year | Method | Results | Relevance |
70
+ |-------|------|--------|---------|-----------|
71
+ | [Title] | [Year] | [Method] | [Key metric] | [Why relevant] |
72
+
73
+ ### Missing from Related Work
74
+ - [Paper that should have been cited]
75
+
76
+ ### SOTA Verification
77
+ - Claimed SOTA: [what paper claims]
78
+ - Actual SOTA: [what you found]
79
+ - Gap: [difference if any]
80
+ ```
81
+
82
+ ### Step 3: Write Review
83
+
84
+ With both the paper content and research context, write the formal review following the ARR structure below.
85
+
86
+ ## ARR Review Form Structure
87
+
88
+ Provide your review in the following structure:
89
+
90
+ ```markdown
91
+ ## Paper Summary
92
+
93
+ [2-3 sentences describing what the paper is about, helping editors understand the topic]
94
+
95
+ ## Summary of Strengths
96
+
97
+ Major reasons to publish this paper at a selective *ACL venue:
98
+
99
+ 1. [Strength 1 - be specific, reference sections/tables]
100
+ 2. [Strength 2]
101
+ 3. [Strength 3]
102
+
103
+ ## Summary of Weaknesses
104
+
105
+ Numbered concerns that prevent prioritizing this work:
106
+
107
+ 1. [Weakness 1 - specific and actionable]
108
+ 2. [Weakness 2]
109
+ 3. [Weakness 3]
110
+
111
+ ## Scores
112
+
113
+ ### Soundness: [1-5]
114
+ - 5: Excellent - No major issues, claims well-supported
115
+ - 4: Good - Minor issues that don't affect main claims
116
+ - 3: Acceptable - Some issues but core contributions valid
117
+ - 2: Poor - Significant issues undermine key claims
118
+ - 1: Major Issues - Not sufficiently thorough for publication
119
+
120
+ ### Excitement: [1-5]
121
+ - 5: Highly Exciting - Would recommend to others, transformational
122
+ - 4: Exciting - Important contribution to the field
123
+ - 3: Moderately Exciting - Incremental but solid work
124
+ - 2: Somewhat Boring - Limited novelty or impact
125
+ - 1: Not Exciting - Routine work with minimal contribution
126
+
127
+ ### Overall Assessment: [1-5]
128
+ - 5: Award consideration (top 2.5%)
129
+ - 4: Strong accept - main conference
130
+ - 3: Borderline - Findings track appropriate
131
+ - 2: Resubmit next cycle - substantial revisions needed
132
+ - 1: Do not resubmit
133
+
134
+ ### Reproducibility: [1-5]
135
+ - 5: Could reproduce results exactly
136
+ - 4: Could mostly reproduce, minor variation expected
137
+ - 3: Partial reproduction possible
138
+ - 2: Significant details missing
139
+ - 1: Cannot reproduce
140
+
141
+ ### Confidence: [1-5]
142
+ - 5: Expert - positive my evaluation is correct
143
+ - 4: High - familiar with related work
144
+ - 3: Medium - read related papers but not expert
145
+ - 2: Low - educated guess
146
+ - 1: Not my area
147
+
148
+ ## Detailed Comments
149
+
150
+ ### Technical Soundness
151
+ [Evaluate methodology, experimental design, statistical validity]
152
+
153
+ ### Novelty and Contribution
154
+ [Assess originality - don't dismiss work just because method is "simple" or results seem "obvious in retrospect"]
155
+
156
+ ### Clarity and Presentation
157
+ [Focus on substance, not style - note if non-native English but don't penalize]
158
+
159
+ ### Reproducibility Assessment
160
+ [Check for: dataset details, hyperparameters, code availability, training configuration]
161
+
162
+ ### Limitations and Ethics
163
+ [Evaluate if authors adequately discuss limitations and potential negative impacts]
164
+
165
+ ## Related Work Research
166
+
167
+ ### Papers Found
168
+ | Paper | Year | Method | Results | Relevance |
169
+ |-------|------|--------|---------|-----------|
170
+ | [Title] | [Year] | [Method] | [Key metric] | [Why relevant] |
171
+
172
+ ### Missing Citations
173
+ [Important papers not cited that should be referenced]
174
+
175
+ ### SOTA Verification
176
+ - **Claimed**: [what the paper claims as baseline/SOTA]
177
+ - **Actual**: [current SOTA from your research]
178
+ - **Assessment**: [whether claims are accurate]
179
+
180
+ ## Questions for Authors
181
+
182
+ [Specific questions that could be addressed in author response - do NOT ask for new experiments]
183
+
184
+ ## Minor Issues
185
+
186
+ [Typos, formatting, missing references - not grounds for rejection]
187
+
188
+ ## Suggestions for Improvement
189
+
190
+ [Constructive, actionable recommendations to strengthen the work]
191
+ ```
192
+
193
+ ## Review Principles (ACL/EMNLP Best Practices)
194
+
195
+ ### DO:
196
+ - **Be specific**: Reference particular sections, equations, tables, or line numbers
197
+ - **Be constructive**: Suggest how to improve, not just what's wrong
198
+ - **Be kind**: Write the review you would like to receive
199
+ - **Justify scores**: Low soundness scores MUST cite specific technical flaws
200
+ - **Consider diverse contributions**: Novel methodology, insightful analysis, new resources, theoretical advances
201
+ - **Evaluate claimed contributions**: A paper only needs sufficient evidence for its stated claims
202
+
203
+ ### DO NOT:
204
+ - Reject because results aren't SOTA (ask "state of which art?")
205
+ - Dismiss work as "obvious in retrospect" without prior empirical validation
206
+ - Demand experiments beyond the paper's stated scope
207
+ - Criticize for not using deep learning (method diversity is valuable)
208
+ - Reject resource papers (datasets are as important as models)
209
+ - Reject work on non-English languages
210
+ - Penalize for simple methods (often most cited papers use simple methods)
211
+ - Use sarcasm or dismissive language
212
+ - Generate AI-written review content (violates ACL policy)
213
+
214
+ ### Common Review Problems to Avoid (ARR Issue Codes):
215
+ - **I1**: Lack of specificity - vague criticisms without examples
216
+ - **I2**: Score-content misalignment - low scores without stated flaws
217
+ - **I3**: Unprofessional tone - harsh or dismissive language
218
+ - **I4**: Demanding out-of-scope work
219
+ - **I5**: Ignoring author responses without explanation
220
+
221
+ ## Evaluation Checklist
222
+
223
+ ### Methodology
224
+ - [ ] Research questions clearly stated
225
+ - [ ] Methods appropriate for research questions
226
+ - [ ] Baselines appropriate and fairly compared
227
+ - [ ] Statistical significance properly addressed
228
+ - [ ] Limitations of approach acknowledged
229
+
230
+ ### Experiments
231
+ - [ ] Datasets properly described (source, size, splits, preprocessing)
232
+ - [ ] Evaluation metrics appropriate for the task
233
+ - [ ] Training details sufficient for reproduction
234
+ - [ ] Ablation studies or analysis provided
235
+ - [ ] Results support the claims made
236
+
237
+ ### Presentation
238
+ - [ ] Abstract accurately summarizes contributions
239
+ - [ ] Introduction motivates the problem
240
+ - [ ] Related work comprehensive and fair
241
+ - [ ] Figures/tables readable and informative
242
+ - [ ] Conclusion matches actual contributions
243
+
244
+ ### Related Work Verification (from Step 2 Research)
245
+ - [ ] Key prior work on same task is cited
246
+ - [ ] Baseline comparisons use current methods
247
+ - [ ] SOTA claims are accurate and up-to-date
248
+ - [ ] No significant missing references
249
+ - [ ] Fair characterization of competing approaches
250
+
251
+ ### Responsible NLP
252
+ - [ ] Limitations section present and substantive
253
+ - [ ] Potential negative impacts discussed
254
+ - [ ] Data collection ethics addressed (if applicable)
255
+ - [ ] Bias considerations mentioned (if applicable)
256
+
257
+ ## Score Calibration Guidelines
258
+
259
+ **Soundness vs Excitement**: These are orthogonal. A paper can be:
260
+ - High soundness, low excitement: Solid but incremental
261
+ - Low soundness, high excitement: Interesting idea but flawed execution
262
+ - Both should be reflected independently
263
+
264
+ **Overall Assessment**: Consider:
265
+ - Does the paper advance the field?
266
+ - Would the NLP community benefit from this work?
267
+ - Are the claimed contributions adequately supported?
268
+
269
+ ## References
270
+
271
+ Based on guidelines from:
272
+ - [ACL Rolling Review Reviewer Guidelines](https://aclrollingreview.org/reviewerguidelines)
273
+ - [ARR Review Form](https://aclrollingreview.org/reviewform)
274
+ - [EMNLP 2020: How to Write Good Reviews](https://2020.emnlp.org/blog/2020-05-17-write-good-reviews/)
275
+ - [ACL 2023 Review Process](https://2023.aclweb.org/blog/review-basics/)
.claude/skills/paper-write/SKILL.md ADDED
@@ -0,0 +1,402 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: paper-write
3
+ description: Write or improve technical reports following ACL/EMNLP conference standards. Generates publication-ready sections with proper structure and formatting.
4
+ argument-hint: "[section] or [output-file]"
5
+ ---
6
+
7
+ # Technical Paper Writing (ACL/EMNLP Standards)
8
+
9
+ Write or improve technical papers following ACL Rolling Review guidelines and best practices from top NLP conferences.
10
+
11
+ ## Target
12
+
13
+ **Arguments**: $ARGUMENTS
14
+
15
+ - If argument is a section name (abstract, introduction, methodology, experiments, related-work, conclusion, limitations), generate that section
16
+ - If argument is a file path, write the complete paper to that file
17
+ - If no argument, analyze the project and generate a complete TECHNICAL_REPORT.md
18
+
19
+ ## Writing Process
20
+
21
+ ### Step 1: Project Analysis
22
+
23
+ Before writing, analyze the codebase to understand:
24
+
25
+ 1. **Core Contribution**: What is the main technical contribution?
26
+ 2. **Methodology**: What algorithms/models/approaches are used?
27
+ 3. **Data**: What datasets are used for training/evaluation?
28
+ 4. **Results**: What are the key metrics and findings?
29
+ 5. **Implementation**: What are the technical details (hyperparameters, architecture)?
30
+
31
+ Search for:
32
+ - README.md, CLAUDE.md for project overview
33
+ - Training scripts for methodology details
34
+ - Evaluation scripts for metrics
35
+ - Config files for hyperparameters
36
+ - Model files for architecture
37
+
38
+ ### Step 2: Research Context
39
+
40
+ Use web search to contextualize the contribution:
41
+
42
+ 1. **State-of-the-art**: Search "[task] state of the art [year]"
43
+ 2. **Benchmarks**: Search "[dataset] benchmark leaderboard"
44
+ 3. **Related methods**: Search "[method] [task] comparison"
45
+ 4. **Prior work**: Search key paper titles for citations
46
+
47
+ ### Step 3: Write Paper
48
+
49
+ Follow the ACL paper structure below.
50
+
51
+ ---
52
+
53
+ ## ACL Paper Structure
54
+
55
+ ### 1. Title
56
+
57
+ - Concise and informative (max 12 words)
58
+ - Include: task, method, language/domain if specific
59
+ - Avoid: "Novel", "New", "Improved" without substance
60
+
61
+ **Good examples**:
62
+ - "Vietnamese POS Tagging with Conditional Random Fields"
63
+ - "BERT-based Named Entity Recognition for Legal Documents"
64
+
65
+ ### 2. Abstract (max 200 words)
66
+
67
+ Structure in 4-5 sentences:
68
+
69
+ ```
70
+ [Problem/Motivation] [Task] remains challenging due to [specific challenges].
71
+ [Approach] We present [method name], a [brief description of approach].
72
+ [Key Innovation] Our method [key differentiator from prior work].
73
+ [Results] Experiments on [dataset] show [main result with number].
74
+ [Impact/Availability] [Code/model availability statement].
75
+ ```
76
+
77
+ **Tips**:
78
+ - Be specific with numbers: "achieves 95.89% accuracy" not "achieves high accuracy"
79
+ - Avoid vague claims: "outperforms baselines" → "outperforms VnCoreNLP by 2.1%"
80
+ - Include reproducibility info if possible
81
+
82
+ ### 3. Introduction (1-1.5 pages)
83
+
84
+ **Paragraph 1: Problem & Motivation**
85
+ - What is the task? Why is it important?
86
+ - What are the real-world applications?
87
+
88
+ **Paragraph 2: Challenges**
89
+ - What makes this problem difficult?
90
+ - What are the specific challenges for this language/domain?
91
+
92
+ **Paragraph 3: Existing Approaches & Limitations**
93
+ - What methods have been tried?
94
+ - What are their limitations?
95
+
96
+ **Paragraph 4: Our Approach**
97
+ - What is your method?
98
+ - How does it address the limitations?
99
+
100
+ **Paragraph 5: Contributions** (bulleted list, max 3)
101
+ ```markdown
102
+ Our main contributions are:
103
+ - We propose [method] for [task], achieving [result]
104
+ - We release [dataset/model/code] for [purpose]
105
+ - We provide [analysis/insights] showing [finding]
106
+ ```
107
+
108
+ **Paragraph 6: Paper Organization** (optional)
109
+ - Brief roadmap of remaining sections
110
+
111
+ ### 4. Related Work (0.5-1 page)
112
+
113
+ Organize by themes, not chronologically:
114
+
115
+ ```markdown
116
+ ## Related Work
117
+
118
+ ### [Theme 1: e.g., "Traditional Approaches"]
119
+ [Discussion of rule-based, statistical methods...]
120
+
121
+ ### [Theme 2: e.g., "Neural Methods"]
122
+ [Discussion of deep learning approaches...]
123
+
124
+ ### [Theme 3: e.g., "Vietnamese NLP"]
125
+ [Discussion of language-specific work...]
126
+ ```
127
+
128
+ **Tips**:
129
+ - Cite 15-30 papers for a full paper
130
+ - Be fair to prior work - acknowledge their contributions
131
+ - Clearly state how your work differs
132
+ - Use ACL Anthology for citations when available
133
+
134
+ ### 5. Methodology (1.5-2 pages)
135
+
136
+ **5.1 Problem Formulation**
137
+ - Formal definition with mathematical notation
138
+ - Input/output specification
139
+
140
+ **5.2 Model Architecture**
141
+ - High-level overview (with figure if helpful)
142
+ - Detailed description of each component
143
+
144
+ **5.3 Feature Engineering** (if applicable)
145
+ - List all features with clear notation
146
+ - Justify feature choices
147
+
148
+ **5.4 Training**
149
+ - Loss function
150
+ - Optimization algorithm
151
+ - Hyperparameters (in table format)
152
+
153
+ ```markdown
154
+ | Parameter | Value | Description |
155
+ |-----------|-------|-------------|
156
+ | Learning rate | 0.001 | Adam optimizer |
157
+ | Batch size | 32 | Training batch |
158
+ | Epochs | 100 | Maximum iterations |
159
+ ```
160
+
161
+ ### 6. Experimental Setup (0.5-1 page)
162
+
163
+ **6.1 Datasets**
164
+
165
+ ```markdown
166
+ | Dataset | Train | Dev | Test | Domain |
167
+ |---------|-------|-----|------|--------|
168
+ | [Name] | [N] | [N] | [N] | [Domain] |
169
+ ```
170
+
171
+ Include:
172
+ - Source and citation
173
+ - Preprocessing steps
174
+ - Train/dev/test split rationale
175
+
176
+ **6.2 Baselines**
177
+ - List all baseline systems with citations
178
+ - Brief description of each
179
+
180
+ **6.3 Evaluation Metrics**
181
+ - Define each metric mathematically
182
+ - Justify metric choices for the task
183
+
184
+ **6.4 Implementation Details**
185
+ - Framework/library versions
186
+ - Hardware used
187
+ - Random seeds for reproducibility
188
+
189
+ ### 7. Results (1-1.5 pages)
190
+
191
+ **7.1 Main Results**
192
+
193
+ Present main comparison table:
194
+
195
+ ```markdown
196
+ | Model | Accuracy | Precision | Recall | F1 |
197
+ |-------|----------|-----------|--------|-----|
198
+ | Baseline 1 | X.XX | X.XX | X.XX | X.XX |
199
+ | Baseline 2 | X.XX | X.XX | X.XX | X.XX |
200
+ | **Ours** | **X.XX** | **X.XX** | **X.XX** | **X.XX** |
201
+ ```
202
+
203
+ **7.2 Analysis**
204
+ - Why does your method work?
205
+ - Per-class/category breakdown
206
+ - Statistical significance (p-values if applicable)
207
+
208
+ **7.3 Ablation Study**
209
+ - What happens when you remove components?
210
+ - Which features/components contribute most?
211
+
212
+ **7.4 Error Analysis**
213
+ - Common error patterns
214
+ - Failure cases with examples
215
+ - Linguistic analysis if applicable
216
+
217
+ ### 8. Discussion (optional, 0.5 page)
218
+
219
+ - Broader implications of findings
220
+ - Comparison with concurrent work
221
+ - Unexpected observations
222
+
223
+ ### 9. Conclusion (0.5 page)
224
+
225
+ **Paragraph 1: Summary**
226
+ - Restate the problem and your approach
227
+ - Highlight main results
228
+
229
+ **Paragraph 2: Limitations** (can be separate section)
230
+ - Honest assessment of limitations
231
+ - What doesn't your method handle well?
232
+
233
+ **Paragraph 3: Future Work**
234
+ - 2-3 concrete directions for future research
235
+
236
+ ### 10. Limitations Section (REQUIRED)
237
+
238
+ ACL requires a dedicated "Limitations" section. Include:
239
+
240
+ - Data limitations (domain, size, annotation quality)
241
+ - Method limitations (assumptions, failure cases)
242
+ - Evaluation limitations (metrics, benchmarks)
243
+ - Scope limitations (languages, tasks)
244
+
245
+ ```markdown
246
+ ## Limitations
247
+
248
+ This work has several limitations:
249
+
250
+ 1. **Data**: Our model is trained on [domain] data and may not generalize to [other domains].
251
+
252
+ 2. **Method**: [Specific limitation of the approach].
253
+
254
+ 3. **Evaluation**: We evaluate only on [dataset]; performance on other benchmarks is unknown.
255
+
256
+ 4. **Scope**: Our work focuses on [language/task]; extension to [other scenarios] requires further investigation.
257
+ ```
258
+
259
+ ### 11. Ethics Statement (if applicable)
260
+
261
+ Address:
262
+ - Data collection ethics
263
+ - Potential misuse
264
+ - Bias considerations
265
+ - Environmental impact (for large models)
266
+
267
+ ---
268
+
269
+ ## Formatting Guidelines
270
+
271
+ ### Page Limits
272
+ - **Long paper**: 8 pages content + unlimited references
273
+ - **Short paper**: 4 pages content + unlimited references
274
+ - Limitations, ethics, acknowledgments don't count
275
+
276
+ ### Required Elements
277
+ - [ ] Title (15pt bold)
278
+ - [ ] Abstract (max 200 words)
279
+ - [ ] Sections numbered with Arabic numerals
280
+ - [ ] Limitations section (after conclusion, before references)
281
+ - [ ] References (unnumbered)
282
+
283
+ ### Figures & Tables
284
+ - Number sequentially (Figure 1, Table 1)
285
+ - Captions below figures, above tables (10pt)
286
+ - Reference all figures/tables in text
287
+ - Ensure grayscale readability
288
+
289
+ ### Citations
290
+ - Use ACL Anthology when available
291
+ - Format: (Author, Year) or Author (Year)
292
+ - Include DOIs when available
293
+
294
+ ---
295
+
296
+ ## Writing Tips
297
+
298
+ ### DO:
299
+ - **Be specific**: Use numbers, not vague claims
300
+ - **Be honest**: Acknowledge limitations
301
+ - **Be concise**: Every sentence should add value
302
+ - **Be clear**: Define terms, explain notation
303
+ - **Be fair**: Give credit to prior work
304
+
305
+ ### DON'T:
306
+ - Oversell contributions
307
+ - Hide negative results
308
+ - Use excessive jargon
309
+ - Make claims without evidence
310
+ - Ignore reviewer guidelines
311
+
312
+ ### Common Mistakes to Avoid:
313
+ 1. Abstract that doesn't match paper content
314
+ 2. Introduction that's too long/detailed
315
+ 3. Related work that's just a list of papers
316
+ 4. Methodology without enough detail to reproduce
317
+ 5. Results without error analysis
318
+ 6. Missing or superficial limitations section
319
+
320
+ ---
321
+
322
+ ## Section Templates
323
+
324
+ ### Abstract Template
325
+ ```
326
+ [Task] is [importance/challenge]. Existing methods [limitation].
327
+ We present [method], which [key innovation].
328
+ Our approach [brief description].
329
+ Experiments on [dataset] demonstrate [main result].
330
+ [Additional contribution: code/data release].
331
+ ```
332
+
333
+ ### Introduction Contribution Template
334
+ ```markdown
335
+ Our main contributions are as follows:
336
+ - We propose **[Method Name]**, a [brief description] for [task] that [key advantage].
337
+ - We conduct extensive experiments on [datasets], achieving [specific result] and outperforming [baselines] by [margin].
338
+ - We release our [code/model/data] at [URL] to facilitate future research.
339
+ ```
340
+
341
+ ### Conclusion Template
342
+ ```
343
+ We presented [method] for [task]. Our approach [key innovation].
344
+ Experiments on [dataset] show [main findings].
345
+ Our analysis reveals [key insight].
346
+
347
+ Limitations include [honest limitations].
348
+
349
+ Future work includes [2-3 specific directions].
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Checklist Before Submission
355
+
356
+ ### Content
357
+ - [ ] Abstract summarizes all key points
358
+ - [ ] Introduction clearly states contributions
359
+ - [ ] Related work is comprehensive and fair
360
+ - [ ] Methodology has enough detail to reproduce
361
+ - [ ] Experiments include baselines and ablations
362
+ - [ ] Results include error analysis
363
+ - [ ] Limitations section is substantive
364
+ - [ ] Conclusion matches actual contributions
365
+
366
+ ### Formatting
367
+ - [ ] Within page limits
368
+ - [ ] All figures/tables referenced in text
369
+ - [ ] All citations properly formatted
370
+ - [ ] No orphaned section headers
371
+ - [ ] Consistent notation throughout
372
+
373
+ ### Reproducibility
374
+ - [ ] Hyperparameters specified
375
+ - [ ] Dataset details provided
376
+ - [ ] Random seeds mentioned
377
+ - [ ] Code/data availability stated
378
+
379
+ ---
380
+
381
+ ## Output Format
382
+
383
+ Generate the paper in Markdown format with:
384
+
385
+ 1. Clear section headers (## for main sections, ### for subsections)
386
+ 2. Tables using Markdown table syntax
387
+ 3. Math using LaTeX notation ($..$ for inline, $$...$$ for display)
388
+ 4. Code blocks for algorithms/features
389
+ 5. Proper citation placeholders: (Author, Year)
390
+
391
+ Save to the specified output file or TECHNICAL_REPORT.md by default.
392
+
393
+ ---
394
+
395
+ ## References
396
+
397
+ Based on guidelines from:
398
+ - [ACL Paper Formatting Guidelines](https://acl-org.github.io/ACLPUB/formatting.html)
399
+ - [ACL Rolling Review Author Guidelines](http://aclrollingreview.org/authors)
400
+ - [Tips for Writing NLP Papers](https://medium.com/@vered1986/tips-for-writing-nlp-papers-9c729a2f9e1f)
401
+ - [Stanford Tips for Writing Technical Papers](https://cs.stanford.edu/people/widom/paper-writing.html)
402
+ - [EMNLP 2024 Call for Papers](https://2024.emnlp.org/calls/main_conference_papers/)
.gitattributes ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ *.pdf filter=lfs diff=lfs merge=lfs -text
2
+ *.png filter=lfs diff=lfs merge=lfs -text
3
+ *.jpg filter=lfs diff=lfs merge=lfs -text
4
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
5
+ *.gif filter=lfs diff=lfs merge=lfs -text
6
+ *.synctex filter=lfs diff=lfs merge=lfs -text
references/2007.rivf.hoang/paper.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "A Comparative Study on Vietnamese Text Classification Methods"
3
+ authors:
4
+ - "Cong Duy Vu Hoang"
5
+ - "Dien Dinh"
6
+ - "Le Nguyen Nguyen"
7
+ - "Quoc Hung Ngo"
8
+ year: 2007
9
+ venue: "IEEE RIVF 2007"
10
+ url: "https://ieeexplore.ieee.org/document/4223084/"
11
+ ---
12
+
13
+ # A Comparative Study on Vietnamese Text Classification Methods
14
+
15
+ ## Abstract
16
+
17
+ This paper presents two different approaches for Vietnamese text classification: Bag of Words (BOW) and Statistical N-Gram Language Modeling. Based on a Vietnamese news corpus, these approaches achieved an average of >95% accuracy with an average 79 minutes classifying time for about 14,000 documents.
18
+
19
+ ## Key Contributions
20
+
21
+ 1. Introduced the VNTC (Vietnamese News Text Classification) corpus
22
+ 2. Compared BOW and N-gram language model approaches for Vietnamese text classification
23
+ 3. Demonstrated SVM effectiveness for Vietnamese text
24
+
25
+ ## Results
26
+
27
+ | Method | Accuracy |
28
+ |--------|----------|
29
+ | N-gram LM | 97.1% |
30
+ | SVM Multi | 93.4% |
31
+ | BOW + SVM | ~92% |
32
+
33
+ ## Dataset
34
+
35
+ - VNTC: Vietnamese News Text Classification Corpus
36
+ - 10 topics: Politics, Lifestyle, Science, Business, Law, Health, World, Sports, Culture, Technology
37
+ - Available: https://github.com/duyvuleo/VNTC
38
+
39
+ *Full text available at IEEE Xplore*
references/2018.kse.nguyen/paper.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis"
3
+ authors:
4
+ - "Kiet Van Nguyen"
5
+ - "Vu Duc Nguyen"
6
+ - "Phu Xuan-Vinh Nguyen"
7
+ - "Tham Thi-Hong Truong"
8
+ - "Ngan Luu-Thuy Nguyen"
9
+ year: 2018
10
+ venue: "KSE 2018"
11
+ url: "https://ieeexplore.ieee.org/document/8573337/"
12
+ ---
13
+
14
+ # UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis
15
+
16
+ ## Abstract
17
+
18
+ Vietnamese Students' Feedback Corpus (UIT-VSFC) is a resource consisting of over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
19
+
20
+ The corpus achieved 91.20% of the inter-annotator agreement for the sentiment-based task and 71.07% for the topic-based task.
21
+
22
+ Baseline models built with the Maximum Entropy classifier achieved approximately 88% of the sentiment F1-score and over 84% of the topic F1-score.
23
+
24
+ ## Dataset Statistics
25
+
26
+ - Total sentences: 16,175
27
+ - Sentiment labels: Positive, Negative, Neutral
28
+ - Topic labels: Multiple categories
29
+ - Domain: Vietnamese university student feedback
30
+ - Available at: https://huggingface.co/datasets/uitnlp/vietnamese_students_feedback
31
+
32
+ *Full text available at IEEE Xplore*
references/2019.arxiv.conneau/paper.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Unsupervised Cross-lingual Representation Learning at Scale"
3
+ authors:
4
+ - "Alexis Conneau"
5
+ - "Kartikay Khandelwal"
6
+ - "Naman Goyal"
7
+ - "Vishrav Chaudhary"
8
+ - "Guillaume Wenzek"
9
+ - "Francisco Guzmán"
10
+ - "Edouard Grave"
11
+ - "Myle Ott"
12
+ - "Luke Zettlemoyer"
13
+ - "Veselin Stoyanov"
14
+ year: 2019
15
+ venue: "arXiv"
16
+ url: "https://arxiv.org/abs/1911.02116"
17
+ arxiv: "1911.02116"
18
+ ---
19
+
20
+ \nobibliography{acl2020}
21
+ \bibliographystyle{acl_natbib}
22
+ \appendix
23
+ \onecolumn
24
+ # Supplementary materials
25
+ # Languages and statistics for CC-100 used by \xlmr
26
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
27
+ \label{sec:appendix_A}
28
+ \insertDataStatistics
29
+
30
+ \newpage
31
+ # Model Architectures and Sizes
32
+ As we showed in section 5, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
33
+ \label{sec:appendix_B}
34
+
35
+ \insertParameters
references/2019.arxiv.conneau/paper.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf2fbb1aa1805ab6f892a4a421ffd4d7575df37343980b9a3729855577d2d8a1
3
+ size 398981
references/2019.arxiv.conneau/paper.tex ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \usepackage[hyperref]{acl2020}
3
+ \usepackage{times}
4
+ \usepackage{latexsym}
5
+ \renewcommand{\UrlFont}{\ttfamily\small}
6
+
7
+ % This is not strictly necessary, and may be commented out,
8
+ % but it will improve the layout of the manuscript,
9
+ % and will typically save some space.
10
+ \usepackage{microtype}
11
+ \usepackage{graphicx}
12
+ \usepackage{subfigure}
13
+ \usepackage{booktabs} % for professional tables
14
+ \usepackage{url}
15
+ \usepackage{times}
16
+ \usepackage{latexsym}
17
+ \usepackage{array}
18
+ \usepackage{adjustbox}
19
+ \usepackage{multirow}
20
+ % \usepackage{subcaption}
21
+ \usepackage{hyperref}
22
+ \usepackage{longtable}
23
+ \usepackage{bibentry}
24
+ \newcommand{\xlmr}{\textit{XLM-R}\xspace}
25
+ \newcommand{\mbert}{mBERT\xspace}
26
+ \input{content/tables}
27
+
28
+ \begin{document}
29
+ \nobibliography{acl2020}
30
+ \bibliographystyle{acl_natbib}
31
+ \appendix
32
+ \onecolumn
33
+ \section*{Supplementary materials}
34
+ \section{Languages and statistics for CC-100 used by \xlmr}
35
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
36
+ \label{sec:appendix_A}
37
+ \insertDataStatistics
38
+
39
+ \newpage
40
+ \section{Model Architectures and Sizes}
41
+ As we showed in section 5, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
42
+ \label{sec:appendix_B}
43
+
44
+ \insertParameters
45
+ \end{document}
references/2019.arxiv.conneau/source/XLMR Paper/acl2020.bib ADDED
@@ -0,0 +1,739 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @inproceedings{koehn2007moses,
2
+ title={Moses: Open source toolkit for statistical machine translation},
3
+ author={Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and others},
4
+ booktitle={Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions},
5
+ pages={177--180},
6
+ year={2007},
7
+ organization={Association for Computational Linguistics}
8
+ }
9
+
10
+ @article{xie2019unsupervised,
11
+ title={Unsupervised data augmentation for consistency training},
12
+ author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
13
+ journal={arXiv preprint arXiv:1904.12848},
14
+ year={2019}
15
+ }
16
+
17
+ @article{baevski2018adaptive,
18
+ title={Adaptive input representations for neural language modeling},
19
+ author={Baevski, Alexei and Auli, Michael},
20
+ journal={arXiv preprint arXiv:1809.10853},
21
+ year={2018}
22
+ }
23
+
24
+ @article{wu2019emerging,
25
+ title={Emerging Cross-lingual Structure in Pretrained Language Models},
26
+ author={Wu, Shijie and Conneau, Alexis and Li, Haoran and Zettlemoyer, Luke and Stoyanov, Veselin},
27
+ journal={ACL},
28
+ year={2019}
29
+ }
30
+
31
+ @inproceedings{grave2017efficient,
32
+ title={Efficient softmax approximation for GPUs},
33
+ author={Grave, Edouard and Joulin, Armand and Ciss{\'e}, Moustapha and J{\'e}gou, Herv{\'e} and others},
34
+ booktitle={Proceedings of the 34th International Conference on Machine Learning-Volume 70},
35
+ pages={1302--1310},
36
+ year={2017},
37
+ organization={JMLR. org}
38
+ }
39
+
40
+ @article{sang2002introduction,
41
+ title={Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition},
42
+ author={Sang, Erik F},
43
+ journal={CoNLL},
44
+ year={2002}
45
+ }
46
+
47
+ @article{singh2019xlda,
48
+ title={XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering},
49
+ author={Singh, Jasdeep and McCann, Bryan and Keskar, Nitish Shirish and Xiong, Caiming and Socher, Richard},
50
+ journal={arXiv preprint arXiv:1905.11471},
51
+ year={2019}
52
+ }
53
+
54
+ @inproceedings{tjong2003introduction,
55
+ title={Introduction to the CoNLL-2003 shared task: language-independent named entity recognition},
56
+ author={Tjong Kim Sang, Erik F and De Meulder, Fien},
57
+ booktitle={CoNLL},
58
+ pages={142--147},
59
+ year={2003},
60
+ organization={Association for Computational Linguistics}
61
+ }
62
+
63
+ @misc{ud-v2.3,
64
+ title = {Universal Dependencies 2.3},
65
+ author = {Nivre, Joakim et al.},
66
+ url = {http://hdl.handle.net/11234/1-2895},
67
+ note = {{LINDAT}/{CLARIN} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
68
+ copyright = {Licence Universal Dependencies v2.3},
69
+ year = {2018} }
70
+
71
+
72
+ @article{huang2019unicoder,
73
+ title={Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks},
74
+ author={Huang, Haoyang and Liang, Yaobo and Duan, Nan and Gong, Ming and Shou, Linjun and Jiang, Daxin and Zhou, Ming},
75
+ journal={ACL},
76
+ year={2019}
77
+ }
78
+
79
+ @article{kingma2014adam,
80
+ title={Adam: A method for stochastic optimization},
81
+ author={Kingma, Diederik P and Ba, Jimmy},
82
+ journal={arXiv preprint arXiv:1412.6980},
83
+ year={2014}
84
+ }
85
+
86
+
87
+ @article{bojanowski2017enriching,
88
+ title={Enriching word vectors with subword information},
89
+ author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
90
+ journal={TACL},
91
+ volume={5},
92
+ pages={135--146},
93
+ year={2017},
94
+ publisher={MIT Press}
95
+ }
96
+
97
+ @article{werbos1990backpropagation,
98
+ title={Backpropagation through time: what it does and how to do it},
99
+ author={Werbos, Paul J},
100
+ journal={Proceedings of the IEEE},
101
+ volume={78},
102
+ number={10},
103
+ pages={1550--1560},
104
+ year={1990},
105
+ publisher={IEEE}
106
+ }
107
+
108
+ @article{hochreiter1997long,
109
+ title={Long short-term memory},
110
+ author={Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
111
+ journal={Neural computation},
112
+ volume={9},
113
+ number={8},
114
+ pages={1735--1780},
115
+ year={1997},
116
+ publisher={MIT Press}
117
+ }
118
+
119
+ @article{al2018character,
120
+ title={Character-level language modeling with deeper self-attention},
121
+ author={Al-Rfou, Rami and Choe, Dokook and Constant, Noah and Guo, Mandy and Jones, Llion},
122
+ journal={arXiv preprint arXiv:1808.04444},
123
+ year={2018}
124
+ }
125
+
126
+ @misc{dai2019transformerxl,
127
+ title={Transformer-{XL}: Language Modeling with Longer-Term Dependency},
128
+ author={Zihang Dai and Zhilin Yang and Yiming Yang and William W. Cohen and Jaime Carbonell and Quoc V. Le and Ruslan Salakhutdinov},
129
+ year={2019},
130
+ url={https://openreview.net/forum?id=HJePno0cYm},
131
+ }
132
+
133
+ @article{jozefowicz2016exploring,
134
+ title={Exploring the limits of language modeling},
135
+ author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui},
136
+ journal={arXiv preprint arXiv:1602.02410},
137
+ year={2016}
138
+ }
139
+
140
+ @inproceedings{mikolov2010recurrent,
141
+ title={Recurrent neural network based language model},
142
+ author={Mikolov, Tom{\'a}{\v{s}} and Karafi{\'a}t, Martin and Burget, Luk{\'a}{\v{s}} and {\v{C}}ernock{\`y}, Jan and Khudanpur, Sanjeev},
143
+ booktitle={Eleventh Annual Conference of the International Speech Communication Association},
144
+ year={2010}
145
+ }
146
+
147
+ @article{gehring2017convolutional,
148
+ title={Convolutional sequence to sequence learning},
149
+ author={Gehring, Jonas and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
150
+ journal={arXiv preprint arXiv:1705.03122},
151
+ year={2017}
152
+ }
153
+
154
+ @article{sennrich2016edinburgh,
155
+ title={Edinburgh neural machine translation systems for wmt 16},
156
+ author={Sennrich, Rico and Haddow, Barry and Birch, Alexandra},
157
+ journal={arXiv preprint arXiv:1606.02891},
158
+ year={2016}
159
+ }
160
+
161
+ @inproceedings{howard2018universal,
162
+ title={Universal language model fine-tuning for text classification},
163
+ author={Howard, Jeremy and Ruder, Sebastian},
164
+ booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
165
+ volume={1},
166
+ pages={328--339},
167
+ year={2018}
168
+ }
169
+
170
+ @inproceedings{unsupNMTartetxe,
171
+ title = {Unsupervised neural machine translation},
172
+ author = {Mikel Artetxe and Gorka Labaka and Eneko Agirre and Kyunghyun Cho},
173
+ booktitle = {International Conference on Learning Representations (ICLR)},
174
+ year = {2018}
175
+ }
176
+
177
+ @inproceedings{artetxe2017learning,
178
+ title={Learning bilingual word embeddings with (almost) no bilingual data},
179
+ author={Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
180
+ booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
181
+ volume={1},
182
+ pages={451--462},
183
+ year={2017}
184
+ }
185
+
186
+ @inproceedings{socher2013recursive,
187
+ title={Recursive deep models for semantic compositionality over a sentiment treebank},
188
+ author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher},
189
+ booktitle={EMNLP},
190
+ pages={1631--1642},
191
+ year={2013}
192
+ }
193
+
194
+ @inproceedings{bowman2015large,
195
+ title={A large annotated corpus for learning natural language inference},
196
+ author={Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.},
197
+ booktitle={EMNLP},
198
+ year={2015}
199
+ }
200
+
201
+ @inproceedings{multinli:2017,
202
+ Title = {A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference},
203
+ Author = {Adina Williams and Nikita Nangia and Samuel R. Bowman},
204
+ Booktitle = {NAACL},
205
+ year = {2017}
206
+ }
207
+
208
+ @article{paszke2017automatic,
209
+ title={Automatic differentiation in pytorch},
210
+ author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam},
211
+ journal={NIPS 2017 Autodiff Workshop},
212
+ year={2017}
213
+ }
214
+
215
+ @inproceedings{conneau2018craminto,
216
+ title={What you can cram into a single vector: Probing sentence embeddings for linguistic properties},
217
+ author={Conneau, Alexis and Kruszewski, German and Lample, Guillaume and Barrault, Lo{\"\i}c and Baroni, Marco},
218
+ booktitle = {ACL},
219
+ year={2018}
220
+ }
221
+
222
+ @inproceedings{Conneau:2018:iclr_muse,
223
+ title={Word Translation without Parallel Data},
224
+ author={Alexis Conneau and Guillaume Lample and {Marc'Aurelio} Ranzato and Ludovic Denoyer and Hervé Jegou},
225
+ booktitle = {ICLR},
226
+ year={2018}
227
+ }
228
+
229
+ @article{johnson2017google,
230
+ title={Google’s multilingual neural machine translation system: Enabling zero-shot translation},
231
+ author={Johnson, Melvin and Schuster, Mike and Le, Quoc V and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi{\'e}gas, Fernanda and Wattenberg, Martin and Corrado, Greg and others},
232
+ journal={TACL},
233
+ volume={5},
234
+ pages={339--351},
235
+ year={2017},
236
+ publisher={MIT Press}
237
+ }
238
+
239
+ @article{radford2019language,
240
+ title={Language models are unsupervised multitask learners},
241
+ author={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
242
+ journal={OpenAI Blog},
243
+ volume={1},
244
+ number={8},
245
+ year={2019}
246
+ }
247
+
248
+ @inproceedings{unsupNMTlample,
249
+ title = {Unsupervised machine translation using monolingual corpora only},
250
+ author = {Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
251
+ booktitle = {ICLR},
252
+ year = {2018}
253
+ }
254
+
255
+ @inproceedings{lample2018phrase,
256
+ title={Phrase-Based \& Neural Unsupervised Machine Translation},
257
+ author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
258
+ booktitle={EMNLP},
259
+ year={2018}
260
+ }
261
+
262
+ @article{hendrycks2016bridging,
263
+ title={Bridging nonlinearities and stochastic regularizers with Gaussian error linear units},
264
+ author={Hendrycks, Dan and Gimpel, Kevin},
265
+ journal={arXiv preprint arXiv:1606.08415},
266
+ year={2016}
267
+ }
268
+
269
+ @inproceedings{chang2008optimizing,
270
+ title={Optimizing Chinese word segmentation for machine translation performance},
271
+ author={Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D},
272
+ booktitle={Proceedings of the third workshop on statistical machine translation},
273
+ pages={224--232},
274
+ year={2008}
275
+ }
276
+
277
+ @inproceedings{rajpurkar-etal-2016-squad,
278
+ title = "{SQ}u{AD}: 100,000+ Questions for Machine Comprehension of Text",
279
+ author = "Rajpurkar, Pranav and
280
+ Zhang, Jian and
281
+ Lopyrev, Konstantin and
282
+ Liang, Percy",
283
+ booktitle = "EMNLP",
284
+ month = nov,
285
+ year = "2016",
286
+ address = "Austin, Texas",
287
+ publisher = "Association for Computational Linguistics",
288
+ url = "https://www.aclweb.org/anthology/D16-1264",
289
+ doi = "10.18653/v1/D16-1264",
290
+ pages = "2383--2392",
291
+ }
292
+
293
+ @article{lewis2019mlqa,
294
+ title={MLQA: Evaluating Cross-lingual Extractive Question Answering},
295
+ author={Lewis, Patrick and O\u{g}uz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},
296
+ journal={arXiv preprint arXiv:1910.07475},
297
+ year={2019}
298
+ }
299
+
300
+ @inproceedings{sennrich2015neural,
301
+ title={Neural machine translation of rare words with subword units},
302
+ author={Sennrich, Rico and Haddow, Barry and Birch, Alexandra},
303
+ booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics},
304
+ pages = {1715-1725},
305
+ year={2015}
306
+ }
307
+
308
+ @article{eriguchi2018zero,
309
+ title={Zero-shot cross-lingual classification using multilingual neural machine translation},
310
+ author={Eriguchi, Akiko and Johnson, Melvin and Firat, Orhan and Kazawa, Hideto and Macherey, Wolfgang},
311
+ journal={arXiv preprint arXiv:1809.04686},
312
+ year={2018}
313
+ }
314
+
315
+ @article{smith2017offline,
316
+ title={Offline bilingual word vectors, orthogonal transformations and the inverted softmax},
317
+ author={Smith, Samuel L and Turban, David HP and Hamblin, Steven and Hammerla, Nils Y},
318
+ journal={International Conference on Learning Representations},
319
+ year={2017}
320
+ }
321
+
322
+ @article{artetxe2016learning,
323
+ title={Learning principled bilingual mappings of word embeddings while preserving monolingual invariance},
324
+ author={Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
325
+ journal={Proceedings of EMNLP},
326
+ year={2016}
327
+ }
328
+
329
+ @article{ammar2016massively,
330
+ title={Massively multilingual word embeddings},
331
+ author={Ammar, Waleed and Mulcaire, George and Tsvetkov, Yulia and Lample, Guillaume and Dyer, Chris and Smith, Noah A},
332
+ journal={arXiv preprint arXiv:1602.01925},
333
+ year={2016}
334
+ }
335
+
336
+ @article{marcobaroni2015hubness,
337
+ title={Hubness and pollution: Delving into cross-space mapping for zero-shot learning},
338
+ author={Lazaridou, Angeliki and Dinu, Georgiana and Baroni, Marco},
339
+ journal={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics},
340
+ year={2015}
341
+ }
342
+
343
+ @article{xing2015normalized,
344
+ title={Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation},
345
+ author={Xing, Chao and Wang, Dong and Liu, Chao and Lin, Yiye},
346
+ journal={Proceedings of NAACL},
347
+ year={2015}
348
+ }
349
+
350
+ @article{faruqui2014improving,
351
+ title={Improving Vector Space Word Representations Using Multilingual Correlation},
352
+ author={Faruqui, Manaal and Dyer, Chris},
353
+ journal={Proceedings of EACL},
354
+ year={2014}
355
+ }
356
+
357
+ @article{taylor1953cloze,
358
+ title={“Cloze procedure”: A new tool for measuring readability},
359
+ author={Taylor, Wilson L},
360
+ journal={Journalism Bulletin},
361
+ volume={30},
362
+ number={4},
363
+ pages={415--433},
364
+ year={1953},
365
+ publisher={SAGE Publications Sage CA: Los Angeles, CA}
366
+ }
367
+
368
+ @inproceedings{mikolov2013distributed,
369
+ title={Distributed representations of words and phrases and their compositionality},
370
+ author={Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
371
+ booktitle={NIPS},
372
+ pages={3111--3119},
373
+ year={2013}
374
+ }
375
+
376
+ @article{mikolov2013exploiting,
377
+ title={Exploiting similarities among languages for machine translation},
378
+ author={Mikolov, Tomas and Le, Quoc V and Sutskever, Ilya},
379
+ journal={arXiv preprint arXiv:1309.4168},
380
+ year={2013}
381
+ }
382
+
383
+ @article{artetxe2018massively,
384
+ title={Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond},
385
+ author={Artetxe, Mikel and Schwenk, Holger},
386
+ journal={arXiv preprint arXiv:1812.10464},
387
+ year={2018}
388
+ }
389
+
390
+ @article{williams2017broad,
391
+ title={A broad-coverage challenge corpus for sentence understanding through inference},
392
+ author={Williams, Adina and Nangia, Nikita and Bowman, Samuel R},
393
+ journal={Proceedings of the 2nd Workshop on Evaluating Vector-Space Representations for NLP},
394
+ year={2017}
395
+ }
396
+
397
+ @InProceedings{conneau2018xnli,
398
+ author = "Conneau, Alexis
399
+ and Rinott, Ruty
400
+ and Lample, Guillaume
401
+ and Williams, Adina
402
+ and Bowman, Samuel R.
403
+ and Schwenk, Holger
404
+ and Stoyanov, Veselin",
405
+ title = "XNLI: Evaluating Cross-lingual Sentence Representations",
406
+ booktitle = "EMNLP",
407
+ year = "2018",
408
+ publisher = "Association for Computational Linguistics",
409
+ location = "Brussels, Belgium",
410
+ }
411
+
412
+ @article{wada2018unsupervised,
413
+ title={Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models},
414
+ author={Wada, Takashi and Iwata, Tomoharu},
415
+ journal={arXiv preprint arXiv:1809.02306},
416
+ year={2018}
417
+ }
418
+
419
+ @article{xu2013cross,
420
+ title={Cross-lingual language modeling for low-resource speech recognition},
421
+ author={Xu, Ping and Fung, Pascale},
422
+ journal={IEEE Transactions on Audio, Speech, and Language Processing},
423
+ volume={21},
424
+ number={6},
425
+ pages={1134--1144},
426
+ year={2013},
427
+ publisher={IEEE}
428
+ }
429
+
430
+ @article{hermann2014multilingual,
431
+ title={Multilingual models for compositional distributed semantics},
432
+ author={Hermann, Karl Moritz and Blunsom, Phil},
433
+ journal={arXiv preprint arXiv:1404.4641},
434
+ year={2014}
435
+ }
436
+
437
+ @inproceedings{transformer17,
438
+ title = {Attention is all you need},
439
+ author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
440
+ booktitle={Advances in Neural Information Processing Systems},
441
+ pages={6000--6010},
442
+ year = {2017}
443
+ }
444
+
445
+ @article{liu2019multi,
446
+ title={Multi-task deep neural networks for natural language understanding},
447
+ author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
448
+ journal={arXiv preprint arXiv:1901.11504},
449
+ year={2019}
450
+ }
451
+
452
+ @article{wang2018glue,
453
+ title={GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
454
+ author={Wang, Alex and Singh, Amapreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
455
+ journal={arXiv preprint arXiv:1804.07461},
456
+ year={2018}
457
+ }
458
+
459
+ @article{radford2018improving,
460
+ title={Improving language understanding by generative pre-training},
461
+ author={Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya},
462
+ journal={URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language\_understanding\_paper.pdf},
463
+ url={https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf},
464
+ year={2018}
465
+ }
466
+
467
+ @article{conneau2018senteval,
468
+ title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
469
+ author={Conneau, Alexis and Kiela, Douwe},
470
+ journal={LREC},
471
+ year={2018}
472
+ }
473
+
474
+ @article{devlin2018bert,
475
+ title={Bert: Pre-training of deep bidirectional transformers for language understanding},
476
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
477
+ journal={NAACL},
478
+ year={2018}
479
+ }
480
+
481
+ @article{peters2018deep,
482
+ title={Deep contextualized word representations},
483
+ author={Peters, Matthew E and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
484
+ journal={NAACL},
485
+ year={2018}
486
+ }
487
+
488
+ @article{ramachandran2016unsupervised,
489
+ title={Unsupervised pretraining for sequence to sequence learning},
490
+ author={Ramachandran, Prajit and Liu, Peter J and Le, Quoc V},
491
+ journal={arXiv preprint arXiv:1611.02683},
492
+ year={2016}
493
+ }
494
+
495
+ @inproceedings{kunchukuttan2018iit,
496
+ title={The IIT Bombay English-Hindi Parallel Corpus},
497
+ author={Kunchukuttan Anoop and Mehta Pratik and Bhattacharyya Pushpak},
498
+ booktitle={LREC},
499
+ year={2018}
500
+ }
501
+
502
+ @article{wu2019beto,
503
+ title={Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT},
504
+ author={Wu, Shijie and Dredze, Mark},
505
+ journal={EMNLP},
506
+ year={2019}
507
+ }
508
+
509
+ @inproceedings{lample-etal-2016-neural,
510
+ title = "Neural Architectures for Named Entity Recognition",
511
+ author = "Lample, Guillaume and
512
+ Ballesteros, Miguel and
513
+ Subramanian, Sandeep and
514
+ Kawakami, Kazuya and
515
+ Dyer, Chris",
516
+ booktitle = "NAACL",
517
+ month = jun,
518
+ year = "2016",
519
+ address = "San Diego, California",
520
+ publisher = "Association for Computational Linguistics",
521
+ url = "https://www.aclweb.org/anthology/N16-1030",
522
+ doi = "10.18653/v1/N16-1030",
523
+ pages = "260--270",
524
+ }
525
+
526
+ @inproceedings{akbik2018coling,
527
+ title={Contextual String Embeddings for Sequence Labeling},
528
+ author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
529
+ booktitle = {COLING},
530
+ pages = {1638--1649},
531
+ year = {2018}
532
+ }
533
+
534
+ @inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
535
+ title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
536
+ author = "Tjong Kim Sang, Erik F. and
537
+ De Meulder, Fien",
538
+ booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
539
+ year = "2003",
540
+ url = "https://www.aclweb.org/anthology/W03-0419",
541
+ pages = "142--147",
542
+ }
543
+
544
+ @inproceedings{tjong-kim-sang-2002-introduction,
545
+ title = "Introduction to the {C}o{NLL}-2002 Shared Task: Language-Independent Named Entity Recognition",
546
+ author = "Tjong Kim Sang, Erik F.",
547
+ booktitle = "{COLING}-02: The 6th Conference on Natural Language Learning 2002 ({C}o{NLL}-2002)",
548
+ year = "2002",
549
+ url = "https://www.aclweb.org/anthology/W02-2024",
550
+ }
551
+
552
+ @InProceedings{TIEDEMANN12.463,
553
+ author = {Jörg Tiedemann},
554
+ title = {Parallel Data, Tools and Interfaces in OPUS},
555
+ booktitle = {LREC},
556
+ year = {2012},
557
+ month = {may},
558
+ date = {23-25},
559
+ address = {Istanbul, Turkey},
560
+ editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
561
+ publisher = {European Language Resources Association (ELRA)},
562
+ isbn = {978-2-9517408-7-7},
563
+ language = {english}
564
+ }
565
+
566
+ @inproceedings{ziemski2016united,
567
+ title={The United Nations Parallel Corpus v1. 0.},
568
+ author={Ziemski, Michal and Junczys-Dowmunt, Marcin and Pouliquen, Bruno},
569
+ booktitle={LREC},
570
+ year={2016}
571
+ }
572
+
573
+ @article{roberta2019,
574
+ author = {Yinhan Liu and
575
+ Myle Ott and
576
+ Naman Goyal and
577
+ Jingfei Du and
578
+ Mandar Joshi and
579
+ Danqi Chen and
580
+ Omer Levy and
581
+ Mike Lewis and
582
+ Luke Zettlemoyer and
583
+ Veselin Stoyanov},
584
+ title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
585
+ journal = {arXiv preprint arXiv:1907.11692},
586
+ year = {2019}
587
+ }
588
+
589
+
590
+ @article{tan2019multilingual,
591
+ title={Multilingual neural machine translation with knowledge distillation},
592
+ author={Tan, Xu and Ren, Yi and He, Di and Qin, Tao and Zhao, Zhou and Liu, Tie-Yan},
593
+ journal={ICLR},
594
+ year={2019}
595
+ }
596
+
597
+ @article{siddhant2019evaluating,
598
+ title={Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation},
599
+ author={Siddhant, Aditya and Johnson, Melvin and Tsai, Henry and Arivazhagan, Naveen and Riesa, Jason and Bapna, Ankur and Firat, Orhan and Raman, Karthik},
600
+ journal={AAAI},
601
+ year={2019}
602
+ }
603
+
604
+ @inproceedings{camacho2017semeval,
605
+ title={Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity},
606
+ author={Camacho-Collados, Jose and Pilehvar, Mohammad Taher and Collier, Nigel and Navigli, Roberto},
607
+ booktitle={Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
608
+ pages={15--26},
609
+ year={2017}
610
+ }
611
+
612
+ @inproceedings{Pires2019HowMI,
613
+ title={How Multilingual is Multilingual BERT?},
614
+ author={Telmo Pires and Eva Schlinger and Dan Garrette},
615
+ booktitle={ACL},
616
+ year={2019}
617
+ }
618
+
619
+ @article{lample2019cross,
620
+ title={Cross-lingual language model pretraining},
621
+ author={Lample, Guillaume and Conneau, Alexis},
622
+ journal={NeurIPS},
623
+ year={2019}
624
+ }
625
+
626
+ @article{schuster2019cross,
627
+ title={Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing},
628
+ author={Schuster, Tal and Ram, Ori and Barzilay, Regina and Globerson, Amir},
629
+ journal={NAACL},
630
+ year={2019}
631
+ }
632
+
633
+ @inproceedings{chang2008optimizing,
634
+ title={Optimizing Chinese word segmentation for machine translation performance},
635
+ author={Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D},
636
+ booktitle={Proceedings of the third workshop on statistical machine translation},
637
+ pages={224--232},
638
+ year={2008}
639
+ }
640
+
641
+ @inproceedings{koehn2007moses,
642
+ title={Moses: Open source toolkit for statistical machine translation},
643
+ author={Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and others},
644
+ booktitle={Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions},
645
+ pages={177--180},
646
+ year={2007},
647
+ organization={Association for Computational Linguistics}
648
+ }
649
+
650
+ @article{wenzek2019ccnet,
651
+ title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
652
+ author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzman, Francisco and Joulin, Armand and Grave, Edouard},
653
+ journal={arXiv preprint arXiv:1911.00359},
654
+ year={2019}
655
+ }
656
+
657
+ @inproceedings{zhou2016cross,
658
+ title={Cross-lingual sentiment classification with bilingual document representation learning},
659
+ author={Zhou, Xinjie and Wan, Xiaojun and Xiao, Jianguo},
660
+ booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
661
+ pages={1403--1412},
662
+ year={2016}
663
+ }
664
+
665
+ @article{goyal2017accurate,
666
+ title={Accurate, large minibatch sgd: Training imagenet in 1 hour},
667
+ author={Goyal, Priya and Doll{\'a}r, Piotr and Girshick, Ross and Noordhuis, Pieter and Wesolowski, Lukasz and Kyrola, Aapo and Tulloch, Andrew and Jia, Yangqing and He, Kaiming},
668
+ journal={arXiv preprint arXiv:1706.02677},
669
+ year={2017}
670
+ }
671
+
672
+ @article{arivazhagan2019massively,
673
+ title={Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges},
674
+ author={Arivazhagan, Naveen and Bapna, Ankur and Firat, Orhan and Lepikhin, Dmitry and Johnson, Melvin and Krikun, Maxim and Chen, Mia Xu and Cao, Yuan and Foster, George and Cherry, Colin and others},
675
+ journal={arXiv preprint arXiv:1907.05019},
676
+ year={2019}
677
+ }
678
+
679
+ @inproceedings{pan2017cross,
680
+ title={Cross-lingual name tagging and linking for 282 languages},
681
+ author={Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng},
682
+ booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
683
+ volume={1},
684
+ pages={1946--1958},
685
+ year={2017}
686
+ }
687
+
688
+ @article{raffel2019exploring,
689
+ title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
690
+ author={Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
691
+ year={2019},
692
+ journal={arXiv preprint arXiv:1910.10683},
693
+ }
694
+
695
+ @inproceedings{pennington2014glove,
696
+ author = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},
697
+ booktitle = {EMNLP},
698
+ title = {GloVe: Global Vectors for Word Representation},
699
+ year = {2014},
700
+ pages = {1532--1543},
701
+ url = {http://www.aclweb.org/anthology/D14-1162},
702
+ }
703
+
704
+ @article{kudo2018sentencepiece,
705
+ title={Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing},
706
+ author={Kudo, Taku and Richardson, John},
707
+ journal={EMNLP},
708
+ year={2018}
709
+ }
710
+
711
+ @article{rajpurkar2018know,
712
+ title={Know What You Don't Know: Unanswerable Questions for SQuAD},
713
+ author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy},
714
+ journal={ACL},
715
+ year={2018}
716
+ }
717
+
718
+ @article{joulin2017bag,
719
+ title={Bag of Tricks for Efficient Text Classification},
720
+ author={Joulin, Armand and Grave, Edouard and Mikolov, Piotr Bojanowski Tomas},
721
+ journal={EACL 2017},
722
+ pages={427},
723
+ year={2017}
724
+ }
725
+
726
+ @inproceedings{kudo2018subword,
727
+ title={Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates},
728
+ author={Kudo, Taku},
729
+ booktitle={ACL},
730
+ pages={66--75},
731
+ year={2018}
732
+ }
733
+
734
+ @inproceedings{grave2018learning,
735
+ title={Learning Word Vectors for 157 Languages},
736
+ author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
737
+ booktitle={LREC},
738
+ year={2018}
739
+ }
references/2019.arxiv.conneau/source/XLMR Paper/acl2020.sty ADDED
@@ -0,0 +1,560 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This is the LaTex style file for ACL 2020, based off of ACL 2019.
2
+
3
+ % Addressing bibtex issues mentioned in https://github.com/acl-org/acl-pub/issues/2
4
+ % Other major modifications include
5
+ % changing the color of the line numbers to a light gray; changing font size of abstract to be 10pt; changing caption font size to be 10pt.
6
+ % -- M Mitchell and Stephanie Lukin
7
+
8
+ % 2017: modified to support DOI links in bibliography. Now uses
9
+ % natbib package rather than defining citation commands in this file.
10
+ % Use with acl_natbib.bst bib style. -- Dan Gildea
11
+
12
+ % This is the LaTeX style for ACL 2016. It contains Margaret Mitchell's
13
+ % line number adaptations (ported by Hai Zhao and Yannick Versley).
14
+
15
+ % It is nearly identical to the style files for ACL 2015,
16
+ % ACL 2014, EACL 2006, ACL2005, ACL 2002, ACL 2001, ACL 2000,
17
+ % EACL 95 and EACL 99.
18
+ %
19
+ % Changes made include: adapt layout to A4 and centimeters, widen abstract
20
+
21
+ % This is the LaTeX style file for ACL 2000. It is nearly identical to the
22
+ % style files for EACL 95 and EACL 99. Minor changes include editing the
23
+ % instructions to reflect use of \documentclass rather than \documentstyle
24
+ % and removing the white space before the title on the first page
25
+ % -- John Chen, June 29, 2000
26
+
27
+ % This is the LaTeX style file for EACL-95. It is identical to the
28
+ % style file for ANLP '94 except that the margins are adjusted for A4
29
+ % paper. -- abney 13 Dec 94
30
+
31
+ % The ANLP '94 style file is a slightly modified
32
+ % version of the style used for AAAI and IJCAI, using some changes
33
+ % prepared by Fernando Pereira and others and some minor changes
34
+ % by Paul Jacobs.
35
+
36
+ % Papers prepared using the aclsub.sty file and acl.bst bibtex style
37
+ % should be easily converted to final format using this style.
38
+ % (1) Submission information (\wordcount, \subject, and \makeidpage)
39
+ % should be removed.
40
+ % (2) \summary should be removed. The summary material should come
41
+ % after \maketitle and should be in the ``abstract'' environment
42
+ % (between \begin{abstract} and \end{abstract}).
43
+ % (3) Check all citations. This style should handle citations correctly
44
+ % and also allows multiple citations separated by semicolons.
45
+ % (4) Check figures and examples. Because the final format is double-
46
+ % column, some adjustments may have to be made to fit text in the column
47
+ % or to choose full-width (\figure*} figures.
48
+
49
+ % Place this in a file called aclap.sty in the TeX search path.
50
+ % (Placing it in the same directory as the paper should also work.)
51
+
52
+ % Prepared by Peter F. Patel-Schneider, liberally using the ideas of
53
+ % other style hackers, including Barbara Beeton.
54
+ % This style is NOT guaranteed to work. It is provided in the hope
55
+ % that it will make the preparation of papers easier.
56
+ %
57
+ % There are undoubtably bugs in this style. If you make bug fixes,
58
+ % improvements, etc. please let me know. My e-mail address is:
59
+ % pfps@research.att.com
60
+
61
+ % Papers are to be prepared using the ``acl_natbib'' bibliography style,
62
+ % as follows:
63
+ % \documentclass[11pt]{article}
64
+ % \usepackage{acl2000}
65
+ % \title{Title}
66
+ % \author{Author 1 \and Author 2 \\ Address line \\ Address line \And
67
+ % Author 3 \\ Address line \\ Address line}
68
+ % \begin{document}
69
+ % ...
70
+ % \bibliography{bibliography-file}
71
+ % \bibliographystyle{acl_natbib}
72
+ % \end{document}
73
+
74
+ % Author information can be set in various styles:
75
+ % For several authors from the same institution:
76
+ % \author{Author 1 \and ... \and Author n \\
77
+ % Address line \\ ... \\ Address line}
78
+ % if the names do not fit well on one line use
79
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
80
+ % For authors from different institutions:
81
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
82
+ % \And ... \And
83
+ % Author n \\ Address line \\ ... \\ Address line}
84
+ % To start a seperate ``row'' of authors use \AND, as in
85
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
86
+ % \AND
87
+ % Author 2 \\ Address line \\ ... \\ Address line \And
88
+ % Author 3 \\ Address line \\ ... \\ Address line}
89
+
90
+ % If the title and author information does not fit in the area allocated,
91
+ % place \setlength\titlebox{<new height>} right after
92
+ % \usepackage{acl2015}
93
+ % where <new height> can be something larger than 5cm
94
+
95
+ % include hyperref, unless user specifies nohyperref option like this:
96
+ % \usepackage[nohyperref]{naaclhlt2018}
97
+ \newif\ifacl@hyperref
98
+ \DeclareOption{hyperref}{\acl@hyperreftrue}
99
+ \DeclareOption{nohyperref}{\acl@hyperreffalse}
100
+ \ExecuteOptions{hyperref} % default is to use hyperref
101
+ \ProcessOptions\relax
102
+ \ifacl@hyperref
103
+ \RequirePackage{hyperref}
104
+ \usepackage{xcolor} % make links dark blue
105
+ \definecolor{darkblue}{rgb}{0, 0, 0.5}
106
+ \hypersetup{colorlinks=true,citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue}
107
+ \else
108
+ % This definition is used if the hyperref package is not loaded.
109
+ % It provides a backup, no-op definiton of \href.
110
+ % This is necessary because \href command is used in the acl_natbib.bst file.
111
+ \def\href#1#2{{#2}}
112
+ % We still need to load xcolor in this case because the lighter line numbers require it. (SC/KG/WL)
113
+ \usepackage{xcolor}
114
+ \fi
115
+
116
+ \typeout{Conference Style for ACL 2019}
117
+
118
+ % NOTE: Some laser printers have a serious problem printing TeX output.
119
+ % These printing devices, commonly known as ``write-white'' laser
120
+ % printers, tend to make characters too light. To get around this
121
+ % problem, a darker set of fonts must be created for these devices.
122
+ %
123
+
124
+ \newcommand{\Thanks}[1]{\thanks{\ #1}}
125
+
126
+ % A4 modified by Eneko; again modified by Alexander for 5cm titlebox
127
+ \setlength{\paperwidth}{21cm} % A4
128
+ \setlength{\paperheight}{29.7cm}% A4
129
+ \setlength\topmargin{-0.5cm}
130
+ \setlength\oddsidemargin{0cm}
131
+ \setlength\textheight{24.7cm}
132
+ \setlength\textwidth{16.0cm}
133
+ \setlength\columnsep{0.6cm}
134
+ \newlength\titlebox
135
+ \setlength\titlebox{5cm}
136
+ \setlength\headheight{5pt}
137
+ \setlength\headsep{0pt}
138
+ \thispagestyle{empty}
139
+ \pagestyle{empty}
140
+
141
+
142
+ \flushbottom \twocolumn \sloppy
143
+
144
+ % We're never going to need a table of contents, so just flush it to
145
+ % save space --- suggested by drstrip@sandia-2
146
+ \def\addcontentsline#1#2#3{}
147
+
148
+ \newif\ifaclfinal
149
+ \aclfinalfalse
150
+ \def\aclfinalcopy{\global\aclfinaltrue}
151
+
152
+ %% ----- Set up hooks to repeat content on every page of the output doc,
153
+ %% necessary for the line numbers in the submitted version. --MM
154
+ %%
155
+ %% Copied from CVPR 2015's cvpr_eso.sty, which appears to be largely copied from everyshi.sty.
156
+ %%
157
+ %% Original cvpr_eso.sty available at: http://www.pamitc.org/cvpr15/author_guidelines.php
158
+ %% Original evershi.sty available at: https://www.ctan.org/pkg/everyshi
159
+ %%
160
+ %% Copyright (C) 2001 Martin Schr\"oder:
161
+ %%
162
+ %% Martin Schr"oder
163
+ %% Cr"usemannallee 3
164
+ %% D-28213 Bremen
165
+ %% Martin.Schroeder@ACM.org
166
+ %%
167
+ %% This program may be redistributed and/or modified under the terms
168
+ %% of the LaTeX Project Public License, either version 1.0 of this
169
+ %% license, or (at your option) any later version.
170
+ %% The latest version of this license is in
171
+ %% CTAN:macros/latex/base/lppl.txt.
172
+ %%
173
+ %% Happy users are requested to send [Martin] a postcard. :-)
174
+ %%
175
+ \newcommand{\@EveryShipoutACL@Hook}{}
176
+ \newcommand{\@EveryShipoutACL@AtNextHook}{}
177
+ \newcommand*{\EveryShipoutACL}[1]
178
+ {\g@addto@macro\@EveryShipoutACL@Hook{#1}}
179
+ \newcommand*{\AtNextShipoutACL@}[1]
180
+ {\g@addto@macro\@EveryShipoutACL@AtNextHook{#1}}
181
+ \newcommand{\@EveryShipoutACL@Shipout}{%
182
+ \afterassignment\@EveryShipoutACL@Test
183
+ \global\setbox\@cclv= %
184
+ }
185
+ \newcommand{\@EveryShipoutACL@Test}{%
186
+ \ifvoid\@cclv\relax
187
+ \aftergroup\@EveryShipoutACL@Output
188
+ \else
189
+ \@EveryShipoutACL@Output
190
+ \fi%
191
+ }
192
+ \newcommand{\@EveryShipoutACL@Output}{%
193
+ \@EveryShipoutACL@Hook%
194
+ \@EveryShipoutACL@AtNextHook%
195
+ \gdef\@EveryShipoutACL@AtNextHook{}%
196
+ \@EveryShipoutACL@Org@Shipout\box\@cclv%
197
+ }
198
+ \newcommand{\@EveryShipoutACL@Org@Shipout}{}
199
+ \newcommand*{\@EveryShipoutACL@Init}{%
200
+ \message{ABD: EveryShipout initializing macros}%
201
+ \let\@EveryShipoutACL@Org@Shipout\shipout
202
+ \let\shipout\@EveryShipoutACL@Shipout
203
+ }
204
+ \AtBeginDocument{\@EveryShipoutACL@Init}
205
+
206
+ %% ----- Set up for placing additional items into the submitted version --MM
207
+ %%
208
+ %% Based on eso-pic.sty
209
+ %%
210
+ %% Original available at: https://www.ctan.org/tex-archive/macros/latex/contrib/eso-pic
211
+ %% Copyright (C) 1998-2002 by Rolf Niepraschk <niepraschk@ptb.de>
212
+ %%
213
+ %% Which may be distributed and/or modified under the conditions of
214
+ %% the LaTeX Project Public License, either version 1.2 of this license
215
+ %% or (at your option) any later version. The latest version of this
216
+ %% license is in:
217
+ %%
218
+ %% http://www.latex-project.org/lppl.txt
219
+ %%
220
+ %% and version 1.2 or later is part of all distributions of LaTeX version
221
+ %% 1999/12/01 or later.
222
+ %%
223
+ %% In contrast to the original, we do not include the definitions for/using:
224
+ %% gridpicture, div[2], isMEMOIR[1], gridSetup[6][], subgridstyle{dotted}, labelfactor{}, gap{}, gridunitname{}, gridunit{}, gridlines{\thinlines}, subgridlines{\thinlines}, the {keyval} package, evenside margin, nor any definitions with 'color'.
225
+ %%
226
+ %% These are beyond what is needed for the NAACL/ACL style.
227
+ %%
228
+ \newcommand\LenToUnit[1]{#1\@gobble}
229
+ \newcommand\AtPageUpperLeft[1]{%
230
+ \begingroup
231
+ \@tempdima=0pt\relax\@tempdimb=\ESO@yoffsetI\relax
232
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
233
+ \endgroup
234
+ }
235
+ \newcommand\AtPageLowerLeft[1]{\AtPageUpperLeft{%
236
+ \put(0,\LenToUnit{-\paperheight}){#1}}}
237
+ \newcommand\AtPageCenter[1]{\AtPageUpperLeft{%
238
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.5\paperheight}){#1}}}
239
+ \newcommand\AtPageLowerCenter[1]{\AtPageUpperLeft{%
240
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-\paperheight}){#1}}}%
241
+ \newcommand\AtPageLowishCenter[1]{\AtPageUpperLeft{%
242
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.96\paperheight}){#1}}}
243
+ \newcommand\AtTextUpperLeft[1]{%
244
+ \begingroup
245
+ \setlength\@tempdima{1in}%
246
+ \advance\@tempdima\oddsidemargin%
247
+ \@tempdimb=\ESO@yoffsetI\relax\advance\@tempdimb-1in\relax%
248
+ \advance\@tempdimb-\topmargin%
249
+ \advance\@tempdimb-\headheight\advance\@tempdimb-\headsep%
250
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
251
+ \endgroup
252
+ }
253
+ \newcommand\AtTextLowerLeft[1]{\AtTextUpperLeft{%
254
+ \put(0,\LenToUnit{-\textheight}){#1}}}
255
+ \newcommand\AtTextCenter[1]{\AtTextUpperLeft{%
256
+ \put(\LenToUnit{.5\textwidth},\LenToUnit{-.5\textheight}){#1}}}
257
+ \newcommand{\ESO@HookI}{} \newcommand{\ESO@HookII}{}
258
+ \newcommand{\ESO@HookIII}{}
259
+ \newcommand{\AddToShipoutPicture}{%
260
+ \@ifstar{\g@addto@macro\ESO@HookII}{\g@addto@macro\ESO@HookI}}
261
+ \newcommand{\ClearShipoutPicture}{\global\let\ESO@HookI\@empty}
262
+ \newcommand{\@ShipoutPicture}{%
263
+ \bgroup
264
+ \@tempswafalse%
265
+ \ifx\ESO@HookI\@empty\else\@tempswatrue\fi%
266
+ \ifx\ESO@HookII\@empty\else\@tempswatrue\fi%
267
+ \ifx\ESO@HookIII\@empty\else\@tempswatrue\fi%
268
+ \if@tempswa%
269
+ \@tempdima=1in\@tempdimb=-\@tempdima%
270
+ \advance\@tempdimb\ESO@yoffsetI%
271
+ \unitlength=1pt%
272
+ \global\setbox\@cclv\vbox{%
273
+ \vbox{\let\protect\relax
274
+ \pictur@(0,0)(\strip@pt\@tempdima,\strip@pt\@tempdimb)%
275
+ \ESO@HookIII\ESO@HookI\ESO@HookII%
276
+ \global\let\ESO@HookII\@empty%
277
+ \endpicture}%
278
+ \nointerlineskip%
279
+ \box\@cclv}%
280
+ \fi
281
+ \egroup
282
+ }
283
+ \EveryShipoutACL{\@ShipoutPicture}
284
+ \newif\ifESO@dvips\ESO@dvipsfalse
285
+ \newif\ifESO@grid\ESO@gridfalse
286
+ \newif\ifESO@texcoord\ESO@texcoordfalse
287
+ \newcommand*\ESO@griddelta{}\newcommand*\ESO@griddeltaY{}
288
+ \newcommand*\ESO@gridDelta{}\newcommand*\ESO@gridDeltaY{}
289
+ \newcommand*\ESO@yoffsetI{}\newcommand*\ESO@yoffsetII{}
290
+ \ifESO@texcoord
291
+ \def\ESO@yoffsetI{0pt}\def\ESO@yoffsetII{-\paperheight}
292
+ \edef\ESO@griddeltaY{-\ESO@griddelta}\edef\ESO@gridDeltaY{-\ESO@gridDelta}
293
+ \else
294
+ \def\ESO@yoffsetI{\paperheight}\def\ESO@yoffsetII{0pt}
295
+ \edef\ESO@griddeltaY{\ESO@griddelta}\edef\ESO@gridDeltaY{\ESO@gridDelta}
296
+ \fi
297
+
298
+
299
+ %% ----- Submitted version markup: Page numbers, ruler, and confidentiality. Using ideas/code from cvpr.sty 2015. --MM
300
+
301
+ \font\aclhv = phvb at 8pt
302
+
303
+ %% Define vruler %%
304
+
305
+ %\makeatletter
306
+ \newbox\aclrulerbox
307
+ \newcount\aclrulercount
308
+ \newdimen\aclruleroffset
309
+ \newdimen\cv@lineheight
310
+ \newdimen\cv@boxheight
311
+ \newbox\cv@tmpbox
312
+ \newcount\cv@refno
313
+ \newcount\cv@tot
314
+ % NUMBER with left flushed zeros \fillzeros[<WIDTH>]<NUMBER>
315
+ \newcount\cv@tmpc@ \newcount\cv@tmpc
316
+ \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
317
+ \cv@tmpc=1 %
318
+ \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
319
+ \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
320
+ \ifnum#2<0\advance\cv@tmpc1\relax-\fi
321
+ \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
322
+ \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
323
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
324
+ \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
325
+ \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
326
+ \global\setbox\aclrulerbox=\vbox to \textheight{%
327
+ {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
328
+ \color{gray}
329
+ \cv@lineheight=#1\global\aclrulercount=#2%
330
+ \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
331
+ \cv@refno1\vskip-\cv@lineheight\vskip1ex%
332
+ \loop\setbox\cv@tmpbox=\hbox to0cm{{\aclhv\hfil\fillzeros[#4]\aclrulercount}}%
333
+ \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
334
+ \advance\cv@refno1\global\advance\aclrulercount#3\relax
335
+ \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}%
336
+ %\makeatother
337
+
338
+
339
+ \def\aclpaperid{***}
340
+ \def\confidential{\textcolor{black}{ACL 2020 Submission~\aclpaperid. Confidential Review Copy. DO NOT DISTRIBUTE.}}
341
+
342
+ %% Page numbering, Vruler and Confidentiality %%
343
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
344
+
345
+ % SC/KG/WL - changed line numbering to gainsboro
346
+ \definecolor{gainsboro}{rgb}{0.8, 0.8, 0.8}
347
+ %\def\aclruler#1{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}} %% old line
348
+ \def\aclruler#1{\textcolor{gainsboro}{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}}}
349
+
350
+ \def\leftoffset{-2.1cm} %original: -45pt
351
+ \def\rightoffset{17.5cm} %original: 500pt
352
+ \ifaclfinal\else\pagenumbering{arabic}
353
+ \AddToShipoutPicture{%
354
+ \ifaclfinal\else
355
+ \AtPageLowishCenter{\textcolor{black}{\thepage}}
356
+ \aclruleroffset=\textheight
357
+ \advance\aclruleroffset4pt
358
+ \AtTextUpperLeft{%
359
+ \put(\LenToUnit{\leftoffset},\LenToUnit{-\aclruleroffset}){%left ruler
360
+ \aclruler{\aclrulercount}}
361
+ \put(\LenToUnit{\rightoffset},\LenToUnit{-\aclruleroffset}){%right ruler
362
+ \aclruler{\aclrulercount}}
363
+ }
364
+ \AtTextUpperLeft{%confidential
365
+ \put(0,\LenToUnit{1cm}){\parbox{\textwidth}{\centering\aclhv\confidential}}
366
+ }
367
+ \fi
368
+ }
369
+
370
+ %%%% ----- End settings for placing additional items into the submitted version --MM ----- %%%%
371
+
372
+ %%%% ----- Begin settings for both submitted and camera-ready version ----- %%%%
373
+
374
+ %% Title and Authors %%
375
+
376
+ \newcommand\outauthor{
377
+ \begin{tabular}[t]{c}
378
+ \ifaclfinal
379
+ \bf\@author
380
+ \else
381
+ % Avoiding common accidental de-anonymization issue. --MM
382
+ \bf Anonymous ACL submission
383
+ \fi
384
+ \end{tabular}}
385
+
386
+ % Changing the expanded titlebox for submissions to 2.5 in (rather than 6.5cm)
387
+ % and moving it to the style sheet, rather than within the example tex file. --MM
388
+ \ifaclfinal
389
+ \else
390
+ \addtolength\titlebox{.25in}
391
+ \fi
392
+ % Mostly taken from deproc.
393
+ \def\maketitle{\par
394
+ \begingroup
395
+ \def\thefootnote{\fnsymbol{footnote}}
396
+ \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}}
397
+ \twocolumn[\@maketitle] \@thanks
398
+ \endgroup
399
+ \setcounter{footnote}{0}
400
+ \let\maketitle\relax \let\@maketitle\relax
401
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
402
+ \def\@maketitle{\vbox to \titlebox{\hsize\textwidth
403
+ \linewidth\hsize \vskip 0.125in minus 0.125in \centering
404
+ {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in
405
+ {\def\and{\unskip\enspace{\rm and}\enspace}%
406
+ \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil
407
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}%
408
+ \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup
409
+ \vskip 0.25in plus 1fil minus 0.125in
410
+ \hbox to \linewidth\bgroup\large \hfil\hfil
411
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}
412
+ \hbox to \linewidth\bgroup\large \hfil\hfil
413
+ \hbox to 0pt\bgroup\hss
414
+ \outauthor
415
+ \hss\egroup
416
+ \hfil\hfil\egroup}
417
+ \vskip 0.3in plus 2fil minus 0.1in
418
+ }}
419
+
420
+ % margins and font size for abstract
421
+ \renewenvironment{abstract}%
422
+ {\centerline{\large\bf Abstract}%
423
+ \begin{list}{}%
424
+ {\setlength{\rightmargin}{0.6cm}%
425
+ \setlength{\leftmargin}{0.6cm}}%
426
+ \item[]\ignorespaces%
427
+ \@setsize\normalsize{12pt}\xpt\@xpt
428
+ }%
429
+ {\unskip\end{list}}
430
+
431
+ %\renewenvironment{abstract}{\centerline{\large\bf
432
+ % Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
433
+
434
+ % Resizing figure and table captions - SL
435
+ \newcommand{\figcapfont}{\rm}
436
+ \newcommand{\tabcapfont}{\rm}
437
+ \renewcommand{\fnum@figure}{\figcapfont Figure \thefigure}
438
+ \renewcommand{\fnum@table}{\tabcapfont Table \thetable}
439
+ \renewcommand{\figcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
440
+ \renewcommand{\tabcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
441
+ % Support for interacting with the caption, subfigure, and subcaption packages - SL
442
+ \usepackage{caption}
443
+ \DeclareCaptionFont{10pt}{\fontsize{10pt}{12pt}\selectfont}
444
+ \captionsetup{font=10pt}
445
+
446
+ \RequirePackage{natbib}
447
+ % for citation commands in the .tex, authors can use:
448
+ % \citep, \citet, and \citeyearpar for compatibility with natbib, or
449
+ % \cite, \newcite, and \shortcite for compatibility with older ACL .sty files
450
+ \renewcommand\cite{\citep} % to get "(Author Year)" with natbib
451
+ \newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib
452
+ \newcommand\newcite{\citet} % to get "Author (Year)" with natbib
453
+
454
+ % DK/IV: Workaround for annoying hyperref pagewrap bug
455
+ % \RequirePackage{etoolbox}
456
+ % \patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{\errmessage{\noexpand patch failed}}
457
+
458
+ % bibliography
459
+
460
+ \def\@up#1{\raise.2ex\hbox{#1}}
461
+
462
+ % Don't put a label in the bibliography at all. Just use the unlabeled format
463
+ % instead.
464
+ \def\thebibliography#1{\vskip\parskip%
465
+ \vskip\baselineskip%
466
+ \def\baselinestretch{1}%
467
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
468
+ \vskip-\parskip%
469
+ \vskip-\baselineskip%
470
+ \section*{References\@mkboth
471
+ {References}{References}}\list
472
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
473
+ \setlength{\itemindent}{-\parindent}}
474
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
475
+ \sloppy\clubpenalty4000\widowpenalty4000
476
+ \sfcode`\.=1000\relax}
477
+ \let\endthebibliography=\endlist
478
+
479
+
480
+ % Allow for a bibliography of sources of attested examples
481
+ \def\thesourcebibliography#1{\vskip\parskip%
482
+ \vskip\baselineskip%
483
+ \def\baselinestretch{1}%
484
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
485
+ \vskip-\parskip%
486
+ \vskip-\baselineskip%
487
+ \section*{Sources of Attested Examples\@mkboth
488
+ {Sources of Attested Examples}{Sources of Attested Examples}}\list
489
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
490
+ \setlength{\itemindent}{-\parindent}}
491
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
492
+ \sloppy\clubpenalty4000\widowpenalty4000
493
+ \sfcode`\.=1000\relax}
494
+ \let\endthesourcebibliography=\endlist
495
+
496
+ % sections with less space
497
+ \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
498
+ -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}}
499
+ \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
500
+ -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
501
+ %% changed by KO to - values to get teh initial parindent right
502
+ \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus
503
+ -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}}
504
+ \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
505
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
506
+ \def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus
507
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
508
+
509
+ % Footnotes
510
+ \footnotesep 6.65pt %
511
+ \skip\footins 9pt plus 4pt minus 2pt
512
+ \def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt }
513
+ \setcounter{footnote}{0}
514
+
515
+ % Lists and paragraphs
516
+ \parindent 1em
517
+ \topsep 4pt plus 1pt minus 2pt
518
+ \partopsep 1pt plus 0.5pt minus 0.5pt
519
+ \itemsep 2pt plus 1pt minus 0.5pt
520
+ \parsep 2pt plus 1pt minus 0.5pt
521
+
522
+ \leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em
523
+ \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em
524
+ \labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt
525
+
526
+ \def\@listi{\leftmargin\leftmargini}
527
+ \def\@listii{\leftmargin\leftmarginii
528
+ \labelwidth\leftmarginii\advance\labelwidth-\labelsep
529
+ \topsep 2pt plus 1pt minus 0.5pt
530
+ \parsep 1pt plus 0.5pt minus 0.5pt
531
+ \itemsep \parsep}
532
+ \def\@listiii{\leftmargin\leftmarginiii
533
+ \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
534
+ \topsep 1pt plus 0.5pt minus 0.5pt
535
+ \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
536
+ \itemsep \topsep}
537
+ \def\@listiv{\leftmargin\leftmarginiv
538
+ \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
539
+ \def\@listv{\leftmargin\leftmarginv
540
+ \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
541
+ \def\@listvi{\leftmargin\leftmarginvi
542
+ \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
543
+
544
+ \abovedisplayskip 7pt plus2pt minus5pt%
545
+ \belowdisplayskip \abovedisplayskip
546
+ \abovedisplayshortskip 0pt plus3pt%
547
+ \belowdisplayshortskip 4pt plus3pt minus3pt%
548
+
549
+ % Less leading in most fonts (due to the narrow columns)
550
+ % The choices were between 1-pt and 1.5-pt leading
551
+ \def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
552
+ \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
553
+ \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
554
+ \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
555
+ \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
556
+ \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
557
+ \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
558
+ \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
559
+ \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
560
+ \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
references/2019.arxiv.conneau/source/XLMR Paper/acl_natbib.bst ADDED
@@ -0,0 +1,1975 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%% acl_natbib.bst
2
+ %%% Modification of BibTeX style file acl_natbib_nourl.bst
3
+ %%% ... by urlbst, version 0.7 (marked with "% urlbst")
4
+ %%% See <http://purl.org/nxg/dist/urlbst>
5
+ %%% Added webpage entry type, and url and lastchecked fields.
6
+ %%% Added eprint support.
7
+ %%% Added DOI support.
8
+ %%% Added PUBMED support.
9
+ %%% Added hyperref support.
10
+ %%% Original headers follow...
11
+
12
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
13
+ %
14
+ % BibTeX style file acl_natbib_nourl.bst
15
+ %
16
+ % intended as input to urlbst script
17
+ % $ ./urlbst --hyperref --inlinelinks acl_natbib_nourl.bst > acl_natbib.bst
18
+ %
19
+ % adapted from compling.bst
20
+ % in order to mimic the style files for ACL conferences prior to 2017
21
+ % by making the following three changes:
22
+ % - for @incollection, page numbers now follow volume title.
23
+ % - for @inproceedings, address now follows conference name.
24
+ % (address is intended as location of conference,
25
+ % not address of publisher.)
26
+ % - for papers with three authors, use et al. in citation
27
+ % Dan Gildea 2017/06/08
28
+ % - fixed a bug with format.chapter - error given if chapter is empty
29
+ % with inbook.
30
+ % Shay Cohen 2018/02/16
31
+
32
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
33
+ %
34
+ % BibTeX style file compling.bst
35
+ %
36
+ % Intended for the journal Computational Linguistics (ACL/MIT Press)
37
+ % Created by Ron Artstein on 2005/08/22
38
+ % For use with <natbib.sty> for author-year citations.
39
+ %
40
+ % I created this file in order to allow submissions to the journal
41
+ % Computational Linguistics using the <natbib> package for author-year
42
+ % citations, which offers a lot more flexibility than <fullname>, CL's
43
+ % official citation package. This file adheres strictly to the official
44
+ % style guide available from the MIT Press:
45
+ %
46
+ % http://mitpress.mit.edu/journals/coli/compling_style.pdf
47
+ %
48
+ % This includes all the various quirks of the style guide, for example:
49
+ % - a chapter from a monograph (@inbook) has no page numbers.
50
+ % - an article from an edited volume (@incollection) has page numbers
51
+ % after the publisher and address.
52
+ % - an article from a proceedings volume (@inproceedings) has page
53
+ % numbers before the publisher and address.
54
+ %
55
+ % Where the style guide was inconsistent or not specific enough I
56
+ % looked at actual published articles and exercised my own judgment.
57
+ % I noticed two inconsistencies in the style guide:
58
+ %
59
+ % - The style guide gives one example of an article from an edited
60
+ % volume with the editor's name spelled out in full, and another
61
+ % with the editors' names abbreviated. I chose to accept the first
62
+ % one as correct, since the style guide generally shuns abbreviations,
63
+ % and editors' names are also spelled out in some recently published
64
+ % articles.
65
+ %
66
+ % - The style guide gives one example of a reference where the word
67
+ % "and" between two authors is preceded by a comma. This is most
68
+ % likely a typo, since in all other cases with just two authors or
69
+ % editors there is no comma before the word "and".
70
+ %
71
+ % One case where the style guide is not being specific is the placement
72
+ % of the edition number, for which no example is given. I chose to put
73
+ % it immediately after the title, which I (subjectively) find natural,
74
+ % and is also the place of the edition in a few recently published
75
+ % articles.
76
+ %
77
+ % This file correctly reproduces all of the examples in the official
78
+ % style guide, except for the two inconsistencies noted above. I even
79
+ % managed to get it to correctly format the proceedings example which
80
+ % has an organization, a publisher, and two addresses (the conference
81
+ % location and the publisher's address), though I cheated a bit by
82
+ % putting the conference location and month as part of the title field;
83
+ % I feel that in this case the conference location and month can be
84
+ % considered as part of the title, and that adding a location field
85
+ % is not justified. Note also that a location field is not standard,
86
+ % so entries made with this field would not port nicely to other styles.
87
+ % However, if authors feel that there's a need for a location field
88
+ % then tell me and I'll see what I can do.
89
+ %
90
+ % The file also produces to my satisfaction all the bibliographical
91
+ % entries in my recent (joint) submission to CL (this was the original
92
+ % motivation for creating the file). I also tested it by running it
93
+ % on a larger set of entries and eyeballing the results. There may of
94
+ % course still be errors, especially with combinations of fields that
95
+ % are not that common, or with cross-references (which I seldom use).
96
+ % If you find such errors please write to me.
97
+ %
98
+ % I hope people find this file useful. Please email me with comments
99
+ % and suggestions.
100
+ %
101
+ % Ron Artstein
102
+ % artstein [at] essex.ac.uk
103
+ % August 22, 2005.
104
+ %
105
+ % Some technical notes.
106
+ %
107
+ % This file is based on a file generated with the package <custom-bib>
108
+ % by Patrick W. Daly (see selected options below), which was then
109
+ % manually customized to conform with certain CL requirements which
110
+ % cannot be met by <custom-bib>. Departures from the generated file
111
+ % include:
112
+ %
113
+ % Function inbook: moved publisher and address to the end; moved
114
+ % edition after title; replaced function format.chapter.pages by
115
+ % new function format.chapter to output chapter without pages.
116
+ %
117
+ % Function inproceedings: moved publisher and address to the end;
118
+ % replaced function format.in.ed.booktitle by new function
119
+ % format.in.booktitle to output the proceedings title without
120
+ % the editor.
121
+ %
122
+ % Functions book, incollection, manual: moved edition after title.
123
+ %
124
+ % Function mastersthesis: formatted title as for articles (unlike
125
+ % phdthesis which is formatted as book) and added month.
126
+ %
127
+ % Function proceedings: added new.sentence between organization and
128
+ % publisher when both are present.
129
+ %
130
+ % Function format.lab.names: modified so that it gives all the
131
+ % authors' surnames for in-text citations for one, two and three
132
+ % authors and only uses "et. al" for works with four authors or more
133
+ % (thanks to Ken Shan for convincing me to go through the trouble of
134
+ % modifying this function rather than using unreliable hacks).
135
+ %
136
+ % Changes:
137
+ %
138
+ % 2006-10-27: Changed function reverse.pass so that the extra label is
139
+ % enclosed in parentheses when the year field ends in an uppercase or
140
+ % lowercase letter (change modeled after Uli Sauerland's modification
141
+ % of nals.bst). RA.
142
+ %
143
+ %
144
+ % The preamble of the generated file begins below:
145
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
146
+ %%
147
+ %% This is file `compling.bst',
148
+ %% generated with the docstrip utility.
149
+ %%
150
+ %% The original source files were:
151
+ %%
152
+ %% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss')
153
+ %% ----------------------------------------
154
+ %% *** Intended for the journal Computational Linguistics ***
155
+ %%
156
+ %% Copyright 1994-2002 Patrick W Daly
157
+ % ===============================================================
158
+ % IMPORTANT NOTICE:
159
+ % This bibliographic style (bst) file has been generated from one or
160
+ % more master bibliographic style (mbs) files, listed above.
161
+ %
162
+ % This generated file can be redistributed and/or modified under the terms
163
+ % of the LaTeX Project Public License Distributed from CTAN
164
+ % archives in directory macros/latex/base/lppl.txt; either
165
+ % version 1 of the License, or any later version.
166
+ % ===============================================================
167
+ % Name and version information of the main mbs file:
168
+ % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
169
+ % For use with BibTeX version 0.99a or later
170
+ %-------------------------------------------------------------------
171
+ % This bibliography style file is intended for texts in ENGLISH
172
+ % This is an author-year citation style bibliography. As such, it is
173
+ % non-standard LaTeX, and requires a special package file to function properly.
174
+ % Such a package is natbib.sty by Patrick W. Daly
175
+ % The form of the \bibitem entries is
176
+ % \bibitem[Jones et al.(1990)]{key}...
177
+ % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
178
+ % The essential feature is that the label (the part in brackets) consists
179
+ % of the author names, as they should appear in the citation, with the year
180
+ % in parentheses following. There must be no space before the opening
181
+ % parenthesis!
182
+ % With natbib v5.3, a full list of authors may also follow the year.
183
+ % In natbib.sty, it is possible to define the type of enclosures that is
184
+ % really wanted (brackets or parentheses), but in either case, there must
185
+ % be parentheses in the label.
186
+ % The \cite command functions as follows:
187
+ % \citet{key} ==>> Jones et al. (1990)
188
+ % \citet*{key} ==>> Jones, Baker, and Smith (1990)
189
+ % \citep{key} ==>> (Jones et al., 1990)
190
+ % \citep*{key} ==>> (Jones, Baker, and Smith, 1990)
191
+ % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2)
192
+ % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990)
193
+ % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32)
194
+ % \citeauthor{key} ==>> Jones et al.
195
+ % \citeauthor*{key} ==>> Jones, Baker, and Smith
196
+ % \citeyear{key} ==>> 1990
197
+ %---------------------------------------------------------------------
198
+
199
+ ENTRY
200
+ { address
201
+ author
202
+ booktitle
203
+ chapter
204
+ edition
205
+ editor
206
+ howpublished
207
+ institution
208
+ journal
209
+ key
210
+ month
211
+ note
212
+ number
213
+ organization
214
+ pages
215
+ publisher
216
+ school
217
+ series
218
+ title
219
+ type
220
+ volume
221
+ year
222
+ eprint % urlbst
223
+ doi % urlbst
224
+ pubmed % urlbst
225
+ url % urlbst
226
+ lastchecked % urlbst
227
+ }
228
+ {}
229
+ { label extra.label sort.label short.list }
230
+ INTEGERS { output.state before.all mid.sentence after.sentence after.block }
231
+ % urlbst...
232
+ % urlbst constants and state variables
233
+ STRINGS { urlintro
234
+ eprinturl eprintprefix doiprefix doiurl pubmedprefix pubmedurl
235
+ citedstring onlinestring linktextstring
236
+ openinlinelink closeinlinelink }
237
+ INTEGERS { hrefform inlinelinks makeinlinelink
238
+ addeprints adddoiresolver addpubmedresolver }
239
+ FUNCTION {init.urlbst.variables}
240
+ {
241
+ % The following constants may be adjusted by hand, if desired
242
+
243
+ % The first set allow you to enable or disable certain functionality.
244
+ #1 'addeprints := % 0=no eprints; 1=include eprints
245
+ #1 'adddoiresolver := % 0=no DOI resolver; 1=include it
246
+ #1 'addpubmedresolver := % 0=no PUBMED resolver; 1=include it
247
+ #2 'hrefform := % 0=no crossrefs; 1=hypertex xrefs; 2=hyperref refs
248
+ #1 'inlinelinks := % 0=URLs explicit; 1=URLs attached to titles
249
+
250
+ % String constants, which you _might_ want to tweak.
251
+ "URL: " 'urlintro := % prefix before URL; typically "Available from:" or "URL":
252
+ "online" 'onlinestring := % indication that resource is online; typically "online"
253
+ "cited " 'citedstring := % indicator of citation date; typically "cited "
254
+ "[link]" 'linktextstring := % dummy link text; typically "[link]"
255
+ "http://arxiv.org/abs/" 'eprinturl := % prefix to make URL from eprint ref
256
+ "arXiv:" 'eprintprefix := % text prefix printed before eprint ref; typically "arXiv:"
257
+ "https://doi.org/" 'doiurl := % prefix to make URL from DOI
258
+ "doi:" 'doiprefix := % text prefix printed before DOI ref; typically "doi:"
259
+ "http://www.ncbi.nlm.nih.gov/pubmed/" 'pubmedurl := % prefix to make URL from PUBMED
260
+ "PMID:" 'pubmedprefix := % text prefix printed before PUBMED ref; typically "PMID:"
261
+
262
+ % The following are internal state variables, not configuration constants,
263
+ % so they shouldn't be fiddled with.
264
+ #0 'makeinlinelink := % state variable managed by possibly.setup.inlinelink
265
+ "" 'openinlinelink := % ditto
266
+ "" 'closeinlinelink := % ditto
267
+ }
268
+ INTEGERS {
269
+ bracket.state
270
+ outside.brackets
271
+ open.brackets
272
+ within.brackets
273
+ close.brackets
274
+ }
275
+ % ...urlbst to here
276
+ FUNCTION {init.state.consts}
277
+ { #0 'outside.brackets := % urlbst...
278
+ #1 'open.brackets :=
279
+ #2 'within.brackets :=
280
+ #3 'close.brackets := % ...urlbst to here
281
+
282
+ #0 'before.all :=
283
+ #1 'mid.sentence :=
284
+ #2 'after.sentence :=
285
+ #3 'after.block :=
286
+ }
287
+ STRINGS { s t}
288
+ % urlbst
289
+ FUNCTION {output.nonnull.original}
290
+ { 's :=
291
+ output.state mid.sentence =
292
+ { ", " * write$ }
293
+ { output.state after.block =
294
+ { add.period$ write$
295
+ newline$
296
+ "\newblock " write$
297
+ }
298
+ { output.state before.all =
299
+ 'write$
300
+ { add.period$ " " * write$ }
301
+ if$
302
+ }
303
+ if$
304
+ mid.sentence 'output.state :=
305
+ }
306
+ if$
307
+ s
308
+ }
309
+
310
+ % urlbst...
311
+ % The following three functions are for handling inlinelink. They wrap
312
+ % a block of text which is potentially output with write$ by multiple
313
+ % other functions, so we don't know the content a priori.
314
+ % They communicate between each other using the variables makeinlinelink
315
+ % (which is true if a link should be made), and closeinlinelink (which holds
316
+ % the string which should close any current link. They can be called
317
+ % at any time, but start.inlinelink will be a no-op unless something has
318
+ % previously set makeinlinelink true, and the two ...end.inlinelink functions
319
+ % will only do their stuff if start.inlinelink has previously set
320
+ % closeinlinelink to be non-empty.
321
+ % (thanks to 'ijvm' for suggested code here)
322
+ FUNCTION {uand}
323
+ { 'skip$ { pop$ #0 } if$ } % 'and' (which isn't defined at this point in the file)
324
+ FUNCTION {possibly.setup.inlinelink}
325
+ { makeinlinelink hrefform #0 > uand
326
+ { doi empty$ adddoiresolver uand
327
+ { pubmed empty$ addpubmedresolver uand
328
+ { eprint empty$ addeprints uand
329
+ { url empty$
330
+ { "" }
331
+ { url }
332
+ if$ }
333
+ { eprinturl eprint * }
334
+ if$ }
335
+ { pubmedurl pubmed * }
336
+ if$ }
337
+ { doiurl doi * }
338
+ if$
339
+ % an appropriately-formatted URL is now on the stack
340
+ hrefform #1 = % hypertex
341
+ { "\special {html:<a href=" quote$ * swap$ * quote$ * "> }{" * 'openinlinelink :=
342
+ "\special {html:</a>}" 'closeinlinelink := }
343
+ { "\href {" swap$ * "} {" * 'openinlinelink := % hrefform=#2 -- hyperref
344
+ % the space between "} {" matters: a URL of just the right length can cause "\% newline em"
345
+ "}" 'closeinlinelink := }
346
+ if$
347
+ #0 'makeinlinelink :=
348
+ }
349
+ 'skip$
350
+ if$ % makeinlinelink
351
+ }
352
+ FUNCTION {add.inlinelink}
353
+ { openinlinelink empty$
354
+ 'skip$
355
+ { openinlinelink swap$ * closeinlinelink *
356
+ "" 'openinlinelink :=
357
+ }
358
+ if$
359
+ }
360
+ FUNCTION {output.nonnull}
361
+ { % Save the thing we've been asked to output
362
+ 's :=
363
+ % If the bracket-state is close.brackets, then add a close-bracket to
364
+ % what is currently at the top of the stack, and set bracket.state
365
+ % to outside.brackets
366
+ bracket.state close.brackets =
367
+ { "]" *
368
+ outside.brackets 'bracket.state :=
369
+ }
370
+ 'skip$
371
+ if$
372
+ bracket.state outside.brackets =
373
+ { % We're outside all brackets -- this is the normal situation.
374
+ % Write out what's currently at the top of the stack, using the
375
+ % original output.nonnull function.
376
+ s
377
+ add.inlinelink
378
+ output.nonnull.original % invoke the original output.nonnull
379
+ }
380
+ { % Still in brackets. Add open-bracket or (continuation) comma, add the
381
+ % new text (in s) to the top of the stack, and move to the close-brackets
382
+ % state, ready for next time (unless inbrackets resets it). If we come
383
+ % into this branch, then output.state is carefully undisturbed.
384
+ bracket.state open.brackets =
385
+ { " [" * }
386
+ { ", " * } % bracket.state will be within.brackets
387
+ if$
388
+ s *
389
+ close.brackets 'bracket.state :=
390
+ }
391
+ if$
392
+ }
393
+
394
+ % Call this function just before adding something which should be presented in
395
+ % brackets. bracket.state is handled specially within output.nonnull.
396
+ FUNCTION {inbrackets}
397
+ { bracket.state close.brackets =
398
+ { within.brackets 'bracket.state := } % reset the state: not open nor closed
399
+ { open.brackets 'bracket.state := }
400
+ if$
401
+ }
402
+
403
+ FUNCTION {format.lastchecked}
404
+ { lastchecked empty$
405
+ { "" }
406
+ { inbrackets citedstring lastchecked * }
407
+ if$
408
+ }
409
+ % ...urlbst to here
410
+ FUNCTION {output}
411
+ { duplicate$ empty$
412
+ 'pop$
413
+ 'output.nonnull
414
+ if$
415
+ }
416
+ FUNCTION {output.check}
417
+ { 't :=
418
+ duplicate$ empty$
419
+ { pop$ "empty " t * " in " * cite$ * warning$ }
420
+ 'output.nonnull
421
+ if$
422
+ }
423
+ FUNCTION {fin.entry.original} % urlbst (renamed from fin.entry, so it can be wrapped below)
424
+ { add.period$
425
+ write$
426
+ newline$
427
+ }
428
+
429
+ FUNCTION {new.block}
430
+ { output.state before.all =
431
+ 'skip$
432
+ { after.block 'output.state := }
433
+ if$
434
+ }
435
+ FUNCTION {new.sentence}
436
+ { output.state after.block =
437
+ 'skip$
438
+ { output.state before.all =
439
+ 'skip$
440
+ { after.sentence 'output.state := }
441
+ if$
442
+ }
443
+ if$
444
+ }
445
+ FUNCTION {add.blank}
446
+ { " " * before.all 'output.state :=
447
+ }
448
+
449
+ FUNCTION {date.block}
450
+ {
451
+ new.block
452
+ }
453
+
454
+ FUNCTION {not}
455
+ { { #0 }
456
+ { #1 }
457
+ if$
458
+ }
459
+ FUNCTION {and}
460
+ { 'skip$
461
+ { pop$ #0 }
462
+ if$
463
+ }
464
+ FUNCTION {or}
465
+ { { pop$ #1 }
466
+ 'skip$
467
+ if$
468
+ }
469
+ FUNCTION {new.block.checkb}
470
+ { empty$
471
+ swap$ empty$
472
+ and
473
+ 'skip$
474
+ 'new.block
475
+ if$
476
+ }
477
+ FUNCTION {field.or.null}
478
+ { duplicate$ empty$
479
+ { pop$ "" }
480
+ 'skip$
481
+ if$
482
+ }
483
+ FUNCTION {emphasize}
484
+ { duplicate$ empty$
485
+ { pop$ "" }
486
+ { "\emph{" swap$ * "}" * }
487
+ if$
488
+ }
489
+ FUNCTION {tie.or.space.prefix}
490
+ { duplicate$ text.length$ #3 <
491
+ { "~" }
492
+ { " " }
493
+ if$
494
+ swap$
495
+ }
496
+
497
+ FUNCTION {capitalize}
498
+ { "u" change.case$ "t" change.case$ }
499
+
500
+ FUNCTION {space.word}
501
+ { " " swap$ * " " * }
502
+ % Here are the language-specific definitions for explicit words.
503
+ % Each function has a name bbl.xxx where xxx is the English word.
504
+ % The language selected here is ENGLISH
505
+ FUNCTION {bbl.and}
506
+ { "and"}
507
+
508
+ FUNCTION {bbl.etal}
509
+ { "et~al." }
510
+
511
+ FUNCTION {bbl.editors}
512
+ { "editors" }
513
+
514
+ FUNCTION {bbl.editor}
515
+ { "editor" }
516
+
517
+ FUNCTION {bbl.edby}
518
+ { "edited by" }
519
+
520
+ FUNCTION {bbl.edition}
521
+ { "edition" }
522
+
523
+ FUNCTION {bbl.volume}
524
+ { "volume" }
525
+
526
+ FUNCTION {bbl.of}
527
+ { "of" }
528
+
529
+ FUNCTION {bbl.number}
530
+ { "number" }
531
+
532
+ FUNCTION {bbl.nr}
533
+ { "no." }
534
+
535
+ FUNCTION {bbl.in}
536
+ { "in" }
537
+
538
+ FUNCTION {bbl.pages}
539
+ { "pages" }
540
+
541
+ FUNCTION {bbl.page}
542
+ { "page" }
543
+
544
+ FUNCTION {bbl.chapter}
545
+ { "chapter" }
546
+
547
+ FUNCTION {bbl.techrep}
548
+ { "Technical Report" }
549
+
550
+ FUNCTION {bbl.mthesis}
551
+ { "Master's thesis" }
552
+
553
+ FUNCTION {bbl.phdthesis}
554
+ { "Ph.D. thesis" }
555
+
556
+ MACRO {jan} {"January"}
557
+
558
+ MACRO {feb} {"February"}
559
+
560
+ MACRO {mar} {"March"}
561
+
562
+ MACRO {apr} {"April"}
563
+
564
+ MACRO {may} {"May"}
565
+
566
+ MACRO {jun} {"June"}
567
+
568
+ MACRO {jul} {"July"}
569
+
570
+ MACRO {aug} {"August"}
571
+
572
+ MACRO {sep} {"September"}
573
+
574
+ MACRO {oct} {"October"}
575
+
576
+ MACRO {nov} {"November"}
577
+
578
+ MACRO {dec} {"December"}
579
+
580
+ MACRO {acmcs} {"ACM Computing Surveys"}
581
+
582
+ MACRO {acta} {"Acta Informatica"}
583
+
584
+ MACRO {cacm} {"Communications of the ACM"}
585
+
586
+ MACRO {ibmjrd} {"IBM Journal of Research and Development"}
587
+
588
+ MACRO {ibmsj} {"IBM Systems Journal"}
589
+
590
+ MACRO {ieeese} {"IEEE Transactions on Software Engineering"}
591
+
592
+ MACRO {ieeetc} {"IEEE Transactions on Computers"}
593
+
594
+ MACRO {ieeetcad}
595
+ {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}
596
+
597
+ MACRO {ipl} {"Information Processing Letters"}
598
+
599
+ MACRO {jacm} {"Journal of the ACM"}
600
+
601
+ MACRO {jcss} {"Journal of Computer and System Sciences"}
602
+
603
+ MACRO {scp} {"Science of Computer Programming"}
604
+
605
+ MACRO {sicomp} {"SIAM Journal on Computing"}
606
+
607
+ MACRO {tocs} {"ACM Transactions on Computer Systems"}
608
+
609
+ MACRO {tods} {"ACM Transactions on Database Systems"}
610
+
611
+ MACRO {tog} {"ACM Transactions on Graphics"}
612
+
613
+ MACRO {toms} {"ACM Transactions on Mathematical Software"}
614
+
615
+ MACRO {toois} {"ACM Transactions on Office Information Systems"}
616
+
617
+ MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}
618
+
619
+ MACRO {tcs} {"Theoretical Computer Science"}
620
+ FUNCTION {bibinfo.check}
621
+ { swap$
622
+ duplicate$ missing$
623
+ {
624
+ pop$ pop$
625
+ ""
626
+ }
627
+ { duplicate$ empty$
628
+ {
629
+ swap$ pop$
630
+ }
631
+ { swap$
632
+ pop$
633
+ }
634
+ if$
635
+ }
636
+ if$
637
+ }
638
+ FUNCTION {bibinfo.warn}
639
+ { swap$
640
+ duplicate$ missing$
641
+ {
642
+ swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
643
+ ""
644
+ }
645
+ { duplicate$ empty$
646
+ {
647
+ swap$ "empty " swap$ * " in " * cite$ * warning$
648
+ }
649
+ { swap$
650
+ pop$
651
+ }
652
+ if$
653
+ }
654
+ if$
655
+ }
656
+ STRINGS { bibinfo}
657
+ INTEGERS { nameptr namesleft numnames }
658
+
659
+ FUNCTION {format.names}
660
+ { 'bibinfo :=
661
+ duplicate$ empty$ 'skip$ {
662
+ 's :=
663
+ "" 't :=
664
+ #1 'nameptr :=
665
+ s num.names$ 'numnames :=
666
+ numnames 'namesleft :=
667
+ { namesleft #0 > }
668
+ { s nameptr
669
+ duplicate$ #1 >
670
+ { "{ff~}{vv~}{ll}{, jj}" }
671
+ { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author
672
+ % { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author
673
+ if$
674
+ format.name$
675
+ bibinfo bibinfo.check
676
+ 't :=
677
+ nameptr #1 >
678
+ {
679
+ namesleft #1 >
680
+ { ", " * t * }
681
+ {
682
+ numnames #2 >
683
+ { "," * }
684
+ 'skip$
685
+ if$
686
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
687
+ { 't := }
688
+ { pop$ }
689
+ if$
690
+ t "others" =
691
+ {
692
+ " " * bbl.etal *
693
+ }
694
+ {
695
+ bbl.and
696
+ space.word * t *
697
+ }
698
+ if$
699
+ }
700
+ if$
701
+ }
702
+ 't
703
+ if$
704
+ nameptr #1 + 'nameptr :=
705
+ namesleft #1 - 'namesleft :=
706
+ }
707
+ while$
708
+ } if$
709
+ }
710
+ FUNCTION {format.names.ed}
711
+ {
712
+ 'bibinfo :=
713
+ duplicate$ empty$ 'skip$ {
714
+ 's :=
715
+ "" 't :=
716
+ #1 'nameptr :=
717
+ s num.names$ 'numnames :=
718
+ numnames 'namesleft :=
719
+ { namesleft #0 > }
720
+ { s nameptr
721
+ "{ff~}{vv~}{ll}{, jj}"
722
+ format.name$
723
+ bibinfo bibinfo.check
724
+ 't :=
725
+ nameptr #1 >
726
+ {
727
+ namesleft #1 >
728
+ { ", " * t * }
729
+ {
730
+ numnames #2 >
731
+ { "," * }
732
+ 'skip$
733
+ if$
734
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
735
+ { 't := }
736
+ { pop$ }
737
+ if$
738
+ t "others" =
739
+ {
740
+
741
+ " " * bbl.etal *
742
+ }
743
+ {
744
+ bbl.and
745
+ space.word * t *
746
+ }
747
+ if$
748
+ }
749
+ if$
750
+ }
751
+ 't
752
+ if$
753
+ nameptr #1 + 'nameptr :=
754
+ namesleft #1 - 'namesleft :=
755
+ }
756
+ while$
757
+ } if$
758
+ }
759
+ FUNCTION {format.key}
760
+ { empty$
761
+ { key field.or.null }
762
+ { "" }
763
+ if$
764
+ }
765
+
766
+ FUNCTION {format.authors}
767
+ { author "author" format.names
768
+ }
769
+ FUNCTION {get.bbl.editor}
770
+ { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
771
+
772
+ FUNCTION {format.editors}
773
+ { editor "editor" format.names duplicate$ empty$ 'skip$
774
+ {
775
+ "," *
776
+ " " *
777
+ get.bbl.editor
778
+ *
779
+ }
780
+ if$
781
+ }
782
+ FUNCTION {format.note}
783
+ {
784
+ note empty$
785
+ { "" }
786
+ { note #1 #1 substring$
787
+ duplicate$ "{" =
788
+ 'skip$
789
+ { output.state mid.sentence =
790
+ { "l" }
791
+ { "u" }
792
+ if$
793
+ change.case$
794
+ }
795
+ if$
796
+ note #2 global.max$ substring$ * "note" bibinfo.check
797
+ }
798
+ if$
799
+ }
800
+
801
+ FUNCTION {format.title}
802
+ { title
803
+ duplicate$ empty$ 'skip$
804
+ { "t" change.case$ }
805
+ if$
806
+ "title" bibinfo.check
807
+ }
808
+ FUNCTION {format.full.names}
809
+ {'s :=
810
+ "" 't :=
811
+ #1 'nameptr :=
812
+ s num.names$ 'numnames :=
813
+ numnames 'namesleft :=
814
+ { namesleft #0 > }
815
+ { s nameptr
816
+ "{vv~}{ll}" format.name$
817
+ 't :=
818
+ nameptr #1 >
819
+ {
820
+ namesleft #1 >
821
+ { ", " * t * }
822
+ {
823
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
824
+ { 't := }
825
+ { pop$ }
826
+ if$
827
+ t "others" =
828
+ {
829
+ " " * bbl.etal *
830
+ }
831
+ {
832
+ numnames #2 >
833
+ { "," * }
834
+ 'skip$
835
+ if$
836
+ bbl.and
837
+ space.word * t *
838
+ }
839
+ if$
840
+ }
841
+ if$
842
+ }
843
+ 't
844
+ if$
845
+ nameptr #1 + 'nameptr :=
846
+ namesleft #1 - 'namesleft :=
847
+ }
848
+ while$
849
+ }
850
+
851
+ FUNCTION {author.editor.key.full}
852
+ { author empty$
853
+ { editor empty$
854
+ { key empty$
855
+ { cite$ #1 #3 substring$ }
856
+ 'key
857
+ if$
858
+ }
859
+ { editor format.full.names }
860
+ if$
861
+ }
862
+ { author format.full.names }
863
+ if$
864
+ }
865
+
866
+ FUNCTION {author.key.full}
867
+ { author empty$
868
+ { key empty$
869
+ { cite$ #1 #3 substring$ }
870
+ 'key
871
+ if$
872
+ }
873
+ { author format.full.names }
874
+ if$
875
+ }
876
+
877
+ FUNCTION {editor.key.full}
878
+ { editor empty$
879
+ { key empty$
880
+ { cite$ #1 #3 substring$ }
881
+ 'key
882
+ if$
883
+ }
884
+ { editor format.full.names }
885
+ if$
886
+ }
887
+
888
+ FUNCTION {make.full.names}
889
+ { type$ "book" =
890
+ type$ "inbook" =
891
+ or
892
+ 'author.editor.key.full
893
+ { type$ "proceedings" =
894
+ 'editor.key.full
895
+ 'author.key.full
896
+ if$
897
+ }
898
+ if$
899
+ }
900
+
901
+ FUNCTION {output.bibitem.original} % urlbst (renamed from output.bibitem, so it can be wrapped below)
902
+ { newline$
903
+ "\bibitem[{" write$
904
+ label write$
905
+ ")" make.full.names duplicate$ short.list =
906
+ { pop$ }
907
+ { * }
908
+ if$
909
+ "}]{" * write$
910
+ cite$ write$
911
+ "}" write$
912
+ newline$
913
+ ""
914
+ before.all 'output.state :=
915
+ }
916
+
917
+ FUNCTION {n.dashify}
918
+ {
919
+ 't :=
920
+ ""
921
+ { t empty$ not }
922
+ { t #1 #1 substring$ "-" =
923
+ { t #1 #2 substring$ "--" = not
924
+ { "--" *
925
+ t #2 global.max$ substring$ 't :=
926
+ }
927
+ { { t #1 #1 substring$ "-" = }
928
+ { "-" *
929
+ t #2 global.max$ substring$ 't :=
930
+ }
931
+ while$
932
+ }
933
+ if$
934
+ }
935
+ { t #1 #1 substring$ *
936
+ t #2 global.max$ substring$ 't :=
937
+ }
938
+ if$
939
+ }
940
+ while$
941
+ }
942
+
943
+ FUNCTION {word.in}
944
+ { bbl.in capitalize
945
+ " " * }
946
+
947
+ FUNCTION {format.date}
948
+ { year "year" bibinfo.check duplicate$ empty$
949
+ {
950
+ }
951
+ 'skip$
952
+ if$
953
+ extra.label *
954
+ before.all 'output.state :=
955
+ after.sentence 'output.state :=
956
+ }
957
+ FUNCTION {format.btitle}
958
+ { title "title" bibinfo.check
959
+ duplicate$ empty$ 'skip$
960
+ {
961
+ emphasize
962
+ }
963
+ if$
964
+ }
965
+ FUNCTION {either.or.check}
966
+ { empty$
967
+ 'pop$
968
+ { "can't use both " swap$ * " fields in " * cite$ * warning$ }
969
+ if$
970
+ }
971
+ FUNCTION {format.bvolume}
972
+ { volume empty$
973
+ { "" }
974
+ { bbl.volume volume tie.or.space.prefix
975
+ "volume" bibinfo.check * *
976
+ series "series" bibinfo.check
977
+ duplicate$ empty$ 'pop$
978
+ { swap$ bbl.of space.word * swap$
979
+ emphasize * }
980
+ if$
981
+ "volume and number" number either.or.check
982
+ }
983
+ if$
984
+ }
985
+ FUNCTION {format.number.series}
986
+ { volume empty$
987
+ { number empty$
988
+ { series field.or.null }
989
+ { series empty$
990
+ { number "number" bibinfo.check }
991
+ { output.state mid.sentence =
992
+ { bbl.number }
993
+ { bbl.number capitalize }
994
+ if$
995
+ number tie.or.space.prefix "number" bibinfo.check * *
996
+ bbl.in space.word *
997
+ series "series" bibinfo.check *
998
+ }
999
+ if$
1000
+ }
1001
+ if$
1002
+ }
1003
+ { "" }
1004
+ if$
1005
+ }
1006
+
1007
+ FUNCTION {format.edition}
1008
+ { edition duplicate$ empty$ 'skip$
1009
+ {
1010
+ output.state mid.sentence =
1011
+ { "l" }
1012
+ { "t" }
1013
+ if$ change.case$
1014
+ "edition" bibinfo.check
1015
+ " " * bbl.edition *
1016
+ }
1017
+ if$
1018
+ }
1019
+ INTEGERS { multiresult }
1020
+ FUNCTION {multi.page.check}
1021
+ { 't :=
1022
+ #0 'multiresult :=
1023
+ { multiresult not
1024
+ t empty$ not
1025
+ and
1026
+ }
1027
+ { t #1 #1 substring$
1028
+ duplicate$ "-" =
1029
+ swap$ duplicate$ "," =
1030
+ swap$ "+" =
1031
+ or or
1032
+ { #1 'multiresult := }
1033
+ { t #2 global.max$ substring$ 't := }
1034
+ if$
1035
+ }
1036
+ while$
1037
+ multiresult
1038
+ }
1039
+ FUNCTION {format.pages}
1040
+ { pages duplicate$ empty$ 'skip$
1041
+ { duplicate$ multi.page.check
1042
+ {
1043
+ bbl.pages swap$
1044
+ n.dashify
1045
+ }
1046
+ {
1047
+ bbl.page swap$
1048
+ }
1049
+ if$
1050
+ tie.or.space.prefix
1051
+ "pages" bibinfo.check
1052
+ * *
1053
+ }
1054
+ if$
1055
+ }
1056
+ FUNCTION {format.journal.pages}
1057
+ { pages duplicate$ empty$ 'pop$
1058
+ { swap$ duplicate$ empty$
1059
+ { pop$ pop$ format.pages }
1060
+ {
1061
+ ":" *
1062
+ swap$
1063
+ n.dashify
1064
+ "pages" bibinfo.check
1065
+ *
1066
+ }
1067
+ if$
1068
+ }
1069
+ if$
1070
+ }
1071
+ FUNCTION {format.vol.num.pages}
1072
+ { volume field.or.null
1073
+ duplicate$ empty$ 'skip$
1074
+ {
1075
+ "volume" bibinfo.check
1076
+ }
1077
+ if$
1078
+ number "number" bibinfo.check duplicate$ empty$ 'skip$
1079
+ {
1080
+ swap$ duplicate$ empty$
1081
+ { "there's a number but no volume in " cite$ * warning$ }
1082
+ 'skip$
1083
+ if$
1084
+ swap$
1085
+ "(" swap$ * ")" *
1086
+ }
1087
+ if$ *
1088
+ format.journal.pages
1089
+ }
1090
+
1091
+ FUNCTION {format.chapter}
1092
+ { chapter empty$
1093
+ 'format.pages
1094
+ { type empty$
1095
+ { bbl.chapter }
1096
+ { type "l" change.case$
1097
+ "type" bibinfo.check
1098
+ }
1099
+ if$
1100
+ chapter tie.or.space.prefix
1101
+ "chapter" bibinfo.check
1102
+ * *
1103
+ }
1104
+ if$
1105
+ }
1106
+
1107
+ FUNCTION {format.chapter.pages}
1108
+ { chapter empty$
1109
+ 'format.pages
1110
+ { type empty$
1111
+ { bbl.chapter }
1112
+ { type "l" change.case$
1113
+ "type" bibinfo.check
1114
+ }
1115
+ if$
1116
+ chapter tie.or.space.prefix
1117
+ "chapter" bibinfo.check
1118
+ * *
1119
+ pages empty$
1120
+ 'skip$
1121
+ { ", " * format.pages * }
1122
+ if$
1123
+ }
1124
+ if$
1125
+ }
1126
+
1127
+ FUNCTION {format.booktitle}
1128
+ {
1129
+ booktitle "booktitle" bibinfo.check
1130
+ emphasize
1131
+ }
1132
+ FUNCTION {format.in.booktitle}
1133
+ { format.booktitle duplicate$ empty$ 'skip$
1134
+ {
1135
+ word.in swap$ *
1136
+ }
1137
+ if$
1138
+ }
1139
+ FUNCTION {format.in.ed.booktitle}
1140
+ { format.booktitle duplicate$ empty$ 'skip$
1141
+ {
1142
+ editor "editor" format.names.ed duplicate$ empty$ 'pop$
1143
+ {
1144
+ "," *
1145
+ " " *
1146
+ get.bbl.editor
1147
+ ", " *
1148
+ * swap$
1149
+ * }
1150
+ if$
1151
+ word.in swap$ *
1152
+ }
1153
+ if$
1154
+ }
1155
+ FUNCTION {format.thesis.type}
1156
+ { type duplicate$ empty$
1157
+ 'pop$
1158
+ { swap$ pop$
1159
+ "t" change.case$ "type" bibinfo.check
1160
+ }
1161
+ if$
1162
+ }
1163
+ FUNCTION {format.tr.number}
1164
+ { number "number" bibinfo.check
1165
+ type duplicate$ empty$
1166
+ { pop$ bbl.techrep }
1167
+ 'skip$
1168
+ if$
1169
+ "type" bibinfo.check
1170
+ swap$ duplicate$ empty$
1171
+ { pop$ "t" change.case$ }
1172
+ { tie.or.space.prefix * * }
1173
+ if$
1174
+ }
1175
+ FUNCTION {format.article.crossref}
1176
+ {
1177
+ word.in
1178
+ " \cite{" * crossref * "}" *
1179
+ }
1180
+ FUNCTION {format.book.crossref}
1181
+ { volume duplicate$ empty$
1182
+ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
1183
+ pop$ word.in
1184
+ }
1185
+ { bbl.volume
1186
+ capitalize
1187
+ swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
1188
+ }
1189
+ if$
1190
+ " \cite{" * crossref * "}" *
1191
+ }
1192
+ FUNCTION {format.incoll.inproc.crossref}
1193
+ {
1194
+ word.in
1195
+ " \cite{" * crossref * "}" *
1196
+ }
1197
+ FUNCTION {format.org.or.pub}
1198
+ { 't :=
1199
+ ""
1200
+ address empty$ t empty$ and
1201
+ 'skip$
1202
+ {
1203
+ t empty$
1204
+ { address "address" bibinfo.check *
1205
+ }
1206
+ { t *
1207
+ address empty$
1208
+ 'skip$
1209
+ { ", " * address "address" bibinfo.check * }
1210
+ if$
1211
+ }
1212
+ if$
1213
+ }
1214
+ if$
1215
+ }
1216
+ FUNCTION {format.publisher.address}
1217
+ { publisher "publisher" bibinfo.warn format.org.or.pub
1218
+ }
1219
+
1220
+ FUNCTION {format.organization.address}
1221
+ { organization "organization" bibinfo.check format.org.or.pub
1222
+ }
1223
+
1224
+ % urlbst...
1225
+ % Functions for making hypertext links.
1226
+ % In all cases, the stack has (link-text href-url)
1227
+ %
1228
+ % make 'null' specials
1229
+ FUNCTION {make.href.null}
1230
+ {
1231
+ pop$
1232
+ }
1233
+ % make hypertex specials
1234
+ FUNCTION {make.href.hypertex}
1235
+ {
1236
+ "\special {html:<a href=" quote$ *
1237
+ swap$ * quote$ * "> }" * swap$ *
1238
+ "\special {html:</a>}" *
1239
+ }
1240
+ % make hyperref specials
1241
+ FUNCTION {make.href.hyperref}
1242
+ {
1243
+ "\href {" swap$ * "} {\path{" * swap$ * "}}" *
1244
+ }
1245
+ FUNCTION {make.href}
1246
+ { hrefform #2 =
1247
+ 'make.href.hyperref % hrefform = 2
1248
+ { hrefform #1 =
1249
+ 'make.href.hypertex % hrefform = 1
1250
+ 'make.href.null % hrefform = 0 (or anything else)
1251
+ if$
1252
+ }
1253
+ if$
1254
+ }
1255
+
1256
+ % If inlinelinks is true, then format.url should be a no-op, since it's
1257
+ % (a) redundant, and (b) could end up as a link-within-a-link.
1258
+ FUNCTION {format.url}
1259
+ { inlinelinks #1 = url empty$ or
1260
+ { "" }
1261
+ { hrefform #1 =
1262
+ { % special case -- add HyperTeX specials
1263
+ urlintro "\url{" url * "}" * url make.href.hypertex * }
1264
+ { urlintro "\url{" * url * "}" * }
1265
+ if$
1266
+ }
1267
+ if$
1268
+ }
1269
+
1270
+ FUNCTION {format.eprint}
1271
+ { eprint empty$
1272
+ { "" }
1273
+ { eprintprefix eprint * eprinturl eprint * make.href }
1274
+ if$
1275
+ }
1276
+
1277
+ FUNCTION {format.doi}
1278
+ { doi empty$
1279
+ { "" }
1280
+ { doiprefix doi * doiurl doi * make.href }
1281
+ if$
1282
+ }
1283
+
1284
+ FUNCTION {format.pubmed}
1285
+ { pubmed empty$
1286
+ { "" }
1287
+ { pubmedprefix pubmed * pubmedurl pubmed * make.href }
1288
+ if$
1289
+ }
1290
+
1291
+ % Output a URL. We can't use the more normal idiom (something like
1292
+ % `format.url output'), because the `inbrackets' within
1293
+ % format.lastchecked applies to everything between calls to `output',
1294
+ % so that `format.url format.lastchecked * output' ends up with both
1295
+ % the URL and the lastchecked in brackets.
1296
+ FUNCTION {output.url}
1297
+ { url empty$
1298
+ 'skip$
1299
+ { new.block
1300
+ format.url output
1301
+ format.lastchecked output
1302
+ }
1303
+ if$
1304
+ }
1305
+
1306
+ FUNCTION {output.web.refs}
1307
+ {
1308
+ new.block
1309
+ inlinelinks
1310
+ 'skip$ % links were inline -- don't repeat them
1311
+ {
1312
+ output.url
1313
+ addeprints eprint empty$ not and
1314
+ { format.eprint output.nonnull }
1315
+ 'skip$
1316
+ if$
1317
+ adddoiresolver doi empty$ not and
1318
+ { format.doi output.nonnull }
1319
+ 'skip$
1320
+ if$
1321
+ addpubmedresolver pubmed empty$ not and
1322
+ { format.pubmed output.nonnull }
1323
+ 'skip$
1324
+ if$
1325
+ }
1326
+ if$
1327
+ }
1328
+
1329
+ % Wrapper for output.bibitem.original.
1330
+ % If the URL field is not empty, set makeinlinelink to be true,
1331
+ % so that an inline link will be started at the next opportunity
1332
+ FUNCTION {output.bibitem}
1333
+ { outside.brackets 'bracket.state :=
1334
+ output.bibitem.original
1335
+ inlinelinks url empty$ not doi empty$ not or pubmed empty$ not or eprint empty$ not or and
1336
+ { #1 'makeinlinelink := }
1337
+ { #0 'makeinlinelink := }
1338
+ if$
1339
+ }
1340
+
1341
+ % Wrapper for fin.entry.original
1342
+ FUNCTION {fin.entry}
1343
+ { output.web.refs % urlbst
1344
+ makeinlinelink % ooops, it appears we didn't have a title for inlinelink
1345
+ { possibly.setup.inlinelink % add some artificial link text here, as a fallback
1346
+ linktextstring output.nonnull }
1347
+ 'skip$
1348
+ if$
1349
+ bracket.state close.brackets = % urlbst
1350
+ { "]" * }
1351
+ 'skip$
1352
+ if$
1353
+ fin.entry.original
1354
+ }
1355
+
1356
+ % Webpage entry type.
1357
+ % Title and url fields required;
1358
+ % author, note, year, month, and lastchecked fields optional
1359
+ % See references
1360
+ % ISO 690-2 http://www.nlc-bnc.ca/iso/tc46sc9/standard/690-2e.htm
1361
+ % http://www.classroom.net/classroom/CitingNetResources.html
1362
+ % http://neal.ctstateu.edu/history/cite.html
1363
+ % http://www.cas.usf.edu/english/walker/mla.html
1364
+ % for citation formats for web pages.
1365
+ FUNCTION {webpage}
1366
+ { output.bibitem
1367
+ author empty$
1368
+ { editor empty$
1369
+ 'skip$ % author and editor both optional
1370
+ { format.editors output.nonnull }
1371
+ if$
1372
+ }
1373
+ { editor empty$
1374
+ { format.authors output.nonnull }
1375
+ { "can't use both author and editor fields in " cite$ * warning$ }
1376
+ if$
1377
+ }
1378
+ if$
1379
+ new.block
1380
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$
1381
+ format.title "title" output.check
1382
+ inbrackets onlinestring output
1383
+ new.block
1384
+ year empty$
1385
+ 'skip$
1386
+ { format.date "year" output.check }
1387
+ if$
1388
+ % We don't need to output the URL details ('lastchecked' and 'url'),
1389
+ % because fin.entry does that for us, using output.web.refs. The only
1390
+ % reason we would want to put them here is if we were to decide that
1391
+ % they should go in front of the rather miscellaneous information in 'note'.
1392
+ new.block
1393
+ note output
1394
+ fin.entry
1395
+ }
1396
+ % ...urlbst to here
1397
+
1398
+
1399
+ FUNCTION {article}
1400
+ { output.bibitem
1401
+ format.authors "author" output.check
1402
+ author format.key output
1403
+ format.date "year" output.check
1404
+ date.block
1405
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1406
+ format.title "title" output.check
1407
+ new.block
1408
+ crossref missing$
1409
+ {
1410
+ journal
1411
+ "journal" bibinfo.check
1412
+ emphasize
1413
+ "journal" output.check
1414
+ possibly.setup.inlinelink format.vol.num.pages output% urlbst
1415
+ }
1416
+ { format.article.crossref output.nonnull
1417
+ format.pages output
1418
+ }
1419
+ if$
1420
+ new.block
1421
+ format.note output
1422
+ fin.entry
1423
+ }
1424
+ FUNCTION {book}
1425
+ { output.bibitem
1426
+ author empty$
1427
+ { format.editors "author and editor" output.check
1428
+ editor format.key output
1429
+ }
1430
+ { format.authors output.nonnull
1431
+ crossref missing$
1432
+ { "author and editor" editor either.or.check }
1433
+ 'skip$
1434
+ if$
1435
+ }
1436
+ if$
1437
+ format.date "year" output.check
1438
+ date.block
1439
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1440
+ format.btitle "title" output.check
1441
+ format.edition output
1442
+ crossref missing$
1443
+ { format.bvolume output
1444
+ new.block
1445
+ format.number.series output
1446
+ new.sentence
1447
+ format.publisher.address output
1448
+ }
1449
+ {
1450
+ new.block
1451
+ format.book.crossref output.nonnull
1452
+ }
1453
+ if$
1454
+ new.block
1455
+ format.note output
1456
+ fin.entry
1457
+ }
1458
+ FUNCTION {booklet}
1459
+ { output.bibitem
1460
+ format.authors output
1461
+ author format.key output
1462
+ format.date "year" output.check
1463
+ date.block
1464
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1465
+ format.title "title" output.check
1466
+ new.block
1467
+ howpublished "howpublished" bibinfo.check output
1468
+ address "address" bibinfo.check output
1469
+ new.block
1470
+ format.note output
1471
+ fin.entry
1472
+ }
1473
+
1474
+ FUNCTION {inbook}
1475
+ { output.bibitem
1476
+ author empty$
1477
+ { format.editors "author and editor" output.check
1478
+ editor format.key output
1479
+ }
1480
+ { format.authors output.nonnull
1481
+ crossref missing$
1482
+ { "author and editor" editor either.or.check }
1483
+ 'skip$
1484
+ if$
1485
+ }
1486
+ if$
1487
+ format.date "year" output.check
1488
+ date.block
1489
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1490
+ format.btitle "title" output.check
1491
+ format.edition output
1492
+ crossref missing$
1493
+ {
1494
+ format.bvolume output
1495
+ format.number.series output
1496
+ format.chapter "chapter" output.check
1497
+ new.sentence
1498
+ format.publisher.address output
1499
+ new.block
1500
+ }
1501
+ {
1502
+ format.chapter "chapter" output.check
1503
+ new.block
1504
+ format.book.crossref output.nonnull
1505
+ }
1506
+ if$
1507
+ new.block
1508
+ format.note output
1509
+ fin.entry
1510
+ }
1511
+
1512
+ FUNCTION {incollection}
1513
+ { output.bibitem
1514
+ format.authors "author" output.check
1515
+ author format.key output
1516
+ format.date "year" output.check
1517
+ date.block
1518
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1519
+ format.title "title" output.check
1520
+ new.block
1521
+ crossref missing$
1522
+ { format.in.ed.booktitle "booktitle" output.check
1523
+ format.edition output
1524
+ format.bvolume output
1525
+ format.number.series output
1526
+ format.chapter.pages output
1527
+ new.sentence
1528
+ format.publisher.address output
1529
+ }
1530
+ { format.incoll.inproc.crossref output.nonnull
1531
+ format.chapter.pages output
1532
+ }
1533
+ if$
1534
+ new.block
1535
+ format.note output
1536
+ fin.entry
1537
+ }
1538
+ FUNCTION {inproceedings}
1539
+ { output.bibitem
1540
+ format.authors "author" output.check
1541
+ author format.key output
1542
+ format.date "year" output.check
1543
+ date.block
1544
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1545
+ format.title "title" output.check
1546
+ new.block
1547
+ crossref missing$
1548
+ { format.in.booktitle "booktitle" output.check
1549
+ format.bvolume output
1550
+ format.number.series output
1551
+ format.pages output
1552
+ address "address" bibinfo.check output
1553
+ new.sentence
1554
+ organization "organization" bibinfo.check output
1555
+ publisher "publisher" bibinfo.check output
1556
+ }
1557
+ { format.incoll.inproc.crossref output.nonnull
1558
+ format.pages output
1559
+ }
1560
+ if$
1561
+ new.block
1562
+ format.note output
1563
+ fin.entry
1564
+ }
1565
+ FUNCTION {conference} { inproceedings }
1566
+ FUNCTION {manual}
1567
+ { output.bibitem
1568
+ format.authors output
1569
+ author format.key output
1570
+ format.date "year" output.check
1571
+ date.block
1572
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1573
+ format.btitle "title" output.check
1574
+ format.edition output
1575
+ organization address new.block.checkb
1576
+ organization "organization" bibinfo.check output
1577
+ address "address" bibinfo.check output
1578
+ new.block
1579
+ format.note output
1580
+ fin.entry
1581
+ }
1582
+
1583
+ FUNCTION {mastersthesis}
1584
+ { output.bibitem
1585
+ format.authors "author" output.check
1586
+ author format.key output
1587
+ format.date "year" output.check
1588
+ date.block
1589
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1590
+ format.title
1591
+ "title" output.check
1592
+ new.block
1593
+ bbl.mthesis format.thesis.type output.nonnull
1594
+ school "school" bibinfo.warn output
1595
+ address "address" bibinfo.check output
1596
+ month "month" bibinfo.check output
1597
+ new.block
1598
+ format.note output
1599
+ fin.entry
1600
+ }
1601
+
1602
+ FUNCTION {misc}
1603
+ { output.bibitem
1604
+ format.authors output
1605
+ author format.key output
1606
+ format.date "year" output.check
1607
+ date.block
1608
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1609
+ format.title output
1610
+ new.block
1611
+ howpublished "howpublished" bibinfo.check output
1612
+ new.block
1613
+ format.note output
1614
+ fin.entry
1615
+ }
1616
+ FUNCTION {phdthesis}
1617
+ { output.bibitem
1618
+ format.authors "author" output.check
1619
+ author format.key output
1620
+ format.date "year" output.check
1621
+ date.block
1622
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1623
+ format.btitle
1624
+ "title" output.check
1625
+ new.block
1626
+ bbl.phdthesis format.thesis.type output.nonnull
1627
+ school "school" bibinfo.warn output
1628
+ address "address" bibinfo.check output
1629
+ new.block
1630
+ format.note output
1631
+ fin.entry
1632
+ }
1633
+
1634
+ FUNCTION {proceedings}
1635
+ { output.bibitem
1636
+ format.editors output
1637
+ editor format.key output
1638
+ format.date "year" output.check
1639
+ date.block
1640
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1641
+ format.btitle "title" output.check
1642
+ format.bvolume output
1643
+ format.number.series output
1644
+ new.sentence
1645
+ publisher empty$
1646
+ { format.organization.address output }
1647
+ { organization "organization" bibinfo.check output
1648
+ new.sentence
1649
+ format.publisher.address output
1650
+ }
1651
+ if$
1652
+ new.block
1653
+ format.note output
1654
+ fin.entry
1655
+ }
1656
+
1657
+ FUNCTION {techreport}
1658
+ { output.bibitem
1659
+ format.authors "author" output.check
1660
+ author format.key output
1661
+ format.date "year" output.check
1662
+ date.block
1663
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1664
+ format.title
1665
+ "title" output.check
1666
+ new.block
1667
+ format.tr.number output.nonnull
1668
+ institution "institution" bibinfo.warn output
1669
+ address "address" bibinfo.check output
1670
+ new.block
1671
+ format.note output
1672
+ fin.entry
1673
+ }
1674
+
1675
+ FUNCTION {unpublished}
1676
+ { output.bibitem
1677
+ format.authors "author" output.check
1678
+ author format.key output
1679
+ format.date "year" output.check
1680
+ date.block
1681
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1682
+ format.title "title" output.check
1683
+ new.block
1684
+ format.note "note" output.check
1685
+ fin.entry
1686
+ }
1687
+
1688
+ FUNCTION {default.type} { misc }
1689
+ READ
1690
+ FUNCTION {sortify}
1691
+ { purify$
1692
+ "l" change.case$
1693
+ }
1694
+ INTEGERS { len }
1695
+ FUNCTION {chop.word}
1696
+ { 's :=
1697
+ 'len :=
1698
+ s #1 len substring$ =
1699
+ { s len #1 + global.max$ substring$ }
1700
+ 's
1701
+ if$
1702
+ }
1703
+ FUNCTION {format.lab.names}
1704
+ { 's :=
1705
+ "" 't :=
1706
+ s #1 "{vv~}{ll}" format.name$
1707
+ s num.names$ duplicate$
1708
+ #2 >
1709
+ { pop$
1710
+ " " * bbl.etal *
1711
+ }
1712
+ { #2 <
1713
+ 'skip$
1714
+ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
1715
+ {
1716
+ " " * bbl.etal *
1717
+ }
1718
+ { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
1719
+ * }
1720
+ if$
1721
+ }
1722
+ if$
1723
+ }
1724
+ if$
1725
+ }
1726
+
1727
+ FUNCTION {author.key.label}
1728
+ { author empty$
1729
+ { key empty$
1730
+ { cite$ #1 #3 substring$ }
1731
+ 'key
1732
+ if$
1733
+ }
1734
+ { author format.lab.names }
1735
+ if$
1736
+ }
1737
+
1738
+ FUNCTION {author.editor.key.label}
1739
+ { author empty$
1740
+ { editor empty$
1741
+ { key empty$
1742
+ { cite$ #1 #3 substring$ }
1743
+ 'key
1744
+ if$
1745
+ }
1746
+ { editor format.lab.names }
1747
+ if$
1748
+ }
1749
+ { author format.lab.names }
1750
+ if$
1751
+ }
1752
+
1753
+ FUNCTION {editor.key.label}
1754
+ { editor empty$
1755
+ { key empty$
1756
+ { cite$ #1 #3 substring$ }
1757
+ 'key
1758
+ if$
1759
+ }
1760
+ { editor format.lab.names }
1761
+ if$
1762
+ }
1763
+
1764
+ FUNCTION {calc.short.authors}
1765
+ { type$ "book" =
1766
+ type$ "inbook" =
1767
+ or
1768
+ 'author.editor.key.label
1769
+ { type$ "proceedings" =
1770
+ 'editor.key.label
1771
+ 'author.key.label
1772
+ if$
1773
+ }
1774
+ if$
1775
+ 'short.list :=
1776
+ }
1777
+
1778
+ FUNCTION {calc.label}
1779
+ { calc.short.authors
1780
+ short.list
1781
+ "("
1782
+ *
1783
+ year duplicate$ empty$
1784
+ short.list key field.or.null = or
1785
+ { pop$ "" }
1786
+ 'skip$
1787
+ if$
1788
+ *
1789
+ 'label :=
1790
+ }
1791
+
1792
+ FUNCTION {sort.format.names}
1793
+ { 's :=
1794
+ #1 'nameptr :=
1795
+ ""
1796
+ s num.names$ 'numnames :=
1797
+ numnames 'namesleft :=
1798
+ { namesleft #0 > }
1799
+ { s nameptr
1800
+ "{ll{ }}{ ff{ }}{ jj{ }}"
1801
+ format.name$ 't :=
1802
+ nameptr #1 >
1803
+ {
1804
+ " " *
1805
+ namesleft #1 = t "others" = and
1806
+ { "zzzzz" * }
1807
+ { t sortify * }
1808
+ if$
1809
+ }
1810
+ { t sortify * }
1811
+ if$
1812
+ nameptr #1 + 'nameptr :=
1813
+ namesleft #1 - 'namesleft :=
1814
+ }
1815
+ while$
1816
+ }
1817
+
1818
+ FUNCTION {sort.format.title}
1819
+ { 't :=
1820
+ "A " #2
1821
+ "An " #3
1822
+ "The " #4 t chop.word
1823
+ chop.word
1824
+ chop.word
1825
+ sortify
1826
+ #1 global.max$ substring$
1827
+ }
1828
+ FUNCTION {author.sort}
1829
+ { author empty$
1830
+ { key empty$
1831
+ { "to sort, need author or key in " cite$ * warning$
1832
+ ""
1833
+ }
1834
+ { key sortify }
1835
+ if$
1836
+ }
1837
+ { author sort.format.names }
1838
+ if$
1839
+ }
1840
+ FUNCTION {author.editor.sort}
1841
+ { author empty$
1842
+ { editor empty$
1843
+ { key empty$
1844
+ { "to sort, need author, editor, or key in " cite$ * warning$
1845
+ ""
1846
+ }
1847
+ { key sortify }
1848
+ if$
1849
+ }
1850
+ { editor sort.format.names }
1851
+ if$
1852
+ }
1853
+ { author sort.format.names }
1854
+ if$
1855
+ }
1856
+ FUNCTION {editor.sort}
1857
+ { editor empty$
1858
+ { key empty$
1859
+ { "to sort, need editor or key in " cite$ * warning$
1860
+ ""
1861
+ }
1862
+ { key sortify }
1863
+ if$
1864
+ }
1865
+ { editor sort.format.names }
1866
+ if$
1867
+ }
1868
+ FUNCTION {presort}
1869
+ { calc.label
1870
+ label sortify
1871
+ " "
1872
+ *
1873
+ type$ "book" =
1874
+ type$ "inbook" =
1875
+ or
1876
+ 'author.editor.sort
1877
+ { type$ "proceedings" =
1878
+ 'editor.sort
1879
+ 'author.sort
1880
+ if$
1881
+ }
1882
+ if$
1883
+ #1 entry.max$ substring$
1884
+ 'sort.label :=
1885
+ sort.label
1886
+ *
1887
+ " "
1888
+ *
1889
+ title field.or.null
1890
+ sort.format.title
1891
+ *
1892
+ #1 entry.max$ substring$
1893
+ 'sort.key$ :=
1894
+ }
1895
+
1896
+ ITERATE {presort}
1897
+ SORT
1898
+ STRINGS { last.label next.extra }
1899
+ INTEGERS { last.extra.num number.label }
1900
+ FUNCTION {initialize.extra.label.stuff}
1901
+ { #0 int.to.chr$ 'last.label :=
1902
+ "" 'next.extra :=
1903
+ #0 'last.extra.num :=
1904
+ #0 'number.label :=
1905
+ }
1906
+ FUNCTION {forward.pass}
1907
+ { last.label label =
1908
+ { last.extra.num #1 + 'last.extra.num :=
1909
+ last.extra.num int.to.chr$ 'extra.label :=
1910
+ }
1911
+ { "a" chr.to.int$ 'last.extra.num :=
1912
+ "" 'extra.label :=
1913
+ label 'last.label :=
1914
+ }
1915
+ if$
1916
+ number.label #1 + 'number.label :=
1917
+ }
1918
+ FUNCTION {reverse.pass}
1919
+ { next.extra "b" =
1920
+ { "a" 'extra.label := }
1921
+ 'skip$
1922
+ if$
1923
+ extra.label 'next.extra :=
1924
+ extra.label
1925
+ duplicate$ empty$
1926
+ 'skip$
1927
+ { year field.or.null #-1 #1 substring$ chr.to.int$ #65 <
1928
+ { "{\natexlab{" swap$ * "}}" * }
1929
+ { "{(\natexlab{" swap$ * "})}" * }
1930
+ if$ }
1931
+ if$
1932
+ 'extra.label :=
1933
+ label extra.label * 'label :=
1934
+ }
1935
+ EXECUTE {initialize.extra.label.stuff}
1936
+ ITERATE {forward.pass}
1937
+ REVERSE {reverse.pass}
1938
+ FUNCTION {bib.sort.order}
1939
+ { sort.label
1940
+ " "
1941
+ *
1942
+ year field.or.null sortify
1943
+ *
1944
+ " "
1945
+ *
1946
+ title field.or.null
1947
+ sort.format.title
1948
+ *
1949
+ #1 entry.max$ substring$
1950
+ 'sort.key$ :=
1951
+ }
1952
+ ITERATE {bib.sort.order}
1953
+ SORT
1954
+ FUNCTION {begin.bib}
1955
+ { preamble$ empty$
1956
+ 'skip$
1957
+ { preamble$ write$ newline$ }
1958
+ if$
1959
+ "\begin{thebibliography}{" number.label int.to.str$ * "}" *
1960
+ write$ newline$
1961
+ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
1962
+ write$ newline$
1963
+ }
1964
+ EXECUTE {begin.bib}
1965
+ EXECUTE {init.urlbst.variables} % urlbst
1966
+ EXECUTE {init.state.consts}
1967
+ ITERATE {call.type$}
1968
+ FUNCTION {end.bib}
1969
+ { newline$
1970
+ "\end{thebibliography}" write$ newline$
1971
+ }
1972
+ EXECUTE {end.bib}
1973
+ %% End of customized bst file
1974
+ %%
1975
+ %% End of file `compling.bst'.
references/2019.arxiv.conneau/source/XLMR Paper/appendix.tex ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \usepackage[hyperref]{acl2020}
3
+ \usepackage{times}
4
+ \usepackage{latexsym}
5
+ \renewcommand{\UrlFont}{\ttfamily\small}
6
+
7
+ % This is not strictly necessary, and may be commented out,
8
+ % but it will improve the layout of the manuscript,
9
+ % and will typically save some space.
10
+ \usepackage{microtype}
11
+ \usepackage{graphicx}
12
+ \usepackage{subfigure}
13
+ \usepackage{booktabs} % for professional tables
14
+ \usepackage{url}
15
+ \usepackage{times}
16
+ \usepackage{latexsym}
17
+ \usepackage{array}
18
+ \usepackage{adjustbox}
19
+ \usepackage{multirow}
20
+ % \usepackage{subcaption}
21
+ \usepackage{hyperref}
22
+ \usepackage{longtable}
23
+ \usepackage{bibentry}
24
+ \newcommand{\xlmr}{\textit{XLM-R}\xspace}
25
+ \newcommand{\mbert}{mBERT\xspace}
26
+ \input{content/tables}
27
+
28
+ \begin{document}
29
+ \nobibliography{acl2020}
30
+ \bibliographystyle{acl_natbib}
31
+ \appendix
32
+ \onecolumn
33
+ \section*{Supplementary materials}
34
+ \section{Languages and statistics for CC-100 used by \xlmr}
35
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
36
+ \label{sec:appendix_A}
37
+ \insertDataStatistics
38
+
39
+ \newpage
40
+ \section{Model Architectures and Sizes}
41
+ As we showed in section 5, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
42
+ \label{sec:appendix_B}
43
+
44
+ \insertParameters
45
+ \end{document}
references/2019.arxiv.conneau/source/XLMR Paper/content/batchsize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e0c4e1c156379efeba93f0c1a6717bb12ab0b2aa0bdd361a7fda362ff01442e
3
+ size 14673
references/2019.arxiv.conneau/source/XLMR Paper/content/capacity.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00087aeb1a14190e7800a77cecacb04e8ce1432c029e0276b4d8b02b7ff66edb
3
+ size 16459
references/2019.arxiv.conneau/source/XLMR Paper/content/datasize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d07fdd658101ef6caf7e2808faa6045ab175315b6435e25ff14ecedac584118
3
+ size 26052
references/2019.arxiv.conneau/source/XLMR Paper/content/dilution.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80d1555811c23e2c521fbb007d84dfddb85e7020cc9333058368d3a1d63e240a
3
+ size 16376
references/2019.arxiv.conneau/source/XLMR Paper/content/langsampling.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2f2f95649a23b0a46f8553f4e0e29000aff1971385b9addf6f478acc5a516a3
3
+ size 15612
references/2019.arxiv.conneau/source/XLMR Paper/content/tables.tex ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+
4
+ \newcommand{\insertXNLItable}{
5
+ \begin{table*}[h!]
6
+ \begin{center}
7
+ % \scriptsize
8
+ \resizebox{1\linewidth}{!}{
9
+ \begin{tabular}[b]{l ccc ccccccccccccccc c}
10
+ \toprule
11
+ {\bf Model} & {\bf D }& {\bf \#M} & {\bf \#lg} & {\bf en} & {\bf fr} & {\bf es} & {\bf de} & {\bf el} & {\bf bg} & {\bf ru} & {\bf tr} & {\bf ar} & {\bf vi} & {\bf th} & {\bf zh} & {\bf hi} & {\bf sw} & {\bf ur} & {\bf Avg}\\
12
+ \midrule
13
+ %\cmidrule(r){1-1}
14
+ %\cmidrule(lr){2-4}
15
+ %\cmidrule(lr){5-19}
16
+ %\cmidrule(l){20-20}
17
+
18
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on English training set (Cross-lingual Transfer)} \\
19
+ %\midrule
20
+ \midrule
21
+ \citet{lample2019cross} & Wiki+MT & N & 15 & 85.0 & 78.7 & 78.9 & 77.8 & 76.6 & 77.4 & 75.3 & 72.5 & 73.1 & 76.1 & 73.2 & 76.5 & 69.6 & 68.4 & 67.3 & 75.1 \\
22
+ \citet{huang2019unicoder} & Wiki+MT & N & 15 & 85.1 & 79.0 & 79.4 & 77.8 & 77.2 & 77.2 & 76.3 & 72.8 & 73.5 & 76.4 & 73.6 & 76.2 & 69.4 & 69.7 & 66.7 & 75.4 \\
23
+ %\midrule
24
+ \citet{devlin2018bert} & Wiki & N & 102 & 82.1 & 73.8 & 74.3 & 71.1 & 66.4 & 68.9 & 69.0 & 61.6 & 64.9 & 69.5 & 55.8 & 69.3 & 60.0 & 50.4 & 58.0 & 66.3 \\
25
+ \citet{lample2019cross} & Wiki & N & 100 & 83.7 & 76.2 & 76.6 & 73.7 & 72.4 & 73.0 & 72.1 & 68.1 & 68.4 & 72.0 & 68.2 & 71.5 & 64.5 & 58.0 & 62.4 & 71.3 \\
26
+ \citet{lample2019cross} & Wiki & 1 & 100 & 83.2 & 76.7 & 77.7 & 74.0 & 72.7 & 74.1 & 72.7 & 68.7 & 68.6 & 72.9 & 68.9 & 72.5 & 65.6 & 58.2 & 62.4 & 70.7 \\
27
+ \bf XLM-R\textsubscript{Base} & CC & 1 & 100 & 85.8 & 79.7 & 80.7 & 78.7 & 77.5 & 79.6 & 78.1 & 74.2 & 73.8 & 76.5 & 74.6 & 76.7 & 72.4 & 66.5 & 68.3 & 76.2 \\
28
+ \bf XLM-R & CC & 1 & 100 & \bf 89.1 & \bf 84.1 & \bf 85.1 & \bf 83.9 & \bf 82.9 & \bf 84.0 & \bf 81.2 & \bf 79.6 & \bf 79.8 & \bf 80.8 & \bf 78.1 & \bf 80.2 & \bf 76.9 & \bf 73.9 & \bf 73.8 & \bf 80.9 \\
29
+ \midrule
30
+ \multicolumn{19}{l}{\it Translate everything to English and use English-only model (TRANSLATE-TEST)} \\
31
+ \midrule
32
+ BERT-en & Wiki & 1 & 1 & 88.8 & 81.4 & 82.3 & 80.1 & 80.3 & 80.9 & 76.2 & 76.0 & 75.4 & 72.0 & 71.9 & 75.6 & 70.0 & 65.8 & 65.8 & 76.2 \\
33
+ RoBERTa & Wiki+CC & 1 & 1 & \underline{\bf 91.3} & 82.9 & 84.3 & 81.2 & 81.7 & 83.1 & 78.3 & 76.8 & 76.6 & 74.2 & 74.1 & 77.5 & 70.9 & 66.7 & 66.8 & 77.8 \\
34
+ % XLM-en & Wiki & 1 & 1 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 \\
35
+ \midrule
36
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on each training set (TRANSLATE-TRAIN)} \\
37
+ \midrule
38
+ \citet{lample2019cross} & Wiki & N & 100 & 82.9 & 77.6 & 77.9 & 77.9 & 77.1 & 75.7 & 75.5 & 72.6 & 71.2 & 75.8 & 73.1 & 76.2 & 70.4 & 66.5 & 62.4 & 74.2 \\
39
+ \midrule
40
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on all training sets (TRANSLATE-TRAIN-ALL)} \\
41
+ \midrule
42
+ \citet{lample2019cross}$^{\dagger}$ & Wiki+MT & 1 & 15 & 85.0 & 80.8 & 81.3 & 80.3 & 79.1 & 80.9 & 78.3 & 75.6 & 77.6 & 78.5 & 76.0 & 79.5 & 72.9 & 72.8 & 68.5 & 77.8 \\
43
+ \citet{huang2019unicoder} & Wiki+MT & 1 & 15 & 85.6 & 81.1 & 82.3 & 80.9 & 79.5 & 81.4 & 79.7 & 76.8 & 78.2 & 77.9 & 77.1 & 80.5 & 73.4 & 73.8 & 69.6 & 78.5 \\
44
+ %\midrule
45
+ \citet{lample2019cross} & Wiki & 1 & 100 & 84.5 & 80.1 & 81.3 & 79.3 & 78.6 & 79.4 & 77.5 & 75.2 & 75.6 & 78.3 & 75.7 & 78.3 & 72.1 & 69.2 & 67.7 & 76.9 \\
46
+ \bf XLM-R\textsubscript{Base} & CC & 1 & 100 & 85.4 & 81.4 & 82.2 & 80.3 & 80.4 & 81.3 & 79.7 & 78.6 & 77.3 & 79.7 & 77.9 & 80.2 & 76.1 & 73.1 & 73.0 & 79.1 \\
47
+ \bf XLM-R & CC & 1 & 100 & \bf 89.1 & \underline{\bf 85.1} & \underline{\bf 86.6} & \underline{\bf 85.7} & \underline{\bf 85.3} & \underline{\bf 85.9} & \underline{\bf 83.5} & \underline{\bf 83.2} & \underline{\bf 83.1} & \underline{\bf 83.7} & \underline{\bf 81.5} & \underline{\bf 83.7} & \underline{\bf 81.6} & \underline{\bf 78.0} & \underline{\bf 78.1} & \underline{\bf 83.6} \\
48
+ \bottomrule
49
+ \end{tabular}
50
+ }
51
+ \caption{\textbf{Results on cross-lingual classification.} We report the accuracy on each of the 15 XNLI languages and the average accuracy. We specify the dataset D used for pretraining, the number of models \#M the approach requires and the number of languages \#lg the model handles. Our \xlmr results are averaged over five different seeds. We show that using the translate-train-all approach which leverages training sets from multiple languages, \xlmr obtains a new state of the art on XNLI of $83.6$\% average accuracy. Results with $^{\dagger}$ are from \citet{huang2019unicoder}. %It also outperforms previous methods on cross-lingual transfer.
52
+ \label{tab:xnli}}
53
+ \end{center}
54
+ % \vspace{-0.4cm}
55
+ \end{table*}
56
+ }
57
+
58
+ % Evolution of performance w.r.t number of languages
59
+ \newcommand{\insertLanguagesize}{
60
+ \begin{table*}[h!]
61
+ \begin{minipage}{0.49\textwidth}
62
+ \includegraphics[scale=0.4]{content/wiki_vs_cc.pdf}
63
+ \end{minipage}
64
+ \hfill
65
+ \begin{minipage}{0.4\textwidth}
66
+ \captionof{figure}{\textbf{Distribution of the amount of data (in MB) per language for Wikipedia and CommonCrawl.} The Wikipedia data used in open-source mBERT and XLM is not sufficient for the model to develop an understanding of low-resource languages. The CommonCrawl data we collect alleviates that issue and creates the conditions for a single model to understand text coming from multiple languages. \label{fig:lgs}}
67
+ \end{minipage}
68
+ % \vspace{-0.5cm}
69
+ \end{table*}
70
+ }
71
+
72
+ % Evolution of performance w.r.t number of languages
73
+ \newcommand{\insertXLMmorelanguages}{
74
+ \begin{table*}[h!]
75
+ \begin{minipage}{0.49\textwidth}
76
+ \includegraphics[scale=0.4]{content/evolution_languages}
77
+ \end{minipage}
78
+ \hfill
79
+ \begin{minipage}{0.4\textwidth}
80
+ \captionof{figure}{\textbf{Evolution of XLM performance on SeqLab, XNLI and GLUE as the number of languages increases.} While there are subtlteties as to what languages lose more accuracy than others as we add more languages, we observe a steady decrease of the overall monolingual and cross-lingual performance. \label{fig:lgsunused}}
81
+ \end{minipage}
82
+ % \vspace{-0.5cm}
83
+ \end{table*}
84
+ }
85
+
86
+ \newcommand{\insertMLQA}{
87
+ \begin{table*}[h!]
88
+ \begin{center}
89
+ % \scriptsize
90
+ \resizebox{1\linewidth}{!}{
91
+ \begin{tabular}[h]{l cc ccccccc c}
92
+ \toprule
93
+ {\bf Model} & {\bf train} & {\bf \#lgs} & {\bf en} & {\bf es} & {\bf de} & {\bf ar} & {\bf hi} & {\bf vi} & {\bf zh} & {\bf Avg} \\
94
+ \midrule
95
+ BERT-Large$^{\dagger}$ & en & 1 & 80.2 / 67.4 & - & - & - & - & - & - & - \\
96
+ mBERT$^{\dagger}$ & en & 102 & 77.7 / 65.2 & 64.3 / 46.6 & 57.9 / 44.3 & 45.7 / 29.8 & 43.8 / 29.7 & 57.1 / 38.6 & 57.5 / 37.3 & 57.7 / 41.6 \\
97
+ XLM-15$^{\dagger}$ & en & 15 & 74.9 / 62.4 & 68.0 / 49.8 & 62.2 / 47.6 & 54.8 / 36.3 & 48.8 / 27.3 & 61.4 / 41.8 & 61.1 / 39.6 & 61.6 / 43.5 \\
98
+ XLM-R\textsubscript{Base} & en & 100 & 77.1 / 64.6 & 67.4 / 49.6 & 60.9 / 46.7 & 54.9 / 36.6 & 59.4 / 42.9 & 64.5 / 44.7 & 61.8 / 39.3 & 63.7 / 46.3 \\
99
+ \bf XLM-R & en & 100 & \bf 80.6 / 67.8 & \bf 74.1 / 56.0 & \bf 68.5 / 53.6 & \bf 63.1 / 43.5 & \bf 69.2 / 51.6 & \bf 71.3 / 50.9 & \bf 68.0 / 45.4 & \bf 70.7 / 52.7 \\
100
+ \bottomrule
101
+ \end{tabular}
102
+ }
103
+ \caption{\textbf{Results on MLQA question answering} We report the F1 and EM (exact match) scores for zero-shot classification where models are fine-tuned on the English Squad dataset and evaluated on the 7 languages of MLQA. Results with $\dagger$ are taken from the original MLQA paper \citet{lewis2019mlqa}.
104
+ \label{tab:mlqa}}
105
+ \end{center}
106
+ \end{table*}
107
+ }
108
+
109
+ \newcommand{\insertNER}{
110
+ \begin{table}[t]
111
+ \begin{center}
112
+ % \scriptsize
113
+ \resizebox{1\linewidth}{!}{
114
+ \begin{tabular}[b]{l cc cccc c}
115
+ \toprule
116
+ {\bf Model} & {\bf train} & {\bf \#M} & {\bf en} & {\bf nl} & {\bf es} & {\bf de} & {\bf Avg}\\
117
+ \midrule
118
+ \citet{lample-etal-2016-neural} & each & N & 90.74 & 81.74 & 85.75 & 78.76 & 84.25 \\
119
+ \citet{akbik2018coling} & each & N & \bf 93.18 & 90.44 & - & \bf 88.27 & - \\
120
+ \midrule
121
+ \multirow{2}{*}{mBERT$^{\dagger}$} & each & N & 91.97 & 90.94 & 87.38 & 82.82 & 88.28\\
122
+ & en & 1 & 91.97 & 77.57 & 74.96 & 69.56 & 78.52\\
123
+ \midrule
124
+ \multirow{3}{*}{XLM-R\textsubscript{Base}} & each & N & 92.25 & 90.39 & 87.99 & 84.60 & 88.81\\
125
+ & en & 1 & 92.25 & 78.08 & 76.53 & 69.60 & 79.11\\
126
+ & all & 1 & 91.08 & 89.09 & 87.28 & 83.17 & 87.66 \\
127
+ \midrule
128
+ \multirow{3}{*}{\bf XLM-R} & each & N & 92.92 & \bf 92.53 & \bf 89.72 & 85.81 & 90.24\\
129
+ & en & 1 & 92.92 & 80.80 & 78.64 & 71.40 & 80.94\\
130
+ & all & 1 & 92.00 & 91.60 & 89.52 & 84.60 & 89.43 \\
131
+ \bottomrule
132
+ \end{tabular}
133
+ }
134
+ \caption{\textbf{Results on named entity recognition} on CoNLL-2002 and CoNLL-2003 (F1 score). Results with $\dagger$ are from \citet{wu2019beto}. Note that mBERT and \xlmr do not use a linear-chain CRF, as opposed to \citet{akbik2018coling} and \citet{lample-etal-2016-neural}.
135
+ \label{tab:ner}}
136
+ \end{center}
137
+ \vspace{-0.6cm}
138
+ \end{table}
139
+ }
140
+
141
+
142
+ \newcommand{\insertAblationone}{
143
+ \begin{table*}[h!]
144
+ \begin{minipage}[t]{0.3\linewidth}
145
+ \begin{center}
146
+ %\includegraphics[width=\linewidth]{content/xlmroberta_transfer_dilution.pdf}
147
+ \includegraphics{content/dilution}
148
+ \captionof{figure}{The transfer-interference trade-off: Low-resource languages benefit from scaling to more languages, until dilution (interference) kicks in and degrades overall performance.}
149
+ \label{fig:transfer_dilution}
150
+ \vspace{-0.2cm}
151
+ \end{center}
152
+ \end{minipage}
153
+ \hfill
154
+ \begin{minipage}[t]{0.3\linewidth}
155
+ \begin{center}
156
+ %\includegraphics[width=\linewidth]{content/xlmroberta_evolution.pdf}
157
+ \includegraphics{content/wikicc}
158
+ \captionof{figure}{Wikipedia versus CommonCrawl: An XLM-7 obtains significantly better performance when trained on CC, in particular on low-resource languages.}
159
+ \label{fig:curse}
160
+ \end{center}
161
+ \end{minipage}
162
+ \hfill
163
+ \begin{minipage}[t]{0.3\linewidth}
164
+ \begin{center}
165
+ % \includegraphics[width=\linewidth]{content/xlmroberta_evolution.pdf}
166
+ \includegraphics{content/capacity}
167
+ \captionof{figure}{Adding more capacity to the model alleviates the curse of multilinguality, but remains an issue for models of moderate size.}
168
+ \label{fig:capacity}
169
+ \end{center}
170
+ \end{minipage}
171
+ \vspace{-0.2cm}
172
+ \end{table*}
173
+ }
174
+
175
+
176
+ \newcommand{\insertAblationtwo}{
177
+ \begin{table*}[h!]
178
+ \begin{minipage}[t]{0.3\linewidth}
179
+ \begin{center}
180
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_alpha_tradeoff.pdf}
181
+ \includegraphics{content/langsampling}
182
+ \captionof{figure}{On the high-resource versus low-resource trade-off: impact of batch language sampling for XLM-100.
183
+ \label{fig:alpha}}
184
+ \end{center}
185
+ \end{minipage}
186
+ \hfill
187
+ \begin{minipage}[t]{0.3\linewidth}
188
+ \begin{center}
189
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_vocab.pdf}
190
+ \includegraphics{content/vocabsize.pdf}
191
+ \captionof{figure}{On the impact of vocabulary size at fixed capacity and with increasing capacity for XLM-100.
192
+ \label{fig:vocab}}
193
+ \end{center}
194
+ \end{minipage}
195
+ \hfill
196
+ \begin{minipage}[t]{0.3\linewidth}
197
+ \begin{center}
198
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_batch_and_tok.pdf}
199
+ \includegraphics{content/batchsize.pdf}
200
+ \captionof{figure}{On the impact of large-scale training, and preprocessing simplification from BPE with tokenization to SPM on raw text data.
201
+ \label{fig:batch}}
202
+ \end{center}
203
+ \end{minipage}
204
+ \vspace{-0.2cm}
205
+ \end{table*}
206
+ }
207
+
208
+
209
+ % Multilingual vs monolingual
210
+ \newcommand{\insertMultiMono}{
211
+ \begin{table}[h!]
212
+ \begin{center}
213
+ % \scriptsize
214
+ \resizebox{1\linewidth}{!}{
215
+ \begin{tabular}[b]{l cc ccccccc c}
216
+ \toprule
217
+ {\bf Model} & {\bf D } & {\bf \#vocab} & {\bf en} & {\bf fr} & {\bf de} & {\bf ru} & {\bf zh} & {\bf sw} & {\bf ur} & {\bf Avg}\\
218
+ \midrule
219
+ \multicolumn{11}{l}{\it Monolingual baselines}\\
220
+ \midrule
221
+ \multirow{2}{*}{BERT} & Wiki & 40k & 84.5 & 78.6 & 80.0 & 75.5 & 77.7 & 60.1 & 57.3 & 73.4 \\
222
+ & CC & 40k & 86.7 & 81.2 & 81.2 & 78.2 & 79.5 & 70.8 & 65.1 & 77.5 \\
223
+ \midrule
224
+ \multicolumn{11}{l}{\it Multilingual models (cross-lingual transfer)}\\
225
+ \midrule
226
+ \multirow{2}{*}{XLM-7} & Wiki & 150k & 82.3 & 76.8 & 74.7 & 72.5 & 73.1 & 60.8 & 62.3 & 71.8 \\
227
+ & CC & 150k & 85.7 & 78.6 & 79.5 & 76.4 & 74.8 & 71.2 & 66.9 & 76.2 \\
228
+ \midrule
229
+ \multicolumn{11}{l}{\it Multilingual models (translate-train-all)}\\
230
+ \midrule
231
+ \multirow{2}{*}{XLM-7} & Wiki & 150k & 84.6 & 80.1 & 80.2 & 75.7 & 78 & 68.7 & 66.7 & 76.3 \\
232
+ & CC & 150k & \bf 87.2 & \bf 82.5 & \bf 82.9 & \bf 79.7 & \bf 80.4 & \bf 75.7 & \bf 71.5 & \bf 80.0 \\
233
+ % \midrule
234
+ % XLM (sw,ar) & CC & 60k & N & 2-3 & - & - & - & - & - & 00.0 & - & 00.0 \\
235
+ % XLM (ur,hi,ar) & CC & 60k & N & 2-3 & - & - & - & - & - & - & 00.0 & 00.0 \\
236
+ \bottomrule
237
+ \end{tabular}
238
+ }
239
+ \caption{\textbf{Multilingual versus monolingual models (BERT-BASE).} We compare the performance of monolingual models (BERT) versus multilingual models (XLM) on seven languages, using a BERT-BASE architecture. We choose a vocabulary size of 40k and 150k for monolingual and multilingual models.
240
+ \label{tab:multimono}}
241
+ \end{center}
242
+ \vspace{-0.4cm}
243
+ \end{table}
244
+ }
245
+
246
+ % GLUE benchmark results
247
+ \newcommand{\insertGlue}{
248
+ \begin{table}[h!]
249
+ \begin{center}
250
+ % \scriptsize
251
+ \resizebox{1\linewidth}{!}{
252
+ \begin{tabular}[b]{l|c|cccccc|c}
253
+ \toprule
254
+ {\bf Model} & {\bf \#lgs} & {\bf MNLI-m/mm} & {\bf QNLI} & {\bf QQP} & {\bf SST} & {\bf MRPC} & {\bf STS-B} & {\bf Avg}\\
255
+ \midrule
256
+ BERT\textsubscript{Large}$^{\dagger}$ & 1 & 86.6/- & 92.3 & 91.3 & 93.2 & 88.0 & 90.0 & 90.2 \\
257
+ XLNet\textsubscript{Large}$^{\dagger}$ & 1 & 89.8/- & 93.9 & 91.8 & 95.6 & 89.2 & 91.8 & 92.0 \\
258
+ RoBERTa$^{\dagger}$ & 1 & 90.2/90.2 & 94.7 & 92.2 & 96.4 & 90.9 & 92.4 & 92.8 \\
259
+ XLM-R & 100 & 88.9/89.0 & 93.8 & 92.3 & 95.0 & 89.5 & 91.2 & 91.8 \\
260
+ \bottomrule
261
+ \end{tabular}
262
+ }
263
+ \caption{\textbf{GLUE dev results.} Results with $^{\dagger}$ are from \citet{roberta2019}. We compare the performance of \xlmr to BERT\textsubscript{Large}, XLNet and RoBERTa on the English GLUE benchmark.
264
+ \label{tab:glue}}
265
+ \end{center}
266
+ \vspace{-0.4cm}
267
+ \end{table}
268
+ }
269
+
270
+
271
+ % Wiki vs CommonCrawl statistics
272
+ \newcommand{\insertWikivsCC}{
273
+ \begin{table*}[h]
274
+ \begin{center}
275
+ %\includegraphics[width=\linewidth]{content/wiki_vs_cc.pdf}
276
+ \includegraphics{content/datasize.pdf}
277
+ \captionof{figure}{Amount of data in GiB (log-scale) for the 88 languages that appear in both the Wiki-100 corpus used for mBERT and XLM-100, and the CC-100 used for XLM-R. CC-100 increases the amount of data by several orders of magnitude, in particular for low-resource languages.
278
+ \label{fig:wikivscc}}
279
+ \end{center}
280
+ % \vspace{-0.4cm}
281
+ \end{table*}
282
+ }
283
+
284
+ % Corpus statistics for CC-100
285
+ \newcommand{\insertDataStatistics}{
286
+ %\resizebox{1\linewidth}{!}{
287
+ \begin{table}[h!]
288
+ \begin{center}
289
+ \small
290
+ \begin{tabular}[b]{clrrclrr}
291
+ \toprule
292
+ \textbf{ISO code} & \textbf{Language} & \textbf{Tokens} (M) & \textbf{Size} (GiB) & \textbf{ISO code} & \textbf{Language} & \textbf{Tokens} (M) & \textbf{Size} (GiB)\\
293
+ \cmidrule(r){1-4}\cmidrule(l){5-8}
294
+ {\bf af }& Afrikaans & 242 & 1.3 &{\bf lo }& Lao & 17 & 0.6 \\
295
+ {\bf am }& Amharic & 68 & 0.8 &{\bf lt }& Lithuanian & 1835 & 13.7 \\
296
+ {\bf ar }& Arabic & 2869 & 28.0 &{\bf lv }& Latvian & 1198 & 8.8 \\
297
+ {\bf as }& Assamese & 5 & 0.1 &{\bf mg }& Malagasy & 25 & 0.2 \\
298
+ {\bf az }& Azerbaijani & 783 & 6.5 &{\bf mk }& Macedonian & 449 & 4.8 \\
299
+ {\bf be }& Belarusian & 362 & 4.3 &{\bf ml }& Malayalam & 313 & 7.6 \\
300
+ {\bf bg }& Bulgarian & 5487 & 57.5 &{\bf mn }& Mongolian & 248 & 3.0 \\
301
+ {\bf bn }& Bengali & 525 & 8.4 &{\bf mr }& Marathi & 175 & 2.8 \\
302
+ {\bf - }& Bengali Romanized & 77 & 0.5 &{\bf ms }& Malay & 1318 & 8.5 \\
303
+ {\bf br }& Breton & 16 & 0.1 &{\bf my }& Burmese & 15 & 0.4 \\
304
+ {\bf bs }& Bosnian & 14 & 0.1 &{\bf my }& Burmese & 56 & 1.6 \\
305
+ {\bf ca }& Catalan & 1752 & 10.1 &{\bf ne }& Nepali & 237 & 3.8 \\
306
+ {\bf cs }& Czech & 2498 & 16.3 &{\bf nl }& Dutch & 5025 & 29.3 \\
307
+ {\bf cy }& Welsh & 141 & 0.8 &{\bf no }& Norwegian & 8494 & 49.0 \\
308
+ {\bf da }& Danish & 7823 & 45.6 &{\bf om }& Oromo & 8 & 0.1 \\
309
+ {\bf de }& German & 10297 & 66.6 &{\bf or }& Oriya & 36 & 0.6 \\
310
+ {\bf el }& Greek & 4285 & 46.9 &{\bf pa }& Punjabi & 68 & 0.8 \\
311
+ {\bf en }& English & 55608 & 300.8 &{\bf pl }& Polish & 6490 & 44.6 \\
312
+ {\bf eo }& Esperanto & 157 & 0.9 &{\bf ps }& Pashto & 96 & 0.7 \\
313
+ {\bf es }& Spanish & 9374 & 53.3 &{\bf pt }& Portuguese & 8405 & 49.1 \\
314
+ {\bf et }& Estonian & 843 & 6.1 &{\bf ro }& Romanian & 10354 & 61.4 \\
315
+ {\bf eu }& Basque & 270 & 2.0 &{\bf ru }& Russian & 23408 & 278.0 \\
316
+ {\bf fa }& Persian & 13259 & 111.6 &{\bf sa }& Sanskrit & 17 & 0.3 \\
317
+ {\bf fi }& Finnish & 6730 & 54.3 &{\bf sd }& Sindhi & 50 & 0.4 \\
318
+ {\bf fr }& French & 9780 & 56.8 &{\bf si }& Sinhala & 243 & 3.6 \\
319
+ {\bf fy }& Western Frisian & 29 & 0.2 &{\bf sk }& Slovak & 3525 & 23.2 \\
320
+ {\bf ga }& Irish & 86 & 0.5 &{\bf sl }& Slovenian & 1669 & 10.3 \\
321
+ {\bf gd }& Scottish Gaelic & 21 & 0.1 &{\bf so }& Somali & 62 & 0.4 \\
322
+ {\bf gl }& Galician & 495 & 2.9 &{\bf sq }& Albanian & 918 & 5.4 \\
323
+ {\bf gu }& Gujarati & 140 & 1.9 &{\bf sr }& Serbian & 843 & 9.1 \\
324
+ {\bf ha }& Hausa & 56 & 0.3 &{\bf su }& Sundanese & 10 & 0.1 \\
325
+ {\bf he }& Hebrew & 3399 & 31.6 &{\bf sv }& Swedish & 77.8 & 12.1 \\
326
+ {\bf hi }& Hindi & 1715 & 20.2 &{\bf sw }& Swahili & 275 & 1.6 \\
327
+ {\bf - }& Hindi Romanized & 88 & 0.5 &{\bf ta }& Tamil & 595 & 12.2 \\
328
+ {\bf hr }& Croatian & 3297 & 20.5 &{\bf - }& Tamil Romanized & 36 & 0.3 \\
329
+ {\bf hu }& Hungarian & 7807 & 58.4 &{\bf te }& Telugu & 249 & 4.7 \\
330
+ {\bf hy }& Armenian & 421 & 5.5 &{\bf - }& Telugu Romanized & 39 & 0.3 \\
331
+ {\bf id }& Indonesian & 22704 & 148.3 &{\bf th }& Thai & 1834 & 71.7 \\
332
+ {\bf is }& Icelandic & 505 & 3.2 &{\bf tl }& Filipino & 556 & 3.1 \\
333
+ {\bf it }& Italian & 4983 & 30.2 &{\bf tr }& Turkish & 2736 & 20.9 \\
334
+ {\bf ja }& Japanese & 530 & 69.3 &{\bf ug }& Uyghur & 27 & 0.4 \\
335
+ {\bf jv }& Javanese & 24 & 0.2 &{\bf uk }& Ukrainian & 6.5 & 84.6 \\
336
+ {\bf ka }& Georgian & 469 & 9.1 &{\bf ur }& Urdu & 730 & 5.7 \\
337
+ {\bf kk }& Kazakh & 476 & 6.4 &{\bf - }& Urdu Romanized & 85 & 0.5 \\
338
+ {\bf km }& Khmer & 36 & 1.5 &{\bf uz }& Uzbek & 91 & 0.7 \\
339
+ {\bf kn }& Kannada & 169 & 3.3 &{\bf vi }& Vietnamese & 24757 & 137.3 \\
340
+ {\bf ko }& Korean & 5644 & 54.2 &{\bf xh }& Xhosa & 13 & 0.1 \\
341
+ {\bf ku }& Kurdish (Kurmanji) & 66 & 0.4 &{\bf yi }& Yiddish & 34 & 0.3 \\
342
+ {\bf ky }& Kyrgyz & 94 & 1.2 &{\bf zh }& Chinese (Simplified) & 259 & 46.9 \\
343
+ {\bf la }& Latin & 390 & 2.5 &{\bf zh }& Chinese (Traditional) & 176 & 16.6 \\
344
+
345
+ \bottomrule
346
+ \end{tabular}
347
+ \caption{\textbf{Languages and statistics of the CC-100 corpus.} We report the list of 100 languages and include the number of tokens (Millions) and the size of the data (in GiB) for each language. Note that we also include romanized variants of some non latin languages such as Bengali, Hindi, Tamil, Telugu and Urdu.\label{tab:datastats}}
348
+ \end{center}
349
+ \end{table}
350
+ %}
351
+ }
352
+
353
+
354
+ % Comparison of parameters for different models
355
+ \newcommand{\insertParameters}{
356
+ \begin{table*}[h!]
357
+ \begin{center}
358
+ % \scriptsize
359
+ %\resizebox{1\linewidth}{!}{
360
+ \begin{tabular}[b]{lrcrrrrrc}
361
+ \toprule
362
+ \textbf{Model} & \textbf{\#lgs} & \textbf{tokenization} & \textbf{L} & \textbf{$H_{m}$} & \textbf{$H_{ff}$} & \textbf{A} & \textbf{V} & \textbf{\#params}\\
363
+ \cmidrule(r){1-1}
364
+ \cmidrule(lr){2-3}
365
+ \cmidrule(lr){4-8}
366
+ \cmidrule(l){9-9}
367
+ % TODO: rank by number of parameters
368
+ BERT\textsubscript{Base} & 1 & WordPiece & 12 & 768 & 3072 & 12 & 30k & 110M \\
369
+ BERT\textsubscript{Large} & 1 & WordPiece & 24 & 1024 & 4096 & 16 & 30k & 335M \\
370
+ mBERT & 104 & WordPiece & 12 & 768 & 3072 & 12 & 110k & 172M \\
371
+ RoBERTa\textsubscript{Base} & 1 & bBPE & 12 & 768 & 3072 & 8 & 50k & 125M \\
372
+ RoBERTa & 1 & bBPE & 24 & 1024 & 4096 & 16 & 50k & 355M \\
373
+ XLM-15 & 15 & BPE & 12 & 1024 & 4096 & 8 & 95k & 250M \\
374
+ XLM-17 & 17 & BPE & 16 & 1280 & 5120 & 16 & 200k & 570M \\
375
+ XLM-100 & 100 & BPE & 16 & 1280 & 5120 & 16 & 200k & 570M \\
376
+ Unicoder & 15 & BPE & 12 & 1024 & 4096 & 8 & 95k & 250M \\
377
+ \xlmr\textsubscript{Base} & 100 & SPM & 12 & 768 & 3072 & 12 & 250k & 270M \\
378
+ \xlmr & 100 & SPM & 24 & 1024 & 4096 & 16 & 250k & 550M \\
379
+ GPT2 & 1 & bBPE & 48 & 1600 & 6400 & 32 & 50k & 1.5B \\
380
+ wide-mmNMT & 103 & SPM & 12 & 2048 & 16384 & 32 & 64k & 3B \\
381
+ deep-mmNMT & 103 & SPM & 24 & 1024 & 16384 & 32 & 64k & 3B \\
382
+ T5-3B & 1 & WordPiece & 24 & 1024 & 16384 & 32 & 32k & 3B \\
383
+ T5-11B & 1 & WordPiece & 24 & 1024 & 65536 & 32 & 32k & 11B \\
384
+ % XLNet\textsubscript{Large}$^{\dagger}$ & 1 & 89.8/- & 93.9 & 91.8 & 95.6 & 89.2 & 91.8 & 92.0 \\
385
+ % RoBERTa$^{\dagger}$ & 1 & 90.2/90.2 & 94.7 & 92.2 & 96.4 & 90.9 & 92.4 & 92.8 \\
386
+ % XLM-R & 100 & 88.4/88.5 & 93.1 & 92.2 & 95.1 & 89.7 & 90.4 & 91.5 \\
387
+ \bottomrule
388
+ \end{tabular}
389
+ %}
390
+ \caption{\textbf{Details on model sizes.}
391
+ We show the tokenization used by each Transformer model, the number of layers L, the number of hidden states of the model $H_{m}$, the dimension of the feed-forward layer $H_{ff}$, the number of attention heads A, the size of the vocabulary V and the total number of parameters \#params.
392
+ For Transformer encoders, the number of parameters can be approximated by $4LH_m^2 + 2LH_m H_{ff} + VH_m$.
393
+ GPT2 numbers are from \citet{radford2019language}, mm-NMT models are from the work of \citet{arivazhagan2019massively} on massively multilingual neural machine translation (mmNMT), and T5 numbers are from \citet{raffel2019exploring}. While \xlmr is among the largest models partly due to its large embedding layer, it has a similar number of parameters than XLM-100, and remains significantly smaller that recently introduced Transformer models for multilingual MT and transfer learning. While this table gives more hindsight on the difference of capacity of each model, note it does not highlight other critical differences between the models.
394
+ \label{tab:parameters}}
395
+ \end{center}
396
+ \vspace{-0.4cm}
397
+ \end{table*}
398
+ }
references/2019.arxiv.conneau/source/XLMR Paper/content/vocabsize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e45090856dc149265ada0062c8c2456c3057902dfaaade60aa80905785563506
3
+ size 15677
references/2019.arxiv.conneau/source/XLMR Paper/content/wikicc.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0d7e959db8240f283922c3ca7c6de6f5ad3750681f27f4fcf35d161506a7a21
3
+ size 16304
references/2019.arxiv.conneau/source/XLMR Paper/texput.log ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) (preloaded format=pdflatex 2019.5.8) 7 APR 2020 17:41
2
+ entering extended mode
3
+ restricted \write18 enabled.
4
+ %&-line parsing enabled.
5
+ **acl2020.tex
6
+
7
+ ! Emergency stop.
8
+ <*> acl2020.tex
9
+
10
+ *** (job aborted, file error in nonstop mode)
11
+
12
+
13
+ Here is how much of TeX's memory you used:
14
+ 3 strings out of 492616
15
+ 102 string characters out of 6129482
16
+ 57117 words of memory out of 5000000
17
+ 4025 multiletter control sequences out of 15000+600000
18
+ 3640 words of font info for 14 fonts, out of 8000000 for 9000
19
+ 1141 hyphenation exceptions out of 8191
20
+ 0i,0n,0p,1b,6s stack positions out of 5000i,500n,10000p,200000b,80000s
21
+ ! ==> Fatal error occurred, no output PDF file produced!
references/2019.arxiv.conneau/source/XLMR Paper/xlmr.bbl ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{thebibliography}{40}
2
+ \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
3
+
4
+ \bibitem[{Akbik et~al.(2018)Akbik, Blythe, and Vollgraf}]{akbik2018coling}
5
+ Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018.
6
+ \newblock Contextual string embeddings for sequence labeling.
7
+ \newblock In \emph{COLING}, pages 1638--1649.
8
+
9
+ \bibitem[{Arivazhagan et~al.(2019)Arivazhagan, Bapna, Firat, Lepikhin, Johnson,
10
+ Krikun, Chen, Cao, Foster, Cherry et~al.}]{arivazhagan2019massively}
11
+ Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson,
12
+ Maxim Krikun, Mia~Xu Chen, Yuan Cao, George Foster, Colin Cherry, et~al.
13
+ 2019.
14
+ \newblock Massively multilingual neural machine translation in the wild:
15
+ Findings and challenges.
16
+ \newblock \emph{arXiv preprint arXiv:1907.05019}.
17
+
18
+ \bibitem[{Bowman et~al.(2015)Bowman, Angeli, Potts, and
19
+ Manning}]{bowman2015large}
20
+ Samuel~R. Bowman, Gabor Angeli, Christopher Potts, and Christopher~D. Manning.
21
+ 2015.
22
+ \newblock A large annotated corpus for learning natural language inference.
23
+ \newblock In \emph{EMNLP}.
24
+
25
+ \bibitem[{Conneau et~al.(2018)Conneau, Rinott, Lample, Williams, Bowman,
26
+ Schwenk, and Stoyanov}]{conneau2018xnli}
27
+ Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel~R.
28
+ Bowman, Holger Schwenk, and Veselin Stoyanov. 2018.
29
+ \newblock Xnli: Evaluating cross-lingual sentence representations.
30
+ \newblock In \emph{EMNLP}. Association for Computational Linguistics.
31
+
32
+ \bibitem[{Devlin et~al.(2018)Devlin, Chang, Lee, and
33
+ Toutanova}]{devlin2018bert}
34
+ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.
35
+ \newblock Bert: Pre-training of deep bidirectional transformers for language
36
+ understanding.
37
+ \newblock \emph{NAACL}.
38
+
39
+ \bibitem[{Grave et~al.(2018)Grave, Bojanowski, Gupta, Joulin, and
40
+ Mikolov}]{grave2018learning}
41
+ Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas
42
+ Mikolov. 2018.
43
+ \newblock Learning word vectors for 157 languages.
44
+ \newblock In \emph{LREC}.
45
+
46
+ \bibitem[{Huang et~al.(2019)Huang, Liang, Duan, Gong, Shou, Jiang, and
47
+ Zhou}]{huang2019unicoder}
48
+ Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and
49
+ Ming Zhou. 2019.
50
+ \newblock Unicoder: A universal language encoder by pre-training with multiple
51
+ cross-lingual tasks.
52
+ \newblock \emph{ACL}.
53
+
54
+ \bibitem[{Johnson et~al.(2017)Johnson, Schuster, Le, Krikun, Wu, Chen, Thorat,
55
+ Vi{\'e}gas, Wattenberg, Corrado et~al.}]{johnson2017google}
56
+ Melvin Johnson, Mike Schuster, Quoc~V Le, Maxim Krikun, Yonghui Wu, Zhifeng
57
+ Chen, Nikhil Thorat, Fernanda Vi{\'e}gas, Martin Wattenberg, Greg Corrado,
58
+ et~al. 2017.
59
+ \newblock Google’s multilingual neural machine translation system: Enabling
60
+ zero-shot translation.
61
+ \newblock \emph{TACL}, 5:339--351.
62
+
63
+ \bibitem[{Joulin et~al.(2017)Joulin, Grave, and Mikolov}]{joulin2017bag}
64
+ Armand Joulin, Edouard Grave, and Piotr Bojanowski~Tomas Mikolov. 2017.
65
+ \newblock Bag of tricks for efficient text classification.
66
+ \newblock \emph{EACL 2017}, page 427.
67
+
68
+ \bibitem[{Jozefowicz et~al.(2016)Jozefowicz, Vinyals, Schuster, Shazeer, and
69
+ Wu}]{jozefowicz2016exploring}
70
+ Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu.
71
+ 2016.
72
+ \newblock Exploring the limits of language modeling.
73
+ \newblock \emph{arXiv preprint arXiv:1602.02410}.
74
+
75
+ \bibitem[{Kudo(2018)}]{kudo2018subword}
76
+ Taku Kudo. 2018.
77
+ \newblock Subword regularization: Improving neural network translation models
78
+ with multiple subword candidates.
79
+ \newblock In \emph{ACL}, pages 66--75.
80
+
81
+ \bibitem[{Kudo and Richardson(2018)}]{kudo2018sentencepiece}
82
+ Taku Kudo and John Richardson. 2018.
83
+ \newblock Sentencepiece: A simple and language independent subword tokenizer
84
+ and detokenizer for neural text processing.
85
+ \newblock \emph{EMNLP}.
86
+
87
+ \bibitem[{Lample et~al.(2016)Lample, Ballesteros, Subramanian, Kawakami, and
88
+ Dyer}]{lample-etal-2016-neural}
89
+ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
90
+ Chris Dyer. 2016.
91
+ \newblock \href {https://doi.org/10.18653/v1/N16-1030} {Neural architectures
92
+ for named entity recognition}.
93
+ \newblock In \emph{NAACL}, pages 260--270, San Diego, California. Association
94
+ for Computational Linguistics.
95
+
96
+ \bibitem[{Lample and Conneau(2019)}]{lample2019cross}
97
+ Guillaume Lample and Alexis Conneau. 2019.
98
+ \newblock Cross-lingual language model pretraining.
99
+ \newblock \emph{NeurIPS}.
100
+
101
+ \bibitem[{Lewis et~al.(2019)Lewis, O\u{g}uz, Rinott, Riedel, and
102
+ Schwenk}]{lewis2019mlqa}
103
+ Patrick Lewis, Barlas O\u{g}uz, Ruty Rinott, Sebastian Riedel, and Holger
104
+ Schwenk. 2019.
105
+ \newblock Mlqa: Evaluating cross-lingual extractive question answering.
106
+ \newblock \emph{arXiv preprint arXiv:1910.07475}.
107
+
108
+ \bibitem[{Liu et~al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis,
109
+ Zettlemoyer, and Stoyanov}]{roberta2019}
110
+ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
111
+ Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
112
+ \newblock Roberta: {A} robustly optimized {BERT} pretraining approach.
113
+ \newblock \emph{arXiv preprint arXiv:1907.11692}.
114
+
115
+ \bibitem[{Mikolov et~al.(2013{\natexlab{a}})Mikolov, Le, and
116
+ Sutskever}]{mikolov2013exploiting}
117
+ Tomas Mikolov, Quoc~V Le, and Ilya Sutskever. 2013{\natexlab{a}}.
118
+ \newblock Exploiting similarities among languages for machine translation.
119
+ \newblock \emph{arXiv preprint arXiv:1309.4168}.
120
+
121
+ \bibitem[{Mikolov et~al.(2013{\natexlab{b}})Mikolov, Sutskever, Chen, Corrado,
122
+ and Dean}]{mikolov2013distributed}
123
+ Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg~S Corrado, and Jeff Dean.
124
+ 2013{\natexlab{b}}.
125
+ \newblock Distributed representations of words and phrases and their
126
+ compositionality.
127
+ \newblock In \emph{NIPS}, pages 3111--3119.
128
+
129
+ \bibitem[{Pennington et~al.(2014)Pennington, Socher, and
130
+ Manning}]{pennington2014glove}
131
+ Jeffrey Pennington, Richard Socher, and Christopher~D. Manning. 2014.
132
+ \newblock \href {http://www.aclweb.org/anthology/D14-1162} {Glove: Global
133
+ vectors for word representation}.
134
+ \newblock In \emph{EMNLP}, pages 1532--1543.
135
+
136
+ \bibitem[{Peters et~al.(2018)Peters, Neumann, Iyyer, Gardner, Clark, Lee, and
137
+ Zettlemoyer}]{peters2018deep}
138
+ Matthew~E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
139
+ Kenton Lee, and Luke Zettlemoyer. 2018.
140
+ \newblock Deep contextualized word representations.
141
+ \newblock \emph{NAACL}.
142
+
143
+ \bibitem[{Pires et~al.(2019)Pires, Schlinger, and Garrette}]{Pires2019HowMI}
144
+ Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
145
+ \newblock How multilingual is multilingual bert?
146
+ \newblock In \emph{ACL}.
147
+
148
+ \bibitem[{Radford et~al.(2018)Radford, Narasimhan, Salimans, and
149
+ Sutskever}]{radford2018improving}
150
+ Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018.
151
+ \newblock \href
152
+ {https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf}
153
+ {Improving language understanding by generative pre-training}.
154
+ \newblock \emph{URL
155
+ https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language\_understanding\_paper.pdf}.
156
+
157
+ \bibitem[{Radford et~al.(2019)Radford, Wu, Child, Luan, Amodei, and
158
+ Sutskever}]{radford2019language}
159
+ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
160
+ Sutskever. 2019.
161
+ \newblock Language models are unsupervised multitask learners.
162
+ \newblock \emph{OpenAI Blog}, 1(8).
163
+
164
+ \bibitem[{Raffel et~al.(2019)Raffel, Shazeer, Roberts, Lee, Narang, Matena,
165
+ Zhou, Li, and Liu}]{raffel2019exploring}
166
+ Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
167
+ Matena, Yanqi Zhou, Wei Li, and Peter~J. Liu. 2019.
168
+ \newblock Exploring the limits of transfer learning with a unified text-to-text
169
+ transformer.
170
+ \newblock \emph{arXiv preprint arXiv:1910.10683}.
171
+
172
+ \bibitem[{Rajpurkar et~al.(2018)Rajpurkar, Jia, and Liang}]{rajpurkar2018know}
173
+ Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
174
+ \newblock Know what you don't know: Unanswerable questions for squad.
175
+ \newblock \emph{ACL}.
176
+
177
+ \bibitem[{Rajpurkar et~al.(2016)Rajpurkar, Zhang, Lopyrev, and
178
+ Liang}]{rajpurkar-etal-2016-squad}
179
+ Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
180
+ \newblock \href {https://doi.org/10.18653/v1/D16-1264} {{SQ}u{AD}: 100,000+
181
+ questions for machine comprehension of text}.
182
+ \newblock In \emph{EMNLP}, pages 2383--2392, Austin, Texas. Association for
183
+ Computational Linguistics.
184
+
185
+ \bibitem[{Sang(2002)}]{sang2002introduction}
186
+ Erik~F Sang. 2002.
187
+ \newblock Introduction to the conll-2002 shared task: Language-independent
188
+ named entity recognition.
189
+ \newblock \emph{CoNLL}.
190
+
191
+ \bibitem[{Schuster et~al.(2019)Schuster, Ram, Barzilay, and
192
+ Globerson}]{schuster2019cross}
193
+ Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019.
194
+ \newblock Cross-lingual alignment of contextual word embeddings, with
195
+ applications to zero-shot dependency parsing.
196
+ \newblock \emph{NAACL}.
197
+
198
+ \bibitem[{Siddhant et~al.(2019)Siddhant, Johnson, Tsai, Arivazhagan, Riesa,
199
+ Bapna, Firat, and Raman}]{siddhant2019evaluating}
200
+ Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa,
201
+ Ankur Bapna, Orhan Firat, and Karthik Raman. 2019.
202
+ \newblock Evaluating the cross-lingual effectiveness of massively multilingual
203
+ neural machine translation.
204
+ \newblock \emph{AAAI}.
205
+
206
+ \bibitem[{Singh et~al.(2019)Singh, McCann, Keskar, Xiong, and
207
+ Socher}]{singh2019xlda}
208
+ Jasdeep Singh, Bryan McCann, Nitish~Shirish Keskar, Caiming Xiong, and Richard
209
+ Socher. 2019.
210
+ \newblock Xlda: Cross-lingual data augmentation for natural language inference
211
+ and question answering.
212
+ \newblock \emph{arXiv preprint arXiv:1905.11471}.
213
+
214
+ \bibitem[{Socher et~al.(2013)Socher, Perelygin, Wu, Chuang, Manning, Ng, and
215
+ Potts}]{socher2013recursive}
216
+ Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher~D Manning,
217
+ Andrew Ng, and Christopher Potts. 2013.
218
+ \newblock Recursive deep models for semantic compositionality over a sentiment
219
+ treebank.
220
+ \newblock In \emph{EMNLP}, pages 1631--1642.
221
+
222
+ \bibitem[{Tan et~al.(2019)Tan, Ren, He, Qin, Zhao, and
223
+ Liu}]{tan2019multilingual}
224
+ Xu~Tan, Yi~Ren, Di~He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019.
225
+ \newblock Multilingual neural machine translation with knowledge distillation.
226
+ \newblock \emph{ICLR}.
227
+
228
+ \bibitem[{Tjong Kim~Sang and De~Meulder(2003)}]{tjong2003introduction}
229
+ Erik~F Tjong Kim~Sang and Fien De~Meulder. 2003.
230
+ \newblock Introduction to the conll-2003 shared task: language-independent
231
+ named entity recognition.
232
+ \newblock In \emph{CoNLL}, pages 142--147. Association for Computational
233
+ Linguistics.
234
+
235
+ \bibitem[{Vaswani et~al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones,
236
+ Gomez, Kaiser, and Polosukhin}]{transformer17}
237
+ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
238
+ Aidan~N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
239
+ \newblock Attention is all you need.
240
+ \newblock In \emph{Advances in Neural Information Processing Systems}, pages
241
+ 6000--6010.
242
+
243
+ \bibitem[{Wang et~al.(2018)Wang, Singh, Michael, Hill, Levy, and
244
+ Bowman}]{wang2018glue}
245
+ Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel~R
246
+ Bowman. 2018.
247
+ \newblock Glue: A multi-task benchmark and analysis platform for natural
248
+ language understanding.
249
+ \newblock \emph{arXiv preprint arXiv:1804.07461}.
250
+
251
+ \bibitem[{Wenzek et~al.(2019)Wenzek, Lachaux, Conneau, Chaudhary, Guzman,
252
+ Joulin, and Grave}]{wenzek2019ccnet}
253
+ Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary,
254
+ Francisco Guzman, Armand Joulin, and Edouard Grave. 2019.
255
+ \newblock Ccnet: Extracting high quality monolingual datasets from web crawl
256
+ data.
257
+ \newblock \emph{arXiv preprint arXiv:1911.00359}.
258
+
259
+ \bibitem[{Williams et~al.(2017)Williams, Nangia, and
260
+ Bowman}]{williams2017broad}
261
+ Adina Williams, Nikita Nangia, and Samuel~R Bowman. 2017.
262
+ \newblock A broad-coverage challenge corpus for sentence understanding through
263
+ inference.
264
+ \newblock \emph{Proceedings of the 2nd Workshop on Evaluating Vector-Space
265
+ Representations for NLP}.
266
+
267
+ \bibitem[{Wu et~al.(2019)Wu, Conneau, Li, Zettlemoyer, and
268
+ Stoyanov}]{wu2019emerging}
269
+ Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov.
270
+ 2019.
271
+ \newblock Emerging cross-lingual structure in pretrained language models.
272
+ \newblock \emph{ACL}.
273
+
274
+ \bibitem[{Wu and Dredze(2019)}]{wu2019beto}
275
+ Shijie Wu and Mark Dredze. 2019.
276
+ \newblock Beto, bentz, becas: The surprising cross-lingual effectiveness of
277
+ bert.
278
+ \newblock \emph{EMNLP}.
279
+
280
+ \bibitem[{Xie et~al.(2019)Xie, Dai, Hovy, Luong, and Le}]{xie2019unsupervised}
281
+ Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc~V Le. 2019.
282
+ \newblock Unsupervised data augmentation for consistency training.
283
+ \newblock \emph{arXiv preprint arXiv:1904.12848}.
284
+
285
+ \end{thebibliography}
references/2019.arxiv.conneau/source/XLMR Paper/xlmr.synctex ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:420af1ab9f337834c49b93240fd9062be0a9f1bd9135878e6c96a6d128aa6856
3
+ size 865236
references/2019.arxiv.conneau/source/XLMR Paper/xlmr.tex ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ %
3
+ % File acl2020.tex
4
+ %
5
+ %% Based on the style files for ACL 2020, which were
6
+ %% Based on the style files for ACL 2018, NAACL 2018/19, which were
7
+ %% Based on the style files for ACL-2015, with some improvements
8
+ %% taken from the NAACL-2016 style
9
+ %% Based on the style files for ACL-2014, which were, in turn,
10
+ %% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009,
11
+ %% EACL-2009, IJCNLP-2008...
12
+ %% Based on the style files for EACL 2006 by
13
+ %%e.agirre@ehu.es or Sergi.Balari@uab.es
14
+ %% and that of ACL 08 by Joakim Nivre and Noah Smith
15
+
16
+ \documentclass[11pt,a4paper]{article}
17
+ \usepackage[hyperref]{acl2020}
18
+ \usepackage{times}
19
+ \usepackage{latexsym}
20
+ \renewcommand{\UrlFont}{\ttfamily\small}
21
+
22
+ % This is not strictly necessary, and may be commented out,
23
+ % but it will improve the layout of the manuscript,
24
+ % and will typically save some space.
25
+ \usepackage{microtype}
26
+ \usepackage{graphicx}
27
+ \usepackage{subfigure}
28
+ \usepackage{booktabs} % for professional tables
29
+ \usepackage{url}
30
+ \usepackage{times}
31
+ \usepackage{latexsym}
32
+ \usepackage{array}
33
+ \usepackage{adjustbox}
34
+ \usepackage{multirow}
35
+ % \usepackage{subcaption}
36
+ \usepackage{hyperref}
37
+ \usepackage{longtable}
38
+
39
+ \input{content/tables}
40
+
41
+
42
+ \aclfinalcopy % Uncomment this line for the final submission
43
+ \def\aclpaperid{479} % Enter the acl Paper ID here
44
+
45
+ %\setlength\titlebox{5cm}
46
+ % You can expand the titlebox if you need extra space
47
+ % to show all the authors. Please do not make the titlebox
48
+ % smaller than 5cm (the original size); we will check this
49
+ % in the camera-ready version and ask you to change it back.
50
+
51
+ \newcommand\BibTeX{B\textsc{ib}\TeX}
52
+ \usepackage{xspace}
53
+ \newcommand{\xlmr}{\textit{XLM-R}\xspace}
54
+ \newcommand{\mbert}{mBERT\xspace}
55
+ \newcommand{\XX}{\textcolor{red}{XX}\xspace}
56
+
57
+ \newcommand{\note}[3]{{\color{#2}[#1: #3]}}
58
+ \newcommand{\ves}[1]{\note{ves}{red}{#1}}
59
+ \newcommand{\luke}[1]{\note{luke}{green}{#1}}
60
+ \newcommand{\myle}[1]{\note{myle}{cyan}{#1}}
61
+ \newcommand{\paco}[1]{\note{paco}{blue}{#1}}
62
+ \newcommand{\eg}[1]{\note{edouard}{orange}{#1}}
63
+ \newcommand{\kk}[1]{\note{kartikay}{pink}{#1}}
64
+
65
+ \renewcommand{\UrlFont}{\scriptsize}
66
+ \title{Unsupervised Cross-lingual Representation Learning at Scale}
67
+
68
+ \author{Alexis Conneau\thanks{\ \ Equal contribution.} \space\space\space
69
+ Kartikay Khandelwal\footnotemark[1] \space\space\space \AND
70
+ \bf Naman Goyal \space\space\space
71
+ Vishrav Chaudhary \space\space\space
72
+ Guillaume Wenzek \space\space\space
73
+ Francisco Guzm\'an \space\space\space \AND
74
+ \bf Edouard Grave \space\space\space
75
+ Myle Ott \space\space\space
76
+ Luke Zettlemoyer \space\space\space
77
+ Veselin Stoyanov \space\space\space \\ \\ \\
78
+ \bf Facebook AI
79
+ }
80
+
81
+ \date{}
82
+
83
+ \begin{document}
84
+ \maketitle
85
+ \begin{abstract}
86
+ This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed \xlmr, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6\% average accuracy on XNLI, +13\% average F1 score on MLQA, and +2.4\% F1 score on NER. \xlmr performs particularly well on low-resource languages, improving 15.7\% in XNLI accuracy for Swahili and 11.4\% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; \xlmr is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.{\let\thefootnote\relax\footnotetext{\scriptsize Correspondence to {\tt \{aconneau,kartikayk\}@fb.com}}}\footnote{\url{https://github.com/facebookresearch/(fairseq-py,pytext,xlm)}}
87
+ \end{abstract}
88
+
89
+
90
+ \section{Introduction}
91
+
92
+ The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised cross-lingual representations at a very large scale.
93
+ We present \xlmr a transformer-based multilingual masked language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.
94
+
95
+ Multilingual masked language models (MLM) like \mbert ~\cite{devlin2018bert} and XLM \cite{lample2019cross} have pushed the state-of-the-art on cross-lingual understanding tasks by jointly pretraining large Transformer models~\cite{transformer17} on many languages. These models allow for effective cross-lingual transfer, as seen in a number of benchmarks including cross-lingual natural language inference ~\cite{bowman2015large,williams2017broad,conneau2018xnli}, question answering~\cite{rajpurkar-etal-2016-squad,lewis2019mlqa}, and named entity recognition~\cite{Pires2019HowMI,wu2019beto}.
96
+ However, all of these studies pre-train on Wikipedia, which provides a relatively limited scale especially for lower resource languages.
97
+
98
+
99
+ In this paper, we first present a comprehensive analysis of the trade-offs and limitations of multilingual language models at scale, inspired by recent monolingual scaling efforts~\cite{roberta2019}.
100
+ We measure the trade-off between high-resource and low-resource languages and the impact of language sampling and vocabulary size.
101
+ %By training models with an increasing number of languages,
102
+ The experiments expose a trade-off as we scale the number of languages for a fixed model capacity: more languages leads to better cross-lingual performance on low-resource languages up until a point, after which the overall performance on monolingual and cross-lingual benchmarks degrades. We refer to this tradeoff as the \emph{curse of multilinguality}, and show that it can be alleviated by simply increasing model capacity.
103
+ We argue, however, that this remains an important limitation for future XLU systems which may aim to improve performance with more modest computational budgets.
104
+
105
+ Our best model XLM-RoBERTa (\xlmr) outperforms \mbert on cross-lingual classification by up to 23\% accuracy on low-resource languages.
106
+ %like Swahili and Urdu.
107
+ It outperforms the previous state of the art by 5.1\% average accuracy on XNLI, 2.42\% average F1-score on Named Entity Recognition, and 9.1\% average F1-score on cross-lingual Question Answering. We also evaluate monolingual fine tuning on the GLUE and XNLI benchmarks, where \xlmr obtains results competitive with state-of-the-art monolingual models, including RoBERTa \cite{roberta2019}.
108
+ These results demonstrate, for the first time, that it is possible to have a single large model for all languages, without sacrificing per-language performance.
109
+ We will make our code, models and data publicly available, with the hope that this will help research in multilingual NLP and low-resource language understanding.
110
+
111
+ \section{Related Work}
112
+ From pretrained word embeddings~\citep{mikolov2013distributed, pennington2014glove} to pretrained contextualized representations~\citep{peters2018deep,schuster2019cross} and transformer based language models~\citep{radford2018improving,devlin2018bert}, unsupervised representation learning has significantly improved the state of the art in natural language understanding. Parallel work on cross-lingual understanding~\citep{mikolov2013exploiting,schuster2019cross,lample2019cross} extends these systems to more languages and to the cross-lingual setting in which a model is learned in one language and applied in other languages.
113
+
114
+ Most recently, \citet{devlin2018bert} and \citet{lample2019cross} introduced \mbert and XLM - masked language models trained on multiple languages, without any cross-lingual supervision.
115
+ \citet{lample2019cross} propose translation language modeling (TLM) as a way to leverage parallel data and obtain a new state of the art on the cross-lingual natural language inference (XNLI) benchmark~\cite{conneau2018xnli}.
116
+ They further show strong improvements on unsupervised machine translation and pretraining for sequence generation. \citet{wu2019emerging} shows that monolingual BERT representations are similar across languages, explaining in part the natural emergence of multilinguality in bottleneck architectures. Separately, \citet{Pires2019HowMI} demonstrated the effectiveness of multilingual models like \mbert on sequence labeling tasks. \citet{huang2019unicoder} showed gains over XLM using cross-lingual multi-task learning, and \citet{singh2019xlda} demonstrated the efficiency of cross-lingual data augmentation for cross-lingual NLI. However, all of this work was at a relatively modest scale, in terms of the amount of training data, as compared to our approach.
117
+
118
+ \insertWikivsCC
119
+
120
+ The benefits of scaling language model pretraining by increasing the size of the model as well as the training data has been extensively studied in the literature. For the monolingual case, \citet{jozefowicz2016exploring} show how large-scale LSTM models can obtain much stronger performance on language modeling benchmarks when trained on billions of tokens.
121
+ %[Kartikay: TODO; CHange the reference to GPT2]
122
+ GPT~\cite{radford2018improving} also highlights the importance of scaling the amount of data and RoBERTa \cite{roberta2019} shows that training BERT longer on more data leads to significant boost in performance. Inspired by RoBERTa, we show that mBERT and XLM are undertuned, and that simple improvements in the learning procedure of unsupervised MLM leads to much better performance. We train on cleaned CommonCrawls~\cite{wenzek2019ccnet}, which increase the amount of data for low-resource languages by two orders of magnitude on average. Similar data has also been shown to be effective for learning high quality word embeddings in multiple languages~\cite{grave2018learning}.
123
+
124
+
125
+ Several efforts have trained massively multilingual machine translation models from large parallel corpora. They uncover the high and low resource trade-off and the problem of capacity dilution~\citep{johnson2017google,tan2019multilingual}. The work most similar to ours is \citet{arivazhagan2019massively}, which trains a single model in 103 languages on over 25 billion parallel sentences.
126
+ \citet{siddhant2019evaluating} further analyze the representations obtained by the encoder of a massively multilingual machine translation system and show that it obtains similar results to mBERT on cross-lingual NLI.
127
+ %, which performs much wors that the XLM models we study.
128
+ Our work, in contrast, focuses on the unsupervised learning of cross-lingual representations and their transfer to discriminative tasks.
129
+
130
+
131
+ \section{Model and Data}
132
+ \label{sec:model+data}
133
+
134
+ In this section, we present the training objective, languages, and data we use. We follow the XLM approach~\cite{lample2019cross} as closely as possible, only introducing changes that improve performance at scale.
135
+
136
+ \paragraph{Masked Language Models.}
137
+ We use a Transformer model~\cite{transformer17} trained with the multilingual MLM objective~\cite{devlin2018bert,lample2019cross} using only monolingual data. We sample streams of text from each language and train the model to predict the masked tokens in the input.
138
+ We apply subword tokenization directly on raw text data using Sentence Piece~\cite{kudo2018sentencepiece} with a unigram language model~\cite{kudo2018subword}. We sample batches from different languages using the same sampling distribution as \citet{lample2019cross}, but with $\alpha=0.3$. Unlike \citet{lample2019cross}, we do not use language embeddings, which allows our model to better deal with code-switching. We use a large vocabulary size of 250K with a full softmax and train two different models: \xlmr\textsubscript{Base} (L = 12, H = 768, A = 12, 270M params) and \xlmr (L = 24, H = 1024, A = 16, 550M params). For all of our ablation studies, we use a BERT\textsubscript{Base} architecture with a vocabulary of 150K tokens. Appendix~\ref{sec:appendix_B} goes into more details about the architecture of the different models referenced in this paper.
139
+
140
+ \paragraph{Scaling to a hundred languages.}
141
+ \xlmr is trained on 100 languages;
142
+ we provide a full list of languages and associated statistics in Appendix~\ref{sec:appendix_A}. Figure~\ref{fig:wikivscc} specifies the iso codes of 88 languages that are shared across \xlmr and XLM-100, the model from~\citet{lample2019cross} trained on Wikipedia text in 100 languages.
143
+
144
+ Compared to previous work, we replace some languages with more commonly used ones such as romanized Hindi and traditional Chinese. In our ablation studies, we always include the 7 languages for which we have classification and sequence labeling evaluation benchmarks: English, French, German, Russian, Chinese, Swahili and Urdu. We chose this set as it covers a suitable range of language families and includes low-resource languages such as Swahili and Urdu.
145
+ We also consider larger sets of 15, 30, 60 and all 100 languages. When reporting results on high-resource and low-resource, we refer to the average of English and French results, and the average of Swahili and Urdu results respectively.
146
+
147
+ \paragraph{Scaling the Amount of Training Data.}
148
+ Following~\citet{wenzek2019ccnet}~\footnote{\url{https://github.com/facebookresearch/cc_net}}, we build a clean CommonCrawl Corpus in 100 languages. We use an internal language identification model in combination with the one from fastText~\cite{joulin2017bag}. We train language models in each language and use it to filter documents as described in \citet{wenzek2019ccnet}. We consider one CommonCrawl dump for English and twelve dumps for all other languages, which significantly increases dataset sizes, especially for low-resource languages like Burmese and Swahili.
149
+
150
+ Figure~\ref{fig:wikivscc} shows the difference in size between the Wikipedia Corpus used by mBERT and XLM-100, and the CommonCrawl Corpus we use. As we show in Section~\ref{sec:multimono}, monolingual Wikipedia corpora are too small to enable unsupervised representation learning. Based on our experiments, we found that a few hundred MiB of text data is usually a minimal size for learning a BERT model.
151
+
152
+ \section{Evaluation}
153
+ We consider four evaluation benchmarks.
154
+ For cross-lingual understanding, we use cross-lingual natural language inference, named entity recognition, and question answering. We use the GLUE benchmark to evaluate the English performance of \xlmr and compare it to other state-of-the-art models.
155
+
156
+ \paragraph{Cross-lingual Natural Language Inference (XNLI).}
157
+ The XNLI dataset comes with ground-truth dev and test sets in 15 languages, and a ground-truth English training set. The training set has been machine-translated to the remaining 14 languages, providing synthetic training data for these languages as well. We evaluate our model on cross-lingual transfer from English to other languages. We also consider three machine translation baselines: (i) \textit{translate-test}: dev and test sets are machine-translated to English and a single English model is used (ii) \textit{translate-train} (per-language): the English training set is machine-translated to each language and we fine-tune a multiligual model on each training set (iii) \textit{translate-train-all} (multi-language): we fine-tune a multilingual model on the concatenation of all training sets from translate-train. For the translations, we use the official data provided by the XNLI project.
158
+ % In case we want to add more details about the CC-100 corpora : We train language models in each language and use it to filter documents as described in Wenzek et al. (2019). We additionally apply a filter based on type-token ratio score of 0.6. We consider one CommonCrawl snapshot (December, 2018) for English and twelve snapshots from all months of 2018 for all other languages, which significantly increases dataset sizes, especially for low-resource languages like Burmese and Swahili.
159
+
160
+ \paragraph{Named Entity Recognition.}
161
+ % WikiAnn http://nlp.cs.rpi.edu/wikiann/
162
+ For NER, we consider the CoNLL-2002~\cite{sang2002introduction} and CoNLL-2003~\cite{tjong2003introduction} datasets in English, Dutch, Spanish and German. We fine-tune multilingual models either (1) on the English set to evaluate cross-lingual transfer, (2) on each set to evaluate per-language performance, or (3) on all sets to evaluate multilingual learning. We report the F1 score, and compare to baselines from \citet{lample-etal-2016-neural} and \citet{akbik2018coling}.
163
+
164
+ \paragraph{Cross-lingual Question Answering.}
165
+ We use the MLQA benchmark from \citet{lewis2019mlqa}, which extends the English SQuAD benchmark to Spanish, German, Arabic, Hindi, Vietnamese and Chinese. We report the F1 score as well as the exact match (EM) score for cross-lingual transfer from English.
166
+
167
+ \paragraph{GLUE Benchmark.}
168
+ Finally, we evaluate the English performance of our model on the GLUE benchmark~\cite{wang2018glue} which gathers multiple classification tasks, such as MNLI~\cite{williams2017broad}, SST-2~\cite{socher2013recursive}, or QNLI~\cite{rajpurkar2018know}. We use BERT\textsubscript{Large} and RoBERTa as baselines.
169
+
170
+ \section{Analysis and Results}
171
+ \label{sec:analysis}
172
+
173
+ In this section, we perform a comprehensive analysis of multilingual masked language models. We conduct most of the analysis on XNLI, which we found to be representative of our findings on other tasks. We then present the results of \xlmr on cross-lingual understanding and GLUE. Finally, we compare multilingual and monolingual models, and present results on low-resource languages.
174
+
175
+ \subsection{Improving and Understanding Multilingual Masked Language Models}
176
+ % prior analysis necessary to build \xlmr
177
+ \insertAblationone
178
+ \insertAblationtwo
179
+
180
+ Much of the work done on understanding the cross-lingual effectiveness of \mbert or XLM~\cite{Pires2019HowMI,wu2019beto,lewis2019mlqa} has focused on analyzing the performance of fixed pretrained models on downstream tasks. In this section, we present a comprehensive study of different factors that are important to \textit{pretraining} large scale multilingual models. We highlight the trade-offs and limitations of these models as we scale to one hundred languages.
181
+
182
+ \paragraph{Transfer-dilution Trade-off and Curse of Multilinguality.}
183
+ Model capacity (i.e. the number of parameters in the model) is constrained due to practical considerations such as memory and speed during training and inference. For a fixed sized model, the per-language capacity decreases as we increase the number of languages. While low-resource language performance can be improved by adding similar higher-resource languages during pretraining, the overall downstream performance suffers from this capacity dilution~\cite{arivazhagan2019massively}. Positive transfer and capacity dilution have to be traded off against each other.
184
+
185
+ We illustrate this trade-off in Figure~\ref{fig:transfer_dilution}, which shows XNLI performance vs the number of languages the model is pretrained on. Initially, as we go from 7 to 15 languages, the model is able to take advantage of positive transfer which improves performance, especially on low resource languages. Beyond this point the {\em curse of multilinguality}
186
+ kicks in and degrades performance across all languages. Specifically, the overall XNLI accuracy decreases from 71.8\% to 67.7\% as we go from XLM-7 to XLM-100. The same trend can be observed for models trained on the larger CommonCrawl Corpus.
187
+
188
+ The issue is even more prominent when the capacity of the model is small. To show this, we pretrain models on Wikipedia Data in 7, 30 and 100 languages. As we add more languages, we make the Transformer wider by increasing the hidden size from 768 to 960 to 1152. In Figure~\ref{fig:capacity}, we show that the added capacity allows XLM-30 to be on par with XLM-7, thus overcoming the curse of multilinguality. The added capacity for XLM-100, however, is not enough
189
+ and it still lags behind due to higher vocabulary dilution (recall from Section~\ref{sec:model+data} that we used a fixed vocabulary size of 150K for all models).
190
+
191
+ \paragraph{High-resource vs Low-resource Trade-off.}
192
+ The allocation of the model capacity across languages is controlled by several parameters: the training set size, the size of the shared subword vocabulary, and the rate at which we sample training examples from each language. We study the effect of sampling on the performance of high-resource (English and French) and low-resource (Swahili and Urdu) languages for an XLM-100 model trained on Wikipedia (we observe a similar trend for the construction of the subword vocab). Specifically, we investigate the impact of varying the $\alpha$ parameter which controls the exponential smoothing of the language sampling rate. Similar to~\citet{lample2019cross}, we use a sampling rate proportional to the number of sentences in each corpus. Models trained with higher values of $\alpha$ see batches of high-resource languages more often.
193
+ Figure~\ref{fig:alpha} shows that the higher the value of $\alpha$, the better the performance on high-resource languages, and vice-versa. When considering overall performance, we found $0.3$ to be an optimal value for $\alpha$, and use this for \xlmr.
194
+
195
+ \paragraph{Importance of Capacity and Vocabulary.}
196
+ In previous sections and in Figure~\ref{fig:capacity}, we showed the importance of scaling the model size as we increase the number of languages. Similar to the overall model size, we argue that scaling the size of the shared vocabulary (the vocabulary capacity) can improve the performance of multilingual models on downstream tasks. To illustrate this effect, we train XLM-100 models on Wikipedia data with different vocabulary sizes. We keep the overall number of parameters constant by adjusting the width of the transformer. Figure~\ref{fig:vocab} shows that even with a fixed capacity, we observe a 2.8\% increase in XNLI average accuracy as we increase the vocabulary size from 32K to 256K. This suggests that multilingual models can benefit from allocating a higher proportion of the total number of parameters to the embedding layer even though this reduces the size of the Transformer.
197
+ %With bigger models, we believe that using a vocabulary of up to 2 million tokens with an adaptive softmax~\cite{grave2017efficient,baevski2018adaptive} should improve performance even further, but we leave this exploration to future work.
198
+ For simplicity and given the softmax computational constraints, we use a vocabulary of 250k for \xlmr.
199
+
200
+ We further illustrate the importance of this parameter, by training three models with the same transformer architecture (BERT\textsubscript{Base}) but with different vocabulary sizes: 128K, 256K and 512K. We observe more than 3\% gains in overall accuracy on XNLI by simply increasing the vocab size from 128k to 512k.
201
+
202
+ \paragraph{Larger-scale Datasets and Training.}
203
+ As shown in Figure~\ref{fig:wikivscc}, the CommonCrawl Corpus that we collected has significantly more monolingual data than the previously used Wikipedia corpora. Figure~\ref{fig:curse} shows that for the same BERT\textsubscript{Base} architecture, all models trained on CommonCrawl obtain significantly better performance.
204
+
205
+ Apart from scaling the training data, \citet{roberta2019} also showed the benefits of training MLMs longer. In our experiments, we observed similar effects of large-scale training, such as increasing batch size (see Figure~\ref{fig:batch}) and training time, on model performance. Specifically, we found that using validation perplexity as a stopping criterion for pretraining caused the multilingual MLM in \citet{lample2019cross} to be under-tuned. In our experience, performance on downstream tasks continues to improve even after validation perplexity has plateaued. Combining this observation with our implementation of the unsupervised XLM-MLM objective, we were able to improve the performance of \citet{lample2019cross} from 71.3\% to more than 75\% average accuracy on XNLI, which was on par with their supervised translation language modeling (TLM) objective. Based on these results, and given our focus on unsupervised learning, we decided to not use the supervised TLM objective for training our models.
206
+
207
+
208
+ \paragraph{Simplifying Multilingual Tokenization with Sentence Piece.}
209
+ The different language-specific tokenization tools
210
+ used by mBERT and XLM-100 make these models more difficult to use on raw text. Instead, we train a Sentence Piece model (SPM) and apply it directly on raw text data for all languages. We did not observe any loss in performance for models trained with SPM when compared to models trained with language-specific preprocessing and byte-pair encoding (see Figure~\ref{fig:batch}) and hence use SPM for \xlmr.
211
+
212
+ \subsection{Cross-lingual Understanding Results}
213
+ Based on these results, we adapt the setting of \citet{lample2019cross} and use a large Transformer model with 24 layers and 1024 hidden states, with a 250k vocabulary. We use the multilingual MLM loss and train our \xlmr model for 1.5 Million updates on five-hundred 32GB Nvidia V100 GPUs with a batch size of 8192. We leverage the SPM-preprocessed text data from CommonCrawl in 100 languages and sample languages with $\alpha=0.3$. In this section, we show that it outperforms all previous techniques on cross-lingual benchmarks while getting performance on par with RoBERTa on the GLUE benchmark.
214
+
215
+
216
+ \insertXNLItable
217
+
218
+ \paragraph{XNLI.}
219
+ Table~\ref{tab:xnli} shows XNLI results and adds some additional details: (i) the number of models the approach induces (\#M), (ii) the data on which the model was trained (D), and (iii) the number of languages the model was pretrained on (\#lg). As we show in our results, these parameters significantly impact performance. Column \#M specifies whether model selection was done separately on the dev set of each language ($N$ models), or on the joint dev set of all the languages (single model). We observe a 0.6 decrease in overall accuracy when we go from $N$ models to a single model - going from 71.3 to 70.7. We encourage the community to adopt this setting. For cross-lingual transfer, while this approach is not fully zero-shot transfer, we argue that in real applications, a small amount of supervised data is often available for validation in each language.
220
+
221
+ \xlmr sets a new state of the art on XNLI.
222
+ On cross-lingual transfer, \xlmr obtains 80.9\% accuracy, outperforming the XLM-100 and \mbert open-source models by 10.2\% and 14.6\% average accuracy. On the Swahili and Urdu low-resource languages, \xlmr outperforms XLM-100 by 15.7\% and 11.4\%, and \mbert by 23.5\% and 15.8\%. While \xlmr handles 100 languages, we also show that it outperforms the former state of the art Unicoder~\citep{huang2019unicoder} and XLM (MLM+TLM), which handle only 15 languages, by 5.5\% and 5.8\% average accuracy respectively. Using the multilingual training of translate-train-all, \xlmr further improves performance and reaches 83.6\% accuracy, a new overall state of the art for XNLI, outperforming Unicoder by 5.1\%. Multilingual training is similar to practical applications where training sets are available in various languages for the same task. In the case of XNLI, datasets have been translated, and translate-train-all can be seen as some form of cross-lingual data augmentation~\cite{singh2019xlda}, similar to back-translation~\cite{xie2019unsupervised}.
223
+
224
+ \insertNER
225
+ \paragraph{Named Entity Recognition.}
226
+ In Table~\ref{tab:ner}, we report results of \xlmr and \mbert on CoNLL-2002 and CoNLL-2003. We consider the LSTM + CRF approach from \citet{lample-etal-2016-neural} and the Flair model from \citet{akbik2018coling} as baselines. We evaluate the performance of the model on each of the target languages in three different settings: (i) train on English data only (en) (ii) train on data in target language (each) (iii) train on data in all languages (all). Results of \mbert are reported from \citet{wu2019beto}. Note that we do not use a linear-chain CRF on top of \xlmr and \mbert representations, which gives an advantage to \citet{akbik2018coling}. Without the CRF, our \xlmr model still performs on par with the state of the art, outperforming \citet{akbik2018coling} on Dutch by $2.09$ points. On this task, \xlmr also outperforms \mbert by 2.42 F1 on average for cross-lingual transfer, and 1.86 F1 when trained on each language. Training on all languages leads to an average F1 score of 89.43\%, outperforming cross-lingual transfer approach by 8.49\%.
227
+
228
+ \paragraph{Question Answering.}
229
+ We also obtain new state of the art results on the MLQA cross-lingual question answering benchmark, introduced by \citet{lewis2019mlqa}. We follow their procedure by training on the English training data and evaluating on the 7 languages of the dataset.
230
+ We report results in Table~\ref{tab:mlqa}.
231
+ \xlmr obtains F1 and accuracy scores of 70.7\% and 52.7\% while the previous state of the art was 61.6\% and 43.5\%. \xlmr also outperforms \mbert by 13.0\% F1-score and 11.1\% accuracy. It even outperforms BERT-Large on English, confirming its strong monolingual performance.
232
+
233
+ \insertMLQA
234
+
235
+ \subsection{Multilingual versus Monolingual}
236
+ \label{sec:multimono}
237
+ In this section, we present results of multilingual XLM models against monolingual BERT models.
238
+
239
+ \paragraph{GLUE: \xlmr versus RoBERTa.}
240
+ Our goal is to obtain a multilingual model with strong performance on both, cross-lingual understanding tasks as well as natural language understanding tasks for each language. To that end, we evaluate \xlmr on the GLUE benchmark. We show in Table~\ref{tab:glue}, that \xlmr obtains better average dev performance than BERT\textsubscript{Large} by 1.6\% and reaches performance on par with XLNet\textsubscript{Large}. The RoBERTa model outperforms \xlmr by only 1.0\% on average. We believe future work can reduce this gap even further by alleviating the curse of multilinguality and vocabulary dilution. These results demonstrate the possibility of learning one model for many languages while maintaining strong performance on per-language downstream tasks.
241
+
242
+ \insertGlue
243
+
244
+ \paragraph{XNLI: XLM versus BERT.}
245
+ A recurrent criticism against multilingual models is that they obtain worse performance than their monolingual counterparts. In addition to the comparison of \xlmr and RoBERTa, we provide the first comprehensive study to assess this claim on the XNLI benchmark. We extend our comparison between multilingual XLM models and monolingual BERT models on 7 languages and compare performance in Table~\ref{tab:multimono}. We train 14 monolingual BERT models on Wikipedia and CommonCrawl (capped at 60 GiB),
246
+ %\footnote{For simplicity, we use a reduced version of our corpus by capping the size of each monolingual dataset to 60 GiB.}
247
+ and two XLM-7 models. We increase the vocabulary size of the multilingual model for a better comparison.
248
+ % To our surprise - and backed by further study on internal benchmarks -
249
+ We found that \textit{multilingual models can outperform their monolingual BERT counterparts}. Specifically, in Table~\ref{tab:multimono}, we show that for cross-lingual transfer, monolingual baselines outperform XLM-7 for both Wikipedia and CC by 1.6\% and 1.3\% average accuracy. However, by making use of multilingual training (translate-train-all) and leveraging training sets coming from multiple languages, XLM-7 can outperform the BERT models: our XLM-7 trained on CC obtains 80.0\% average accuracy on the 7 languages, while the average performance of BERT models trained on CC is 77.5\%. This is a surprising result that shows that the capacity of multilingual models to leverage training data coming from multiple languages for a particular task can overcome the capacity dilution problem to obtain better overall performance.
250
+
251
+
252
+ \insertMultiMono
253
+
254
+ \subsection{Representation Learning for Low-resource Languages}
255
+ We observed in Table~\ref{tab:multimono} that pretraining on Wikipedia for Swahili and Urdu performed similarly to a randomly initialized model; most likely due to the small size of the data for these languages. On the other hand, pretraining on CC improved performance by up to 10 points. This confirms our assumption that mBERT and XLM-100 rely heavily on cross-lingual transfer but do not model the low-resource languages as well as \xlmr. Specifically, in the translate-train-all setting, we observe that the biggest gains for XLM models trained on CC, compared to their Wikipedia counterparts, are on low-resource languages; 7\% and 4.8\% improvement on Swahili and Urdu respectively.
256
+
257
+ \section{Conclusion}
258
+ In this work, we introduced \xlmr, our new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages. We show that it provides strong gains over previous multilingual models like \mbert and XLM on classification, sequence labeling and question answering. We exposed the limitations of multilingual MLMs, in particular by uncovering the high-resource versus low-resource trade-off, the curse of multilinguality and the importance of key hyperparameters. We also expose the surprising effectiveness of multilingual models over monolingual models, and show strong improvements on low-resource languages.
259
+ % \section*{Acknowledgements}
260
+
261
+
262
+ \bibliography{acl2020}
263
+ \bibliographystyle{acl_natbib}
264
+
265
+
266
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
267
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
268
+ % DELETE THIS PART. DO NOT PLACE CONTENT AFTER THE REFERENCES!
269
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
270
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
271
+ \newpage
272
+ \clearpage
273
+ \appendix
274
+ \onecolumn
275
+ \section*{Appendix}
276
+ \section{Languages and statistics for CC-100 used by \xlmr}
277
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
278
+ \label{sec:appendix_A}
279
+ \insertDataStatistics
280
+
281
+ % \newpage
282
+ \section{Model Architectures and Sizes}
283
+ As we showed in section~\ref{sec:analysis}, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
284
+ \label{sec:appendix_B}
285
+
286
+ \insertParameters
287
+
288
+ % \section{Do \emph{not} have an appendix here}
289
+
290
+ % \textbf{\emph{Do not put content after the references.}}
291
+ % %
292
+ % Put anything that you might normally include after the references in a separate
293
+ % supplementary file.
294
+
295
+ % We recommend that you build supplementary material in a separate document.
296
+ % If you must create one PDF and cut it up, please be careful to use a tool that
297
+ % doesn't alter the margins, and that doesn't aggressively rewrite the PDF file.
298
+ % pdftk usually works fine.
299
+
300
+ % \textbf{Please do not use Apple's preview to cut off supplementary material.} In
301
+ % previous years it has altered margins, and created headaches at the camera-ready
302
+ % stage.
303
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
304
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
305
+
306
+
307
+ \end{document}
references/2019.arxiv.conneau/source/acl2020.bib ADDED
@@ -0,0 +1,739 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @inproceedings{koehn2007moses,
2
+ title={Moses: Open source toolkit for statistical machine translation},
3
+ author={Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and others},
4
+ booktitle={Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions},
5
+ pages={177--180},
6
+ year={2007},
7
+ organization={Association for Computational Linguistics}
8
+ }
9
+
10
+ @article{xie2019unsupervised,
11
+ title={Unsupervised data augmentation for consistency training},
12
+ author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},
13
+ journal={arXiv preprint arXiv:1904.12848},
14
+ year={2019}
15
+ }
16
+
17
+ @article{baevski2018adaptive,
18
+ title={Adaptive input representations for neural language modeling},
19
+ author={Baevski, Alexei and Auli, Michael},
20
+ journal={arXiv preprint arXiv:1809.10853},
21
+ year={2018}
22
+ }
23
+
24
+ @article{wu2019emerging,
25
+ title={Emerging Cross-lingual Structure in Pretrained Language Models},
26
+ author={Wu, Shijie and Conneau, Alexis and Li, Haoran and Zettlemoyer, Luke and Stoyanov, Veselin},
27
+ journal={ACL},
28
+ year={2019}
29
+ }
30
+
31
+ @inproceedings{grave2017efficient,
32
+ title={Efficient softmax approximation for GPUs},
33
+ author={Grave, Edouard and Joulin, Armand and Ciss{\'e}, Moustapha and J{\'e}gou, Herv{\'e} and others},
34
+ booktitle={Proceedings of the 34th International Conference on Machine Learning-Volume 70},
35
+ pages={1302--1310},
36
+ year={2017},
37
+ organization={JMLR. org}
38
+ }
39
+
40
+ @article{sang2002introduction,
41
+ title={Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition},
42
+ author={Sang, Erik F},
43
+ journal={CoNLL},
44
+ year={2002}
45
+ }
46
+
47
+ @article{singh2019xlda,
48
+ title={XLDA: Cross-Lingual Data Augmentation for Natural Language Inference and Question Answering},
49
+ author={Singh, Jasdeep and McCann, Bryan and Keskar, Nitish Shirish and Xiong, Caiming and Socher, Richard},
50
+ journal={arXiv preprint arXiv:1905.11471},
51
+ year={2019}
52
+ }
53
+
54
+ @inproceedings{tjong2003introduction,
55
+ title={Introduction to the CoNLL-2003 shared task: language-independent named entity recognition},
56
+ author={Tjong Kim Sang, Erik F and De Meulder, Fien},
57
+ booktitle={CoNLL},
58
+ pages={142--147},
59
+ year={2003},
60
+ organization={Association for Computational Linguistics}
61
+ }
62
+
63
+ @misc{ud-v2.3,
64
+ title = {Universal Dependencies 2.3},
65
+ author = {Nivre, Joakim et al.},
66
+ url = {http://hdl.handle.net/11234/1-2895},
67
+ note = {{LINDAT}/{CLARIN} digital library at the Institute of Formal and Applied Linguistics ({{\'U}FAL}), Faculty of Mathematics and Physics, Charles University},
68
+ copyright = {Licence Universal Dependencies v2.3},
69
+ year = {2018} }
70
+
71
+
72
+ @article{huang2019unicoder,
73
+ title={Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks},
74
+ author={Huang, Haoyang and Liang, Yaobo and Duan, Nan and Gong, Ming and Shou, Linjun and Jiang, Daxin and Zhou, Ming},
75
+ journal={ACL},
76
+ year={2019}
77
+ }
78
+
79
+ @article{kingma2014adam,
80
+ title={Adam: A method for stochastic optimization},
81
+ author={Kingma, Diederik P and Ba, Jimmy},
82
+ journal={arXiv preprint arXiv:1412.6980},
83
+ year={2014}
84
+ }
85
+
86
+
87
+ @article{bojanowski2017enriching,
88
+ title={Enriching word vectors with subword information},
89
+ author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
90
+ journal={TACL},
91
+ volume={5},
92
+ pages={135--146},
93
+ year={2017},
94
+ publisher={MIT Press}
95
+ }
96
+
97
+ @article{werbos1990backpropagation,
98
+ title={Backpropagation through time: what it does and how to do it},
99
+ author={Werbos, Paul J},
100
+ journal={Proceedings of the IEEE},
101
+ volume={78},
102
+ number={10},
103
+ pages={1550--1560},
104
+ year={1990},
105
+ publisher={IEEE}
106
+ }
107
+
108
+ @article{hochreiter1997long,
109
+ title={Long short-term memory},
110
+ author={Hochreiter, Sepp and Schmidhuber, J{\"u}rgen},
111
+ journal={Neural computation},
112
+ volume={9},
113
+ number={8},
114
+ pages={1735--1780},
115
+ year={1997},
116
+ publisher={MIT Press}
117
+ }
118
+
119
+ @article{al2018character,
120
+ title={Character-level language modeling with deeper self-attention},
121
+ author={Al-Rfou, Rami and Choe, Dokook and Constant, Noah and Guo, Mandy and Jones, Llion},
122
+ journal={arXiv preprint arXiv:1808.04444},
123
+ year={2018}
124
+ }
125
+
126
+ @misc{dai2019transformerxl,
127
+ title={Transformer-{XL}: Language Modeling with Longer-Term Dependency},
128
+ author={Zihang Dai and Zhilin Yang and Yiming Yang and William W. Cohen and Jaime Carbonell and Quoc V. Le and Ruslan Salakhutdinov},
129
+ year={2019},
130
+ url={https://openreview.net/forum?id=HJePno0cYm},
131
+ }
132
+
133
+ @article{jozefowicz2016exploring,
134
+ title={Exploring the limits of language modeling},
135
+ author={Jozefowicz, Rafal and Vinyals, Oriol and Schuster, Mike and Shazeer, Noam and Wu, Yonghui},
136
+ journal={arXiv preprint arXiv:1602.02410},
137
+ year={2016}
138
+ }
139
+
140
+ @inproceedings{mikolov2010recurrent,
141
+ title={Recurrent neural network based language model},
142
+ author={Mikolov, Tom{\'a}{\v{s}} and Karafi{\'a}t, Martin and Burget, Luk{\'a}{\v{s}} and {\v{C}}ernock{\`y}, Jan and Khudanpur, Sanjeev},
143
+ booktitle={Eleventh Annual Conference of the International Speech Communication Association},
144
+ year={2010}
145
+ }
146
+
147
+ @article{gehring2017convolutional,
148
+ title={Convolutional sequence to sequence learning},
149
+ author={Gehring, Jonas and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
150
+ journal={arXiv preprint arXiv:1705.03122},
151
+ year={2017}
152
+ }
153
+
154
+ @article{sennrich2016edinburgh,
155
+ title={Edinburgh neural machine translation systems for wmt 16},
156
+ author={Sennrich, Rico and Haddow, Barry and Birch, Alexandra},
157
+ journal={arXiv preprint arXiv:1606.02891},
158
+ year={2016}
159
+ }
160
+
161
+ @inproceedings{howard2018universal,
162
+ title={Universal language model fine-tuning for text classification},
163
+ author={Howard, Jeremy and Ruder, Sebastian},
164
+ booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
165
+ volume={1},
166
+ pages={328--339},
167
+ year={2018}
168
+ }
169
+
170
+ @inproceedings{unsupNMTartetxe,
171
+ title = {Unsupervised neural machine translation},
172
+ author = {Mikel Artetxe and Gorka Labaka and Eneko Agirre and Kyunghyun Cho},
173
+ booktitle = {International Conference on Learning Representations (ICLR)},
174
+ year = {2018}
175
+ }
176
+
177
+ @inproceedings{artetxe2017learning,
178
+ title={Learning bilingual word embeddings with (almost) no bilingual data},
179
+ author={Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
180
+ booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
181
+ volume={1},
182
+ pages={451--462},
183
+ year={2017}
184
+ }
185
+
186
+ @inproceedings{socher2013recursive,
187
+ title={Recursive deep models for semantic compositionality over a sentiment treebank},
188
+ author={Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew and Potts, Christopher},
189
+ booktitle={EMNLP},
190
+ pages={1631--1642},
191
+ year={2013}
192
+ }
193
+
194
+ @inproceedings{bowman2015large,
195
+ title={A large annotated corpus for learning natural language inference},
196
+ author={Bowman, Samuel R. and Angeli, Gabor and Potts, Christopher and Manning, Christopher D.},
197
+ booktitle={EMNLP},
198
+ year={2015}
199
+ }
200
+
201
+ @inproceedings{multinli:2017,
202
+ Title = {A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference},
203
+ Author = {Adina Williams and Nikita Nangia and Samuel R. Bowman},
204
+ Booktitle = {NAACL},
205
+ year = {2017}
206
+ }
207
+
208
+ @article{paszke2017automatic,
209
+ title={Automatic differentiation in pytorch},
210
+ author={Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and Antiga, Luca and Lerer, Adam},
211
+ journal={NIPS 2017 Autodiff Workshop},
212
+ year={2017}
213
+ }
214
+
215
+ @inproceedings{conneau2018craminto,
216
+ title={What you can cram into a single vector: Probing sentence embeddings for linguistic properties},
217
+ author={Conneau, Alexis and Kruszewski, German and Lample, Guillaume and Barrault, Lo{\"\i}c and Baroni, Marco},
218
+ booktitle = {ACL},
219
+ year={2018}
220
+ }
221
+
222
+ @inproceedings{Conneau:2018:iclr_muse,
223
+ title={Word Translation without Parallel Data},
224
+ author={Alexis Conneau and Guillaume Lample and {Marc'Aurelio} Ranzato and Ludovic Denoyer and Hervé Jegou},
225
+ booktitle = {ICLR},
226
+ year={2018}
227
+ }
228
+
229
+ @article{johnson2017google,
230
+ title={Google’s multilingual neural machine translation system: Enabling zero-shot translation},
231
+ author={Johnson, Melvin and Schuster, Mike and Le, Quoc V and Krikun, Maxim and Wu, Yonghui and Chen, Zhifeng and Thorat, Nikhil and Vi{\'e}gas, Fernanda and Wattenberg, Martin and Corrado, Greg and others},
232
+ journal={TACL},
233
+ volume={5},
234
+ pages={339--351},
235
+ year={2017},
236
+ publisher={MIT Press}
237
+ }
238
+
239
+ @article{radford2019language,
240
+ title={Language models are unsupervised multitask learners},
241
+ author={Radford, Alec and Wu, Jeffrey and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
242
+ journal={OpenAI Blog},
243
+ volume={1},
244
+ number={8},
245
+ year={2019}
246
+ }
247
+
248
+ @inproceedings{unsupNMTlample,
249
+ title = {Unsupervised machine translation using monolingual corpora only},
250
+ author = {Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
251
+ booktitle = {ICLR},
252
+ year = {2018}
253
+ }
254
+
255
+ @inproceedings{lample2018phrase,
256
+ title={Phrase-Based \& Neural Unsupervised Machine Translation},
257
+ author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
258
+ booktitle={EMNLP},
259
+ year={2018}
260
+ }
261
+
262
+ @article{hendrycks2016bridging,
263
+ title={Bridging nonlinearities and stochastic regularizers with Gaussian error linear units},
264
+ author={Hendrycks, Dan and Gimpel, Kevin},
265
+ journal={arXiv preprint arXiv:1606.08415},
266
+ year={2016}
267
+ }
268
+
269
+ @inproceedings{chang2008optimizing,
270
+ title={Optimizing Chinese word segmentation for machine translation performance},
271
+ author={Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D},
272
+ booktitle={Proceedings of the third workshop on statistical machine translation},
273
+ pages={224--232},
274
+ year={2008}
275
+ }
276
+
277
+ @inproceedings{rajpurkar-etal-2016-squad,
278
+ title = "{SQ}u{AD}: 100,000+ Questions for Machine Comprehension of Text",
279
+ author = "Rajpurkar, Pranav and
280
+ Zhang, Jian and
281
+ Lopyrev, Konstantin and
282
+ Liang, Percy",
283
+ booktitle = "EMNLP",
284
+ month = nov,
285
+ year = "2016",
286
+ address = "Austin, Texas",
287
+ publisher = "Association for Computational Linguistics",
288
+ url = "https://www.aclweb.org/anthology/D16-1264",
289
+ doi = "10.18653/v1/D16-1264",
290
+ pages = "2383--2392",
291
+ }
292
+
293
+ @article{lewis2019mlqa,
294
+ title={MLQA: Evaluating Cross-lingual Extractive Question Answering},
295
+ author={Lewis, Patrick and O\u{g}uz, Barlas and Rinott, Ruty and Riedel, Sebastian and Schwenk, Holger},
296
+ journal={arXiv preprint arXiv:1910.07475},
297
+ year={2019}
298
+ }
299
+
300
+ @inproceedings{sennrich2015neural,
301
+ title={Neural machine translation of rare words with subword units},
302
+ author={Sennrich, Rico and Haddow, Barry and Birch, Alexandra},
303
+ booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics},
304
+ pages = {1715-1725},
305
+ year={2015}
306
+ }
307
+
308
+ @article{eriguchi2018zero,
309
+ title={Zero-shot cross-lingual classification using multilingual neural machine translation},
310
+ author={Eriguchi, Akiko and Johnson, Melvin and Firat, Orhan and Kazawa, Hideto and Macherey, Wolfgang},
311
+ journal={arXiv preprint arXiv:1809.04686},
312
+ year={2018}
313
+ }
314
+
315
+ @article{smith2017offline,
316
+ title={Offline bilingual word vectors, orthogonal transformations and the inverted softmax},
317
+ author={Smith, Samuel L and Turban, David HP and Hamblin, Steven and Hammerla, Nils Y},
318
+ journal={International Conference on Learning Representations},
319
+ year={2017}
320
+ }
321
+
322
+ @article{artetxe2016learning,
323
+ title={Learning principled bilingual mappings of word embeddings while preserving monolingual invariance},
324
+ author={Artetxe, Mikel and Labaka, Gorka and Agirre, Eneko},
325
+ journal={Proceedings of EMNLP},
326
+ year={2016}
327
+ }
328
+
329
+ @article{ammar2016massively,
330
+ title={Massively multilingual word embeddings},
331
+ author={Ammar, Waleed and Mulcaire, George and Tsvetkov, Yulia and Lample, Guillaume and Dyer, Chris and Smith, Noah A},
332
+ journal={arXiv preprint arXiv:1602.01925},
333
+ year={2016}
334
+ }
335
+
336
+ @article{marcobaroni2015hubness,
337
+ title={Hubness and pollution: Delving into cross-space mapping for zero-shot learning},
338
+ author={Lazaridou, Angeliki and Dinu, Georgiana and Baroni, Marco},
339
+ journal={Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics},
340
+ year={2015}
341
+ }
342
+
343
+ @article{xing2015normalized,
344
+ title={Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation},
345
+ author={Xing, Chao and Wang, Dong and Liu, Chao and Lin, Yiye},
346
+ journal={Proceedings of NAACL},
347
+ year={2015}
348
+ }
349
+
350
+ @article{faruqui2014improving,
351
+ title={Improving Vector Space Word Representations Using Multilingual Correlation},
352
+ author={Faruqui, Manaal and Dyer, Chris},
353
+ journal={Proceedings of EACL},
354
+ year={2014}
355
+ }
356
+
357
+ @article{taylor1953cloze,
358
+ title={“Cloze procedure”: A new tool for measuring readability},
359
+ author={Taylor, Wilson L},
360
+ journal={Journalism Bulletin},
361
+ volume={30},
362
+ number={4},
363
+ pages={415--433},
364
+ year={1953},
365
+ publisher={SAGE Publications Sage CA: Los Angeles, CA}
366
+ }
367
+
368
+ @inproceedings{mikolov2013distributed,
369
+ title={Distributed representations of words and phrases and their compositionality},
370
+ author={Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff},
371
+ booktitle={NIPS},
372
+ pages={3111--3119},
373
+ year={2013}
374
+ }
375
+
376
+ @article{mikolov2013exploiting,
377
+ title={Exploiting similarities among languages for machine translation},
378
+ author={Mikolov, Tomas and Le, Quoc V and Sutskever, Ilya},
379
+ journal={arXiv preprint arXiv:1309.4168},
380
+ year={2013}
381
+ }
382
+
383
+ @article{artetxe2018massively,
384
+ title={Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond},
385
+ author={Artetxe, Mikel and Schwenk, Holger},
386
+ journal={arXiv preprint arXiv:1812.10464},
387
+ year={2018}
388
+ }
389
+
390
+ @article{williams2017broad,
391
+ title={A broad-coverage challenge corpus for sentence understanding through inference},
392
+ author={Williams, Adina and Nangia, Nikita and Bowman, Samuel R},
393
+ journal={Proceedings of the 2nd Workshop on Evaluating Vector-Space Representations for NLP},
394
+ year={2017}
395
+ }
396
+
397
+ @InProceedings{conneau2018xnli,
398
+ author = "Conneau, Alexis
399
+ and Rinott, Ruty
400
+ and Lample, Guillaume
401
+ and Williams, Adina
402
+ and Bowman, Samuel R.
403
+ and Schwenk, Holger
404
+ and Stoyanov, Veselin",
405
+ title = "XNLI: Evaluating Cross-lingual Sentence Representations",
406
+ booktitle = "EMNLP",
407
+ year = "2018",
408
+ publisher = "Association for Computational Linguistics",
409
+ location = "Brussels, Belgium",
410
+ }
411
+
412
+ @article{wada2018unsupervised,
413
+ title={Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models},
414
+ author={Wada, Takashi and Iwata, Tomoharu},
415
+ journal={arXiv preprint arXiv:1809.02306},
416
+ year={2018}
417
+ }
418
+
419
+ @article{xu2013cross,
420
+ title={Cross-lingual language modeling for low-resource speech recognition},
421
+ author={Xu, Ping and Fung, Pascale},
422
+ journal={IEEE Transactions on Audio, Speech, and Language Processing},
423
+ volume={21},
424
+ number={6},
425
+ pages={1134--1144},
426
+ year={2013},
427
+ publisher={IEEE}
428
+ }
429
+
430
+ @article{hermann2014multilingual,
431
+ title={Multilingual models for compositional distributed semantics},
432
+ author={Hermann, Karl Moritz and Blunsom, Phil},
433
+ journal={arXiv preprint arXiv:1404.4641},
434
+ year={2014}
435
+ }
436
+
437
+ @inproceedings{transformer17,
438
+ title = {Attention is all you need},
439
+ author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
440
+ booktitle={Advances in Neural Information Processing Systems},
441
+ pages={6000--6010},
442
+ year = {2017}
443
+ }
444
+
445
+ @article{liu2019multi,
446
+ title={Multi-task deep neural networks for natural language understanding},
447
+ author={Liu, Xiaodong and He, Pengcheng and Chen, Weizhu and Gao, Jianfeng},
448
+ journal={arXiv preprint arXiv:1901.11504},
449
+ year={2019}
450
+ }
451
+
452
+ @article{wang2018glue,
453
+ title={GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding},
454
+ author={Wang, Alex and Singh, Amapreet and Michael, Julian and Hill, Felix and Levy, Omer and Bowman, Samuel R},
455
+ journal={arXiv preprint arXiv:1804.07461},
456
+ year={2018}
457
+ }
458
+
459
+ @article{radford2018improving,
460
+ title={Improving language understanding by generative pre-training},
461
+ author={Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya},
462
+ journal={URL https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language\_understanding\_paper.pdf},
463
+ url={https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf},
464
+ year={2018}
465
+ }
466
+
467
+ @article{conneau2018senteval,
468
+ title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
469
+ author={Conneau, Alexis and Kiela, Douwe},
470
+ journal={LREC},
471
+ year={2018}
472
+ }
473
+
474
+ @article{devlin2018bert,
475
+ title={Bert: Pre-training of deep bidirectional transformers for language understanding},
476
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
477
+ journal={NAACL},
478
+ year={2018}
479
+ }
480
+
481
+ @article{peters2018deep,
482
+ title={Deep contextualized word representations},
483
+ author={Peters, Matthew E and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
484
+ journal={NAACL},
485
+ year={2018}
486
+ }
487
+
488
+ @article{ramachandran2016unsupervised,
489
+ title={Unsupervised pretraining for sequence to sequence learning},
490
+ author={Ramachandran, Prajit and Liu, Peter J and Le, Quoc V},
491
+ journal={arXiv preprint arXiv:1611.02683},
492
+ year={2016}
493
+ }
494
+
495
+ @inproceedings{kunchukuttan2018iit,
496
+ title={The IIT Bombay English-Hindi Parallel Corpus},
497
+ author={Kunchukuttan Anoop and Mehta Pratik and Bhattacharyya Pushpak},
498
+ booktitle={LREC},
499
+ year={2018}
500
+ }
501
+
502
+ @article{wu2019beto,
503
+ title={Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT},
504
+ author={Wu, Shijie and Dredze, Mark},
505
+ journal={EMNLP},
506
+ year={2019}
507
+ }
508
+
509
+ @inproceedings{lample-etal-2016-neural,
510
+ title = "Neural Architectures for Named Entity Recognition",
511
+ author = "Lample, Guillaume and
512
+ Ballesteros, Miguel and
513
+ Subramanian, Sandeep and
514
+ Kawakami, Kazuya and
515
+ Dyer, Chris",
516
+ booktitle = "NAACL",
517
+ month = jun,
518
+ year = "2016",
519
+ address = "San Diego, California",
520
+ publisher = "Association for Computational Linguistics",
521
+ url = "https://www.aclweb.org/anthology/N16-1030",
522
+ doi = "10.18653/v1/N16-1030",
523
+ pages = "260--270",
524
+ }
525
+
526
+ @inproceedings{akbik2018coling,
527
+ title={Contextual String Embeddings for Sequence Labeling},
528
+ author={Akbik, Alan and Blythe, Duncan and Vollgraf, Roland},
529
+ booktitle = {COLING},
530
+ pages = {1638--1649},
531
+ year = {2018}
532
+ }
533
+
534
+ @inproceedings{tjong-kim-sang-de-meulder-2003-introduction,
535
+ title = "Introduction to the {C}o{NLL}-2003 Shared Task: Language-Independent Named Entity Recognition",
536
+ author = "Tjong Kim Sang, Erik F. and
537
+ De Meulder, Fien",
538
+ booktitle = "Proceedings of the Seventh Conference on Natural Language Learning at {HLT}-{NAACL} 2003",
539
+ year = "2003",
540
+ url = "https://www.aclweb.org/anthology/W03-0419",
541
+ pages = "142--147",
542
+ }
543
+
544
+ @inproceedings{tjong-kim-sang-2002-introduction,
545
+ title = "Introduction to the {C}o{NLL}-2002 Shared Task: Language-Independent Named Entity Recognition",
546
+ author = "Tjong Kim Sang, Erik F.",
547
+ booktitle = "{COLING}-02: The 6th Conference on Natural Language Learning 2002 ({C}o{NLL}-2002)",
548
+ year = "2002",
549
+ url = "https://www.aclweb.org/anthology/W02-2024",
550
+ }
551
+
552
+ @InProceedings{TIEDEMANN12.463,
553
+ author = {Jörg Tiedemann},
554
+ title = {Parallel Data, Tools and Interfaces in OPUS},
555
+ booktitle = {LREC},
556
+ year = {2012},
557
+ month = {may},
558
+ date = {23-25},
559
+ address = {Istanbul, Turkey},
560
+ editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
561
+ publisher = {European Language Resources Association (ELRA)},
562
+ isbn = {978-2-9517408-7-7},
563
+ language = {english}
564
+ }
565
+
566
+ @inproceedings{ziemski2016united,
567
+ title={The United Nations Parallel Corpus v1. 0.},
568
+ author={Ziemski, Michal and Junczys-Dowmunt, Marcin and Pouliquen, Bruno},
569
+ booktitle={LREC},
570
+ year={2016}
571
+ }
572
+
573
+ @article{roberta2019,
574
+ author = {Yinhan Liu and
575
+ Myle Ott and
576
+ Naman Goyal and
577
+ Jingfei Du and
578
+ Mandar Joshi and
579
+ Danqi Chen and
580
+ Omer Levy and
581
+ Mike Lewis and
582
+ Luke Zettlemoyer and
583
+ Veselin Stoyanov},
584
+ title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
585
+ journal = {arXiv preprint arXiv:1907.11692},
586
+ year = {2019}
587
+ }
588
+
589
+
590
+ @article{tan2019multilingual,
591
+ title={Multilingual neural machine translation with knowledge distillation},
592
+ author={Tan, Xu and Ren, Yi and He, Di and Qin, Tao and Zhao, Zhou and Liu, Tie-Yan},
593
+ journal={ICLR},
594
+ year={2019}
595
+ }
596
+
597
+ @article{siddhant2019evaluating,
598
+ title={Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation},
599
+ author={Siddhant, Aditya and Johnson, Melvin and Tsai, Henry and Arivazhagan, Naveen and Riesa, Jason and Bapna, Ankur and Firat, Orhan and Raman, Karthik},
600
+ journal={AAAI},
601
+ year={2019}
602
+ }
603
+
604
+ @inproceedings{camacho2017semeval,
605
+ title={Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity},
606
+ author={Camacho-Collados, Jose and Pilehvar, Mohammad Taher and Collier, Nigel and Navigli, Roberto},
607
+ booktitle={Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)},
608
+ pages={15--26},
609
+ year={2017}
610
+ }
611
+
612
+ @inproceedings{Pires2019HowMI,
613
+ title={How Multilingual is Multilingual BERT?},
614
+ author={Telmo Pires and Eva Schlinger and Dan Garrette},
615
+ booktitle={ACL},
616
+ year={2019}
617
+ }
618
+
619
+ @article{lample2019cross,
620
+ title={Cross-lingual language model pretraining},
621
+ author={Lample, Guillaume and Conneau, Alexis},
622
+ journal={NeurIPS},
623
+ year={2019}
624
+ }
625
+
626
+ @article{schuster2019cross,
627
+ title={Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing},
628
+ author={Schuster, Tal and Ram, Ori and Barzilay, Regina and Globerson, Amir},
629
+ journal={NAACL},
630
+ year={2019}
631
+ }
632
+
633
+ @inproceedings{chang2008optimizing,
634
+ title={Optimizing Chinese word segmentation for machine translation performance},
635
+ author={Chang, Pi-Chuan and Galley, Michel and Manning, Christopher D},
636
+ booktitle={Proceedings of the third workshop on statistical machine translation},
637
+ pages={224--232},
638
+ year={2008}
639
+ }
640
+
641
+ @inproceedings{koehn2007moses,
642
+ title={Moses: Open source toolkit for statistical machine translation},
643
+ author={Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and others},
644
+ booktitle={Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions},
645
+ pages={177--180},
646
+ year={2007},
647
+ organization={Association for Computational Linguistics}
648
+ }
649
+
650
+ @article{wenzek2019ccnet,
651
+ title={CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data},
652
+ author={Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzman, Francisco and Joulin, Armand and Grave, Edouard},
653
+ journal={arXiv preprint arXiv:1911.00359},
654
+ year={2019}
655
+ }
656
+
657
+ @inproceedings{zhou2016cross,
658
+ title={Cross-lingual sentiment classification with bilingual document representation learning},
659
+ author={Zhou, Xinjie and Wan, Xiaojun and Xiao, Jianguo},
660
+ booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
661
+ pages={1403--1412},
662
+ year={2016}
663
+ }
664
+
665
+ @article{goyal2017accurate,
666
+ title={Accurate, large minibatch sgd: Training imagenet in 1 hour},
667
+ author={Goyal, Priya and Doll{\'a}r, Piotr and Girshick, Ross and Noordhuis, Pieter and Wesolowski, Lukasz and Kyrola, Aapo and Tulloch, Andrew and Jia, Yangqing and He, Kaiming},
668
+ journal={arXiv preprint arXiv:1706.02677},
669
+ year={2017}
670
+ }
671
+
672
+ @article{arivazhagan2019massively,
673
+ title={Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges},
674
+ author={Arivazhagan, Naveen and Bapna, Ankur and Firat, Orhan and Lepikhin, Dmitry and Johnson, Melvin and Krikun, Maxim and Chen, Mia Xu and Cao, Yuan and Foster, George and Cherry, Colin and others},
675
+ journal={arXiv preprint arXiv:1907.05019},
676
+ year={2019}
677
+ }
678
+
679
+ @inproceedings{pan2017cross,
680
+ title={Cross-lingual name tagging and linking for 282 languages},
681
+ author={Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng},
682
+ booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
683
+ volume={1},
684
+ pages={1946--1958},
685
+ year={2017}
686
+ }
687
+
688
+ @article{raffel2019exploring,
689
+ title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
690
+ author={Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
691
+ year={2019},
692
+ journal={arXiv preprint arXiv:1910.10683},
693
+ }
694
+
695
+ @inproceedings{pennington2014glove,
696
+ author = {Jeffrey Pennington and Richard Socher and Christopher D. Manning},
697
+ booktitle = {EMNLP},
698
+ title = {GloVe: Global Vectors for Word Representation},
699
+ year = {2014},
700
+ pages = {1532--1543},
701
+ url = {http://www.aclweb.org/anthology/D14-1162},
702
+ }
703
+
704
+ @article{kudo2018sentencepiece,
705
+ title={Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing},
706
+ author={Kudo, Taku and Richardson, John},
707
+ journal={EMNLP},
708
+ year={2018}
709
+ }
710
+
711
+ @article{rajpurkar2018know,
712
+ title={Know What You Don't Know: Unanswerable Questions for SQuAD},
713
+ author={Rajpurkar, Pranav and Jia, Robin and Liang, Percy},
714
+ journal={ACL},
715
+ year={2018}
716
+ }
717
+
718
+ @article{joulin2017bag,
719
+ title={Bag of Tricks for Efficient Text Classification},
720
+ author={Joulin, Armand and Grave, Edouard and Mikolov, Piotr Bojanowski Tomas},
721
+ journal={EACL 2017},
722
+ pages={427},
723
+ year={2017}
724
+ }
725
+
726
+ @inproceedings{kudo2018subword,
727
+ title={Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates},
728
+ author={Kudo, Taku},
729
+ booktitle={ACL},
730
+ pages={66--75},
731
+ year={2018}
732
+ }
733
+
734
+ @inproceedings{grave2018learning,
735
+ title={Learning Word Vectors for 157 Languages},
736
+ author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
737
+ booktitle={LREC},
738
+ year={2018}
739
+ }
references/2019.arxiv.conneau/source/acl2020.sty ADDED
@@ -0,0 +1,560 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ % This is the LaTex style file for ACL 2020, based off of ACL 2019.
2
+
3
+ % Addressing bibtex issues mentioned in https://github.com/acl-org/acl-pub/issues/2
4
+ % Other major modifications include
5
+ % changing the color of the line numbers to a light gray; changing font size of abstract to be 10pt; changing caption font size to be 10pt.
6
+ % -- M Mitchell and Stephanie Lukin
7
+
8
+ % 2017: modified to support DOI links in bibliography. Now uses
9
+ % natbib package rather than defining citation commands in this file.
10
+ % Use with acl_natbib.bst bib style. -- Dan Gildea
11
+
12
+ % This is the LaTeX style for ACL 2016. It contains Margaret Mitchell's
13
+ % line number adaptations (ported by Hai Zhao and Yannick Versley).
14
+
15
+ % It is nearly identical to the style files for ACL 2015,
16
+ % ACL 2014, EACL 2006, ACL2005, ACL 2002, ACL 2001, ACL 2000,
17
+ % EACL 95 and EACL 99.
18
+ %
19
+ % Changes made include: adapt layout to A4 and centimeters, widen abstract
20
+
21
+ % This is the LaTeX style file for ACL 2000. It is nearly identical to the
22
+ % style files for EACL 95 and EACL 99. Minor changes include editing the
23
+ % instructions to reflect use of \documentclass rather than \documentstyle
24
+ % and removing the white space before the title on the first page
25
+ % -- John Chen, June 29, 2000
26
+
27
+ % This is the LaTeX style file for EACL-95. It is identical to the
28
+ % style file for ANLP '94 except that the margins are adjusted for A4
29
+ % paper. -- abney 13 Dec 94
30
+
31
+ % The ANLP '94 style file is a slightly modified
32
+ % version of the style used for AAAI and IJCAI, using some changes
33
+ % prepared by Fernando Pereira and others and some minor changes
34
+ % by Paul Jacobs.
35
+
36
+ % Papers prepared using the aclsub.sty file and acl.bst bibtex style
37
+ % should be easily converted to final format using this style.
38
+ % (1) Submission information (\wordcount, \subject, and \makeidpage)
39
+ % should be removed.
40
+ % (2) \summary should be removed. The summary material should come
41
+ % after \maketitle and should be in the ``abstract'' environment
42
+ % (between \begin{abstract} and \end{abstract}).
43
+ % (3) Check all citations. This style should handle citations correctly
44
+ % and also allows multiple citations separated by semicolons.
45
+ % (4) Check figures and examples. Because the final format is double-
46
+ % column, some adjustments may have to be made to fit text in the column
47
+ % or to choose full-width (\figure*} figures.
48
+
49
+ % Place this in a file called aclap.sty in the TeX search path.
50
+ % (Placing it in the same directory as the paper should also work.)
51
+
52
+ % Prepared by Peter F. Patel-Schneider, liberally using the ideas of
53
+ % other style hackers, including Barbara Beeton.
54
+ % This style is NOT guaranteed to work. It is provided in the hope
55
+ % that it will make the preparation of papers easier.
56
+ %
57
+ % There are undoubtably bugs in this style. If you make bug fixes,
58
+ % improvements, etc. please let me know. My e-mail address is:
59
+ % pfps@research.att.com
60
+
61
+ % Papers are to be prepared using the ``acl_natbib'' bibliography style,
62
+ % as follows:
63
+ % \documentclass[11pt]{article}
64
+ % \usepackage{acl2000}
65
+ % \title{Title}
66
+ % \author{Author 1 \and Author 2 \\ Address line \\ Address line \And
67
+ % Author 3 \\ Address line \\ Address line}
68
+ % \begin{document}
69
+ % ...
70
+ % \bibliography{bibliography-file}
71
+ % \bibliographystyle{acl_natbib}
72
+ % \end{document}
73
+
74
+ % Author information can be set in various styles:
75
+ % For several authors from the same institution:
76
+ % \author{Author 1 \and ... \and Author n \\
77
+ % Address line \\ ... \\ Address line}
78
+ % if the names do not fit well on one line use
79
+ % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
80
+ % For authors from different institutions:
81
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
82
+ % \And ... \And
83
+ % Author n \\ Address line \\ ... \\ Address line}
84
+ % To start a seperate ``row'' of authors use \AND, as in
85
+ % \author{Author 1 \\ Address line \\ ... \\ Address line
86
+ % \AND
87
+ % Author 2 \\ Address line \\ ... \\ Address line \And
88
+ % Author 3 \\ Address line \\ ... \\ Address line}
89
+
90
+ % If the title and author information does not fit in the area allocated,
91
+ % place \setlength\titlebox{<new height>} right after
92
+ % \usepackage{acl2015}
93
+ % where <new height> can be something larger than 5cm
94
+
95
+ % include hyperref, unless user specifies nohyperref option like this:
96
+ % \usepackage[nohyperref]{naaclhlt2018}
97
+ \newif\ifacl@hyperref
98
+ \DeclareOption{hyperref}{\acl@hyperreftrue}
99
+ \DeclareOption{nohyperref}{\acl@hyperreffalse}
100
+ \ExecuteOptions{hyperref} % default is to use hyperref
101
+ \ProcessOptions\relax
102
+ \ifacl@hyperref
103
+ \RequirePackage{hyperref}
104
+ \usepackage{xcolor} % make links dark blue
105
+ \definecolor{darkblue}{rgb}{0, 0, 0.5}
106
+ \hypersetup{colorlinks=true,citecolor=darkblue, linkcolor=darkblue, urlcolor=darkblue}
107
+ \else
108
+ % This definition is used if the hyperref package is not loaded.
109
+ % It provides a backup, no-op definiton of \href.
110
+ % This is necessary because \href command is used in the acl_natbib.bst file.
111
+ \def\href#1#2{{#2}}
112
+ % We still need to load xcolor in this case because the lighter line numbers require it. (SC/KG/WL)
113
+ \usepackage{xcolor}
114
+ \fi
115
+
116
+ \typeout{Conference Style for ACL 2019}
117
+
118
+ % NOTE: Some laser printers have a serious problem printing TeX output.
119
+ % These printing devices, commonly known as ``write-white'' laser
120
+ % printers, tend to make characters too light. To get around this
121
+ % problem, a darker set of fonts must be created for these devices.
122
+ %
123
+
124
+ \newcommand{\Thanks}[1]{\thanks{\ #1}}
125
+
126
+ % A4 modified by Eneko; again modified by Alexander for 5cm titlebox
127
+ \setlength{\paperwidth}{21cm} % A4
128
+ \setlength{\paperheight}{29.7cm}% A4
129
+ \setlength\topmargin{-0.5cm}
130
+ \setlength\oddsidemargin{0cm}
131
+ \setlength\textheight{24.7cm}
132
+ \setlength\textwidth{16.0cm}
133
+ \setlength\columnsep{0.6cm}
134
+ \newlength\titlebox
135
+ \setlength\titlebox{5cm}
136
+ \setlength\headheight{5pt}
137
+ \setlength\headsep{0pt}
138
+ \thispagestyle{empty}
139
+ \pagestyle{empty}
140
+
141
+
142
+ \flushbottom \twocolumn \sloppy
143
+
144
+ % We're never going to need a table of contents, so just flush it to
145
+ % save space --- suggested by drstrip@sandia-2
146
+ \def\addcontentsline#1#2#3{}
147
+
148
+ \newif\ifaclfinal
149
+ \aclfinalfalse
150
+ \def\aclfinalcopy{\global\aclfinaltrue}
151
+
152
+ %% ----- Set up hooks to repeat content on every page of the output doc,
153
+ %% necessary for the line numbers in the submitted version. --MM
154
+ %%
155
+ %% Copied from CVPR 2015's cvpr_eso.sty, which appears to be largely copied from everyshi.sty.
156
+ %%
157
+ %% Original cvpr_eso.sty available at: http://www.pamitc.org/cvpr15/author_guidelines.php
158
+ %% Original evershi.sty available at: https://www.ctan.org/pkg/everyshi
159
+ %%
160
+ %% Copyright (C) 2001 Martin Schr\"oder:
161
+ %%
162
+ %% Martin Schr"oder
163
+ %% Cr"usemannallee 3
164
+ %% D-28213 Bremen
165
+ %% Martin.Schroeder@ACM.org
166
+ %%
167
+ %% This program may be redistributed and/or modified under the terms
168
+ %% of the LaTeX Project Public License, either version 1.0 of this
169
+ %% license, or (at your option) any later version.
170
+ %% The latest version of this license is in
171
+ %% CTAN:macros/latex/base/lppl.txt.
172
+ %%
173
+ %% Happy users are requested to send [Martin] a postcard. :-)
174
+ %%
175
+ \newcommand{\@EveryShipoutACL@Hook}{}
176
+ \newcommand{\@EveryShipoutACL@AtNextHook}{}
177
+ \newcommand*{\EveryShipoutACL}[1]
178
+ {\g@addto@macro\@EveryShipoutACL@Hook{#1}}
179
+ \newcommand*{\AtNextShipoutACL@}[1]
180
+ {\g@addto@macro\@EveryShipoutACL@AtNextHook{#1}}
181
+ \newcommand{\@EveryShipoutACL@Shipout}{%
182
+ \afterassignment\@EveryShipoutACL@Test
183
+ \global\setbox\@cclv= %
184
+ }
185
+ \newcommand{\@EveryShipoutACL@Test}{%
186
+ \ifvoid\@cclv\relax
187
+ \aftergroup\@EveryShipoutACL@Output
188
+ \else
189
+ \@EveryShipoutACL@Output
190
+ \fi%
191
+ }
192
+ \newcommand{\@EveryShipoutACL@Output}{%
193
+ \@EveryShipoutACL@Hook%
194
+ \@EveryShipoutACL@AtNextHook%
195
+ \gdef\@EveryShipoutACL@AtNextHook{}%
196
+ \@EveryShipoutACL@Org@Shipout\box\@cclv%
197
+ }
198
+ \newcommand{\@EveryShipoutACL@Org@Shipout}{}
199
+ \newcommand*{\@EveryShipoutACL@Init}{%
200
+ \message{ABD: EveryShipout initializing macros}%
201
+ \let\@EveryShipoutACL@Org@Shipout\shipout
202
+ \let\shipout\@EveryShipoutACL@Shipout
203
+ }
204
+ \AtBeginDocument{\@EveryShipoutACL@Init}
205
+
206
+ %% ----- Set up for placing additional items into the submitted version --MM
207
+ %%
208
+ %% Based on eso-pic.sty
209
+ %%
210
+ %% Original available at: https://www.ctan.org/tex-archive/macros/latex/contrib/eso-pic
211
+ %% Copyright (C) 1998-2002 by Rolf Niepraschk <niepraschk@ptb.de>
212
+ %%
213
+ %% Which may be distributed and/or modified under the conditions of
214
+ %% the LaTeX Project Public License, either version 1.2 of this license
215
+ %% or (at your option) any later version. The latest version of this
216
+ %% license is in:
217
+ %%
218
+ %% http://www.latex-project.org/lppl.txt
219
+ %%
220
+ %% and version 1.2 or later is part of all distributions of LaTeX version
221
+ %% 1999/12/01 or later.
222
+ %%
223
+ %% In contrast to the original, we do not include the definitions for/using:
224
+ %% gridpicture, div[2], isMEMOIR[1], gridSetup[6][], subgridstyle{dotted}, labelfactor{}, gap{}, gridunitname{}, gridunit{}, gridlines{\thinlines}, subgridlines{\thinlines}, the {keyval} package, evenside margin, nor any definitions with 'color'.
225
+ %%
226
+ %% These are beyond what is needed for the NAACL/ACL style.
227
+ %%
228
+ \newcommand\LenToUnit[1]{#1\@gobble}
229
+ \newcommand\AtPageUpperLeft[1]{%
230
+ \begingroup
231
+ \@tempdima=0pt\relax\@tempdimb=\ESO@yoffsetI\relax
232
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
233
+ \endgroup
234
+ }
235
+ \newcommand\AtPageLowerLeft[1]{\AtPageUpperLeft{%
236
+ \put(0,\LenToUnit{-\paperheight}){#1}}}
237
+ \newcommand\AtPageCenter[1]{\AtPageUpperLeft{%
238
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.5\paperheight}){#1}}}
239
+ \newcommand\AtPageLowerCenter[1]{\AtPageUpperLeft{%
240
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-\paperheight}){#1}}}%
241
+ \newcommand\AtPageLowishCenter[1]{\AtPageUpperLeft{%
242
+ \put(\LenToUnit{.5\paperwidth},\LenToUnit{-.96\paperheight}){#1}}}
243
+ \newcommand\AtTextUpperLeft[1]{%
244
+ \begingroup
245
+ \setlength\@tempdima{1in}%
246
+ \advance\@tempdima\oddsidemargin%
247
+ \@tempdimb=\ESO@yoffsetI\relax\advance\@tempdimb-1in\relax%
248
+ \advance\@tempdimb-\topmargin%
249
+ \advance\@tempdimb-\headheight\advance\@tempdimb-\headsep%
250
+ \put(\LenToUnit{\@tempdima},\LenToUnit{\@tempdimb}){#1}%
251
+ \endgroup
252
+ }
253
+ \newcommand\AtTextLowerLeft[1]{\AtTextUpperLeft{%
254
+ \put(0,\LenToUnit{-\textheight}){#1}}}
255
+ \newcommand\AtTextCenter[1]{\AtTextUpperLeft{%
256
+ \put(\LenToUnit{.5\textwidth},\LenToUnit{-.5\textheight}){#1}}}
257
+ \newcommand{\ESO@HookI}{} \newcommand{\ESO@HookII}{}
258
+ \newcommand{\ESO@HookIII}{}
259
+ \newcommand{\AddToShipoutPicture}{%
260
+ \@ifstar{\g@addto@macro\ESO@HookII}{\g@addto@macro\ESO@HookI}}
261
+ \newcommand{\ClearShipoutPicture}{\global\let\ESO@HookI\@empty}
262
+ \newcommand{\@ShipoutPicture}{%
263
+ \bgroup
264
+ \@tempswafalse%
265
+ \ifx\ESO@HookI\@empty\else\@tempswatrue\fi%
266
+ \ifx\ESO@HookII\@empty\else\@tempswatrue\fi%
267
+ \ifx\ESO@HookIII\@empty\else\@tempswatrue\fi%
268
+ \if@tempswa%
269
+ \@tempdima=1in\@tempdimb=-\@tempdima%
270
+ \advance\@tempdimb\ESO@yoffsetI%
271
+ \unitlength=1pt%
272
+ \global\setbox\@cclv\vbox{%
273
+ \vbox{\let\protect\relax
274
+ \pictur@(0,0)(\strip@pt\@tempdima,\strip@pt\@tempdimb)%
275
+ \ESO@HookIII\ESO@HookI\ESO@HookII%
276
+ \global\let\ESO@HookII\@empty%
277
+ \endpicture}%
278
+ \nointerlineskip%
279
+ \box\@cclv}%
280
+ \fi
281
+ \egroup
282
+ }
283
+ \EveryShipoutACL{\@ShipoutPicture}
284
+ \newif\ifESO@dvips\ESO@dvipsfalse
285
+ \newif\ifESO@grid\ESO@gridfalse
286
+ \newif\ifESO@texcoord\ESO@texcoordfalse
287
+ \newcommand*\ESO@griddelta{}\newcommand*\ESO@griddeltaY{}
288
+ \newcommand*\ESO@gridDelta{}\newcommand*\ESO@gridDeltaY{}
289
+ \newcommand*\ESO@yoffsetI{}\newcommand*\ESO@yoffsetII{}
290
+ \ifESO@texcoord
291
+ \def\ESO@yoffsetI{0pt}\def\ESO@yoffsetII{-\paperheight}
292
+ \edef\ESO@griddeltaY{-\ESO@griddelta}\edef\ESO@gridDeltaY{-\ESO@gridDelta}
293
+ \else
294
+ \def\ESO@yoffsetI{\paperheight}\def\ESO@yoffsetII{0pt}
295
+ \edef\ESO@griddeltaY{\ESO@griddelta}\edef\ESO@gridDeltaY{\ESO@gridDelta}
296
+ \fi
297
+
298
+
299
+ %% ----- Submitted version markup: Page numbers, ruler, and confidentiality. Using ideas/code from cvpr.sty 2015. --MM
300
+
301
+ \font\aclhv = phvb at 8pt
302
+
303
+ %% Define vruler %%
304
+
305
+ %\makeatletter
306
+ \newbox\aclrulerbox
307
+ \newcount\aclrulercount
308
+ \newdimen\aclruleroffset
309
+ \newdimen\cv@lineheight
310
+ \newdimen\cv@boxheight
311
+ \newbox\cv@tmpbox
312
+ \newcount\cv@refno
313
+ \newcount\cv@tot
314
+ % NUMBER with left flushed zeros \fillzeros[<WIDTH>]<NUMBER>
315
+ \newcount\cv@tmpc@ \newcount\cv@tmpc
316
+ \def\fillzeros[#1]#2{\cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi
317
+ \cv@tmpc=1 %
318
+ \loop\ifnum\cv@tmpc@<10 \else \divide\cv@tmpc@ by 10 \advance\cv@tmpc by 1 \fi
319
+ \ifnum\cv@tmpc@=10\relax\cv@tmpc@=11\relax\fi \ifnum\cv@tmpc@>10 \repeat
320
+ \ifnum#2<0\advance\cv@tmpc1\relax-\fi
321
+ \loop\ifnum\cv@tmpc<#1\relax0\advance\cv@tmpc1\relax\fi \ifnum\cv@tmpc<#1 \repeat
322
+ \cv@tmpc@=#2\relax\ifnum\cv@tmpc@<0\cv@tmpc@=-\cv@tmpc@\fi \relax\the\cv@tmpc@}%
323
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
324
+ \def\makevruler[#1][#2][#3][#4][#5]{\begingroup\offinterlineskip
325
+ \textheight=#5\vbadness=10000\vfuzz=120ex\overfullrule=0pt%
326
+ \global\setbox\aclrulerbox=\vbox to \textheight{%
327
+ {\parskip=0pt\hfuzz=150em\cv@boxheight=\textheight
328
+ \color{gray}
329
+ \cv@lineheight=#1\global\aclrulercount=#2%
330
+ \cv@tot\cv@boxheight\divide\cv@tot\cv@lineheight\advance\cv@tot2%
331
+ \cv@refno1\vskip-\cv@lineheight\vskip1ex%
332
+ \loop\setbox\cv@tmpbox=\hbox to0cm{{\aclhv\hfil\fillzeros[#4]\aclrulercount}}%
333
+ \ht\cv@tmpbox\cv@lineheight\dp\cv@tmpbox0pt\box\cv@tmpbox\break
334
+ \advance\cv@refno1\global\advance\aclrulercount#3\relax
335
+ \ifnum\cv@refno<\cv@tot\repeat}}\endgroup}%
336
+ %\makeatother
337
+
338
+
339
+ \def\aclpaperid{***}
340
+ \def\confidential{\textcolor{black}{ACL 2020 Submission~\aclpaperid. Confidential Review Copy. DO NOT DISTRIBUTE.}}
341
+
342
+ %% Page numbering, Vruler and Confidentiality %%
343
+ % \makevruler[<SCALE>][<INITIAL_COUNT>][<STEP>][<DIGITS>][<HEIGHT>]
344
+
345
+ % SC/KG/WL - changed line numbering to gainsboro
346
+ \definecolor{gainsboro}{rgb}{0.8, 0.8, 0.8}
347
+ %\def\aclruler#1{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}} %% old line
348
+ \def\aclruler#1{\textcolor{gainsboro}{\makevruler[14.17pt][#1][1][3][\textheight]\usebox{\aclrulerbox}}}
349
+
350
+ \def\leftoffset{-2.1cm} %original: -45pt
351
+ \def\rightoffset{17.5cm} %original: 500pt
352
+ \ifaclfinal\else\pagenumbering{arabic}
353
+ \AddToShipoutPicture{%
354
+ \ifaclfinal\else
355
+ \AtPageLowishCenter{\textcolor{black}{\thepage}}
356
+ \aclruleroffset=\textheight
357
+ \advance\aclruleroffset4pt
358
+ \AtTextUpperLeft{%
359
+ \put(\LenToUnit{\leftoffset},\LenToUnit{-\aclruleroffset}){%left ruler
360
+ \aclruler{\aclrulercount}}
361
+ \put(\LenToUnit{\rightoffset},\LenToUnit{-\aclruleroffset}){%right ruler
362
+ \aclruler{\aclrulercount}}
363
+ }
364
+ \AtTextUpperLeft{%confidential
365
+ \put(0,\LenToUnit{1cm}){\parbox{\textwidth}{\centering\aclhv\confidential}}
366
+ }
367
+ \fi
368
+ }
369
+
370
+ %%%% ----- End settings for placing additional items into the submitted version --MM ----- %%%%
371
+
372
+ %%%% ----- Begin settings for both submitted and camera-ready version ----- %%%%
373
+
374
+ %% Title and Authors %%
375
+
376
+ \newcommand\outauthor{
377
+ \begin{tabular}[t]{c}
378
+ \ifaclfinal
379
+ \bf\@author
380
+ \else
381
+ % Avoiding common accidental de-anonymization issue. --MM
382
+ \bf Anonymous ACL submission
383
+ \fi
384
+ \end{tabular}}
385
+
386
+ % Changing the expanded titlebox for submissions to 2.5 in (rather than 6.5cm)
387
+ % and moving it to the style sheet, rather than within the example tex file. --MM
388
+ \ifaclfinal
389
+ \else
390
+ \addtolength\titlebox{.25in}
391
+ \fi
392
+ % Mostly taken from deproc.
393
+ \def\maketitle{\par
394
+ \begingroup
395
+ \def\thefootnote{\fnsymbol{footnote}}
396
+ \def\@makefnmark{\hbox to 0pt{$^{\@thefnmark}$\hss}}
397
+ \twocolumn[\@maketitle] \@thanks
398
+ \endgroup
399
+ \setcounter{footnote}{0}
400
+ \let\maketitle\relax \let\@maketitle\relax
401
+ \gdef\@thanks{}\gdef\@author{}\gdef\@title{}\let\thanks\relax}
402
+ \def\@maketitle{\vbox to \titlebox{\hsize\textwidth
403
+ \linewidth\hsize \vskip 0.125in minus 0.125in \centering
404
+ {\Large\bf \@title \par} \vskip 0.2in plus 1fil minus 0.1in
405
+ {\def\and{\unskip\enspace{\rm and}\enspace}%
406
+ \def\And{\end{tabular}\hss \egroup \hskip 1in plus 2fil
407
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}%
408
+ \def\AND{\end{tabular}\hss\egroup \hfil\hfil\egroup
409
+ \vskip 0.25in plus 1fil minus 0.125in
410
+ \hbox to \linewidth\bgroup\large \hfil\hfil
411
+ \hbox to 0pt\bgroup\hss \begin{tabular}[t]{c}\bf}
412
+ \hbox to \linewidth\bgroup\large \hfil\hfil
413
+ \hbox to 0pt\bgroup\hss
414
+ \outauthor
415
+ \hss\egroup
416
+ \hfil\hfil\egroup}
417
+ \vskip 0.3in plus 2fil minus 0.1in
418
+ }}
419
+
420
+ % margins and font size for abstract
421
+ \renewenvironment{abstract}%
422
+ {\centerline{\large\bf Abstract}%
423
+ \begin{list}{}%
424
+ {\setlength{\rightmargin}{0.6cm}%
425
+ \setlength{\leftmargin}{0.6cm}}%
426
+ \item[]\ignorespaces%
427
+ \@setsize\normalsize{12pt}\xpt\@xpt
428
+ }%
429
+ {\unskip\end{list}}
430
+
431
+ %\renewenvironment{abstract}{\centerline{\large\bf
432
+ % Abstract}\vspace{0.5ex}\begin{quote}}{\par\end{quote}\vskip 1ex}
433
+
434
+ % Resizing figure and table captions - SL
435
+ \newcommand{\figcapfont}{\rm}
436
+ \newcommand{\tabcapfont}{\rm}
437
+ \renewcommand{\fnum@figure}{\figcapfont Figure \thefigure}
438
+ \renewcommand{\fnum@table}{\tabcapfont Table \thetable}
439
+ \renewcommand{\figcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
440
+ \renewcommand{\tabcapfont}{\@setsize\normalsize{12pt}\xpt\@xpt}
441
+ % Support for interacting with the caption, subfigure, and subcaption packages - SL
442
+ \usepackage{caption}
443
+ \DeclareCaptionFont{10pt}{\fontsize{10pt}{12pt}\selectfont}
444
+ \captionsetup{font=10pt}
445
+
446
+ \RequirePackage{natbib}
447
+ % for citation commands in the .tex, authors can use:
448
+ % \citep, \citet, and \citeyearpar for compatibility with natbib, or
449
+ % \cite, \newcite, and \shortcite for compatibility with older ACL .sty files
450
+ \renewcommand\cite{\citep} % to get "(Author Year)" with natbib
451
+ \newcommand\shortcite{\citeyearpar}% to get "(Year)" with natbib
452
+ \newcommand\newcite{\citet} % to get "Author (Year)" with natbib
453
+
454
+ % DK/IV: Workaround for annoying hyperref pagewrap bug
455
+ % \RequirePackage{etoolbox}
456
+ % \patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{\errmessage{\noexpand patch failed}}
457
+
458
+ % bibliography
459
+
460
+ \def\@up#1{\raise.2ex\hbox{#1}}
461
+
462
+ % Don't put a label in the bibliography at all. Just use the unlabeled format
463
+ % instead.
464
+ \def\thebibliography#1{\vskip\parskip%
465
+ \vskip\baselineskip%
466
+ \def\baselinestretch{1}%
467
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
468
+ \vskip-\parskip%
469
+ \vskip-\baselineskip%
470
+ \section*{References\@mkboth
471
+ {References}{References}}\list
472
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
473
+ \setlength{\itemindent}{-\parindent}}
474
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
475
+ \sloppy\clubpenalty4000\widowpenalty4000
476
+ \sfcode`\.=1000\relax}
477
+ \let\endthebibliography=\endlist
478
+
479
+
480
+ % Allow for a bibliography of sources of attested examples
481
+ \def\thesourcebibliography#1{\vskip\parskip%
482
+ \vskip\baselineskip%
483
+ \def\baselinestretch{1}%
484
+ \ifx\@currsize\normalsize\@normalsize\else\@currsize\fi%
485
+ \vskip-\parskip%
486
+ \vskip-\baselineskip%
487
+ \section*{Sources of Attested Examples\@mkboth
488
+ {Sources of Attested Examples}{Sources of Attested Examples}}\list
489
+ {}{\setlength{\labelwidth}{0pt}\setlength{\leftmargin}{\parindent}
490
+ \setlength{\itemindent}{-\parindent}}
491
+ \def\newblock{\hskip .11em plus .33em minus -.07em}
492
+ \sloppy\clubpenalty4000\widowpenalty4000
493
+ \sfcode`\.=1000\relax}
494
+ \let\endthesourcebibliography=\endlist
495
+
496
+ % sections with less space
497
+ \def\section{\@startsection {section}{1}{\z@}{-2.0ex plus
498
+ -0.5ex minus -.2ex}{1.5ex plus 0.3ex minus .2ex}{\large\bf\raggedright}}
499
+ \def\subsection{\@startsection{subsection}{2}{\z@}{-1.8ex plus
500
+ -0.5ex minus -.2ex}{0.8ex plus .2ex}{\normalsize\bf\raggedright}}
501
+ %% changed by KO to - values to get teh initial parindent right
502
+ \def\subsubsection{\@startsection{subsubsection}{3}{\z@}{-1.5ex plus
503
+ -0.5ex minus -.2ex}{0.5ex plus .2ex}{\normalsize\bf\raggedright}}
504
+ \def\paragraph{\@startsection{paragraph}{4}{\z@}{1.5ex plus
505
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
506
+ \def\subparagraph{\@startsection{subparagraph}{5}{\parindent}{1.5ex plus
507
+ 0.5ex minus .2ex}{-1em}{\normalsize\bf}}
508
+
509
+ % Footnotes
510
+ \footnotesep 6.65pt %
511
+ \skip\footins 9pt plus 4pt minus 2pt
512
+ \def\footnoterule{\kern-3pt \hrule width 5pc \kern 2.6pt }
513
+ \setcounter{footnote}{0}
514
+
515
+ % Lists and paragraphs
516
+ \parindent 1em
517
+ \topsep 4pt plus 1pt minus 2pt
518
+ \partopsep 1pt plus 0.5pt minus 0.5pt
519
+ \itemsep 2pt plus 1pt minus 0.5pt
520
+ \parsep 2pt plus 1pt minus 0.5pt
521
+
522
+ \leftmargin 2em \leftmargini\leftmargin \leftmarginii 2em
523
+ \leftmarginiii 1.5em \leftmarginiv 1.0em \leftmarginv .5em \leftmarginvi .5em
524
+ \labelwidth\leftmargini\advance\labelwidth-\labelsep \labelsep 5pt
525
+
526
+ \def\@listi{\leftmargin\leftmargini}
527
+ \def\@listii{\leftmargin\leftmarginii
528
+ \labelwidth\leftmarginii\advance\labelwidth-\labelsep
529
+ \topsep 2pt plus 1pt minus 0.5pt
530
+ \parsep 1pt plus 0.5pt minus 0.5pt
531
+ \itemsep \parsep}
532
+ \def\@listiii{\leftmargin\leftmarginiii
533
+ \labelwidth\leftmarginiii\advance\labelwidth-\labelsep
534
+ \topsep 1pt plus 0.5pt minus 0.5pt
535
+ \parsep \z@ \partopsep 0.5pt plus 0pt minus 0.5pt
536
+ \itemsep \topsep}
537
+ \def\@listiv{\leftmargin\leftmarginiv
538
+ \labelwidth\leftmarginiv\advance\labelwidth-\labelsep}
539
+ \def\@listv{\leftmargin\leftmarginv
540
+ \labelwidth\leftmarginv\advance\labelwidth-\labelsep}
541
+ \def\@listvi{\leftmargin\leftmarginvi
542
+ \labelwidth\leftmarginvi\advance\labelwidth-\labelsep}
543
+
544
+ \abovedisplayskip 7pt plus2pt minus5pt%
545
+ \belowdisplayskip \abovedisplayskip
546
+ \abovedisplayshortskip 0pt plus3pt%
547
+ \belowdisplayshortskip 4pt plus3pt minus3pt%
548
+
549
+ % Less leading in most fonts (due to the narrow columns)
550
+ % The choices were between 1-pt and 1.5-pt leading
551
+ \def\@normalsize{\@setsize\normalsize{11pt}\xpt\@xpt}
552
+ \def\small{\@setsize\small{10pt}\ixpt\@ixpt}
553
+ \def\footnotesize{\@setsize\footnotesize{10pt}\ixpt\@ixpt}
554
+ \def\scriptsize{\@setsize\scriptsize{8pt}\viipt\@viipt}
555
+ \def\tiny{\@setsize\tiny{7pt}\vipt\@vipt}
556
+ \def\large{\@setsize\large{14pt}\xiipt\@xiipt}
557
+ \def\Large{\@setsize\Large{16pt}\xivpt\@xivpt}
558
+ \def\LARGE{\@setsize\LARGE{20pt}\xviipt\@xviipt}
559
+ \def\huge{\@setsize\huge{23pt}\xxpt\@xxpt}
560
+ \def\Huge{\@setsize\Huge{28pt}\xxvpt\@xxvpt}
references/2019.arxiv.conneau/source/acl_natbib.bst ADDED
@@ -0,0 +1,1975 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ %%% acl_natbib.bst
2
+ %%% Modification of BibTeX style file acl_natbib_nourl.bst
3
+ %%% ... by urlbst, version 0.7 (marked with "% urlbst")
4
+ %%% See <http://purl.org/nxg/dist/urlbst>
5
+ %%% Added webpage entry type, and url and lastchecked fields.
6
+ %%% Added eprint support.
7
+ %%% Added DOI support.
8
+ %%% Added PUBMED support.
9
+ %%% Added hyperref support.
10
+ %%% Original headers follow...
11
+
12
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
13
+ %
14
+ % BibTeX style file acl_natbib_nourl.bst
15
+ %
16
+ % intended as input to urlbst script
17
+ % $ ./urlbst --hyperref --inlinelinks acl_natbib_nourl.bst > acl_natbib.bst
18
+ %
19
+ % adapted from compling.bst
20
+ % in order to mimic the style files for ACL conferences prior to 2017
21
+ % by making the following three changes:
22
+ % - for @incollection, page numbers now follow volume title.
23
+ % - for @inproceedings, address now follows conference name.
24
+ % (address is intended as location of conference,
25
+ % not address of publisher.)
26
+ % - for papers with three authors, use et al. in citation
27
+ % Dan Gildea 2017/06/08
28
+ % - fixed a bug with format.chapter - error given if chapter is empty
29
+ % with inbook.
30
+ % Shay Cohen 2018/02/16
31
+
32
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
33
+ %
34
+ % BibTeX style file compling.bst
35
+ %
36
+ % Intended for the journal Computational Linguistics (ACL/MIT Press)
37
+ % Created by Ron Artstein on 2005/08/22
38
+ % For use with <natbib.sty> for author-year citations.
39
+ %
40
+ % I created this file in order to allow submissions to the journal
41
+ % Computational Linguistics using the <natbib> package for author-year
42
+ % citations, which offers a lot more flexibility than <fullname>, CL's
43
+ % official citation package. This file adheres strictly to the official
44
+ % style guide available from the MIT Press:
45
+ %
46
+ % http://mitpress.mit.edu/journals/coli/compling_style.pdf
47
+ %
48
+ % This includes all the various quirks of the style guide, for example:
49
+ % - a chapter from a monograph (@inbook) has no page numbers.
50
+ % - an article from an edited volume (@incollection) has page numbers
51
+ % after the publisher and address.
52
+ % - an article from a proceedings volume (@inproceedings) has page
53
+ % numbers before the publisher and address.
54
+ %
55
+ % Where the style guide was inconsistent or not specific enough I
56
+ % looked at actual published articles and exercised my own judgment.
57
+ % I noticed two inconsistencies in the style guide:
58
+ %
59
+ % - The style guide gives one example of an article from an edited
60
+ % volume with the editor's name spelled out in full, and another
61
+ % with the editors' names abbreviated. I chose to accept the first
62
+ % one as correct, since the style guide generally shuns abbreviations,
63
+ % and editors' names are also spelled out in some recently published
64
+ % articles.
65
+ %
66
+ % - The style guide gives one example of a reference where the word
67
+ % "and" between two authors is preceded by a comma. This is most
68
+ % likely a typo, since in all other cases with just two authors or
69
+ % editors there is no comma before the word "and".
70
+ %
71
+ % One case where the style guide is not being specific is the placement
72
+ % of the edition number, for which no example is given. I chose to put
73
+ % it immediately after the title, which I (subjectively) find natural,
74
+ % and is also the place of the edition in a few recently published
75
+ % articles.
76
+ %
77
+ % This file correctly reproduces all of the examples in the official
78
+ % style guide, except for the two inconsistencies noted above. I even
79
+ % managed to get it to correctly format the proceedings example which
80
+ % has an organization, a publisher, and two addresses (the conference
81
+ % location and the publisher's address), though I cheated a bit by
82
+ % putting the conference location and month as part of the title field;
83
+ % I feel that in this case the conference location and month can be
84
+ % considered as part of the title, and that adding a location field
85
+ % is not justified. Note also that a location field is not standard,
86
+ % so entries made with this field would not port nicely to other styles.
87
+ % However, if authors feel that there's a need for a location field
88
+ % then tell me and I'll see what I can do.
89
+ %
90
+ % The file also produces to my satisfaction all the bibliographical
91
+ % entries in my recent (joint) submission to CL (this was the original
92
+ % motivation for creating the file). I also tested it by running it
93
+ % on a larger set of entries and eyeballing the results. There may of
94
+ % course still be errors, especially with combinations of fields that
95
+ % are not that common, or with cross-references (which I seldom use).
96
+ % If you find such errors please write to me.
97
+ %
98
+ % I hope people find this file useful. Please email me with comments
99
+ % and suggestions.
100
+ %
101
+ % Ron Artstein
102
+ % artstein [at] essex.ac.uk
103
+ % August 22, 2005.
104
+ %
105
+ % Some technical notes.
106
+ %
107
+ % This file is based on a file generated with the package <custom-bib>
108
+ % by Patrick W. Daly (see selected options below), which was then
109
+ % manually customized to conform with certain CL requirements which
110
+ % cannot be met by <custom-bib>. Departures from the generated file
111
+ % include:
112
+ %
113
+ % Function inbook: moved publisher and address to the end; moved
114
+ % edition after title; replaced function format.chapter.pages by
115
+ % new function format.chapter to output chapter without pages.
116
+ %
117
+ % Function inproceedings: moved publisher and address to the end;
118
+ % replaced function format.in.ed.booktitle by new function
119
+ % format.in.booktitle to output the proceedings title without
120
+ % the editor.
121
+ %
122
+ % Functions book, incollection, manual: moved edition after title.
123
+ %
124
+ % Function mastersthesis: formatted title as for articles (unlike
125
+ % phdthesis which is formatted as book) and added month.
126
+ %
127
+ % Function proceedings: added new.sentence between organization and
128
+ % publisher when both are present.
129
+ %
130
+ % Function format.lab.names: modified so that it gives all the
131
+ % authors' surnames for in-text citations for one, two and three
132
+ % authors and only uses "et. al" for works with four authors or more
133
+ % (thanks to Ken Shan for convincing me to go through the trouble of
134
+ % modifying this function rather than using unreliable hacks).
135
+ %
136
+ % Changes:
137
+ %
138
+ % 2006-10-27: Changed function reverse.pass so that the extra label is
139
+ % enclosed in parentheses when the year field ends in an uppercase or
140
+ % lowercase letter (change modeled after Uli Sauerland's modification
141
+ % of nals.bst). RA.
142
+ %
143
+ %
144
+ % The preamble of the generated file begins below:
145
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
146
+ %%
147
+ %% This is file `compling.bst',
148
+ %% generated with the docstrip utility.
149
+ %%
150
+ %% The original source files were:
151
+ %%
152
+ %% merlin.mbs (with options: `ay,nat,vonx,nm-revv1,jnrlst,keyxyr,blkyear,dt-beg,yr-per,note-yr,num-xser,pre-pub,xedn,nfss')
153
+ %% ----------------------------------------
154
+ %% *** Intended for the journal Computational Linguistics ***
155
+ %%
156
+ %% Copyright 1994-2002 Patrick W Daly
157
+ % ===============================================================
158
+ % IMPORTANT NOTICE:
159
+ % This bibliographic style (bst) file has been generated from one or
160
+ % more master bibliographic style (mbs) files, listed above.
161
+ %
162
+ % This generated file can be redistributed and/or modified under the terms
163
+ % of the LaTeX Project Public License Distributed from CTAN
164
+ % archives in directory macros/latex/base/lppl.txt; either
165
+ % version 1 of the License, or any later version.
166
+ % ===============================================================
167
+ % Name and version information of the main mbs file:
168
+ % \ProvidesFile{merlin.mbs}[2002/10/21 4.05 (PWD, AO, DPC)]
169
+ % For use with BibTeX version 0.99a or later
170
+ %-------------------------------------------------------------------
171
+ % This bibliography style file is intended for texts in ENGLISH
172
+ % This is an author-year citation style bibliography. As such, it is
173
+ % non-standard LaTeX, and requires a special package file to function properly.
174
+ % Such a package is natbib.sty by Patrick W. Daly
175
+ % The form of the \bibitem entries is
176
+ % \bibitem[Jones et al.(1990)]{key}...
177
+ % \bibitem[Jones et al.(1990)Jones, Baker, and Smith]{key}...
178
+ % The essential feature is that the label (the part in brackets) consists
179
+ % of the author names, as they should appear in the citation, with the year
180
+ % in parentheses following. There must be no space before the opening
181
+ % parenthesis!
182
+ % With natbib v5.3, a full list of authors may also follow the year.
183
+ % In natbib.sty, it is possible to define the type of enclosures that is
184
+ % really wanted (brackets or parentheses), but in either case, there must
185
+ % be parentheses in the label.
186
+ % The \cite command functions as follows:
187
+ % \citet{key} ==>> Jones et al. (1990)
188
+ % \citet*{key} ==>> Jones, Baker, and Smith (1990)
189
+ % \citep{key} ==>> (Jones et al., 1990)
190
+ % \citep*{key} ==>> (Jones, Baker, and Smith, 1990)
191
+ % \citep[chap. 2]{key} ==>> (Jones et al., 1990, chap. 2)
192
+ % \citep[e.g.][]{key} ==>> (e.g. Jones et al., 1990)
193
+ % \citep[e.g.][p. 32]{key} ==>> (e.g. Jones et al., p. 32)
194
+ % \citeauthor{key} ==>> Jones et al.
195
+ % \citeauthor*{key} ==>> Jones, Baker, and Smith
196
+ % \citeyear{key} ==>> 1990
197
+ %---------------------------------------------------------------------
198
+
199
+ ENTRY
200
+ { address
201
+ author
202
+ booktitle
203
+ chapter
204
+ edition
205
+ editor
206
+ howpublished
207
+ institution
208
+ journal
209
+ key
210
+ month
211
+ note
212
+ number
213
+ organization
214
+ pages
215
+ publisher
216
+ school
217
+ series
218
+ title
219
+ type
220
+ volume
221
+ year
222
+ eprint % urlbst
223
+ doi % urlbst
224
+ pubmed % urlbst
225
+ url % urlbst
226
+ lastchecked % urlbst
227
+ }
228
+ {}
229
+ { label extra.label sort.label short.list }
230
+ INTEGERS { output.state before.all mid.sentence after.sentence after.block }
231
+ % urlbst...
232
+ % urlbst constants and state variables
233
+ STRINGS { urlintro
234
+ eprinturl eprintprefix doiprefix doiurl pubmedprefix pubmedurl
235
+ citedstring onlinestring linktextstring
236
+ openinlinelink closeinlinelink }
237
+ INTEGERS { hrefform inlinelinks makeinlinelink
238
+ addeprints adddoiresolver addpubmedresolver }
239
+ FUNCTION {init.urlbst.variables}
240
+ {
241
+ % The following constants may be adjusted by hand, if desired
242
+
243
+ % The first set allow you to enable or disable certain functionality.
244
+ #1 'addeprints := % 0=no eprints; 1=include eprints
245
+ #1 'adddoiresolver := % 0=no DOI resolver; 1=include it
246
+ #1 'addpubmedresolver := % 0=no PUBMED resolver; 1=include it
247
+ #2 'hrefform := % 0=no crossrefs; 1=hypertex xrefs; 2=hyperref refs
248
+ #1 'inlinelinks := % 0=URLs explicit; 1=URLs attached to titles
249
+
250
+ % String constants, which you _might_ want to tweak.
251
+ "URL: " 'urlintro := % prefix before URL; typically "Available from:" or "URL":
252
+ "online" 'onlinestring := % indication that resource is online; typically "online"
253
+ "cited " 'citedstring := % indicator of citation date; typically "cited "
254
+ "[link]" 'linktextstring := % dummy link text; typically "[link]"
255
+ "http://arxiv.org/abs/" 'eprinturl := % prefix to make URL from eprint ref
256
+ "arXiv:" 'eprintprefix := % text prefix printed before eprint ref; typically "arXiv:"
257
+ "https://doi.org/" 'doiurl := % prefix to make URL from DOI
258
+ "doi:" 'doiprefix := % text prefix printed before DOI ref; typically "doi:"
259
+ "http://www.ncbi.nlm.nih.gov/pubmed/" 'pubmedurl := % prefix to make URL from PUBMED
260
+ "PMID:" 'pubmedprefix := % text prefix printed before PUBMED ref; typically "PMID:"
261
+
262
+ % The following are internal state variables, not configuration constants,
263
+ % so they shouldn't be fiddled with.
264
+ #0 'makeinlinelink := % state variable managed by possibly.setup.inlinelink
265
+ "" 'openinlinelink := % ditto
266
+ "" 'closeinlinelink := % ditto
267
+ }
268
+ INTEGERS {
269
+ bracket.state
270
+ outside.brackets
271
+ open.brackets
272
+ within.brackets
273
+ close.brackets
274
+ }
275
+ % ...urlbst to here
276
+ FUNCTION {init.state.consts}
277
+ { #0 'outside.brackets := % urlbst...
278
+ #1 'open.brackets :=
279
+ #2 'within.brackets :=
280
+ #3 'close.brackets := % ...urlbst to here
281
+
282
+ #0 'before.all :=
283
+ #1 'mid.sentence :=
284
+ #2 'after.sentence :=
285
+ #3 'after.block :=
286
+ }
287
+ STRINGS { s t}
288
+ % urlbst
289
+ FUNCTION {output.nonnull.original}
290
+ { 's :=
291
+ output.state mid.sentence =
292
+ { ", " * write$ }
293
+ { output.state after.block =
294
+ { add.period$ write$
295
+ newline$
296
+ "\newblock " write$
297
+ }
298
+ { output.state before.all =
299
+ 'write$
300
+ { add.period$ " " * write$ }
301
+ if$
302
+ }
303
+ if$
304
+ mid.sentence 'output.state :=
305
+ }
306
+ if$
307
+ s
308
+ }
309
+
310
+ % urlbst...
311
+ % The following three functions are for handling inlinelink. They wrap
312
+ % a block of text which is potentially output with write$ by multiple
313
+ % other functions, so we don't know the content a priori.
314
+ % They communicate between each other using the variables makeinlinelink
315
+ % (which is true if a link should be made), and closeinlinelink (which holds
316
+ % the string which should close any current link. They can be called
317
+ % at any time, but start.inlinelink will be a no-op unless something has
318
+ % previously set makeinlinelink true, and the two ...end.inlinelink functions
319
+ % will only do their stuff if start.inlinelink has previously set
320
+ % closeinlinelink to be non-empty.
321
+ % (thanks to 'ijvm' for suggested code here)
322
+ FUNCTION {uand}
323
+ { 'skip$ { pop$ #0 } if$ } % 'and' (which isn't defined at this point in the file)
324
+ FUNCTION {possibly.setup.inlinelink}
325
+ { makeinlinelink hrefform #0 > uand
326
+ { doi empty$ adddoiresolver uand
327
+ { pubmed empty$ addpubmedresolver uand
328
+ { eprint empty$ addeprints uand
329
+ { url empty$
330
+ { "" }
331
+ { url }
332
+ if$ }
333
+ { eprinturl eprint * }
334
+ if$ }
335
+ { pubmedurl pubmed * }
336
+ if$ }
337
+ { doiurl doi * }
338
+ if$
339
+ % an appropriately-formatted URL is now on the stack
340
+ hrefform #1 = % hypertex
341
+ { "\special {html:<a href=" quote$ * swap$ * quote$ * "> }{" * 'openinlinelink :=
342
+ "\special {html:</a>}" 'closeinlinelink := }
343
+ { "\href {" swap$ * "} {" * 'openinlinelink := % hrefform=#2 -- hyperref
344
+ % the space between "} {" matters: a URL of just the right length can cause "\% newline em"
345
+ "}" 'closeinlinelink := }
346
+ if$
347
+ #0 'makeinlinelink :=
348
+ }
349
+ 'skip$
350
+ if$ % makeinlinelink
351
+ }
352
+ FUNCTION {add.inlinelink}
353
+ { openinlinelink empty$
354
+ 'skip$
355
+ { openinlinelink swap$ * closeinlinelink *
356
+ "" 'openinlinelink :=
357
+ }
358
+ if$
359
+ }
360
+ FUNCTION {output.nonnull}
361
+ { % Save the thing we've been asked to output
362
+ 's :=
363
+ % If the bracket-state is close.brackets, then add a close-bracket to
364
+ % what is currently at the top of the stack, and set bracket.state
365
+ % to outside.brackets
366
+ bracket.state close.brackets =
367
+ { "]" *
368
+ outside.brackets 'bracket.state :=
369
+ }
370
+ 'skip$
371
+ if$
372
+ bracket.state outside.brackets =
373
+ { % We're outside all brackets -- this is the normal situation.
374
+ % Write out what's currently at the top of the stack, using the
375
+ % original output.nonnull function.
376
+ s
377
+ add.inlinelink
378
+ output.nonnull.original % invoke the original output.nonnull
379
+ }
380
+ { % Still in brackets. Add open-bracket or (continuation) comma, add the
381
+ % new text (in s) to the top of the stack, and move to the close-brackets
382
+ % state, ready for next time (unless inbrackets resets it). If we come
383
+ % into this branch, then output.state is carefully undisturbed.
384
+ bracket.state open.brackets =
385
+ { " [" * }
386
+ { ", " * } % bracket.state will be within.brackets
387
+ if$
388
+ s *
389
+ close.brackets 'bracket.state :=
390
+ }
391
+ if$
392
+ }
393
+
394
+ % Call this function just before adding something which should be presented in
395
+ % brackets. bracket.state is handled specially within output.nonnull.
396
+ FUNCTION {inbrackets}
397
+ { bracket.state close.brackets =
398
+ { within.brackets 'bracket.state := } % reset the state: not open nor closed
399
+ { open.brackets 'bracket.state := }
400
+ if$
401
+ }
402
+
403
+ FUNCTION {format.lastchecked}
404
+ { lastchecked empty$
405
+ { "" }
406
+ { inbrackets citedstring lastchecked * }
407
+ if$
408
+ }
409
+ % ...urlbst to here
410
+ FUNCTION {output}
411
+ { duplicate$ empty$
412
+ 'pop$
413
+ 'output.nonnull
414
+ if$
415
+ }
416
+ FUNCTION {output.check}
417
+ { 't :=
418
+ duplicate$ empty$
419
+ { pop$ "empty " t * " in " * cite$ * warning$ }
420
+ 'output.nonnull
421
+ if$
422
+ }
423
+ FUNCTION {fin.entry.original} % urlbst (renamed from fin.entry, so it can be wrapped below)
424
+ { add.period$
425
+ write$
426
+ newline$
427
+ }
428
+
429
+ FUNCTION {new.block}
430
+ { output.state before.all =
431
+ 'skip$
432
+ { after.block 'output.state := }
433
+ if$
434
+ }
435
+ FUNCTION {new.sentence}
436
+ { output.state after.block =
437
+ 'skip$
438
+ { output.state before.all =
439
+ 'skip$
440
+ { after.sentence 'output.state := }
441
+ if$
442
+ }
443
+ if$
444
+ }
445
+ FUNCTION {add.blank}
446
+ { " " * before.all 'output.state :=
447
+ }
448
+
449
+ FUNCTION {date.block}
450
+ {
451
+ new.block
452
+ }
453
+
454
+ FUNCTION {not}
455
+ { { #0 }
456
+ { #1 }
457
+ if$
458
+ }
459
+ FUNCTION {and}
460
+ { 'skip$
461
+ { pop$ #0 }
462
+ if$
463
+ }
464
+ FUNCTION {or}
465
+ { { pop$ #1 }
466
+ 'skip$
467
+ if$
468
+ }
469
+ FUNCTION {new.block.checkb}
470
+ { empty$
471
+ swap$ empty$
472
+ and
473
+ 'skip$
474
+ 'new.block
475
+ if$
476
+ }
477
+ FUNCTION {field.or.null}
478
+ { duplicate$ empty$
479
+ { pop$ "" }
480
+ 'skip$
481
+ if$
482
+ }
483
+ FUNCTION {emphasize}
484
+ { duplicate$ empty$
485
+ { pop$ "" }
486
+ { "\emph{" swap$ * "}" * }
487
+ if$
488
+ }
489
+ FUNCTION {tie.or.space.prefix}
490
+ { duplicate$ text.length$ #3 <
491
+ { "~" }
492
+ { " " }
493
+ if$
494
+ swap$
495
+ }
496
+
497
+ FUNCTION {capitalize}
498
+ { "u" change.case$ "t" change.case$ }
499
+
500
+ FUNCTION {space.word}
501
+ { " " swap$ * " " * }
502
+ % Here are the language-specific definitions for explicit words.
503
+ % Each function has a name bbl.xxx where xxx is the English word.
504
+ % The language selected here is ENGLISH
505
+ FUNCTION {bbl.and}
506
+ { "and"}
507
+
508
+ FUNCTION {bbl.etal}
509
+ { "et~al." }
510
+
511
+ FUNCTION {bbl.editors}
512
+ { "editors" }
513
+
514
+ FUNCTION {bbl.editor}
515
+ { "editor" }
516
+
517
+ FUNCTION {bbl.edby}
518
+ { "edited by" }
519
+
520
+ FUNCTION {bbl.edition}
521
+ { "edition" }
522
+
523
+ FUNCTION {bbl.volume}
524
+ { "volume" }
525
+
526
+ FUNCTION {bbl.of}
527
+ { "of" }
528
+
529
+ FUNCTION {bbl.number}
530
+ { "number" }
531
+
532
+ FUNCTION {bbl.nr}
533
+ { "no." }
534
+
535
+ FUNCTION {bbl.in}
536
+ { "in" }
537
+
538
+ FUNCTION {bbl.pages}
539
+ { "pages" }
540
+
541
+ FUNCTION {bbl.page}
542
+ { "page" }
543
+
544
+ FUNCTION {bbl.chapter}
545
+ { "chapter" }
546
+
547
+ FUNCTION {bbl.techrep}
548
+ { "Technical Report" }
549
+
550
+ FUNCTION {bbl.mthesis}
551
+ { "Master's thesis" }
552
+
553
+ FUNCTION {bbl.phdthesis}
554
+ { "Ph.D. thesis" }
555
+
556
+ MACRO {jan} {"January"}
557
+
558
+ MACRO {feb} {"February"}
559
+
560
+ MACRO {mar} {"March"}
561
+
562
+ MACRO {apr} {"April"}
563
+
564
+ MACRO {may} {"May"}
565
+
566
+ MACRO {jun} {"June"}
567
+
568
+ MACRO {jul} {"July"}
569
+
570
+ MACRO {aug} {"August"}
571
+
572
+ MACRO {sep} {"September"}
573
+
574
+ MACRO {oct} {"October"}
575
+
576
+ MACRO {nov} {"November"}
577
+
578
+ MACRO {dec} {"December"}
579
+
580
+ MACRO {acmcs} {"ACM Computing Surveys"}
581
+
582
+ MACRO {acta} {"Acta Informatica"}
583
+
584
+ MACRO {cacm} {"Communications of the ACM"}
585
+
586
+ MACRO {ibmjrd} {"IBM Journal of Research and Development"}
587
+
588
+ MACRO {ibmsj} {"IBM Systems Journal"}
589
+
590
+ MACRO {ieeese} {"IEEE Transactions on Software Engineering"}
591
+
592
+ MACRO {ieeetc} {"IEEE Transactions on Computers"}
593
+
594
+ MACRO {ieeetcad}
595
+ {"IEEE Transactions on Computer-Aided Design of Integrated Circuits"}
596
+
597
+ MACRO {ipl} {"Information Processing Letters"}
598
+
599
+ MACRO {jacm} {"Journal of the ACM"}
600
+
601
+ MACRO {jcss} {"Journal of Computer and System Sciences"}
602
+
603
+ MACRO {scp} {"Science of Computer Programming"}
604
+
605
+ MACRO {sicomp} {"SIAM Journal on Computing"}
606
+
607
+ MACRO {tocs} {"ACM Transactions on Computer Systems"}
608
+
609
+ MACRO {tods} {"ACM Transactions on Database Systems"}
610
+
611
+ MACRO {tog} {"ACM Transactions on Graphics"}
612
+
613
+ MACRO {toms} {"ACM Transactions on Mathematical Software"}
614
+
615
+ MACRO {toois} {"ACM Transactions on Office Information Systems"}
616
+
617
+ MACRO {toplas} {"ACM Transactions on Programming Languages and Systems"}
618
+
619
+ MACRO {tcs} {"Theoretical Computer Science"}
620
+ FUNCTION {bibinfo.check}
621
+ { swap$
622
+ duplicate$ missing$
623
+ {
624
+ pop$ pop$
625
+ ""
626
+ }
627
+ { duplicate$ empty$
628
+ {
629
+ swap$ pop$
630
+ }
631
+ { swap$
632
+ pop$
633
+ }
634
+ if$
635
+ }
636
+ if$
637
+ }
638
+ FUNCTION {bibinfo.warn}
639
+ { swap$
640
+ duplicate$ missing$
641
+ {
642
+ swap$ "missing " swap$ * " in " * cite$ * warning$ pop$
643
+ ""
644
+ }
645
+ { duplicate$ empty$
646
+ {
647
+ swap$ "empty " swap$ * " in " * cite$ * warning$
648
+ }
649
+ { swap$
650
+ pop$
651
+ }
652
+ if$
653
+ }
654
+ if$
655
+ }
656
+ STRINGS { bibinfo}
657
+ INTEGERS { nameptr namesleft numnames }
658
+
659
+ FUNCTION {format.names}
660
+ { 'bibinfo :=
661
+ duplicate$ empty$ 'skip$ {
662
+ 's :=
663
+ "" 't :=
664
+ #1 'nameptr :=
665
+ s num.names$ 'numnames :=
666
+ numnames 'namesleft :=
667
+ { namesleft #0 > }
668
+ { s nameptr
669
+ duplicate$ #1 >
670
+ { "{ff~}{vv~}{ll}{, jj}" }
671
+ { "{ff~}{vv~}{ll}{, jj}" } % first name first for first author
672
+ % { "{vv~}{ll}{, ff}{, jj}" } % last name first for first author
673
+ if$
674
+ format.name$
675
+ bibinfo bibinfo.check
676
+ 't :=
677
+ nameptr #1 >
678
+ {
679
+ namesleft #1 >
680
+ { ", " * t * }
681
+ {
682
+ numnames #2 >
683
+ { "," * }
684
+ 'skip$
685
+ if$
686
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
687
+ { 't := }
688
+ { pop$ }
689
+ if$
690
+ t "others" =
691
+ {
692
+ " " * bbl.etal *
693
+ }
694
+ {
695
+ bbl.and
696
+ space.word * t *
697
+ }
698
+ if$
699
+ }
700
+ if$
701
+ }
702
+ 't
703
+ if$
704
+ nameptr #1 + 'nameptr :=
705
+ namesleft #1 - 'namesleft :=
706
+ }
707
+ while$
708
+ } if$
709
+ }
710
+ FUNCTION {format.names.ed}
711
+ {
712
+ 'bibinfo :=
713
+ duplicate$ empty$ 'skip$ {
714
+ 's :=
715
+ "" 't :=
716
+ #1 'nameptr :=
717
+ s num.names$ 'numnames :=
718
+ numnames 'namesleft :=
719
+ { namesleft #0 > }
720
+ { s nameptr
721
+ "{ff~}{vv~}{ll}{, jj}"
722
+ format.name$
723
+ bibinfo bibinfo.check
724
+ 't :=
725
+ nameptr #1 >
726
+ {
727
+ namesleft #1 >
728
+ { ", " * t * }
729
+ {
730
+ numnames #2 >
731
+ { "," * }
732
+ 'skip$
733
+ if$
734
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
735
+ { 't := }
736
+ { pop$ }
737
+ if$
738
+ t "others" =
739
+ {
740
+
741
+ " " * bbl.etal *
742
+ }
743
+ {
744
+ bbl.and
745
+ space.word * t *
746
+ }
747
+ if$
748
+ }
749
+ if$
750
+ }
751
+ 't
752
+ if$
753
+ nameptr #1 + 'nameptr :=
754
+ namesleft #1 - 'namesleft :=
755
+ }
756
+ while$
757
+ } if$
758
+ }
759
+ FUNCTION {format.key}
760
+ { empty$
761
+ { key field.or.null }
762
+ { "" }
763
+ if$
764
+ }
765
+
766
+ FUNCTION {format.authors}
767
+ { author "author" format.names
768
+ }
769
+ FUNCTION {get.bbl.editor}
770
+ { editor num.names$ #1 > 'bbl.editors 'bbl.editor if$ }
771
+
772
+ FUNCTION {format.editors}
773
+ { editor "editor" format.names duplicate$ empty$ 'skip$
774
+ {
775
+ "," *
776
+ " " *
777
+ get.bbl.editor
778
+ *
779
+ }
780
+ if$
781
+ }
782
+ FUNCTION {format.note}
783
+ {
784
+ note empty$
785
+ { "" }
786
+ { note #1 #1 substring$
787
+ duplicate$ "{" =
788
+ 'skip$
789
+ { output.state mid.sentence =
790
+ { "l" }
791
+ { "u" }
792
+ if$
793
+ change.case$
794
+ }
795
+ if$
796
+ note #2 global.max$ substring$ * "note" bibinfo.check
797
+ }
798
+ if$
799
+ }
800
+
801
+ FUNCTION {format.title}
802
+ { title
803
+ duplicate$ empty$ 'skip$
804
+ { "t" change.case$ }
805
+ if$
806
+ "title" bibinfo.check
807
+ }
808
+ FUNCTION {format.full.names}
809
+ {'s :=
810
+ "" 't :=
811
+ #1 'nameptr :=
812
+ s num.names$ 'numnames :=
813
+ numnames 'namesleft :=
814
+ { namesleft #0 > }
815
+ { s nameptr
816
+ "{vv~}{ll}" format.name$
817
+ 't :=
818
+ nameptr #1 >
819
+ {
820
+ namesleft #1 >
821
+ { ", " * t * }
822
+ {
823
+ s nameptr "{ll}" format.name$ duplicate$ "others" =
824
+ { 't := }
825
+ { pop$ }
826
+ if$
827
+ t "others" =
828
+ {
829
+ " " * bbl.etal *
830
+ }
831
+ {
832
+ numnames #2 >
833
+ { "," * }
834
+ 'skip$
835
+ if$
836
+ bbl.and
837
+ space.word * t *
838
+ }
839
+ if$
840
+ }
841
+ if$
842
+ }
843
+ 't
844
+ if$
845
+ nameptr #1 + 'nameptr :=
846
+ namesleft #1 - 'namesleft :=
847
+ }
848
+ while$
849
+ }
850
+
851
+ FUNCTION {author.editor.key.full}
852
+ { author empty$
853
+ { editor empty$
854
+ { key empty$
855
+ { cite$ #1 #3 substring$ }
856
+ 'key
857
+ if$
858
+ }
859
+ { editor format.full.names }
860
+ if$
861
+ }
862
+ { author format.full.names }
863
+ if$
864
+ }
865
+
866
+ FUNCTION {author.key.full}
867
+ { author empty$
868
+ { key empty$
869
+ { cite$ #1 #3 substring$ }
870
+ 'key
871
+ if$
872
+ }
873
+ { author format.full.names }
874
+ if$
875
+ }
876
+
877
+ FUNCTION {editor.key.full}
878
+ { editor empty$
879
+ { key empty$
880
+ { cite$ #1 #3 substring$ }
881
+ 'key
882
+ if$
883
+ }
884
+ { editor format.full.names }
885
+ if$
886
+ }
887
+
888
+ FUNCTION {make.full.names}
889
+ { type$ "book" =
890
+ type$ "inbook" =
891
+ or
892
+ 'author.editor.key.full
893
+ { type$ "proceedings" =
894
+ 'editor.key.full
895
+ 'author.key.full
896
+ if$
897
+ }
898
+ if$
899
+ }
900
+
901
+ FUNCTION {output.bibitem.original} % urlbst (renamed from output.bibitem, so it can be wrapped below)
902
+ { newline$
903
+ "\bibitem[{" write$
904
+ label write$
905
+ ")" make.full.names duplicate$ short.list =
906
+ { pop$ }
907
+ { * }
908
+ if$
909
+ "}]{" * write$
910
+ cite$ write$
911
+ "}" write$
912
+ newline$
913
+ ""
914
+ before.all 'output.state :=
915
+ }
916
+
917
+ FUNCTION {n.dashify}
918
+ {
919
+ 't :=
920
+ ""
921
+ { t empty$ not }
922
+ { t #1 #1 substring$ "-" =
923
+ { t #1 #2 substring$ "--" = not
924
+ { "--" *
925
+ t #2 global.max$ substring$ 't :=
926
+ }
927
+ { { t #1 #1 substring$ "-" = }
928
+ { "-" *
929
+ t #2 global.max$ substring$ 't :=
930
+ }
931
+ while$
932
+ }
933
+ if$
934
+ }
935
+ { t #1 #1 substring$ *
936
+ t #2 global.max$ substring$ 't :=
937
+ }
938
+ if$
939
+ }
940
+ while$
941
+ }
942
+
943
+ FUNCTION {word.in}
944
+ { bbl.in capitalize
945
+ " " * }
946
+
947
+ FUNCTION {format.date}
948
+ { year "year" bibinfo.check duplicate$ empty$
949
+ {
950
+ }
951
+ 'skip$
952
+ if$
953
+ extra.label *
954
+ before.all 'output.state :=
955
+ after.sentence 'output.state :=
956
+ }
957
+ FUNCTION {format.btitle}
958
+ { title "title" bibinfo.check
959
+ duplicate$ empty$ 'skip$
960
+ {
961
+ emphasize
962
+ }
963
+ if$
964
+ }
965
+ FUNCTION {either.or.check}
966
+ { empty$
967
+ 'pop$
968
+ { "can't use both " swap$ * " fields in " * cite$ * warning$ }
969
+ if$
970
+ }
971
+ FUNCTION {format.bvolume}
972
+ { volume empty$
973
+ { "" }
974
+ { bbl.volume volume tie.or.space.prefix
975
+ "volume" bibinfo.check * *
976
+ series "series" bibinfo.check
977
+ duplicate$ empty$ 'pop$
978
+ { swap$ bbl.of space.word * swap$
979
+ emphasize * }
980
+ if$
981
+ "volume and number" number either.or.check
982
+ }
983
+ if$
984
+ }
985
+ FUNCTION {format.number.series}
986
+ { volume empty$
987
+ { number empty$
988
+ { series field.or.null }
989
+ { series empty$
990
+ { number "number" bibinfo.check }
991
+ { output.state mid.sentence =
992
+ { bbl.number }
993
+ { bbl.number capitalize }
994
+ if$
995
+ number tie.or.space.prefix "number" bibinfo.check * *
996
+ bbl.in space.word *
997
+ series "series" bibinfo.check *
998
+ }
999
+ if$
1000
+ }
1001
+ if$
1002
+ }
1003
+ { "" }
1004
+ if$
1005
+ }
1006
+
1007
+ FUNCTION {format.edition}
1008
+ { edition duplicate$ empty$ 'skip$
1009
+ {
1010
+ output.state mid.sentence =
1011
+ { "l" }
1012
+ { "t" }
1013
+ if$ change.case$
1014
+ "edition" bibinfo.check
1015
+ " " * bbl.edition *
1016
+ }
1017
+ if$
1018
+ }
1019
+ INTEGERS { multiresult }
1020
+ FUNCTION {multi.page.check}
1021
+ { 't :=
1022
+ #0 'multiresult :=
1023
+ { multiresult not
1024
+ t empty$ not
1025
+ and
1026
+ }
1027
+ { t #1 #1 substring$
1028
+ duplicate$ "-" =
1029
+ swap$ duplicate$ "," =
1030
+ swap$ "+" =
1031
+ or or
1032
+ { #1 'multiresult := }
1033
+ { t #2 global.max$ substring$ 't := }
1034
+ if$
1035
+ }
1036
+ while$
1037
+ multiresult
1038
+ }
1039
+ FUNCTION {format.pages}
1040
+ { pages duplicate$ empty$ 'skip$
1041
+ { duplicate$ multi.page.check
1042
+ {
1043
+ bbl.pages swap$
1044
+ n.dashify
1045
+ }
1046
+ {
1047
+ bbl.page swap$
1048
+ }
1049
+ if$
1050
+ tie.or.space.prefix
1051
+ "pages" bibinfo.check
1052
+ * *
1053
+ }
1054
+ if$
1055
+ }
1056
+ FUNCTION {format.journal.pages}
1057
+ { pages duplicate$ empty$ 'pop$
1058
+ { swap$ duplicate$ empty$
1059
+ { pop$ pop$ format.pages }
1060
+ {
1061
+ ":" *
1062
+ swap$
1063
+ n.dashify
1064
+ "pages" bibinfo.check
1065
+ *
1066
+ }
1067
+ if$
1068
+ }
1069
+ if$
1070
+ }
1071
+ FUNCTION {format.vol.num.pages}
1072
+ { volume field.or.null
1073
+ duplicate$ empty$ 'skip$
1074
+ {
1075
+ "volume" bibinfo.check
1076
+ }
1077
+ if$
1078
+ number "number" bibinfo.check duplicate$ empty$ 'skip$
1079
+ {
1080
+ swap$ duplicate$ empty$
1081
+ { "there's a number but no volume in " cite$ * warning$ }
1082
+ 'skip$
1083
+ if$
1084
+ swap$
1085
+ "(" swap$ * ")" *
1086
+ }
1087
+ if$ *
1088
+ format.journal.pages
1089
+ }
1090
+
1091
+ FUNCTION {format.chapter}
1092
+ { chapter empty$
1093
+ 'format.pages
1094
+ { type empty$
1095
+ { bbl.chapter }
1096
+ { type "l" change.case$
1097
+ "type" bibinfo.check
1098
+ }
1099
+ if$
1100
+ chapter tie.or.space.prefix
1101
+ "chapter" bibinfo.check
1102
+ * *
1103
+ }
1104
+ if$
1105
+ }
1106
+
1107
+ FUNCTION {format.chapter.pages}
1108
+ { chapter empty$
1109
+ 'format.pages
1110
+ { type empty$
1111
+ { bbl.chapter }
1112
+ { type "l" change.case$
1113
+ "type" bibinfo.check
1114
+ }
1115
+ if$
1116
+ chapter tie.or.space.prefix
1117
+ "chapter" bibinfo.check
1118
+ * *
1119
+ pages empty$
1120
+ 'skip$
1121
+ { ", " * format.pages * }
1122
+ if$
1123
+ }
1124
+ if$
1125
+ }
1126
+
1127
+ FUNCTION {format.booktitle}
1128
+ {
1129
+ booktitle "booktitle" bibinfo.check
1130
+ emphasize
1131
+ }
1132
+ FUNCTION {format.in.booktitle}
1133
+ { format.booktitle duplicate$ empty$ 'skip$
1134
+ {
1135
+ word.in swap$ *
1136
+ }
1137
+ if$
1138
+ }
1139
+ FUNCTION {format.in.ed.booktitle}
1140
+ { format.booktitle duplicate$ empty$ 'skip$
1141
+ {
1142
+ editor "editor" format.names.ed duplicate$ empty$ 'pop$
1143
+ {
1144
+ "," *
1145
+ " " *
1146
+ get.bbl.editor
1147
+ ", " *
1148
+ * swap$
1149
+ * }
1150
+ if$
1151
+ word.in swap$ *
1152
+ }
1153
+ if$
1154
+ }
1155
+ FUNCTION {format.thesis.type}
1156
+ { type duplicate$ empty$
1157
+ 'pop$
1158
+ { swap$ pop$
1159
+ "t" change.case$ "type" bibinfo.check
1160
+ }
1161
+ if$
1162
+ }
1163
+ FUNCTION {format.tr.number}
1164
+ { number "number" bibinfo.check
1165
+ type duplicate$ empty$
1166
+ { pop$ bbl.techrep }
1167
+ 'skip$
1168
+ if$
1169
+ "type" bibinfo.check
1170
+ swap$ duplicate$ empty$
1171
+ { pop$ "t" change.case$ }
1172
+ { tie.or.space.prefix * * }
1173
+ if$
1174
+ }
1175
+ FUNCTION {format.article.crossref}
1176
+ {
1177
+ word.in
1178
+ " \cite{" * crossref * "}" *
1179
+ }
1180
+ FUNCTION {format.book.crossref}
1181
+ { volume duplicate$ empty$
1182
+ { "empty volume in " cite$ * "'s crossref of " * crossref * warning$
1183
+ pop$ word.in
1184
+ }
1185
+ { bbl.volume
1186
+ capitalize
1187
+ swap$ tie.or.space.prefix "volume" bibinfo.check * * bbl.of space.word *
1188
+ }
1189
+ if$
1190
+ " \cite{" * crossref * "}" *
1191
+ }
1192
+ FUNCTION {format.incoll.inproc.crossref}
1193
+ {
1194
+ word.in
1195
+ " \cite{" * crossref * "}" *
1196
+ }
1197
+ FUNCTION {format.org.or.pub}
1198
+ { 't :=
1199
+ ""
1200
+ address empty$ t empty$ and
1201
+ 'skip$
1202
+ {
1203
+ t empty$
1204
+ { address "address" bibinfo.check *
1205
+ }
1206
+ { t *
1207
+ address empty$
1208
+ 'skip$
1209
+ { ", " * address "address" bibinfo.check * }
1210
+ if$
1211
+ }
1212
+ if$
1213
+ }
1214
+ if$
1215
+ }
1216
+ FUNCTION {format.publisher.address}
1217
+ { publisher "publisher" bibinfo.warn format.org.or.pub
1218
+ }
1219
+
1220
+ FUNCTION {format.organization.address}
1221
+ { organization "organization" bibinfo.check format.org.or.pub
1222
+ }
1223
+
1224
+ % urlbst...
1225
+ % Functions for making hypertext links.
1226
+ % In all cases, the stack has (link-text href-url)
1227
+ %
1228
+ % make 'null' specials
1229
+ FUNCTION {make.href.null}
1230
+ {
1231
+ pop$
1232
+ }
1233
+ % make hypertex specials
1234
+ FUNCTION {make.href.hypertex}
1235
+ {
1236
+ "\special {html:<a href=" quote$ *
1237
+ swap$ * quote$ * "> }" * swap$ *
1238
+ "\special {html:</a>}" *
1239
+ }
1240
+ % make hyperref specials
1241
+ FUNCTION {make.href.hyperref}
1242
+ {
1243
+ "\href {" swap$ * "} {\path{" * swap$ * "}}" *
1244
+ }
1245
+ FUNCTION {make.href}
1246
+ { hrefform #2 =
1247
+ 'make.href.hyperref % hrefform = 2
1248
+ { hrefform #1 =
1249
+ 'make.href.hypertex % hrefform = 1
1250
+ 'make.href.null % hrefform = 0 (or anything else)
1251
+ if$
1252
+ }
1253
+ if$
1254
+ }
1255
+
1256
+ % If inlinelinks is true, then format.url should be a no-op, since it's
1257
+ % (a) redundant, and (b) could end up as a link-within-a-link.
1258
+ FUNCTION {format.url}
1259
+ { inlinelinks #1 = url empty$ or
1260
+ { "" }
1261
+ { hrefform #1 =
1262
+ { % special case -- add HyperTeX specials
1263
+ urlintro "\url{" url * "}" * url make.href.hypertex * }
1264
+ { urlintro "\url{" * url * "}" * }
1265
+ if$
1266
+ }
1267
+ if$
1268
+ }
1269
+
1270
+ FUNCTION {format.eprint}
1271
+ { eprint empty$
1272
+ { "" }
1273
+ { eprintprefix eprint * eprinturl eprint * make.href }
1274
+ if$
1275
+ }
1276
+
1277
+ FUNCTION {format.doi}
1278
+ { doi empty$
1279
+ { "" }
1280
+ { doiprefix doi * doiurl doi * make.href }
1281
+ if$
1282
+ }
1283
+
1284
+ FUNCTION {format.pubmed}
1285
+ { pubmed empty$
1286
+ { "" }
1287
+ { pubmedprefix pubmed * pubmedurl pubmed * make.href }
1288
+ if$
1289
+ }
1290
+
1291
+ % Output a URL. We can't use the more normal idiom (something like
1292
+ % `format.url output'), because the `inbrackets' within
1293
+ % format.lastchecked applies to everything between calls to `output',
1294
+ % so that `format.url format.lastchecked * output' ends up with both
1295
+ % the URL and the lastchecked in brackets.
1296
+ FUNCTION {output.url}
1297
+ { url empty$
1298
+ 'skip$
1299
+ { new.block
1300
+ format.url output
1301
+ format.lastchecked output
1302
+ }
1303
+ if$
1304
+ }
1305
+
1306
+ FUNCTION {output.web.refs}
1307
+ {
1308
+ new.block
1309
+ inlinelinks
1310
+ 'skip$ % links were inline -- don't repeat them
1311
+ {
1312
+ output.url
1313
+ addeprints eprint empty$ not and
1314
+ { format.eprint output.nonnull }
1315
+ 'skip$
1316
+ if$
1317
+ adddoiresolver doi empty$ not and
1318
+ { format.doi output.nonnull }
1319
+ 'skip$
1320
+ if$
1321
+ addpubmedresolver pubmed empty$ not and
1322
+ { format.pubmed output.nonnull }
1323
+ 'skip$
1324
+ if$
1325
+ }
1326
+ if$
1327
+ }
1328
+
1329
+ % Wrapper for output.bibitem.original.
1330
+ % If the URL field is not empty, set makeinlinelink to be true,
1331
+ % so that an inline link will be started at the next opportunity
1332
+ FUNCTION {output.bibitem}
1333
+ { outside.brackets 'bracket.state :=
1334
+ output.bibitem.original
1335
+ inlinelinks url empty$ not doi empty$ not or pubmed empty$ not or eprint empty$ not or and
1336
+ { #1 'makeinlinelink := }
1337
+ { #0 'makeinlinelink := }
1338
+ if$
1339
+ }
1340
+
1341
+ % Wrapper for fin.entry.original
1342
+ FUNCTION {fin.entry}
1343
+ { output.web.refs % urlbst
1344
+ makeinlinelink % ooops, it appears we didn't have a title for inlinelink
1345
+ { possibly.setup.inlinelink % add some artificial link text here, as a fallback
1346
+ linktextstring output.nonnull }
1347
+ 'skip$
1348
+ if$
1349
+ bracket.state close.brackets = % urlbst
1350
+ { "]" * }
1351
+ 'skip$
1352
+ if$
1353
+ fin.entry.original
1354
+ }
1355
+
1356
+ % Webpage entry type.
1357
+ % Title and url fields required;
1358
+ % author, note, year, month, and lastchecked fields optional
1359
+ % See references
1360
+ % ISO 690-2 http://www.nlc-bnc.ca/iso/tc46sc9/standard/690-2e.htm
1361
+ % http://www.classroom.net/classroom/CitingNetResources.html
1362
+ % http://neal.ctstateu.edu/history/cite.html
1363
+ % http://www.cas.usf.edu/english/walker/mla.html
1364
+ % for citation formats for web pages.
1365
+ FUNCTION {webpage}
1366
+ { output.bibitem
1367
+ author empty$
1368
+ { editor empty$
1369
+ 'skip$ % author and editor both optional
1370
+ { format.editors output.nonnull }
1371
+ if$
1372
+ }
1373
+ { editor empty$
1374
+ { format.authors output.nonnull }
1375
+ { "can't use both author and editor fields in " cite$ * warning$ }
1376
+ if$
1377
+ }
1378
+ if$
1379
+ new.block
1380
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$
1381
+ format.title "title" output.check
1382
+ inbrackets onlinestring output
1383
+ new.block
1384
+ year empty$
1385
+ 'skip$
1386
+ { format.date "year" output.check }
1387
+ if$
1388
+ % We don't need to output the URL details ('lastchecked' and 'url'),
1389
+ % because fin.entry does that for us, using output.web.refs. The only
1390
+ % reason we would want to put them here is if we were to decide that
1391
+ % they should go in front of the rather miscellaneous information in 'note'.
1392
+ new.block
1393
+ note output
1394
+ fin.entry
1395
+ }
1396
+ % ...urlbst to here
1397
+
1398
+
1399
+ FUNCTION {article}
1400
+ { output.bibitem
1401
+ format.authors "author" output.check
1402
+ author format.key output
1403
+ format.date "year" output.check
1404
+ date.block
1405
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1406
+ format.title "title" output.check
1407
+ new.block
1408
+ crossref missing$
1409
+ {
1410
+ journal
1411
+ "journal" bibinfo.check
1412
+ emphasize
1413
+ "journal" output.check
1414
+ possibly.setup.inlinelink format.vol.num.pages output% urlbst
1415
+ }
1416
+ { format.article.crossref output.nonnull
1417
+ format.pages output
1418
+ }
1419
+ if$
1420
+ new.block
1421
+ format.note output
1422
+ fin.entry
1423
+ }
1424
+ FUNCTION {book}
1425
+ { output.bibitem
1426
+ author empty$
1427
+ { format.editors "author and editor" output.check
1428
+ editor format.key output
1429
+ }
1430
+ { format.authors output.nonnull
1431
+ crossref missing$
1432
+ { "author and editor" editor either.or.check }
1433
+ 'skip$
1434
+ if$
1435
+ }
1436
+ if$
1437
+ format.date "year" output.check
1438
+ date.block
1439
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1440
+ format.btitle "title" output.check
1441
+ format.edition output
1442
+ crossref missing$
1443
+ { format.bvolume output
1444
+ new.block
1445
+ format.number.series output
1446
+ new.sentence
1447
+ format.publisher.address output
1448
+ }
1449
+ {
1450
+ new.block
1451
+ format.book.crossref output.nonnull
1452
+ }
1453
+ if$
1454
+ new.block
1455
+ format.note output
1456
+ fin.entry
1457
+ }
1458
+ FUNCTION {booklet}
1459
+ { output.bibitem
1460
+ format.authors output
1461
+ author format.key output
1462
+ format.date "year" output.check
1463
+ date.block
1464
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1465
+ format.title "title" output.check
1466
+ new.block
1467
+ howpublished "howpublished" bibinfo.check output
1468
+ address "address" bibinfo.check output
1469
+ new.block
1470
+ format.note output
1471
+ fin.entry
1472
+ }
1473
+
1474
+ FUNCTION {inbook}
1475
+ { output.bibitem
1476
+ author empty$
1477
+ { format.editors "author and editor" output.check
1478
+ editor format.key output
1479
+ }
1480
+ { format.authors output.nonnull
1481
+ crossref missing$
1482
+ { "author and editor" editor either.or.check }
1483
+ 'skip$
1484
+ if$
1485
+ }
1486
+ if$
1487
+ format.date "year" output.check
1488
+ date.block
1489
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1490
+ format.btitle "title" output.check
1491
+ format.edition output
1492
+ crossref missing$
1493
+ {
1494
+ format.bvolume output
1495
+ format.number.series output
1496
+ format.chapter "chapter" output.check
1497
+ new.sentence
1498
+ format.publisher.address output
1499
+ new.block
1500
+ }
1501
+ {
1502
+ format.chapter "chapter" output.check
1503
+ new.block
1504
+ format.book.crossref output.nonnull
1505
+ }
1506
+ if$
1507
+ new.block
1508
+ format.note output
1509
+ fin.entry
1510
+ }
1511
+
1512
+ FUNCTION {incollection}
1513
+ { output.bibitem
1514
+ format.authors "author" output.check
1515
+ author format.key output
1516
+ format.date "year" output.check
1517
+ date.block
1518
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1519
+ format.title "title" output.check
1520
+ new.block
1521
+ crossref missing$
1522
+ { format.in.ed.booktitle "booktitle" output.check
1523
+ format.edition output
1524
+ format.bvolume output
1525
+ format.number.series output
1526
+ format.chapter.pages output
1527
+ new.sentence
1528
+ format.publisher.address output
1529
+ }
1530
+ { format.incoll.inproc.crossref output.nonnull
1531
+ format.chapter.pages output
1532
+ }
1533
+ if$
1534
+ new.block
1535
+ format.note output
1536
+ fin.entry
1537
+ }
1538
+ FUNCTION {inproceedings}
1539
+ { output.bibitem
1540
+ format.authors "author" output.check
1541
+ author format.key output
1542
+ format.date "year" output.check
1543
+ date.block
1544
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1545
+ format.title "title" output.check
1546
+ new.block
1547
+ crossref missing$
1548
+ { format.in.booktitle "booktitle" output.check
1549
+ format.bvolume output
1550
+ format.number.series output
1551
+ format.pages output
1552
+ address "address" bibinfo.check output
1553
+ new.sentence
1554
+ organization "organization" bibinfo.check output
1555
+ publisher "publisher" bibinfo.check output
1556
+ }
1557
+ { format.incoll.inproc.crossref output.nonnull
1558
+ format.pages output
1559
+ }
1560
+ if$
1561
+ new.block
1562
+ format.note output
1563
+ fin.entry
1564
+ }
1565
+ FUNCTION {conference} { inproceedings }
1566
+ FUNCTION {manual}
1567
+ { output.bibitem
1568
+ format.authors output
1569
+ author format.key output
1570
+ format.date "year" output.check
1571
+ date.block
1572
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1573
+ format.btitle "title" output.check
1574
+ format.edition output
1575
+ organization address new.block.checkb
1576
+ organization "organization" bibinfo.check output
1577
+ address "address" bibinfo.check output
1578
+ new.block
1579
+ format.note output
1580
+ fin.entry
1581
+ }
1582
+
1583
+ FUNCTION {mastersthesis}
1584
+ { output.bibitem
1585
+ format.authors "author" output.check
1586
+ author format.key output
1587
+ format.date "year" output.check
1588
+ date.block
1589
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1590
+ format.title
1591
+ "title" output.check
1592
+ new.block
1593
+ bbl.mthesis format.thesis.type output.nonnull
1594
+ school "school" bibinfo.warn output
1595
+ address "address" bibinfo.check output
1596
+ month "month" bibinfo.check output
1597
+ new.block
1598
+ format.note output
1599
+ fin.entry
1600
+ }
1601
+
1602
+ FUNCTION {misc}
1603
+ { output.bibitem
1604
+ format.authors output
1605
+ author format.key output
1606
+ format.date "year" output.check
1607
+ date.block
1608
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1609
+ format.title output
1610
+ new.block
1611
+ howpublished "howpublished" bibinfo.check output
1612
+ new.block
1613
+ format.note output
1614
+ fin.entry
1615
+ }
1616
+ FUNCTION {phdthesis}
1617
+ { output.bibitem
1618
+ format.authors "author" output.check
1619
+ author format.key output
1620
+ format.date "year" output.check
1621
+ date.block
1622
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1623
+ format.btitle
1624
+ "title" output.check
1625
+ new.block
1626
+ bbl.phdthesis format.thesis.type output.nonnull
1627
+ school "school" bibinfo.warn output
1628
+ address "address" bibinfo.check output
1629
+ new.block
1630
+ format.note output
1631
+ fin.entry
1632
+ }
1633
+
1634
+ FUNCTION {proceedings}
1635
+ { output.bibitem
1636
+ format.editors output
1637
+ editor format.key output
1638
+ format.date "year" output.check
1639
+ date.block
1640
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1641
+ format.btitle "title" output.check
1642
+ format.bvolume output
1643
+ format.number.series output
1644
+ new.sentence
1645
+ publisher empty$
1646
+ { format.organization.address output }
1647
+ { organization "organization" bibinfo.check output
1648
+ new.sentence
1649
+ format.publisher.address output
1650
+ }
1651
+ if$
1652
+ new.block
1653
+ format.note output
1654
+ fin.entry
1655
+ }
1656
+
1657
+ FUNCTION {techreport}
1658
+ { output.bibitem
1659
+ format.authors "author" output.check
1660
+ author format.key output
1661
+ format.date "year" output.check
1662
+ date.block
1663
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1664
+ format.title
1665
+ "title" output.check
1666
+ new.block
1667
+ format.tr.number output.nonnull
1668
+ institution "institution" bibinfo.warn output
1669
+ address "address" bibinfo.check output
1670
+ new.block
1671
+ format.note output
1672
+ fin.entry
1673
+ }
1674
+
1675
+ FUNCTION {unpublished}
1676
+ { output.bibitem
1677
+ format.authors "author" output.check
1678
+ author format.key output
1679
+ format.date "year" output.check
1680
+ date.block
1681
+ title empty$ 'skip$ 'possibly.setup.inlinelink if$ % urlbst
1682
+ format.title "title" output.check
1683
+ new.block
1684
+ format.note "note" output.check
1685
+ fin.entry
1686
+ }
1687
+
1688
+ FUNCTION {default.type} { misc }
1689
+ READ
1690
+ FUNCTION {sortify}
1691
+ { purify$
1692
+ "l" change.case$
1693
+ }
1694
+ INTEGERS { len }
1695
+ FUNCTION {chop.word}
1696
+ { 's :=
1697
+ 'len :=
1698
+ s #1 len substring$ =
1699
+ { s len #1 + global.max$ substring$ }
1700
+ 's
1701
+ if$
1702
+ }
1703
+ FUNCTION {format.lab.names}
1704
+ { 's :=
1705
+ "" 't :=
1706
+ s #1 "{vv~}{ll}" format.name$
1707
+ s num.names$ duplicate$
1708
+ #2 >
1709
+ { pop$
1710
+ " " * bbl.etal *
1711
+ }
1712
+ { #2 <
1713
+ 'skip$
1714
+ { s #2 "{ff }{vv }{ll}{ jj}" format.name$ "others" =
1715
+ {
1716
+ " " * bbl.etal *
1717
+ }
1718
+ { bbl.and space.word * s #2 "{vv~}{ll}" format.name$
1719
+ * }
1720
+ if$
1721
+ }
1722
+ if$
1723
+ }
1724
+ if$
1725
+ }
1726
+
1727
+ FUNCTION {author.key.label}
1728
+ { author empty$
1729
+ { key empty$
1730
+ { cite$ #1 #3 substring$ }
1731
+ 'key
1732
+ if$
1733
+ }
1734
+ { author format.lab.names }
1735
+ if$
1736
+ }
1737
+
1738
+ FUNCTION {author.editor.key.label}
1739
+ { author empty$
1740
+ { editor empty$
1741
+ { key empty$
1742
+ { cite$ #1 #3 substring$ }
1743
+ 'key
1744
+ if$
1745
+ }
1746
+ { editor format.lab.names }
1747
+ if$
1748
+ }
1749
+ { author format.lab.names }
1750
+ if$
1751
+ }
1752
+
1753
+ FUNCTION {editor.key.label}
1754
+ { editor empty$
1755
+ { key empty$
1756
+ { cite$ #1 #3 substring$ }
1757
+ 'key
1758
+ if$
1759
+ }
1760
+ { editor format.lab.names }
1761
+ if$
1762
+ }
1763
+
1764
+ FUNCTION {calc.short.authors}
1765
+ { type$ "book" =
1766
+ type$ "inbook" =
1767
+ or
1768
+ 'author.editor.key.label
1769
+ { type$ "proceedings" =
1770
+ 'editor.key.label
1771
+ 'author.key.label
1772
+ if$
1773
+ }
1774
+ if$
1775
+ 'short.list :=
1776
+ }
1777
+
1778
+ FUNCTION {calc.label}
1779
+ { calc.short.authors
1780
+ short.list
1781
+ "("
1782
+ *
1783
+ year duplicate$ empty$
1784
+ short.list key field.or.null = or
1785
+ { pop$ "" }
1786
+ 'skip$
1787
+ if$
1788
+ *
1789
+ 'label :=
1790
+ }
1791
+
1792
+ FUNCTION {sort.format.names}
1793
+ { 's :=
1794
+ #1 'nameptr :=
1795
+ ""
1796
+ s num.names$ 'numnames :=
1797
+ numnames 'namesleft :=
1798
+ { namesleft #0 > }
1799
+ { s nameptr
1800
+ "{ll{ }}{ ff{ }}{ jj{ }}"
1801
+ format.name$ 't :=
1802
+ nameptr #1 >
1803
+ {
1804
+ " " *
1805
+ namesleft #1 = t "others" = and
1806
+ { "zzzzz" * }
1807
+ { t sortify * }
1808
+ if$
1809
+ }
1810
+ { t sortify * }
1811
+ if$
1812
+ nameptr #1 + 'nameptr :=
1813
+ namesleft #1 - 'namesleft :=
1814
+ }
1815
+ while$
1816
+ }
1817
+
1818
+ FUNCTION {sort.format.title}
1819
+ { 't :=
1820
+ "A " #2
1821
+ "An " #3
1822
+ "The " #4 t chop.word
1823
+ chop.word
1824
+ chop.word
1825
+ sortify
1826
+ #1 global.max$ substring$
1827
+ }
1828
+ FUNCTION {author.sort}
1829
+ { author empty$
1830
+ { key empty$
1831
+ { "to sort, need author or key in " cite$ * warning$
1832
+ ""
1833
+ }
1834
+ { key sortify }
1835
+ if$
1836
+ }
1837
+ { author sort.format.names }
1838
+ if$
1839
+ }
1840
+ FUNCTION {author.editor.sort}
1841
+ { author empty$
1842
+ { editor empty$
1843
+ { key empty$
1844
+ { "to sort, need author, editor, or key in " cite$ * warning$
1845
+ ""
1846
+ }
1847
+ { key sortify }
1848
+ if$
1849
+ }
1850
+ { editor sort.format.names }
1851
+ if$
1852
+ }
1853
+ { author sort.format.names }
1854
+ if$
1855
+ }
1856
+ FUNCTION {editor.sort}
1857
+ { editor empty$
1858
+ { key empty$
1859
+ { "to sort, need editor or key in " cite$ * warning$
1860
+ ""
1861
+ }
1862
+ { key sortify }
1863
+ if$
1864
+ }
1865
+ { editor sort.format.names }
1866
+ if$
1867
+ }
1868
+ FUNCTION {presort}
1869
+ { calc.label
1870
+ label sortify
1871
+ " "
1872
+ *
1873
+ type$ "book" =
1874
+ type$ "inbook" =
1875
+ or
1876
+ 'author.editor.sort
1877
+ { type$ "proceedings" =
1878
+ 'editor.sort
1879
+ 'author.sort
1880
+ if$
1881
+ }
1882
+ if$
1883
+ #1 entry.max$ substring$
1884
+ 'sort.label :=
1885
+ sort.label
1886
+ *
1887
+ " "
1888
+ *
1889
+ title field.or.null
1890
+ sort.format.title
1891
+ *
1892
+ #1 entry.max$ substring$
1893
+ 'sort.key$ :=
1894
+ }
1895
+
1896
+ ITERATE {presort}
1897
+ SORT
1898
+ STRINGS { last.label next.extra }
1899
+ INTEGERS { last.extra.num number.label }
1900
+ FUNCTION {initialize.extra.label.stuff}
1901
+ { #0 int.to.chr$ 'last.label :=
1902
+ "" 'next.extra :=
1903
+ #0 'last.extra.num :=
1904
+ #0 'number.label :=
1905
+ }
1906
+ FUNCTION {forward.pass}
1907
+ { last.label label =
1908
+ { last.extra.num #1 + 'last.extra.num :=
1909
+ last.extra.num int.to.chr$ 'extra.label :=
1910
+ }
1911
+ { "a" chr.to.int$ 'last.extra.num :=
1912
+ "" 'extra.label :=
1913
+ label 'last.label :=
1914
+ }
1915
+ if$
1916
+ number.label #1 + 'number.label :=
1917
+ }
1918
+ FUNCTION {reverse.pass}
1919
+ { next.extra "b" =
1920
+ { "a" 'extra.label := }
1921
+ 'skip$
1922
+ if$
1923
+ extra.label 'next.extra :=
1924
+ extra.label
1925
+ duplicate$ empty$
1926
+ 'skip$
1927
+ { year field.or.null #-1 #1 substring$ chr.to.int$ #65 <
1928
+ { "{\natexlab{" swap$ * "}}" * }
1929
+ { "{(\natexlab{" swap$ * "})}" * }
1930
+ if$ }
1931
+ if$
1932
+ 'extra.label :=
1933
+ label extra.label * 'label :=
1934
+ }
1935
+ EXECUTE {initialize.extra.label.stuff}
1936
+ ITERATE {forward.pass}
1937
+ REVERSE {reverse.pass}
1938
+ FUNCTION {bib.sort.order}
1939
+ { sort.label
1940
+ " "
1941
+ *
1942
+ year field.or.null sortify
1943
+ *
1944
+ " "
1945
+ *
1946
+ title field.or.null
1947
+ sort.format.title
1948
+ *
1949
+ #1 entry.max$ substring$
1950
+ 'sort.key$ :=
1951
+ }
1952
+ ITERATE {bib.sort.order}
1953
+ SORT
1954
+ FUNCTION {begin.bib}
1955
+ { preamble$ empty$
1956
+ 'skip$
1957
+ { preamble$ write$ newline$ }
1958
+ if$
1959
+ "\begin{thebibliography}{" number.label int.to.str$ * "}" *
1960
+ write$ newline$
1961
+ "\expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi"
1962
+ write$ newline$
1963
+ }
1964
+ EXECUTE {begin.bib}
1965
+ EXECUTE {init.urlbst.variables} % urlbst
1966
+ EXECUTE {init.state.consts}
1967
+ ITERATE {call.type$}
1968
+ FUNCTION {end.bib}
1969
+ { newline$
1970
+ "\end{thebibliography}" write$ newline$
1971
+ }
1972
+ EXECUTE {end.bib}
1973
+ %% End of customized bst file
1974
+ %%
1975
+ %% End of file `compling.bst'.
references/2019.arxiv.conneau/source/appendix.tex ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+ \usepackage[hyperref]{acl2020}
3
+ \usepackage{times}
4
+ \usepackage{latexsym}
5
+ \renewcommand{\UrlFont}{\ttfamily\small}
6
+
7
+ % This is not strictly necessary, and may be commented out,
8
+ % but it will improve the layout of the manuscript,
9
+ % and will typically save some space.
10
+ \usepackage{microtype}
11
+ \usepackage{graphicx}
12
+ \usepackage{subfigure}
13
+ \usepackage{booktabs} % for professional tables
14
+ \usepackage{url}
15
+ \usepackage{times}
16
+ \usepackage{latexsym}
17
+ \usepackage{array}
18
+ \usepackage{adjustbox}
19
+ \usepackage{multirow}
20
+ % \usepackage{subcaption}
21
+ \usepackage{hyperref}
22
+ \usepackage{longtable}
23
+ \usepackage{bibentry}
24
+ \newcommand{\xlmr}{\textit{XLM-R}\xspace}
25
+ \newcommand{\mbert}{mBERT\xspace}
26
+ \input{content/tables}
27
+
28
+ \begin{document}
29
+ \nobibliography{acl2020}
30
+ \bibliographystyle{acl_natbib}
31
+ \appendix
32
+ \onecolumn
33
+ \section*{Supplementary materials}
34
+ \section{Languages and statistics for CC-100 used by \xlmr}
35
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
36
+ \label{sec:appendix_A}
37
+ \insertDataStatistics
38
+
39
+ \newpage
40
+ \section{Model Architectures and Sizes}
41
+ As we showed in section 5, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
42
+ \label{sec:appendix_B}
43
+
44
+ \insertParameters
45
+ \end{document}
references/2019.arxiv.conneau/source/content/batchsize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e0c4e1c156379efeba93f0c1a6717bb12ab0b2aa0bdd361a7fda362ff01442e
3
+ size 14673
references/2019.arxiv.conneau/source/content/capacity.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00087aeb1a14190e7800a77cecacb04e8ce1432c029e0276b4d8b02b7ff66edb
3
+ size 16459
references/2019.arxiv.conneau/source/content/datasize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d07fdd658101ef6caf7e2808faa6045ab175315b6435e25ff14ecedac584118
3
+ size 26052
references/2019.arxiv.conneau/source/content/dilution.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80d1555811c23e2c521fbb007d84dfddb85e7020cc9333058368d3a1d63e240a
3
+ size 16376
references/2019.arxiv.conneau/source/content/langsampling.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c2f2f95649a23b0a46f8553f4e0e29000aff1971385b9addf6f478acc5a516a3
3
+ size 15612
references/2019.arxiv.conneau/source/content/tables.tex ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+
4
+ \newcommand{\insertXNLItable}{
5
+ \begin{table*}[h!]
6
+ \begin{center}
7
+ % \scriptsize
8
+ \resizebox{1\linewidth}{!}{
9
+ \begin{tabular}[b]{l ccc ccccccccccccccc c}
10
+ \toprule
11
+ {\bf Model} & {\bf D }& {\bf \#M} & {\bf \#lg} & {\bf en} & {\bf fr} & {\bf es} & {\bf de} & {\bf el} & {\bf bg} & {\bf ru} & {\bf tr} & {\bf ar} & {\bf vi} & {\bf th} & {\bf zh} & {\bf hi} & {\bf sw} & {\bf ur} & {\bf Avg}\\
12
+ \midrule
13
+ %\cmidrule(r){1-1}
14
+ %\cmidrule(lr){2-4}
15
+ %\cmidrule(lr){5-19}
16
+ %\cmidrule(l){20-20}
17
+
18
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on English training set (Cross-lingual Transfer)} \\
19
+ %\midrule
20
+ \midrule
21
+ \citet{lample2019cross} & Wiki+MT & N & 15 & 85.0 & 78.7 & 78.9 & 77.8 & 76.6 & 77.4 & 75.3 & 72.5 & 73.1 & 76.1 & 73.2 & 76.5 & 69.6 & 68.4 & 67.3 & 75.1 \\
22
+ \citet{huang2019unicoder} & Wiki+MT & N & 15 & 85.1 & 79.0 & 79.4 & 77.8 & 77.2 & 77.2 & 76.3 & 72.8 & 73.5 & 76.4 & 73.6 & 76.2 & 69.4 & 69.7 & 66.7 & 75.4 \\
23
+ %\midrule
24
+ \citet{devlin2018bert} & Wiki & N & 102 & 82.1 & 73.8 & 74.3 & 71.1 & 66.4 & 68.9 & 69.0 & 61.6 & 64.9 & 69.5 & 55.8 & 69.3 & 60.0 & 50.4 & 58.0 & 66.3 \\
25
+ \citet{lample2019cross} & Wiki & N & 100 & 83.7 & 76.2 & 76.6 & 73.7 & 72.4 & 73.0 & 72.1 & 68.1 & 68.4 & 72.0 & 68.2 & 71.5 & 64.5 & 58.0 & 62.4 & 71.3 \\
26
+ \citet{lample2019cross} & Wiki & 1 & 100 & 83.2 & 76.7 & 77.7 & 74.0 & 72.7 & 74.1 & 72.7 & 68.7 & 68.6 & 72.9 & 68.9 & 72.5 & 65.6 & 58.2 & 62.4 & 70.7 \\
27
+ \bf XLM-R\textsubscript{Base} & CC & 1 & 100 & 85.8 & 79.7 & 80.7 & 78.7 & 77.5 & 79.6 & 78.1 & 74.2 & 73.8 & 76.5 & 74.6 & 76.7 & 72.4 & 66.5 & 68.3 & 76.2 \\
28
+ \bf XLM-R & CC & 1 & 100 & \bf 89.1 & \bf 84.1 & \bf 85.1 & \bf 83.9 & \bf 82.9 & \bf 84.0 & \bf 81.2 & \bf 79.6 & \bf 79.8 & \bf 80.8 & \bf 78.1 & \bf 80.2 & \bf 76.9 & \bf 73.9 & \bf 73.8 & \bf 80.9 \\
29
+ \midrule
30
+ \multicolumn{19}{l}{\it Translate everything to English and use English-only model (TRANSLATE-TEST)} \\
31
+ \midrule
32
+ BERT-en & Wiki & 1 & 1 & 88.8 & 81.4 & 82.3 & 80.1 & 80.3 & 80.9 & 76.2 & 76.0 & 75.4 & 72.0 & 71.9 & 75.6 & 70.0 & 65.8 & 65.8 & 76.2 \\
33
+ RoBERTa & Wiki+CC & 1 & 1 & \underline{\bf 91.3} & 82.9 & 84.3 & 81.2 & 81.7 & 83.1 & 78.3 & 76.8 & 76.6 & 74.2 & 74.1 & 77.5 & 70.9 & 66.7 & 66.8 & 77.8 \\
34
+ % XLM-en & Wiki & 1 & 1 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 & 00.0 \\
35
+ \midrule
36
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on each training set (TRANSLATE-TRAIN)} \\
37
+ \midrule
38
+ \citet{lample2019cross} & Wiki & N & 100 & 82.9 & 77.6 & 77.9 & 77.9 & 77.1 & 75.7 & 75.5 & 72.6 & 71.2 & 75.8 & 73.1 & 76.2 & 70.4 & 66.5 & 62.4 & 74.2 \\
39
+ \midrule
40
+ \multicolumn{19}{l}{\it Fine-tune multilingual model on all training sets (TRANSLATE-TRAIN-ALL)} \\
41
+ \midrule
42
+ \citet{lample2019cross}$^{\dagger}$ & Wiki+MT & 1 & 15 & 85.0 & 80.8 & 81.3 & 80.3 & 79.1 & 80.9 & 78.3 & 75.6 & 77.6 & 78.5 & 76.0 & 79.5 & 72.9 & 72.8 & 68.5 & 77.8 \\
43
+ \citet{huang2019unicoder} & Wiki+MT & 1 & 15 & 85.6 & 81.1 & 82.3 & 80.9 & 79.5 & 81.4 & 79.7 & 76.8 & 78.2 & 77.9 & 77.1 & 80.5 & 73.4 & 73.8 & 69.6 & 78.5 \\
44
+ %\midrule
45
+ \citet{lample2019cross} & Wiki & 1 & 100 & 84.5 & 80.1 & 81.3 & 79.3 & 78.6 & 79.4 & 77.5 & 75.2 & 75.6 & 78.3 & 75.7 & 78.3 & 72.1 & 69.2 & 67.7 & 76.9 \\
46
+ \bf XLM-R\textsubscript{Base} & CC & 1 & 100 & 85.4 & 81.4 & 82.2 & 80.3 & 80.4 & 81.3 & 79.7 & 78.6 & 77.3 & 79.7 & 77.9 & 80.2 & 76.1 & 73.1 & 73.0 & 79.1 \\
47
+ \bf XLM-R & CC & 1 & 100 & \bf 89.1 & \underline{\bf 85.1} & \underline{\bf 86.6} & \underline{\bf 85.7} & \underline{\bf 85.3} & \underline{\bf 85.9} & \underline{\bf 83.5} & \underline{\bf 83.2} & \underline{\bf 83.1} & \underline{\bf 83.7} & \underline{\bf 81.5} & \underline{\bf 83.7} & \underline{\bf 81.6} & \underline{\bf 78.0} & \underline{\bf 78.1} & \underline{\bf 83.6} \\
48
+ \bottomrule
49
+ \end{tabular}
50
+ }
51
+ \caption{\textbf{Results on cross-lingual classification.} We report the accuracy on each of the 15 XNLI languages and the average accuracy. We specify the dataset D used for pretraining, the number of models \#M the approach requires and the number of languages \#lg the model handles. Our \xlmr results are averaged over five different seeds. We show that using the translate-train-all approach which leverages training sets from multiple languages, \xlmr obtains a new state of the art on XNLI of $83.6$\% average accuracy. Results with $^{\dagger}$ are from \citet{huang2019unicoder}. %It also outperforms previous methods on cross-lingual transfer.
52
+ \label{tab:xnli}}
53
+ \end{center}
54
+ % \vspace{-0.4cm}
55
+ \end{table*}
56
+ }
57
+
58
+ % Evolution of performance w.r.t number of languages
59
+ \newcommand{\insertLanguagesize}{
60
+ \begin{table*}[h!]
61
+ \begin{minipage}{0.49\textwidth}
62
+ \includegraphics[scale=0.4]{content/wiki_vs_cc.pdf}
63
+ \end{minipage}
64
+ \hfill
65
+ \begin{minipage}{0.4\textwidth}
66
+ \captionof{figure}{\textbf{Distribution of the amount of data (in MB) per language for Wikipedia and CommonCrawl.} The Wikipedia data used in open-source mBERT and XLM is not sufficient for the model to develop an understanding of low-resource languages. The CommonCrawl data we collect alleviates that issue and creates the conditions for a single model to understand text coming from multiple languages. \label{fig:lgs}}
67
+ \end{minipage}
68
+ % \vspace{-0.5cm}
69
+ \end{table*}
70
+ }
71
+
72
+ % Evolution of performance w.r.t number of languages
73
+ \newcommand{\insertXLMmorelanguages}{
74
+ \begin{table*}[h!]
75
+ \begin{minipage}{0.49\textwidth}
76
+ \includegraphics[scale=0.4]{content/evolution_languages}
77
+ \end{minipage}
78
+ \hfill
79
+ \begin{minipage}{0.4\textwidth}
80
+ \captionof{figure}{\textbf{Evolution of XLM performance on SeqLab, XNLI and GLUE as the number of languages increases.} While there are subtlteties as to what languages lose more accuracy than others as we add more languages, we observe a steady decrease of the overall monolingual and cross-lingual performance. \label{fig:lgsunused}}
81
+ \end{minipage}
82
+ % \vspace{-0.5cm}
83
+ \end{table*}
84
+ }
85
+
86
+ \newcommand{\insertMLQA}{
87
+ \begin{table*}[h!]
88
+ \begin{center}
89
+ % \scriptsize
90
+ \resizebox{1\linewidth}{!}{
91
+ \begin{tabular}[h]{l cc ccccccc c}
92
+ \toprule
93
+ {\bf Model} & {\bf train} & {\bf \#lgs} & {\bf en} & {\bf es} & {\bf de} & {\bf ar} & {\bf hi} & {\bf vi} & {\bf zh} & {\bf Avg} \\
94
+ \midrule
95
+ BERT-Large$^{\dagger}$ & en & 1 & 80.2 / 67.4 & - & - & - & - & - & - & - \\
96
+ mBERT$^{\dagger}$ & en & 102 & 77.7 / 65.2 & 64.3 / 46.6 & 57.9 / 44.3 & 45.7 / 29.8 & 43.8 / 29.7 & 57.1 / 38.6 & 57.5 / 37.3 & 57.7 / 41.6 \\
97
+ XLM-15$^{\dagger}$ & en & 15 & 74.9 / 62.4 & 68.0 / 49.8 & 62.2 / 47.6 & 54.8 / 36.3 & 48.8 / 27.3 & 61.4 / 41.8 & 61.1 / 39.6 & 61.6 / 43.5 \\
98
+ XLM-R\textsubscript{Base} & en & 100 & 77.1 / 64.6 & 67.4 / 49.6 & 60.9 / 46.7 & 54.9 / 36.6 & 59.4 / 42.9 & 64.5 / 44.7 & 61.8 / 39.3 & 63.7 / 46.3 \\
99
+ \bf XLM-R & en & 100 & \bf 80.6 / 67.8 & \bf 74.1 / 56.0 & \bf 68.5 / 53.6 & \bf 63.1 / 43.5 & \bf 69.2 / 51.6 & \bf 71.3 / 50.9 & \bf 68.0 / 45.4 & \bf 70.7 / 52.7 \\
100
+ \bottomrule
101
+ \end{tabular}
102
+ }
103
+ \caption{\textbf{Results on MLQA question answering} We report the F1 and EM (exact match) scores for zero-shot classification where models are fine-tuned on the English Squad dataset and evaluated on the 7 languages of MLQA. Results with $\dagger$ are taken from the original MLQA paper \citet{lewis2019mlqa}.
104
+ \label{tab:mlqa}}
105
+ \end{center}
106
+ \end{table*}
107
+ }
108
+
109
+ \newcommand{\insertNER}{
110
+ \begin{table}[t]
111
+ \begin{center}
112
+ % \scriptsize
113
+ \resizebox{1\linewidth}{!}{
114
+ \begin{tabular}[b]{l cc cccc c}
115
+ \toprule
116
+ {\bf Model} & {\bf train} & {\bf \#M} & {\bf en} & {\bf nl} & {\bf es} & {\bf de} & {\bf Avg}\\
117
+ \midrule
118
+ \citet{lample-etal-2016-neural} & each & N & 90.74 & 81.74 & 85.75 & 78.76 & 84.25 \\
119
+ \citet{akbik2018coling} & each & N & \bf 93.18 & 90.44 & - & \bf 88.27 & - \\
120
+ \midrule
121
+ \multirow{2}{*}{mBERT$^{\dagger}$} & each & N & 91.97 & 90.94 & 87.38 & 82.82 & 88.28\\
122
+ & en & 1 & 91.97 & 77.57 & 74.96 & 69.56 & 78.52\\
123
+ \midrule
124
+ \multirow{3}{*}{XLM-R\textsubscript{Base}} & each & N & 92.25 & 90.39 & 87.99 & 84.60 & 88.81\\
125
+ & en & 1 & 92.25 & 78.08 & 76.53 & 69.60 & 79.11\\
126
+ & all & 1 & 91.08 & 89.09 & 87.28 & 83.17 & 87.66 \\
127
+ \midrule
128
+ \multirow{3}{*}{\bf XLM-R} & each & N & 92.92 & \bf 92.53 & \bf 89.72 & 85.81 & 90.24\\
129
+ & en & 1 & 92.92 & 80.80 & 78.64 & 71.40 & 80.94\\
130
+ & all & 1 & 92.00 & 91.60 & 89.52 & 84.60 & 89.43 \\
131
+ \bottomrule
132
+ \end{tabular}
133
+ }
134
+ \caption{\textbf{Results on named entity recognition} on CoNLL-2002 and CoNLL-2003 (F1 score). Results with $\dagger$ are from \citet{wu2019beto}. Note that mBERT and \xlmr do not use a linear-chain CRF, as opposed to \citet{akbik2018coling} and \citet{lample-etal-2016-neural}.
135
+ \label{tab:ner}}
136
+ \end{center}
137
+ \vspace{-0.6cm}
138
+ \end{table}
139
+ }
140
+
141
+
142
+ \newcommand{\insertAblationone}{
143
+ \begin{table*}[h!]
144
+ \begin{minipage}[t]{0.3\linewidth}
145
+ \begin{center}
146
+ %\includegraphics[width=\linewidth]{content/xlmroberta_transfer_dilution.pdf}
147
+ \includegraphics{content/dilution}
148
+ \captionof{figure}{The transfer-interference trade-off: Low-resource languages benefit from scaling to more languages, until dilution (interference) kicks in and degrades overall performance.}
149
+ \label{fig:transfer_dilution}
150
+ \vspace{-0.2cm}
151
+ \end{center}
152
+ \end{minipage}
153
+ \hfill
154
+ \begin{minipage}[t]{0.3\linewidth}
155
+ \begin{center}
156
+ %\includegraphics[width=\linewidth]{content/xlmroberta_evolution.pdf}
157
+ \includegraphics{content/wikicc}
158
+ \captionof{figure}{Wikipedia versus CommonCrawl: An XLM-7 obtains significantly better performance when trained on CC, in particular on low-resource languages.}
159
+ \label{fig:curse}
160
+ \end{center}
161
+ \end{minipage}
162
+ \hfill
163
+ \begin{minipage}[t]{0.3\linewidth}
164
+ \begin{center}
165
+ % \includegraphics[width=\linewidth]{content/xlmroberta_evolution.pdf}
166
+ \includegraphics{content/capacity}
167
+ \captionof{figure}{Adding more capacity to the model alleviates the curse of multilinguality, but remains an issue for models of moderate size.}
168
+ \label{fig:capacity}
169
+ \end{center}
170
+ \end{minipage}
171
+ \vspace{-0.2cm}
172
+ \end{table*}
173
+ }
174
+
175
+
176
+ \newcommand{\insertAblationtwo}{
177
+ \begin{table*}[h!]
178
+ \begin{minipage}[t]{0.3\linewidth}
179
+ \begin{center}
180
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_alpha_tradeoff.pdf}
181
+ \includegraphics{content/langsampling}
182
+ \captionof{figure}{On the high-resource versus low-resource trade-off: impact of batch language sampling for XLM-100.
183
+ \label{fig:alpha}}
184
+ \end{center}
185
+ \end{minipage}
186
+ \hfill
187
+ \begin{minipage}[t]{0.3\linewidth}
188
+ \begin{center}
189
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_vocab.pdf}
190
+ \includegraphics{content/vocabsize.pdf}
191
+ \captionof{figure}{On the impact of vocabulary size at fixed capacity and with increasing capacity for XLM-100.
192
+ \label{fig:vocab}}
193
+ \end{center}
194
+ \end{minipage}
195
+ \hfill
196
+ \begin{minipage}[t]{0.3\linewidth}
197
+ \begin{center}
198
+ %\includegraphics[width=\columnwidth]{content/xlmroberta_batch_and_tok.pdf}
199
+ \includegraphics{content/batchsize.pdf}
200
+ \captionof{figure}{On the impact of large-scale training, and preprocessing simplification from BPE with tokenization to SPM on raw text data.
201
+ \label{fig:batch}}
202
+ \end{center}
203
+ \end{minipage}
204
+ \vspace{-0.2cm}
205
+ \end{table*}
206
+ }
207
+
208
+
209
+ % Multilingual vs monolingual
210
+ \newcommand{\insertMultiMono}{
211
+ \begin{table}[h!]
212
+ \begin{center}
213
+ % \scriptsize
214
+ \resizebox{1\linewidth}{!}{
215
+ \begin{tabular}[b]{l cc ccccccc c}
216
+ \toprule
217
+ {\bf Model} & {\bf D } & {\bf \#vocab} & {\bf en} & {\bf fr} & {\bf de} & {\bf ru} & {\bf zh} & {\bf sw} & {\bf ur} & {\bf Avg}\\
218
+ \midrule
219
+ \multicolumn{11}{l}{\it Monolingual baselines}\\
220
+ \midrule
221
+ \multirow{2}{*}{BERT} & Wiki & 40k & 84.5 & 78.6 & 80.0 & 75.5 & 77.7 & 60.1 & 57.3 & 73.4 \\
222
+ & CC & 40k & 86.7 & 81.2 & 81.2 & 78.2 & 79.5 & 70.8 & 65.1 & 77.5 \\
223
+ \midrule
224
+ \multicolumn{11}{l}{\it Multilingual models (cross-lingual transfer)}\\
225
+ \midrule
226
+ \multirow{2}{*}{XLM-7} & Wiki & 150k & 82.3 & 76.8 & 74.7 & 72.5 & 73.1 & 60.8 & 62.3 & 71.8 \\
227
+ & CC & 150k & 85.7 & 78.6 & 79.5 & 76.4 & 74.8 & 71.2 & 66.9 & 76.2 \\
228
+ \midrule
229
+ \multicolumn{11}{l}{\it Multilingual models (translate-train-all)}\\
230
+ \midrule
231
+ \multirow{2}{*}{XLM-7} & Wiki & 150k & 84.6 & 80.1 & 80.2 & 75.7 & 78 & 68.7 & 66.7 & 76.3 \\
232
+ & CC & 150k & \bf 87.2 & \bf 82.5 & \bf 82.9 & \bf 79.7 & \bf 80.4 & \bf 75.7 & \bf 71.5 & \bf 80.0 \\
233
+ % \midrule
234
+ % XLM (sw,ar) & CC & 60k & N & 2-3 & - & - & - & - & - & 00.0 & - & 00.0 \\
235
+ % XLM (ur,hi,ar) & CC & 60k & N & 2-3 & - & - & - & - & - & - & 00.0 & 00.0 \\
236
+ \bottomrule
237
+ \end{tabular}
238
+ }
239
+ \caption{\textbf{Multilingual versus monolingual models (BERT-BASE).} We compare the performance of monolingual models (BERT) versus multilingual models (XLM) on seven languages, using a BERT-BASE architecture. We choose a vocabulary size of 40k and 150k for monolingual and multilingual models.
240
+ \label{tab:multimono}}
241
+ \end{center}
242
+ \vspace{-0.4cm}
243
+ \end{table}
244
+ }
245
+
246
+ % GLUE benchmark results
247
+ \newcommand{\insertGlue}{
248
+ \begin{table}[h!]
249
+ \begin{center}
250
+ % \scriptsize
251
+ \resizebox{1\linewidth}{!}{
252
+ \begin{tabular}[b]{l|c|cccccc|c}
253
+ \toprule
254
+ {\bf Model} & {\bf \#lgs} & {\bf MNLI-m/mm} & {\bf QNLI} & {\bf QQP} & {\bf SST} & {\bf MRPC} & {\bf STS-B} & {\bf Avg}\\
255
+ \midrule
256
+ BERT\textsubscript{Large}$^{\dagger}$ & 1 & 86.6/- & 92.3 & 91.3 & 93.2 & 88.0 & 90.0 & 90.2 \\
257
+ XLNet\textsubscript{Large}$^{\dagger}$ & 1 & 89.8/- & 93.9 & 91.8 & 95.6 & 89.2 & 91.8 & 92.0 \\
258
+ RoBERTa$^{\dagger}$ & 1 & 90.2/90.2 & 94.7 & 92.2 & 96.4 & 90.9 & 92.4 & 92.8 \\
259
+ XLM-R & 100 & 88.9/89.0 & 93.8 & 92.3 & 95.0 & 89.5 & 91.2 & 91.8 \\
260
+ \bottomrule
261
+ \end{tabular}
262
+ }
263
+ \caption{\textbf{GLUE dev results.} Results with $^{\dagger}$ are from \citet{roberta2019}. We compare the performance of \xlmr to BERT\textsubscript{Large}, XLNet and RoBERTa on the English GLUE benchmark.
264
+ \label{tab:glue}}
265
+ \end{center}
266
+ \vspace{-0.4cm}
267
+ \end{table}
268
+ }
269
+
270
+
271
+ % Wiki vs CommonCrawl statistics
272
+ \newcommand{\insertWikivsCC}{
273
+ \begin{table*}[h]
274
+ \begin{center}
275
+ %\includegraphics[width=\linewidth]{content/wiki_vs_cc.pdf}
276
+ \includegraphics{content/datasize.pdf}
277
+ \captionof{figure}{Amount of data in GiB (log-scale) for the 88 languages that appear in both the Wiki-100 corpus used for mBERT and XLM-100, and the CC-100 used for XLM-R. CC-100 increases the amount of data by several orders of magnitude, in particular for low-resource languages.
278
+ \label{fig:wikivscc}}
279
+ \end{center}
280
+ % \vspace{-0.4cm}
281
+ \end{table*}
282
+ }
283
+
284
+ % Corpus statistics for CC-100
285
+ \newcommand{\insertDataStatistics}{
286
+ %\resizebox{1\linewidth}{!}{
287
+ \begin{table}[h!]
288
+ \begin{center}
289
+ \small
290
+ \begin{tabular}[b]{clrrclrr}
291
+ \toprule
292
+ \textbf{ISO code} & \textbf{Language} & \textbf{Tokens} (M) & \textbf{Size} (GiB) & \textbf{ISO code} & \textbf{Language} & \textbf{Tokens} (M) & \textbf{Size} (GiB)\\
293
+ \cmidrule(r){1-4}\cmidrule(l){5-8}
294
+ {\bf af }& Afrikaans & 242 & 1.3 &{\bf lo }& Lao & 17 & 0.6 \\
295
+ {\bf am }& Amharic & 68 & 0.8 &{\bf lt }& Lithuanian & 1835 & 13.7 \\
296
+ {\bf ar }& Arabic & 2869 & 28.0 &{\bf lv }& Latvian & 1198 & 8.8 \\
297
+ {\bf as }& Assamese & 5 & 0.1 &{\bf mg }& Malagasy & 25 & 0.2 \\
298
+ {\bf az }& Azerbaijani & 783 & 6.5 &{\bf mk }& Macedonian & 449 & 4.8 \\
299
+ {\bf be }& Belarusian & 362 & 4.3 &{\bf ml }& Malayalam & 313 & 7.6 \\
300
+ {\bf bg }& Bulgarian & 5487 & 57.5 &{\bf mn }& Mongolian & 248 & 3.0 \\
301
+ {\bf bn }& Bengali & 525 & 8.4 &{\bf mr }& Marathi & 175 & 2.8 \\
302
+ {\bf - }& Bengali Romanized & 77 & 0.5 &{\bf ms }& Malay & 1318 & 8.5 \\
303
+ {\bf br }& Breton & 16 & 0.1 &{\bf my }& Burmese & 15 & 0.4 \\
304
+ {\bf bs }& Bosnian & 14 & 0.1 &{\bf my }& Burmese & 56 & 1.6 \\
305
+ {\bf ca }& Catalan & 1752 & 10.1 &{\bf ne }& Nepali & 237 & 3.8 \\
306
+ {\bf cs }& Czech & 2498 & 16.3 &{\bf nl }& Dutch & 5025 & 29.3 \\
307
+ {\bf cy }& Welsh & 141 & 0.8 &{\bf no }& Norwegian & 8494 & 49.0 \\
308
+ {\bf da }& Danish & 7823 & 45.6 &{\bf om }& Oromo & 8 & 0.1 \\
309
+ {\bf de }& German & 10297 & 66.6 &{\bf or }& Oriya & 36 & 0.6 \\
310
+ {\bf el }& Greek & 4285 & 46.9 &{\bf pa }& Punjabi & 68 & 0.8 \\
311
+ {\bf en }& English & 55608 & 300.8 &{\bf pl }& Polish & 6490 & 44.6 \\
312
+ {\bf eo }& Esperanto & 157 & 0.9 &{\bf ps }& Pashto & 96 & 0.7 \\
313
+ {\bf es }& Spanish & 9374 & 53.3 &{\bf pt }& Portuguese & 8405 & 49.1 \\
314
+ {\bf et }& Estonian & 843 & 6.1 &{\bf ro }& Romanian & 10354 & 61.4 \\
315
+ {\bf eu }& Basque & 270 & 2.0 &{\bf ru }& Russian & 23408 & 278.0 \\
316
+ {\bf fa }& Persian & 13259 & 111.6 &{\bf sa }& Sanskrit & 17 & 0.3 \\
317
+ {\bf fi }& Finnish & 6730 & 54.3 &{\bf sd }& Sindhi & 50 & 0.4 \\
318
+ {\bf fr }& French & 9780 & 56.8 &{\bf si }& Sinhala & 243 & 3.6 \\
319
+ {\bf fy }& Western Frisian & 29 & 0.2 &{\bf sk }& Slovak & 3525 & 23.2 \\
320
+ {\bf ga }& Irish & 86 & 0.5 &{\bf sl }& Slovenian & 1669 & 10.3 \\
321
+ {\bf gd }& Scottish Gaelic & 21 & 0.1 &{\bf so }& Somali & 62 & 0.4 \\
322
+ {\bf gl }& Galician & 495 & 2.9 &{\bf sq }& Albanian & 918 & 5.4 \\
323
+ {\bf gu }& Gujarati & 140 & 1.9 &{\bf sr }& Serbian & 843 & 9.1 \\
324
+ {\bf ha }& Hausa & 56 & 0.3 &{\bf su }& Sundanese & 10 & 0.1 \\
325
+ {\bf he }& Hebrew & 3399 & 31.6 &{\bf sv }& Swedish & 77.8 & 12.1 \\
326
+ {\bf hi }& Hindi & 1715 & 20.2 &{\bf sw }& Swahili & 275 & 1.6 \\
327
+ {\bf - }& Hindi Romanized & 88 & 0.5 &{\bf ta }& Tamil & 595 & 12.2 \\
328
+ {\bf hr }& Croatian & 3297 & 20.5 &{\bf - }& Tamil Romanized & 36 & 0.3 \\
329
+ {\bf hu }& Hungarian & 7807 & 58.4 &{\bf te }& Telugu & 249 & 4.7 \\
330
+ {\bf hy }& Armenian & 421 & 5.5 &{\bf - }& Telugu Romanized & 39 & 0.3 \\
331
+ {\bf id }& Indonesian & 22704 & 148.3 &{\bf th }& Thai & 1834 & 71.7 \\
332
+ {\bf is }& Icelandic & 505 & 3.2 &{\bf tl }& Filipino & 556 & 3.1 \\
333
+ {\bf it }& Italian & 4983 & 30.2 &{\bf tr }& Turkish & 2736 & 20.9 \\
334
+ {\bf ja }& Japanese & 530 & 69.3 &{\bf ug }& Uyghur & 27 & 0.4 \\
335
+ {\bf jv }& Javanese & 24 & 0.2 &{\bf uk }& Ukrainian & 6.5 & 84.6 \\
336
+ {\bf ka }& Georgian & 469 & 9.1 &{\bf ur }& Urdu & 730 & 5.7 \\
337
+ {\bf kk }& Kazakh & 476 & 6.4 &{\bf - }& Urdu Romanized & 85 & 0.5 \\
338
+ {\bf km }& Khmer & 36 & 1.5 &{\bf uz }& Uzbek & 91 & 0.7 \\
339
+ {\bf kn }& Kannada & 169 & 3.3 &{\bf vi }& Vietnamese & 24757 & 137.3 \\
340
+ {\bf ko }& Korean & 5644 & 54.2 &{\bf xh }& Xhosa & 13 & 0.1 \\
341
+ {\bf ku }& Kurdish (Kurmanji) & 66 & 0.4 &{\bf yi }& Yiddish & 34 & 0.3 \\
342
+ {\bf ky }& Kyrgyz & 94 & 1.2 &{\bf zh }& Chinese (Simplified) & 259 & 46.9 \\
343
+ {\bf la }& Latin & 390 & 2.5 &{\bf zh }& Chinese (Traditional) & 176 & 16.6 \\
344
+
345
+ \bottomrule
346
+ \end{tabular}
347
+ \caption{\textbf{Languages and statistics of the CC-100 corpus.} We report the list of 100 languages and include the number of tokens (Millions) and the size of the data (in GiB) for each language. Note that we also include romanized variants of some non latin languages such as Bengali, Hindi, Tamil, Telugu and Urdu.\label{tab:datastats}}
348
+ \end{center}
349
+ \end{table}
350
+ %}
351
+ }
352
+
353
+
354
+ % Comparison of parameters for different models
355
+ \newcommand{\insertParameters}{
356
+ \begin{table*}[h!]
357
+ \begin{center}
358
+ % \scriptsize
359
+ %\resizebox{1\linewidth}{!}{
360
+ \begin{tabular}[b]{lrcrrrrrc}
361
+ \toprule
362
+ \textbf{Model} & \textbf{\#lgs} & \textbf{tokenization} & \textbf{L} & \textbf{$H_{m}$} & \textbf{$H_{ff}$} & \textbf{A} & \textbf{V} & \textbf{\#params}\\
363
+ \cmidrule(r){1-1}
364
+ \cmidrule(lr){2-3}
365
+ \cmidrule(lr){4-8}
366
+ \cmidrule(l){9-9}
367
+ % TODO: rank by number of parameters
368
+ BERT\textsubscript{Base} & 1 & WordPiece & 12 & 768 & 3072 & 12 & 30k & 110M \\
369
+ BERT\textsubscript{Large} & 1 & WordPiece & 24 & 1024 & 4096 & 16 & 30k & 335M \\
370
+ mBERT & 104 & WordPiece & 12 & 768 & 3072 & 12 & 110k & 172M \\
371
+ RoBERTa\textsubscript{Base} & 1 & bBPE & 12 & 768 & 3072 & 8 & 50k & 125M \\
372
+ RoBERTa & 1 & bBPE & 24 & 1024 & 4096 & 16 & 50k & 355M \\
373
+ XLM-15 & 15 & BPE & 12 & 1024 & 4096 & 8 & 95k & 250M \\
374
+ XLM-17 & 17 & BPE & 16 & 1280 & 5120 & 16 & 200k & 570M \\
375
+ XLM-100 & 100 & BPE & 16 & 1280 & 5120 & 16 & 200k & 570M \\
376
+ Unicoder & 15 & BPE & 12 & 1024 & 4096 & 8 & 95k & 250M \\
377
+ \xlmr\textsubscript{Base} & 100 & SPM & 12 & 768 & 3072 & 12 & 250k & 270M \\
378
+ \xlmr & 100 & SPM & 24 & 1024 & 4096 & 16 & 250k & 550M \\
379
+ GPT2 & 1 & bBPE & 48 & 1600 & 6400 & 32 & 50k & 1.5B \\
380
+ wide-mmNMT & 103 & SPM & 12 & 2048 & 16384 & 32 & 64k & 3B \\
381
+ deep-mmNMT & 103 & SPM & 24 & 1024 & 16384 & 32 & 64k & 3B \\
382
+ T5-3B & 1 & WordPiece & 24 & 1024 & 16384 & 32 & 32k & 3B \\
383
+ T5-11B & 1 & WordPiece & 24 & 1024 & 65536 & 32 & 32k & 11B \\
384
+ % XLNet\textsubscript{Large}$^{\dagger}$ & 1 & 89.8/- & 93.9 & 91.8 & 95.6 & 89.2 & 91.8 & 92.0 \\
385
+ % RoBERTa$^{\dagger}$ & 1 & 90.2/90.2 & 94.7 & 92.2 & 96.4 & 90.9 & 92.4 & 92.8 \\
386
+ % XLM-R & 100 & 88.4/88.5 & 93.1 & 92.2 & 95.1 & 89.7 & 90.4 & 91.5 \\
387
+ \bottomrule
388
+ \end{tabular}
389
+ %}
390
+ \caption{\textbf{Details on model sizes.}
391
+ We show the tokenization used by each Transformer model, the number of layers L, the number of hidden states of the model $H_{m}$, the dimension of the feed-forward layer $H_{ff}$, the number of attention heads A, the size of the vocabulary V and the total number of parameters \#params.
392
+ For Transformer encoders, the number of parameters can be approximated by $4LH_m^2 + 2LH_m H_{ff} + VH_m$.
393
+ GPT2 numbers are from \citet{radford2019language}, mm-NMT models are from the work of \citet{arivazhagan2019massively} on massively multilingual neural machine translation (mmNMT), and T5 numbers are from \citet{raffel2019exploring}. While \xlmr is among the largest models partly due to its large embedding layer, it has a similar number of parameters than XLM-100, and remains significantly smaller that recently introduced Transformer models for multilingual MT and transfer learning. While this table gives more hindsight on the difference of capacity of each model, note it does not highlight other critical differences between the models.
394
+ \label{tab:parameters}}
395
+ \end{center}
396
+ \vspace{-0.4cm}
397
+ \end{table*}
398
+ }
references/2019.arxiv.conneau/source/content/vocabsize.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e45090856dc149265ada0062c8c2456c3057902dfaaade60aa80905785563506
3
+ size 15677
references/2019.arxiv.conneau/source/content/wikicc.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0d7e959db8240f283922c3ca7c6de6f5ad3750681f27f4fcf35d161506a7a21
3
+ size 16304
references/2019.arxiv.conneau/source/texput.log ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is pdfTeX, Version 3.14159265-2.6-1.40.20 (TeX Live 2019) (preloaded format=pdflatex 2019.5.8) 7 APR 2020 17:41
2
+ entering extended mode
3
+ restricted \write18 enabled.
4
+ %&-line parsing enabled.
5
+ **acl2020.tex
6
+
7
+ ! Emergency stop.
8
+ <*> acl2020.tex
9
+
10
+ *** (job aborted, file error in nonstop mode)
11
+
12
+
13
+ Here is how much of TeX's memory you used:
14
+ 3 strings out of 492616
15
+ 102 string characters out of 6129482
16
+ 57117 words of memory out of 5000000
17
+ 4025 multiletter control sequences out of 15000+600000
18
+ 3640 words of font info for 14 fonts, out of 8000000 for 9000
19
+ 1141 hyphenation exceptions out of 8191
20
+ 0i,0n,0p,1b,6s stack positions out of 5000i,500n,10000p,200000b,80000s
21
+ ! ==> Fatal error occurred, no output PDF file produced!
references/2019.arxiv.conneau/source/xlmr.bbl ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \begin{thebibliography}{40}
2
+ \expandafter\ifx\csname natexlab\endcsname\relax\def\natexlab#1{#1}\fi
3
+
4
+ \bibitem[{Akbik et~al.(2018)Akbik, Blythe, and Vollgraf}]{akbik2018coling}
5
+ Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018.
6
+ \newblock Contextual string embeddings for sequence labeling.
7
+ \newblock In \emph{COLING}, pages 1638--1649.
8
+
9
+ \bibitem[{Arivazhagan et~al.(2019)Arivazhagan, Bapna, Firat, Lepikhin, Johnson,
10
+ Krikun, Chen, Cao, Foster, Cherry et~al.}]{arivazhagan2019massively}
11
+ Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson,
12
+ Maxim Krikun, Mia~Xu Chen, Yuan Cao, George Foster, Colin Cherry, et~al.
13
+ 2019.
14
+ \newblock Massively multilingual neural machine translation in the wild:
15
+ Findings and challenges.
16
+ \newblock \emph{arXiv preprint arXiv:1907.05019}.
17
+
18
+ \bibitem[{Bowman et~al.(2015)Bowman, Angeli, Potts, and
19
+ Manning}]{bowman2015large}
20
+ Samuel~R. Bowman, Gabor Angeli, Christopher Potts, and Christopher~D. Manning.
21
+ 2015.
22
+ \newblock A large annotated corpus for learning natural language inference.
23
+ \newblock In \emph{EMNLP}.
24
+
25
+ \bibitem[{Conneau et~al.(2018)Conneau, Rinott, Lample, Williams, Bowman,
26
+ Schwenk, and Stoyanov}]{conneau2018xnli}
27
+ Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel~R.
28
+ Bowman, Holger Schwenk, and Veselin Stoyanov. 2018.
29
+ \newblock Xnli: Evaluating cross-lingual sentence representations.
30
+ \newblock In \emph{EMNLP}. Association for Computational Linguistics.
31
+
32
+ \bibitem[{Devlin et~al.(2018)Devlin, Chang, Lee, and
33
+ Toutanova}]{devlin2018bert}
34
+ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018.
35
+ \newblock Bert: Pre-training of deep bidirectional transformers for language
36
+ understanding.
37
+ \newblock \emph{NAACL}.
38
+
39
+ \bibitem[{Grave et~al.(2018)Grave, Bojanowski, Gupta, Joulin, and
40
+ Mikolov}]{grave2018learning}
41
+ Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas
42
+ Mikolov. 2018.
43
+ \newblock Learning word vectors for 157 languages.
44
+ \newblock In \emph{LREC}.
45
+
46
+ \bibitem[{Huang et~al.(2019)Huang, Liang, Duan, Gong, Shou, Jiang, and
47
+ Zhou}]{huang2019unicoder}
48
+ Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and
49
+ Ming Zhou. 2019.
50
+ \newblock Unicoder: A universal language encoder by pre-training with multiple
51
+ cross-lingual tasks.
52
+ \newblock \emph{ACL}.
53
+
54
+ \bibitem[{Johnson et~al.(2017)Johnson, Schuster, Le, Krikun, Wu, Chen, Thorat,
55
+ Vi{\'e}gas, Wattenberg, Corrado et~al.}]{johnson2017google}
56
+ Melvin Johnson, Mike Schuster, Quoc~V Le, Maxim Krikun, Yonghui Wu, Zhifeng
57
+ Chen, Nikhil Thorat, Fernanda Vi{\'e}gas, Martin Wattenberg, Greg Corrado,
58
+ et~al. 2017.
59
+ \newblock Google’s multilingual neural machine translation system: Enabling
60
+ zero-shot translation.
61
+ \newblock \emph{TACL}, 5:339--351.
62
+
63
+ \bibitem[{Joulin et~al.(2017)Joulin, Grave, and Mikolov}]{joulin2017bag}
64
+ Armand Joulin, Edouard Grave, and Piotr Bojanowski~Tomas Mikolov. 2017.
65
+ \newblock Bag of tricks for efficient text classification.
66
+ \newblock \emph{EACL 2017}, page 427.
67
+
68
+ \bibitem[{Jozefowicz et~al.(2016)Jozefowicz, Vinyals, Schuster, Shazeer, and
69
+ Wu}]{jozefowicz2016exploring}
70
+ Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu.
71
+ 2016.
72
+ \newblock Exploring the limits of language modeling.
73
+ \newblock \emph{arXiv preprint arXiv:1602.02410}.
74
+
75
+ \bibitem[{Kudo(2018)}]{kudo2018subword}
76
+ Taku Kudo. 2018.
77
+ \newblock Subword regularization: Improving neural network translation models
78
+ with multiple subword candidates.
79
+ \newblock In \emph{ACL}, pages 66--75.
80
+
81
+ \bibitem[{Kudo and Richardson(2018)}]{kudo2018sentencepiece}
82
+ Taku Kudo and John Richardson. 2018.
83
+ \newblock Sentencepiece: A simple and language independent subword tokenizer
84
+ and detokenizer for neural text processing.
85
+ \newblock \emph{EMNLP}.
86
+
87
+ \bibitem[{Lample et~al.(2016)Lample, Ballesteros, Subramanian, Kawakami, and
88
+ Dyer}]{lample-etal-2016-neural}
89
+ Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
90
+ Chris Dyer. 2016.
91
+ \newblock \href {https://doi.org/10.18653/v1/N16-1030} {Neural architectures
92
+ for named entity recognition}.
93
+ \newblock In \emph{NAACL}, pages 260--270, San Diego, California. Association
94
+ for Computational Linguistics.
95
+
96
+ \bibitem[{Lample and Conneau(2019)}]{lample2019cross}
97
+ Guillaume Lample and Alexis Conneau. 2019.
98
+ \newblock Cross-lingual language model pretraining.
99
+ \newblock \emph{NeurIPS}.
100
+
101
+ \bibitem[{Lewis et~al.(2019)Lewis, O\u{g}uz, Rinott, Riedel, and
102
+ Schwenk}]{lewis2019mlqa}
103
+ Patrick Lewis, Barlas O\u{g}uz, Ruty Rinott, Sebastian Riedel, and Holger
104
+ Schwenk. 2019.
105
+ \newblock Mlqa: Evaluating cross-lingual extractive question answering.
106
+ \newblock \emph{arXiv preprint arXiv:1910.07475}.
107
+
108
+ \bibitem[{Liu et~al.(2019)Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis,
109
+ Zettlemoyer, and Stoyanov}]{roberta2019}
110
+ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
111
+ Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
112
+ \newblock Roberta: {A} robustly optimized {BERT} pretraining approach.
113
+ \newblock \emph{arXiv preprint arXiv:1907.11692}.
114
+
115
+ \bibitem[{Mikolov et~al.(2013{\natexlab{a}})Mikolov, Le, and
116
+ Sutskever}]{mikolov2013exploiting}
117
+ Tomas Mikolov, Quoc~V Le, and Ilya Sutskever. 2013{\natexlab{a}}.
118
+ \newblock Exploiting similarities among languages for machine translation.
119
+ \newblock \emph{arXiv preprint arXiv:1309.4168}.
120
+
121
+ \bibitem[{Mikolov et~al.(2013{\natexlab{b}})Mikolov, Sutskever, Chen, Corrado,
122
+ and Dean}]{mikolov2013distributed}
123
+ Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg~S Corrado, and Jeff Dean.
124
+ 2013{\natexlab{b}}.
125
+ \newblock Distributed representations of words and phrases and their
126
+ compositionality.
127
+ \newblock In \emph{NIPS}, pages 3111--3119.
128
+
129
+ \bibitem[{Pennington et~al.(2014)Pennington, Socher, and
130
+ Manning}]{pennington2014glove}
131
+ Jeffrey Pennington, Richard Socher, and Christopher~D. Manning. 2014.
132
+ \newblock \href {http://www.aclweb.org/anthology/D14-1162} {Glove: Global
133
+ vectors for word representation}.
134
+ \newblock In \emph{EMNLP}, pages 1532--1543.
135
+
136
+ \bibitem[{Peters et~al.(2018)Peters, Neumann, Iyyer, Gardner, Clark, Lee, and
137
+ Zettlemoyer}]{peters2018deep}
138
+ Matthew~E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
139
+ Kenton Lee, and Luke Zettlemoyer. 2018.
140
+ \newblock Deep contextualized word representations.
141
+ \newblock \emph{NAACL}.
142
+
143
+ \bibitem[{Pires et~al.(2019)Pires, Schlinger, and Garrette}]{Pires2019HowMI}
144
+ Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
145
+ \newblock How multilingual is multilingual bert?
146
+ \newblock In \emph{ACL}.
147
+
148
+ \bibitem[{Radford et~al.(2018)Radford, Narasimhan, Salimans, and
149
+ Sutskever}]{radford2018improving}
150
+ Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018.
151
+ \newblock \href
152
+ {https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf}
153
+ {Improving language understanding by generative pre-training}.
154
+ \newblock \emph{URL
155
+ https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language\_understanding\_paper.pdf}.
156
+
157
+ \bibitem[{Radford et~al.(2019)Radford, Wu, Child, Luan, Amodei, and
158
+ Sutskever}]{radford2019language}
159
+ Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
160
+ Sutskever. 2019.
161
+ \newblock Language models are unsupervised multitask learners.
162
+ \newblock \emph{OpenAI Blog}, 1(8).
163
+
164
+ \bibitem[{Raffel et~al.(2019)Raffel, Shazeer, Roberts, Lee, Narang, Matena,
165
+ Zhou, Li, and Liu}]{raffel2019exploring}
166
+ Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
167
+ Matena, Yanqi Zhou, Wei Li, and Peter~J. Liu. 2019.
168
+ \newblock Exploring the limits of transfer learning with a unified text-to-text
169
+ transformer.
170
+ \newblock \emph{arXiv preprint arXiv:1910.10683}.
171
+
172
+ \bibitem[{Rajpurkar et~al.(2018)Rajpurkar, Jia, and Liang}]{rajpurkar2018know}
173
+ Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
174
+ \newblock Know what you don't know: Unanswerable questions for squad.
175
+ \newblock \emph{ACL}.
176
+
177
+ \bibitem[{Rajpurkar et~al.(2016)Rajpurkar, Zhang, Lopyrev, and
178
+ Liang}]{rajpurkar-etal-2016-squad}
179
+ Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
180
+ \newblock \href {https://doi.org/10.18653/v1/D16-1264} {{SQ}u{AD}: 100,000+
181
+ questions for machine comprehension of text}.
182
+ \newblock In \emph{EMNLP}, pages 2383--2392, Austin, Texas. Association for
183
+ Computational Linguistics.
184
+
185
+ \bibitem[{Sang(2002)}]{sang2002introduction}
186
+ Erik~F Sang. 2002.
187
+ \newblock Introduction to the conll-2002 shared task: Language-independent
188
+ named entity recognition.
189
+ \newblock \emph{CoNLL}.
190
+
191
+ \bibitem[{Schuster et~al.(2019)Schuster, Ram, Barzilay, and
192
+ Globerson}]{schuster2019cross}
193
+ Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019.
194
+ \newblock Cross-lingual alignment of contextual word embeddings, with
195
+ applications to zero-shot dependency parsing.
196
+ \newblock \emph{NAACL}.
197
+
198
+ \bibitem[{Siddhant et~al.(2019)Siddhant, Johnson, Tsai, Arivazhagan, Riesa,
199
+ Bapna, Firat, and Raman}]{siddhant2019evaluating}
200
+ Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa,
201
+ Ankur Bapna, Orhan Firat, and Karthik Raman. 2019.
202
+ \newblock Evaluating the cross-lingual effectiveness of massively multilingual
203
+ neural machine translation.
204
+ \newblock \emph{AAAI}.
205
+
206
+ \bibitem[{Singh et~al.(2019)Singh, McCann, Keskar, Xiong, and
207
+ Socher}]{singh2019xlda}
208
+ Jasdeep Singh, Bryan McCann, Nitish~Shirish Keskar, Caiming Xiong, and Richard
209
+ Socher. 2019.
210
+ \newblock Xlda: Cross-lingual data augmentation for natural language inference
211
+ and question answering.
212
+ \newblock \emph{arXiv preprint arXiv:1905.11471}.
213
+
214
+ \bibitem[{Socher et~al.(2013)Socher, Perelygin, Wu, Chuang, Manning, Ng, and
215
+ Potts}]{socher2013recursive}
216
+ Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher~D Manning,
217
+ Andrew Ng, and Christopher Potts. 2013.
218
+ \newblock Recursive deep models for semantic compositionality over a sentiment
219
+ treebank.
220
+ \newblock In \emph{EMNLP}, pages 1631--1642.
221
+
222
+ \bibitem[{Tan et~al.(2019)Tan, Ren, He, Qin, Zhao, and
223
+ Liu}]{tan2019multilingual}
224
+ Xu~Tan, Yi~Ren, Di~He, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2019.
225
+ \newblock Multilingual neural machine translation with knowledge distillation.
226
+ \newblock \emph{ICLR}.
227
+
228
+ \bibitem[{Tjong Kim~Sang and De~Meulder(2003)}]{tjong2003introduction}
229
+ Erik~F Tjong Kim~Sang and Fien De~Meulder. 2003.
230
+ \newblock Introduction to the conll-2003 shared task: language-independent
231
+ named entity recognition.
232
+ \newblock In \emph{CoNLL}, pages 142--147. Association for Computational
233
+ Linguistics.
234
+
235
+ \bibitem[{Vaswani et~al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones,
236
+ Gomez, Kaiser, and Polosukhin}]{transformer17}
237
+ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
238
+ Aidan~N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017.
239
+ \newblock Attention is all you need.
240
+ \newblock In \emph{Advances in Neural Information Processing Systems}, pages
241
+ 6000--6010.
242
+
243
+ \bibitem[{Wang et~al.(2018)Wang, Singh, Michael, Hill, Levy, and
244
+ Bowman}]{wang2018glue}
245
+ Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel~R
246
+ Bowman. 2018.
247
+ \newblock Glue: A multi-task benchmark and analysis platform for natural
248
+ language understanding.
249
+ \newblock \emph{arXiv preprint arXiv:1804.07461}.
250
+
251
+ \bibitem[{Wenzek et~al.(2019)Wenzek, Lachaux, Conneau, Chaudhary, Guzman,
252
+ Joulin, and Grave}]{wenzek2019ccnet}
253
+ Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary,
254
+ Francisco Guzman, Armand Joulin, and Edouard Grave. 2019.
255
+ \newblock Ccnet: Extracting high quality monolingual datasets from web crawl
256
+ data.
257
+ \newblock \emph{arXiv preprint arXiv:1911.00359}.
258
+
259
+ \bibitem[{Williams et~al.(2017)Williams, Nangia, and
260
+ Bowman}]{williams2017broad}
261
+ Adina Williams, Nikita Nangia, and Samuel~R Bowman. 2017.
262
+ \newblock A broad-coverage challenge corpus for sentence understanding through
263
+ inference.
264
+ \newblock \emph{Proceedings of the 2nd Workshop on Evaluating Vector-Space
265
+ Representations for NLP}.
266
+
267
+ \bibitem[{Wu et~al.(2019)Wu, Conneau, Li, Zettlemoyer, and
268
+ Stoyanov}]{wu2019emerging}
269
+ Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov.
270
+ 2019.
271
+ \newblock Emerging cross-lingual structure in pretrained language models.
272
+ \newblock \emph{ACL}.
273
+
274
+ \bibitem[{Wu and Dredze(2019)}]{wu2019beto}
275
+ Shijie Wu and Mark Dredze. 2019.
276
+ \newblock Beto, bentz, becas: The surprising cross-lingual effectiveness of
277
+ bert.
278
+ \newblock \emph{EMNLP}.
279
+
280
+ \bibitem[{Xie et~al.(2019)Xie, Dai, Hovy, Luong, and Le}]{xie2019unsupervised}
281
+ Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc~V Le. 2019.
282
+ \newblock Unsupervised data augmentation for consistency training.
283
+ \newblock \emph{arXiv preprint arXiv:1904.12848}.
284
+
285
+ \end{thebibliography}
references/2019.arxiv.conneau/source/xlmr.synctex ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:420af1ab9f337834c49b93240fd9062be0a9f1bd9135878e6c96a6d128aa6856
3
+ size 865236
references/2019.arxiv.conneau/source/xlmr.tex ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ %
3
+ % File acl2020.tex
4
+ %
5
+ %% Based on the style files for ACL 2020, which were
6
+ %% Based on the style files for ACL 2018, NAACL 2018/19, which were
7
+ %% Based on the style files for ACL-2015, with some improvements
8
+ %% taken from the NAACL-2016 style
9
+ %% Based on the style files for ACL-2014, which were, in turn,
10
+ %% based on ACL-2013, ACL-2012, ACL-2011, ACL-2010, ACL-IJCNLP-2009,
11
+ %% EACL-2009, IJCNLP-2008...
12
+ %% Based on the style files for EACL 2006 by
13
+ %%e.agirre@ehu.es or Sergi.Balari@uab.es
14
+ %% and that of ACL 08 by Joakim Nivre and Noah Smith
15
+
16
+ \documentclass[11pt,a4paper]{article}
17
+ \usepackage[hyperref]{acl2020}
18
+ \usepackage{times}
19
+ \usepackage{latexsym}
20
+ \renewcommand{\UrlFont}{\ttfamily\small}
21
+
22
+ % This is not strictly necessary, and may be commented out,
23
+ % but it will improve the layout of the manuscript,
24
+ % and will typically save some space.
25
+ \usepackage{microtype}
26
+ \usepackage{graphicx}
27
+ \usepackage{subfigure}
28
+ \usepackage{booktabs} % for professional tables
29
+ \usepackage{url}
30
+ \usepackage{times}
31
+ \usepackage{latexsym}
32
+ \usepackage{array}
33
+ \usepackage{adjustbox}
34
+ \usepackage{multirow}
35
+ % \usepackage{subcaption}
36
+ \usepackage{hyperref}
37
+ \usepackage{longtable}
38
+
39
+ \input{content/tables}
40
+
41
+
42
+ \aclfinalcopy % Uncomment this line for the final submission
43
+ \def\aclpaperid{479} % Enter the acl Paper ID here
44
+
45
+ %\setlength\titlebox{5cm}
46
+ % You can expand the titlebox if you need extra space
47
+ % to show all the authors. Please do not make the titlebox
48
+ % smaller than 5cm (the original size); we will check this
49
+ % in the camera-ready version and ask you to change it back.
50
+
51
+ \newcommand\BibTeX{B\textsc{ib}\TeX}
52
+ \usepackage{xspace}
53
+ \newcommand{\xlmr}{\textit{XLM-R}\xspace}
54
+ \newcommand{\mbert}{mBERT\xspace}
55
+ \newcommand{\XX}{\textcolor{red}{XX}\xspace}
56
+
57
+ \newcommand{\note}[3]{{\color{#2}[#1: #3]}}
58
+ \newcommand{\ves}[1]{\note{ves}{red}{#1}}
59
+ \newcommand{\luke}[1]{\note{luke}{green}{#1}}
60
+ \newcommand{\myle}[1]{\note{myle}{cyan}{#1}}
61
+ \newcommand{\paco}[1]{\note{paco}{blue}{#1}}
62
+ \newcommand{\eg}[1]{\note{edouard}{orange}{#1}}
63
+ \newcommand{\kk}[1]{\note{kartikay}{pink}{#1}}
64
+
65
+ \renewcommand{\UrlFont}{\scriptsize}
66
+ \title{Unsupervised Cross-lingual Representation Learning at Scale}
67
+
68
+ \author{Alexis Conneau\thanks{\ \ Equal contribution.} \space\space\space
69
+ Kartikay Khandelwal\footnotemark[1] \space\space\space \AND
70
+ \bf Naman Goyal \space\space\space
71
+ Vishrav Chaudhary \space\space\space
72
+ Guillaume Wenzek \space\space\space
73
+ Francisco Guzm\'an \space\space\space \AND
74
+ \bf Edouard Grave \space\space\space
75
+ Myle Ott \space\space\space
76
+ Luke Zettlemoyer \space\space\space
77
+ Veselin Stoyanov \space\space\space \\ \\ \\
78
+ \bf Facebook AI
79
+ }
80
+
81
+ \date{}
82
+
83
+ \begin{document}
84
+ \maketitle
85
+ \begin{abstract}
86
+ This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed \xlmr, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6\% average accuracy on XNLI, +13\% average F1 score on MLQA, and +2.4\% F1 score on NER. \xlmr performs particularly well on low-resource languages, improving 15.7\% in XNLI accuracy for Swahili and 11.4\% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; \xlmr is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.{\let\thefootnote\relax\footnotetext{\scriptsize Correspondence to {\tt \{aconneau,kartikayk\}@fb.com}}}\footnote{\url{https://github.com/facebookresearch/(fairseq-py,pytext,xlm)}}
87
+ \end{abstract}
88
+
89
+
90
+ \section{Introduction}
91
+
92
+ The goal of this paper is to improve cross-lingual language understanding (XLU), by carefully studying the effects of training unsupervised cross-lingual representations at a very large scale.
93
+ We present \xlmr a transformer-based multilingual masked language model pre-trained on text in 100 languages, which obtains state-of-the-art performance on cross-lingual classification, sequence labeling and question answering.
94
+
95
+ Multilingual masked language models (MLM) like \mbert ~\cite{devlin2018bert} and XLM \cite{lample2019cross} have pushed the state-of-the-art on cross-lingual understanding tasks by jointly pretraining large Transformer models~\cite{transformer17} on many languages. These models allow for effective cross-lingual transfer, as seen in a number of benchmarks including cross-lingual natural language inference ~\cite{bowman2015large,williams2017broad,conneau2018xnli}, question answering~\cite{rajpurkar-etal-2016-squad,lewis2019mlqa}, and named entity recognition~\cite{Pires2019HowMI,wu2019beto}.
96
+ However, all of these studies pre-train on Wikipedia, which provides a relatively limited scale especially for lower resource languages.
97
+
98
+
99
+ In this paper, we first present a comprehensive analysis of the trade-offs and limitations of multilingual language models at scale, inspired by recent monolingual scaling efforts~\cite{roberta2019}.
100
+ We measure the trade-off between high-resource and low-resource languages and the impact of language sampling and vocabulary size.
101
+ %By training models with an increasing number of languages,
102
+ The experiments expose a trade-off as we scale the number of languages for a fixed model capacity: more languages leads to better cross-lingual performance on low-resource languages up until a point, after which the overall performance on monolingual and cross-lingual benchmarks degrades. We refer to this tradeoff as the \emph{curse of multilinguality}, and show that it can be alleviated by simply increasing model capacity.
103
+ We argue, however, that this remains an important limitation for future XLU systems which may aim to improve performance with more modest computational budgets.
104
+
105
+ Our best model XLM-RoBERTa (\xlmr) outperforms \mbert on cross-lingual classification by up to 23\% accuracy on low-resource languages.
106
+ %like Swahili and Urdu.
107
+ It outperforms the previous state of the art by 5.1\% average accuracy on XNLI, 2.42\% average F1-score on Named Entity Recognition, and 9.1\% average F1-score on cross-lingual Question Answering. We also evaluate monolingual fine tuning on the GLUE and XNLI benchmarks, where \xlmr obtains results competitive with state-of-the-art monolingual models, including RoBERTa \cite{roberta2019}.
108
+ These results demonstrate, for the first time, that it is possible to have a single large model for all languages, without sacrificing per-language performance.
109
+ We will make our code, models and data publicly available, with the hope that this will help research in multilingual NLP and low-resource language understanding.
110
+
111
+ \section{Related Work}
112
+ From pretrained word embeddings~\citep{mikolov2013distributed, pennington2014glove} to pretrained contextualized representations~\citep{peters2018deep,schuster2019cross} and transformer based language models~\citep{radford2018improving,devlin2018bert}, unsupervised representation learning has significantly improved the state of the art in natural language understanding. Parallel work on cross-lingual understanding~\citep{mikolov2013exploiting,schuster2019cross,lample2019cross} extends these systems to more languages and to the cross-lingual setting in which a model is learned in one language and applied in other languages.
113
+
114
+ Most recently, \citet{devlin2018bert} and \citet{lample2019cross} introduced \mbert and XLM - masked language models trained on multiple languages, without any cross-lingual supervision.
115
+ \citet{lample2019cross} propose translation language modeling (TLM) as a way to leverage parallel data and obtain a new state of the art on the cross-lingual natural language inference (XNLI) benchmark~\cite{conneau2018xnli}.
116
+ They further show strong improvements on unsupervised machine translation and pretraining for sequence generation. \citet{wu2019emerging} shows that monolingual BERT representations are similar across languages, explaining in part the natural emergence of multilinguality in bottleneck architectures. Separately, \citet{Pires2019HowMI} demonstrated the effectiveness of multilingual models like \mbert on sequence labeling tasks. \citet{huang2019unicoder} showed gains over XLM using cross-lingual multi-task learning, and \citet{singh2019xlda} demonstrated the efficiency of cross-lingual data augmentation for cross-lingual NLI. However, all of this work was at a relatively modest scale, in terms of the amount of training data, as compared to our approach.
117
+
118
+ \insertWikivsCC
119
+
120
+ The benefits of scaling language model pretraining by increasing the size of the model as well as the training data has been extensively studied in the literature. For the monolingual case, \citet{jozefowicz2016exploring} show how large-scale LSTM models can obtain much stronger performance on language modeling benchmarks when trained on billions of tokens.
121
+ %[Kartikay: TODO; CHange the reference to GPT2]
122
+ GPT~\cite{radford2018improving} also highlights the importance of scaling the amount of data and RoBERTa \cite{roberta2019} shows that training BERT longer on more data leads to significant boost in performance. Inspired by RoBERTa, we show that mBERT and XLM are undertuned, and that simple improvements in the learning procedure of unsupervised MLM leads to much better performance. We train on cleaned CommonCrawls~\cite{wenzek2019ccnet}, which increase the amount of data for low-resource languages by two orders of magnitude on average. Similar data has also been shown to be effective for learning high quality word embeddings in multiple languages~\cite{grave2018learning}.
123
+
124
+
125
+ Several efforts have trained massively multilingual machine translation models from large parallel corpora. They uncover the high and low resource trade-off and the problem of capacity dilution~\citep{johnson2017google,tan2019multilingual}. The work most similar to ours is \citet{arivazhagan2019massively}, which trains a single model in 103 languages on over 25 billion parallel sentences.
126
+ \citet{siddhant2019evaluating} further analyze the representations obtained by the encoder of a massively multilingual machine translation system and show that it obtains similar results to mBERT on cross-lingual NLI.
127
+ %, which performs much wors that the XLM models we study.
128
+ Our work, in contrast, focuses on the unsupervised learning of cross-lingual representations and their transfer to discriminative tasks.
129
+
130
+
131
+ \section{Model and Data}
132
+ \label{sec:model+data}
133
+
134
+ In this section, we present the training objective, languages, and data we use. We follow the XLM approach~\cite{lample2019cross} as closely as possible, only introducing changes that improve performance at scale.
135
+
136
+ \paragraph{Masked Language Models.}
137
+ We use a Transformer model~\cite{transformer17} trained with the multilingual MLM objective~\cite{devlin2018bert,lample2019cross} using only monolingual data. We sample streams of text from each language and train the model to predict the masked tokens in the input.
138
+ We apply subword tokenization directly on raw text data using Sentence Piece~\cite{kudo2018sentencepiece} with a unigram language model~\cite{kudo2018subword}. We sample batches from different languages using the same sampling distribution as \citet{lample2019cross}, but with $\alpha=0.3$. Unlike \citet{lample2019cross}, we do not use language embeddings, which allows our model to better deal with code-switching. We use a large vocabulary size of 250K with a full softmax and train two different models: \xlmr\textsubscript{Base} (L = 12, H = 768, A = 12, 270M params) and \xlmr (L = 24, H = 1024, A = 16, 550M params). For all of our ablation studies, we use a BERT\textsubscript{Base} architecture with a vocabulary of 150K tokens. Appendix~\ref{sec:appendix_B} goes into more details about the architecture of the different models referenced in this paper.
139
+
140
+ \paragraph{Scaling to a hundred languages.}
141
+ \xlmr is trained on 100 languages;
142
+ we provide a full list of languages and associated statistics in Appendix~\ref{sec:appendix_A}. Figure~\ref{fig:wikivscc} specifies the iso codes of 88 languages that are shared across \xlmr and XLM-100, the model from~\citet{lample2019cross} trained on Wikipedia text in 100 languages.
143
+
144
+ Compared to previous work, we replace some languages with more commonly used ones such as romanized Hindi and traditional Chinese. In our ablation studies, we always include the 7 languages for which we have classification and sequence labeling evaluation benchmarks: English, French, German, Russian, Chinese, Swahili and Urdu. We chose this set as it covers a suitable range of language families and includes low-resource languages such as Swahili and Urdu.
145
+ We also consider larger sets of 15, 30, 60 and all 100 languages. When reporting results on high-resource and low-resource, we refer to the average of English and French results, and the average of Swahili and Urdu results respectively.
146
+
147
+ \paragraph{Scaling the Amount of Training Data.}
148
+ Following~\citet{wenzek2019ccnet}~\footnote{\url{https://github.com/facebookresearch/cc_net}}, we build a clean CommonCrawl Corpus in 100 languages. We use an internal language identification model in combination with the one from fastText~\cite{joulin2017bag}. We train language models in each language and use it to filter documents as described in \citet{wenzek2019ccnet}. We consider one CommonCrawl dump for English and twelve dumps for all other languages, which significantly increases dataset sizes, especially for low-resource languages like Burmese and Swahili.
149
+
150
+ Figure~\ref{fig:wikivscc} shows the difference in size between the Wikipedia Corpus used by mBERT and XLM-100, and the CommonCrawl Corpus we use. As we show in Section~\ref{sec:multimono}, monolingual Wikipedia corpora are too small to enable unsupervised representation learning. Based on our experiments, we found that a few hundred MiB of text data is usually a minimal size for learning a BERT model.
151
+
152
+ \section{Evaluation}
153
+ We consider four evaluation benchmarks.
154
+ For cross-lingual understanding, we use cross-lingual natural language inference, named entity recognition, and question answering. We use the GLUE benchmark to evaluate the English performance of \xlmr and compare it to other state-of-the-art models.
155
+
156
+ \paragraph{Cross-lingual Natural Language Inference (XNLI).}
157
+ The XNLI dataset comes with ground-truth dev and test sets in 15 languages, and a ground-truth English training set. The training set has been machine-translated to the remaining 14 languages, providing synthetic training data for these languages as well. We evaluate our model on cross-lingual transfer from English to other languages. We also consider three machine translation baselines: (i) \textit{translate-test}: dev and test sets are machine-translated to English and a single English model is used (ii) \textit{translate-train} (per-language): the English training set is machine-translated to each language and we fine-tune a multiligual model on each training set (iii) \textit{translate-train-all} (multi-language): we fine-tune a multilingual model on the concatenation of all training sets from translate-train. For the translations, we use the official data provided by the XNLI project.
158
+ % In case we want to add more details about the CC-100 corpora : We train language models in each language and use it to filter documents as described in Wenzek et al. (2019). We additionally apply a filter based on type-token ratio score of 0.6. We consider one CommonCrawl snapshot (December, 2018) for English and twelve snapshots from all months of 2018 for all other languages, which significantly increases dataset sizes, especially for low-resource languages like Burmese and Swahili.
159
+
160
+ \paragraph{Named Entity Recognition.}
161
+ % WikiAnn http://nlp.cs.rpi.edu/wikiann/
162
+ For NER, we consider the CoNLL-2002~\cite{sang2002introduction} and CoNLL-2003~\cite{tjong2003introduction} datasets in English, Dutch, Spanish and German. We fine-tune multilingual models either (1) on the English set to evaluate cross-lingual transfer, (2) on each set to evaluate per-language performance, or (3) on all sets to evaluate multilingual learning. We report the F1 score, and compare to baselines from \citet{lample-etal-2016-neural} and \citet{akbik2018coling}.
163
+
164
+ \paragraph{Cross-lingual Question Answering.}
165
+ We use the MLQA benchmark from \citet{lewis2019mlqa}, which extends the English SQuAD benchmark to Spanish, German, Arabic, Hindi, Vietnamese and Chinese. We report the F1 score as well as the exact match (EM) score for cross-lingual transfer from English.
166
+
167
+ \paragraph{GLUE Benchmark.}
168
+ Finally, we evaluate the English performance of our model on the GLUE benchmark~\cite{wang2018glue} which gathers multiple classification tasks, such as MNLI~\cite{williams2017broad}, SST-2~\cite{socher2013recursive}, or QNLI~\cite{rajpurkar2018know}. We use BERT\textsubscript{Large} and RoBERTa as baselines.
169
+
170
+ \section{Analysis and Results}
171
+ \label{sec:analysis}
172
+
173
+ In this section, we perform a comprehensive analysis of multilingual masked language models. We conduct most of the analysis on XNLI, which we found to be representative of our findings on other tasks. We then present the results of \xlmr on cross-lingual understanding and GLUE. Finally, we compare multilingual and monolingual models, and present results on low-resource languages.
174
+
175
+ \subsection{Improving and Understanding Multilingual Masked Language Models}
176
+ % prior analysis necessary to build \xlmr
177
+ \insertAblationone
178
+ \insertAblationtwo
179
+
180
+ Much of the work done on understanding the cross-lingual effectiveness of \mbert or XLM~\cite{Pires2019HowMI,wu2019beto,lewis2019mlqa} has focused on analyzing the performance of fixed pretrained models on downstream tasks. In this section, we present a comprehensive study of different factors that are important to \textit{pretraining} large scale multilingual models. We highlight the trade-offs and limitations of these models as we scale to one hundred languages.
181
+
182
+ \paragraph{Transfer-dilution Trade-off and Curse of Multilinguality.}
183
+ Model capacity (i.e. the number of parameters in the model) is constrained due to practical considerations such as memory and speed during training and inference. For a fixed sized model, the per-language capacity decreases as we increase the number of languages. While low-resource language performance can be improved by adding similar higher-resource languages during pretraining, the overall downstream performance suffers from this capacity dilution~\cite{arivazhagan2019massively}. Positive transfer and capacity dilution have to be traded off against each other.
184
+
185
+ We illustrate this trade-off in Figure~\ref{fig:transfer_dilution}, which shows XNLI performance vs the number of languages the model is pretrained on. Initially, as we go from 7 to 15 languages, the model is able to take advantage of positive transfer which improves performance, especially on low resource languages. Beyond this point the {\em curse of multilinguality}
186
+ kicks in and degrades performance across all languages. Specifically, the overall XNLI accuracy decreases from 71.8\% to 67.7\% as we go from XLM-7 to XLM-100. The same trend can be observed for models trained on the larger CommonCrawl Corpus.
187
+
188
+ The issue is even more prominent when the capacity of the model is small. To show this, we pretrain models on Wikipedia Data in 7, 30 and 100 languages. As we add more languages, we make the Transformer wider by increasing the hidden size from 768 to 960 to 1152. In Figure~\ref{fig:capacity}, we show that the added capacity allows XLM-30 to be on par with XLM-7, thus overcoming the curse of multilinguality. The added capacity for XLM-100, however, is not enough
189
+ and it still lags behind due to higher vocabulary dilution (recall from Section~\ref{sec:model+data} that we used a fixed vocabulary size of 150K for all models).
190
+
191
+ \paragraph{High-resource vs Low-resource Trade-off.}
192
+ The allocation of the model capacity across languages is controlled by several parameters: the training set size, the size of the shared subword vocabulary, and the rate at which we sample training examples from each language. We study the effect of sampling on the performance of high-resource (English and French) and low-resource (Swahili and Urdu) languages for an XLM-100 model trained on Wikipedia (we observe a similar trend for the construction of the subword vocab). Specifically, we investigate the impact of varying the $\alpha$ parameter which controls the exponential smoothing of the language sampling rate. Similar to~\citet{lample2019cross}, we use a sampling rate proportional to the number of sentences in each corpus. Models trained with higher values of $\alpha$ see batches of high-resource languages more often.
193
+ Figure~\ref{fig:alpha} shows that the higher the value of $\alpha$, the better the performance on high-resource languages, and vice-versa. When considering overall performance, we found $0.3$ to be an optimal value for $\alpha$, and use this for \xlmr.
194
+
195
+ \paragraph{Importance of Capacity and Vocabulary.}
196
+ In previous sections and in Figure~\ref{fig:capacity}, we showed the importance of scaling the model size as we increase the number of languages. Similar to the overall model size, we argue that scaling the size of the shared vocabulary (the vocabulary capacity) can improve the performance of multilingual models on downstream tasks. To illustrate this effect, we train XLM-100 models on Wikipedia data with different vocabulary sizes. We keep the overall number of parameters constant by adjusting the width of the transformer. Figure~\ref{fig:vocab} shows that even with a fixed capacity, we observe a 2.8\% increase in XNLI average accuracy as we increase the vocabulary size from 32K to 256K. This suggests that multilingual models can benefit from allocating a higher proportion of the total number of parameters to the embedding layer even though this reduces the size of the Transformer.
197
+ %With bigger models, we believe that using a vocabulary of up to 2 million tokens with an adaptive softmax~\cite{grave2017efficient,baevski2018adaptive} should improve performance even further, but we leave this exploration to future work.
198
+ For simplicity and given the softmax computational constraints, we use a vocabulary of 250k for \xlmr.
199
+
200
+ We further illustrate the importance of this parameter, by training three models with the same transformer architecture (BERT\textsubscript{Base}) but with different vocabulary sizes: 128K, 256K and 512K. We observe more than 3\% gains in overall accuracy on XNLI by simply increasing the vocab size from 128k to 512k.
201
+
202
+ \paragraph{Larger-scale Datasets and Training.}
203
+ As shown in Figure~\ref{fig:wikivscc}, the CommonCrawl Corpus that we collected has significantly more monolingual data than the previously used Wikipedia corpora. Figure~\ref{fig:curse} shows that for the same BERT\textsubscript{Base} architecture, all models trained on CommonCrawl obtain significantly better performance.
204
+
205
+ Apart from scaling the training data, \citet{roberta2019} also showed the benefits of training MLMs longer. In our experiments, we observed similar effects of large-scale training, such as increasing batch size (see Figure~\ref{fig:batch}) and training time, on model performance. Specifically, we found that using validation perplexity as a stopping criterion for pretraining caused the multilingual MLM in \citet{lample2019cross} to be under-tuned. In our experience, performance on downstream tasks continues to improve even after validation perplexity has plateaued. Combining this observation with our implementation of the unsupervised XLM-MLM objective, we were able to improve the performance of \citet{lample2019cross} from 71.3\% to more than 75\% average accuracy on XNLI, which was on par with their supervised translation language modeling (TLM) objective. Based on these results, and given our focus on unsupervised learning, we decided to not use the supervised TLM objective for training our models.
206
+
207
+
208
+ \paragraph{Simplifying Multilingual Tokenization with Sentence Piece.}
209
+ The different language-specific tokenization tools
210
+ used by mBERT and XLM-100 make these models more difficult to use on raw text. Instead, we train a Sentence Piece model (SPM) and apply it directly on raw text data for all languages. We did not observe any loss in performance for models trained with SPM when compared to models trained with language-specific preprocessing and byte-pair encoding (see Figure~\ref{fig:batch}) and hence use SPM for \xlmr.
211
+
212
+ \subsection{Cross-lingual Understanding Results}
213
+ Based on these results, we adapt the setting of \citet{lample2019cross} and use a large Transformer model with 24 layers and 1024 hidden states, with a 250k vocabulary. We use the multilingual MLM loss and train our \xlmr model for 1.5 Million updates on five-hundred 32GB Nvidia V100 GPUs with a batch size of 8192. We leverage the SPM-preprocessed text data from CommonCrawl in 100 languages and sample languages with $\alpha=0.3$. In this section, we show that it outperforms all previous techniques on cross-lingual benchmarks while getting performance on par with RoBERTa on the GLUE benchmark.
214
+
215
+
216
+ \insertXNLItable
217
+
218
+ \paragraph{XNLI.}
219
+ Table~\ref{tab:xnli} shows XNLI results and adds some additional details: (i) the number of models the approach induces (\#M), (ii) the data on which the model was trained (D), and (iii) the number of languages the model was pretrained on (\#lg). As we show in our results, these parameters significantly impact performance. Column \#M specifies whether model selection was done separately on the dev set of each language ($N$ models), or on the joint dev set of all the languages (single model). We observe a 0.6 decrease in overall accuracy when we go from $N$ models to a single model - going from 71.3 to 70.7. We encourage the community to adopt this setting. For cross-lingual transfer, while this approach is not fully zero-shot transfer, we argue that in real applications, a small amount of supervised data is often available for validation in each language.
220
+
221
+ \xlmr sets a new state of the art on XNLI.
222
+ On cross-lingual transfer, \xlmr obtains 80.9\% accuracy, outperforming the XLM-100 and \mbert open-source models by 10.2\% and 14.6\% average accuracy. On the Swahili and Urdu low-resource languages, \xlmr outperforms XLM-100 by 15.7\% and 11.4\%, and \mbert by 23.5\% and 15.8\%. While \xlmr handles 100 languages, we also show that it outperforms the former state of the art Unicoder~\citep{huang2019unicoder} and XLM (MLM+TLM), which handle only 15 languages, by 5.5\% and 5.8\% average accuracy respectively. Using the multilingual training of translate-train-all, \xlmr further improves performance and reaches 83.6\% accuracy, a new overall state of the art for XNLI, outperforming Unicoder by 5.1\%. Multilingual training is similar to practical applications where training sets are available in various languages for the same task. In the case of XNLI, datasets have been translated, and translate-train-all can be seen as some form of cross-lingual data augmentation~\cite{singh2019xlda}, similar to back-translation~\cite{xie2019unsupervised}.
223
+
224
+ \insertNER
225
+ \paragraph{Named Entity Recognition.}
226
+ In Table~\ref{tab:ner}, we report results of \xlmr and \mbert on CoNLL-2002 and CoNLL-2003. We consider the LSTM + CRF approach from \citet{lample-etal-2016-neural} and the Flair model from \citet{akbik2018coling} as baselines. We evaluate the performance of the model on each of the target languages in three different settings: (i) train on English data only (en) (ii) train on data in target language (each) (iii) train on data in all languages (all). Results of \mbert are reported from \citet{wu2019beto}. Note that we do not use a linear-chain CRF on top of \xlmr and \mbert representations, which gives an advantage to \citet{akbik2018coling}. Without the CRF, our \xlmr model still performs on par with the state of the art, outperforming \citet{akbik2018coling} on Dutch by $2.09$ points. On this task, \xlmr also outperforms \mbert by 2.42 F1 on average for cross-lingual transfer, and 1.86 F1 when trained on each language. Training on all languages leads to an average F1 score of 89.43\%, outperforming cross-lingual transfer approach by 8.49\%.
227
+
228
+ \paragraph{Question Answering.}
229
+ We also obtain new state of the art results on the MLQA cross-lingual question answering benchmark, introduced by \citet{lewis2019mlqa}. We follow their procedure by training on the English training data and evaluating on the 7 languages of the dataset.
230
+ We report results in Table~\ref{tab:mlqa}.
231
+ \xlmr obtains F1 and accuracy scores of 70.7\% and 52.7\% while the previous state of the art was 61.6\% and 43.5\%. \xlmr also outperforms \mbert by 13.0\% F1-score and 11.1\% accuracy. It even outperforms BERT-Large on English, confirming its strong monolingual performance.
232
+
233
+ \insertMLQA
234
+
235
+ \subsection{Multilingual versus Monolingual}
236
+ \label{sec:multimono}
237
+ In this section, we present results of multilingual XLM models against monolingual BERT models.
238
+
239
+ \paragraph{GLUE: \xlmr versus RoBERTa.}
240
+ Our goal is to obtain a multilingual model with strong performance on both, cross-lingual understanding tasks as well as natural language understanding tasks for each language. To that end, we evaluate \xlmr on the GLUE benchmark. We show in Table~\ref{tab:glue}, that \xlmr obtains better average dev performance than BERT\textsubscript{Large} by 1.6\% and reaches performance on par with XLNet\textsubscript{Large}. The RoBERTa model outperforms \xlmr by only 1.0\% on average. We believe future work can reduce this gap even further by alleviating the curse of multilinguality and vocabulary dilution. These results demonstrate the possibility of learning one model for many languages while maintaining strong performance on per-language downstream tasks.
241
+
242
+ \insertGlue
243
+
244
+ \paragraph{XNLI: XLM versus BERT.}
245
+ A recurrent criticism against multilingual models is that they obtain worse performance than their monolingual counterparts. In addition to the comparison of \xlmr and RoBERTa, we provide the first comprehensive study to assess this claim on the XNLI benchmark. We extend our comparison between multilingual XLM models and monolingual BERT models on 7 languages and compare performance in Table~\ref{tab:multimono}. We train 14 monolingual BERT models on Wikipedia and CommonCrawl (capped at 60 GiB),
246
+ %\footnote{For simplicity, we use a reduced version of our corpus by capping the size of each monolingual dataset to 60 GiB.}
247
+ and two XLM-7 models. We increase the vocabulary size of the multilingual model for a better comparison.
248
+ % To our surprise - and backed by further study on internal benchmarks -
249
+ We found that \textit{multilingual models can outperform their monolingual BERT counterparts}. Specifically, in Table~\ref{tab:multimono}, we show that for cross-lingual transfer, monolingual baselines outperform XLM-7 for both Wikipedia and CC by 1.6\% and 1.3\% average accuracy. However, by making use of multilingual training (translate-train-all) and leveraging training sets coming from multiple languages, XLM-7 can outperform the BERT models: our XLM-7 trained on CC obtains 80.0\% average accuracy on the 7 languages, while the average performance of BERT models trained on CC is 77.5\%. This is a surprising result that shows that the capacity of multilingual models to leverage training data coming from multiple languages for a particular task can overcome the capacity dilution problem to obtain better overall performance.
250
+
251
+
252
+ \insertMultiMono
253
+
254
+ \subsection{Representation Learning for Low-resource Languages}
255
+ We observed in Table~\ref{tab:multimono} that pretraining on Wikipedia for Swahili and Urdu performed similarly to a randomly initialized model; most likely due to the small size of the data for these languages. On the other hand, pretraining on CC improved performance by up to 10 points. This confirms our assumption that mBERT and XLM-100 rely heavily on cross-lingual transfer but do not model the low-resource languages as well as \xlmr. Specifically, in the translate-train-all setting, we observe that the biggest gains for XLM models trained on CC, compared to their Wikipedia counterparts, are on low-resource languages; 7\% and 4.8\% improvement on Swahili and Urdu respectively.
256
+
257
+ \section{Conclusion}
258
+ In this work, we introduced \xlmr, our new state of the art multilingual masked language model trained on 2.5 TB of newly created clean CommonCrawl data in 100 languages. We show that it provides strong gains over previous multilingual models like \mbert and XLM on classification, sequence labeling and question answering. We exposed the limitations of multilingual MLMs, in particular by uncovering the high-resource versus low-resource trade-off, the curse of multilinguality and the importance of key hyperparameters. We also expose the surprising effectiveness of multilingual models over monolingual models, and show strong improvements on low-resource languages.
259
+ % \section*{Acknowledgements}
260
+
261
+
262
+ \bibliography{acl2020}
263
+ \bibliographystyle{acl_natbib}
264
+
265
+
266
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
267
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
268
+ % DELETE THIS PART. DO NOT PLACE CONTENT AFTER THE REFERENCES!
269
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
270
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
271
+ \newpage
272
+ \clearpage
273
+ \appendix
274
+ \onecolumn
275
+ \section*{Appendix}
276
+ \section{Languages and statistics for CC-100 used by \xlmr}
277
+ In this section we present the list of languages in the CC-100 corpus we created for training \xlmr. We also report statistics such as the number of tokens and the size of each monolingual corpus.
278
+ \label{sec:appendix_A}
279
+ \insertDataStatistics
280
+
281
+ % \newpage
282
+ \section{Model Architectures and Sizes}
283
+ As we showed in section~\ref{sec:analysis}, capacity is an important parameter for learning strong cross-lingual representations. In the table below, we list multiple monolingual and multilingual models used by the research community and summarize their architectures and total number of parameters.
284
+ \label{sec:appendix_B}
285
+
286
+ \insertParameters
287
+
288
+ % \section{Do \emph{not} have an appendix here}
289
+
290
+ % \textbf{\emph{Do not put content after the references.}}
291
+ % %
292
+ % Put anything that you might normally include after the references in a separate
293
+ % supplementary file.
294
+
295
+ % We recommend that you build supplementary material in a separate document.
296
+ % If you must create one PDF and cut it up, please be careful to use a tool that
297
+ % doesn't alter the margins, and that doesn't aggressively rewrite the PDF file.
298
+ % pdftk usually works fine.
299
+
300
+ % \textbf{Please do not use Apple's preview to cut off supplementary material.} In
301
+ % previous years it has altered margins, and created headaches at the camera-ready
302
+ % stage.
303
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
304
+ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
305
+
306
+
307
+ \end{document}
references/2019.arxiv.ho/paper.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Emotion Recognition for Vietnamese Social Media Text"
3
+ authors:
4
+ - "Vong Anh Ho"
5
+ - "Duong Huynh-Cong Nguyen"
6
+ - "Danh Hoang Nguyen"
7
+ - "Linh Thi-Van Pham"
8
+ - "Duc-Vu Nguyen"
9
+ - "Kiet Van Nguyen"
10
+ - "Ngan Luu-Thuy Nguyen"
11
+ year: 2019
12
+ venue: "arXiv"
13
+ url: "https://arxiv.org/abs/1911.09339"
14
+ arxiv: "1911.09339"
15
+ ---
16
+
17
+ \title{Emotion Recognition\\for Vietnamese Social Media Text}
18
+
19
+ \titlerunning{Emotion Recognition for Vietnamese Social Media Text}
20
+
21
+ \author{Vong Anh Ho\inst{1,4}\textsuperscript{(\Letter)} \and
22
+ Duong Huynh-Cong Nguyen\inst{1,4} \and
23
+ Danh Hoang Nguyen\inst{1,4} \and
24
+ \\Linh Thi-Van Pham\inst{2,4} \and
25
+ Duc-Vu Nguyen\inst{3,4} \and
26
+ \\Kiet Van Nguyen\inst{1,4} \and
27
+ Ngan Luu-Thuy Nguyen\inst{1,4}}
28
+
29
+ \authorrunning{Vong Anh Ho et al.}
30
+
31
+ \institute{University of Information Technology, VNU-HCM, Vietnam\\
32
+ \email{\{15521025,15520148,15520090\}@gm.uit.edu.vn, \{kietnv,ngannlt\}@uit.edu.vn}\\
33
+ \and
34
+ University of Social Sciences and Humanities, VNU-HCM, Vietnam\\
35
+ \email{vanlinhpham888@gmail.com}\\
36
+ \and
37
+ Multimedia Communications Laboratory, University of Information Technology, VNU-HCM, Vietnam\\
38
+ \email{vund@uit.edu.vn}\\
39
+ \and
40
+ Vietnam National University, Ho Chi Minh City, Vietnam}
41
+ \maketitle
42
+
43
+ \begin{abstract}
44
+
45
+ Emotion recognition or emotion prediction is a higher approach or a special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of analysis in which the results are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers' comments. In this study, we have achieved two targets. First and foremost, we built a standard **V**ietnamese **S**ocial **M**edia **E**motion **C**orpus (UIT-VSMEC) with exactly 6,927 emotion-annotated sentences, contributing to emotion recognition research in Vietnamese which is a low-resource language in natural language processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC corpus. As a result, the CNN model achieved the highest performance with the weighted F1-score of 59.74\
46
+
47
+ \keywords{Emotion Recognition \and Emotion Prediction \and Vietnamese \and Machine Learning \and Deep Learning \and CNN \and LSTM \and SVM.}
48
+ \end{abstract}
49
+
50
+ \input{sections/1-introduction.tex}
51
+ \input{sections/2-relatedwork.tex}
52
+ \input{sections/3-corpus.tex}
53
+ \input{sections/4-method.tex}
54
+ \input{sections/5-experiments.tex}
55
+ \input{sections/6-conclusion.tex}
56
+
57
+ # Acknowledgment
58
+ We would like to give our thanks to the NLP@UIT research group and the Citynow-UIT Laboratory of the University of Information Technology - Vietnam National University Ho Chi Minh City for their supports with pragmatic and inspiring advice.
59
+
60
+ \bibliographystyle{splncs04}
61
+
62
+ \begin{thebibliography}{10}
63
+ \providecommand{\url}[1]{`#1`}
64
+ \providecommand{\urlprefix}{URL }
65
+ \providecommand{\doi}[1]{https://doi.org/#1}
66
+
67
+ \bibitem{PlabanKumarBhowmick}
68
+ {Bhowmick}, P.K., {Basu}, A., {Mitra}, P.: {An Agreement Measure for
69
+ Determining Inter-Annotator Reliability of Human Judgements on Affective
70
+ Tex}. In: {Proceedings of the workshop on Human Judgements in Computational
71
+ Linguistics}. pp. 58--65. COLING 2008, Manchester, United Kingdom (2008)
72
+
73
+ \bibitem{Jointstockcompany}
74
+ company, J.S.: {The habit of using social networks of Vietnamese people 2018}.
75
+ brands vietnam, Ho Chi Minh City, Vietnam (2018)
76
+
77
+ \bibitem{Ekman1993}
78
+ {Ekman}, P.: In: {Facial expression and emotion}. vol.~48, pp. 384--392.
79
+ {American Psychologist} (1993)
80
+
81
+ \bibitem{Ekman}
82
+ {Ekman}, P.: In: {Emotions revealed: Recognizing faces and feelings to improve
83
+ communication and emotional life}. p.~2007. {Macmillan} (2012)
84
+
85
+ \bibitem{PaulEkman}
86
+ {Ekman}, P., {Ekman}, E., {Lama}, D.: In: {The Ekmans' Atlas of Emotion} (2018)
87
+
88
+ \bibitem{Kim}
89
+ {Kim}, Y.: {Convolutional Neural Networks for Sentence Classifications}. In:
90
+ {Proceedings of the 2014 Conference on Empirical Methods in Natural Language
91
+ Processing (EMNLP)}. pp. 1746--1751. { Association for Computational
92
+ Linguistics}, Doha, Qatar (2014)
93
+
94
+ \bibitem{Kiritchenko}
95
+ {Kiritchenko}, S., {Mohammad}, S.: {Using Hashtags to Capture Fine Emotion
96
+ Categories from Tweets}. In: {Computational Intelligence}. pp. 301--326
97
+ (2015)
98
+
99
+ \bibitem{RomanKlinger}
100
+ {Klinger}, R., {Clerc}, O.D., {Mohammad}, S.M., {Balahur}, A.: {IEST:WASSA-2018
101
+ Implicit Emotions Shared Task}. pp. 31--42. 2017 AFNLP, Brussels, Belgium
102
+ (2018)
103
+
104
+ \bibitem{BernhardKratzwald}
105
+ {Kratzwald}, B., {Ilic}, S., {Kraus}, M., S.~{Feuerriegel}, H.P.: {Decision
106
+ support with text-based emotion recognition: Deep learning for affective
107
+ computing}. pp. 24 -- 35. {Decision Support Systems} (2018)
108
+
109
+ \bibitem{SaifMohammad2017}
110
+ {Mohammad}, S., {Bravo-Marquez}, F.: {Emotion Intensities in Tweets}. In:
111
+ {Proceedings of the Sixth Joint Conference on Lexical and Computational
112
+ Semantics (*SEM)}. pp. 65--77. Association for Computational Linguistics,
113
+ Vancouver, Canada (2017)
114
+
115
+ \bibitem{Mohammad}
116
+ {Mohammad}, S.M.: {\#Emotional Tweets}. In: {First Joint Conference on Lexical
117
+ and Computational Semantics (*SEM)}. pp. 246--255. {Association for
118
+ Computational Linguistics}, Montreal, Canada (2012)
119
+
120
+ \bibitem{Mohammad2018}
121
+ {Mohammad}, S.M., {Bravo-Marquez}, F., {Salameh}, M., {Kiritchenko}, S.:
122
+ {SemEval-2018 task 1: Affect in tweets}. pp. 1--17. Proceedings of
123
+ International Workshop on Semantic Evaluation, New Orleans, Louisiana (2018)
124
+
125
+ \bibitem{SaifMohammad}
126
+ {Mohammad}, S.M., {Xiaodan}, Z., {Kiritchenko}, S., {Martin}, J.: {Sentiment,
127
+ emotion, purpose, and style in electoral tweets}. pp. 480--499. Information
128
+ Processing and Management: an International Journal (2015)
129
+
130
+ \bibitem{Nguyen}
131
+ Nguyen: {Vietnam has the 7th largest number of Facebook users in the world}.
132
+ Dan Tri newspaper (2018)
133
+
134
+ \bibitem{VLSPX}
135
+ {Nguyen}, H.T.M., {Nguyen}, H.V., {Ngo}, Q.T., {Vu}, L.X., {Tran}, V.M., {Ngo},
136
+ B.X., {Le}, C.A.: {VLSP Shared Task: Sentiment Analysis}. In: {Journal of
137
+ Computer Science and Cybernetics}. pp. 295--310 (2018)
138
+
139
+
140
+ \bibitem{KietVanNguyen}
141
+ {Nguyen}, K.V., {Nguyen}, V.D., {Nguyen}, P., {Truong}, T., {Nguyen}, N.L.T.:
142
+ {UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis}.
143
+ In: {2018 10th International Conference on Knowledge and Systems Engineering
144
+ (KSE)}. pp. 19--24. {IEEE}, Ho Chi Minh City, Vietnam (2018)
145
+
146
+ \bibitem{PhuNguyen}
147
+ {Nguyen}, P.X.V., {Truong}, T.T.H., {Nguyen}, K.V., {Nguyen}, N.L.T.: {Deep
148
+ Learning versus Traditional Classifiers on Vietnamese Students' Feedback
149
+ Corpus}. In: {2018 5th NAFOSTED Conference on Information and Computer
150
+ Science (NICS)}. pp. 75--80. Ho Chi Minh City, Vietnam (2018)
151
+
152
+ \bibitem{VuDucNguyen}
153
+ {Nguyen}, V.D., {Nguyen}, K.V., {Nguyen}, N.L.T.: {Variants of Long Short-Term
154
+ Memory for Sentiment Analysis on Vietnamese Students’ Feedback Corpus}. In:
155
+ {2018 10th International Conference on Knowledge and Systems Engineering
156
+ (KSE)}. pp. 306--311. IEEE, Ho Chi Minh City, Vietnam (2018)
157
+
158
+ \bibitem{AurelienGeron}
159
+ Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
160
+ O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
161
+ Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.:
162
+ Scikit-learn: Machine learning in python. Journal of Machine Learning
163
+ Research **12**, 2825--2830 (2011)
164
+
165
+ \bibitem{CarloStrapparava}
166
+ {Strapparava}, C., {Mihalcea}, R.: {SemEval-2007 Task 14: Affective Text}. In:
167
+ {Proceedings of the 4th International Workshop on Semantic Evaluations
168
+ (SemEval-2007)}. pp. 70--74. { Association for Computational Linguistics},
169
+ Prague (2007)
170
+
171
+ \bibitem{TingweiWang}
172
+ {T. {Wang} and X. {Yang} and C. {Ouyang}}: {A Multi-emotion Classification
173
+ Method Based on BLSTM-MC in Code-Switching Text: 7th CCF International
174
+ Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings,
175
+ Part II.} pp. 190--199. Natural Language Processing and Chinese Computing
176
+ (2018)
177
+
178
+ \bibitem{ZhongqingWang}
179
+ {Wang}, Z., {Li}, S.: {Overview of NLPCC 2018 Shared Task 1: Emotion Detection
180
+ in Code-Switching Text: 7th CCF International Conference, NLPCC 2018, Hohhot,
181
+ China, August 26–30, 2018, Proceedings, Part II}. pp. 429--433. Natural
182
+ Language Processing and Chinese Computing (2018)
183
+
184
+ \bibitem{Facial2007}
185
+ {Zhang}, S., {Wu}, Z., {Meng}, H., {Cai}, L.: Facial expression synthesis using
186
+ pad emotional parameters for a chinese expressive avatar. vol.~4738, pp.
187
+ 24--35 (09 2007)
188
+
189
+ \bibitem{YingjieZhang}
190
+ {Zhang}, Y., {Wallace}, B.C.: {A Sensitivity Analysis of (and Practitioners’
191
+ Guide to Convolutional}. pp. 253--263. 2017 AFNLP, Taipei, Taiwan (2017)
192
+
193
+ \bibitem
194
+ {Nguyen}, K.V., {Nguyen}, N.L.T., 2016, October. {Vietnamese transition-based dependency parsing with supertag features}. In 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE) (pp. 175-180). IEEE.
195
+
196
+ \bibitem{nguyen2014treebank}
197
+ Nguyen, D.Q., Pham, S.B., Nguyen, P.T., Le Nguyen, M., et al.: From treebankconversion to automatic dependency parsing for vietnamese. In: International Con-ference on Applications of Natural Language to Data Bases/Information Systems.pp. 196–207. Springer (2014)
198
+
199
+ \bibitem{nguyen2016vietnamese}
200
+ Nguyen, K.V., Nguyen, N.L.T.: Vietnamese transition-based dependency parsingwith supertag features. In: 2016 Eighth International Conference on Knowledgeand Systems Engineering (KSE). pp. 175–180. IEEE (2016)
201
+
202
+ \bibitem{bach2018empirical}
203
+ Bach, N.X., Linh, N.D., Phuong, T.M.: An empirical study on pos tagging forvietnamese social media text. Computer Speech \& Language50, 1–15 (2018)
204
+
205
+ \bibitem{nguyen2017word}
206
+ Nguyen, D.Q., Vu, T., Nguyen, D.Q., Dras, M., Johnson, M.: From word segmen-tation to pos tagging for vietnamese. arXiv preprint arXiv:1711.04951 (2017)
207
+
208
+ \bibitem{thao2007named}
209
+ Thao, P.T.X., Tri, T.Q., Dien, D., Collier, N.: Named entity recognition in viet-namese using classifier voting. ACM Transactions on Asian Language InformationProcessing (TALIP)6(4), 1–18 (2007)
210
+
211
+ \bibitem{nguyen2016approach}
212
+ Nguyen, L.H., Dinh, D., Tran, P.: An approach to construct a named entity anno-tated english-vietnamese bilingual corpus. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)16(2), 1–17 (2016)
213
+
214
+ \bibitem{Nguyen_2009}
215
+ Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A vietnamese question answeringsystem. 2009 International Conference on Knowledge and Systems Engineering(Oct 2009). https://doi.org/10.1109/kse.2009.42,http://dx.doi.org/10.1109/KSE.2009.42
216
+
217
+ \bibitem{le2018factoid}
218
+ Le-Hong, P., Bui, D.T.: A factoid question answering system for vietnamese. In:Companion Proceedings of the The Web Conference 2018. pp. 1049–1055 (2018)
219
+
220
+ \end{thebibliography}
references/2019.arxiv.ho/paper.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:583f61e2334e8547aba92a311850d5fb7b6dbac7301b0d9af9186c3ffb7aed60
3
+ size 365205
references/2019.arxiv.ho/paper.tex ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[runningheads]{llncs}
2
+ \usepackage{graphicx}
3
+ \usepackage{marvosym}
4
+ \usepackage{amsmath,amssymb,amsfonts}
5
+ \usepackage{algorithmic}
6
+ \usepackage{graphicx}
7
+ \usepackage{textcomp}
8
+ \usepackage[T5]{fontenc}
9
+ \usepackage[utf8]{inputenc}
10
+ \usepackage[vietnamese,english]{babel}
11
+ \usepackage{pifont}
12
+ \usepackage{float}
13
+ \usepackage{caption}
14
+ \usepackage{placeins}
15
+ \usepackage{array}
16
+ \newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}
17
+ \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
18
+ \usepackage{multirow}
19
+ \usepackage{hyperref}
20
+ \usepackage{amsmath}
21
+ \hypersetup{colorlinks, citecolor=blue, linkcolor=blue, urlcolor=blue}
22
+
23
+ \begin{document}
24
+ % \title{Emotion Recognition for Vietnamese Social Media Text}
25
+ \title{Emotion Recognition\\for Vietnamese Social Media Text}
26
+
27
+ \titlerunning{Emotion Recognition for Vietnamese Social Media Text}
28
+
29
+ \author{Vong Anh Ho\inst{1,4}\textsuperscript{(\Letter)} \and
30
+ Duong Huynh-Cong Nguyen\inst{1,4} \and
31
+ Danh Hoang Nguyen\inst{1,4} \and
32
+ \\Linh Thi-Van Pham\inst{2,4} \and
33
+ Duc-Vu Nguyen\inst{3,4} \and
34
+ \\Kiet Van Nguyen\inst{1,4} \and
35
+ Ngan Luu-Thuy Nguyen\inst{1,4}}
36
+
37
+ \authorrunning{Vong Anh Ho et al.}
38
+
39
+ \institute{University of Information Technology, VNU-HCM, Vietnam\\
40
+ \email{\{15521025,15520148,15520090\}@gm.uit.edu.vn, \{kietnv,ngannlt\}@uit.edu.vn}\\
41
+ \and
42
+ University of Social Sciences and Humanities, VNU-HCM, Vietnam\\
43
+ \email{vanlinhpham888@gmail.com}\\
44
+ \and
45
+ Multimedia Communications Laboratory, University of Information Technology, VNU-HCM, Vietnam\\
46
+ \email{vund@uit.edu.vn}\\
47
+ \and
48
+ Vietnam National University, Ho Chi Minh City, Vietnam}
49
+ \maketitle
50
+
51
+ \begin{abstract}
52
+
53
+ Emotion recognition or emotion prediction is a higher approach or a special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of analysis in which the results are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers' comments. In this study, we have achieved two targets. First and foremost, we built a standard \textbf{V}ietnamese \textbf{S}ocial \textbf{M}edia \textbf{E}motion \textbf{C}orpus (UIT-VSMEC) with exactly 6,927 emotion-annotated sentences, contributing to emotion recognition research in Vietnamese which is a low-resource language in natural language processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC corpus. As a result, the CNN model achieved the highest performance with the weighted F1-score of 59.74\%. Our corpus is available at our research website \footnote[1]{\url{ https://sites.google.com/uit.edu.vn/uit-nlp/corpora-projects}}.
54
+
55
+ \keywords{Emotion Recognition \and Emotion Prediction \and Vietnamese \and Machine Learning \and Deep Learning \and CNN \and LSTM \and SVM.}
56
+ \end{abstract}
57
+
58
+ \input{sections/1-introduction.tex}
59
+ \input{sections/2-relatedwork.tex}
60
+ \input{sections/3-corpus.tex}
61
+ \input{sections/4-method.tex}
62
+ \input{sections/5-experiments.tex}
63
+ \input{sections/6-conclusion.tex}
64
+
65
+ \section*{Acknowledgment}
66
+ We would like to give our thanks to the NLP@UIT research group and the Citynow-UIT Laboratory of the University of Information Technology - Vietnam National University Ho Chi Minh City for their supports with pragmatic and inspiring advice.
67
+
68
+ \bibliographystyle{splncs04}
69
+ % \bibliography{bibliography}
70
+ \begin{thebibliography}{10}
71
+ \providecommand{\url}[1]{\texttt{#1}}
72
+ \providecommand{\urlprefix}{URL }
73
+ \providecommand{\doi}[1]{https://doi.org/#1}
74
+
75
+ \bibitem{PlabanKumarBhowmick}
76
+ {Bhowmick}, P.K., {Basu}, A., {Mitra}, P.: {An Agreement Measure for
77
+ Determining Inter-Annotator Reliability of Human Judgements on Affective
78
+ Tex}. In: {Proceedings of the workshop on Human Judgements in Computational
79
+ Linguistics}. pp. 58--65. COLING 2008, Manchester, United Kingdom (2008)
80
+
81
+ \bibitem{Jointstockcompany}
82
+ company, J.S.: {The habit of using social networks of Vietnamese people 2018}.
83
+ brands vietnam, Ho Chi Minh City, Vietnam (2018)
84
+
85
+ \bibitem{Ekman1993}
86
+ {Ekman}, P.: In: {Facial expression and emotion}. vol.~48, pp. 384--392.
87
+ {American Psychologist} (1993)
88
+
89
+ \bibitem{Ekman}
90
+ {Ekman}, P.: In: {Emotions revealed: Recognizing faces and feelings to improve
91
+ communication and emotional life}. p.~2007. {Macmillan} (2012)
92
+
93
+ \bibitem{PaulEkman}
94
+ {Ekman}, P., {Ekman}, E., {Lama}, D.: In: {The Ekmans' Atlas of Emotion} (2018)
95
+
96
+ \bibitem{Kim}
97
+ {Kim}, Y.: {Convolutional Neural Networks for Sentence Classifications}. In:
98
+ {Proceedings of the 2014 Conference on Empirical Methods in Natural Language
99
+ Processing (EMNLP)}. pp. 1746--1751. { Association for Computational
100
+ Linguistics}, Doha, Qatar (2014)
101
+
102
+ \bibitem{Kiritchenko}
103
+ {Kiritchenko}, S., {Mohammad}, S.: {Using Hashtags to Capture Fine Emotion
104
+ Categories from Tweets}. In: {Computational Intelligence}. pp. 301--326
105
+ (2015)
106
+
107
+ \bibitem{RomanKlinger}
108
+ {Klinger}, R., {Clerc}, O.D., {Mohammad}, S.M., {Balahur}, A.: {IEST:WASSA-2018
109
+ Implicit Emotions Shared Task}. pp. 31--42. 2017 AFNLP, Brussels, Belgium
110
+ (2018)
111
+
112
+ \bibitem{BernhardKratzwald}
113
+ {Kratzwald}, B., {Ilic}, S., {Kraus}, M., S.~{Feuerriegel}, H.P.: {Decision
114
+ support with text-based emotion recognition: Deep learning for affective
115
+ computing}. pp. 24 -- 35. {Decision Support Systems} (2018)
116
+
117
+ \bibitem{SaifMohammad2017}
118
+ {Mohammad}, S., {Bravo-Marquez}, F.: {Emotion Intensities in Tweets}. In:
119
+ {Proceedings of the Sixth Joint Conference on Lexical and Computational
120
+ Semantics (*SEM)}. pp. 65--77. Association for Computational Linguistics,
121
+ Vancouver, Canada (2017)
122
+
123
+ \bibitem{Mohammad}
124
+ {Mohammad}, S.M.: {\#Emotional Tweets}. In: {First Joint Conference on Lexical
125
+ and Computational Semantics (*SEM)}. pp. 246--255. {Association for
126
+ Computational Linguistics}, Montreal, Canada (2012)
127
+
128
+ \bibitem{Mohammad2018}
129
+ {Mohammad}, S.M., {Bravo-Marquez}, F., {Salameh}, M., {Kiritchenko}, S.:
130
+ {SemEval-2018 task 1: Affect in tweets}. pp. 1--17. Proceedings of
131
+ International Workshop on Semantic Evaluation, New Orleans, Louisiana (2018)
132
+
133
+ \bibitem{SaifMohammad}
134
+ {Mohammad}, S.M., {Xiaodan}, Z., {Kiritchenko}, S., {Martin}, J.: {Sentiment,
135
+ emotion, purpose, and style in electoral tweets}. pp. 480--499. Information
136
+ Processing and Management: an International Journal (2015)
137
+
138
+ \bibitem{Nguyen}
139
+ Nguyen: {Vietnam has the 7th largest number of Facebook users in the world}.
140
+ Dan Tri newspaper (2018)
141
+
142
+ \bibitem{VLSPX}
143
+ {Nguyen}, H.T.M., {Nguyen}, H.V., {Ngo}, Q.T., {Vu}, L.X., {Tran}, V.M., {Ngo},
144
+ B.X., {Le}, C.A.: {VLSP Shared Task: Sentiment Analysis}. In: {Journal of
145
+ Computer Science and Cybernetics}. pp. 295--310 (2018)
146
+
147
+
148
+ \bibitem{KietVanNguyen}
149
+ {Nguyen}, K.V., {Nguyen}, V.D., {Nguyen}, P., {Truong}, T., {Nguyen}, N.L.T.:
150
+ {UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis}.
151
+ In: {2018 10th International Conference on Knowledge and Systems Engineering
152
+ (KSE)}. pp. 19--24. {IEEE}, Ho Chi Minh City, Vietnam (2018)
153
+
154
+ \bibitem{PhuNguyen}
155
+ {Nguyen}, P.X.V., {Truong}, T.T.H., {Nguyen}, K.V., {Nguyen}, N.L.T.: {Deep
156
+ Learning versus Traditional Classifiers on Vietnamese Students' Feedback
157
+ Corpus}. In: {2018 5th NAFOSTED Conference on Information and Computer
158
+ Science (NICS)}. pp. 75--80. Ho Chi Minh City, Vietnam (2018)
159
+
160
+ \bibitem{VuDucNguyen}
161
+ {Nguyen}, V.D., {Nguyen}, K.V., {Nguyen}, N.L.T.: {Variants of Long Short-Term
162
+ Memory for Sentiment Analysis on Vietnamese Students’ Feedback Corpus}. In:
163
+ {2018 10th International Conference on Knowledge and Systems Engineering
164
+ (KSE)}. pp. 306--311. IEEE, Ho Chi Minh City, Vietnam (2018)
165
+
166
+ \bibitem{AurelienGeron}
167
+ Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
168
+ O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J.,
169
+ Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.:
170
+ Scikit-learn: Machine learning in python. Journal of Machine Learning
171
+ Research \textbf{12}, 2825--2830 (2011)
172
+
173
+ \bibitem{CarloStrapparava}
174
+ {Strapparava}, C., {Mihalcea}, R.: {SemEval-2007 Task 14: Affective Text}. In:
175
+ {Proceedings of the 4th International Workshop on Semantic Evaluations
176
+ (SemEval-2007)}. pp. 70--74. { Association for Computational Linguistics},
177
+ Prague (2007)
178
+
179
+ \bibitem{TingweiWang}
180
+ {T. {Wang} and X. {Yang} and C. {Ouyang}}: {A Multi-emotion Classification
181
+ Method Based on BLSTM-MC in Code-Switching Text: 7th CCF International
182
+ Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings,
183
+ Part II.} pp. 190--199. Natural Language Processing and Chinese Computing
184
+ (2018)
185
+
186
+ \bibitem{ZhongqingWang}
187
+ {Wang}, Z., {Li}, S.: {Overview of NLPCC 2018 Shared Task 1: Emotion Detection
188
+ in Code-Switching Text: 7th CCF International Conference, NLPCC 2018, Hohhot,
189
+ China, August 26–30, 2018, Proceedings, Part II}. pp. 429--433. Natural
190
+ Language Processing and Chinese Computing (2018)
191
+
192
+ \bibitem{Facial2007}
193
+ {Zhang}, S., {Wu}, Z., {Meng}, H., {Cai}, L.: Facial expression synthesis using
194
+ pad emotional parameters for a chinese expressive avatar. vol.~4738, pp.
195
+ 24--35 (09 2007)
196
+
197
+ \bibitem{YingjieZhang}
198
+ {Zhang}, Y., {Wallace}, B.C.: {A Sensitivity Analysis of (and Practitioners’
199
+ Guide to Convolutional}. pp. 253--263. 2017 AFNLP, Taipei, Taiwan (2017)
200
+
201
+ \bibitem
202
+ {Nguyen}, K.V., {Nguyen}, N.L.T., 2016, October. {Vietnamese transition-based dependency parsing with supertag features}. In 2016 Eighth International Conference on Knowledge and Systems Engineering (KSE) (pp. 175-180). IEEE.
203
+
204
+ \bibitem{nguyen2014treebank}
205
+ Nguyen, D.Q., Pham, S.B., Nguyen, P.T., Le Nguyen, M., et al.: From treebankconversion to automatic dependency parsing for vietnamese. In: International Con-ference on Applications of Natural Language to Data Bases/Information Systems.pp. 196–207. Springer (2014)
206
+
207
+ \bibitem{nguyen2016vietnamese}
208
+ Nguyen, K.V., Nguyen, N.L.T.: Vietnamese transition-based dependency parsingwith supertag features. In: 2016 Eighth International Conference on Knowledgeand Systems Engineering (KSE). pp. 175–180. IEEE (2016)
209
+
210
+
211
+ \bibitem{bach2018empirical}
212
+ Bach, N.X., Linh, N.D., Phuong, T.M.: An empirical study on pos tagging forvietnamese social media text. Computer Speech \& Language50, 1–15 (2018)
213
+
214
+ \bibitem{nguyen2017word}
215
+ Nguyen, D.Q., Vu, T., Nguyen, D.Q., Dras, M., Johnson, M.: From word segmen-tation to pos tagging for vietnamese. arXiv preprint arXiv:1711.04951 (2017)
216
+
217
+ \bibitem{thao2007named}
218
+ Thao, P.T.X., Tri, T.Q., Dien, D., Collier, N.: Named entity recognition in viet-namese using classifier voting. ACM Transactions on Asian Language InformationProcessing (TALIP)6(4), 1–18 (2007)
219
+
220
+ \bibitem{nguyen2016approach}
221
+ Nguyen, L.H., Dinh, D., Tran, P.: An approach to construct a named entity anno-tated english-vietnamese bilingual corpus. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)16(2), 1–17 (2016)
222
+
223
+
224
+ \bibitem{Nguyen_2009}
225
+ Nguyen, D.Q., Nguyen, D.Q., Pham, S.B.: A vietnamese question answeringsystem. 2009 International Conference on Knowledge and Systems Engineering(Oct 2009). https://doi.org/10.1109/kse.2009.42,http://dx.doi.org/10.1109/KSE.2009.42
226
+
227
+ \bibitem{le2018factoid}
228
+ Le-Hong, P., Bui, D.T.: A factoid question answering system for vietnamese. In:Companion Proceedings of the The Web Conference 2018. pp. 1049–1055 (2018)
229
+
230
+
231
+
232
+
233
+
234
+
235
+
236
+ \end{thebibliography}
237
+
238
+
239
+ \end{document}
references/2019.arxiv.ho/source/bibliography.bib ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @InProceedings{Kiritchenko,
2
+ title = "{Using Hashtags to Capture Fine Emotion Categories from Tweets}",
3
+ author = {S. {Kiritchenko} and S. {Mohammad}},
4
+ booktitle = "{Computational Intelligence}",
5
+ year = {2015},
6
+ pages = {301-326},
7
+ }
8
+
9
+ @InProceedings{BernhardKratzwald,
10
+ title = "{Decision support with text-based emotion recognition: Deep learning for affective computing}",
11
+ author = { B. {Kratzwald} and S. {Ilic} and M. {Kraus} and S. {Feuerriegel}, H. {Prendinger}},
12
+ year = {2018},
13
+ publisher = "{Decision Support Systems}",
14
+ pages = {24 - 35},
15
+ }
16
+
17
+ @InProceedings{ApurbaPaul,
18
+ title = "{Identification and Classification of Emotional Key Phrases from Psychological texts.}",
19
+ author = "{A. {Paul} and D. {Das}}",
20
+ booktitle = "{Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction}",
21
+ year = {2015},
22
+ publisher = "{Association for Computational Linguistics}",
23
+ address = {Beijing, China},
24
+ pages = {32 - 38},
25
+ }
26
+
27
+ @InProceedings{CarloStrapparava,
28
+ title = "{SemEval-2007 Task 14: Affective Text}",
29
+ author = {C. {Strapparava} and R. {Mihalcea}},
30
+ booktitle = "{Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007)}",
31
+ year = {2007},
32
+ publisher = "{ Association for Computational Linguistics}",
33
+ address = {Prague},
34
+ pages = {70-74},
35
+ }
36
+ @InProceedings{tran2009,
37
+ author = {O. T. {Tran} and C. A. {Le} and T. Q. {Ha} and Q. H. {Le}},
38
+ booktitle = "{2009 International Conference on Asian Language Processing}",
39
+ title = "{An Experimental Study on Vietnamese POS Tagging}",
40
+ year = {2009},
41
+ pages = {23-27}
42
+ }
43
+
44
+ @InProceedings{Facial2007,
45
+ author = {S. {Zhang} and Z. {Wu} and H. {Meng} and L. {Cai}},
46
+ year = {2007},
47
+ month = {09},
48
+ pages = {24-35},
49
+ title = {Facial Expression Synthesis Using PAD Emotional Parameters for a Chinese Expressive Avatar},
50
+ volume = {4738}
51
+ }
52
+
53
+ @InProceedings{vncorenlp,
54
+ title = "{{V}n{C}ore{NLP}: A {V}ietnamese Natural Language Processing Toolkit}",
55
+ author = {T. {Vu} and Q. D. {Nguyen} and Q. D. {Nguyen} and M. {Dras} and M. {Johnson}},
56
+ booktitle = "{Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Demonstrations}",
57
+ year = {2018},
58
+ address = "{New Orleans, Louisiana}",
59
+ publisher = "{Association for Computational Linguistics}",
60
+ pages = {56-60}
61
+ }
62
+
63
+ @InProceedings{Ekman,
64
+ booktitle = "{Emotions revealed: Recognizing faces and feelings to improve communication and emotional life}",
65
+ author = {P. {Ekman}},
66
+ year = {2012},
67
+ publisher = "{Macmillan}",
68
+ pages = {2007},
69
+ }
70
+
71
+ @InProceedings{Ekman1993,
72
+ booktitle = "{Facial expression and emotion}",
73
+ author = {P. {Ekman}},
74
+ year = {1993},
75
+ publisher = "{American Psychologist}",
76
+ pages = {384-392},
77
+ volume = {48}
78
+ }
79
+
80
+ @InProceedings{VLSPX,
81
+ title = "{VLSP Shared Task: Sentiment Analysis}",
82
+ author = {H. T. M. {Nguyen} and H. V. {Nguyen} and Q. T. {Ngo} and L. X. {Vu} and V. M. {Tran} and B. X. {Ngo} and C. A. {Le}},
83
+ booktitle = "{Journal of Computer Science and Cybernetics}",
84
+ year = {2018},
85
+ pages = {295-310},
86
+ }
87
+
88
+ @InProceedings{KietVanNguyen,
89
+ title = "{UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis}",
90
+ author = {K. V. {Nguyen} and V. D. {Nguyen} and P. {Nguyen} and T. {Truong} and N. L. T. {Nguyen}},
91
+ booktitle = "{2018 10th International Conference on Knowledge and Systems Engineering (KSE)}",
92
+ year = {2018},
93
+ publisher = "{IEEE}",
94
+ pages = {19-24},
95
+ address = {Ho Chi Minh City, Vietnam},
96
+ }
97
+
98
+ @InProceedings{PhuNguyen,
99
+ title = "{Deep Learning versus Traditional Classifiers on Vietnamese Students' Feedback Corpus}",
100
+ author = { P. X. V. {Nguyen} and T. T. H. {Truong} and K. V. {Nguyen} and N. L. T. {Nguyen}},
101
+ booktitle = "{2018 5th NAFOSTED Conference on Information and Computer Science (NICS)}",
102
+ year = {2018},
103
+ pages = {75-80},
104
+ address = {Ho Chi Minh City, Vietnam},
105
+ }
106
+
107
+ @InProceedings{Kim,
108
+ title = "{Convolutional Neural Networks for Sentence Classifications}",
109
+ author = {Y. {Kim}},
110
+ booktitle = "{Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)}",
111
+ year = {2014},
112
+ publisher = "{ Association for Computational Linguistics}",
113
+ pages = {1746-1751},
114
+ address = {Doha, Qatar}
115
+ }
116
+
117
+ @InProceedings{Mohammad,
118
+ title = "{\#Emotional Tweets}",
119
+ author = {S. M. {Mohammad}},
120
+ booktitle = "{First Joint Conference on Lexical and Computational Semantics (*SEM)}",
121
+ year = {2012},
122
+ publisher = "{Association for Computational Linguistics}",
123
+ pages = {246-255},
124
+ address = {Montreal, Canada}
125
+ }
126
+
127
+ @InProceedings{PaulEkman,
128
+ booktitle = "{The Ekmans' Atlas of Emotion}",
129
+ author = {P. {Ekman} and E. {Ekman} and D. {Lama}},
130
+ year = {2018}
131
+ }
132
+
133
+ @InProceedings{PlabanKumarBhowmick,
134
+ title = "{An Agreement Measure for Determining Inter-Annotator Reliability of Human Judgements on Affective Tex}",
135
+ author = {P. K. {Bhowmick} and A. {Basu} and P. {Mitra}},
136
+ booktitle = "{Proceedings of the workshop on Human Judgements in Computational Linguistics}",
137
+ publisher = {COLING 2008},
138
+ year = {2008},
139
+ address = {Manchester, United Kingdom},
140
+ pages = {58-65},
141
+ }
142
+
143
+ @InProceedings{Nguyen,
144
+ title = "{Vietnam has the 7th largest number of Facebook users in the world}",
145
+ author = {Nguyen},
146
+ publisher = {Dan Tri newspaper},
147
+ year = {2018}
148
+ }
149
+
150
+ @InProceedings{SaifMohammad,
151
+ title = "{Sentiment, emotion, purpose, and style in electoral tweets}",
152
+ author = {S. M. {Mohammad} and Z. {Xiaodan} and S. {Kiritchenko} and J. {Martin}},
153
+ publisher = {Information Processing and Management: an International Journal},
154
+ year = {2015},
155
+ pages = {480-499},
156
+ }
157
+
158
+ @InProceedings{Mohammad2018,
159
+ title = "{SemEval-2018 task 1: Affect in tweets}",
160
+ author = { S. M. {Mohammad} and F. {Bravo-Marquez} and M. {Salameh} and S. {Kiritchenko}},
161
+ publisher = { Proceedings of International Workshop on Semantic Evaluation},
162
+ year = {2018},
163
+ pages = {1-17},
164
+ address = {New Orleans, Louisiana},
165
+ }
166
+
167
+ @InProceedings{SaifMohammad2017,
168
+ title = "{Emotion Intensities in Tweets}",
169
+ author = {S. {Mohammad} and F. {Bravo-Marquez}},
170
+ booktitle = "{Proceedings of the Sixth Joint Conference on Lexical and Computational Semantics (*SEM)}",
171
+ publisher = {Association for Computational Linguistics},
172
+ year = {2017},
173
+ pages = {65-77},
174
+ address = {Vancouver, Canada},
175
+ }
176
+
177
+ @InProceedings{smd1,
178
+ title = "{Social media data}",
179
+ author = {Science and Information Technology},
180
+ publisher = {Science and Information Technology - University of Information Technology},
181
+ year = {2016}
182
+ }
183
+
184
+ @InProceedings{TingweiWang,
185
+ title = "{A Multi-emotion Classification Method Based on BLSTM-MC in Code-Switching Text: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II.}",
186
+ author = "{T. {Wang} and X. {Yang} and C. {Ouyang}}",
187
+ publisher = {Natural Language Processing and Chinese Computing},
188
+ year = {2018},
189
+ pages = {190-199},
190
+ }
191
+
192
+ @InProceedings{VenkateshDuppada,
193
+ title = "{SeerNet at SemEval-2018 Task 1: Domain Adaptation for Affect in Tweets}",
194
+ author = {V. {Duppada} and R. {Jain} and S. {Hiray}},
195
+ booktitle = "{*SEMEVAL}",
196
+ publisher = {Association for Computational Linguistics},
197
+ address = {New Orleans, Louisiana},
198
+ year = {2018},
199
+ pages = {18-23},
200
+ }
201
+ @InProceedings{VoNgocPhu,
202
+ title = "{A Vietnamese adjective emotion dictionary based on exploitation of Vietnamese language characteristics}",
203
+ author = {P. N. {Vo} and C. T. {Vo} and T. T. {Vo} and D. D. {Nguyen}},
204
+ publisher = {Artificial Intelligence Review},
205
+ address = {New Orleans, Louisiana},
206
+ year = {2018},
207
+ pages = {93-159},
208
+ }
209
+
210
+ @InProceedings{VuDucNguyen,
211
+ title = "{Variants of Long Short-Term Memory for Sentiment Analysis on Vietnamese Students’ Feedback Corpus}",
212
+ author = {V. D. {Nguyen} and K. V. {Nguyen} and N. L. T. {Nguyen}},
213
+ booktitle = "{2018 10th International Conference on Knowledge and Systems Engineering (KSE)}",
214
+ publisher = {IEEE},
215
+ address = {Ho Chi Minh City, Vietnam},
216
+ year = {2018},
217
+ pages = {306-311},
218
+ }
219
+ @InProceedings{Jointstockcompany,
220
+ title = "{The habit of using social networks of Vietnamese people 2018}",
221
+ author = {Joint Stock company},
222
+ publisher = {brands vietnam},
223
+ address = {Ho Chi Minh City, Vietnam},
224
+ year = {2018}
225
+ }
226
+ @InProceedings{Yam,
227
+ title = "{Emotion Detection and Recognition from Text Using Deep Learning}",
228
+ author = {C. Y. {Yam}},
229
+ year = {2018},
230
+ publisher = {Developer blog}
231
+ }
232
+ @InProceedings{YingjieZhang,
233
+ title = "{A Sensitivity Analysis of (and Practitioners’ Guide to Convolutional}",
234
+ author = {Y. {Zhang} and B. C. {Wallace}},
235
+ booktiltle= "{Proceedings of the The 8th International Joint Conference on Natural Language Processing}",
236
+ publisher = {2017 AFNLP},
237
+ address = {Taipei, Taiwan},
238
+ year = {2017},
239
+ pages = {253-263},
240
+ }
241
+ @InProceedings{ZhongqingWang,
242
+ title = "{Overview of NLPCC 2018 Shared Task 1: Emotion Detection in Code-Switching Text: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part II}",
243
+ author = {Z. {Wang} and S. {Li}},
244
+ publisher = {Natural Language Processing and Chinese Computing},
245
+ year = {2018},
246
+ pages = {429-433},
247
+ }
248
+ @InProceedings{RomanKlinger,
249
+ title = "{IEST:WASSA-2018 Implicit Emotions Shared Task}",
250
+ author = {R. {Klinger} and O. D. {Clerc} and S. M. {Mohammad} and A. {Balahur}},
251
+ booktiltle= "{2018 Association for Computational Linguistics}",
252
+ publisher = {2017 AFNLP},
253
+ address = {Brussels, Belgium},
254
+ year = {2018},
255
+ pages = {31-42},
256
+ }
257
+ @InProceedings{joulin2016fasttext,
258
+ title="{FastText.zip: Compressing text classification models}",
259
+ author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
260
+ journal="{arXiv preprint arXiv:1612.03651}",
261
+ year={2016},
262
+ }
263
+
264
+ @Article{AurelienGeron,
265
+ author = {Pedregosa, Fabian and Varoquaux, Ga\"{e}l and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perrot, Matthieu and Duchesnay, \'{E}douard},
266
+ title = {Scikit-learn: Machine Learning in Python},
267
+ journal = {Journal of Machine Learning Research},
268
+ volume = {12},
269
+ year = {2011},
270
+ pages = {2825-2830}
271
+ }
272
+
273
+ @InProceedings{KietVanNguyen1,
274
+ title = "{UIT-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis}",
275
+ author = {K. V. {Nguyen} and V. D. {Nguyen} and P. {Nguyen} and T. {Truong} and N. L. T. {Nguyen}},
276
+ booktitle = "{2018 10th International Conference on Knowledge and Systems Engineering (KSE)}",
277
+ year = {2018},
278
+ pages = {19-24},
279
+ address = {Ho Chi Minh City, Vietnam},
280
+ }
281
+
282
+ @Inproceedings{nguyen2016,
283
+ title={Vietnamese transition-based dependency parsing with supertag features},
284
+ author={Nguyen, Kiet V and Nguyen, Ngan Luu-Thuy},
285
+ booktitle={2016 Eighth International Conference on Knowledge and Systems Engineering (KSE)},
286
+ pages={175--180},
287
+ year={2016},
288
+ organization={IEEE}
289
+ }
references/2019.arxiv.ho/source/images/DataProcessing.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0c2e2acb142678c169cb7523e8b6e151b726648fa3d048c149a8594e1791189d
3
+ size 10809
references/2019.arxiv.ho/source/images/cnnmodel.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e96bee03b1dec742297dbdf7c3a9f1fb99386c3b13372549c9e4dba07a59e571
3
+ size 74015
references/2019.arxiv.ho/source/images/con_matrix.png ADDED

Git LFS Details

  • SHA256: a153882bb69258af0fbe136853563cec63d231b2b3b4a36a4f4d7f643db4b2b0
  • Pointer size: 131 Bytes
  • Size of remote file: 154 kB
references/2019.arxiv.ho/source/images/confusion_matrix.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:523a704f183b8ebe06afda81dee17f7570f4782fd9b0a1ebcdd0f0710923fcd7
3
+ size 21538