Bamboo-1: Vietnamese Dependency Parser - Technical Report
Version: 1.0.0
Release Date: 2026-02-02
Model ID: undertheseanlp/bamboo-1
1. Introduction
Bamboo-1 is a Vietnamese dependency parser that implements the Trankit architecture (Nguyen et al., 2021), combining XLM-RoBERTa with Biaffine attention mechanism. The model is trained on the UDD-1 dataset and achieves state-of-the-art results on UD_Vietnamese-VTB (+8.5% UAS over Trankit) and competitive performance on VnDT v1.1.
1.1 Key Features
- Multilingual Encoder: XLM-RoBERTa-base for robust Vietnamese text representation
- Biaffine Attention: Efficient arc and relation prediction (Dozat & Manning, 2017)
- Standalone Inference: No package installation required - download and run
- HuggingFace Integration: Easy model distribution and version control
2. Model Architecture
2.1 Overview
Input: Vietnamese sentence (whitespace tokenized)
β
βββββββββββββββββββββββββββββββββββββββ
β XLM-RoBERTa Encoder β
β (xlm-roberta-base, 768-dim) β
βββββββββββββββββββββββββββββββββββββββ
β
Word-level pooling (first subword)
β
βββββββββββββββββββββββββββββββββββββββ
β MLP Projections β
β Arc-dep, Arc-head (500-dim) β
β Rel-dep, Rel-head (100-dim) β
βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββ
β Biaffine Attention β
β Arc scores: (seq_len Γ seq_len) β
β Rel scores: (seq_len Γ seq_len Γ n_rels) β
βββββββββββββββββββββββββββββββββββββββ
β
Output: Dependency tree (heads + relations)
2.2 Component Details
| Component | Configuration |
|---|---|
| Encoder | xlm-roberta-base (12 layers, 768 hidden, 12 heads) |
| Arc MLP | 768 β 500 (LeakyReLU, dropout=0.33) |
| Rel MLP | 768 β 100 (LeakyReLU, dropout=0.33) |
| Biaffine (Arc) | 500-dim, bias_x=True, bias_y=False |
| Biaffine (Rel) | 100-dim, bias_x=True, bias_y=True |
| Output | 37-79 dependency relations (dataset-dependent) |
2.3 Parameter Count
| Component | Parameters |
|---|---|
| XLM-RoBERTa Encoder | ~278M |
| Biaffine Head | ~2.5M |
| Total | ~280M |
3. Training
3.1 Dataset
UDD-1 (Universal Dependency Dataset for Vietnamese)
| Split | Sentences | Tokens |
|---|---|---|
| Train | 18,282 | ~400K |
| Dev | 859 | ~19K |
| Test | 859 | ~19K |
Source: huggingface.co/datasets/undertheseanlp/UDD-1
3.2 Hyperparameters
| Parameter | Value |
|---|---|
| Batch size | 32 |
| Encoder learning rate | 1e-5 |
| Head learning rate | 1e-4 |
| Optimizer | AdamW (weight_decay=0.01) |
| Warmup steps | 500 |
| LR scheduler | Linear decay |
| Max epochs | 100 |
| Early stopping patience | 10 |
| Dropout | 0.33 |
| Gradient clipping | 5.0 |
| Mixed precision | FP16 |
| Random seed | 42 |
3.3 Training Command
uv run src/train.py \
--method trankit \
--encoder xlm-roberta-base \
--dataset udd1 \
--batch-size 32 \
--bert-lr 1e-5 \
--head-lr 1e-4 \
--warmup-steps 500 \
--epochs 100 \
--patience 10 \
--fp16 \
--wandb
3.4 Training Infrastructure
| Resource | XLM-R (UDD-1) | PhoBERT (VnDT) |
|---|---|---|
| GPU | NVIDIA RTX A4000 (16GB) | NVIDIA RTX 4090 (24GB) |
| Training time | ~4 hours | ~18 minutes |
| Epochs | 100 | 23 (early stop) |
| Cost | ~$0.68 | $0.13 |
| Framework | PyTorch 2.0+ | PyTorch 2.0+ |
4. Evaluation
4.1 Metrics
- UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
- LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation label
4.2 Results on UDD-1 Test Set
| Model | UAS | LAS |
|---|---|---|
| Bamboo-1 (XLM-RoBERTa-base) | 55.42 | 41.19 |
Note: The lower performance on UDD-1 compared to UD_Vietnamese-VTB and VnDT may be attributed to:
- Different annotation guidelines and relation set (79 relations vs 37 in VTB)
- Domain differences in the training data
- The model was evaluated zero-shot on UDD-1 test set after being trained on the UDD-1 training set
4.3 Comparison with Prior Work
Vietnamese Dependency Parsing Benchmarks
| Model | Dataset | UAS | LAS | Reference |
|---|---|---|---|---|
| Bamboo-1 (This Work) | UD_Vietnamese-VTB | 79.57 | 66.74 | - |
| Trankit (XLM-R large) | UD_Vietnamese-VTB | 71.07 | 65.37 | Nguyen et al., EACL 2021 |
| Trankit v0.3.1 | UD_Vietnamese-VTB | 70.96 | 64.76 | - |
| Stanza v1.1.1 | UD_Vietnamese-VTB | 53.63 | 48.16 | - |
| Bamboo-1 XLM-R (This Work) | VnDT v1.1 | 83.41 | 76.32 | - |
| Bamboo-1 PhoBERT (This Work) | VnDT v1.1 | 84.29 | 77.22 | - |
| PhoBERT-base + Biaffine | VnDT v1.1 | 85.22 | 78.77 | - |
| PhoBERT-large + Biaffine | VnDT v1.1 | 84.32 | 77.85 | - |
| Biaffine | VnDT v1.1 | 81.19 | 74.99 | - |
| VnCoreNLP | VnDT v1.0 | 79.02 | 73.39 | Vu et al., NAACL 2018 |
| PhoBert+ELMO / Biaffine | VLSP 2020 | 84.65 | 76.27 | Doan, VLSP 2020 |
Notes:
- UD_Vietnamese-VTB: Universal Dependencies Vietnamese Treebank (~3,000 sentences)
- VnDT: Vietnamese Dependency Treebank (~10,200 sentences)
- VLSP 2020: Vietnamese Language and Speech Processing shared task dataset
- UDD-1: Our training dataset derived from Vietnamese UD annotations (~20,000 sentences)
Encoder Ablation Study (VnDT v1.1)
To investigate the performance gap on VnDT, we conducted an encoder ablation study:
| Encoder | VnDT UAS | VnDT LAS | Notes |
|---|---|---|---|
| XLM-RoBERTa-base | 83.41 | 76.32 | Baseline (Bamboo-1) |
| PhoBERT-base | 84.29 | 77.22 | +0.88% UAS, +0.90% LAS |
| PhoBERT-base (literature) | 85.22 | 78.77 | Reference |
Key Finding: PhoBERT's Vietnamese-specific pretraining provides measurable improvements on VnDT. The gap between XLM-RoBERTa and PhoBERT-base confirms that language-specific pretraining benefits Vietnamese parsing.
Trankit Architecture Comparison
Bamboo-1 follows the Trankit architecture (Nguyen et al., 2021):
| Component | Trankit (EACL 2021) | Bamboo-1 |
|---|---|---|
| Encoder | XLM-RoBERTa (base/large) | XLM-RoBERTa-base |
| Arc MLP dim | 500 | 500 |
| Rel MLP dim | 100 | 100 |
| Dropout | 0.33 | 0.33 |
| Biaffine | Deep Biaffine | Deep Biaffine |
| Training data | UD_Vietnamese-VTB | UDD-1 |
4.4 Dependency Relations
The number of relations depends on the training dataset:
- UD_Vietnamese-VTB / UDD-1: 79 relations (extended UD tagset)
- VnDT v1.1: 31 relations (VnDT-specific tagset)
The UD-trained model predicts the following Universal Dependencies relations:
acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, clf,
compound, conj, cop, csubj, dep, det, discourse, fixed, flat,
iobj, list, mark, nmod, nsubj, nummod, obj, obl, orphan,
parataxis, punct, reparandum, root, vocative, xcomp,
det:pmod, nmod:poss, compound:redup
5. Inference
5.1 Standalone Usage (No Installation)
from huggingface_hub import hf_hub_download
import importlib.util
# Download inference script
inference_path = hf_hub_download(
"undertheseanlp/bamboo-1",
"src/inference.py"
)
# Load module
spec = importlib.util.spec_from_file_location("inference", inference_path)
inference = importlib.util.module_from_spec(spec)
spec.loader.exec_module(inference)
# Download model and create parser
parser = inference.download_and_load()
# Parse sentence
sent = parser.parse("TΓ΄i yΓͺu Viα»t Nam")
for token in sent:
head = sent.get_head(token)
print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")
Output:
TΓ΄i -> yΓͺu (nsubj)
yΓͺu -> ROOT (root)
Viα»t -> yΓͺu (obj)
Nam -> Viα»t (compound)
5.2 With src Package
from src import parse
sent = parse("TΓ΄i yΓͺu Viα»t Nam")
for token in sent:
head = sent.get_head(token)
print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")
5.3 Output Formats
ParsedSentence Object:
sent.text # Original text
sent.tokens # List of Token objects
sent.get_root() # Root token
sent.get_head(token) # Head of token
sent.get_dependents(token) # Dependents of token
sent.to_conllu() # CoNLL-U format
Token Object:
token.id # 1-indexed position
token.form # Word form
token.head # Head index (0 = ROOT)
token.deprel # Dependency relation
5.4 Requirements
torch>=2.0.0
transformers>=4.30.0
huggingface_hub>=0.20.0
6. Model Files
6.1 HuggingFace Repository
Repo: undertheseanlp/bamboo-1
| File | Size | Description |
|---|---|---|
bamboo-1.0.0-20260202-xlmr-udd1.pt |
1.12 GB | XLM-RoBERTa model (UDD-1) |
bamboo-1.0.0-20260202-phobert-vndt.pt |
0.55 GB | PhoBERT-base model (VnDT) |
src/inference.py |
16 KB | Standalone inference script |
README.md |
2 KB | Model card |
TECHNICAL_REPORT.md |
12 KB | This technical report |
6.2 Checkpoint Contents
checkpoint = {
'model': state_dict, # Model weights
'vocab': Vocabulary, # Vocabulary object
'config': {
'method': 'trankit',
'encoder': 'xlm-roberta-base', # or 'vinai/phobert-base'
'n_rels': 37,
}
}
6.3 Versioning
Format: bamboo-{version}-{date}-{encoder}-{dataset}.pt
| Version | Date | Encoder | Dataset | UAS | LAS |
|---|---|---|---|---|---|
| 1.0.0 | 2026-02-02 | xlmr | udd1 | 79.57% (VTB) | 66.74% (VTB) |
| 1.0.0 | 2026-02-07 | phobert | vndt | 84.29% | 77.22% |
7. Limitations
Tokenization: Expects whitespace-tokenized input. For raw Vietnamese text, use a word segmenter first (e.g.,
underthesea.word_tokenize)Sentence Length: Performance may degrade on very long sentences (>100 tokens)
Domain: Trained on news/formal text. May perform differently on social media or informal text.
GPU Memory: Requires ~4GB GPU memory for inference with XLM-RoBERTa
8. References
Nguyen, M. V., et al. (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. EACL 2021. Outstanding Demo Paper Award.
Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. ICLR 2017.
Conneau, A., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020. (XLM-RoBERTa)
Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. EMNLP 2020 Findings.
Vu, T., et al. (2018). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. NAACL 2018 Demonstrations.
Qi, P., et al. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. ACL 2020 System Demonstrations.
9. Citation
@misc{bamboo1,
title={Bamboo-1: Vietnamese Dependency Parser},
author={Underthesea NLP},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/undertheseanlp/bamboo-1},
version={1.0.0}
}
10. License
Apache License 2.0
Appendix A: Dependency Relations
| Relation | Description | Example |
|---|---|---|
| nsubj | Nominal subject | "TΓ΄i yΓͺu Viα»t Nam" |
| obj | Object | "TΓ΄i yΓͺu Viα»t Nam" |
| root | Root of sentence | "TΓ΄i yΓͺu Viα»t Nam" |
| compound | Compound | "Viα»t Nam" |
| nmod | Nominal modifier | "nhΓ cα»§a tΓ΄i" |
| amod | Adjectival modifier | "cΓ΄ gΓ‘i ΔαΊΉp" |
| advmod | Adverbial modifier | "chαΊ‘y nhanh" |
| case | Case marking | "cα»§a tΓ΄i" |
| det | Determiner | "mα»t ngΖ°α»i" |
| punct | Punctuation | "Xin chΓ o**!**" |
Appendix B: Example Outputs
Input 1
HΓ Nα»i lΓ thα»§ ΔΓ΄ cα»§a Viα»t Nam
Output 1
# sent_id = 1
# text = HΓ Nα»i lΓ thα»§ ΔΓ΄ cα»§a Viα»t Nam
1 HΓ _ _ _ _ 5 nsubj _ _
2 Nα»i _ _ _ _ 1 flat _ _
3 lΓ _ _ _ _ 5 cop _ _
4 thα»§ _ _ _ _ 5 compound _ _
5 ΔΓ΄ _ _ _ _ 0 root _ _
6 cα»§a _ _ _ _ 8 case _ _
7 Viα»t _ _ _ _ 8 compound _ _
8 Nam _ _ _ _ 5 nmod _ _
Input 2
Em gΓ‘i tΓ΄i hα»c tiαΊΏng Anh
Output 2
# sent_id = 1
# text = Em gΓ‘i tΓ΄i hα»c tiαΊΏng Anh
1 Em _ _ _ _ 4 nsubj _ _
2 gΓ‘i _ _ _ _ 1 nmod _ _
3 tΓ΄i _ _ _ _ 2 det:pmod _ _
4 hα»c _ _ _ _ 0 root _ _
5 tiαΊΏng _ _ _ _ 4 obj _ _
6 Anh _ _ _ _ 5 compound _ _