bamboo-1 / TECHNICAL_REPORT.md
rain1024's picture
Move models to models/, add evaluation tools, update training infra
a951efb

Bamboo-1: Vietnamese Dependency Parser - Technical Report

Version: 1.0.0 Release Date: 2026-02-02 Model ID: undertheseanlp/bamboo-1


1. Introduction

Bamboo-1 is a Vietnamese dependency parser that implements the Trankit architecture (Nguyen et al., 2021), combining XLM-RoBERTa with Biaffine attention mechanism. The model is trained on the UDD-1 dataset and achieves state-of-the-art results on UD_Vietnamese-VTB (+8.5% UAS over Trankit) and competitive performance on VnDT v1.1.

1.1 Key Features

  • Multilingual Encoder: XLM-RoBERTa-base for robust Vietnamese text representation
  • Biaffine Attention: Efficient arc and relation prediction (Dozat & Manning, 2017)
  • Standalone Inference: No package installation required - download and run
  • HuggingFace Integration: Easy model distribution and version control

2. Model Architecture

2.1 Overview

Input: Vietnamese sentence (whitespace tokenized)
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     XLM-RoBERTa Encoder             β”‚
β”‚     (xlm-roberta-base, 768-dim)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
    Word-level pooling (first subword)
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     MLP Projections                 β”‚
β”‚  Arc-dep, Arc-head (500-dim)        β”‚
β”‚  Rel-dep, Rel-head (100-dim)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Biaffine Attention              β”‚
β”‚  Arc scores: (seq_len Γ— seq_len)    β”‚
β”‚  Rel scores: (seq_len Γ— seq_len Γ— n_rels) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
Output: Dependency tree (heads + relations)

2.2 Component Details

Component Configuration
Encoder xlm-roberta-base (12 layers, 768 hidden, 12 heads)
Arc MLP 768 β†’ 500 (LeakyReLU, dropout=0.33)
Rel MLP 768 β†’ 100 (LeakyReLU, dropout=0.33)
Biaffine (Arc) 500-dim, bias_x=True, bias_y=False
Biaffine (Rel) 100-dim, bias_x=True, bias_y=True
Output 37-79 dependency relations (dataset-dependent)

2.3 Parameter Count

Component Parameters
XLM-RoBERTa Encoder ~278M
Biaffine Head ~2.5M
Total ~280M

3. Training

3.1 Dataset

UDD-1 (Universal Dependency Dataset for Vietnamese)

Split Sentences Tokens
Train 18,282 ~400K
Dev 859 ~19K
Test 859 ~19K

Source: huggingface.co/datasets/undertheseanlp/UDD-1

3.2 Hyperparameters

Parameter Value
Batch size 32
Encoder learning rate 1e-5
Head learning rate 1e-4
Optimizer AdamW (weight_decay=0.01)
Warmup steps 500
LR scheduler Linear decay
Max epochs 100
Early stopping patience 10
Dropout 0.33
Gradient clipping 5.0
Mixed precision FP16
Random seed 42

3.3 Training Command

uv run src/train.py \
    --method trankit \
    --encoder xlm-roberta-base \
    --dataset udd1 \
    --batch-size 32 \
    --bert-lr 1e-5 \
    --head-lr 1e-4 \
    --warmup-steps 500 \
    --epochs 100 \
    --patience 10 \
    --fp16 \
    --wandb

3.4 Training Infrastructure

Resource XLM-R (UDD-1) PhoBERT (VnDT)
GPU NVIDIA RTX A4000 (16GB) NVIDIA RTX 4090 (24GB)
Training time ~4 hours ~18 minutes
Epochs 100 23 (early stop)
Cost ~$0.68 $0.13
Framework PyTorch 2.0+ PyTorch 2.0+

4. Evaluation

4.1 Metrics

  • UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
  • LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation label

4.2 Results on UDD-1 Test Set

Model UAS LAS
Bamboo-1 (XLM-RoBERTa-base) 55.42 41.19

Note: The lower performance on UDD-1 compared to UD_Vietnamese-VTB and VnDT may be attributed to:

  • Different annotation guidelines and relation set (79 relations vs 37 in VTB)
  • Domain differences in the training data
  • The model was evaluated zero-shot on UDD-1 test set after being trained on the UDD-1 training set

4.3 Comparison with Prior Work

Vietnamese Dependency Parsing Benchmarks

Model Dataset UAS LAS Reference
Bamboo-1 (This Work) UD_Vietnamese-VTB 79.57 66.74 -
Trankit (XLM-R large) UD_Vietnamese-VTB 71.07 65.37 Nguyen et al., EACL 2021
Trankit v0.3.1 UD_Vietnamese-VTB 70.96 64.76 -
Stanza v1.1.1 UD_Vietnamese-VTB 53.63 48.16 -
Bamboo-1 XLM-R (This Work) VnDT v1.1 83.41 76.32 -
Bamboo-1 PhoBERT (This Work) VnDT v1.1 84.29 77.22 -
PhoBERT-base + Biaffine VnDT v1.1 85.22 78.77 -
PhoBERT-large + Biaffine VnDT v1.1 84.32 77.85 -
Biaffine VnDT v1.1 81.19 74.99 -
VnCoreNLP VnDT v1.0 79.02 73.39 Vu et al., NAACL 2018
PhoBert+ELMO / Biaffine VLSP 2020 84.65 76.27 Doan, VLSP 2020

Notes:

  • UD_Vietnamese-VTB: Universal Dependencies Vietnamese Treebank (~3,000 sentences)
  • VnDT: Vietnamese Dependency Treebank (~10,200 sentences)
  • VLSP 2020: Vietnamese Language and Speech Processing shared task dataset
  • UDD-1: Our training dataset derived from Vietnamese UD annotations (~20,000 sentences)

Encoder Ablation Study (VnDT v1.1)

To investigate the performance gap on VnDT, we conducted an encoder ablation study:

Encoder VnDT UAS VnDT LAS Notes
XLM-RoBERTa-base 83.41 76.32 Baseline (Bamboo-1)
PhoBERT-base 84.29 77.22 +0.88% UAS, +0.90% LAS
PhoBERT-base (literature) 85.22 78.77 Reference

Key Finding: PhoBERT's Vietnamese-specific pretraining provides measurable improvements on VnDT. The gap between XLM-RoBERTa and PhoBERT-base confirms that language-specific pretraining benefits Vietnamese parsing.

Trankit Architecture Comparison

Bamboo-1 follows the Trankit architecture (Nguyen et al., 2021):

Component Trankit (EACL 2021) Bamboo-1
Encoder XLM-RoBERTa (base/large) XLM-RoBERTa-base
Arc MLP dim 500 500
Rel MLP dim 100 100
Dropout 0.33 0.33
Biaffine Deep Biaffine Deep Biaffine
Training data UD_Vietnamese-VTB UDD-1

4.4 Dependency Relations

The number of relations depends on the training dataset:

  • UD_Vietnamese-VTB / UDD-1: 79 relations (extended UD tagset)
  • VnDT v1.1: 31 relations (VnDT-specific tagset)

The UD-trained model predicts the following Universal Dependencies relations:

acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, clf,
compound, conj, cop, csubj, dep, det, discourse, fixed, flat,
iobj, list, mark, nmod, nsubj, nummod, obj, obl, orphan,
parataxis, punct, reparandum, root, vocative, xcomp,
det:pmod, nmod:poss, compound:redup

5. Inference

5.1 Standalone Usage (No Installation)

from huggingface_hub import hf_hub_download
import importlib.util

# Download inference script
inference_path = hf_hub_download(
    "undertheseanlp/bamboo-1",
    "src/inference.py"
)

# Load module
spec = importlib.util.spec_from_file_location("inference", inference_path)
inference = importlib.util.module_from_spec(spec)
spec.loader.exec_module(inference)

# Download model and create parser
parser = inference.download_and_load()

# Parse sentence
sent = parser.parse("TΓ΄i yΓͺu Việt Nam")
for token in sent:
    head = sent.get_head(token)
    print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")

Output:

TΓ΄i -> yΓͺu (nsubj)
yΓͺu -> ROOT (root)
Việt -> yΓͺu (obj)
Nam -> Việt (compound)

5.2 With src Package

from src import parse

sent = parse("TΓ΄i yΓͺu Việt Nam")
for token in sent:
    head = sent.get_head(token)
    print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")

5.3 Output Formats

ParsedSentence Object:

sent.text        # Original text
sent.tokens      # List of Token objects
sent.get_root()  # Root token
sent.get_head(token)       # Head of token
sent.get_dependents(token) # Dependents of token
sent.to_conllu()           # CoNLL-U format

Token Object:

token.id      # 1-indexed position
token.form    # Word form
token.head    # Head index (0 = ROOT)
token.deprel  # Dependency relation

5.4 Requirements

torch>=2.0.0
transformers>=4.30.0
huggingface_hub>=0.20.0

6. Model Files

6.1 HuggingFace Repository

Repo: undertheseanlp/bamboo-1

File Size Description
bamboo-1.0.0-20260202-xlmr-udd1.pt 1.12 GB XLM-RoBERTa model (UDD-1)
bamboo-1.0.0-20260202-phobert-vndt.pt 0.55 GB PhoBERT-base model (VnDT)
src/inference.py 16 KB Standalone inference script
README.md 2 KB Model card
TECHNICAL_REPORT.md 12 KB This technical report

6.2 Checkpoint Contents

checkpoint = {
    'model': state_dict,      # Model weights
    'vocab': Vocabulary,      # Vocabulary object
    'config': {
        'method': 'trankit',
        'encoder': 'xlm-roberta-base',  # or 'vinai/phobert-base'
        'n_rels': 37,
    }
}

6.3 Versioning

Format: bamboo-{version}-{date}-{encoder}-{dataset}.pt

Version Date Encoder Dataset UAS LAS
1.0.0 2026-02-02 xlmr udd1 79.57% (VTB) 66.74% (VTB)
1.0.0 2026-02-07 phobert vndt 84.29% 77.22%

7. Limitations

  1. Tokenization: Expects whitespace-tokenized input. For raw Vietnamese text, use a word segmenter first (e.g., underthesea.word_tokenize)

  2. Sentence Length: Performance may degrade on very long sentences (>100 tokens)

  3. Domain: Trained on news/formal text. May perform differently on social media or informal text.

  4. GPU Memory: Requires ~4GB GPU memory for inference with XLM-RoBERTa


8. References

  1. Nguyen, M. V., et al. (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. EACL 2021. Outstanding Demo Paper Award.

  2. Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. ICLR 2017.

  3. Conneau, A., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020. (XLM-RoBERTa)

  4. Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. EMNLP 2020 Findings.

  5. Vu, T., et al. (2018). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. NAACL 2018 Demonstrations.

  6. Qi, P., et al. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. ACL 2020 System Demonstrations.


9. Citation

@misc{bamboo1,
  title={Bamboo-1: Vietnamese Dependency Parser},
  author={Underthesea NLP},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/undertheseanlp/bamboo-1},
  version={1.0.0}
}

10. License

Apache License 2.0


Appendix A: Dependency Relations

Relation Description Example
nsubj Nominal subject "TΓ΄i yΓͺu Việt Nam"
obj Object "TΓ΄i yΓͺu Việt Nam"
root Root of sentence "TΓ΄i yΓͺu Việt Nam"
compound Compound "Việt Nam"
nmod Nominal modifier "nhΓ  cα»§a tΓ΄i"
amod Adjectival modifier "cΓ΄ gΓ‘i Δ‘αΊΉp"
advmod Adverbial modifier "chαΊ‘y nhanh"
case Case marking "cα»§a tΓ΄i"
det Determiner "mα»™t người"
punct Punctuation "Xin chΓ o**!**"

Appendix B: Example Outputs

Input 1

HΓ  Nα»™i lΓ  thα»§ Δ‘Γ΄ cα»§a Việt Nam

Output 1

# sent_id = 1
# text = HΓ  Nα»™i lΓ  thα»§ Δ‘Γ΄ cα»§a Việt Nam
1	HΓ 	_	_	_	_	5	nsubj	_	_
2	Nα»™i	_	_	_	_	1	flat	_	_
3	lΓ 	_	_	_	_	5	cop	_	_
4	thα»§	_	_	_	_	5	compound	_	_
5	Δ‘Γ΄	_	_	_	_	0	root	_	_
6	cα»§a	_	_	_	_	8	case	_	_
7	Việt	_	_	_	_	8	compound	_	_
8	Nam	_	_	_	_	5	nmod	_	_

Input 2

Em gÑi tôi học tiếng Anh

Output 2

# sent_id = 1
# text = Em gÑi tôi học tiếng Anh
1	Em	_	_	_	_	4	nsubj	_	_
2	gΓ‘i	_	_	_	_	1	nmod	_	_
3	tΓ΄i	_	_	_	_	2	det:pmod	_	_
4	học	_	_	_	_	0	root	_	_
5	tiαΊΏng	_	_	_	_	4	obj	_	_
6	Anh	_	_	_	_	5	compound	_	_