bamboo-1 / TECHNICAL_REPORT.md

Move models to models/, add evaluation tools, update training infra

a951efb 3 months ago

preview code

raw

history blame contribute delete

13.1 kB

Bamboo-1: Vietnamese Dependency Parser - Technical Report

Version: 1.0.0 Release Date: 2026-02-02 Model ID: undertheseanlp/bamboo-1

1. Introduction

Bamboo-1 is a Vietnamese dependency parser that implements the Trankit architecture (Nguyen et al., 2021), combining XLM-RoBERTa with Biaffine attention mechanism. The model is trained on the UDD-1 dataset and achieves state-of-the-art results on UD_Vietnamese-VTB (+8.5% UAS over Trankit) and competitive performance on VnDT v1.1.

1.1 Key Features

Multilingual Encoder: XLM-RoBERTa-base for robust Vietnamese text representation
Biaffine Attention: Efficient arc and relation prediction (Dozat & Manning, 2017)
Standalone Inference: No package installation required - download and run
HuggingFace Integration: Easy model distribution and version control

2. Model Architecture

2.1 Overview

Input: Vietnamese sentence (whitespace tokenized)
         ↓
┌─────────────────────────────────────┐
│     XLM-RoBERTa Encoder             │
│     (xlm-roberta-base, 768-dim)     │
└─────────────────────────────────────┘
         ↓
    Word-level pooling (first subword)
         ↓
┌─────────────────────────────────────┐
│     MLP Projections                 │
│  Arc-dep, Arc-head (500-dim)        │
│  Rel-dep, Rel-head (100-dim)        │
└─────────────────────────────────────┘
         ↓
┌─────────────────────────────────────┐
│     Biaffine Attention              │
│  Arc scores: (seq_len × seq_len)    │
│  Rel scores: (seq_len × seq_len × n_rels) │
└─────────────────────────────────────┘
         ↓
Output: Dependency tree (heads + relations)

2.2 Component Details

Component	Configuration
Encoder	`xlm-roberta-base` (12 layers, 768 hidden, 12 heads)
Arc MLP	768 → 500 (LeakyReLU, dropout=0.33)
Rel MLP	768 → 100 (LeakyReLU, dropout=0.33)
Biaffine (Arc)	500-dim, bias_x=True, bias_y=False
Biaffine (Rel)	100-dim, bias_x=True, bias_y=True
Output	37-79 dependency relations (dataset-dependent)

2.3 Parameter Count

Component	Parameters
XLM-RoBERTa Encoder	~278M
Biaffine Head	~2.5M
Total	~280M

3. Training

3.1 Dataset

UDD-1 (Universal Dependency Dataset for Vietnamese)

Split	Sentences	Tokens
Train	18,282	~400K
Dev	859	~19K
Test	859	~19K

Source: huggingface.co/datasets/undertheseanlp/UDD-1

3.2 Hyperparameters

Parameter	Value
Batch size	32
Encoder learning rate	1e-5
Head learning rate	1e-4
Optimizer	AdamW (weight_decay=0.01)
Warmup steps	500
LR scheduler	Linear decay
Max epochs	100
Early stopping patience	10
Dropout	0.33
Gradient clipping	5.0
Mixed precision	FP16
Random seed	42

3.3 Training Command

uv run src/train.py \
    --method trankit \
    --encoder xlm-roberta-base \
    --dataset udd1 \
    --batch-size 32 \
    --bert-lr 1e-5 \
    --head-lr 1e-4 \
    --warmup-steps 500 \
    --epochs 100 \
    --patience 10 \
    --fp16 \
    --wandb

3.4 Training Infrastructure

Resource	XLM-R (UDD-1)	PhoBERT (VnDT)
GPU	NVIDIA RTX A4000 (16GB)	NVIDIA RTX 4090 (24GB)
Training time	~4 hours	~18 minutes
Epochs	100	23 (early stop)
Cost	~$0.68	$0.13
Framework	PyTorch 2.0+	PyTorch 2.0+

4. Evaluation

4.1 Metrics

UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation label

4.2 Results on UDD-1 Test Set

Model	UAS	LAS
Bamboo-1 (XLM-RoBERTa-base)	55.42	41.19

Note: The lower performance on UDD-1 compared to UD_Vietnamese-VTB and VnDT may be attributed to:

Different annotation guidelines and relation set (79 relations vs 37 in VTB)
Domain differences in the training data
The model was evaluated zero-shot on UDD-1 test set after being trained on the UDD-1 training set

4.3 Comparison with Prior Work

Vietnamese Dependency Parsing Benchmarks

Model	Dataset	UAS	LAS	Reference
Bamboo-1 (This Work)	UD_Vietnamese-VTB	79.57	66.74	-
Trankit (XLM-R large)	UD_Vietnamese-VTB	71.07	65.37	Nguyen et al., EACL 2021
Trankit v0.3.1	UD_Vietnamese-VTB	70.96	64.76	-
Stanza v1.1.1	UD_Vietnamese-VTB	53.63	48.16	-
Bamboo-1 XLM-R (This Work)	VnDT v1.1	83.41	76.32	-
Bamboo-1 PhoBERT (This Work)	VnDT v1.1	84.29	77.22	-
PhoBERT-base + Biaffine	VnDT v1.1	85.22	78.77	-
PhoBERT-large + Biaffine	VnDT v1.1	84.32	77.85	-
Biaffine	VnDT v1.1	81.19	74.99	-
VnCoreNLP	VnDT v1.0	79.02	73.39	Vu et al., NAACL 2018
PhoBert+ELMO / Biaffine	VLSP 2020	84.65	76.27	Doan, VLSP 2020

Notes:

UD_Vietnamese-VTB: Universal Dependencies Vietnamese Treebank (~3,000 sentences)
VnDT: Vietnamese Dependency Treebank (~10,200 sentences)
VLSP 2020: Vietnamese Language and Speech Processing shared task dataset
UDD-1: Our training dataset derived from Vietnamese UD annotations (~20,000 sentences)

Encoder Ablation Study (VnDT v1.1)

To investigate the performance gap on VnDT, we conducted an encoder ablation study:

Encoder	VnDT UAS	VnDT LAS	Notes
XLM-RoBERTa-base	83.41	76.32	Baseline (Bamboo-1)
PhoBERT-base	84.29	77.22	+0.88% UAS, +0.90% LAS
PhoBERT-base (literature)	85.22	78.77	Reference

Key Finding: PhoBERT's Vietnamese-specific pretraining provides measurable improvements on VnDT. The gap between XLM-RoBERTa and PhoBERT-base confirms that language-specific pretraining benefits Vietnamese parsing.

Trankit Architecture Comparison

Bamboo-1 follows the Trankit architecture (Nguyen et al., 2021):

Component	Trankit (EACL 2021)	Bamboo-1
Encoder	XLM-RoBERTa (base/large)	XLM-RoBERTa-base
Arc MLP dim	500	500
Rel MLP dim	100	100
Dropout	0.33	0.33
Biaffine	Deep Biaffine	Deep Biaffine
Training data	UD_Vietnamese-VTB	UDD-1

4.4 Dependency Relations

The number of relations depends on the training dataset:

UD_Vietnamese-VTB / UDD-1: 79 relations (extended UD tagset)
VnDT v1.1: 31 relations (VnDT-specific tagset)

The UD-trained model predicts the following Universal Dependencies relations:

acl, advcl, advmod, amod, appos, aux, case, cc, ccomp, clf,
compound, conj, cop, csubj, dep, det, discourse, fixed, flat,
iobj, list, mark, nmod, nsubj, nummod, obj, obl, orphan,
parataxis, punct, reparandum, root, vocative, xcomp,
det:pmod, nmod:poss, compound:redup

5. Inference

5.1 Standalone Usage (No Installation)

from huggingface_hub import hf_hub_download
import importlib.util

# Download inference script
inference_path = hf_hub_download(
    "undertheseanlp/bamboo-1",
    "src/inference.py"
)

# Load module
spec = importlib.util.spec_from_file_location("inference", inference_path)
inference = importlib.util.module_from_spec(spec)
spec.loader.exec_module(inference)

# Download model and create parser
parser = inference.download_and_load()

# Parse sentence
sent = parser.parse("Tôi yêu Việt Nam")
for token in sent:
    head = sent.get_head(token)
    print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")

Output:

Tôi -> yêu (nsubj)
yêu -> ROOT (root)
Việt -> yêu (obj)
Nam -> Việt (compound)

5.2 With src Package

from src import parse

sent = parse("Tôi yêu Việt Nam")
for token in sent:
    head = sent.get_head(token)
    print(f"{token.form} -> {head.form if head else 'ROOT'} ({token.deprel})")

5.3 Output Formats

ParsedSentence Object:

sent.text        # Original text
sent.tokens      # List of Token objects
sent.get_root()  # Root token
sent.get_head(token)       # Head of token
sent.get_dependents(token) # Dependents of token
sent.to_conllu()           # CoNLL-U format

Token Object:

token.id      # 1-indexed position
token.form    # Word form
token.head    # Head index (0 = ROOT)
token.deprel  # Dependency relation

5.4 Requirements

torch>=2.0.0
transformers>=4.30.0
huggingface_hub>=0.20.0

6. Model Files

6.1 HuggingFace Repository

Repo: undertheseanlp/bamboo-1

File	Size	Description
`bamboo-1.0.0-20260202-xlmr-udd1.pt`	1.12 GB	XLM-RoBERTa model (UDD-1)
`bamboo-1.0.0-20260202-phobert-vndt.pt`	0.55 GB	PhoBERT-base model (VnDT)
`src/inference.py`	16 KB	Standalone inference script
`README.md`	2 KB	Model card
`TECHNICAL_REPORT.md`	12 KB	This technical report

6.2 Checkpoint Contents

checkpoint = {
    'model': state_dict,      # Model weights
    'vocab': Vocabulary,      # Vocabulary object
    'config': {
        'method': 'trankit',
        'encoder': 'xlm-roberta-base',  # or 'vinai/phobert-base'
        'n_rels': 37,
    }
}

6.3 Versioning

Format: bamboo-{version}-{date}-{encoder}-{dataset}.pt

Version	Date	Encoder	Dataset	UAS	LAS
1.0.0	2026-02-02	xlmr	udd1	79.57% (VTB)	66.74% (VTB)
1.0.0	2026-02-07	phobert	vndt	84.29%	77.22%

7. Limitations

Tokenization: Expects whitespace-tokenized input. For raw Vietnamese text, use a word segmenter first (e.g., underthesea.word_tokenize)
Sentence Length: Performance may degrade on very long sentences (>100 tokens)
Domain: Trained on news/formal text. May perform differently on social media or informal text.
GPU Memory: Requires ~4GB GPU memory for inference with XLM-RoBERTa

8. References

Nguyen, M. V., et al. (2021). Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. EACL 2021. Outstanding Demo Paper Award.
Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. ICLR 2017.
Conneau, A., et al. (2020). Unsupervised Cross-lingual Representation Learning at Scale. ACL 2020. (XLM-RoBERTa)
Nguyen, D. Q., & Nguyen, A. T. (2020). PhoBERT: Pre-trained language models for Vietnamese. EMNLP 2020 Findings.
Vu, T., et al. (2018). VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. NAACL 2018 Demonstrations.
Qi, P., et al. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. ACL 2020 System Demonstrations.

9. Citation

@misc{bamboo1,
  title={Bamboo-1: Vietnamese Dependency Parser},
  author={Underthesea NLP},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/undertheseanlp/bamboo-1},
  version={1.0.0}
}

10. License

Apache License 2.0

Appendix A: Dependency Relations

Relation	Description	Example
nsubj	Nominal subject	"Tôi yêu Việt Nam"
obj	Object	"Tôi yêu Việt Nam"
root	Root of sentence	"Tôi yêu Việt Nam"
compound	Compound	"Việt Nam"
nmod	Nominal modifier	"nhà của tôi"
amod	Adjectival modifier	"cô gái đẹp"
advmod	Adverbial modifier	"chạy nhanh"
case	Case marking	"của tôi"
det	Determiner	"một người"
punct	Punctuation	"Xin chào!"

Appendix B: Example Outputs

Input 1

Hà Nội là thủ đô của Việt Nam

Output 1

# sent_id = 1
# text = Hà Nội là thủ đô của Việt Nam
1	Hà	_	_	_	_	5	nsubj	_	_
2	Nội	_	_	_	_	1	flat	_	_
3	là	_	_	_	_	5	cop	_	_
4	thủ	_	_	_	_	5	compound	_	_
5	đô	_	_	_	_	0	root	_	_
6	của	_	_	_	_	8	case	_	_
7	Việt	_	_	_	_	8	compound	_	_
8	Nam	_	_	_	_	5	nmod	_	_

Input 2

Em gái tôi học tiếng Anh

Output 2

# sent_id = 1
# text = Em gái tôi học tiếng Anh
1	Em	_	_	_	_	4	nsubj	_	_
2	gái	_	_	_	_	1	nmod	_	_
3	tôi	_	_	_	_	2	det:pmod	_	_
4	học	_	_	_	_	0	root	_	_
5	tiếng	_	_	_	_	4	obj	_	_
6	Anh	_	_	_	_	5	compound	_	_