Bamboo-1: Vietnamese Dependency Parser

A Vietnamese dependency parser trained on the UDD-1 dataset using the Biaffine architecture.

Overview

Bamboo-1 is a neural dependency parser for Vietnamese that uses:

Architecture: Biaffine Dependency Parser (Dozat & Manning, 2017)
Dataset: UDD-1 (Universal Dependency Dataset for Vietnamese)
Features: Character-level LSTM embeddings

Installation

cd ~/projects/workspace_underthesea/bamboo-1
uv sync

Usage

Quick start (from Hugging Face)

from src.inference import Parser

parser = Parser("undertheseanlp/bamboo-1")   # downloads the released safetensors model
sent = parser.parse("Tôi yêu Việt Nam")
print(sent.to_conllu())

Training

# Train with default parameters
uv run scripts/train.py

# Train with custom parameters
uv run scripts/train.py --output models/bamboo-1 --max-epochs 200 --feat char

# Train with BERT embeddings
uv run scripts/train.py --feat bert --bert vinai/phobert-base

# Train with Weights & Biases logging
uv run scripts/train.py --wandb

Evaluation

# Evaluate trained model
uv run scripts/evaluate.py --model models/bamboo-1

Prediction

# Interactive prediction
uv run scripts/predict.py --model models/bamboo-1

# Predict from file
uv run scripts/predict.py --model models/bamboo-1 --input input.txt --output output.conllu

Dataset

The UDD-1 dataset is automatically downloaded from HuggingFace:

Source: undertheseanlp/UDD-1
Train: 18,282 sentences
Validation: 859 sentences
Test: 859 sentences
Format: Universal Dependencies (CoNLL-U)

Model Architecture

Input: Vietnamese sentence
    ↓
Word Embeddings + Character LSTM Embeddings
    ↓
BiLSTM Encoder (3 layers, 400 hidden units)
    ↓
Biaffine Attention (Arc + Relation)
    ↓
Output: Dependency tree (head indices + relation labels)

Metrics

UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation

Released model

The released checkpoint is the XLM-RoBERTa + Biaffine (Trankit-style) variant models/bamboo-1.0.0-20260601-xlmr-udd1, trained on UDD-1 (whitespace-tokenized input).

Split	UAS	LAS
UDD-1 dev	88.70%	82.37%
UDD-1 test	89.25%	82.87%

Trained 100 epochs (batch 32, encoder LR 1e-5, head LR 1e-4, AdamW, FP16) on a single RTX 3090.

Project Structure

bamboo-1/
├── README.md
├── requirements.txt
├── scripts/
│   ├── train.py          # Training script
│   ├── evaluate.py       # Evaluation script
│   └── predict.py        # Prediction script
├── bamboo1/
│   └── corpus.py         # UDD-1 corpus loader
├── models/               # Trained models (generated)
└── data/                 # Downloaded dataset (generated)

References

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train undertheseanlp/bamboo-1

Paper for undertheseanlp/bamboo-1

Deep Biaffine Attention for Neural Dependency Parsing

Paper • 1611.01734 • Published Nov 6, 2016