Bamboo-1: Vietnamese Dependency Parser

A Vietnamese dependency parser trained on the UDD-1 dataset using the Biaffine architecture.

Overview

Bamboo-1 is a neural dependency parser for Vietnamese that uses:

  • Architecture: Biaffine Dependency Parser (Dozat & Manning, 2017)
  • Dataset: UDD-1 (Universal Dependency Dataset for Vietnamese)
  • Features: Character-level LSTM embeddings

Installation

cd ~/projects/workspace_underthesea/bamboo-1
uv sync

Usage

Training

# Train with default parameters
uv run scripts/train.py

# Train with custom parameters
uv run scripts/train.py --output models/bamboo-1 --max-epochs 200 --feat char

# Train with BERT embeddings
uv run scripts/train.py --feat bert --bert vinai/phobert-base

# Train with Weights & Biases logging
uv run scripts/train.py --wandb

Evaluation

# Evaluate trained model
uv run scripts/evaluate.py --model models/bamboo-1

Prediction

# Interactive prediction
uv run scripts/predict.py --model models/bamboo-1

# Predict from file
uv run scripts/predict.py --model models/bamboo-1 --input input.txt --output output.conllu

Dataset

The UDD-1 dataset is automatically downloaded from HuggingFace:

  • Source: undertheseanlp/UDD-1
  • Train: 18,282 sentences
  • Validation: 859 sentences
  • Test: 859 sentences
  • Format: Universal Dependencies (CoNLL-U)

Model Architecture

Input: Vietnamese sentence
    ↓
Word Embeddings + Character LSTM Embeddings
    ↓
BiLSTM Encoder (3 layers, 400 hidden units)
    ↓
Biaffine Attention (Arc + Relation)
    ↓
Output: Dependency tree (head indices + relation labels)

Metrics

  • UAS (Unlabeled Attachment Score): Percentage of tokens with correct head
  • LAS (Labeled Attachment Score): Percentage of tokens with correct head AND relation

Project Structure

bamboo-1/
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py          # Training script
β”‚   β”œβ”€β”€ evaluate.py       # Evaluation script
β”‚   └── predict.py        # Prediction script
β”œβ”€β”€ bamboo1/
β”‚   └── corpus.py         # UDD-1 corpus loader
β”œβ”€β”€ models/               # Trained models (generated)
└── data/                 # Downloaded dataset (generated)

References

License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train undertheseanlp/bamboo-1

Paper for undertheseanlp/bamboo-1