CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Bamboo-1 is a Vietnamese dependency parser using the Biaffine architecture (Dozat & Manning, 2017), trained on the UDD-1 dataset from HuggingFace (undertheseanlp/UDD-1).

Commands

Setup

uv sync                    # Install dependencies
uv sync --extra dev        # Include pytest and wandb
uv sync --extra cloud      # Include runpod for cloud training

Training

uv run src/train.py                                    # Default training
uv run src/train.py --feat bert --bert vinai/phobert-base  # With PhoBERT
uv run src/train.py --wandb --wandb-project bamboo-1   # With W&B logging

Evaluation

uv run src/evaluate.py --model models/bamboo-1         # Evaluate on test set
uv run src/evaluate.py --model models/bamboo-1 --detailed  # Per-relation breakdown
uv run src/evaluate_detailed.py --model models/bamboo-1    # Detailed error analysis (P/R/F1, length, distance)

Prediction

uv run src/predict.py --model models/bamboo-1              # Interactive mode
uv run src/predict.py --model models/bamboo-1 --text "Tôi yêu Việt Nam"

Architecture

bamboo-1/
├── src/
│   ├── corpus.py          # UDD1Corpus - downloads from HuggingFace, converts to CoNLL-U
│   ├── ud_corpus.py       # UD Vietnamese VTB corpus loader
│   ├── vndt_corpus.py     # VnDT corpus loader
│   ├── inference.py       # Standalone inference module
│   ├── train.py           # Training entry point (Click CLI)
│   ├── train_phobert.py   # PhoBERT-based training
│   ├── evaluate.py        # UAS/LAS evaluation
│   ├── evaluate_detailed.py # Per-relation P/R/F1, length & distance analysis
│   ├── predict.py         # Inference (interactive, file, or single sentence)
│   └── models/            # Model implementations
├── data/                  # Auto-generated: CoNLL-U files from datasets
└── models/                # Trained model output

Key dependencies:

underthesea[deep] provides the Biaffine parser implementation (DependencyParser, DependencyParserTrainer)
datasets for loading UDD-1 from HuggingFace
click for CLI argument parsing

Model architecture:

Word + Character LSTM embeddings (or PhoBERT with --feat bert)
3-layer BiLSTM encoder (400 hidden units)
Biaffine attention for arc and relation prediction

Key Implementation Details

UDD1Corpus (src/corpus.py): Auto-downloads dataset on first use; converts HuggingFace format to CoNLL-U files
VnDTCorpus (src/vndt_corpus.py): Downloads VnDT dataset from GitHub
UDVietnameseVTB (src/ud_corpus.py): Downloads UD Vietnamese VTB from Universal Dependencies
Scripts use PEP 723 inline dependencies and manual sys.path manipulation to import the src module
Training hyperparameters are CLI flags (see --help for each script)
Feature types: char (character LSTM), bert (PhoBERT), tag (POS tags)

RunPod

API keys: ~/.env (RUNPOD_API_KEY, WANDB_API_KEY, HF_TOKEN)
Volume: bamboo-data (hxliy8vnua) - 50GB @ EU-RO-1 (Romania)
GPU rẻ nhất cùng region (EU-RO-1):
- RTX A4000 (16G) - $0.17/hr, spot $0.09/hr
- RTX A4500 (20G) - $0.19/hr, spot $0.10/hr
- RTX 4000 Ada (20G) - $0.20/hr, spot $0.10/hr
- RTX 4090 (24G) - $0.34/hr, spot $0.20/hr (High stock)
Launch: uv run scripts/runpod_setup.py launch --gpu "NVIDIA RTX A4000" --volume hxliy8vnua