CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
Bamboo-1 is a Vietnamese dependency parser using the Biaffine architecture (Dozat & Manning, 2017), trained on the UDD-1 dataset from HuggingFace (undertheseanlp/UDD-1).
Commands
Setup
uv sync # Install dependencies
uv sync --extra dev # Include pytest and wandb
uv sync --extra cloud # Include runpod for cloud training
Training
uv run src/train.py # Default training
uv run src/train.py --feat bert --bert vinai/phobert-base # With PhoBERT
uv run src/train.py --wandb --wandb-project bamboo-1 # With W&B logging
Evaluation
uv run src/evaluate.py --model models/bamboo-1 # Evaluate on test set
uv run src/evaluate.py --model models/bamboo-1 --detailed # Per-relation breakdown
uv run src/evaluate_detailed.py --model models/bamboo-1 # Detailed error analysis (P/R/F1, length, distance)
Prediction
uv run src/predict.py --model models/bamboo-1 # Interactive mode
uv run src/predict.py --model models/bamboo-1 --text "TΓ΄i yΓͺu Viα»t Nam"
Architecture
bamboo-1/
βββ src/
β βββ corpus.py # UDD1Corpus - downloads from HuggingFace, converts to CoNLL-U
β βββ ud_corpus.py # UD Vietnamese VTB corpus loader
β βββ vndt_corpus.py # VnDT corpus loader
β βββ inference.py # Standalone inference module
β βββ train.py # Training entry point (Click CLI)
β βββ train_phobert.py # PhoBERT-based training
β βββ evaluate.py # UAS/LAS evaluation
β βββ evaluate_detailed.py # Per-relation P/R/F1, length & distance analysis
β βββ predict.py # Inference (interactive, file, or single sentence)
β βββ models/ # Model implementations
βββ data/ # Auto-generated: CoNLL-U files from datasets
βββ models/ # Trained model output
Key dependencies:
underthesea[deep]provides the Biaffine parser implementation (DependencyParser,DependencyParserTrainer)datasetsfor loading UDD-1 from HuggingFaceclickfor CLI argument parsing
Model architecture:
- Word + Character LSTM embeddings (or PhoBERT with
--feat bert) - 3-layer BiLSTM encoder (400 hidden units)
- Biaffine attention for arc and relation prediction
Key Implementation Details
- UDD1Corpus (
src/corpus.py): Auto-downloads dataset on first use; converts HuggingFace format to CoNLL-U files - VnDTCorpus (
src/vndt_corpus.py): Downloads VnDT dataset from GitHub - UDVietnameseVTB (
src/ud_corpus.py): Downloads UD Vietnamese VTB from Universal Dependencies - Scripts use PEP 723 inline dependencies and manual
sys.pathmanipulation to import thesrcmodule - Training hyperparameters are CLI flags (see
--helpfor each script) - Feature types:
char(character LSTM),bert(PhoBERT),tag(POS tags)
RunPod
- API keys:
~/.env(RUNPOD_API_KEY, WANDB_API_KEY, HF_TOKEN) - Volume:
bamboo-data(hxliy8vnua) - 50GB @ EU-RO-1 (Romania) - GPU rαΊ» nhαΊ₯t cΓΉng region (EU-RO-1):
- RTX A4000 (16G) - $0.17/hr, spot $0.09/hr
- RTX A4500 (20G) - $0.19/hr, spot $0.10/hr
- RTX 4000 Ada (20G) - $0.20/hr, spot $0.10/hr
- RTX 4090 (24G) - $0.34/hr, spot $0.20/hr (High stock)
- Launch:
uv run scripts/runpod_setup.py launch --gpu "NVIDIA RTX A4000" --volume hxliy8vnua