bamboo-1 / CLAUDE.md
rain1024's picture
Move models to models/, add evaluation tools, update training infra
a951efb

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Bamboo-1 is a Vietnamese dependency parser using the Biaffine architecture (Dozat & Manning, 2017), trained on the UDD-1 dataset from HuggingFace (undertheseanlp/UDD-1).

Commands

Setup

uv sync                    # Install dependencies
uv sync --extra dev        # Include pytest and wandb
uv sync --extra cloud      # Include runpod for cloud training

Training

uv run src/train.py                                    # Default training
uv run src/train.py --feat bert --bert vinai/phobert-base  # With PhoBERT
uv run src/train.py --wandb --wandb-project bamboo-1   # With W&B logging

Evaluation

uv run src/evaluate.py --model models/bamboo-1         # Evaluate on test set
uv run src/evaluate.py --model models/bamboo-1 --detailed  # Per-relation breakdown
uv run src/evaluate_detailed.py --model models/bamboo-1    # Detailed error analysis (P/R/F1, length, distance)

Prediction

uv run src/predict.py --model models/bamboo-1              # Interactive mode
uv run src/predict.py --model models/bamboo-1 --text "TΓ΄i yΓͺu Việt Nam"

Architecture

bamboo-1/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ corpus.py          # UDD1Corpus - downloads from HuggingFace, converts to CoNLL-U
β”‚   β”œβ”€β”€ ud_corpus.py       # UD Vietnamese VTB corpus loader
β”‚   β”œβ”€β”€ vndt_corpus.py     # VnDT corpus loader
β”‚   β”œβ”€β”€ inference.py       # Standalone inference module
β”‚   β”œβ”€β”€ train.py           # Training entry point (Click CLI)
β”‚   β”œβ”€β”€ train_phobert.py   # PhoBERT-based training
β”‚   β”œβ”€β”€ evaluate.py        # UAS/LAS evaluation
β”‚   β”œβ”€β”€ evaluate_detailed.py # Per-relation P/R/F1, length & distance analysis
β”‚   β”œβ”€β”€ predict.py         # Inference (interactive, file, or single sentence)
β”‚   └── models/            # Model implementations
β”œβ”€β”€ data/                  # Auto-generated: CoNLL-U files from datasets
└── models/                # Trained model output

Key dependencies:

  • underthesea[deep] provides the Biaffine parser implementation (DependencyParser, DependencyParserTrainer)
  • datasets for loading UDD-1 from HuggingFace
  • click for CLI argument parsing

Model architecture:

  • Word + Character LSTM embeddings (or PhoBERT with --feat bert)
  • 3-layer BiLSTM encoder (400 hidden units)
  • Biaffine attention for arc and relation prediction

Key Implementation Details

  • UDD1Corpus (src/corpus.py): Auto-downloads dataset on first use; converts HuggingFace format to CoNLL-U files
  • VnDTCorpus (src/vndt_corpus.py): Downloads VnDT dataset from GitHub
  • UDVietnameseVTB (src/ud_corpus.py): Downloads UD Vietnamese VTB from Universal Dependencies
  • Scripts use PEP 723 inline dependencies and manual sys.path manipulation to import the src module
  • Training hyperparameters are CLI flags (see --help for each script)
  • Feature types: char (character LSTM), bert (PhoBERT), tag (POS tags)

RunPod

  • API keys: ~/.env (RUNPOD_API_KEY, WANDB_API_KEY, HF_TOKEN)
  • Volume: bamboo-data (hxliy8vnua) - 50GB @ EU-RO-1 (Romania)
  • GPU rαΊ» nhαΊ₯t cΓΉng region (EU-RO-1):
    • RTX A4000 (16G) - $0.17/hr, spot $0.09/hr
    • RTX A4500 (20G) - $0.19/hr, spot $0.10/hr
    • RTX 4000 Ada (20G) - $0.20/hr, spot $0.10/hr
    • RTX 4090 (24G) - $0.34/hr, spot $0.20/hr (High stock)
  • Launch: uv run scripts/runpod_setup.py launch --gpu "NVIDIA RTX A4000" --volume hxliy8vnua