rain1024 commited on 27 days ago

Commit

b85c683

0 Parent(s):

Initial commit: Vietnamese dependency parser with Biaffine architecture

- UDD1Corpus for loading UDD-1 dataset from HuggingFace
- Training, evaluation, and prediction scripts
- Docker configuration for containerized training
- Support for character LSTM and PhoBERT features

Files changed (21) hide show

.dockerignore +39 -0
.gitignore +57 -0
CLAUDE.md +66 -0
README.md +121 -0
RUNPOD.md +141 -0
bamboo1/__init__.py +6 -0
bamboo1/corpus.py +158 -0
docker/Dockerfile +57 -0
docker/requirements.txt +5 -0
pyproject.toml +29 -0
requirements.txt +5 -0
scripts/cost_estimate.py +534 -0
scripts/evaluate.py +229 -0
scripts/predict.py +173 -0
scripts/runpod_setup.py +287 -0
scripts/runpod_simple_test.py +81 -0
scripts/runpod_train.sh +42 -0
scripts/train.py +673 -0
scripts/train_gpu.py +70 -0
scripts/watch_pod.py +113 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,39 @@

+# Git
+.git
+.gitignore
+# Python
+__pycache__
+*.py[cod]
+*.egg-info
+.eggs
+*.egg
+.venv
+venv
+# IDE
+.vscode
+.idea
+*.swp
+# Build artifacts
+dist
+build
+*.so
+# Models (saved at runtime to network volume)
+models/
+*.pt
+*.bin
+# Logs
+wandb/
+*.log
+# Environment
+.env
+.env.*
+# Docs
+*.md
+!README.md

.gitignore ADDED Viewed

	@@ -0,0 +1,57 @@

+# Environment
+.env
+.env.*
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+.venv/
+venv/
+ENV/
+env/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+*~
+# Data and models (large files)
+data/
+models/
+tmp/
+*.pt
+*.bin
+*.safetensors
+# Logs
+*.log
+wandb/
+# Jupyter
+.ipynb_checkpoints/
+# OS
+.DS_Store
+Thumbs.db

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Bamboo-1 is a Vietnamese dependency parser using the Biaffine architecture (Dozat & Manning, 2017), trained on the UDD-1 dataset from HuggingFace (`undertheseanlp/UDD-1`).
+## Commands
+### Setup
+```bash
+uv sync                    # Install dependencies
+uv sync --extra dev        # Include pytest and wandb
+uv sync --extra cloud      # Include runpod for cloud training
+```
+### Training
+```bash
+uv run scripts/train.py                                    # Default training
+uv run scripts/train.py --feat bert --bert vinai/phobert-base  # With PhoBERT
+uv run scripts/train.py --wandb --wandb-project bamboo-1   # With W&B logging
+```
+### Evaluation
+```bash
+uv run scripts/evaluate.py --model models/bamboo-1         # Evaluate on test set
+uv run scripts/evaluate.py --model models/bamboo-1 --detailed  # Per-relation breakdown
+```
+### Prediction
+```bash
+uv run scripts/predict.py --model models/bamboo-1              # Interactive mode
+uv run scripts/predict.py --model models/bamboo-1 --text "Tôi yêu Việt Nam"
+```
+## Architecture
+```
+bamboo-1/
+├── bamboo1/
+│   └── corpus.py      # UDD1Corpus - downloads from HuggingFace, converts to CoNLL-U
+├── scripts/
+│   ├── train.py       # Training entry point (Click CLI)
+│   ├── evaluate.py    # UAS/LAS evaluation
+│   └── predict.py     # Inference (interactive, file, or single sentence)
+├── data/              # Auto-generated: CoNLL-U files from UDD-1
+└── models/            # Trained model output
+```
+**Key dependencies:**
+- `underthesea[deep]` provides the Biaffine parser implementation (`DependencyParser`, `DependencyParserTrainer`)
+- `datasets` for loading UDD-1 from HuggingFace
+- `click` for CLI argument parsing
+**Model architecture:**
+- Word + Character LSTM embeddings (or PhoBERT with `--feat bert`)
+- 3-layer BiLSTM encoder (400 hidden units)
+- Biaffine attention for arc and relation prediction
+## Key Implementation Details
+- **UDD1Corpus** (`bamboo1/corpus.py`): Auto-downloads dataset on first use; converts HuggingFace format to CoNLL-U files
+- Scripts use PEP 723 inline dependencies and manual `sys.path` manipulation to import the `bamboo1` module
+- Training hyperparameters are CLI flags (see `--help` for each script)
+- Feature types: `char` (character LSTM), `bert` (PhoBERT), `tag` (POS tags)

README.md ADDED Viewed

	@@ -0,0 +1,121 @@

+---
+language:
+- vi
+license: mit
+tags:
+- dependency-parsing
+- vietnamese
+- nlp
+- biaffine
+datasets:
+- undertheseanlp/UDD-1
+library_name: underthesea
+pipeline_tag: token-classification
+---
+# Bamboo-1: Vietnamese Dependency Parser
+A Vietnamese dependency parser trained on the UDD-1 dataset using the Biaffine architecture.
+## Overview
+Bamboo-1 is a neural dependency parser for Vietnamese that uses:
+- **Architecture**: Biaffine Dependency Parser (Dozat & Manning, 2017)
+- **Dataset**: UDD-1 (Universal Dependency Dataset for Vietnamese)
+- **Features**: Character-level LSTM embeddings
+## Installation
+```bash
+cd ~/projects/workspace_underthesea/bamboo-1
+uv sync
+```
+## Usage
+### Training
+```bash
+# Train with default parameters
+uv run scripts/train.py
+# Train with custom parameters
+uv run scripts/train.py --output models/bamboo-1 --max-epochs 200 --feat char
+# Train with BERT embeddings
+uv run scripts/train.py --feat bert --bert vinai/phobert-base
+# Train with Weights & Biases logging
+uv run scripts/train.py --wandb
+```
+### Evaluation
+```bash
+# Evaluate trained model
+uv run scripts/evaluate.py --model models/bamboo-1
+```
+### Prediction
+```bash
+# Interactive prediction
+uv run scripts/predict.py --model models/bamboo-1
+# Predict from file
+uv run scripts/predict.py --model models/bamboo-1 --input input.txt --output output.conllu
+```
+## Dataset
+The UDD-1 dataset is automatically downloaded from HuggingFace:
+- **Source**: `undertheseanlp/UDD-1`
+- **Train**: 18,282 sentences
+- **Validation**: 859 sentences
+- **Test**: 859 sentences
+- **Format**: Universal Dependencies (CoNLL-U)
+## Model Architecture
+```
+Input: Vietnamese sentence
+    ↓
+Word Embeddings + Character LSTM Embeddings
+    ↓
+BiLSTM Encoder (3 layers, 400 hidden units)
+    ↓
+Biaffine Attention (Arc + Relation)
+    ↓
+Output: Dependency tree (head indices + relation labels)
+```
+## Metrics
+- **UAS (Unlabeled Attachment Score)**: Percentage of tokens with correct head
+- **LAS (Labeled Attachment Score)**: Percentage of tokens with correct head AND relation
+## Project Structure
+```
+bamboo-1/
+├── README.md
+├── requirements.txt
+├── scripts/
+│   ├── train.py          # Training script
+│   ├── evaluate.py       # Evaluation script
+│   └── predict.py        # Prediction script
+├── bamboo1/
+│   └── corpus.py         # UDD-1 corpus loader
+├── models/               # Trained models (generated)
+└── data/                 # Downloaded dataset (generated)
+```
+## References
+- [UDD-1 Dataset](https://huggingface.co/datasets/undertheseanlp/UDD-1)
+- [Underthesea NLP Toolkit](https://github.com/undertheseanlp/underthesea)
+- [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)
+## License
+MIT License

RUNPOD.md ADDED Viewed

	@@ -0,0 +1,141 @@

+# Training on RunPod
+Guide for training Bamboo-1 Vietnamese Dependency Parser on RunPod.
+## Option 1: Manual Setup (Web UI)
+### 1. Create a Pod
+1. Go to [RunPod Console](https://runpod.io/console/pods)
+2. Click "Deploy"
+3. Select GPU (recommended: RTX A4000 or RTX 3090)
+4. Choose template: `runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04`
+5. Set disk size: 20GB+
+6. **Expose TCP port 22** (để SSH)
+7. **Thêm SSH public key** vào env `PUBLIC_KEY` (xem mục Best Practices)
+8. Deploy
+### 2. Connect and Train
+```bash
+# SSH into the pod or use Web Terminal
+# Install uv
+curl -LsSf https://astral.sh/uv/install.sh | sh
+source $HOME/.local/bin/env
+# Clone repo
+git clone https://huggingface.co/undertheseanlp/bamboo-1
+cd bamboo-1
+# Install dependencies
+uv sync
+# Train with character embeddings
+uv run scripts/train.py --output models/bamboo-1-char --feat char --max-epochs 100
+# Or train with BERT (PhoBERT)
+uv run scripts/train.py --output models/bamboo-1-bert --feat bert --max-epochs 50
+```
+### 3. Upload Model
+```bash
+# Login to HuggingFace
+huggingface-cli login
+# Upload trained model
+hf upload undertheseanlp/bamboo-1 models/bamboo-1-char models/bamboo-1-char
+```
+## Option 2: RunPod API
+### 1. Setup
+```bash
+# Install runpod SDK
+uv pip install runpod
+# Set API key
+export RUNPOD_API_KEY="your-api-key"
+```
+### 2. Launch Training
+```bash
+uv run scripts/runpod_setup.py launch --gpu "NVIDIA RTX A4000"
+```
+### 3. Monitor
+```bash
+# Check status
+uv run scripts/runpod_setup.py status
+# Stop when done
+uv run scripts/runpod_setup.py stop <pod-id>
+```
+## Option 3: One-liner
+SSH into any RunPod instance and run:
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env && git clone https://huggingface.co/undertheseanlp/bamboo-1 && cd bamboo-1 && uv sync && uv run scripts/train.py --output models/bamboo-1-char
+```
+## GPU Recommendations
+| GPU | VRAM | Batch Size | Est. Time |
+|-----|------|------------|-----------|
+| RTX 3090 | 24GB | 5000 | ~2-3 hours |
+| RTX A4000 | 16GB | 3000 | ~3-4 hours |
+| RTX A5000 | 24GB | 5000 | ~2-3 hours |
+| A100 | 40GB | 8000 | ~1-2 hours |
+## Training with Weights & Biases
+```bash
+# Login to W&B
+wandb login
+# Train with logging
+uv run scripts/train.py --output models/bamboo-1-char --wandb --wandb-project bamboo-1
+```
+## Cost Estimate
+- RTX A4000: ~$0.20/hour → ~$0.80 for full training
+- RTX 3090: ~$0.30/hour → ~$0.90 for full training
+- A100: ~$1.50/hour → ~$2.25 for full training
+## Best Practices
+### Luôn bật SSH khi tạo pod
+Khi tạo pod mới, **bắt buộc** cấu hình SSH để có thể watch logs:
+1. **Expose port 22 (TCP)** trong phần "Expose Ports"
+2. **Thêm SSH Public Key** vào environment variable `PUBLIC_KEY`
+```bash
+# Lấy public key từ máy local
+cat ~/.ssh/id_rsa.pub
+```
+Nếu không có SSH:
+- Không thể SSH vào pod để xem logs
+- Chỉ có thể dùng Web Terminal (chậm, không tiện)
+- RunPod API không hỗ trợ xem logs trực tiếp
+### Kiểm tra GPU utilization
+Sau khi tạo pod, kiểm tra GPU có đang được sử dụng không:
+```python
+import runpod
+pods = runpod.get_pods()
+# Nếu gpuUtilPercent = 0% trong thời gian dài → training chưa chạy hoặc đã xong
+```
+Tránh lãng phí tiền khi GPU idle.

bamboo1/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""Bamboo-1: Vietnamese Dependency Parser trained on UDD-1."""
+from bamboo1.corpus import UDD1Corpus
+__all__ = ["UDD1Corpus"]
+__version__ = "0.1.0"

bamboo1/corpus.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+UDD-1 Corpus loader for dependency parsing.
+This module provides a corpus class that downloads the UDD-1 dataset from
+HuggingFace and converts it to CoNLL format for use with the underthesea
+dependency parser trainer.
+"""
+import os
+from pathlib import Path
+class UDD1Corpus:
+    """
+    Corpus class for the UDD-1 (Universal Dependency Dataset) for Vietnamese.
+    This class downloads the UDD-1 dataset from HuggingFace and converts it to
+    CoNLL-U format files that can be used with the underthesea ParserTrainer.
+    Attributes:
+        train: Path to the training data file (CoNLL format)
+        dev: Path to the development/validation data file (CoNLL format)
+        test: Path to the test data file (CoNLL format)
+    Example:
+        >>> from bamboo1.corpus import UDD1Corpus
+        >>> corpus = UDD1Corpus()
+        >>> print(corpus.train)  # Path to train.conllu
+    """
+    name = "UDD-1"
+    def __init__(self, data_dir: str = None, force_download: bool = False):
+        """
+        Initialize the UDD-1 corpus.
+        Args:
+            data_dir: Directory to store the converted CoNLL files.
+                     Defaults to ./data/UDD-1
+            force_download: If True, re-download and convert even if files exist.
+        """
+        if data_dir is None:
+            data_dir = Path(__file__).parent.parent / "data" / "UDD-1"
+        self.data_dir = Path(data_dir)
+        self.data_dir.mkdir(parents=True, exist_ok=True)
+        self._train = self.data_dir / "train.conllu"
+        self._dev = self.data_dir / "dev.conllu"
+        self._test = self.data_dir / "test.conllu"
+        if force_download or not self._files_exist():
+            self._download_and_convert()
+    def _files_exist(self) -> bool:
+        """Check if all required files exist."""
+        return self._train.exists() and self._dev.exists() and self._test.exists()
+    def _download_and_convert(self):
+        """Download UDD-1 from HuggingFace and convert to CoNLL format."""
+        # Lazy import - only needed when downloading
+        from datasets import load_dataset
+        print(f"Downloading UDD-1 dataset from HuggingFace...")
+        dataset = load_dataset("undertheseanlp/UDD-1")
+        print(f"Converting to CoNLL format...")
+        self._convert_split(dataset["train"], self._train)
+        self._convert_split(dataset["validation"], self._dev)
+        self._convert_split(dataset["test"], self._test)
+        print(f"Dataset saved to {self.data_dir}")
+        print(f"  Train: {len(dataset['train'])} sentences")
+        print(f"  Dev: {len(dataset['validation'])} sentences")
+        print(f"  Test: {len(dataset['test'])} sentences")
+    def _convert_split(self, split, output_path: Path):
+        """Convert a dataset split to CoNLL-U format."""
+        with open(output_path, "w", encoding="utf-8") as f:
+            for item in split:
+                sent_id = item.get("sent_id", "")
+                text = item.get("text", "")
+                if sent_id:
+                    f.write(f"# sent_id = {sent_id}\n")
+                if text:
+                    f.write(f"# text = {text}\n")
+                tokens = item["tokens"]
+                lemmas = item.get("lemmas", ["_"] * len(tokens))
+                upos = item["upos"]
+                xpos = item.get("xpos", ["_"] * len(tokens))
+                feats = item.get("feats", ["_"] * len(tokens))
+                heads = item["head"]
+                deprels = item["deprel"]
+                deps = item.get("deps", ["_"] * len(tokens))
+                misc = item.get("misc", ["_"] * len(tokens))
+                for i in range(len(tokens)):
+                    token_id = i + 1
+                    form = tokens[i]
+                    lemma = lemmas[i] if lemmas[i] else "_"
+                    upos_tag = upos[i] if upos[i] else "_"
+                    xpos_tag = xpos[i] if xpos[i] else "_"
+                    feat = feats[i] if feats[i] else "_"
+                    head = int(heads[i]) if heads[i] else 0
+                    deprel = deprels[i] if deprels[i] else "_"
+                    dep = deps[i] if deps[i] else "_"
+                    misc_val = misc[i] if misc[i] else "_"
+                    line = f"{token_id}\t{form}\t{lemma}\t{upos_tag}\t{xpos_tag}\t{feat}\t{head}\t{deprel}\t{dep}\t{misc_val}"
+                    f.write(line + "\n")
+                f.write("\n")
+    @property
+    def train(self) -> str:
+        """Path to training data file."""
+        return str(self._train)
+    @property
+    def dev(self) -> str:
+        """Path to development/validation data file."""
+        return str(self._dev)
+    @property
+    def test(self) -> str:
+        """Path to test data file."""
+        return str(self._test)
+    def get_statistics(self) -> dict:
+        """Get dataset statistics."""
+        # Lazy import - only needed for statistics
+        from datasets import load_dataset
+        dataset = load_dataset("undertheseanlp/UDD-1")
+        stats = {
+            "train_sentences": len(dataset["train"]),
+            "dev_sentences": len(dataset["validation"]),
+            "test_sentences": len(dataset["test"]),
+            "train_tokens": sum(len(item["tokens"]) for item in dataset["train"]),
+            "dev_tokens": sum(len(item["tokens"]) for item in dataset["validation"]),
+            "test_tokens": sum(len(item["tokens"]) for item in dataset["test"]),
+        }
+        all_upos = set()
+        all_deprels = set()
+        for split in ["train", "validation", "test"]:
+            for item in dataset[split]:
+                all_upos.update(item["upos"])
+                all_deprels.update(item["deprel"])
+        stats["num_upos_tags"] = len(all_upos)
+        stats["num_deprels"] = len(all_deprels)
+        stats["upos_tags"] = sorted(all_upos)
+        stats["deprels"] = sorted(all_deprels)
+        return stats

docker/Dockerfile ADDED Viewed

	@@ -0,0 +1,57 @@

+# Dockerfile for Bamboo-1 Vietnamese Dependency Parser Training
+# Optimized for RunPod deployment
+#
+# Build:
+#   docker build -t bamboo-1:latest -f docker/Dockerfile .
+#
+# Push to Docker Hub:
+#   docker tag bamboo-1:latest <username>/bamboo-1:latest
+#   docker push <username>/bamboo-1:latest
+#
+# RunPod Usage:
+#   - Set image to: <username>/bamboo-1:latest
+#   - Network volume mount: /runpod-volume
+#   - Models saved to: /runpod-volume/models
+#
+# Training commands:
+#   uv run scripts/train.py
+#   uv run scripts/train.py --wandb --wandb-project bamboo-1
+# RunPod optimized base image
+# - PyTorch 2.6.0 + CUDA 12.8.1
+# - Python 3.9-3.13 (default 3.12)
+# - JupyterLab, SSH, NGINX pre-installed
+# - uv package manager included
+FROM runpod/pytorch:1.0.2-cu1281-torch260-ubuntu2204
+LABEL maintainer="underthesea"
+LABEL description="Bamboo-1 Vietnamese Dependency Parser - RunPod Training"
+# Environment variables
+ENV PYTHONUNBUFFERED=1
+# Set working directory
+WORKDIR /workspace/bamboo-1
+# Copy dependency files first (for Docker layer cache)
+COPY pyproject.toml uv.lock ./
+COPY docker/requirements.txt ./
+# Install dependencies with uv
+# Only click and tqdm needed - PyTorch in base, data pre-included
+RUN uv pip install --system -r requirements.txt
+# Copy project source code
+COPY bamboo1/ ./bamboo1/
+COPY scripts/ ./scripts/
+# Copy pre-processed data (UDD-1 CoNLL-U files, ~22MB)
+# No need for datasets library at runtime
+COPY data/ ./data/
+# Create symlink for models to persist on RunPod network volume
+RUN mkdir -p /runpod-volume/bamboo-1/models && \
+    ln -sf /runpod-volume/bamboo-1/models models
+# Default command - start training
+CMD ["uv", "run", "scripts/train.py"]

docker/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+# Docker requirements for training
+# - PyTorch: pre-installed in base image
+# - datasets: not needed, data pre-included in image
+click>=8.0.0
+tqdm>=4.60.0

pyproject.toml ADDED Viewed

	@@ -0,0 +1,29 @@

+[project]
+name = "bamboo-1"
+version = "0.1.0"
+description = "Vietnamese Dependency Parser trained on UDD-1 dataset"
+readme = "README.md"
+requires-python = ">=3.10"
+dependencies = [
+    "torch>=2.0.0",
+    "datasets>=2.14.0",
+    "click>=8.0.0",
+    "underthesea>=9.2.0",
+    "transformers>=5.0.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0.0",
+    "wandb>=0.15.0",
+]
+cloud = [
+    "runpod>=1.6.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["bamboo1"]

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+underthesea[deep]>=6.8.0
+datasets>=2.14.0
+click>=8.0.0
+torch>=2.0.0
+transformers>=4.30.0

scripts/cost_estimate.py ADDED Viewed

	@@ -0,0 +1,534 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = ["runpod", "python-dotenv", "click"]
+# ///
+"""
+Cost estimation utilities for cloud GPU training.
+Usage:
+    from cost_estimate import CostTracker
+    tracker = CostTracker(gpu_type="RTX_A4000")
+    tracker.start()
+    # ... training loop ...
+    tracker.update(epoch=1, total_epochs=100)
+    tracker.summary()
+"""
+import time
+from dataclasses import dataclass
+from typing import Optional
+# GPU pricing per hour (USD) - RunPod on-demand prices
+GPU_PRICES = {
+    "RTX_A4000": 0.20,
+    "RTX_A5000": 0.28,
+    "RTX_3090": 0.22,
+    "RTX_4090": 0.44,
+    "A40": 0.39,
+    "A100_40GB": 1.09,
+    "A100_80GB": 1.59,
+    "H100": 2.49,
+    "CPU": 0.0,  # No GPU cost for CPU-only
+}
+def detect_cloud_provider() -> str:
+    """Detect cloud provider from environment or metadata."""
+    import os
+    # Check environment variables first (most reliable)
+    if os.getenv("RUNPOD_POD_ID"):
+        return "runpod"
+    if os.getenv("LINODE_ID") or os.getenv("LINODE_DATACENTER_ID"):
+        return "linode"
+    if os.getenv("AWS_EXECUTION_ENV") or os.getenv("AWS_REGION"):
+        return "aws"
+    if os.getenv("GOOGLE_CLOUD_PROJECT") or os.getenv("GCP_PROJECT"):
+        return "gcp"
+    if os.getenv("AZURE_CLIENT_ID") or os.getenv("MSI_ENDPOINT"):
+        return "azure"
+    if os.getenv("LAMBDA_LABS_API_KEY"):
+        return "lambda"
+    if os.getenv("VAST_CONTAINERLABEL"):
+        return "vast"
+    if os.getenv("COLAB_GPU"):
+        return "colab"
+    if os.getenv("KAGGLE_KERNEL_RUN_TYPE"):
+        return "kaggle"
+    # Check for cloud-specific metadata endpoints
+    try:
+        import subprocess
+        # Check Linode metadata (uses same IP but different path)
+        result = subprocess.run(
+            ["curl", "-s", "-m", "1", "http://169.254.169.254/v1/instance"],
+            capture_output=True, timeout=2
+        )
+        if result.returncode == 0 and b"instance" in result.stdout.lower():
+            return "linode"
+        # Check for AWS metadata
+        result = subprocess.run(
+            ["curl", "-s", "-m", "1", "http://169.254.169.254/latest/meta-data/ami-id"],
+            capture_output=True, timeout=2
+        )
+        if result.returncode == 0 and b"ami-" in result.stdout:
+            return "aws"
+        # Check GCP metadata
+        result = subprocess.run(
+            ["curl", "-s", "-m", "1", "-H", "Metadata-Flavor: Google",
+             "http://metadata.google.internal/computeMetadata/v1/"],
+            capture_output=True, timeout=2
+        )
+        if result.returncode == 0 and result.stdout:
+            return "gcp"
+    except Exception:
+        pass
+    # Check /etc files for cloud hints
+    try:
+        with open("/etc/hostname", "r") as f:
+            hostname = f.read().lower()
+            if "linode" in hostname:
+                return "linode"
+    except Exception:
+        pass
+    # Check sys_vendor (most reliable for Linode)
+    try:
+        with open("/sys/class/dmi/id/sys_vendor", "r") as f:
+            vendor = f.read().strip().lower()
+            if "linode" in vendor:
+                return "linode"
+            if "amazon" in vendor:
+                return "aws"
+            if "google" in vendor:
+                return "gcp"
+            if "microsoft" in vendor:
+                return "azure"
+    except Exception:
+        pass
+    # Check product_name as fallback
+    try:
+        import subprocess
+        result = subprocess.run(
+            ["cat", "/sys/class/dmi/id/product_name"],
+            capture_output=True, timeout=2
+        )
+        if result.returncode == 0:
+            product = result.stdout.decode().lower()
+            if "linode" in product:
+                return "linode"
+            if "amazon" in product or "ec2" in product:
+                return "aws"
+            if "google" in product:
+                return "gcp"
+    except Exception:
+        pass
+    return "local"
+@dataclass
+class HardwareInfo:
+    """Detected hardware information."""
+    device_type: str  # "cuda" or "cpu"
+    gpu_name: Optional[str] = None
+    gpu_memory_gb: Optional[float] = None
+    cpu_name: Optional[str] = None
+    cpu_cores: Optional[int] = None
+    ram_gb: Optional[float] = None
+    cloud_provider: str = "local"
+    def get_gpu_type(self) -> str:
+        """Map detected GPU to pricing category."""
+        if self.device_type == "cpu" or not self.gpu_name:
+            return "CPU"
+        name = self.gpu_name.upper()
+        # Match known GPU types
+        if "H100" in name:
+            return "H100"
+        elif "A100" in name:
+            if self.gpu_memory_gb and self.gpu_memory_gb > 50:
+                return "A100_80GB"
+            return "A100_40GB"
+        elif "A40" in name:
+            return "A40"
+        elif "4090" in name:
+            return "RTX_4090"
+        elif "3090" in name:
+            return "RTX_3090"
+        elif "A5000" in name:
+            return "RTX_A5000"
+        elif "A4000" in name:
+            return "RTX_A4000"
+        else:
+            return "RTX_A4000"  # Default fallback
+    def to_dict(self) -> dict:
+        """Convert to dictionary for logging."""
+        return {
+            "device_type": self.device_type,
+            "gpu_name": self.gpu_name,
+            "gpu_memory_gb": self.gpu_memory_gb,
+            "cpu_name": self.cpu_name,
+            "cpu_cores": self.cpu_cores,
+            "ram_gb": self.ram_gb,
+            "gpu_type": self.get_gpu_type(),
+            "cloud_provider": self.cloud_provider,
+        }
+    def __str__(self) -> str:
+        provider = f"[{self.cloud_provider}] " if self.cloud_provider != "local" else ""
+        if self.device_type == "cuda" and self.gpu_name:
+            mem = f" ({self.gpu_memory_gb:.1f}GB)" if self.gpu_memory_gb else ""
+            return f"{provider}{self.gpu_name}{mem}"
+        else:
+            ram = f", {self.ram_gb:.1f}GB RAM" if self.ram_gb else ""
+            return f"{provider}CPU: {self.cpu_name or 'Unknown'} ({self.cpu_cores} cores{ram})"
+def detect_hardware() -> HardwareInfo:
+    """Detect available hardware (GPU/CPU) and cloud provider."""
+    import platform
+    import os
+    # Detect cloud provider
+    cloud_provider = detect_cloud_provider()
+    # Get CPU info
+    cpu_name = platform.processor() or "Unknown"
+    cpu_cores = os.cpu_count()
+    # Get RAM
+    try:
+        import subprocess
+        if platform.system() == "Linux":
+            mem_info = subprocess.check_output(["free", "-b"]).decode()
+            ram_bytes = int(mem_info.split("\n")[1].split()[1])
+            ram_gb = ram_bytes / (1024**3)
+        else:
+            ram_gb = None
+    except Exception:
+        ram_gb = None
+    # Try to detect GPU with torch
+    try:
+        import torch
+        if torch.cuda.is_available():
+            gpu_name = torch.cuda.get_device_name(0)
+            gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+            return HardwareInfo(
+                device_type="cuda",
+                gpu_name=gpu_name,
+                gpu_memory_gb=gpu_memory_gb,
+                cpu_name=cpu_name,
+                cpu_cores=cpu_cores,
+                ram_gb=ram_gb,
+                cloud_provider=cloud_provider,
+            )
+    except Exception:
+        pass
+    return HardwareInfo(
+        device_type="cpu",
+        cpu_name=cpu_name,
+        cpu_cores=cpu_cores,
+        ram_gb=ram_gb,
+        cloud_provider=cloud_provider,
+    )
+@dataclass
+class CostTracker:
+    """Track training time and estimate costs."""
+    gpu_type: str = "RTX_A4000"
+    def __post_init__(self):
+        self.start_time: Optional[float] = None
+        self.hourly_rate = GPU_PRICES.get(self.gpu_type, 0.20)
+        self.last_report_time: Optional[float] = None
+        self.report_interval = 300  # Report every 5 minutes
+    def start(self):
+        """Start the cost tracker."""
+        self.start_time = time.time()
+        self.last_report_time = self.start_time
+    def elapsed_seconds(self) -> float:
+        """Get elapsed time in seconds."""
+        if self.start_time is None:
+            return 0
+        return time.time() - self.start_time
+    def elapsed_hours(self) -> float:
+        """Get elapsed time in hours."""
+        return self.elapsed_seconds() / 3600
+    def current_cost(self) -> float:
+        """Get current cost in USD."""
+        return self.elapsed_hours() * self.hourly_rate
+    def estimate_total_cost(self, progress: float) -> float:
+        """
+        Estimate total cost based on current progress.
+        Args:
+            progress: Training progress (0.0 to 1.0)
+        """
+        if progress <= 0:
+            return 0
+        return self.current_cost() / progress
+    def estimate_remaining_cost(self, progress: float) -> float:
+        """Estimate remaining cost."""
+        return self.estimate_total_cost(progress) - self.current_cost()
+    def estimate_remaining_time(self, progress: float) -> float:
+        """Estimate remaining time in seconds."""
+        if progress <= 0:
+            return 0
+        elapsed = self.elapsed_seconds()
+        total_time = elapsed / progress
+        return total_time - elapsed
+    def format_time(self, seconds: float) -> str:
+        """Format seconds to human readable string."""
+        if seconds < 60:
+            return f"{seconds:.0f}s"
+        elif seconds < 3600:
+            mins = seconds / 60
+            return f"{mins:.1f}m"
+        else:
+            hours = seconds / 3600
+            return f"{hours:.1f}h"
+    def format_cost(self, cost: float) -> str:
+        """Format cost to human readable string."""
+        if cost < 0.01:
+            return f"${cost:.4f}"
+        elif cost < 1:
+            return f"${cost:.3f}"
+        else:
+            return f"${cost:.2f}"
+    def should_report(self) -> bool:
+        """Check if it's time to report costs."""
+        if self.last_report_time is None:
+            return True
+        return time.time() - self.last_report_time >= self.report_interval
+    def get_status(self, epoch: int, total_epochs: int) -> str:
+        """Get formatted status string with cost info."""
+        progress = epoch / total_epochs if total_epochs > 0 else 0
+        current = self.current_cost()
+        estimated_total = self.estimate_total_cost(progress)
+        remaining_time = self.estimate_remaining_time(progress)
+        return (
+            f"Cost: {self.format_cost(current)} | "
+            f"Est. total: {self.format_cost(estimated_total)} | "
+            f"ETA: {self.format_time(remaining_time)}"
+        )
+    def update(self, epoch: int, total_epochs: int, force: bool = False) -> Optional[str]:
+        """
+        Update and optionally return status if report interval passed.
+        Returns status string if it's time to report, None otherwise.
+        """
+        if force or self.should_report():
+            self.last_report_time = time.time()
+            return self.get_status(epoch, total_epochs)
+        return None
+    def summary(self, epoch: int, total_epochs: int) -> str:
+        """Get final summary."""
+        progress = epoch / total_epochs if total_epochs > 0 else 1.0
+        elapsed = self.elapsed_seconds()
+        cost = self.current_cost()
+        lines = [
+            "=" * 50,
+            "Cost Summary",
+            "=" * 50,
+            f"  GPU: {self.gpu_type} (${self.hourly_rate}/hr)",
+            f"  Duration: {self.format_time(elapsed)}",
+            f"  Total cost: {self.format_cost(cost)}",
+        ]
+        if progress < 1.0:
+            estimated = self.estimate_total_cost(progress)
+            lines.append(f"  Est. full training: {self.format_cost(estimated)}")
+        lines.append("=" * 50)
+        return "\n".join(lines)
+def get_runpod_costs(pod_id: str = None) -> list[dict]:
+    """Get cost info from RunPod API using GraphQL for accurate uptime."""
+    import os
+    import requests
+    from dotenv import load_dotenv
+    load_dotenv()
+    api_key = os.getenv("RUNPOD_API_KEY")
+    # Use GraphQL for accurate runtime data
+    query = """
+    query getMyPods {
+      myself {
+        pods {
+          id
+          name
+          desiredStatus
+          costPerHr
+          machine { gpuDisplayName }
+          runtime {
+            uptimeInSeconds
+            gpus { gpuUtilPercent memoryUtilPercent }
+          }
+        }
+      }
+    }
+    """
+    response = requests.post(
+        "https://api.runpod.io/graphql",
+        headers={"Authorization": f"Bearer {api_key}"},
+        json={"query": query}
+    )
+    data = response.json()
+    pods = data.get("data", {}).get("myself", {}).get("pods", [])
+    if pod_id:
+        pods = [p for p in pods if p["id"] == pod_id]
+    results = []
+    for pod in pods:
+        if pod.get("desiredStatus") != "RUNNING":
+            continue
+        cost_per_hr = pod.get("costPerHr", 0)
+        runtime = pod.get("runtime") or {}
+        uptime_seconds = runtime.get("uptimeInSeconds", 0)
+        uptime_hours = uptime_seconds / 3600
+        current_cost = cost_per_hr * uptime_hours
+        gpus = runtime.get("gpus") or []
+        gpu_util = gpus[0].get("gpuUtilPercent", 0) if gpus else 0
+        mem_util = gpus[0].get("memoryUtilPercent", 0) if gpus else 0
+        results.append({
+            "id": pod["id"],
+            "name": pod.get("name", "N/A"),
+            "gpu": (pod.get("machine") or {}).get("gpuDisplayName", "N/A"),
+            "cost_per_hr": cost_per_hr,
+            "uptime_seconds": uptime_seconds,
+            "uptime_hours": uptime_hours,
+            "current_cost": current_cost,
+            "gpu_util": gpu_util,
+            "mem_util": mem_util,
+        })
+    return results
+def print_runpod_report(pods: list[dict], estimate_hours: float = None):
+    """Print RunPod cost report."""
+    import click
+    from datetime import datetime
+    if not pods:
+        click.echo("No running pods found.")
+        return
+    click.echo(f"\n{'='*60}")
+    click.echo(f" RunPod Cost Report - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
+    click.echo(f"{'='*60}\n")
+    total_current = 0
+    total_per_hr = 0
+    for pod in pods:
+        uptime_str = f"{pod['uptime_seconds']:.0f}s" if pod['uptime_seconds'] < 60 else f"{pod['uptime_seconds']/60:.1f}m"
+        click.echo(f"Pod: {pod['name']} ({pod['id']})")
+        click.echo(f"  GPU: {pod['gpu']} @ ${pod['cost_per_hr']:.2f}/hr")
+        click.echo(f"  Uptime: {uptime_str}")
+        click.echo(f"  Current Cost: ${pod['current_cost']:.4f}")
+        click.echo(f"  GPU: {pod['gpu_util']:.0f}% | Mem: {pod['mem_util']:.0f}%")
+        if estimate_hours:
+            est_total = pod['cost_per_hr'] * estimate_hours
+            remaining_hrs = max(0, estimate_hours - pod['uptime_hours'])
+            click.echo(f"  Est. Total ({estimate_hours}h): ${est_total:.2f} (remaining: ${pod['cost_per_hr'] * remaining_hrs:.2f})")
+        click.echo()
+        total_current += pod['current_cost']
+        total_per_hr += pod['cost_per_hr']
+    click.echo(f"{'-'*60}")
+    click.echo(f"TOTAL: ${total_current:.4f} (${total_per_hr:.2f}/hr)")
+    if estimate_hours:
+        click.echo(f"Est. Total ({estimate_hours}h): ${total_per_hr * estimate_hours:.2f}")
+    click.echo()
+def main():
+    """Cost estimation CLI."""
+    import click
+    import os
+    @click.group()
+    def cli():
+        """Cost estimation for GPU training."""
+        pass
+    @cli.command()
+    @click.option("--gpu", default="RTX_A4000", type=click.Choice(list(GPU_PRICES.keys())))
+    @click.option("--hours", default=1.0, type=float, help="Estimated training hours")
+    def estimate(gpu, hours):
+        """Estimate training cost for a GPU."""
+        rate = GPU_PRICES[gpu]
+        cost = rate * hours
+        click.echo(f"GPU: {gpu}")
+        click.echo(f"Rate: ${rate}/hour")
+        click.echo(f"Duration: {hours} hours")
+        click.echo(f"Estimated cost: ${cost:.2f}")
+    @cli.command()
+    @click.option("--pod-id", "-p", help="Specific pod ID")
+    @click.option("--watch", "-w", is_flag=True, help="Watch mode (refresh every 10s)")
+    @click.option("--estimate", "-e", type=float, help="Estimate total for N hours")
+    def monitor(pod_id, watch, estimate):
+        """Monitor RunPod costs in real-time."""
+        if watch:
+            try:
+                while True:
+                    os.system("clear" if os.name != "nt" else "cls")
+                    pods = get_runpod_costs(pod_id)
+                    print_runpod_report(pods, estimate)
+                    click.echo("Press Ctrl+C to exit...")
+                    time.sleep(10)
+            except KeyboardInterrupt:
+                click.echo("\nExiting...")
+        else:
+            pods = get_runpod_costs(pod_id)
+            print_runpod_report(pods, estimate)
+    cli()
+if __name__ == "__main__":
+    main()

scripts/evaluate.py ADDED Viewed

	@@ -0,0 +1,229 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "underthesea[deep]>=6.8.0",
+#     "datasets>=2.14.0",
+#     "click>=8.0.0",
+#     "torch>=2.0.0",
+#     "transformers>=4.30.0",
+# ]
+# ///
+"""
+Evaluation script for Bamboo-1 Vietnamese Dependency Parser.
+Usage:
+    uv run scripts/evaluate.py --model models/bamboo-1
+    uv run scripts/evaluate.py --model models/bamboo-1 --split test
+    uv run scripts/evaluate.py --model models/bamboo-1 --detailed
+"""
+import sys
+from pathlib import Path
+from collections import Counter
+import click
+# Add parent directory to path for bamboo1 module
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from bamboo1.corpus import UDD1Corpus
+def read_conll_sentences(filepath: str):
+    """Read sentences from a CoNLL-U file."""
+    sentences = []
+    current_sentence = []
+    with open(filepath, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line.startswith("#"):
+                continue
+            if not line:
+                if current_sentence:
+                    sentences.append(current_sentence)
+                    current_sentence = []
+            else:
+                parts = line.split("\t")
+                if len(parts) >= 8 and not "-" in parts[0] and not "." in parts[0]:
+                    current_sentence.append({
+                        "id": int(parts[0]),
+                        "form": parts[1],
+                        "upos": parts[3],
+                        "head": int(parts[6]),
+                        "deprel": parts[7],
+                    })
+    if current_sentence:
+        sentences.append(current_sentence)
+    return sentences
+def calculate_attachment_scores(gold_sentences, pred_sentences):
+    """Calculate UAS and LAS scores."""
+    total_tokens = 0
+    correct_heads = 0
+    correct_labels = 0
+    deprel_stats = Counter()
+    deprel_correct = Counter()
+    for gold_sent, pred_sent in zip(gold_sentences, pred_sentences):
+        for gold_tok, pred_tok in zip(gold_sent, pred_sent):
+            total_tokens += 1
+            deprel = gold_tok["deprel"]
+            deprel_stats[deprel] += 1
+            if gold_tok["head"] == pred_tok["head"]:
+                correct_heads += 1
+                if gold_tok["deprel"] == pred_tok["deprel"]:
+                    correct_labels += 1
+                    deprel_correct[deprel] += 1
+    uas = correct_heads / total_tokens if total_tokens > 0 else 0
+    las = correct_labels / total_tokens if total_tokens > 0 else 0
+    per_deprel_scores = {}
+    for deprel in deprel_stats:
+        if deprel_stats[deprel] > 0:
+            per_deprel_scores[deprel] = {
+                "total": deprel_stats[deprel],
+                "correct": deprel_correct[deprel],
+                "accuracy": deprel_correct[deprel] / deprel_stats[deprel],
+            }
+    return {
+        "uas": uas,
+        "las": las,
+        "total_tokens": total_tokens,
+        "correct_heads": correct_heads,
+        "correct_labels": correct_labels,
+        "per_deprel": per_deprel_scores,
+    }
+@click.command()
+@click.option(
+    "--model", "-m",
+    required=True,
+    help="Path to trained model directory",
+)
+@click.option(
+    "--split",
+    type=click.Choice(["dev", "test", "both"]),
+    default="test",
+    help="Dataset split to evaluate on",
+    show_default=True,
+)
+@click.option(
+    "--detailed",
+    is_flag=True,
+    help="Show detailed per-relation scores",
+)
+@click.option(
+    "--output", "-o",
+    help="Save predictions to file (CoNLL-U format)",
+)
+def evaluate(model, split, detailed, output):
+    """Evaluate Bamboo-1 Vietnamese Dependency Parser on UDD-1 dataset."""
+    from underthesea.models.dependency_parser import DependencyParser
+    click.echo("=" * 60)
+    click.echo("Bamboo-1: Vietnamese Dependency Parser Evaluation")
+    click.echo("=" * 60)
+    # Load model
+    click.echo(f"\nLoading model from {model}...")
+    parser = DependencyParser.load(model)
+    # Load corpus
+    click.echo("Loading UDD-1 corpus...")
+    corpus = UDD1Corpus()
+    splits_to_eval = []
+    if split == "both":
+        splits_to_eval = [("dev", corpus.dev), ("test", corpus.test)]
+    elif split == "dev":
+        splits_to_eval = [("dev", corpus.dev)]
+    else:
+        splits_to_eval = [("test", corpus.test)]
+    for split_name, split_path in splits_to_eval:
+        click.echo(f"\n{'=' * 40}")
+        click.echo(f"Evaluating on {split_name} set: {split_path}")
+        click.echo("=" * 40)
+        # Read gold data
+        gold_sentences = read_conll_sentences(split_path)
+        click.echo(f"  Sentences: {len(gold_sentences)}")
+        click.echo(f"  Tokens: {sum(len(s) for s in gold_sentences)}")
+        # Make predictions
+        click.echo("\nMaking predictions...")
+        pred_sentences = []
+        for gold_sent in gold_sentences:
+            # Reconstruct text from tokens
+            tokens = [tok["form"] for tok in gold_sent]
+            text = " ".join(tokens)
+            # Parse
+            result = parser.predict(text)
+            # Convert result to same format as gold
+            pred_sent = []
+            for i, (word, head, deprel) in enumerate(result):
+                pred_sent.append({
+                    "id": i + 1,
+                    "form": word,
+                    "head": head,
+                    "deprel": deprel,
+                })
+            pred_sentences.append(pred_sent)
+        # Calculate scores
+        scores = calculate_attachment_scores(gold_sentences, pred_sentences)
+        click.echo(f"\nResults:")
+        click.echo(f"  UAS: {scores['uas']:.4f} ({scores['uas']*100:.2f}%)")
+        click.echo(f"  LAS: {scores['las']:.4f} ({scores['las']*100:.2f}%)")
+        click.echo(f"  Total tokens: {scores['total_tokens']}")
+        click.echo(f"  Correct heads: {scores['correct_heads']}")
+        click.echo(f"  Correct labels: {scores['correct_labels']}")
+        if detailed:
+            click.echo("\nPer-relation scores:")
+            click.echo("-" * 50)
+            click.echo(f"{'Relation':<15} {'Count':>8} {'Correct':>8} {'Accuracy':>10}")
+            click.echo("-" * 50)
+            for deprel in sorted(scores["per_deprel"].keys()):
+                stats = scores["per_deprel"][deprel]
+                click.echo(
+                    f"{deprel:<15} {stats['total']:>8} {stats['correct']:>8} "
+                    f"{stats['accuracy']*100:>9.2f}%"
+                )
+        # Save predictions if requested
+        if output:
+            out_path = Path(output)
+            if split_name != "test":
+                out_path = out_path.with_stem(f"{out_path.stem}_{split_name}")
+            click.echo(f"\nSaving predictions to {out_path}...")
+            with open(out_path, "w", encoding="utf-8") as f:
+                for i, (gold_sent, pred_sent) in enumerate(zip(gold_sentences, pred_sentences)):
+                    f.write(f"# sent_id = {i + 1}\n")
+                    for gold_tok, pred_tok in zip(gold_sent, pred_sent):
+                        f.write(
+                            f"{gold_tok['id']}\t{gold_tok['form']}\t_\t{gold_tok['upos']}\t_\t_\t"
+                            f"{pred_tok['head']}\t{pred_tok['deprel']}\t_\t_\n"
+                        )
+                    f.write("\n")
+    click.echo("\nEvaluation complete!")
+if __name__ == "__main__":
+    evaluate()

scripts/predict.py ADDED Viewed

	@@ -0,0 +1,173 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "underthesea[deep]>=6.8.0",
+#     "click>=8.0.0",
+#     "torch>=2.0.0",
+#     "transformers>=4.30.0",
+# ]
+# ///
+"""
+Prediction script for Bamboo-1 Vietnamese Dependency Parser.
+Usage:
+    # Interactive mode
+    uv run scripts/predict.py --model models/bamboo-1
+    # File input
+    uv run scripts/predict.py --model models/bamboo-1 --input input.txt --output output.conllu
+    # Single sentence
+    uv run scripts/predict.py --model models/bamboo-1 --text "Tôi yêu Việt Nam"
+"""
+import sys
+from pathlib import Path
+import click
+def format_tree_ascii(tokens, heads, deprels):
+    """Format dependency tree as ASCII art."""
+    n = len(tokens)
+    lines = []
+    # Header
+    lines.append("  " + "  ".join(f"{i+1:>3}" for i in range(n)))
+    lines.append("  " + "  ".join(f"{t[:3]:>3}" for t in tokens))
+    # Draw arcs
+    for i in range(n):
+        head = heads[i]
+        if head == 0:
+            lines.append(f"  {tokens[i]} <- ROOT ({deprels[i]})")
+        else:
+            arrow = "<-" if head > i + 1 else "->"
+            lines.append(f"  {tokens[i]} {arrow} {tokens[head-1]} ({deprels[i]})")
+    return "\n".join(lines)
+def format_conllu(tokens, heads, deprels, sent_id=None, text=None):
+    """Format result as CoNLL-U."""
+    lines = []
+    if sent_id:
+        lines.append(f"# sent_id = {sent_id}")
+    if text:
+        lines.append(f"# text = {text}")
+    for i, (token, head, deprel) in enumerate(zip(tokens, heads, deprels)):
+        lines.append(f"{i+1}\t{token}\t_\t_\t_\t_\t{head}\t{deprel}\t_\t_")
+    lines.append("")
+    return "\n".join(lines)
+@click.command()
+@click.option(
+    "--model", "-m",
+    required=True,
+    help="Path to trained model directory",
+)
+@click.option(
+    "--input", "-i",
+    "input_file",
+    help="Input file (one sentence per line)",
+)
+@click.option(
+    "--output", "-o",
+    "output_file",
+    help="Output file (CoNLL-U format)",
+)
+@click.option(
+    "--text", "-t",
+    help="Single sentence to parse",
+)
+@click.option(
+    "--format",
+    "output_format",
+    type=click.Choice(["conllu", "simple", "tree"]),
+    default="simple",
+    help="Output format",
+    show_default=True,
+)
+def predict(model, input_file, output_file, text, output_format):
+    """Parse Vietnamese sentences with Bamboo-1 Dependency Parser."""
+    from underthesea.models.dependency_parser import DependencyParser
+    click.echo(f"Loading model from {model}...")
+    parser = DependencyParser.load(model)
+    click.echo("Model loaded.\n")
+    def parse_and_print(sentence, sent_id=None):
+        """Parse a sentence and print the result."""
+        result = parser.predict(sentence)
+        tokens = [r[0] for r in result]
+        heads = [r[1] for r in result]
+        deprels = [r[2] for r in result]
+        if output_format == "conllu":
+            return format_conllu(tokens, heads, deprels, sent_id, sentence)
+        elif output_format == "tree":
+            output = f"Sentence: {sentence}\n"
+            output += format_tree_ascii(tokens, heads, deprels)
+            return output
+        else:  # simple
+            output = f"Input: {sentence}\n"
+            output += "Output:\n"
+            for i, (token, head, deprel) in enumerate(zip(tokens, heads, deprels)):
+                head_word = "ROOT" if head == 0 else tokens[head - 1]
+                output += f"  {i+1}. {token} -> {head_word} ({deprel})\n"
+            return output
+    # Single text mode
+    if text:
+        result = parse_and_print(text, sent_id=1)
+        click.echo(result)
+        return
+    # File mode
+    if input_file:
+        click.echo(f"Reading from {input_file}...")
+        with open(input_file, "r", encoding="utf-8") as f:
+            sentences = [line.strip() for line in f if line.strip()]
+        click.echo(f"Parsing {len(sentences)} sentences...")
+        results = []
+        for i, sentence in enumerate(sentences, 1):
+            result = parse_and_print(sentence, sent_id=i)
+            results.append(result)
+            if i % 100 == 0:
+                click.echo(f"  Processed {i}/{len(sentences)}...")
+        if output_file:
+            with open(output_file, "w", encoding="utf-8") as f:
+                f.write("\n".join(results))
+            click.echo(f"Results saved to {output_file}")
+        else:
+            for result in results:
+                click.echo(result)
+                click.echo()
+        return
+    # Interactive mode
+    click.echo("Interactive mode. Enter sentences to parse (Ctrl+C to exit).\n")
+    sent_id = 1
+    while True:
+        try:
+            sentence = input(">>> ").strip()
+            if not sentence:
+                continue
+            result = parse_and_print(sentence, sent_id=sent_id)
+            click.echo(result)
+            click.echo()
+            sent_id += 1
+        except KeyboardInterrupt:
+            click.echo("\nGoodbye!")
+            break
+        except EOFError:
+            break
+if __name__ == "__main__":
+    predict()

scripts/runpod_setup.py ADDED Viewed

	@@ -0,0 +1,287 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "runpod>=1.6.0",
+#     "requests>=2.28.0",
+# ]
+# ///
+"""
+RunPod setup script for Bamboo-1 training.
+Usage:
+    # Set your RunPod API key
+    export RUNPOD_API_KEY="your-api-key"
+    # Create a network volume for data
+    uv run scripts/runpod_setup.py volume-create --name bamboo-data --size 10
+    # List volumes
+    uv run scripts/runpod_setup.py volume-list
+    # Launch training pod with volume
+    uv run scripts/runpod_setup.py launch --volume <volume-id>
+    # Check pod status
+    uv run scripts/runpod_setup.py status
+    # Stop pod
+    uv run scripts/runpod_setup.py stop
+"""
+import os
+import click
+import runpod
+import requests
+@click.group()
+def cli():
+    """RunPod management for Bamboo-1 training."""
+    api_key = os.environ.get("RUNPOD_API_KEY")
+    if not api_key:
+        raise click.ClickException(
+            "RUNPOD_API_KEY environment variable not set.\n"
+            "Get your API key from https://runpod.io/console/user/settings"
+        )
+    runpod.api_key = api_key
+def get_ssh_public_key() -> str:
+    """Get the user's SSH public key."""
+    from pathlib import Path
+    for key_file in ["~/.ssh/id_rsa.pub", "~/.ssh/id_ed25519.pub"]:
+        path = Path(key_file).expanduser()
+        if path.exists():
+            return path.read_text().strip()
+    return None
+# Default images
+DEFAULT_IMAGE = "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
+BAMBOO1_IMAGE = "undertheseanlp/bamboo-1:latest"  # Pre-built image with dependencies
+@cli.command()
+@click.option("--gpu", default="NVIDIA RTX A4000", help="GPU type")
+@click.option("--image", default=DEFAULT_IMAGE, help="Docker image")
+@click.option("--prebuilt", is_flag=True, help="Use pre-built bamboo-1 image (faster startup)")
+@click.option("--disk", default=20, type=int, help="Disk size in GB")
+@click.option("--name", default="bamboo-1-training", help="Pod name")
+@click.option("--volume", default=None, help="Network volume ID to attach")
+@click.option("--wandb-key", envvar="WANDB_API_KEY", help="W&B API key for logging")
+@click.option("--sample", default=0, type=int, help="Sample N sentences (0=all)")
+@click.option("--epochs", default=100, type=int, help="Number of epochs")
+def launch(gpu, image, prebuilt, disk, name, volume, wandb_key, sample, epochs):
+    """Launch a RunPod instance for training."""
+    # Use pre-built image if requested
+    if prebuilt:
+        image = BAMBOO1_IMAGE
+    click.echo("Launching RunPod instance...")
+    click.echo(f"  GPU: {gpu}")
+    click.echo(f"  Image: {image}")
+    click.echo(f"  Disk: {disk}GB")
+    # Build training command
+    train_cmd = "uv run scripts/train.py"
+    if sample > 0:
+        train_cmd += f" --sample {sample}"
+    train_cmd += f" --epochs {epochs}"
+    if wandb_key:
+        train_cmd += " --wandb --wandb-project bamboo-1"
+    # Set environment variables
+    env_vars = {}
+    if wandb_key:
+        env_vars["WANDB_API_KEY"] = wandb_key
+    # Add SSH public key
+    ssh_key = get_ssh_public_key()
+    if ssh_key:
+        env_vars["PUBLIC_KEY"] = ssh_key
+        click.echo("  SSH key: configured")
+    if volume:
+        click.echo(f"  Volume: {volume}")
+    pod = runpod.create_pod(
+        name=name,
+        image_name=image,
+        gpu_type_id=gpu,
+        volume_in_gb=disk,
+        env=env_vars if env_vars else None,
+        ports="22/tcp",  # Expose SSH port
+        network_volume_id=volume,  # Attach network volume
+    )
+    click.echo("\nPod created!")
+    click.echo(f"  ID: {pod['id']}")
+    click.echo(f"  Status: {pod.get('desiredStatus', 'PENDING')}")
+    click.echo("\nMonitor at: https://runpod.io/console/pods")
+    # Generate one-liner training command
+    click.echo("\n" + "="*60)
+    click.echo("SSH into the pod and run this command:")
+    click.echo("="*60)
+    if prebuilt:
+        # Pre-built image: dependencies already installed
+        one_liner = f"cd /workspace/bamboo-1 && {train_cmd}"
+    else:
+        # Standard image: need to install everything
+        one_liner = f"""curl -LsSf https://astral.sh/uv/install.sh | sh && source $HOME/.local/bin/env && git clone https://huggingface.co/undertheseanlp/bamboo-1 && cd bamboo-1 && uv sync && {train_cmd}"""
+    click.echo(one_liner)
+    click.echo("="*60)
+@cli.command()
+def status():
+    """Check status of all pods."""
+    pods = runpod.get_pods()
+    if not pods:
+        click.echo("No active pods.")
+        return
+    click.echo("Active pods:")
+    for pod in pods:
+        click.echo(f"  - {pod['name']} ({pod['id']}): {pod.get('desiredStatus', 'UNKNOWN')}")
+@cli.command()
+@click.argument("pod_id")
+def stop(pod_id):
+    """Stop a pod by ID."""
+    click.echo(f"Stopping pod {pod_id}...")
+    runpod.stop_pod(pod_id)
+    click.echo("Pod stopped.")
+@cli.command()
+@click.argument("pod_id")
+def terminate(pod_id):
+    """Terminate a pod by ID."""
+    click.echo(f"Terminating pod {pod_id}...")
+    runpod.terminate_pod(pod_id)
+    click.echo("Pod terminated.")
+# =============================================================================
+# Volume Management
+# =============================================================================
+DATACENTERS = {
+    "EU-RO-1": "Europe (Romania)",
+    "EU-CZ-1": "Europe (Czech Republic)",
+    "EUR-IS-1": "Europe (Iceland)",
+    "US-KS-2": "US (Kansas)",
+    "US-CA-2": "US (California)",
+}
+def _graphql_request(query: str, variables: dict = None) -> dict:
+    """Make a GraphQL request to RunPod API."""
+    api_key = os.environ.get("RUNPOD_API_KEY")
+    response = requests.post(
+        "https://api.runpod.io/graphql",
+        headers={"Authorization": f"Bearer {api_key}"},
+        json={"query": query, "variables": variables or {}}
+    )
+    return response.json()
+@cli.command("volume-list")
+def volume_list():
+    """List all network volumes."""
+    query = """
+    query {
+        myself {
+            networkVolumes {
+                id
+                name
+                size
+                dataCenterId
+            }
+        }
+    }
+    """
+    result = _graphql_request(query)
+    volumes = result.get("data", {}).get("myself", {}).get("networkVolumes", [])
+    if not volumes:
+        click.echo("No network volumes found.")
+        click.echo(f"\nCreate one with: uv run scripts/runpod_setup.py volume-create --name bamboo-data --size 10")
+        return
+    click.echo("Network Volumes:")
+    for vol in volumes:
+        dc = DATACENTERS.get(vol['dataCenterId'], vol['dataCenterId'])
+        click.echo(f"  - {vol['name']} ({vol['id']}): {vol['size']}GB @ {dc}")
+@cli.command("volume-create")
+@click.option("--name", default="bamboo-data", help="Volume name")
+@click.option("--size", default=10, type=int, help="Size in GB")
+@click.option("--datacenter", default="EUR-IS-1", type=click.Choice(list(DATACENTERS.keys())), help="Datacenter")
+def volume_create(name, size, datacenter):
+    """Create a network volume for data storage."""
+    click.echo(f"Creating network volume...")
+    click.echo(f"  Name: {name}")
+    click.echo(f"  Size: {size}GB")
+    click.echo(f"  Datacenter: {DATACENTERS[datacenter]}")
+    query = """
+    mutation createNetworkVolume($input: CreateNetworkVolumeInput!) {
+        createNetworkVolume(input: $input) {
+            id
+            name
+            size
+            dataCenterId
+        }
+    }
+    """
+    variables = {
+        "input": {
+            "name": name,
+            "size": size,
+            "dataCenterId": datacenter
+        }
+    }
+    result = _graphql_request(query, variables)
+    if "errors" in result:
+        click.echo(f"\nError: {result['errors'][0]['message']}")
+        return
+    volume = result.get("data", {}).get("createNetworkVolume", {})
+    click.echo(f"\nVolume created!")
+    click.echo(f"  ID: {volume['id']}")
+    click.echo(f"\nUse with: uv run scripts/runpod_setup.py launch --volume {volume['id']}")
+@cli.command("volume-delete")
+@click.argument("volume_id")
+@click.confirmation_option(prompt="Are you sure you want to delete this volume?")
+def volume_delete(volume_id):
+    """Delete a network volume."""
+    query = """
+    mutation deleteNetworkVolume($input: DeleteNetworkVolumeInput!) {
+        deleteNetworkVolume(input: $input)
+    }
+    """
+    variables = {"input": {"id": volume_id}}
+    result = _graphql_request(query, variables)
+    if "errors" in result:
+        click.echo(f"Error: {result['errors'][0]['message']}")
+        return
+    click.echo(f"Volume {volume_id} deleted.")
+if __name__ == "__main__":
+    cli()

scripts/runpod_simple_test.py ADDED Viewed

	@@ -0,0 +1,81 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "runpod>=1.6.0",
+#     "click>=8.0.0",
+# ]
+# ///
+"""
+Simple RunPod test script to verify API connection and GPU availability.
+Usage:
+    export RUNPOD_API_KEY="your-api-key"
+    uv run scripts/runpod_simple_test.py
+    uv run scripts/runpod_simple_test.py --list-gpus
+    uv run scripts/runpod_simple_test.py --run-test
+"""
+import os
+import click
+import runpod
+@click.command()
+@click.option("--list-gpus", is_flag=True, help="List available GPU types")
+@click.option("--run-test", is_flag=True, help="Run a quick test pod")
+def main(list_gpus, run_test):
+    """Test RunPod API connection and GPU availability."""
+    api_key = os.environ.get("RUNPOD_API_KEY")
+    if not api_key:
+        raise click.ClickException(
+            "Set RUNPOD_API_KEY environment variable.\n"
+            "Get your key at: https://runpod.io/console/user/settings"
+        )
+    runpod.api_key = api_key
+    # Test API connection
+    click.echo("Testing RunPod API connection...")
+    try:
+        pods = runpod.get_pods()
+        click.echo(f"  Connected! Active pods: {len(pods)}")
+    except Exception as e:
+        raise click.ClickException(f"API connection failed: {e}")
+    # List GPUs
+    if list_gpus:
+        click.echo("\nAvailable GPU types:")
+        try:
+            gpus = runpod.get_gpus()
+            for gpu in gpus:
+                name = gpu.get("id", "Unknown")
+                mem = gpu.get("memoryInGb", "?")
+                click.echo(f"  - {name} ({mem}GB)")
+        except Exception as e:
+            click.echo(f"  Could not list GPUs: {e}")
+    # Run test pod
+    if run_test:
+        click.echo("\nLaunching test pod...")
+        test_script = "nvidia-smi && python3 -c 'import torch; print(f\"PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}\")'"
+        pod = runpod.create_pod(
+            name="bamboo-1-test",
+            image_name="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",
+            gpu_type_id="NVIDIA RTX A4000",
+            volume_in_gb=5,
+            docker_args=f"bash -c '{test_script}; sleep 60'",
+        )
+        click.echo(f"  Pod ID: {pod['id']}")
+        click.echo(f"  Monitor: https://runpod.io/console/pods")
+        click.echo(f"\n  Terminate after checking:")
+        click.echo(f"  uv run scripts/runpod_setup.py terminate {pod['id']}")
+    if not list_gpus and not run_test:
+        click.echo("\nUse --list-gpus to see available GPUs")
+        click.echo("Use --run-test to launch a quick test pod")
+if __name__ == "__main__":
+    main()

scripts/runpod_train.sh ADDED Viewed

	@@ -0,0 +1,42 @@

+#!/bin/bash
+# RunPod Training Script for Bamboo-1
+# Usage: bash scripts/runpod_train.sh
+set -e
+echo "=========================================="
+echo "Bamboo-1: Vietnamese Dependency Parser"
+echo "RunPod Training Setup"
+echo "=========================================="
+# Install uv if not present
+if ! command -v uv &> /dev/null; then
+    echo "Installing uv..."
+    curl -LsSf https://astral.sh/uv/install.sh | sh
+    source $HOME/.local/bin/env
+fi
+# Clone repo if not exists
+if [ ! -d "bamboo-1" ]; then
+    echo "Cloning bamboo-1 from HuggingFace..."
+    git clone https://huggingface.co/undertheseanlp/bamboo-1
+fi
+cd bamboo-1
+# Install dependencies
+echo "Installing dependencies..."
+uv sync
+# Run training
+echo "Starting training..."
+uv run scripts/train.py \
+    --output models/bamboo-1-char \
+    --feat char \
+    --max-epochs 100 \
+    --batch-size 5000 \
+    --lr 2e-3 \
+    "$@"
+echo "Training complete!"
+echo "Model saved to: models/bamboo-1-char"

scripts/train.py ADDED Viewed

	@@ -0,0 +1,673 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "torch>=2.0.0",
+#     "datasets>=2.14.0",
+#     "click>=8.0.0",
+#     "tqdm>=4.60.0",
+#     "wandb>=0.15.0",
+# ]
+# ///
+"""
+Training script for Bamboo-1 Vietnamese Dependency Parser.
+Biaffine parser implementation from scratch (Dozat & Manning, 2017).
+Usage:
+    uv run scripts/train.py
+    uv run scripts/train.py --output models/bamboo-1 --epochs 100
+"""
+import sys
+from pathlib import Path
+from collections import Counter
+from dataclasses import dataclass
+from typing import List, Tuple, Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence, pad_sequence
+from torch.utils.data import Dataset, DataLoader
+from torch.optim import Adam
+from torch.optim.lr_scheduler import ExponentialLR
+from tqdm import tqdm
+import click
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from bamboo1.corpus import UDD1Corpus
+from scripts.cost_estimate import CostTracker, detect_hardware
+# ============================================================================
+# Data Processing
+# ============================================================================
+@dataclass
+class Sentence:
+    """A dependency-parsed sentence."""
+    words: List[str]
+    heads: List[int]
+    rels: List[str]
+def read_conllu(path: str) -> List[Sentence]:
+    """Read CoNLL-U file and return list of sentences."""
+    sentences = []
+    words, heads, rels = [], [], []
+    with open(path, 'r', encoding='utf-8') as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                if words:
+                    sentences.append(Sentence(words, heads, rels))
+                    words, heads, rels = [], [], []
+            elif line.startswith('#'):
+                continue
+            else:
+                parts = line.split('\t')
+                if '-' in parts[0] or '.' in parts[0]:  # Skip multi-word tokens
+                    continue
+                words.append(parts[1])  # FORM
+                heads.append(int(parts[6]))  # HEAD
+                rels.append(parts[7])  # DEPREL
+        if words:
+            sentences.append(Sentence(words, heads, rels))
+    return sentences
+class Vocabulary:
+    """Vocabulary for words, characters, and relations."""
+    PAD = '<pad>'
+    UNK = '<unk>'
+    def __init__(self, min_freq: int = 2):
+        self.min_freq = min_freq
+        self.word2idx = {self.PAD: 0, self.UNK: 1}
+        self.char2idx = {self.PAD: 0, self.UNK: 1}
+        self.rel2idx = {}
+        self.idx2rel = {}
+    def build(self, sentences: List[Sentence]):
+        """Build vocabulary from sentences."""
+        word_counts = Counter()
+        char_counts = Counter()
+        rel_counts = Counter()
+        for sent in sentences:
+            for word in sent.words:
+                word_counts[word.lower()] += 1
+                for char in word:
+                    char_counts[char] += 1
+            for rel in sent.rels:
+                rel_counts[rel] += 1
+        # Words
+        for word, count in word_counts.items():
+            if count >= self.min_freq and word not in self.word2idx:
+                self.word2idx[word] = len(self.word2idx)
+        # Characters
+        for char, count in char_counts.items():
+            if char not in self.char2idx:
+                self.char2idx[char] = len(self.char2idx)
+        # Relations
+        for rel in rel_counts:
+            if rel not in self.rel2idx:
+                idx = len(self.rel2idx)
+                self.rel2idx[rel] = idx
+                self.idx2rel[idx] = rel
+    def encode_word(self, word: str) -> int:
+        return self.word2idx.get(word.lower(), self.word2idx[self.UNK])
+    def encode_char(self, char: str) -> int:
+        return self.char2idx.get(char, self.char2idx[self.UNK])
+    def encode_rel(self, rel: str) -> int:
+        return self.rel2idx.get(rel, 0)
+    @property
+    def n_words(self) -> int:
+        return len(self.word2idx)
+    @property
+    def n_chars(self) -> int:
+        return len(self.char2idx)
+    @property
+    def n_rels(self) -> int:
+        return len(self.rel2idx)
+class DependencyDataset(Dataset):
+    """Dataset for dependency parsing."""
+    def __init__(self, sentences: List[Sentence], vocab: Vocabulary):
+        self.sentences = sentences
+        self.vocab = vocab
+    def __len__(self):
+        return len(self.sentences)
+    def __getitem__(self, idx):
+        sent = self.sentences[idx]
+        # Encode words
+        word_ids = [self.vocab.encode_word(w) for w in sent.words]
+        # Encode characters
+        char_ids = [[self.vocab.encode_char(c) for c in w] for w in sent.words]
+        # Heads and relations
+        heads = sent.heads
+        rels = [self.vocab.encode_rel(r) for r in sent.rels]
+        return word_ids, char_ids, heads, rels
+def collate_fn(batch):
+    """Collate function for DataLoader."""
+    word_ids, char_ids, heads, rels = zip(*batch)
+    # Get lengths
+    lengths = [len(w) for w in word_ids]
+    max_len = max(lengths)
+    # Pad words
+    word_ids_padded = torch.zeros(len(batch), max_len, dtype=torch.long)
+    for i, wids in enumerate(word_ids):
+        word_ids_padded[i, :len(wids)] = torch.tensor(wids)
+    # Pad characters
+    max_word_len = max(max(len(c) for c in chars) for chars in char_ids)
+    char_ids_padded = torch.zeros(len(batch), max_len, max_word_len, dtype=torch.long)
+    for i, chars in enumerate(char_ids):
+        for j, c in enumerate(chars):
+            char_ids_padded[i, j, :len(c)] = torch.tensor(c)
+    # Pad heads
+    heads_padded = torch.zeros(len(batch), max_len, dtype=torch.long)
+    for i, h in enumerate(heads):
+        heads_padded[i, :len(h)] = torch.tensor(h)
+    # Pad rels
+    rels_padded = torch.zeros(len(batch), max_len, dtype=torch.long)
+    for i, r in enumerate(rels):
+        rels_padded[i, :len(r)] = torch.tensor(r)
+    # Mask
+    mask = torch.zeros(len(batch), max_len, dtype=torch.bool)
+    for i, l in enumerate(lengths):
+        mask[i, :l] = True
+    lengths = torch.tensor(lengths)
+    return word_ids_padded, char_ids_padded, heads_padded, rels_padded, mask, lengths
+# ============================================================================
+# Model
+# ============================================================================
+class CharLSTM(nn.Module):
+    """Character-level LSTM embeddings."""
+    def __init__(self, n_chars: int, char_dim: int = 50, hidden_dim: int = 100):
+        super().__init__()
+        self.embed = nn.Embedding(n_chars, char_dim, padding_idx=0)
+        self.lstm = nn.LSTM(char_dim, hidden_dim // 2, batch_first=True, bidirectional=True)
+        self.hidden_dim = hidden_dim
+    def forward(self, chars):
+        """
+        Args:
+            chars: (batch, seq_len, max_word_len)
+        Returns:
+            (batch, seq_len, hidden_dim)
+        """
+        batch, seq_len, max_word_len = chars.shape
+        # Flatten
+        chars_flat = chars.view(-1, max_word_len)  # (batch * seq_len, max_word_len)
+        # Get word lengths
+        word_lens = (chars_flat != 0).sum(dim=1)
+        word_lens = word_lens.clamp(min=1)
+        # Embed
+        char_embeds = self.embed(chars_flat)  # (batch * seq_len, max_word_len, char_dim)
+        # Pack and run LSTM
+        packed = pack_padded_sequence(char_embeds, word_lens.cpu(), batch_first=True, enforce_sorted=False)
+        _, (hidden, _) = self.lstm(packed)
+        # Concatenate forward and backward hidden states
+        hidden = torch.cat([hidden[0], hidden[1]], dim=-1)  # (batch * seq_len, hidden_dim)
+        return hidden.view(batch, seq_len, self.hidden_dim)
+class MLP(nn.Module):
+    """Multi-layer perceptron."""
+    def __init__(self, input_dim: int, hidden_dim: int, dropout: float = 0.33):
+        super().__init__()
+        self.linear = nn.Linear(input_dim, hidden_dim)
+        self.activation = nn.LeakyReLU(0.1)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        return self.dropout(self.activation(self.linear(x)))
+class Biaffine(nn.Module):
+    """Biaffine attention layer."""
+    def __init__(self, input_dim: int, output_dim: int = 1, bias_x: bool = True, bias_y: bool = True):
+        super().__init__()
+        self.input_dim = input_dim
+        self.output_dim = output_dim
+        self.bias_x = bias_x
+        self.bias_y = bias_y
+        self.weight = nn.Parameter(torch.zeros(output_dim, input_dim + bias_x, input_dim + bias_y))
+        nn.init.xavier_uniform_(self.weight)
+    def forward(self, x, y):
+        """
+        Args:
+            x: (batch, seq_len, input_dim) - dependent
+            y: (batch, seq_len, input_dim) - head
+        Returns:
+            (batch, seq_len, seq_len, output_dim) or (batch, seq_len, seq_len) if output_dim=1
+        """
+        if self.bias_x:
+            x = torch.cat([x, torch.ones_like(x[..., :1])], dim=-1)
+        if self.bias_y:
+            y = torch.cat([y, torch.ones_like(y[..., :1])], dim=-1)
+        # (batch, seq_len, output_dim, input_dim+1)
+        x = torch.einsum('bxi,oij->bxoj', x, self.weight)
+        # (batch, seq_len, seq_len, output_dim)
+        scores = torch.einsum('bxoj,byj->bxyo', x, y)
+        if self.output_dim == 1:
+            scores = scores.squeeze(-1)
+        return scores
+class BiaffineDependencyParser(nn.Module):
+    """Biaffine Dependency Parser (Dozat & Manning, 2017)."""
+    def __init__(
+        self,
+        n_words: int,
+        n_chars: int,
+        n_rels: int,
+        word_dim: int = 100,
+        char_dim: int = 50,
+        char_hidden: int = 100,
+        lstm_hidden: int = 400,
+        lstm_layers: int = 3,
+        arc_hidden: int = 500,
+        rel_hidden: int = 100,
+        dropout: float = 0.33,
+    ):
+        super().__init__()
+        self.word_embed = nn.Embedding(n_words, word_dim, padding_idx=0)
+        self.char_lstm = CharLSTM(n_chars, char_dim, char_hidden)
+        input_dim = word_dim + char_hidden
+        self.lstm = nn.LSTM(
+            input_dim, lstm_hidden // 2,
+            num_layers=lstm_layers,
+            batch_first=True,
+            bidirectional=True,
+            dropout=dropout if lstm_layers > 1 else 0
+        )
+        self.mlp_arc_dep = MLP(lstm_hidden, arc_hidden, dropout)
+        self.mlp_arc_head = MLP(lstm_hidden, arc_hidden, dropout)
+        self.mlp_rel_dep = MLP(lstm_hidden, rel_hidden, dropout)
+        self.mlp_rel_head = MLP(lstm_hidden, rel_hidden, dropout)
+        self.arc_attn = Biaffine(arc_hidden, 1, bias_x=True, bias_y=False)
+        self.rel_attn = Biaffine(rel_hidden, n_rels, bias_x=True, bias_y=True)
+        self.dropout = nn.Dropout(dropout)
+        self.n_rels = n_rels
+    def forward(self, words, chars, mask):
+        """
+        Args:
+            words: (batch, seq_len)
+            chars: (batch, seq_len, max_word_len)
+            mask: (batch, seq_len)
+        Returns:
+            arc_scores: (batch, seq_len, seq_len)
+            rel_scores: (batch, seq_len, seq_len, n_rels)
+        """
+        # Embeddings
+        word_embeds = self.word_embed(words)
+        char_embeds = self.char_lstm(chars)
+        embeds = torch.cat([word_embeds, char_embeds], dim=-1)
+        embeds = self.dropout(embeds)
+        # BiLSTM
+        lengths = mask.sum(dim=1).cpu()
+        packed = pack_padded_sequence(embeds, lengths, batch_first=True, enforce_sorted=False)
+        lstm_out, _ = self.lstm(packed)
+        lstm_out, _ = pad_packed_sequence(lstm_out, batch_first=True, total_length=mask.size(1))
+        lstm_out = self.dropout(lstm_out)
+        # MLP
+        arc_dep = self.mlp_arc_dep(lstm_out)
+        arc_head = self.mlp_arc_head(lstm_out)
+        rel_dep = self.mlp_rel_dep(lstm_out)
+        rel_head = self.mlp_rel_head(lstm_out)
+        # Biaffine
+        arc_scores = self.arc_attn(arc_dep, arc_head)  # (batch, seq_len, seq_len)
+        rel_scores = self.rel_attn(rel_dep, rel_head)  # (batch, seq_len, seq_len, n_rels)
+        return arc_scores, rel_scores
+    def loss(self, arc_scores, rel_scores, heads, rels, mask):
+        """Compute loss."""
+        batch_size, seq_len = mask.shape
+        # Arc loss
+        arc_scores = arc_scores.masked_fill(~mask.unsqueeze(2), float('-inf'))
+        arc_loss = F.cross_entropy(
+            arc_scores[mask].view(-1, seq_len),
+            heads[mask],
+            reduction='mean'
+        )
+        # Rel loss - select scores for gold heads
+        rel_scores_gold = rel_scores[torch.arange(batch_size).unsqueeze(1), torch.arange(seq_len), heads]
+        rel_loss = F.cross_entropy(
+            rel_scores_gold[mask],
+            rels[mask],
+            reduction='mean'
+        )
+        return arc_loss + rel_loss
+    def decode(self, arc_scores, rel_scores, mask):
+        """Decode predictions."""
+        # Greedy decoding
+        arc_preds = arc_scores.argmax(dim=-1)
+        batch_size, seq_len = mask.shape
+        rel_scores_pred = rel_scores[torch.arange(batch_size).unsqueeze(1), torch.arange(seq_len), arc_preds]
+        rel_preds = rel_scores_pred.argmax(dim=-1)
+        return arc_preds, rel_preds
+# ============================================================================
+# Training
+# ============================================================================
+def evaluate(model, dataloader, device):
+    """Evaluate model and return UAS/LAS."""
+    model.eval()
+    total_arcs = 0
+    correct_arcs = 0
+    correct_rels = 0
+    with torch.no_grad():
+        for batch in dataloader:
+            words, chars, heads, rels, mask, lengths = [x.to(device) for x in batch]
+            arc_scores, rel_scores = model(words, chars, mask)
+            arc_preds, rel_preds = model.decode(arc_scores, rel_scores, mask)
+            # Count correct
+            arc_correct = (arc_preds == heads) & mask
+            rel_correct = (rel_preds == rels) & mask & arc_correct
+            total_arcs += mask.sum().item()
+            correct_arcs += arc_correct.sum().item()
+            correct_rels += rel_correct.sum().item()
+    uas = correct_arcs / total_arcs * 100
+    las = correct_rels / total_arcs * 100
+    return uas, las
+@click.command()
+@click.option('--output', '-o', default='models/bamboo-1', help='Output directory')
+@click.option('--epochs', default=100, type=int, help='Number of epochs')
+@click.option('--batch-size', default=32, type=int, help='Batch size')
+@click.option('--lr', default=2e-3, type=float, help='Learning rate')
+@click.option('--lstm-hidden', default=400, type=int, help='LSTM hidden size')
+@click.option('--lstm-layers', default=3, type=int, help='LSTM layers')
+@click.option('--patience', default=10, type=int, help='Early stopping patience')
+@click.option('--force-download', is_flag=True, help='Force re-download dataset')
+@click.option('--gpu-type', default='RTX_A4000', help='GPU type for cost estimation')
+@click.option('--cost-interval', default=300, type=int, help='Cost report interval in seconds')
+@click.option('--wandb', 'use_wandb', is_flag=True, help='Enable W&B logging')
+@click.option('--wandb-project', default='bamboo-1', help='W&B project name')
+@click.option('--max-time', default=0, type=int, help='Max training time in minutes (0=unlimited)')
+@click.option('--sample', default=0, type=int, help='Sample N sentences from each split (0=all)')
+def train(output, epochs, batch_size, lr, lstm_hidden, lstm_layers, patience, force_download, gpu_type, cost_interval, use_wandb, wandb_project, max_time, sample):
+    """Train Bamboo-1 Vietnamese Dependency Parser."""
+    # Detect hardware
+    hardware = detect_hardware()
+    detected_gpu_type = hardware.get_gpu_type()
+    # Use detected GPU type if not explicitly specified or if using default
+    if gpu_type == "RTX_A4000":  # default value
+        gpu_type = detected_gpu_type
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    click.echo(f"Using device: {device}")
+    click.echo(f"Hardware: {hardware}")
+    # Initialize wandb
+    if use_wandb:
+        import wandb
+        wandb.init(
+            project=wandb_project,
+            config={
+                "epochs": epochs,
+                "batch_size": batch_size,
+                "lr": lr,
+                "lstm_hidden": lstm_hidden,
+                "lstm_layers": lstm_layers,
+                "patience": patience,
+                "gpu_type": gpu_type,
+                "hardware": hardware.to_dict(),
+            }
+        )
+        click.echo(f"W&B logging enabled: {wandb.run.url}")
+    click.echo("=" * 60)
+    click.echo("Bamboo-1: Vietnamese Dependency Parser")
+    click.echo("=" * 60)
+    # Load corpus
+    click.echo("\nLoading UDD-1 corpus...")
+    corpus = UDD1Corpus(force_download=force_download)
+    train_sents = read_conllu(corpus.train)
+    dev_sents = read_conllu(corpus.dev)
+    test_sents = read_conllu(corpus.test)
+    # Sample subset if requested
+    if sample > 0:
+        train_sents = train_sents[:sample]
+        dev_sents = dev_sents[:min(sample // 2, len(dev_sents))]
+        test_sents = test_sents[:min(sample // 2, len(test_sents))]
+        click.echo(f"  Sampling {sample} sentences...")
+    click.echo(f"  Train: {len(train_sents)} sentences")
+    click.echo(f"  Dev: {len(dev_sents)} sentences")
+    click.echo(f"  Test: {len(test_sents)} sentences")
+    # Build vocabulary
+    click.echo("\nBuilding vocabulary...")
+    vocab = Vocabulary(min_freq=2)
+    vocab.build(train_sents)
+    click.echo(f"  Words: {vocab.n_words}")
+    click.echo(f"  Chars: {vocab.n_chars}")
+    click.echo(f"  Relations: {vocab.n_rels}")
+    # Create datasets
+    train_dataset = DependencyDataset(train_sents, vocab)
+    dev_dataset = DependencyDataset(dev_sents, vocab)
+    test_dataset = DependencyDataset(test_sents, vocab)
+    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
+    dev_loader = DataLoader(dev_dataset, batch_size=batch_size, collate_fn=collate_fn)
+    test_loader = DataLoader(test_dataset, batch_size=batch_size, collate_fn=collate_fn)
+    # Create model
+    click.echo("\nInitializing model...")
+    model = BiaffineDependencyParser(
+        n_words=vocab.n_words,
+        n_chars=vocab.n_chars,
+        n_rels=vocab.n_rels,
+        lstm_hidden=lstm_hidden,
+        lstm_layers=lstm_layers,
+    ).to(device)
+    n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    click.echo(f"  Parameters: {n_params:,}")
+    # Optimizer
+    optimizer = Adam(model.parameters(), lr=lr, betas=(0.9, 0.9))
+    scheduler = ExponentialLR(optimizer, gamma=0.75 ** (1 / 5000))
+    # Training
+    click.echo(f"\nTraining for {epochs} epochs...")
+    if max_time > 0:
+        click.echo(f"Time limit: {max_time} minutes")
+    output_path = Path(output)
+    output_path.mkdir(parents=True, exist_ok=True)
+    # Cost tracking
+    cost_tracker = CostTracker(gpu_type=gpu_type)
+    cost_tracker.report_interval = cost_interval
+    cost_tracker.start()
+    click.echo(f"Cost tracking: {gpu_type} @ ${cost_tracker.hourly_rate}/hr")
+    best_las = -1
+    no_improve = 0
+    time_limit_seconds = max_time * 60 if max_time > 0 else float('inf')
+    for epoch in range(1, epochs + 1):
+        # Check time limit
+        if cost_tracker.elapsed_seconds() >= time_limit_seconds:
+            click.echo(f"\nTime limit reached ({max_time} minutes)")
+            break
+        model.train()
+        total_loss = 0
+        pbar = tqdm(train_loader, desc=f"Epoch {epoch:3d}", leave=False)
+        for batch in pbar:
+            words, chars, heads, rels, mask, lengths = [x.to(device) for x in batch]
+            optimizer.zero_grad()
+            arc_scores, rel_scores = model(words, chars, mask)
+            loss = model.loss(arc_scores, rel_scores, heads, rels, mask)
+            loss.backward()
+            nn.utils.clip_grad_norm_(model.parameters(), 5.0)
+            optimizer.step()
+            scheduler.step()
+            total_loss += loss.item()
+            pbar.set_postfix({'loss': f'{loss.item():.4f}'})
+        # Evaluate
+        dev_uas, dev_las = evaluate(model, dev_loader, device)
+        # Cost update
+        progress = epoch / epochs
+        current_cost = cost_tracker.current_cost()
+        estimated_total_cost = cost_tracker.estimate_total_cost(progress)
+        elapsed_minutes = cost_tracker.elapsed_seconds() / 60
+        cost_status = cost_tracker.update(epoch, epochs)
+        if cost_status:
+            click.echo(f"  [{cost_status}]")
+        avg_loss = total_loss / len(train_loader)
+        click.echo(f"Epoch {epoch:3d} | Loss: {avg_loss:.4f} | "
+                   f"Dev UAS: {dev_uas:.2f}% | Dev LAS: {dev_las:.2f}%")
+        # Log to wandb
+        if use_wandb:
+            wandb.log({
+                "epoch": epoch,
+                "train/loss": avg_loss,
+                "dev/uas": dev_uas,
+                "dev/las": dev_las,
+                "cost/current_usd": current_cost,
+                "cost/estimated_total_usd": estimated_total_cost,
+                "cost/elapsed_minutes": elapsed_minutes,
+            })
+        # Save best model
+        if dev_las >= best_las:
+            best_las = dev_las
+            no_improve = 0
+            torch.save({
+                'model': model.state_dict(),
+                'vocab': vocab,
+                'config': {
+                    'n_words': vocab.n_words,
+                    'n_chars': vocab.n_chars,
+                    'n_rels': vocab.n_rels,
+                    'lstm_hidden': lstm_hidden,
+                    'lstm_layers': lstm_layers,
+                }
+            }, output_path / 'model.pt')
+            click.echo(f"  -> Saved best model (LAS: {best_las:.2f}%)")
+        else:
+            no_improve += 1
+            if no_improve >= patience:
+                click.echo(f"\nEarly stopping after {patience} epochs without improvement")
+                break
+    # Final evaluation
+    click.echo("\nLoading best model for final evaluation...")
+    checkpoint = torch.load(output_path / 'model.pt', weights_only=False)
+    model.load_state_dict(checkpoint['model'])
+    test_uas, test_las = evaluate(model, test_loader, device)
+    click.echo(f"\nTest Results:")
+    click.echo(f"  UAS: {test_uas:.2f}%")
+    click.echo(f"  LAS: {test_las:.2f}%")
+    click.echo(f"\nModel saved to: {output_path}")
+    # Final cost summary
+    final_cost = cost_tracker.current_cost()
+    click.echo(f"\n{cost_tracker.summary(epoch, epochs)}")
+    # Log final metrics to wandb
+    if use_wandb:
+        wandb.log({
+            "test/uas": test_uas,
+            "test/las": test_las,
+            "cost/final_usd": final_cost,
+        })
+        wandb.finish()
+if __name__ == '__main__':
+    train()

scripts/train_gpu.py ADDED Viewed

	@@ -0,0 +1,70 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "runpod>=1.6.0",
+#     "click>=8.0.0",
+# ]
+# ///
+"""
+Simple GPU training script for Bamboo-1 using RunPod.
+Usage:
+    export RUNPOD_API_KEY="your-api-key"
+    uv run scripts/train_gpu.py
+    uv run scripts/train_gpu.py --gpu "NVIDIA RTX 3090"
+    uv run scripts/train_gpu.py --feat bert --max-epochs 50
+"""
+import os
+import click
+import runpod
+@click.command()
+@click.option("--gpu", default="NVIDIA RTX A4000", help="GPU type")
+@click.option("--feat", type=click.Choice(["char", "bert"]), default="char", help="Feature type")
+@click.option("--max-epochs", default=100, type=int, help="Max training epochs")
+@click.option("--batch-size", default=5000, type=int, help="Tokens per batch")
+@click.option("--name", default="bamboo-1-train", help="Pod name")
+def main(gpu, feat, max_epochs, batch_size, name):
+    """Launch Bamboo-1 training on RunPod GPU."""
+    api_key = os.environ.get("RUNPOD_API_KEY")
+    if not api_key:
+        raise click.ClickException(
+            "Set RUNPOD_API_KEY environment variable.\n"
+            "Get your key at: https://runpod.io/console/user/settings"
+        )
+    runpod.api_key = api_key
+    # One-liner to avoid string escaping issues
+    train_cmd = (
+        f"curl -LsSf https://astral.sh/uv/install.sh | sh && "
+        f"source $HOME/.local/bin/env && "
+        f"git clone https://huggingface.co/undertheseanlp/bamboo-1 && "
+        f"cd bamboo-1 && "
+        f"uv sync && "
+        f"uv run scripts/train.py --output models/bamboo-1 --feat {feat} --max-epochs {max_epochs} --batch-size {batch_size}"
+    )
+    click.echo("Launching RunPod training...")
+    click.echo(f"  GPU: {gpu}")
+    click.echo(f"  Feature: {feat}")
+    click.echo(f"  Epochs: {max_epochs}")
+    pod = runpod.create_pod(
+        name=name,
+        image_name="runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04",
+        gpu_type_id=gpu,
+        volume_in_gb=20,
+        docker_args=train_cmd,
+    )
+    click.echo(f"\nPod launched!")
+    click.echo(f"  ID: {pod['id']}")
+    click.echo(f"  Monitor: https://runpod.io/console/pods")
+    click.echo(f"\nTo stop: uv run scripts/runpod_setup.py terminate {pod['id']}")
+if __name__ == "__main__":
+    main()

scripts/watch_pod.py ADDED Viewed

	@@ -0,0 +1,113 @@

+# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "runpod>=1.6.0",
+#     "click>=8.0.0",
+# ]
+# ///
+"""
+Watch RunPod pod status.
+Usage:
+    export $(cat .env | xargs) && uv run scripts/watch_pod.py
+    export $(cat .env | xargs) && uv run scripts/watch_pod.py --pod-id <id>
+"""
+import os
+import time
+import click
+import runpod
+from runpod.api.graphql import run_graphql_query
+def get_pod_status(pod_id):
+    query = f'''
+    query getPodStatus {{
+      pod(input: {{ podId: "{pod_id}" }}) {{
+        id
+        name
+        desiredStatus
+        runtime {{
+          uptimeInSeconds
+          gpus {{
+            gpuUtilPercent
+            memoryUtilPercent
+          }}
+          container {{
+            cpuPercent
+            memoryPercent
+          }}
+        }}
+      }}
+    }}
+    '''
+    return run_graphql_query(query)
+@click.command()
+@click.option("--pod-id", default=None, help="Pod ID to watch")
+@click.option("--interval", default=10, type=int, help="Refresh interval in seconds")
+def main(pod_id, interval):
+    """Watch RunPod pod status in real-time."""
+    api_key = os.environ.get("RUNPOD_API_KEY")
+    if not api_key:
+        raise click.ClickException("Set RUNPOD_API_KEY")
+    runpod.api_key = api_key
+    # Get pod ID if not provided
+    if not pod_id:
+        pods = runpod.get_pods()
+        if not pods:
+            click.echo("No active pods found.")
+            return
+        pod_id = pods[0]["id"]
+        click.echo(f"Watching pod: {pods[0].get('name', pod_id)}")
+    click.echo(f"Refreshing every {interval}s. Press Ctrl+C to stop.\n")
+    try:
+        while True:
+            result = get_pod_status(pod_id)
+            pod = result.get("data", {}).get("pod")
+            if not pod:
+                click.echo("Pod not found or terminated.")
+                break
+            # Clear and print status
+            click.clear()
+            click.echo(f"=== {pod['name']} ({pod['id']}) ===")
+            click.echo(f"Status: {pod['desiredStatus']}")
+            runtime = pod.get("runtime") or {}
+            uptime = runtime.get("uptimeInSeconds", 0)
+            mins, secs = divmod(uptime, 60)
+            hours, mins = divmod(mins, 60)
+            click.echo(f"Uptime: {int(hours)}h {int(mins)}m {int(secs)}s")
+            gpus = runtime.get("gpus") or []
+            if gpus:
+                gpu = gpus[0]
+                click.echo(f"GPU Util: {gpu.get('gpuUtilPercent', 0):.1f}%")
+                click.echo(f"GPU Mem:  {gpu.get('memoryUtilPercent', 0):.1f}%")
+            container = runtime.get("container") or {}
+            click.echo(f"CPU:      {container.get('cpuPercent', 0):.1f}%")
+            click.echo(f"Memory:   {container.get('memoryPercent', 0):.1f}%")
+            click.echo(f"\nLast update: {time.strftime('%H:%M:%S')}")
+            click.echo("Press Ctrl+C to stop")
+            if pod["desiredStatus"] not in ["RUNNING", "STARTING"]:
+                click.echo(f"\nPod is {pod['desiredStatus']}. Stopping watch.")
+                break
+            time.sleep(interval)
+    except KeyboardInterrupt:
+        click.echo("\nStopped watching.")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff