Add word segmentation support and underthesea-core integration

- Update handler.py to support both pycrfsuite and underthesea-core formats
- Add word segmentation training and prediction scripts
- Add training configurations (configs/pos_tagger.yaml, configs/word_segmentation.yaml)
- Update training scripts with multi-trainer support
- Update CLAUDE.md with folder structure and word segmentation docs

Files changed (10) hide show

.gitignore +2 -0
CLAUDE.md +78 -17
configs/pos_tagger.yaml +54 -0
configs/word_segmentation.yaml +49 -0
handler.py +55 -5
scripts/evaluate.py +93 -65
scripts/predict.py +67 -35
scripts/predict_word_segmentation.py +201 -0
scripts/train.py +454 -79
scripts/train_word_segmentation.py +755 -0

.gitignore CHANGED Viewed

@@ -30,3 +30,5 @@ per_tag_metrics.png
 # Logs
 *.log
 wandb/

 # Logs
 *.log
 wandb/
+models

CLAUDE.md CHANGED Viewed

@@ -4,14 +4,48 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 ## Project Overview
-Vietnamese POS Tagger (TRE-1) - a CRF-based Part-of-Speech tagger for Vietnamese, deployed on Hugging Face at [undertheseanlp/tre-1](https://huggingface.co/undertheseanlp/tre-1). Uses python-crfsuite with 27 handcrafted feature templates. Trained on UDD-v0.1 dataset (80/20 train/test split, random_state=42). Achieves 95.57% accuracy.
 ## Running the Model
 **Local inference:**
 ```python
 from handler import EndpointHandler
-handler = EndpointHandler(path="./")
 result = handler({"inputs": "Tôi yêu Việt Nam"})
 ```
@@ -21,40 +55,67 @@ result = handler({"inputs": "Tôi yêu Việt Nam"})
 Scripts use inline script metadata (PEP 723) - no separate requirements file needed.
 ```bash
-# Train model from scratch
 uv run scripts/train.py
-# Train with custom output path and W&B logging
-uv run scripts/train.py --output model.crfsuite --wandb
-# Evaluate trained model
-uv run scripts/evaluate.py --model pos_tagger.crfsuite
-# Evaluate with confusion matrix and per-tag plots
-uv run scripts/evaluate.py --model pos_tagger.crfsuite --save-plots
-# Inference (formats: inline, json, conll)
 uv run scripts/predict.py "Tôi yêu Việt Nam"
-uv run scripts/predict.py --format json "Hà Nội là thủ đô"
-echo "Học sinh đang học bài" | uv run scripts/predict.py -
 ```
 ## Architecture
 Single-file implementation (`handler.py`) following Hugging Face Custom Handler pattern:
-- **PythonCRFFeaturizer**: Extracts 27 linguistic features per token (word form, case, prefix/suffix, context windows, dictionary lookups)
 - **EndpointHandler**: Hugging Face API entry point - loads CRF model, handles tokenization and inference
-- **pos_tagger.crfsuite**: Binary CRF model (Git LFS tracked)
 **Data flow:** Input text → whitespace tokenization → feature extraction → CRF prediction → `[{"token": "...", "tag": "..."}]`
 ## Key Constraints
 - Input must be pre-tokenized (whitespace-separated Vietnamese tokens)
-- No word segmentation - expects already segmented Vietnamese text
 - Feature template syntax: `T[index].attribute` (e.g., `T[-1].lower`, `T[0,1].is_in_dict`)
-- Predicts 15 Universal POS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
 - CRF training params: c1=1.0 (L1), c2=0.001 (L2), max_iterations=100
-- Model trained on legal documents domain (UDD-v0.1) - may underperform on casual/social text

 ## Project Overview
+Vietnamese NLP Models (TRE-1) - CRF-based models for Vietnamese NLP tasks, deployed on Hugging Face at [undertheseanlp/tre-1](https://huggingface.co/undertheseanlp/tre-1). Includes:
+- **POS Tagger**: 27 handcrafted feature templates, predicts 15 Universal POS tags
+- **Word Segmentation**: BIO tagging at syllable level, 21 feature templates
+Trained on UDD-1 dataset from Hugging Face.
+## Folder Structure
+```
+tre-1/
+├── models/                    # Trained models (versioned by timestamp)
+│   ├── pos_tagger/
+│   │   └── 20260131_154530/   # YYYYMMDD_HHMMSS format
+│   │       ├── model.crfsuite
+│   │       └── metadata.yaml
+│   └── word_segmentation/
+│       └── 20260131_154530/
+│           ├── model.crfsuite
+│           └── metadata.yaml
+├── configs/                   # Training configurations
+│   ├── pos_tagger.yaml
+│   └── word_segmentation.yaml
+├── results/                   # Evaluation outputs (plots, metrics)
+│   ├── pos_tagger/
+│   └── word_segmentation/
+├── scripts/                   # Training, evaluation, inference scripts
+│   ├── train.py
+│   ├── train_word_segmentation.py
+│   ├── evaluate.py
+│   ├── predict.py
+│   └── predict_word_segmentation.py
+├── handler.py                 # Hugging Face Custom Handler
+├── pos_tagger.crfsuite        # Legacy model (for HF deployment)
+└── CLAUDE.md
+```
 ## Running the Model
 **Local inference:**
 ```python
 from handler import EndpointHandler
+handler = EndpointHandler(path="models/pos_tagger/20260131_000000")
 result = handler({"inputs": "Tôi yêu Việt Nam"})
 ```
 Scripts use inline script metadata (PEP 723) - no separate requirements file needed.
+### POS Tagger
 ```bash
+# Train model (auto-generates timestamp version, e.g., 20260131_154530)
 uv run scripts/train.py
+# Train with custom version name
+uv run scripts/train.py --version my_experiment
+# Train with W&B logging
+uv run scripts/train.py --wandb
+# Evaluate latest model
+uv run scripts/evaluate.py
+# Evaluate specific version
+uv run scripts/evaluate.py --version 20260131_000000
+# Evaluate with plots (saves to results/pos_tagger/)
+uv run scripts/evaluate.py --save-plots
+# Inference (uses latest model by default)
 uv run scripts/predict.py "Tôi yêu Việt Nam"
+uv run scripts/predict.py --version 20260131_000000 --format json "Hà Nội là thủ đô"
+```
+### Word Segmentation
+```bash
+# Train model (auto-generates timestamp version)
+uv run scripts/train_word_segmentation.py
+# Train with custom version name
+uv run scripts/train_word_segmentation.py --version my_experiment
+# Inference
+uv run scripts/predict_word_segmentation.py "Tôi yêu Việt Nam"
 ```
 ## Architecture
 Single-file implementation (`handler.py`) following Hugging Face Custom Handler pattern:
+- **PythonCRFFeaturizer**: Extracts linguistic features per token (word form, case, prefix/suffix, context windows, dictionary lookups)
 - **EndpointHandler**: Hugging Face API entry point - loads CRF model, handles tokenization and inference
 **Data flow:** Input text → whitespace tokenization → feature extraction → CRF prediction → `[{"token": "...", "tag": "..."}]`
 ## Key Constraints
 - Input must be pre-tokenized (whitespace-separated Vietnamese tokens)
 - Feature template syntax: `T[index].attribute` (e.g., `T[-1].lower`, `T[0,1].is_in_dict`)
+- POS Tagger predicts 15 Universal POS tags: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
+- Word Segmentation uses BIO tagging: B (beginning), I (inside)
 - CRF training params: c1=1.0 (L1), c2=0.001 (L2), max_iterations=100
+## Model Versioning
+Models use timestamp-based versioning (`YYYYMMDD_HHMMSS`):
+- Each version has its own directory under `models/{task}/{timestamp}/`
+- Auto-generated when training without `--version` flag
+- Scripts default to **latest** version (sorted alphabetically)
+- `metadata.yaml` contains training info, hyperparameters, and performance metrics
+- `configs/` stores reusable training configurations

configs/pos_tagger.yaml ADDED Viewed

	@@ -0,0 +1,54 @@

+# POS Tagger Training Configuration
+# Dataset: UDD-1 from Hugging Face
+model:
+  name: pos_tagger
+  type: crf
+  version: v1.0.0
+training:
+  c1: 1.0              # L1 regularization coefficient
+  c2: 0.001            # L2 regularization coefficient
+  max_iterations: 100
+  feature_possible_transitions: true
+data:
+  dataset: undertheseanlp/UDD-1
+  train_split: train
+  val_split: validation
+  test_split: test
+features:
+  num_templates: 27
+  templates:
+    - T[0]
+    - T[0].lower
+    - T[0].istitle
+    - T[0].isupper
+    - T[0].isdigit
+    - T[0].isalpha
+    - T[0].prefix2
+    - T[0].prefix3
+    - T[0].suffix2
+    - T[0].suffix3
+    - T[-1]
+    - T[-1].lower
+    - T[-1].istitle
+    - T[-1].isupper
+    - T[-2]
+    - T[-2].lower
+    - T[1]
+    - T[1].lower
+    - T[1].istitle
+    - T[1].isupper
+    - T[2]
+    - T[2].lower
+    - T[-1,0]
+    - T[0,1]
+    - T[0].is_in_dict
+    - T[-1,0].is_in_dict
+    - T[0,1].is_in_dict
+output:
+  model_dir: models/pos_tagger
+  results_dir: results/pos_tagger

configs/word_segmentation.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+# Word Segmentation Training Configuration
+# Dataset: UDD-1 from Hugging Face
+model:
+  name: word_segmentation
+  type: crf
+  version: v1.0.0
+  tagging_scheme: BIO  # B=Beginning, I=Inside
+training:
+  c1: 1.0              # L1 regularization coefficient
+  c2: 0.001            # L2 regularization coefficient
+  max_iterations: 100
+  feature_possible_transitions: true
+data:
+  dataset: undertheseanlp/UDD-1
+  train_split: train
+  val_split: validation
+  test_split: test
+  preprocessing: underthesea.regex_tokenize  # Syllable splitting
+features:
+  num_templates: 21
+  templates:
+    - S[0]
+    - S[0].lower
+    - S[0].istitle
+    - S[0].isupper
+    - S[0].isdigit
+    - S[0].ispunct
+    - S[0].len
+    - S[0].prefix2
+    - S[0].suffix2
+    - S[-1]
+    - S[-1].lower
+    - S[-2]
+    - S[-2].lower
+    - S[1]
+    - S[1].lower
+    - S[2]
+    - S[2].lower
+    - S[-1,0]
+    - S[0,1]
+    - S[-1,0,1]
+output:
+  model_dir: models/word_segmentation
+  results_dir: results/word_segmentation

handler.py CHANGED Viewed

@@ -1,11 +1,32 @@
 """
 Custom handler for Vietnamese POS Tagger inference on Hugging Face.
 """
 import re
-import pycrfsuite
 from typing import Dict, List, Any
 class PythonCRFFeaturizer:
     """
@@ -101,10 +122,39 @@ class EndpointHandler:
         self.featurizer = PythonCRFFeaturizer(self.feature_templates)
-        # Load CRF model
-        model_path = os.path.join(path, "pos_tagger.crfsuite")
-        self.tagger = pycrfsuite.Tagger()
-        self.tagger.open(model_path)
     def _tokenize(self, text: str) -> List[str]:
         """Simple whitespace tokenization."""

 """
 Custom handler for Vietnamese POS Tagger inference on Hugging Face.
+Supports two model formats:
+- CRFsuite format (.crfsuite) - loaded with pycrfsuite
+- underthesea-core format (.crf) - loaded with underthesea_core
 """
+import os
 import re
 from typing import Dict, List, Any
+# Try importing both taggers
+try:
+    import pycrfsuite
+    HAS_PYCRFSUITE = True
+except ImportError:
+    HAS_PYCRFSUITE = False
+try:
+    from underthesea_core import CRFModel, CRFTagger
+    HAS_UNDERTHESEA_CORE = True
+except ImportError:
+    try:
+        from underthesea_core.underthesea_core import CRFModel, CRFTagger
+        HAS_UNDERTHESEA_CORE = True
+    except ImportError:
+        HAS_UNDERTHESEA_CORE = False
 class PythonCRFFeaturizer:
     """
         self.featurizer = PythonCRFFeaturizer(self.feature_templates)
+        # Load CRF model - check multiple possible locations and formats
+        # Priority: .crfsuite (pycrfsuite) > .crf (underthesea-core)
+        model_candidates = [
+            (os.path.join(path, "model.crfsuite"), "pycrfsuite"),
+            (os.path.join(path, "pos_tagger.crfsuite"), "pycrfsuite"),
+            (os.path.join(path, "model.crf"), "underthesea-core"),
+        ]
+        model_path = None
+        model_format = None
+        for candidate, fmt in model_candidates:
+            if os.path.exists(candidate):
+                model_path = candidate
+                model_format = fmt
+                break
+        if model_path is None:
+            raise FileNotFoundError(
+                f"No model found. Checked: {[c for c, _ in model_candidates]}"
+            )
+        # Load model based on format
+        self.model_format = model_format
+        if model_format == "pycrfsuite":
+            if not HAS_PYCRFSUITE:
+                raise ImportError("pycrfsuite not installed. Install with: pip install python-crfsuite")
+            self.tagger = pycrfsuite.Tagger()
+            self.tagger.open(model_path)
+        elif model_format == "underthesea-core":
+            if not HAS_UNDERTHESEA_CORE:
+                raise ImportError("underthesea-core not installed")
+            model = CRFModel.load(model_path)
+            self.tagger = CRFTagger.from_model(model)
     def _tokenize(self, text: str) -> List[str]:
         """Simple whitespace tokenization."""

scripts/evaluate.py CHANGED Viewed

@@ -6,24 +6,29 @@
 #     "scikit-learn>=1.6.1",
 #     "matplotlib>=3.5.0",
 #     "seaborn>=0.12.0",
 # ]
 # ///
 """
 Evaluation script for Vietnamese POS Tagger (TRE-1).
-Generates detailed metrics, confusion matrix, and visualizations
-as described in TECHNICAL_REPORT.md.
 Usage:
     uv run scripts/evaluate.py
-    uv run scripts/evaluate.py --model pos_tagger.crfsuite
-    uv run scripts/evaluate.py --save-plots  # Save confusion matrix and charts
 """
-import argparse
 import pycrfsuite
 from datasets import load_dataset
-from sklearn.model_selection import train_test_split
 from sklearn.metrics import (
     accuracy_score,
     precision_recall_fscore_support,
@@ -83,7 +88,6 @@ def apply_attribute(value, attribute, dictionary=None):
 def parse_template(template):
-    import re
     match = re.match(r"T\[([^\]]+)\](?:\.(\w+))?", template)
     if not match:
         return None, None
@@ -122,20 +126,18 @@ def sentence_to_features(tokens):
 def load_test_data():
-    print("Loading UDD-v0.1 dataset...")
-    dataset = load_dataset("undertheseanlp/UDD-v0.1")
     sentences = []
-    for item in dataset["train"]:
         tokens = item["tokens"]
         tags = item["upos"]
         if tokens and tags:
             sentences.append((tokens, tags))
-    # Use same split as training
-    _, test_data = train_test_split(sentences, test_size=0.2, random_state=42)
-    print(f"Test set: {len(test_data)} sentences")
-    return test_data
 def plot_confusion_matrix(y_true, y_pred, labels, output_path):
@@ -159,13 +161,12 @@ def plot_confusion_matrix(y_true, y_pred, labels, output_path):
     plt.tight_layout()
     plt.savefig(output_path, dpi=150)
     plt.close()
-    print(f"Confusion matrix saved to {output_path}")
 def plot_per_tag_metrics(report_dict, output_path):
     import matplotlib.pyplot as plt
-    # Filter out aggregate metrics
     tags = [k for k in report_dict.keys() if k not in ("accuracy", "macro avg", "weighted avg")]
     precision = [report_dict[t]["precision"] for t in tags]
@@ -192,13 +193,11 @@ def plot_per_tag_metrics(report_dict, output_path):
     plt.tight_layout()
     plt.savefig(output_path, dpi=150)
     plt.close()
-    print(f"Per-tag metrics saved to {output_path}")
 def analyze_errors(y_true, y_pred, tokens_flat, top_n=10):
     """Analyze common error patterns."""
-    from collections import Counter
     errors = Counter()
     error_examples = {}
@@ -209,24 +208,69 @@ def analyze_errors(y_true, y_pred, tokens_flat, top_n=10):
             if key not in error_examples:
                 error_examples[key] = token
-    print(f"\nTop {top_n} Error Patterns:")
-    print("-" * 60)
-    print(f"{'True':<10} {'Predicted':<10} {'Count':<8} {'Example'}")
-    print("-" * 60)
     for (true, pred), count in errors.most_common(top_n):
         example = error_examples.get((true, pred), "")
-        print(f"{true:<10} {pred:<10} {count:<8} {example}")
-def evaluate(model_path, save_plots=False):
-    print(f"Loading model from {model_path}...")
     tagger = pycrfsuite.Tagger()
-    tagger.open(model_path)
     test_data = load_test_data()
-    print("Extracting features and predicting...")
     X_test = [sentence_to_features(tokens) for tokens, _ in test_data]
     y_test = [tags for _, tags in test_data]
     tokens_test = [tokens for tokens, _ in test_data]
@@ -246,70 +290,54 @@ def evaluate(model_path, save_plots=False):
     precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
         y_test_flat, y_pred_flat, average="macro"
     )
-    precision_weighted, recall_weighted, f1_weighted, _ = precision_recall_fscore_support(
         y_test_flat, y_pred_flat, average="weighted"
     )
-    print("\n" + "=" * 60)
-    print("EVALUATION RESULTS")
-    print("=" * 60)
-    print("\nOverall Metrics:")
-    print(f"  Accuracy:           {accuracy:.4f} ({accuracy*100:.2f}%)")
-    print(f"  Precision (macro):  {precision_macro:.4f}")
-    print(f"  Recall (macro):     {recall_macro:.4f}")
-    print(f"  F1 (macro):         {f1_macro:.4f}")
-    print(f"  F1 (weighted):      {f1_weighted:.4f}")
-    print("\nPer-Tag Classification Report:")
     report = classification_report(y_test_flat, y_pred_flat, digits=4)
-    print(report)
     # Error analysis
     analyze_errors(y_test_flat, y_pred_flat, tokens_flat)
     # Dataset statistics
-    from collections import Counter
     tag_counts = Counter(y_test_flat)
     total_tokens = len(y_test_flat)
-    print("\nTest Set Tag Distribution:")
-    print("-" * 40)
     for tag in labels:
         count = tag_counts[tag]
         pct = count / total_tokens * 100
-        print(f"  {tag:<8} {count:>6} ({pct:>5.2f}%)")
     if save_plots:
         plot_confusion_matrix(
             y_test_flat, y_pred_flat, labels,
-            "confusion_matrix.png"
         )
         report_dict = classification_report(
             y_test_flat, y_pred_flat, output_dict=True
         )
-        plot_per_tag_metrics(report_dict, "per_tag_metrics.png")
     return accuracy
-def main():
-    parser = argparse.ArgumentParser(description="Evaluate Vietnamese POS Tagger")
-    parser.add_argument(
-        "--model", "-m",
-        default="pos_tagger.crfsuite",
-        help="Path to trained model"
-    )
-    parser.add_argument(
-        "--save-plots",
-        action="store_true",
-        help="Save confusion matrix and per-tag metrics plots"
-    )
-    args = parser.parse_args()
-    evaluate(args.model, save_plots=args.save_plots)
 if __name__ == "__main__":
-    main()

 #     "scikit-learn>=1.6.1",
 #     "matplotlib>=3.5.0",
 #     "seaborn>=0.12.0",
+#     "click>=8.0.0",
 # ]
 # ///
 """
 Evaluation script for Vietnamese POS Tagger (TRE-1).
 Usage:
     uv run scripts/evaluate.py
+    uv run scripts/evaluate.py --version v1.0.0
+    uv run scripts/evaluate.py --model models/pos_tagger/v1.0.0/model.crfsuite
+    uv run scripts/evaluate.py --save-plots
 """
+import re
+from collections import Counter
+from pathlib import Path
+import click
 import pycrfsuite
 from datasets import load_dataset
+# Get project root directory
+PROJECT_ROOT = Path(__file__).parent.parent
 from sklearn.metrics import (
     accuracy_score,
     precision_recall_fscore_support,
 def parse_template(template):
     match = re.match(r"T\[([^\]]+)\](?:\.(\w+))?", template)
     if not match:
         return None, None
 def load_test_data():
+    click.echo("Loading UDD-1 dataset...")
+    dataset = load_dataset("undertheseanlp/UDD-1")
     sentences = []
+    for item in dataset["test"]:
         tokens = item["tokens"]
         tags = item["upos"]
         if tokens and tags:
             sentences.append((tokens, tags))
+    click.echo(f"Test set: {len(sentences)} sentences")
+    return sentences
 def plot_confusion_matrix(y_true, y_pred, labels, output_path):
     plt.tight_layout()
     plt.savefig(output_path, dpi=150)
     plt.close()
+    click.echo(f"Confusion matrix saved to {output_path}")
 def plot_per_tag_metrics(report_dict, output_path):
     import matplotlib.pyplot as plt
     tags = [k for k in report_dict.keys() if k not in ("accuracy", "macro avg", "weighted avg")]
     precision = [report_dict[t]["precision"] for t in tags]
     plt.tight_layout()
     plt.savefig(output_path, dpi=150)
     plt.close()
+    click.echo(f"Per-tag metrics saved to {output_path}")
 def analyze_errors(y_true, y_pred, tokens_flat, top_n=10):
     """Analyze common error patterns."""
     errors = Counter()
     error_examples = {}
             if key not in error_examples:
                 error_examples[key] = token
+    click.echo(f"\nTop {top_n} Error Patterns:")
+    click.echo("-" * 60)
+    click.echo(f"{'True':<10} {'Predicted':<10} {'Count':<8} {'Example'}")
+    click.echo("-" * 60)
     for (true, pred), count in errors.most_common(top_n):
         example = error_examples.get((true, pred), "")
+        click.echo(f"{true:<10} {pred:<10} {count:<8} {example}")
+def get_latest_version(task="pos_tagger"):
+    """Get the latest model version (sorted by timestamp)."""
+    models_dir = PROJECT_ROOT / "models" / task
+    if not models_dir.exists():
+        return None
+    versions = [d.name for d in models_dir.iterdir() if d.is_dir()]
+    if not versions:
+        return None
+    return sorted(versions)[-1]  # Latest timestamp
+@click.command()
+@click.option(
+    "--version", "-v",
+    default=None,
+    help="Model version to evaluate (default: latest)",
+)
+@click.option(
+    "--model", "-m",
+    default=None,
+    help="Custom model path (overrides version-based path)",
+)
+@click.option(
+    "--save-plots",
+    is_flag=True,
+    help="Save confusion matrix and per-tag metrics plots",
+)
+def evaluate(version, model, save_plots):
+    """Evaluate Vietnamese POS Tagger on UDD-1 test set."""
+    # Use latest version if not specified
+    if version is None and model is None:
+        version = get_latest_version("pos_tagger")
+        if version is None:
+            raise click.ClickException("No models found in models/pos_tagger/")
+    # Determine model path
+    if model:
+        model_path = Path(model)
+    else:
+        model_path = PROJECT_ROOT / "models" / "pos_tagger" / version / "model.crfsuite"
+    # Determine output directory for plots
+    if save_plots:
+        results_dir = PROJECT_ROOT / "results" / "pos_tagger"
+        results_dir.mkdir(parents=True, exist_ok=True)
+    click.echo(f"Loading model from {model_path}...")
     tagger = pycrfsuite.Tagger()
+    tagger.open(str(model_path))
     test_data = load_test_data()
+    click.echo("Extracting features and predicting...")
     X_test = [sentence_to_features(tokens) for tokens, _ in test_data]
     y_test = [tags for _, tags in test_data]
     tokens_test = [tokens for tokens, _ in test_data]
     precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
         y_test_flat, y_pred_flat, average="macro"
     )
+    _, _, f1_weighted, _ = precision_recall_fscore_support(
         y_test_flat, y_pred_flat, average="weighted"
     )
+    click.echo("\n" + "=" * 60)
+    click.echo("EVALUATION RESULTS")
+    click.echo("=" * 60)
+    click.echo("\nOverall Metrics:")
+    click.echo(f"  Accuracy:           {accuracy:.4f} ({accuracy*100:.2f}%)")
+    click.echo(f"  Precision (macro):  {precision_macro:.4f}")
+    click.echo(f"  Recall (macro):     {recall_macro:.4f}")
+    click.echo(f"  F1 (macro):         {f1_macro:.4f}")
+    click.echo(f"  F1 (weighted):      {f1_weighted:.4f}")
+    click.echo("\nPer-Tag Classification Report:")
     report = classification_report(y_test_flat, y_pred_flat, digits=4)
+    click.echo(report)
     # Error analysis
     analyze_errors(y_test_flat, y_pred_flat, tokens_flat)
     # Dataset statistics
     tag_counts = Counter(y_test_flat)
     total_tokens = len(y_test_flat)
+    click.echo("\nTest Set Tag Distribution:")
+    click.echo("-" * 40)
     for tag in labels:
         count = tag_counts[tag]
         pct = count / total_tokens * 100
+        click.echo(f"  {tag:<8} {count:>6} ({pct:>5.2f}%)")
     if save_plots:
+        cm_path = results_dir / f"confusion_matrix_{version}.png"
         plot_confusion_matrix(
             y_test_flat, y_pred_flat, labels,
+            str(cm_path)
         )
         report_dict = classification_report(
             y_test_flat, y_pred_flat, output_dict=True
         )
+        metrics_path = results_dir / f"per_tag_metrics_{version}.png"
+        plot_per_tag_metrics(report_dict, str(metrics_path))
     return accuracy
 if __name__ == "__main__":
+    evaluate()

scripts/predict.py CHANGED Viewed

@@ -2,6 +2,8 @@
 # requires-python = ">=3.9"
 # dependencies = [
 #     "python-crfsuite>=0.9.11",
 # ]
 # ///
 """
@@ -9,68 +11,98 @@ Inference script for Vietnamese POS Tagger (TRE-1).
 Usage:
     uv run scripts/predict.py "Tôi yêu Việt Nam"
-    uv run scripts/predict.py --model pos_tagger.crfsuite "Hà Nội là thủ đô"
     echo "Học sinh đang học bài" | uv run scripts/predict.py -
 """
-import argparse
 import sys
 import os
 # Add parent directory to import handler
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from handler import EndpointHandler
-def main():
-    parser = argparse.ArgumentParser(description="Vietnamese POS Tagger Inference")
-    parser.add_argument(
-        "text",
-        nargs="?",
-        default="-",
-        help="Text to tag (use '-' for stdin)"
-    )
-    parser.add_argument(
-        "--model", "-m",
-        default=".",
-        help="Path to model directory (default: current directory)"
-    )
-    parser.add_argument(
-        "--format", "-f",
-        choices=["inline", "json", "conll"],
-        default="inline",
-        help="Output format"
-    )
-    args = parser.parse_args()
     # Read input
-    if args.text == "-":
         text = sys.stdin.read().strip()
-    else:
-        text = args.text
     if not text:
-        print("Error: No input text provided", file=sys.stderr)
-        sys.exit(1)
     # Load model
-    handler = EndpointHandler(path=args.model)
     # Predict
     result = handler({"inputs": text})
     # Format output
-    if args.format == "json":
-        import json
-        print(json.dumps(result, ensure_ascii=False, indent=2))
-    elif args.format == "conll":
         for i, item in enumerate(result, 1):
-            print(f"{i}\t{item['token']}\t{item['tag']}")
     else:  # inline
         tagged = " ".join(f"{item['token']}/{item['tag']}" for item in result)
-        print(tagged)
 if __name__ == "__main__":
-    main()

 # requires-python = ">=3.9"
 # dependencies = [
 #     "python-crfsuite>=0.9.11",
+#     "click>=8.0.0",
+#     "underthesea-core @ file:///home/claude-user/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core/target/wheels/underthesea_core-1.0.7-cp312-cp312-manylinux_2_34_x86_64.whl",
 # ]
 # ///
 """
 Usage:
     uv run scripts/predict.py "Tôi yêu Việt Nam"
+    uv run scripts/predict.py --version v1.0.0 "Hà Nội là thủ đô"
+    uv run scripts/predict.py --model models/pos_tagger/v1.0.0 "Test"
     echo "Học sinh đang học bài" | uv run scripts/predict.py -
 """
+import json
 import sys
 import os
+from pathlib import Path
+import click
 # Add parent directory to import handler
 sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Get project root directory
+PROJECT_ROOT = Path(__file__).parent.parent
 from handler import EndpointHandler
+def get_latest_version(task="pos_tagger"):
+    """Get the latest model version (sorted by timestamp)."""
+    models_dir = PROJECT_ROOT / "models" / task
+    if not models_dir.exists():
+        return None
+    versions = [d.name for d in models_dir.iterdir() if d.is_dir()]
+    if not versions:
+        return None
+    return sorted(versions)[-1]  # Latest timestamp
+@click.command()
+@click.argument("text", default="-")
+@click.option(
+    "--version", "-v",
+    default=None,
+    help="Model version to use (default: latest)",
+)
+@click.option(
+    "--model", "-m",
+    default=None,
+    help="Custom model directory path (overrides version-based path)",
+)
+@click.option(
+    "--format", "-f",
+    "output_format",
+    type=click.Choice(["inline", "json", "conll"]),
+    default="inline",
+    help="Output format",
+    show_default=True,
+)
+def predict(text, version, model, output_format):
+    """Tag Vietnamese text with POS tags.
+    TEXT is the input text to tag. Use '-' to read from stdin.
+    """
+    # Use latest version if not specified
+    if version is None and model is None:
+        version = get_latest_version("pos_tagger")
+        if version is None:
+            raise click.ClickException("No models found in models/pos_tagger/")
+    # Determine model path
+    if model:
+        model_path = model
+    else:
+        model_path = str(PROJECT_ROOT / "models" / "pos_tagger" / version)
     # Read input
+    if text == "-":
         text = sys.stdin.read().strip()
     if not text:
+        raise click.ClickException("No input text provided")
     # Load model
+    handler = EndpointHandler(path=model_path)
     # Predict
     result = handler({"inputs": text})
     # Format output
+    if output_format == "json":
+        click.echo(json.dumps(result, ensure_ascii=False, indent=2))
+    elif output_format == "conll":
         for i, item in enumerate(result, 1):
+            click.echo(f"{i}\t{item['token']}\t{item['tag']}")
     else:  # inline
         tagged = " ".join(f"{item['token']}/{item['tag']}" for item in result)
+        click.echo(tagged)
 if __name__ == "__main__":
+    predict()

scripts/predict_word_segmentation.py ADDED Viewed

	@@ -0,0 +1,201 @@

+# /// script
+# requires-python = ">=3.9"
+# dependencies = [
+#     "python-crfsuite>=0.9.11",
+#     "click>=8.0.0",
+#     "underthesea>=6.8.0",
+#     "underthesea-core @ file:///home/claude-user/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core/target/wheels/underthesea_core-1.0.7-cp312-cp312-manylinux_2_34_x86_64.whl",
+# ]
+# ///
+"""
+Prediction script for Vietnamese Word Segmentation.
+Uses underthesea regex_tokenize to split text into syllables,
+then applies CRF model at syllable level to decide word boundaries.
+Usage:
+    uv run scripts/predict_word_segmentation.py "Trên thế giới, giá vàng đang giao dịch"
+    echo "Text here" | uv run scripts/predict_word_segmentation.py -
+"""
+import sys
+import click
+import pycrfsuite
+from underthesea.pipeline.word_tokenize.regex_tokenize import tokenize as regex_tokenize
+def get_syllable_at(syllables, position, offset):
+    """Get syllable at position + offset, with boundary handling."""
+    idx = position + offset
+    if idx < 0:
+        return "__BOS__"
+    elif idx >= len(syllables):
+        return "__EOS__"
+    return syllables[idx]
+def is_punct(s):
+    """Check if string is punctuation."""
+    return len(s) == 1 and not s.isalnum()
+def extract_syllable_features(syllables, position):
+    """Extract features for a syllable at given position."""
+    features = {}
+    # Current syllable
+    s0 = get_syllable_at(syllables, position, 0)
+    is_boundary = s0 in ("__BOS__", "__EOS__")
+    features["S[0]"] = s0
+    features["S[0].lower"] = s0.lower() if not is_boundary else s0
+    features["S[0].istitle"] = str(s0.istitle()) if not is_boundary else "False"
+    features["S[0].isupper"] = str(s0.isupper()) if not is_boundary else "False"
+    features["S[0].isdigit"] = str(s0.isdigit()) if not is_boundary else "False"
+    features["S[0].ispunct"] = str(is_punct(s0)) if not is_boundary else "False"
+    features["S[0].len"] = str(len(s0)) if not is_boundary else "0"
+    features["S[0].prefix2"] = s0[:2] if not is_boundary and len(s0) >= 2 else s0
+    features["S[0].suffix2"] = s0[-2:] if not is_boundary and len(s0) >= 2 else s0
+    # Previous syllables
+    s_1 = get_syllable_at(syllables, position, -1)
+    s_2 = get_syllable_at(syllables, position, -2)
+    features["S[-1]"] = s_1
+    features["S[-1].lower"] = s_1.lower() if s_1 not in ("__BOS__", "__EOS__") else s_1
+    features["S[-2]"] = s_2
+    features["S[-2].lower"] = s_2.lower() if s_2 not in ("__BOS__", "__EOS__") else s_2
+    # Next syllables
+    s1 = get_syllable_at(syllables, position, 1)
+    s2 = get_syllable_at(syllables, position, 2)
+    features["S[1]"] = s1
+    features["S[1].lower"] = s1.lower() if s1 not in ("__BOS__", "__EOS__") else s1
+    features["S[2]"] = s2
+    features["S[2].lower"] = s2.lower() if s2 not in ("__BOS__", "__EOS__") else s2
+    # Bigrams
+    features["S[-1,0]"] = f"{s_1}|{s0}"
+    features["S[0,1]"] = f"{s0}|{s1}"
+    # Trigrams
+    features["S[-1,0,1]"] = f"{s_1}|{s0}|{s1}"
+    return features
+def sentence_to_syllable_features(syllables):
+    """Convert syllable sequence to feature sequences."""
+    return [
+        [f"{k}={v}" for k, v in extract_syllable_features(syllables, i).items()]
+        for i in range(len(syllables))
+    ]
+def labels_to_words(syllables, labels):
+    """Convert syllable sequence and BIO labels back to words."""
+    words = []
+    current_word = []
+    for syl, label in zip(syllables, labels):
+        if label == "B":
+            if current_word:
+                words.append(" ".join(current_word))
+            current_word = [syl]
+        else:  # I
+            current_word.append(syl)
+    if current_word:
+        words.append(" ".join(current_word))
+    return words
+def segment_text(text, tagger):
+    """
+    Full pipeline: regex tokenize -> CRF segment -> output words.
+    """
+    # Step 1: Regex tokenize into syllables
+    syllables = regex_tokenize(text)
+    if not syllables:
+        return ""
+    # Step 2: Extract syllable features
+    X = sentence_to_syllable_features(syllables)
+    # Step 3: Predict BIO labels
+    labels = tagger.tag(X)
+    # Step 4: Convert to words (syllables joined with underscore for compound words)
+    words = labels_to_words(syllables, labels)
+    return "_".join(words).replace(" ", "_").replace("_", " ").replace("  ", " _ ")
+def segment_text_formatted(text, tagger, use_underscore=True):
+    """
+    Full pipeline with formatted output.
+    """
+    syllables = regex_tokenize(text)
+    if not syllables:
+        return ""
+    X = sentence_to_syllable_features(syllables)
+    labels = tagger.tag(X)
+    words = labels_to_words(syllables, labels)
+    if use_underscore:
+        # Join compound word syllables with underscore
+        return " ".join(w.replace(" ", "_") for w in words)
+    else:
+        return " ".join(words)
+@click.command()
+@click.argument("text", required=False)
+@click.option(
+    "--model", "-m",
+    default="word_segmenter.crfsuite",
+    help="Path to CRF model file",
+    show_default=True,
+)
+@click.option(
+    "--underscore/--no-underscore",
+    default=True,
+    help="Use underscore to join compound word syllables",
+)
+def main(text, model, underscore):
+    """Segment Vietnamese text into words."""
+    # Handle stdin input
+    if text == "-" or text is None:
+        text = sys.stdin.read().strip()
+    if not text:
+        click.echo("No input text provided", err=True)
+        return
+    # Load model - support both pycrfsuite and underthesea-core formats
+    if model.endswith(".crf"):
+        # underthesea-core format
+        try:
+            from underthesea_core import CRFModel, CRFTagger
+        except ImportError:
+            from underthesea_core.underthesea_core import CRFModel, CRFTagger
+        crf_model = CRFModel.load(model)
+        tagger = CRFTagger.from_model(crf_model)
+    else:
+        # pycrfsuite format
+        tagger = pycrfsuite.Tagger()
+        tagger.open(model)
+    # Process each line
+    for line in text.split("\n"):
+        if line.strip():
+            result = segment_text_formatted(line, tagger, use_underscore=underscore)
+            click.echo(result)
+if __name__ == "__main__":
+    main()

scripts/train.py CHANGED Viewed

@@ -2,25 +2,94 @@
 # requires-python = ">=3.9"
 # dependencies = [
 #     "python-crfsuite>=0.9.11",
 #     "datasets>=4.5.0",
 #     "scikit-learn>=1.6.1",
 # ]
 # ///
 """
 Training script for Vietnamese POS Tagger (TRE-1).
-Reproduces the training process from TECHNICAL_REPORT.md.
 Usage:
     uv run scripts/train.py
-    uv run scripts/train.py --output model.crfsuite
-    uv run scripts/train.py --wandb  # Enable W&B logging
 """
-import argparse
-import pycrfsuite
 from datasets import load_dataset
-from sklearn.model_selection import train_test_split
 FEATURE_TEMPLATES = [
@@ -74,7 +143,6 @@ def apply_attribute(value, attribute, dictionary=None):
 def parse_template(template):
-    import re
     match = re.match(r"T\[([^\]]+)\](?:\.(\w+))?", template)
     if not match:
         return None, None
@@ -112,115 +180,422 @@ def sentence_to_features(tokens):
     ]
-def load_data():
-    print("Loading UDD-v0.1 dataset...")
-    dataset = load_dataset("undertheseanlp/UDD-v0.1")
-    sentences = []
-    for item in dataset["train"]:
-        tokens = item["tokens"]
-        tags = item["upos"]
-        if tokens and tags:
-            sentences.append((tokens, tags))
-    print(f"Loaded {len(sentences)} sentences")
-    return sentences
-def train(output_path, use_wandb=False):
-    sentences = load_data()
-    # Split 80/20 as per technical report
-    train_data, test_data = train_test_split(
-        sentences, test_size=0.2, random_state=42
-    )
-    print(f"Train: {len(train_data)} sentences")
-    print(f"Test: {len(test_data)} sentences")
     # Prepare training data
-    print("Extracting features...")
     X_train = [sentence_to_features(tokens) for tokens, _ in train_data]
     y_train = [tags for _, tags in train_data]
     # Train CRF
-    print("Training CRF model...")
-    trainer = pycrfsuite.Trainer(verbose=True)
-    for xseq, yseq in zip(X_train, y_train):
-        trainer.append(xseq, yseq)
-    # Training parameters from technical report
-    trainer.set_params({
-        "c1": 1.0,                          # L1 regularization
-        "c2": 0.001,                        # L2 regularization
-        "max_iterations": 100,              # Max iterations
-        "feature.possible_transitions": True,
-    })
     if use_wandb:
         try:
-            import wandb
-            wandb.init(project="pos-tagger-vietnamese", name="underthesea-crf")
-            wandb.config.update({
-                "c1": 1.0,
-                "c2": 0.001,
-                "max_iterations": 100,
                 "num_features": len(FEATURE_TEMPLATES),
                 "train_sentences": len(train_data),
                 "test_sentences": len(test_data),
             })
         except ImportError:
-            print("wandb not installed, skipping logging")
             use_wandb = False
-    trainer.train(output_path)
-    print(f"Model saved to {output_path}")
-    # Quick evaluation
-    print("\nEvaluating on test set...")
-    tagger = pycrfsuite.Tagger()
-    tagger.open(output_path)
     X_test = [sentence_to_features(tokens) for tokens, _ in test_data]
     y_test = [tags for _, tags in test_data]
-    y_pred = [tagger.tag(xseq) for xseq in X_test]
     # Flatten for metrics
     y_test_flat = [tag for tags in y_test for tag in tags]
     y_pred_flat = [tag for tags in y_pred for tag in tags]
-    from sklearn.metrics import accuracy_score, classification_report
     accuracy = accuracy_score(y_test_flat, y_pred_flat)
-    print(f"\nAccuracy: {accuracy:.4f}")
-    print("\nClassification Report:")
-    print(classification_report(y_test_flat, y_pred_flat))
-    if use_wandb:
-        wandb.log({"accuracy": accuracy})
-        wandb.finish()
-    return output_path
-def main():
-    parser = argparse.ArgumentParser(description="Train Vietnamese POS Tagger")
-    parser.add_argument(
-        "--output", "-o",
-        default="pos_tagger.crfsuite",
-        help="Output model path"
-    )
-    parser.add_argument(
-        "--wandb",
-        action="store_true",
-        help="Enable Weights & Biases logging"
-    )
-    args = parser.parse_args()
-    train(args.output, use_wandb=args.wandb)
 if __name__ == "__main__":
-    main()

 # requires-python = ">=3.9"
 # dependencies = [
 #     "python-crfsuite>=0.9.11",
+#     "crfsuite>=0.3.0",
 #     "datasets>=4.5.0",
 #     "scikit-learn>=1.6.1",
+#     "click>=8.0.0",
+#     "psutil>=5.9.0",
+#     "pyyaml>=6.0.0",
+#     "underthesea>=6.8.0",
+#     "underthesea-core @ file:///home/claude-user/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core/target/wheels/underthesea_core-1.0.7-cp312-cp312-manylinux_2_34_x86_64.whl",
 # ]
 # ///
 """
 Training script for Vietnamese POS Tagger (TRE-1).
+Supports 3 CRF trainers:
+- python-crfsuite: Original Python bindings to CRFsuite
+- crfsuite-rs: Rust bindings to CRFsuite (pip install crfsuite)
+- underthesea-core: Underthesea's native Rust CRF implementation
+Models are saved to: models/pos_tagger/{version}/model.crfsuite
 Usage:
     uv run scripts/train.py
+    uv run scripts/train.py --trainer crfsuite-rs
+    uv run scripts/train.py --trainer underthesea-core
+    uv run scripts/train.py --version v1.1.0
+    uv run scripts/train.py --wandb
+    uv run scripts/train.py --c1 0.5 --c2 0.01 --max-iterations 200
 """
+import platform
+import re
+import time
+from abc import ABC, abstractmethod
+from datetime import datetime
+from pathlib import Path
+import click
+import psutil
+import yaml
 from datasets import load_dataset
+from sklearn.metrics import accuracy_score, classification_report
+# Get project root directory
+PROJECT_ROOT = Path(__file__).parent.parent
+# Available trainers
+TRAINERS = ["python-crfsuite", "crfsuite-rs", "underthesea-core"]
+def get_hardware_info():
+    """Collect hardware and system information."""
+    info = {
+        "platform": platform.system(),
+        "platform_release": platform.release(),
+        "architecture": platform.machine(),
+        "python_version": platform.python_version(),
+        "cpu_physical_cores": psutil.cpu_count(logical=False),
+        "cpu_logical_cores": psutil.cpu_count(logical=True),
+        "ram_total_gb": round(psutil.virtual_memory().total / (1024**3), 2),
+    }
+    try:
+        if platform.system() == "Linux":
+            with open("/proc/cpuinfo", "r") as f:
+                for line in f:
+                    if "model name" in line:
+                        info["cpu_model"] = line.split(":")[1].strip()
+                        break
+    except Exception:
+        info["cpu_model"] = "Unknown"
+    return info
+def format_duration(seconds):
+    """Format duration in human-readable format."""
+    if seconds < 60:
+        return f"{seconds:.2f}s"
+    elif seconds < 3600:
+        minutes = int(seconds // 60)
+        secs = seconds % 60
+        return f"{minutes}m {secs:.2f}s"
+    else:
+        hours = int(seconds // 3600)
+        minutes = int((seconds % 3600) // 60)
+        secs = seconds % 60
+        return f"{hours}h {minutes}m {secs:.2f}s"
 FEATURE_TEMPLATES = [
 def parse_template(template):
     match = re.match(r"T\[([^\]]+)\](?:\.(\w+))?", template)
     if not match:
         return None, None
     ]
+# ============================================================================
+# Trainer Abstraction
+# ============================================================================
+class CRFTrainerBase(ABC):
+    """Abstract base class for CRF trainers."""
+    name: str = "base"
+    @abstractmethod
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        """Train the CRF model and save to output_path."""
+        pass
+    @abstractmethod
+    def predict(self, model_path, X_test):
+        """Load model and predict on test data."""
+        pass
+class PythonCRFSuiteTrainer(CRFTrainerBase):
+    """Trainer using python-crfsuite (original Python bindings)."""
+    name = "python-crfsuite"
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        import pycrfsuite
+        trainer = pycrfsuite.Trainer(verbose=verbose)
+        for xseq, yseq in zip(X_train, y_train):
+            trainer.append(xseq, yseq)
+        trainer.set_params({
+            "c1": c1,
+            "c2": c2,
+            "max_iterations": max_iterations,
+            "feature.possible_transitions": True,
+        })
+        trainer.train(str(output_path))
+    def predict(self, model_path, X_test):
+        import pycrfsuite
+        tagger = pycrfsuite.Tagger()
+        tagger.open(str(model_path))
+        return [tagger.tag(xseq) for xseq in X_test]
+class CRFSuiteRsTrainer(CRFTrainerBase):
+    """Trainer using crfsuite-rs (Rust bindings via pip install crfsuite)."""
+    name = "crfsuite-rs"
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        import crfsuite
+        trainer = crfsuite.Trainer()
+        # Set parameters
+        trainer.set_params({
+            "c1": c1,
+            "c2": c2,
+            "max_iterations": max_iterations,
+            "feature.possible_transitions": True,
+        })
+        # Add training data
+        for xseq, yseq in zip(X_train, y_train):
+            trainer.append(xseq, yseq)
+        # Train
+        trainer.train(str(output_path))
+    def predict(self, model_path, X_test):
+        import crfsuite
+        model = crfsuite.Model(str(model_path))
+        return [model.tag(xseq) for xseq in X_test]
+class UndertheseaCoreTrainer(CRFTrainerBase):
+    """Trainer using underthesea-core native Rust CRF with LBFGS optimization.
+    This trainer uses the native underthesea-core Rust CRF implementation
+    with L-BFGS optimization, matching CRFsuite performance.
+    Requires building underthesea-core from source:
+        cd ~/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core
+        uv venv && source .venv/bin/activate
+        uv pip install maturin
+        maturin develop --release
+    """
+    name = "underthesea-core"
+    def _check_trainer_import(self):
+        """Check if CRFTrainer is available."""
+        try:
+            from underthesea_core import CRFTrainer
+            return CRFTrainer
+        except ImportError:
+            pass
+        try:
+            from underthesea_core.underthesea_core import CRFTrainer
+            return CRFTrainer
+        except ImportError:
+            pass
+        raise ImportError(
+            "CRFTrainer not available in underthesea_core.\n"
+            "Build from source with LBFGS support:\n"
+            "  cd ~/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core\n"
+            "  source .venv/bin/activate && maturin develop --release"
+        )
+    def _check_tagger_import(self):
+        """Check if CRFModel and CRFTagger are available."""
+        try:
+            from underthesea_core import CRFModel, CRFTagger
+            return CRFModel, CRFTagger
+        except ImportError:
+            pass
+        try:
+            from underthesea_core.underthesea_core import CRFModel, CRFTagger
+            return CRFModel, CRFTagger
+        except ImportError:
+            pass
+        raise ImportError("CRFModel/CRFTagger not available in underthesea_core")
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        CRFTrainer = self._check_trainer_import()
+        # Use LBFGS (default, fast)
+        trainer = CRFTrainer(
+            loss_function="lbfgs",
+            l1_penalty=c1,
+            l2_penalty=c2,
+            max_iterations=max_iterations,
+            verbose=1 if verbose else 0,
+        )
+        # Train
+        model = trainer.train(X_train, y_train)
+        # Save model
+        output_path_str = str(output_path)
+        if output_path_str.endswith('.crfsuite'):
+            output_path_str = output_path_str.replace('.crfsuite', '.crf')
+        model.save(output_path_str)
+        # Store the actual path for prediction
+        self._model_path = output_path_str
+    def predict(self, model_path, X_test):
+        CRFModel, CRFTagger = self._check_tagger_import()
+        # Use the actual saved path if available
+        model_path_str = str(model_path)
+        if hasattr(self, '_model_path'):
+            model_path_str = self._model_path
+        elif model_path_str.endswith('.crfsuite'):
+            model_path_str = model_path_str.replace('.crfsuite', '.crf')
+        model = CRFModel.load(model_path_str)
+        tagger = CRFTagger.from_model(model)
+        return [tagger.tag(xseq) for xseq in X_test]
+def get_trainer(trainer_name: str) -> CRFTrainerBase:
+    """Get trainer instance by name."""
+    trainers = {
+        "python-crfsuite": PythonCRFSuiteTrainer,
+        "crfsuite-rs": CRFSuiteRsTrainer,
+        "underthesea-core": UndertheseaCoreTrainer,
+    }
+    if trainer_name not in trainers:
+        raise ValueError(f"Unknown trainer: {trainer_name}. Available: {list(trainers.keys())}")
+    return trainers[trainer_name]()
+# ============================================================================
+# Data Loading
+# ============================================================================
+def load_data():
+    click.echo("Loading UDD-1 dataset...")
+    dataset = load_dataset("undertheseanlp/UDD-1")
+    def extract_sentences(split):
+        sentences = []
+        for item in split:
+            tokens = item["tokens"]
+            tags = item["upos"]
+            if tokens and tags:
+                sentences.append((tokens, tags))
+        return sentences
+    train_data = extract_sentences(dataset["train"])
+    val_data = extract_sentences(dataset["validation"])
+    test_data = extract_sentences(dataset["test"])
+    click.echo(f"Loaded {len(train_data)} train, {len(val_data)} val, {len(test_data)} test sentences")
+    return train_data, val_data, test_data
+def save_metadata(output_dir, version, trainer_name, train_data, val_data, test_data, c1, c2, max_iterations, accuracy, hw_info, training_time):
+    """Save model metadata to YAML file."""
+    metadata = {
+        "model": {
+            "name": "Vietnamese POS Tagger",
+            "version": version,
+            "type": "CRF (Conditional Random Field)",
+            "framework": trainer_name,
+        },
+        "training": {
+            "dataset": "undertheseanlp/UDD-1",
+            "train_sentences": len(train_data),
+            "val_sentences": len(val_data),
+            "test_sentences": len(test_data),
+            "hyperparameters": {
+                "c1": c1,
+                "c2": c2,
+                "max_iterations": max_iterations,
+            },
+            "duration_seconds": round(training_time, 2),
+        },
+        "performance": {
+            "test_accuracy": round(accuracy, 4),
+        },
+        "environment": {
+            "platform": hw_info["platform"],
+            "cpu_model": hw_info.get("cpu_model", "Unknown"),
+            "python_version": hw_info["python_version"],
+        },
+        "files": {
+            "model": "model.crfsuite",
+            "config": "../../../configs/pos_tagger.yaml",
+        },
+        "created_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+        "author": "undertheseanlp",
+    }
+    metadata_path = output_dir / "metadata.yaml"
+    with open(metadata_path, "w") as f:
+        yaml.dump(metadata, f, default_flow_style=False, allow_unicode=True, sort_keys=False)
+    click.echo(f"Metadata saved to {metadata_path}")
+def get_default_version():
+    """Generate timestamp-based version."""
+    return datetime.now().strftime("%Y%m%d_%H%M%S")
+@click.command()
+@click.option(
+    "--trainer", "-t",
+    type=click.Choice(TRAINERS),
+    default="python-crfsuite",
+    help="CRF trainer to use",
+    show_default=True,
+)
+@click.option(
+    "--version", "-v",
+    default=None,
+    help="Model version (default: timestamp, e.g., 20260131_154530)",
+)
+@click.option(
+    "--output", "-o",
+    default=None,
+    help="Custom output path (overrides version-based path)",
+)
+@click.option(
+    "--c1",
+    default=1.0,
+    type=float,
+    help="L1 regularization coefficient",
+    show_default=True,
+)
+@click.option(
+    "--c2",
+    default=0.001,
+    type=float,
+    help="L2 regularization coefficient",
+    show_default=True,
+)
+@click.option(
+    "--max-iterations",
+    default=100,
+    type=int,
+    help="Maximum training iterations",
+    show_default=True,
+)
+@click.option(
+    "--wandb/--no-wandb",
+    default=False,
+    help="Enable Weights & Biases logging",
+)
+def train(trainer, version, output, c1, c2, max_iterations, wandb):
+    """Train Vietnamese POS Tagger using CRF on UDD-1 dataset."""
+    total_start_time = time.time()
+    start_datetime = datetime.now()
+    # Get trainer
+    crf_trainer = get_trainer(trainer)
+    # Use timestamp version if not specified
+    if version is None:
+        version = get_default_version()
+    # Determine output directory
+    if output:
+        output_path = Path(output)
+        output_dir = output_path.parent
+    else:
+        output_dir = PROJECT_ROOT / "models" / "pos_tagger" / version
+        output_dir.mkdir(parents=True, exist_ok=True)
+        output_path = output_dir / "model.crfsuite"
+    # Collect hardware info
+    hw_info = get_hardware_info()
+    click.echo("=" * 60)
+    click.echo(f"POS Tagger Training - {version}")
+    click.echo("=" * 60)
+    click.echo(f"Trainer: {trainer}")
+    click.echo(f"Platform: {hw_info['platform']}")
+    click.echo(f"CPU: {hw_info.get('cpu_model', 'Unknown')}")
+    click.echo(f"Output: {output_path}")
+    click.echo(f"Started: {start_datetime.strftime('%Y-%m-%d %H:%M:%S')}")
+    click.echo("=" * 60)
+    train_data, val_data, test_data = load_data()
+    click.echo(f"\nTrain: {len(train_data)} sentences")
+    click.echo(f"Validation: {len(val_data)} sentences")
+    click.echo(f"Test: {len(test_data)} sentences")
     # Prepare training data
+    click.echo("\nExtracting features...")
+    feature_start = time.time()
     X_train = [sentence_to_features(tokens) for tokens, _ in train_data]
     y_train = [tags for _, tags in train_data]
+    click.echo(f"Feature extraction: {format_duration(time.time() - feature_start)}")
     # Train CRF
+    click.echo(f"\nTraining CRF model with {trainer}...")
+    use_wandb = wandb
     if use_wandb:
         try:
+            import wandb as wb
+            wb.init(project="pos-tagger-vietnamese", name=f"crf-{trainer}-{version}")
+            wb.config.update({
+                "trainer": trainer,
+                "c1": c1,
+                "c2": c2,
+                "max_iterations": max_iterations,
                 "num_features": len(FEATURE_TEMPLATES),
                 "train_sentences": len(train_data),
+                "val_sentences": len(val_data),
                 "test_sentences": len(test_data),
+                "version": version,
             })
         except ImportError:
+            click.echo("wandb not installed, skipping logging", err=True)
             use_wandb = False
+    crf_start = time.time()
+    crf_trainer.train(X_train, y_train, output_path, c1, c2, max_iterations, verbose=True)
+    crf_time = time.time() - crf_start
+    click.echo(f"\nModel saved to {output_path}")
+    click.echo(f"CRF training: {format_duration(crf_time)}")
+    # Evaluation
+    click.echo("\nEvaluating on test set...")
     X_test = [sentence_to_features(tokens) for tokens, _ in test_data]
     y_test = [tags for _, tags in test_data]
+    y_pred = crf_trainer.predict(output_path, X_test)
     # Flatten for metrics
     y_test_flat = [tag for tags in y_test for tag in tags]
     y_pred_flat = [tag for tags in y_pred for tag in tags]
     accuracy = accuracy_score(y_test_flat, y_pred_flat)
+    total_time = time.time() - total_start_time
+    click.echo(f"\nAccuracy: {accuracy:.4f}")
+    click.echo("\nClassification Report:")
+    click.echo(classification_report(y_test_flat, y_pred_flat))
+    # Save metadata
+    if not output:
+        save_metadata(output_dir, version, trainer, train_data, val_data, test_data,
+                      c1, c2, max_iterations, accuracy, hw_info, total_time)
+    click.echo("\n" + "=" * 60)
+    click.echo("Training Summary")
+    click.echo("=" * 60)
+    click.echo(f"Trainer: {trainer}")
+    click.echo(f"Version: {version}")
+    click.echo(f"Model: {output_path}")
+    click.echo(f"Accuracy: {accuracy:.4f}")
+    click.echo(f"Total time: {format_duration(total_time)}")
+    click.echo("=" * 60)
+    if use_wandb:
+        wb.log({"accuracy": accuracy})
+        wb.finish()
 if __name__ == "__main__":
+    train()

scripts/train_word_segmentation.py ADDED Viewed

	@@ -0,0 +1,755 @@

+# /// script
+# requires-python = ">=3.9"
+# dependencies = [
+#     "python-crfsuite>=0.9.11",
+#     "crfsuite>=0.3.0",
+#     "datasets>=4.5.0",
+#     "scikit-learn>=1.6.1",
+#     "click>=8.0.0",
+#     "psutil>=5.9.0",
+#     "pyyaml>=6.0.0",
+#     "underthesea>=6.8.0",
+#     "underthesea-core @ file:///home/claude-user/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core/target/wheels/underthesea_core-1.0.7-cp312-cp312-manylinux_2_34_x86_64.whl",
+# ]
+# ///
+# Note: underthesea-core trainer now uses crfsuite (LBFGS) for fast training
+"""
+Training script for Vietnamese Word Segmentation using CRF.
+Supports 3 CRF trainers:
+- python-crfsuite: Original Python bindings to CRFsuite
+- crfsuite-rs: Rust bindings to CRFsuite (pip install crfsuite)
+- underthesea-core: Underthesea's native Rust CRF implementation
+Models are saved to: models/word_segmentation/{version}/model.crfsuite
+Uses BIO tagging at SYLLABLE level:
+- B: Beginning of a word (first syllable)
+- I: Inside a word (continuation syllables)
+Usage:
+    uv run scripts/train_word_segmentation.py
+    uv run scripts/train_word_segmentation.py --trainer crfsuite-rs
+    uv run scripts/train_word_segmentation.py --trainer underthesea-core
+    uv run scripts/train_word_segmentation.py --version v1.1.0
+"""
+import os
+import platform
+import time
+from abc import ABC, abstractmethod
+from datetime import datetime
+from pathlib import Path
+import click
+import psutil
+import yaml
+from datasets import load_dataset
+from sklearn.metrics import accuracy_score, classification_report, f1_score
+from underthesea.pipeline.word_tokenize.regex_tokenize import tokenize as regex_tokenize
+# Get project root directory
+PROJECT_ROOT = Path(__file__).parent.parent
+# Available trainers
+TRAINERS = ["python-crfsuite", "crfsuite-rs", "underthesea-core"]
+def get_hardware_info():
+    """Collect hardware and system information."""
+    info = {
+        "platform": platform.system(),
+        "platform_release": platform.release(),
+        "architecture": platform.machine(),
+        "python_version": platform.python_version(),
+        "cpu_physical_cores": psutil.cpu_count(logical=False),
+        "cpu_logical_cores": psutil.cpu_count(logical=True),
+        "ram_total_gb": round(psutil.virtual_memory().total / (1024**3), 2),
+    }
+    try:
+        if platform.system() == "Linux":
+            with open("/proc/cpuinfo", "r") as f:
+                for line in f:
+                    if "model name" in line:
+                        info["cpu_model"] = line.split(":")[1].strip()
+                        break
+    except Exception:
+        info["cpu_model"] = "Unknown"
+    return info
+def format_duration(seconds):
+    """Format duration in human-readable format."""
+    if seconds < 60:
+        return f"{seconds:.2f}s"
+    elif seconds < 3600:
+        minutes = int(seconds // 60)
+        secs = seconds % 60
+        return f"{minutes}m {secs:.2f}s"
+    else:
+        hours = int(seconds // 3600)
+        minutes = int((seconds % 3600) // 60)
+        secs = seconds % 60
+        return f"{hours}h {minutes}m {secs:.2f}s"
+# Syllable-level feature templates
+FEATURE_TEMPLATES = [
+    # Current syllable
+    "S[0]",              # Syllable text
+    "S[0].lower",        # Lowercase
+    "S[0].istitle",      # Is title case
+    "S[0].isupper",      # Is all uppercase
+    "S[0].isdigit",      # Is digit
+    "S[0].ispunct",      # Is punctuation
+    "S[0].len",          # Length
+    "S[0].prefix2",      # First 2 chars
+    "S[0].suffix2",      # Last 2 chars
+    # Previous syllables
+    "S[-1]",
+    "S[-1].lower",
+    "S[-2]",
+    "S[-2].lower",
+    # Next syllables
+    "S[1]",
+    "S[1].lower",
+    "S[2]",
+    "S[2].lower",
+    # Bigrams
+    "S[-1,0]",
+    "S[0,1]",
+    # Trigrams
+    "S[-1,0,1]",
+]
+def get_syllable_at(syllables, position, offset):
+    """Get syllable at position + offset, with boundary handling."""
+    idx = position + offset
+    if idx < 0:
+        return "__BOS__"
+    elif idx >= len(syllables):
+        return "__EOS__"
+    return syllables[idx]
+def is_punct(s):
+    """Check if string is punctuation."""
+    return len(s) == 1 and not s.isalnum()
+def extract_syllable_features(syllables, position):
+    """Extract features for a syllable at given position."""
+    features = {}
+    # Current syllable
+    s0 = get_syllable_at(syllables, position, 0)
+    is_boundary = s0 in ("__BOS__", "__EOS__")
+    features["S[0]"] = s0
+    features["S[0].lower"] = s0.lower() if not is_boundary else s0
+    features["S[0].istitle"] = str(s0.istitle()) if not is_boundary else "False"
+    features["S[0].isupper"] = str(s0.isupper()) if not is_boundary else "False"
+    features["S[0].isdigit"] = str(s0.isdigit()) if not is_boundary else "False"
+    features["S[0].ispunct"] = str(is_punct(s0)) if not is_boundary else "False"
+    features["S[0].len"] = str(len(s0)) if not is_boundary else "0"
+    features["S[0].prefix2"] = s0[:2] if not is_boundary and len(s0) >= 2 else s0
+    features["S[0].suffix2"] = s0[-2:] if not is_boundary and len(s0) >= 2 else s0
+    # Previous syllables
+    s_1 = get_syllable_at(syllables, position, -1)
+    s_2 = get_syllable_at(syllables, position, -2)
+    features["S[-1]"] = s_1
+    features["S[-1].lower"] = s_1.lower() if s_1 not in ("__BOS__", "__EOS__") else s_1
+    features["S[-2]"] = s_2
+    features["S[-2].lower"] = s_2.lower() if s_2 not in ("__BOS__", "__EOS__") else s_2
+    # Next syllables
+    s1 = get_syllable_at(syllables, position, 1)
+    s2 = get_syllable_at(syllables, position, 2)
+    features["S[1]"] = s1
+    features["S[1].lower"] = s1.lower() if s1 not in ("__BOS__", "__EOS__") else s1
+    features["S[2]"] = s2
+    features["S[2].lower"] = s2.lower() if s2 not in ("__BOS__", "__EOS__") else s2
+    # Bigrams
+    features["S[-1,0]"] = f"{s_1}|{s0}"
+    features["S[0,1]"] = f"{s0}|{s1}"
+    # Trigrams
+    features["S[-1,0,1]"] = f"{s_1}|{s0}|{s1}"
+    return features
+def sentence_to_syllable_features(syllables):
+    """Convert syllable sequence to feature sequences."""
+    return [
+        [f"{k}={v}" for k, v in extract_syllable_features(syllables, i).items()]
+        for i in range(len(syllables))
+    ]
+def tokens_to_syllable_labels(tokens):
+    """
+    Convert tokenized compound words to syllable-level BIO labels.
+    Each compound word (e.g., "Thời hạn") is split into syllables,
+    first syllable gets 'B', rest get 'I'.
+    """
+    syllables = []
+    labels = []
+    for token in tokens:
+        # Split compound word into syllables using regex_tokenize
+        token_syllables = regex_tokenize(token)
+        for i, syl in enumerate(token_syllables):
+            syllables.append(syl)
+            if i == 0:
+                labels.append("B")
+            else:
+                labels.append("I")
+    return syllables, labels
+def labels_to_words(syllables, labels):
+    """Convert syllable sequence and BIO labels back to words."""
+    words = []
+    current_word = []
+    for syl, label in zip(syllables, labels):
+        if label == "B":
+            if current_word:
+                words.append(" ".join(current_word))
+            current_word = [syl]
+        else:  # I
+            current_word.append(syl)
+    if current_word:
+        words.append(" ".join(current_word))
+    return words
+def compute_word_metrics(y_true, y_pred, syllables_list):
+    """Compute word-level F1 score."""
+    correct = 0
+    total_pred = 0
+    total_true = 0
+    for syllables, true_labels, pred_labels in zip(syllables_list, y_true, y_pred):
+        true_words = labels_to_words(syllables, true_labels)
+        pred_words = labels_to_words(syllables, pred_labels)
+        total_true += len(true_words)
+        total_pred += len(pred_words)
+        # Count exact word matches at same positions
+        true_boundaries = set()
+        pred_boundaries = set()
+        pos = 0
+        for word in true_words:
+            n_syls = len(word.split())
+            true_boundaries.add((pos, pos + n_syls))
+            pos += n_syls
+        pos = 0
+        for word in pred_words:
+            n_syls = len(word.split())
+            pred_boundaries.add((pos, pos + n_syls))
+            pos += n_syls
+        correct += len(true_boundaries & pred_boundaries)
+    precision = correct / total_pred if total_pred > 0 else 0
+    recall = correct / total_true if total_true > 0 else 0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
+    return precision, recall, f1
+def load_data():
+    """Load UDD-1 dataset and convert to syllable-level sequences."""
+    click.echo("Loading UDD-1 dataset...")
+    dataset = load_dataset("undertheseanlp/UDD-1")
+    def extract_syllable_sequences(split):
+        sequences = []
+        for item in split:
+            tokens = item["tokens"]
+            if tokens:
+                syllables, labels = tokens_to_syllable_labels(tokens)
+                if syllables:
+                    sequences.append((syllables, labels))
+        return sequences
+    train_data = extract_syllable_sequences(dataset["train"])
+    val_data = extract_syllable_sequences(dataset["validation"])
+    test_data = extract_syllable_sequences(dataset["test"])
+    # Statistics
+    train_syls = sum(len(syls) for syls, _ in train_data)
+    val_syls = sum(len(syls) for syls, _ in val_data)
+    test_syls = sum(len(syls) for syls, _ in test_data)
+    click.echo(f"Loaded {len(train_data)} train ({train_syls} syllables), "
+               f"{len(val_data)} val ({val_syls} syllables), "
+               f"{len(test_data)} test ({test_syls} syllables) sentences")
+    return train_data, val_data, test_data, {
+        "train_sentences": len(train_data),
+        "train_syllables": train_syls,
+        "val_sentences": len(val_data),
+        "val_syllables": val_syls,
+        "test_sentences": len(test_data),
+        "test_syllables": test_syls,
+    }
+# ============================================================================
+# Trainer Abstraction
+# ============================================================================
+class CRFTrainerBase(ABC):
+    """Abstract base class for CRF trainers."""
+    name: str = "base"
+    @abstractmethod
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        """Train the CRF model and save to output_path."""
+        pass
+    @abstractmethod
+    def predict(self, model_path, X_test):
+        """Load model and predict on test data."""
+        pass
+class PythonCRFSuiteTrainer(CRFTrainerBase):
+    """Trainer using python-crfsuite (original Python bindings)."""
+    name = "python-crfsuite"
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        import pycrfsuite
+        trainer = pycrfsuite.Trainer(verbose=verbose)
+        for xseq, yseq in zip(X_train, y_train):
+            trainer.append(xseq, yseq)
+        trainer.set_params({
+            "c1": c1,
+            "c2": c2,
+            "max_iterations": max_iterations,
+            "feature.possible_transitions": True,
+        })
+        trainer.train(str(output_path))
+    def predict(self, model_path, X_test):
+        import pycrfsuite
+        tagger = pycrfsuite.Tagger()
+        tagger.open(str(model_path))
+        return [tagger.tag(xseq) for xseq in X_test]
+class CRFSuiteRsTrainer(CRFTrainerBase):
+    """Trainer using crfsuite-rs (Rust bindings via pip install crfsuite)."""
+    name = "crfsuite-rs"
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        import crfsuite
+        trainer = crfsuite.Trainer()
+        # Set parameters
+        trainer.set_params({
+            "c1": c1,
+            "c2": c2,
+            "max_iterations": max_iterations,
+            "feature.possible_transitions": True,
+        })
+        # Add training data
+        for xseq, yseq in zip(X_train, y_train):
+            trainer.append(xseq, yseq)
+        # Train
+        trainer.train(str(output_path))
+    def predict(self, model_path, X_test):
+        import crfsuite
+        model = crfsuite.Model(str(model_path))
+        return [model.tag(xseq) for xseq in X_test]
+class UndertheseaCoreTrainer(CRFTrainerBase):
+    """Trainer using underthesea-core native Rust CRF with LBFGS optimization.
+    This trainer uses the native underthesea-core Rust CRF implementation
+    with L-BFGS optimization, matching CRFsuite performance.
+    Requires building underthesea-core from source:
+        cd ~/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core
+        uv venv && source .venv/bin/activate
+        uv pip install maturin
+        maturin develop --release
+    """
+    name = "underthesea-core"
+    def _check_trainer_import(self):
+        """Check if CRFTrainer is available."""
+        try:
+            from underthesea_core import CRFTrainer
+            return CRFTrainer
+        except ImportError:
+            pass
+        try:
+            from underthesea_core.underthesea_core import CRFTrainer
+            return CRFTrainer
+        except ImportError:
+            pass
+        raise ImportError(
+            "CRFTrainer not available in underthesea_core.\n"
+            "Build from source with LBFGS support:\n"
+            "  cd ~/projects/workspace_underthesea/underthesea-core-dev/extensions/underthesea_core\n"
+            "  source .venv/bin/activate && maturin develop --release"
+        )
+    def _check_tagger_import(self):
+        """Check if CRFModel and CRFTagger are available."""
+        try:
+            from underthesea_core import CRFModel, CRFTagger
+            return CRFModel, CRFTagger
+        except ImportError:
+            pass
+        try:
+            from underthesea_core.underthesea_core import CRFModel, CRFTagger
+            return CRFModel, CRFTagger
+        except ImportError:
+            pass
+        raise ImportError("CRFModel/CRFTagger not available in underthesea_core")
+    def train(self, X_train, y_train, output_path, c1, c2, max_iterations, verbose=True):
+        CRFTrainer = self._check_trainer_import()
+        # Use LBFGS (default, fast)
+        trainer = CRFTrainer(
+            loss_function="lbfgs",
+            l1_penalty=c1,
+            l2_penalty=c2,
+            max_iterations=max_iterations,
+            verbose=1 if verbose else 0,
+        )
+        # Train
+        model = trainer.train(X_train, y_train)
+        # Save model
+        output_path_str = str(output_path)
+        if output_path_str.endswith('.crfsuite'):
+            output_path_str = output_path_str.replace('.crfsuite', '.crf')
+        model.save(output_path_str)
+        # Store the actual path for prediction
+        self._model_path = output_path_str
+    def predict(self, model_path, X_test):
+        CRFModel, CRFTagger = self._check_tagger_import()
+        # Use the actual saved path if available
+        model_path_str = str(model_path)
+        if hasattr(self, '_model_path'):
+            model_path_str = self._model_path
+        elif model_path_str.endswith('.crfsuite'):
+            model_path_str = model_path_str.replace('.crfsuite', '.crf')
+        model = CRFModel.load(model_path_str)
+        tagger = CRFTagger.from_model(model)
+        return [tagger.tag(xseq) for xseq in X_test]
+def get_trainer(trainer_name: str) -> CRFTrainerBase:
+    """Get trainer instance by name."""
+    trainers = {
+        "python-crfsuite": PythonCRFSuiteTrainer,
+        "crfsuite-rs": CRFSuiteRsTrainer,
+        "underthesea-core": UndertheseaCoreTrainer,
+    }
+    if trainer_name not in trainers:
+        raise ValueError(f"Unknown trainer: {trainer_name}. Available: {list(trainers.keys())}")
+    return trainers[trainer_name]()
+# ============================================================================
+# Metadata and CLI
+# ============================================================================
+def save_metadata(output_dir, version, trainer_name, data_stats, c1, c2, max_iterations, metrics, hw_info, training_time):
+    """Save model metadata to YAML file."""
+    metadata = {
+        "model": {
+            "name": "Vietnamese Word Segmentation",
+            "version": version,
+            "type": "CRF (Conditional Random Field)",
+            "framework": trainer_name,
+            "tagging_scheme": "BIO",
+        },
+        "training": {
+            "dataset": "undertheseanlp/UDD-1",
+            "train_sentences": data_stats["train_sentences"],
+            "train_syllables": data_stats["train_syllables"],
+            "val_sentences": data_stats["val_sentences"],
+            "val_syllables": data_stats["val_syllables"],
+            "test_sentences": data_stats["test_sentences"],
+            "test_syllables": data_stats["test_syllables"],
+            "hyperparameters": {
+                "c1": c1,
+                "c2": c2,
+                "max_iterations": max_iterations,
+            },
+            "duration_seconds": round(training_time, 2),
+        },
+        "performance": {
+            "syllable_accuracy": round(metrics["syl_accuracy"], 4),
+            "syllable_f1": round(metrics["syl_f1"], 4),
+            "word_precision": round(metrics["word_precision"], 4),
+            "word_recall": round(metrics["word_recall"], 4),
+            "word_f1": round(metrics["word_f1"], 4),
+        },
+        "environment": {
+            "platform": hw_info["platform"],
+            "cpu_model": hw_info.get("cpu_model", "Unknown"),
+            "python_version": hw_info["python_version"],
+        },
+        "files": {
+            "model": "model.crfsuite",
+            "config": "../../../configs/word_segmentation.yaml",
+        },
+        "created_at": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),
+        "author": "undertheseanlp",
+    }
+    metadata_path = output_dir / "metadata.yaml"
+    with open(metadata_path, "w") as f:
+        yaml.dump(metadata, f, default_flow_style=False, allow_unicode=True, sort_keys=False)
+    click.echo(f"Metadata saved to {metadata_path}")
+def get_default_version():
+    """Generate timestamp-based version."""
+    return datetime.now().strftime("%Y%m%d_%H%M%S")
+@click.command()
+@click.option(
+    "--trainer", "-t",
+    type=click.Choice(TRAINERS),
+    default="python-crfsuite",
+    help="CRF trainer to use",
+    show_default=True,
+)
+@click.option(
+    "--version", "-v",
+    default=None,
+    help="Model version (default: timestamp, e.g., 20260131_154530)",
+)
+@click.option(
+    "--output", "-o",
+    default=None,
+    help="Custom output path (overrides version-based path)",
+)
+@click.option(
+    "--c1",
+    default=1.0,
+    type=float,
+    help="L1 regularization coefficient",
+    show_default=True,
+)
+@click.option(
+    "--c2",
+    default=0.001,
+    type=float,
+    help="L2 regularization coefficient",
+    show_default=True,
+)
+@click.option(
+    "--max-iterations",
+    default=100,
+    type=int,
+    help="Maximum training iterations",
+    show_default=True,
+)
+@click.option(
+    "--wandb/--no-wandb",
+    default=False,
+    help="Enable Weights & Biases logging",
+)
+def train(trainer, version, output, c1, c2, max_iterations, wandb):
+    """Train Vietnamese Word Segmenter using CRF on UDD-1 dataset."""
+    total_start_time = time.time()
+    start_datetime = datetime.now()
+    # Get trainer
+    crf_trainer = get_trainer(trainer)
+    # Use timestamp version if not specified
+    if version is None:
+        version = get_default_version()
+    # Determine output directory
+    if output:
+        output_path = Path(output)
+        output_dir = output_path.parent
+    else:
+        output_dir = PROJECT_ROOT / "models" / "word_segmentation" / version
+        output_dir.mkdir(parents=True, exist_ok=True)
+        output_path = output_dir / "model.crfsuite"
+    # Collect hardware info
+    hw_info = get_hardware_info()
+    click.echo("=" * 60)
+    click.echo(f"Word Segmentation Training - {version}")
+    click.echo("=" * 60)
+    click.echo(f"Trainer: {trainer}")
+    click.echo(f"Platform: {hw_info['platform']}")
+    click.echo(f"CPU: {hw_info.get('cpu_model', 'Unknown')}")
+    click.echo(f"Output: {output_path}")
+    click.echo(f"Started: {start_datetime.strftime('%Y-%m-%d %H:%M:%S')}")
+    click.echo("=" * 60)
+    # Load data
+    train_data, val_data, test_data, data_stats = load_data()
+    click.echo(f"\nTrain: {len(train_data)} sentences ({data_stats['train_syllables']} syllables)")
+    click.echo(f"Validation: {len(val_data)} sentences ({data_stats['val_syllables']} syllables)")
+    click.echo(f"Test: {len(test_data)} sentences ({data_stats['test_syllables']} syllables)")
+    # Prepare training data
+    click.echo("\nExtracting syllable-level features...")
+    feature_start = time.time()
+    X_train = [sentence_to_syllable_features(syls) for syls, _ in train_data]
+    y_train = [labels for _, labels in train_data]
+    click.echo(f"Feature extraction: {format_duration(time.time() - feature_start)}")
+    # Train CRF
+    click.echo(f"\nTraining CRF model with {trainer}...")
+    use_wandb = wandb
+    if use_wandb:
+        try:
+            import wandb as wb
+            wb.init(project="word-segmentation-vietnamese", name=f"crf-{version}")
+            wb.config.update({
+                "trainer": trainer,
+                "c1": c1,
+                "c2": c2,
+                "max_iterations": max_iterations,
+                "num_feature_templates": len(FEATURE_TEMPLATES),
+                "train_sentences": len(train_data),
+                "val_sentences": len(val_data),
+                "test_sentences": len(test_data),
+                "version": version,
+                "level": "syllable",
+            })
+        except ImportError:
+            click.echo("wandb not installed, skipping logging", err=True)
+            use_wandb = False
+    crf_start = time.time()
+    crf_trainer.train(X_train, y_train, output_path, c1, c2, max_iterations, verbose=True)
+    crf_time = time.time() - crf_start
+    click.echo(f"\nModel saved to {output_path}")
+    click.echo(f"CRF training: {format_duration(crf_time)}")
+    # Evaluation
+    click.echo("\nEvaluating on test set...")
+    X_test = [sentence_to_syllable_features(syls) for syls, _ in test_data]
+    y_test = [labels for _, labels in test_data]
+    syllables_test = [syls for syls, _ in test_data]
+    y_pred = crf_trainer.predict(output_path, X_test)
+    # Syllable-level metrics
+    y_test_flat = [label for labels in y_test for label in labels]
+    y_pred_flat = [label for labels in y_pred for label in labels]
+    syl_accuracy = accuracy_score(y_test_flat, y_pred_flat)
+    syl_f1 = f1_score(y_test_flat, y_pred_flat, average="weighted")
+    click.echo(f"\nSyllable-level Accuracy: {syl_accuracy:.4f}")
+    click.echo(f"Syllable-level F1 (weighted): {syl_f1:.4f}")
+    click.echo("\nSyllable-level Classification Report:")
+    click.echo(classification_report(y_test_flat, y_pred_flat))
+    # Word-level metrics
+    precision, recall, word_f1 = compute_word_metrics(y_test, y_pred, syllables_test)
+    click.echo(f"\nWord-level Metrics:")
+    click.echo(f"  Precision: {precision:.4f}")
+    click.echo(f"  Recall: {recall:.4f}")
+    click.echo(f"  F1: {word_f1:.4f}")
+    total_time = time.time() - total_start_time
+    # Collect metrics
+    metrics = {
+        "syl_accuracy": syl_accuracy,
+        "syl_f1": syl_f1,
+        "word_precision": precision,
+        "word_recall": recall,
+        "word_f1": word_f1,
+    }
+    # Save metadata
+    if not output:
+        save_metadata(output_dir, version, trainer, data_stats, c1, c2, max_iterations,
+                      metrics, hw_info, total_time)
+    # Show examples
+    click.echo("\n" + "=" * 60)
+    click.echo("Example predictions:")
+    click.echo("=" * 60)
+    for i in range(min(3, len(test_data))):
+        syllables = syllables_test[i]
+        true_words = labels_to_words(syllables, y_test[i])
+        pred_words = labels_to_words(syllables, y_pred[i])
+        click.echo(f"\nInput:  {' '.join(syllables)}")
+        click.echo(f"True:   {' | '.join(true_words)}")
+        click.echo(f"Pred:   {' | '.join(pred_words)}")
+    click.echo("\n" + "=" * 60)
+    click.echo("Training Summary")
+    click.echo("=" * 60)
+    click.echo(f"Trainer: {trainer}")
+    click.echo(f"Version: {version}")
+    click.echo(f"Model: {output_path}")
+    click.echo(f"Syllable Accuracy: {syl_accuracy:.4f}")
+    click.echo(f"Word F1: {word_f1:.4f}")
+    click.echo(f"Total time: {format_duration(total_time)}")
+    click.echo("=" * 60)
+    if use_wandb:
+        wb.log(metrics)
+        wb.finish()
+if __name__ == "__main__":
+    train()