Initial project setup for Radar-1 language detection model

Browse files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (5) hide show

.gitignore +17 -0
CLAUDE.md +44 -0
README.md +94 -0
pyproject.toml +31 -0
src/train.py +20 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,17 @@

+__pycache__/
+*.py[cod]
+*.egg-info/
+dist/
+build/
+.eggs/
+*.egg
+.venv/
+venv/
+.env
+*.pt
+*.bin
+!models/**/*.bin
+data/
+.tox/
+.pytest_cache/
+*.log

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,44 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+Radar-1 is a language detection model for the UnderTheSea NLP ecosystem. It identifies the language of input text, supporting Vietnamese and other languages commonly encountered in Vietnamese NLP contexts.
+## Build & Development Commands
+### Running Tests
+```bash
+pytest tests/
+```
+### Development Installation
+```bash
+pip install -e ".[dev]"
+```
+### Training
+```bash
+python src/train.py
+```
+## Architecture
+**Pipeline:**
+```
+Input Text → Feature Extraction (character n-grams / TF-IDF)
+           → Classifier
+           → Language Label + Confidence Score
+```
+**Key Design Decisions:**
+- Operates at character-level for language-agnostic feature extraction
+- Lightweight model for fast inference
+- Compatible with underthesea API (`lang_detect`)
+## Key Files
+- `src/train.py` - Training script
+- `TECHNICAL_REPORT.md` - Detailed methodology and benchmark results
+- `RESEARCH_PLAN.md` - Research roadmap

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+license: apache-2.0
+language:
+- vi
+- en
+- zh
+- ja
+- ko
+- fr
+- de
+- es
+- th
+- lo
+- km
+tags:
+- language-detection
+- language-identification
+- vietnamese
+- multilingual
+library_name: underthesea
+pipeline_tag: text-classification
+metrics:
+- accuracy
+- f1
+---
+# Radar-1
+Radar-1 is a language detection model developed by UnderTheSea NLP.
+## Model Description
+- **Model Type:** Language Detection (Text Classification)
+- **Task:** Identify the language of input text
+- **Language:** Multilingual
+- **License:** Apache 2.0
+## Supported Languages
+| Code | Language |
+|------|----------|
+| vi | Vietnamese |
+| en | English |
+| zh | Chinese |
+| ja | Japanese |
+| ko | Korean |
+| fr | French |
+| de | German |
+| es | Spanish |
+| th | Thai |
+| lo | Lao |
+| km | Khmer |
+## Installation
+```bash
+pip install underthesea
+```
+## Usage
+```python
+from underthesea import lang_detect
+text = "Xin chào, tôi là người Việt Nam"
+language = lang_detect(text)
+print(language)  # vi
+```
+## API
+```python
+from radar import RadarLangDetector, detect
+# Quick detection
+lang = detect("Hello world")
+print(lang)  # en
+# With confidence scores
+detector = RadarLangDetector.load("models/radar-1")
+result = detector.predict("Xin chào Việt Nam")
+print(result.lang)   # vi
+print(result.score)  # 0.98
+```
+## Training
+```bash
+python src/train.py
+```
+## Technical Report
+See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,31 @@

+[project]
+name = "radar"
+version = "1.0.0"
+description = "Language Detection for underthesea"
+readme = "README.md"
+requires-python = ">=3.10"
+license = "Apache-2.0"
+authors = [
+    {name = "UnderTheSea NLP", email = "undertheseanlp@gmail.com"}
+]
+keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
+dependencies = [
+    "underthesea>=9.2.9",
+    "click>=8.0.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0.0",
+    "huggingface-hub>=0.20.0",
+    "scikit-learn>=1.0.0",
+    "datasets>=2.0.0",
+]
+[project.urls]
+Homepage = "https://huggingface.co/undertheseanlp/radar-1"
+Repository = "https://github.com/undertheseanlp/radar"
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"

src/train.py ADDED Viewed

	@@ -0,0 +1,20 @@

+"""Training script for Radar-1 language detection model."""
+import click
+@click.command()
+@click.option("--data-dir", default="data", help="Path to training data")
+@click.option("--output-dir", default="models/radar-1", help="Output directory for trained model")
+@click.option("--epochs", default=10, help="Number of training epochs")
+def train(data_dir, output_dir, epochs):
+    """Train the Radar-1 language detection model."""
+    click.echo(f"Training Radar-1 language detection model")
+    click.echo(f"  Data: {data_dir}")
+    click.echo(f"  Output: {output_dir}")
+    click.echo(f"  Epochs: {epochs}")
+    # TODO: Implement training pipeline
+if __name__ == "__main__":
+    train()