rain1024 Claude Opus 4.6 commited on
Commit
8551e99
·
1 Parent(s): 30fd98c

Initial project setup for Radar-1 language detection model

Browse files

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (5) hide show
  1. .gitignore +17 -0
  2. CLAUDE.md +44 -0
  3. README.md +94 -0
  4. pyproject.toml +31 -0
  5. src/train.py +20 -0
.gitignore ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.py[cod]
3
+ *.egg-info/
4
+ dist/
5
+ build/
6
+ .eggs/
7
+ *.egg
8
+ .venv/
9
+ venv/
10
+ .env
11
+ *.pt
12
+ *.bin
13
+ !models/**/*.bin
14
+ data/
15
+ .tox/
16
+ .pytest_cache/
17
+ *.log
CLAUDE.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ Radar-1 is a language detection model for the UnderTheSea NLP ecosystem. It identifies the language of input text, supporting Vietnamese and other languages commonly encountered in Vietnamese NLP contexts.
8
+
9
+ ## Build & Development Commands
10
+
11
+ ### Running Tests
12
+ ```bash
13
+ pytest tests/
14
+ ```
15
+
16
+ ### Development Installation
17
+ ```bash
18
+ pip install -e ".[dev]"
19
+ ```
20
+
21
+ ### Training
22
+ ```bash
23
+ python src/train.py
24
+ ```
25
+
26
+ ## Architecture
27
+
28
+ **Pipeline:**
29
+ ```
30
+ Input Text → Feature Extraction (character n-grams / TF-IDF)
31
+ → Classifier
32
+ → Language Label + Confidence Score
33
+ ```
34
+
35
+ **Key Design Decisions:**
36
+ - Operates at character-level for language-agnostic feature extraction
37
+ - Lightweight model for fast inference
38
+ - Compatible with underthesea API (`lang_detect`)
39
+
40
+ ## Key Files
41
+
42
+ - `src/train.py` - Training script
43
+ - `TECHNICAL_REPORT.md` - Detailed methodology and benchmark results
44
+ - `RESEARCH_PLAN.md` - Research roadmap
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - vi
5
+ - en
6
+ - zh
7
+ - ja
8
+ - ko
9
+ - fr
10
+ - de
11
+ - es
12
+ - th
13
+ - lo
14
+ - km
15
+ tags:
16
+ - language-detection
17
+ - language-identification
18
+ - vietnamese
19
+ - multilingual
20
+ library_name: underthesea
21
+ pipeline_tag: text-classification
22
+ metrics:
23
+ - accuracy
24
+ - f1
25
+ ---
26
+
27
+ # Radar-1
28
+
29
+ Radar-1 is a language detection model developed by UnderTheSea NLP.
30
+
31
+ ## Model Description
32
+
33
+ - **Model Type:** Language Detection (Text Classification)
34
+ - **Task:** Identify the language of input text
35
+ - **Language:** Multilingual
36
+ - **License:** Apache 2.0
37
+
38
+ ## Supported Languages
39
+
40
+ | Code | Language |
41
+ |------|----------|
42
+ | vi | Vietnamese |
43
+ | en | English |
44
+ | zh | Chinese |
45
+ | ja | Japanese |
46
+ | ko | Korean |
47
+ | fr | French |
48
+ | de | German |
49
+ | es | Spanish |
50
+ | th | Thai |
51
+ | lo | Lao |
52
+ | km | Khmer |
53
+
54
+ ## Installation
55
+
56
+ ```bash
57
+ pip install underthesea
58
+ ```
59
+
60
+ ## Usage
61
+
62
+ ```python
63
+ from underthesea import lang_detect
64
+
65
+ text = "Xin chào, tôi là người Việt Nam"
66
+ language = lang_detect(text)
67
+ print(language) # vi
68
+ ```
69
+
70
+ ## API
71
+
72
+ ```python
73
+ from radar import RadarLangDetector, detect
74
+
75
+ # Quick detection
76
+ lang = detect("Hello world")
77
+ print(lang) # en
78
+
79
+ # With confidence scores
80
+ detector = RadarLangDetector.load("models/radar-1")
81
+ result = detector.predict("Xin chào Việt Nam")
82
+ print(result.lang) # vi
83
+ print(result.score) # 0.98
84
+ ```
85
+
86
+ ## Training
87
+
88
+ ```bash
89
+ python src/train.py
90
+ ```
91
+
92
+ ## Technical Report
93
+
94
+ See [TECHNICAL_REPORT.md](TECHNICAL_REPORT.md) for detailed methodology and evaluation.
pyproject.toml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "radar"
3
+ version = "1.0.0"
4
+ description = "Language Detection for underthesea"
5
+ readme = "README.md"
6
+ requires-python = ">=3.10"
7
+ license = "Apache-2.0"
8
+ authors = [
9
+ {name = "UnderTheSea NLP", email = "undertheseanlp@gmail.com"}
10
+ ]
11
+ keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
12
+ dependencies = [
13
+ "underthesea>=9.2.9",
14
+ "click>=8.0.0",
15
+ ]
16
+
17
+ [project.optional-dependencies]
18
+ dev = [
19
+ "pytest>=7.0.0",
20
+ "huggingface-hub>=0.20.0",
21
+ "scikit-learn>=1.0.0",
22
+ "datasets>=2.0.0",
23
+ ]
24
+
25
+ [project.urls]
26
+ Homepage = "https://huggingface.co/undertheseanlp/radar-1"
27
+ Repository = "https://github.com/undertheseanlp/radar"
28
+
29
+ [build-system]
30
+ requires = ["hatchling"]
31
+ build-backend = "hatchling.build"
src/train.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Training script for Radar-1 language detection model."""
2
+
3
+ import click
4
+
5
+
6
+ @click.command()
7
+ @click.option("--data-dir", default="data", help="Path to training data")
8
+ @click.option("--output-dir", default="models/radar-1", help="Output directory for trained model")
9
+ @click.option("--epochs", default=10, help="Number of training epochs")
10
+ def train(data_dir, output_dir, epochs):
11
+ """Train the Radar-1 language detection model."""
12
+ click.echo(f"Training Radar-1 language detection model")
13
+ click.echo(f" Data: {data_dir}")
14
+ click.echo(f" Output: {output_dir}")
15
+ click.echo(f" Epochs: {epochs}")
16
+ # TODO: Implement training pipeline
17
+
18
+
19
+ if __name__ == "__main__":
20
+ train()