Update to use underthesea_core for FastText inference

- Replace fasttext-wheel with underthesea_core>=3.3.0 as core dependency
- Move fasttext-wheel and numpy to dev dependencies (benchmark only)
- Rewrite benchmark to compare all 5 fasttext libraries
- Update TECHNICAL_REPORT.md with full benchmark results

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (3) hide show

TECHNICAL_REPORT.md +52 -27
benchmark_fasttext.py +184 -138
pyproject.toml +3 -2

TECHNICAL_REPORT.md CHANGED Viewed

@@ -10,8 +10,9 @@ Radar-1 is a language detection module for the [underthesea](https://github.com/
 |--------|-------|
 | Accuracy (97 test cases, 25+ languages) | 95.9% |
 | Prediction match vs C++ fasttext | 100% |
-| Inference speedup vs C++ fasttext | 2.15x (avg), 2.29x (batch) |
-| Batch throughput | 122,198 predictions/sec |
 ## 2. Model
@@ -157,36 +158,62 @@ The 4 misclassified cases are inherent model errors (both Rust and Python produc
 ## 6. Performance
-### 6.1 Prediction Latency
-Median latency over 1,000 runs per sentence (top-3 prediction):
-| Input | Rust | C++ Python | Speedup |
-|-------|------|------------|---------|
-| "hello" (5 chars) | 4.0 us | 8.3 us | 2.05x |
-| Vietnamese, medium (27 chars) | 6.4 us | 14.2 us | 2.23x |
-| English, medium (44 chars) | 7.5 us | 16.9 us | 2.26x |
-| French, medium (49 chars) | 8.0 us | 16.6 us | 2.06x |
-| Japanese, medium (22 chars) | 6.1 us | 15.9 us | 2.61x |
-| Vietnamese, long (110 chars) | 18.7 us | 37.4 us | 2.00x |
-| **Average** | **8.5 us** | **18.2 us** | **2.15x** |
-### 6.2 Batch Throughput
-Sustained throughput over 60,000 predictions:
-| Implementation | Throughput |
-|----------------|-----------|
-| Rust (underthesea_core) | **122,198 pred/sec** |
-| C++ (fasttext-wheel) | 53,269 pred/sec |
-| **Speedup** | **2.29x** |
-### 6.3 Model Loading
 | Implementation | Load Time |
 |----------------|-----------|
-| Rust | 51.1 ms |
-| C++ | 25.6 ms |
 Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk `memcpy`). This is a one-time cost and does not affect prediction performance.
@@ -274,13 +301,11 @@ No additional crates were added for the FastText module.
 | Package | Version | Purpose |
 |---------|---------|---------|
 | underthesea | >= 9.2.9 | NLP ecosystem integration |
-| fasttext-wheel | 0.9.2 | Reference implementation (testing only) |
 ## 10. Future Work
 - Batch prediction API (`predict_batch(texts: Vec<str>)`)
 - SIMD-accelerated dot products for further speedup
-- Support for `.bin` (dense, non-quantized) models
-- Softmax loss function models (currently only HS is tested)
-- Removal of debug methods before production release
 - Bulk `read_exact` for dense matrix loading to improve load time

 |--------|-------|
 | Accuracy (97 test cases, 25+ languages) | 95.9% |
 | Prediction match vs C++ fasttext | 100% |
+| Batch throughput | 110,001 predictions/sec |
+| vs fasttext-predict (C++ stripped) | 2.14x faster |
+| vs fasttext-wheel (C++ full) | 2.42x faster |
 ## 2. Model
 ## 6. Performance
+### 6.1 Library Comparison
+Benchmarked against all major Python FastText libraries using `lid.176.ftz` on 16 multilingual sentences:
+| Library | Type | Load (ms) | Avg Latency (us) | Throughput (pred/s) | vs Rust |
+|---------|------|-----------|-------------------|---------------------|---------|
+| **underthesea_core** | Rust/PyO3 | 51.8 | 8.3 | **110,001** | **1.00x** |
+| fasttext-langdetect | C++ wrapper | 0.0* | 8.9 | 89,038 | 0.81x |
+| fast-langdetect | C++ wrapper | 0.0* | 14.6 | 57,399 | 0.52x |
+| fasttext-predict | C++ stripped | 29.9 | 17.1 | 51,493 | 0.47x |
+| fasttext-wheel | C++ full | 28.4 | 19.2 | 45,547 | 0.41x |
+*\* Wrappers keep model loaded globally, so load = 0 after warmup.*
+**Libraries tested:**
+| Package | Version | Description |
+|---------|---------|-------------|
+| [underthesea_core](https://pypi.org/project/underthesea-core/) | 3.3.0 | Pure Rust FastText inference (this project) |
+| [fasttext-predict](https://github.com/searxng/fasttext-predict) | 0.9.2.4 | C++ predict-only fork, no numpy, <1MB wheel |
+| [fasttext-wheel](https://pypi.org/project/fasttext-wheel/) | 0.9.2 | Full Facebook C++ fasttext with numpy/pybind11 |
+| [fast-langdetect](https://github.com/LlmKira/fast-langdetect) | 1.0.0 | Wrapper around fasttext-predict, bundles lid.176.ftz |
+| [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) | 1.0.5 | Wrapper around full fasttext |
+### 6.2 Prediction Latency by Input
+Median latency over 500 runs per sentence (top-3 prediction):
+| Input | Rust (us) | C++ fasttext-predict (us) | Speedup |
+|-------|-----------|---------------------------|---------|
+| "hello" (5 chars) | 3.0 | 6.3 | 2.10x |
+| Vietnamese, medium (36 chars) | 6.2 | 13.0 | 2.10x |
+| English, medium (44 chars) | 6.7 | 14.7 | 2.19x |
+| French, medium (49 chars) | 7.2 | 15.2 | 2.11x |
+| Chinese (10 chars) | 4.6 | 10.7 | 2.33x |
+| Japanese (11 chars) | 4.0 | 8.3 | 2.08x |
+| Vietnamese, long (185 chars) | 29.1 | 59.3 | 2.04x |
+| **Average** | **8.3** | **17.1** | **2.06x** |
+### 6.3 Prediction Verification
+All implementations produce identical top-1 predictions on 16 test sentences:
+| | underthesea_core | fasttext-predict | fasttext-wheel |
+|-|-----------------|------------------|----------------|
+| Match rate | - | **16/16 (100%)** | **16/16 (100%)** |
+Note: `fast-langdetect` and `fasttext-langdetect` show 15/16 match because they default to the larger `lid.176.bin` model instead of `.ftz`.
+### 6.4 Model Loading
 | Implementation | Load Time |
 |----------------|-----------|
+| fasttext-predict (C++) | 29.9 ms |
+| fasttext-wheel (C++) | 28.4 ms |
+| underthesea_core (Rust) | 51.8 ms |
 Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk `memcpy`). This is a one-time cost and does not affect prediction performance.
 | Package | Version | Purpose |
 |---------|---------|---------|
 | underthesea | >= 9.2.9 | NLP ecosystem integration |
+| underthesea_core | >= 3.3.0 | Rust FastText inference |
 ## 10. Future Work
 - Batch prediction API (`predict_batch(texts: Vec<str>)`)
 - SIMD-accelerated dot products for further speedup
 - Bulk `read_exact` for dense matrix loading to improve load time
+- Softmax loss function models (currently only HS is tested)

benchmark_fasttext.py CHANGED Viewed

@@ -1,197 +1,243 @@
 """
-Benchmark: underthesea_core.FastTextModel (Rust/PyO3) vs facebook fasttext (C++/pybind11)
-Compares: model loading time, single prediction latency, batch throughput, memory usage.
 """
-import time
-import statistics
-import tracemalloc
 MODEL_PATH = "/tmp/lid.176.ftz"
-# ── Test sentences (varying length & language) ──────────────────────────────
 SENTENCES = [
-    # Vietnamese
     "Xin chào, tôi là sinh viên Việt Nam",
     "Hôm nay thời tiết rất đẹp, tôi muốn đi dạo công viên",
     "Việt Nam là một quốc gia nằm ở phía đông bán đảo Đông Dương thuộc khu vực Đông Nam Á",
-    # English
     "The quick brown fox jumps over the lazy dog",
     "Machine learning is a subset of artificial intelligence that focuses on building systems",
     "Natural language processing enables computers to understand human language",
-    # French
     "Bonjour le monde, comment allez-vous aujourd'hui",
     "La France est un pays dont la métropole se situe en Europe de l'Ouest",
-    # Chinese
     "今天天气很好我想出去走走",
     "机器学习是人工智能的一个重要分支",
-    # Japanese
     "今日はとても良い天気ですね",
     "自然言語処理は人工知能の重要な分野です",
-    # Short texts
     "hello",
     "xin chào",
     "bonjour",
-    # Longer text
     "Việt Nam, tên gọi chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam, "
     "là một quốc gia nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực "
     "Đông Nam Á, giáp với Lào, Campuchia, Trung Quốc, biển Đông và vịnh Thái Lan.",
 ]
-K = 3  # top-k predictions
 WARMUP = 50
 REPEATS = 500
-def benchmark_load(load_fn, label, n=5):
-    """Measure model loading time."""
-    times = []
-    for _ in range(n):
         t0 = time.perf_counter()
-        model = load_fn()
         t1 = time.perf_counter()
-        times.append(t1 - t0)
-    med = statistics.median(times)
-    print(f"  {label:30s}  load = {med*1000:8.1f} ms  (median of {n})")
-    return model
-def benchmark_predict(model, predict_fn, label):
-    """Measure single-call latency and throughput."""
-    # Warmup
     for _ in range(WARMUP):
         for s in SENTENCES:
-            predict_fn(model, s)
-    # Timed runs
     per_sentence_us = []
     for s in SENTENCES:
         times = []
         for _ in range(REPEATS):
             t0 = time.perf_counter()
-            predict_fn(model, s)
             t1 = time.perf_counter()
             times.append(t1 - t0)
-        med = statistics.median(times)
-        per_sentence_us.append(med * 1e6)
     avg_us = statistics.mean(per_sentence_us)
-    min_us = min(per_sentence_us)
-    max_us = max(per_sentence_us)
-    throughput = 1e6 / avg_us  # predictions/sec
-    print(f"  {label:30s}  avg = {avg_us:7.1f} µs   "
-          f"min = {min_us:7.1f} µs   max = {max_us:7.1f} µs   "
-          f"({throughput:,.0f} pred/s)")
-    return per_sentence_us
-def benchmark_batch(model, predict_fn, label, n_calls=5000):
-    """Measure total throughput over many calls."""
-    # Warmup
-    for s in SENTENCES:
-        predict_fn(model, s)
     t0 = time.perf_counter()
-    for _ in range(n_calls):
         for s in SENTENCES:
-            predict_fn(model, s)
     t1 = time.perf_counter()
-    total = n_calls * len(SENTENCES)
-    elapsed = t1 - t0
-    throughput = total / elapsed
-    print(f"  {label:30s}  {total:,} calls in {elapsed:.2f}s = {throughput:,.0f} pred/s")
-def benchmark_memory(load_fn, predict_fn, label):
-    """Measure peak memory for model loading + prediction."""
-    tracemalloc.start()
-    model = load_fn()
     for s in SENTENCES:
-        predict_fn(model, s)
-    current, peak = tracemalloc.get_traced_memory()
-    tracemalloc.stop()
-    print(f"  {label:30s}  current = {current/1024/1024:.1f} MB   peak = {peak/1024/1024:.1f} MB")
-def verify_results(rust_model, fb_model, rust_predict, fb_predict):
-    """Check that both models produce the same labels."""
-    print("\n── Result Verification ──")
-    matches = 0
-    total = len(SENTENCES)
-    for s in SENTENCES:
-        r_res = rust_predict(rust_model, s)
-        f_res = fb_predict(fb_model, s)
-        r_label = r_res[0][0] if r_res else "?"
-        f_label = f_res[0][0] if f_res else "?"
-        match = "✓" if r_label == f_label else "✗"
-        if r_label == f_label:
-            matches += 1
-        r_score = r_res[0][1] if r_res else 0
-        f_score = f_res[0][1] if f_res else 0
-        text_preview = s[:50] + ("..." if len(s) > 50 else "")
-        print(f"  {match}  {text_preview:55s}  "
-              f"rust={r_label}({r_score:.4f})  fb={f_label}({f_score:.4f})")
-    print(f"\n  Match rate: {matches}/{total}")
 def main():
-    # ── Load libraries ──
-    print("Loading libraries...")
-    import fasttext
-    from underthesea_core import FastTextModel
-    # ── Define load/predict callables ──
-    def rust_load():
-        return FastTextModel.load(MODEL_PATH)
-    def fb_load():
-        return fasttext.load_model(MODEL_PATH)
-    def rust_predict(m, text):
-        return m.predict(text, k=K)
-    def fb_predict(m, text):
-        labels, scores = m.predict(text, k=K)
-        return [(l.replace("__label__", ""), s) for l, s in zip(labels, scores)]
-    # ── Model Loading ──
-    print("\n── Model Loading Time ──")
-    rust_model = benchmark_load(rust_load, "underthesea_core (Rust)")
-    fb_model = benchmark_load(fb_load, "fasttext (Facebook C++)")
-    # ── Verify correctness first ──
-    verify_results(rust_model, fb_model, rust_predict, fb_predict)
-    # ── Single Prediction Latency ──
-    print("\n── Single Prediction Latency ──")
-    rust_times = benchmark_predict(rust_model, rust_predict, "underthesea_core (Rust)")
-    fb_times = benchmark_predict(fb_model, fb_predict, "fasttext (Facebook C++)")
-    # ── Per-sentence comparison ──
-    print("\n── Per-Sentence Speedup (Rust vs Facebook) ──")
     for i, s in enumerate(SENTENCES):
-        ratio = fb_times[i] / rust_times[i] if rust_times[i] > 0 else float('inf')
-        text_preview = s[:50] + ("..." if len(s) > 50 else "")
-        print(f"  {text_preview:55s}  {ratio:.2f}x")
-    avg_speedup = statistics.mean(fb_times) / statistics.mean(rust_times)
-    print(f"\n  Average speedup: {avg_speedup:.2f}x")
-    # ── Batch Throughput ──
-    print("\n── Batch Throughput ──")
-    benchmark_batch(rust_model, rust_predict, "underthesea_core (Rust)")
-    benchmark_batch(fb_model, fb_predict, "fasttext (Facebook C++)")
-    # ── Memory Usage ──
-    print("\n── Memory Usage (Python-side tracemalloc) ──")
-    benchmark_memory(rust_load, rust_predict, "underthesea_core (Rust)")
-    benchmark_memory(fb_load, fb_predict, "fasttext (Facebook C++)")
-    print("\nDone.")
 if __name__ == "__main__":

 """
+Benchmark: underthesea_core FastText (Rust/PyO3) vs all Python fasttext libraries.
+Compares: model loading time, single prediction latency, batch throughput.
+Libraries tested:
+  1. underthesea_core  - Pure Rust (PyO3), predict-only
+  2. fasttext-predict  - C++ stripped predict-only, no numpy (<1MB)
+  3. fasttext-wheel    - Full Facebook C++ fasttext
+  4. fast-langdetect   - Wrapper around fasttext-predict, bundles lid.176.ftz
+  5. fasttext-langdetect - Wrapper around full fasttext
 """
+import subprocess
+import sys
+import json
+import os
 MODEL_PATH = "/tmp/lid.176.ftz"
 SENTENCES = [
     "Xin chào, tôi là sinh viên Việt Nam",
     "Hôm nay thời tiết rất đẹp, tôi muốn đi dạo công viên",
     "Việt Nam là một quốc gia nằm ở phía đông bán đảo Đông Dương thuộc khu vực Đông Nam Á",
     "The quick brown fox jumps over the lazy dog",
     "Machine learning is a subset of artificial intelligence that focuses on building systems",
     "Natural language processing enables computers to understand human language",
     "Bonjour le monde, comment allez-vous aujourd'hui",
     "La France est un pays dont la métropole se situe en Europe de l'Ouest",
     "今天天气很好我想出去走走",
     "机器学习是人工智能的一个重要分支",
     "今日はとても良い天気ですね",
     "自然言語処理は人工知能の重要な分野です",
     "hello",
     "xin chào",
     "bonjour",
     "Việt Nam, tên gọi chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam, "
     "là một quốc gia nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực "
     "Đông Nam Á, giáp với Lào, Campuchia, Trung Quốc, biển Đông và vịnh Thái Lan.",
 ]
+# Runner script executed in each venv
+RUNNER_SCRIPT = r'''
+import time, statistics, json, sys, os
+MODEL_PATH = sys.argv[1]
+SENTENCES = json.loads(sys.argv[2])
+LIB_NAME = sys.argv[3]
+K = 3
 WARMUP = 50
 REPEATS = 500
+BATCH_CALLS = 5000
+def run():
+    # --- Load ---
+    if LIB_NAME == "underthesea_core":
+        from underthesea_core import FastText
+        def load(): return FastText.load(MODEL_PATH)
+        def predict(m, t): return m.predict(t, k=K)
+        def fmt(r): return r[0][0] if r else "?"
+    elif LIB_NAME == "fasttext-predict":
+        import fasttext
+        def load(): return fasttext.load_model(MODEL_PATH)
+        def predict(m, t): return m.predict(t, k=K)
+        def fmt(r): return r[0][0].replace("__label__","") if r[0] else "?"
+    elif LIB_NAME == "fasttext-wheel":
+        import fasttext
+        def load(): return fasttext.load_model(MODEL_PATH)
+        def predict(m, t): return m.predict(t, k=K)
+        def fmt(r): return r[0][0].replace("__label__","") if r[0] else "?"
+    elif LIB_NAME == "fast-langdetect":
+        from fast_langdetect import detect
+        # preload to avoid download during benchmark
+        detect("warmup")
+        def load(): return None
+        def predict(m, t): return detect(t)
+        def fmt(r): return r[0]["lang"] if isinstance(r, list) else r.get("lang","?")
+    elif LIB_NAME == "fasttext-langdetect":
+        from ftlangdetect import detect
+        detect("warmup")
+        def load(): return None
+        def predict(m, t): return detect(t)
+        def fmt(r): return r.get("lang","?")
+    # --- Benchmark Load ---
+    load_times = []
+    for _ in range(5):
         t0 = time.perf_counter()
+        model = load()
         t1 = time.perf_counter()
+        load_times.append(t1 - t0)
+    load_ms = statistics.median(load_times) * 1000
+    # --- Warmup ---
     for _ in range(WARMUP):
         for s in SENTENCES:
+            predict(model, s)
+    # --- Single prediction latency ---
     per_sentence_us = []
     for s in SENTENCES:
         times = []
         for _ in range(REPEATS):
             t0 = time.perf_counter()
+            predict(model, s)
             t1 = time.perf_counter()
             times.append(t1 - t0)
+        per_sentence_us.append(statistics.median(times) * 1e6)
     avg_us = statistics.mean(per_sentence_us)
+    throughput_single = 1e6 / avg_us
+    # --- Batch throughput ---
     t0 = time.perf_counter()
+    for _ in range(BATCH_CALLS):
         for s in SENTENCES:
+            predict(model, s)
     t1 = time.perf_counter()
+    total = BATCH_CALLS * len(SENTENCES)
+    throughput_batch = total / (t1 - t0)
+    # --- Predictions for verification ---
+    preds = []
     for s in SENTENCES:
+        r = predict(model, s)
+        preds.append(fmt(r))
+    result = {
+        "lib": LIB_NAME,
+        "load_ms": round(load_ms, 1),
+        "avg_us": round(avg_us, 1),
+        "min_us": round(min(per_sentence_us), 1),
+        "max_us": round(max(per_sentence_us), 1),
+        "throughput_single": int(throughput_single),
+        "throughput_batch": int(throughput_batch),
+        "preds": preds,
+    }
+    print(json.dumps(result))
+run()
+'''
+VENVS = {
+    "underthesea_core":    "/tmp/venv_ftpredict/bin/python3",
+    "fasttext-predict":    "/tmp/venv_ftpredict/bin/python3",
+    "fasttext-wheel":      "/tmp/venv_ftwheel/bin/python3",
+    "fast-langdetect":     "/tmp/venv_fastlang/bin/python3",
+    "fasttext-langdetect": "/tmp/venv_ftlangdetect/bin/python3",
+}
+def run_benchmark(lib_name, python_bin):
+    """Run benchmark in a subprocess with the correct venv."""
+    env = os.environ.copy()
+    env.pop("VIRTUAL_ENV", None)
+    result = subprocess.run(
+        [python_bin, "-c", RUNNER_SCRIPT, MODEL_PATH, json.dumps(SENTENCES), lib_name],
+        capture_output=True, text=True, timeout=600, env=env,
+    )
+    # Filter out non-JSON lines (warnings, download progress, etc.)
+    for line in result.stdout.strip().split("\n"):
+        line = line.strip()
+        if line.startswith("{"):
+            return json.loads(line)
+    print(f"  ERROR ({lib_name}): {result.stderr[-500:]}", file=sys.stderr)
+    return None
 def main():
+    print("=" * 80)
+    print("FastText Library Benchmark")
+    print("=" * 80)
+    print(f"Model: {MODEL_PATH}")
+    print(f"Sentences: {len(SENTENCES)}")
+    print()
+    results = []
+    for lib_name, python_bin in VENVS.items():
+        if not os.path.exists(python_bin):
+            print(f"  SKIP {lib_name}: venv not found at {python_bin}")
+            continue
+        print(f"  Benchmarking {lib_name}...", end="", flush=True)
+        r = run_benchmark(lib_name, python_bin)
+        if r:
+            print(f" done ({r['throughput_batch']:,} pred/s)")
+            results.append(r)
+        else:
+            print(" FAILED")
+    if not results:
+        print("No results!")
+        return
+    # --- Results Table ---
+    print()
+    print("=" * 80)
+    print(f"{'Library':<22s} {'Load':>8s} {'Avg':>8s} {'Min':>8s} {'Max':>8s} {'Throughput':>12s}")
+    print(f"{'':<22s} {'(ms)':>8s} {'(µs)':>8s} {'(µs)':>8s} {'(µs)':>8s} {'(pred/s)':>12s}")
+    print("-" * 80)
+    baseline = results[0]["throughput_batch"]
+    for r in results:
+        ratio = r["throughput_batch"] / baseline if baseline else 0
+        mark = "" if r["lib"] == results[0]["lib"] else f"  ({ratio:.2f}x)"
+        print(f"  {r['lib']:<20s} {r['load_ms']:>8.1f} {r['avg_us']:>8.1f} "
+              f"{r['min_us']:>8.1f} {r['max_us']:>8.1f} {r['throughput_batch']:>10,}{mark}")
+    # --- Prediction Verification ---
+    print()
+    print("=" * 80)
+    print("Prediction Verification (top-1 label)")
+    print("-" * 80)
+    ref = results[0]
+    header = f"  {'Text':<50s}"
+    for r in results:
+        header += f" {r['lib'][:10]:>10s}"
+    print(header)
+    print("  " + "-" * (50 + 11 * len(results)))
     for i, s in enumerate(SENTENCES):
+        preview = s[:48] + ".." if len(s) > 48 else s
+        row = f"  {preview:<50s}"
+        for r in results:
+            pred = r["preds"][i]
+            match = "" if pred == ref["preds"][i] else "*"
+            row += f" {pred+match:>10s}"
+        print(row)
+    # --- Match rate ---
+    print()
+    for r in results[1:]:
+        matches = sum(1 for i in range(len(SENTENCES)) if r["preds"][i] == ref["preds"][i])
+        print(f"  {r['lib']} vs {ref['lib']}: {matches}/{len(SENTENCES)} match")
+    print()
+    print("Done.")
 if __name__ == "__main__":

pyproject.toml CHANGED Viewed

@@ -11,9 +11,8 @@ authors = [
 keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
 dependencies = [
     "underthesea>=9.2.9",
     "click>=8.0.0",
-    "fasttext-wheel>=0.9.2",
-    "numpy<2",
 ]
 [project.optional-dependencies]
@@ -22,6 +21,8 @@ dev = [
     "huggingface-hub>=0.20.0",
     "scikit-learn>=1.0.0",
     "datasets>=2.0.0",
 ]
 [project.urls]

 keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
 dependencies = [
     "underthesea>=9.2.9",
+    "underthesea_core>=3.3.0",
     "click>=8.0.0",
 ]
 [project.optional-dependencies]
     "huggingface-hub>=0.20.0",
     "scikit-learn>=1.0.0",
     "datasets>=2.0.0",
+    "fasttext-wheel>=0.9.2",
+    "numpy<2",
 ]
 [project.urls]