rain1024 Claude Opus 4.6 commited on
Commit
9b203d0
·
1 Parent(s): 98317d0

Update to use underthesea_core for FastText inference

Browse files

- Replace fasttext-wheel with underthesea_core>=3.3.0 as core dependency
- Move fasttext-wheel and numpy to dev dependencies (benchmark only)
- Rewrite benchmark to compare all 5 fasttext libraries
- Update TECHNICAL_REPORT.md with full benchmark results

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Files changed (3) hide show
  1. TECHNICAL_REPORT.md +52 -27
  2. benchmark_fasttext.py +184 -138
  3. pyproject.toml +3 -2
TECHNICAL_REPORT.md CHANGED
@@ -10,8 +10,9 @@ Radar-1 is a language detection module for the [underthesea](https://github.com/
10
  |--------|-------|
11
  | Accuracy (97 test cases, 25+ languages) | 95.9% |
12
  | Prediction match vs C++ fasttext | 100% |
13
- | Inference speedup vs C++ fasttext | 2.15x (avg), 2.29x (batch) |
14
- | Batch throughput | 122,198 predictions/sec |
 
15
 
16
  ## 2. Model
17
 
@@ -157,36 +158,62 @@ The 4 misclassified cases are inherent model errors (both Rust and Python produc
157
 
158
  ## 6. Performance
159
 
160
- ### 6.1 Prediction Latency
161
 
162
- Median latency over 1,000 runs per sentence (top-3 prediction):
163
 
164
- | Input | Rust | C++ Python | Speedup |
165
- |-------|------|------------|---------|
166
- | "hello" (5 chars) | 4.0 us | 8.3 us | 2.05x |
167
- | Vietnamese, medium (27 chars) | 6.4 us | 14.2 us | 2.23x |
168
- | English, medium (44 chars) | 7.5 us | 16.9 us | 2.26x |
169
- | French, medium (49 chars) | 8.0 us | 16.6 us | 2.06x |
170
- | Japanese, medium (22 chars) | 6.1 us | 15.9 us | 2.61x |
171
- | Vietnamese, long (110 chars) | 18.7 us | 37.4 us | 2.00x |
172
- | **Average** | **8.5 us** | **18.2 us** | **2.15x** |
173
 
174
- ### 6.2 Batch Throughput
175
 
176
- Sustained throughput over 60,000 predictions:
177
 
178
- | Implementation | Throughput |
179
- |----------------|-----------|
180
- | Rust (underthesea_core) | **122,198 pred/sec** |
181
- | C++ (fasttext-wheel) | 53,269 pred/sec |
182
- | **Speedup** | **2.29x** |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
 
184
- ### 6.3 Model Loading
 
 
 
 
 
 
 
 
185
 
186
  | Implementation | Load Time |
187
  |----------------|-----------|
188
- | Rust | 51.1 ms |
189
- | C++ | 25.6 ms |
 
190
 
191
  Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk `memcpy`). This is a one-time cost and does not affect prediction performance.
192
 
@@ -274,13 +301,11 @@ No additional crates were added for the FastText module.
274
  | Package | Version | Purpose |
275
  |---------|---------|---------|
276
  | underthesea | >= 9.2.9 | NLP ecosystem integration |
277
- | fasttext-wheel | 0.9.2 | Reference implementation (testing only) |
278
 
279
  ## 10. Future Work
280
 
281
  - Batch prediction API (`predict_batch(texts: Vec<str>)`)
282
  - SIMD-accelerated dot products for further speedup
283
- - Support for `.bin` (dense, non-quantized) models
284
- - Softmax loss function models (currently only HS is tested)
285
- - Removal of debug methods before production release
286
  - Bulk `read_exact` for dense matrix loading to improve load time
 
 
10
  |--------|-------|
11
  | Accuracy (97 test cases, 25+ languages) | 95.9% |
12
  | Prediction match vs C++ fasttext | 100% |
13
+ | Batch throughput | 110,001 predictions/sec |
14
+ | vs fasttext-predict (C++ stripped) | 2.14x faster |
15
+ | vs fasttext-wheel (C++ full) | 2.42x faster |
16
 
17
  ## 2. Model
18
 
 
158
 
159
  ## 6. Performance
160
 
161
+ ### 6.1 Library Comparison
162
 
163
+ Benchmarked against all major Python FastText libraries using `lid.176.ftz` on 16 multilingual sentences:
164
 
165
+ | Library | Type | Load (ms) | Avg Latency (us) | Throughput (pred/s) | vs Rust |
166
+ |---------|------|-----------|-------------------|---------------------|---------|
167
+ | **underthesea_core** | Rust/PyO3 | 51.8 | 8.3 | **110,001** | **1.00x** |
168
+ | fasttext-langdetect | C++ wrapper | 0.0* | 8.9 | 89,038 | 0.81x |
169
+ | fast-langdetect | C++ wrapper | 0.0* | 14.6 | 57,399 | 0.52x |
170
+ | fasttext-predict | C++ stripped | 29.9 | 17.1 | 51,493 | 0.47x |
171
+ | fasttext-wheel | C++ full | 28.4 | 19.2 | 45,547 | 0.41x |
 
 
172
 
173
+ *\* Wrappers keep model loaded globally, so load = 0 after warmup.*
174
 
175
+ **Libraries tested:**
176
 
177
+ | Package | Version | Description |
178
+ |---------|---------|-------------|
179
+ | [underthesea_core](https://pypi.org/project/underthesea-core/) | 3.3.0 | Pure Rust FastText inference (this project) |
180
+ | [fasttext-predict](https://github.com/searxng/fasttext-predict) | 0.9.2.4 | C++ predict-only fork, no numpy, <1MB wheel |
181
+ | [fasttext-wheel](https://pypi.org/project/fasttext-wheel/) | 0.9.2 | Full Facebook C++ fasttext with numpy/pybind11 |
182
+ | [fast-langdetect](https://github.com/LlmKira/fast-langdetect) | 1.0.0 | Wrapper around fasttext-predict, bundles lid.176.ftz |
183
+ | [fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect) | 1.0.5 | Wrapper around full fasttext |
184
+
185
+ ### 6.2 Prediction Latency by Input
186
+
187
+ Median latency over 500 runs per sentence (top-3 prediction):
188
+
189
+ | Input | Rust (us) | C++ fasttext-predict (us) | Speedup |
190
+ |-------|-----------|---------------------------|---------|
191
+ | "hello" (5 chars) | 3.0 | 6.3 | 2.10x |
192
+ | Vietnamese, medium (36 chars) | 6.2 | 13.0 | 2.10x |
193
+ | English, medium (44 chars) | 6.7 | 14.7 | 2.19x |
194
+ | French, medium (49 chars) | 7.2 | 15.2 | 2.11x |
195
+ | Chinese (10 chars) | 4.6 | 10.7 | 2.33x |
196
+ | Japanese (11 chars) | 4.0 | 8.3 | 2.08x |
197
+ | Vietnamese, long (185 chars) | 29.1 | 59.3 | 2.04x |
198
+ | **Average** | **8.3** | **17.1** | **2.06x** |
199
+
200
+ ### 6.3 Prediction Verification
201
 
202
+ All implementations produce identical top-1 predictions on 16 test sentences:
203
+
204
+ | | underthesea_core | fasttext-predict | fasttext-wheel |
205
+ |-|-----------------|------------------|----------------|
206
+ | Match rate | - | **16/16 (100%)** | **16/16 (100%)** |
207
+
208
+ Note: `fast-langdetect` and `fasttext-langdetect` show 15/16 match because they default to the larger `lid.176.bin` model instead of `.ftz`.
209
+
210
+ ### 6.4 Model Loading
211
 
212
  | Implementation | Load Time |
213
  |----------------|-----------|
214
+ | fasttext-predict (C++) | 29.9 ms |
215
+ | fasttext-wheel (C++) | 28.4 ms |
216
+ | underthesea_core (Rust) | 51.8 ms |
217
 
218
  Model loading is slower in Rust due to element-by-element float parsing (vs C++ bulk `memcpy`). This is a one-time cost and does not affect prediction performance.
219
 
 
301
  | Package | Version | Purpose |
302
  |---------|---------|---------|
303
  | underthesea | >= 9.2.9 | NLP ecosystem integration |
304
+ | underthesea_core | >= 3.3.0 | Rust FastText inference |
305
 
306
  ## 10. Future Work
307
 
308
  - Batch prediction API (`predict_batch(texts: Vec<str>)`)
309
  - SIMD-accelerated dot products for further speedup
 
 
 
310
  - Bulk `read_exact` for dense matrix loading to improve load time
311
+ - Softmax loss function models (currently only HS is tested)
benchmark_fasttext.py CHANGED
@@ -1,197 +1,243 @@
1
  """
2
- Benchmark: underthesea_core.FastTextModel (Rust/PyO3) vs facebook fasttext (C++/pybind11)
3
 
4
- Compares: model loading time, single prediction latency, batch throughput, memory usage.
 
 
 
 
 
 
 
5
  """
6
 
7
- import time
8
- import statistics
9
- import tracemalloc
 
10
 
11
  MODEL_PATH = "/tmp/lid.176.ftz"
12
 
13
- # ── Test sentences (varying length & language) ──────────────────────────────
14
  SENTENCES = [
15
- # Vietnamese
16
  "Xin chào, tôi là sinh viên Việt Nam",
17
  "Hôm nay thời tiết rất đẹp, tôi muốn đi dạo công viên",
18
  "Việt Nam là một quốc gia nằm ở phía đông bán đảo Đông Dương thuộc khu vực Đông Nam Á",
19
- # English
20
  "The quick brown fox jumps over the lazy dog",
21
  "Machine learning is a subset of artificial intelligence that focuses on building systems",
22
  "Natural language processing enables computers to understand human language",
23
- # French
24
  "Bonjour le monde, comment allez-vous aujourd'hui",
25
  "La France est un pays dont la métropole se situe en Europe de l'Ouest",
26
- # Chinese
27
  "今天天气很好我想出去走走",
28
  "机器学习是人工智能的一个重要分支",
29
- # Japanese
30
  "今日はとても良い天気ですね",
31
  "自然言語処理は人工知能の重要な分野です",
32
- # Short texts
33
  "hello",
34
  "xin chào",
35
  "bonjour",
36
- # Longer text
37
  "Việt Nam, tên gọi chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam, "
38
  "là một quốc gia nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực "
39
  "Đông Nam Á, giáp với Lào, Campuchia, Trung Quốc, biển Đông và vịnh Thái Lan.",
40
  ]
41
 
42
- K = 3 # top-k predictions
 
 
 
 
 
 
 
43
  WARMUP = 50
44
  REPEATS = 500
45
-
46
-
47
- def benchmark_load(load_fn, label, n=5):
48
- """Measure model loading time."""
49
- times = []
50
- for _ in range(n):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
  t0 = time.perf_counter()
52
- model = load_fn()
53
  t1 = time.perf_counter()
54
- times.append(t1 - t0)
55
- med = statistics.median(times)
56
- print(f" {label:30s} load = {med*1000:8.1f} ms (median of {n})")
57
- return model
58
 
59
-
60
- def benchmark_predict(model, predict_fn, label):
61
- """Measure single-call latency and throughput."""
62
- # Warmup
63
  for _ in range(WARMUP):
64
  for s in SENTENCES:
65
- predict_fn(model, s)
66
 
67
- # Timed runs
68
  per_sentence_us = []
69
  for s in SENTENCES:
70
  times = []
71
  for _ in range(REPEATS):
72
  t0 = time.perf_counter()
73
- predict_fn(model, s)
74
  t1 = time.perf_counter()
75
  times.append(t1 - t0)
76
- med = statistics.median(times)
77
- per_sentence_us.append(med * 1e6)
78
 
79
  avg_us = statistics.mean(per_sentence_us)
80
- min_us = min(per_sentence_us)
81
- max_us = max(per_sentence_us)
82
- throughput = 1e6 / avg_us # predictions/sec
83
-
84
- print(f" {label:30s} avg = {avg_us:7.1f} µs "
85
- f"min = {min_us:7.1f} µs max = {max_us:7.1f} µs "
86
- f"({throughput:,.0f} pred/s)")
87
- return per_sentence_us
88
-
89
-
90
- def benchmark_batch(model, predict_fn, label, n_calls=5000):
91
- """Measure total throughput over many calls."""
92
- # Warmup
93
- for s in SENTENCES:
94
- predict_fn(model, s)
95
 
 
96
  t0 = time.perf_counter()
97
- for _ in range(n_calls):
98
  for s in SENTENCES:
99
- predict_fn(model, s)
100
  t1 = time.perf_counter()
 
 
101
 
102
- total = n_calls * len(SENTENCES)
103
- elapsed = t1 - t0
104
- throughput = total / elapsed
105
- print(f" {label:30s} {total:,} calls in {elapsed:.2f}s = {throughput:,.0f} pred/s")
106
-
107
-
108
- def benchmark_memory(load_fn, predict_fn, label):
109
- """Measure peak memory for model loading + prediction."""
110
- tracemalloc.start()
111
- model = load_fn()
112
  for s in SENTENCES:
113
- predict_fn(model, s)
114
- current, peak = tracemalloc.get_traced_memory()
115
- tracemalloc.stop()
116
- print(f" {label:30s} current = {current/1024/1024:.1f} MB peak = {peak/1024/1024:.1f} MB")
117
-
118
-
119
- def verify_results(rust_model, fb_model, rust_predict, fb_predict):
120
- """Check that both models produce the same labels."""
121
- print("\n── Result Verification ──")
122
- matches = 0
123
- total = len(SENTENCES)
124
- for s in SENTENCES:
125
- r_res = rust_predict(rust_model, s)
126
- f_res = fb_predict(fb_model, s)
127
- r_label = r_res[0][0] if r_res else "?"
128
- f_label = f_res[0][0] if f_res else "?"
129
- match = "✓" if r_label == f_label else "✗"
130
- if r_label == f_label:
131
- matches += 1
132
-
133
- r_score = r_res[0][1] if r_res else 0
134
- f_score = f_res[0][1] if f_res else 0
135
- text_preview = s[:50] + ("..." if len(s) > 50 else "")
136
- print(f" {match} {text_preview:55s} "
137
- f"rust={r_label}({r_score:.4f}) fb={f_label}({f_score:.4f})")
138
- print(f"\n Match rate: {matches}/{total}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
 
141
  def main():
142
- # ── Load libraries ──
143
- print("Loading libraries...")
144
- import fasttext
145
- from underthesea_core import FastTextModel
146
-
147
- # ── Define load/predict callables ──
148
- def rust_load():
149
- return FastTextModel.load(MODEL_PATH)
150
-
151
- def fb_load():
152
- return fasttext.load_model(MODEL_PATH)
153
-
154
- def rust_predict(m, text):
155
- return m.predict(text, k=K)
156
-
157
- def fb_predict(m, text):
158
- labels, scores = m.predict(text, k=K)
159
- return [(l.replace("__label__", ""), s) for l, s in zip(labels, scores)]
160
-
161
- # ── Model Loading ──
162
- print("\n── Model Loading Time ──")
163
- rust_model = benchmark_load(rust_load, "underthesea_core (Rust)")
164
- fb_model = benchmark_load(fb_load, "fasttext (Facebook C++)")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
- # ── Verify correctness first ──
167
- verify_results(rust_model, fb_model, rust_predict, fb_predict)
168
-
169
- # ── Single Prediction Latency ──
170
- print("\n── Single Prediction Latency ──")
171
- rust_times = benchmark_predict(rust_model, rust_predict, "underthesea_core (Rust)")
172
- fb_times = benchmark_predict(fb_model, fb_predict, "fasttext (Facebook C++)")
173
-
174
- # ── Per-sentence comparison ──
175
- print("\n── Per-Sentence Speedup (Rust vs Facebook) ──")
176
  for i, s in enumerate(SENTENCES):
177
- ratio = fb_times[i] / rust_times[i] if rust_times[i] > 0 else float('inf')
178
- text_preview = s[:50] + ("..." if len(s) > 50 else "")
179
- print(f" {text_preview:55s} {ratio:.2f}x")
180
-
181
- avg_speedup = statistics.mean(fb_times) / statistics.mean(rust_times)
182
- print(f"\n Average speedup: {avg_speedup:.2f}x")
183
-
184
- # ── Batch Throughput ──
185
- print("\n── Batch Throughput ──")
186
- benchmark_batch(rust_model, rust_predict, "underthesea_core (Rust)")
187
- benchmark_batch(fb_model, fb_predict, "fasttext (Facebook C++)")
188
-
189
- # ── Memory Usage ──
190
- print("\n── Memory Usage (Python-side tracemalloc) ──")
191
- benchmark_memory(rust_load, rust_predict, "underthesea_core (Rust)")
192
- benchmark_memory(fb_load, fb_predict, "fasttext (Facebook C++)")
193
-
194
- print("\nDone.")
195
 
196
 
197
  if __name__ == "__main__":
 
1
  """
2
+ Benchmark: underthesea_core FastText (Rust/PyO3) vs all Python fasttext libraries.
3
 
4
+ Compares: model loading time, single prediction latency, batch throughput.
5
+
6
+ Libraries tested:
7
+ 1. underthesea_core - Pure Rust (PyO3), predict-only
8
+ 2. fasttext-predict - C++ stripped predict-only, no numpy (<1MB)
9
+ 3. fasttext-wheel - Full Facebook C++ fasttext
10
+ 4. fast-langdetect - Wrapper around fasttext-predict, bundles lid.176.ftz
11
+ 5. fasttext-langdetect - Wrapper around full fasttext
12
  """
13
 
14
+ import subprocess
15
+ import sys
16
+ import json
17
+ import os
18
 
19
  MODEL_PATH = "/tmp/lid.176.ftz"
20
 
 
21
  SENTENCES = [
 
22
  "Xin chào, tôi là sinh viên Việt Nam",
23
  "Hôm nay thời tiết rất đẹp, tôi muốn đi dạo công viên",
24
  "Việt Nam là một quốc gia nằm ở phía đông bán đảo Đông Dương thuộc khu vực Đông Nam Á",
 
25
  "The quick brown fox jumps over the lazy dog",
26
  "Machine learning is a subset of artificial intelligence that focuses on building systems",
27
  "Natural language processing enables computers to understand human language",
 
28
  "Bonjour le monde, comment allez-vous aujourd'hui",
29
  "La France est un pays dont la métropole se situe en Europe de l'Ouest",
 
30
  "今天天气很好我想出去走走",
31
  "机器学习是人工智能的一个重要分支",
 
32
  "今日はとても良い天気ですね",
33
  "自然言語処理は人工知能の重要な分野です",
 
34
  "hello",
35
  "xin chào",
36
  "bonjour",
 
37
  "Việt Nam, tên gọi chính thức là Cộng hòa Xã hội chủ nghĩa Việt Nam, "
38
  "là một quốc gia nằm ở cực Đông của bán đảo Đông Dương thuộc khu vực "
39
  "Đông Nam Á, giáp với Lào, Campuchia, Trung Quốc, biển Đông và vịnh Thái Lan.",
40
  ]
41
 
42
+ # Runner script executed in each venv
43
+ RUNNER_SCRIPT = r'''
44
+ import time, statistics, json, sys, os
45
+
46
+ MODEL_PATH = sys.argv[1]
47
+ SENTENCES = json.loads(sys.argv[2])
48
+ LIB_NAME = sys.argv[3]
49
+ K = 3
50
  WARMUP = 50
51
  REPEATS = 500
52
+ BATCH_CALLS = 5000
53
+
54
+ def run():
55
+ # --- Load ---
56
+ if LIB_NAME == "underthesea_core":
57
+ from underthesea_core import FastText
58
+ def load(): return FastText.load(MODEL_PATH)
59
+ def predict(m, t): return m.predict(t, k=K)
60
+ def fmt(r): return r[0][0] if r else "?"
61
+
62
+ elif LIB_NAME == "fasttext-predict":
63
+ import fasttext
64
+ def load(): return fasttext.load_model(MODEL_PATH)
65
+ def predict(m, t): return m.predict(t, k=K)
66
+ def fmt(r): return r[0][0].replace("__label__","") if r[0] else "?"
67
+
68
+ elif LIB_NAME == "fasttext-wheel":
69
+ import fasttext
70
+ def load(): return fasttext.load_model(MODEL_PATH)
71
+ def predict(m, t): return m.predict(t, k=K)
72
+ def fmt(r): return r[0][0].replace("__label__","") if r[0] else "?"
73
+
74
+ elif LIB_NAME == "fast-langdetect":
75
+ from fast_langdetect import detect
76
+ # preload to avoid download during benchmark
77
+ detect("warmup")
78
+ def load(): return None
79
+ def predict(m, t): return detect(t)
80
+ def fmt(r): return r[0]["lang"] if isinstance(r, list) else r.get("lang","?")
81
+
82
+ elif LIB_NAME == "fasttext-langdetect":
83
+ from ftlangdetect import detect
84
+ detect("warmup")
85
+ def load(): return None
86
+ def predict(m, t): return detect(t)
87
+ def fmt(r): return r.get("lang","?")
88
+
89
+ # --- Benchmark Load ---
90
+ load_times = []
91
+ for _ in range(5):
92
  t0 = time.perf_counter()
93
+ model = load()
94
  t1 = time.perf_counter()
95
+ load_times.append(t1 - t0)
96
+ load_ms = statistics.median(load_times) * 1000
 
 
97
 
98
+ # --- Warmup ---
 
 
 
99
  for _ in range(WARMUP):
100
  for s in SENTENCES:
101
+ predict(model, s)
102
 
103
+ # --- Single prediction latency ---
104
  per_sentence_us = []
105
  for s in SENTENCES:
106
  times = []
107
  for _ in range(REPEATS):
108
  t0 = time.perf_counter()
109
+ predict(model, s)
110
  t1 = time.perf_counter()
111
  times.append(t1 - t0)
112
+ per_sentence_us.append(statistics.median(times) * 1e6)
 
113
 
114
  avg_us = statistics.mean(per_sentence_us)
115
+ throughput_single = 1e6 / avg_us
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
+ # --- Batch throughput ---
118
  t0 = time.perf_counter()
119
+ for _ in range(BATCH_CALLS):
120
  for s in SENTENCES:
121
+ predict(model, s)
122
  t1 = time.perf_counter()
123
+ total = BATCH_CALLS * len(SENTENCES)
124
+ throughput_batch = total / (t1 - t0)
125
 
126
+ # --- Predictions for verification ---
127
+ preds = []
 
 
 
 
 
 
 
 
128
  for s in SENTENCES:
129
+ r = predict(model, s)
130
+ preds.append(fmt(r))
131
+
132
+ result = {
133
+ "lib": LIB_NAME,
134
+ "load_ms": round(load_ms, 1),
135
+ "avg_us": round(avg_us, 1),
136
+ "min_us": round(min(per_sentence_us), 1),
137
+ "max_us": round(max(per_sentence_us), 1),
138
+ "throughput_single": int(throughput_single),
139
+ "throughput_batch": int(throughput_batch),
140
+ "preds": preds,
141
+ }
142
+ print(json.dumps(result))
143
+
144
+ run()
145
+ '''
146
+
147
+ VENVS = {
148
+ "underthesea_core": "/tmp/venv_ftpredict/bin/python3",
149
+ "fasttext-predict": "/tmp/venv_ftpredict/bin/python3",
150
+ "fasttext-wheel": "/tmp/venv_ftwheel/bin/python3",
151
+ "fast-langdetect": "/tmp/venv_fastlang/bin/python3",
152
+ "fasttext-langdetect": "/tmp/venv_ftlangdetect/bin/python3",
153
+ }
154
+
155
+
156
+ def run_benchmark(lib_name, python_bin):
157
+ """Run benchmark in a subprocess with the correct venv."""
158
+ env = os.environ.copy()
159
+ env.pop("VIRTUAL_ENV", None)
160
+ result = subprocess.run(
161
+ [python_bin, "-c", RUNNER_SCRIPT, MODEL_PATH, json.dumps(SENTENCES), lib_name],
162
+ capture_output=True, text=True, timeout=600, env=env,
163
+ )
164
+ # Filter out non-JSON lines (warnings, download progress, etc.)
165
+ for line in result.stdout.strip().split("\n"):
166
+ line = line.strip()
167
+ if line.startswith("{"):
168
+ return json.loads(line)
169
+ print(f" ERROR ({lib_name}): {result.stderr[-500:]}", file=sys.stderr)
170
+ return None
171
 
172
 
173
  def main():
174
+ print("=" * 80)
175
+ print("FastText Library Benchmark")
176
+ print("=" * 80)
177
+ print(f"Model: {MODEL_PATH}")
178
+ print(f"Sentences: {len(SENTENCES)}")
179
+ print()
180
+
181
+ results = []
182
+ for lib_name, python_bin in VENVS.items():
183
+ if not os.path.exists(python_bin):
184
+ print(f" SKIP {lib_name}: venv not found at {python_bin}")
185
+ continue
186
+ print(f" Benchmarking {lib_name}...", end="", flush=True)
187
+ r = run_benchmark(lib_name, python_bin)
188
+ if r:
189
+ print(f" done ({r['throughput_batch']:,} pred/s)")
190
+ results.append(r)
191
+ else:
192
+ print(" FAILED")
193
+
194
+ if not results:
195
+ print("No results!")
196
+ return
197
+
198
+ # --- Results Table ---
199
+ print()
200
+ print("=" * 80)
201
+ print(f"{'Library':<22s} {'Load':>8s} {'Avg':>8s} {'Min':>8s} {'Max':>8s} {'Throughput':>12s}")
202
+ print(f"{'':<22s} {'(ms)':>8s} {'(µs)':>8s} {'(µs)':>8s} {'(µs)':>8s} {'(pred/s)':>12s}")
203
+ print("-" * 80)
204
+
205
+ baseline = results[0]["throughput_batch"]
206
+ for r in results:
207
+ ratio = r["throughput_batch"] / baseline if baseline else 0
208
+ mark = "" if r["lib"] == results[0]["lib"] else f" ({ratio:.2f}x)"
209
+ print(f" {r['lib']:<20s} {r['load_ms']:>8.1f} {r['avg_us']:>8.1f} "
210
+ f"{r['min_us']:>8.1f} {r['max_us']:>8.1f} {r['throughput_batch']:>10,}{mark}")
211
+
212
+ # --- Prediction Verification ---
213
+ print()
214
+ print("=" * 80)
215
+ print("Prediction Verification (top-1 label)")
216
+ print("-" * 80)
217
+ ref = results[0]
218
+ header = f" {'Text':<50s}"
219
+ for r in results:
220
+ header += f" {r['lib'][:10]:>10s}"
221
+ print(header)
222
+ print(" " + "-" * (50 + 11 * len(results)))
223
 
 
 
 
 
 
 
 
 
 
 
224
  for i, s in enumerate(SENTENCES):
225
+ preview = s[:48] + ".." if len(s) > 48 else s
226
+ row = f" {preview:<50s}"
227
+ for r in results:
228
+ pred = r["preds"][i]
229
+ match = "" if pred == ref["preds"][i] else "*"
230
+ row += f" {pred+match:>10s}"
231
+ print(row)
232
+
233
+ # --- Match rate ---
234
+ print()
235
+ for r in results[1:]:
236
+ matches = sum(1 for i in range(len(SENTENCES)) if r["preds"][i] == ref["preds"][i])
237
+ print(f" {r['lib']} vs {ref['lib']}: {matches}/{len(SENTENCES)} match")
238
+
239
+ print()
240
+ print("Done.")
 
 
241
 
242
 
243
  if __name__ == "__main__":
pyproject.toml CHANGED
@@ -11,9 +11,8 @@ authors = [
11
  keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
12
  dependencies = [
13
  "underthesea>=9.2.9",
 
14
  "click>=8.0.0",
15
- "fasttext-wheel>=0.9.2",
16
- "numpy<2",
17
  ]
18
 
19
  [project.optional-dependencies]
@@ -22,6 +21,8 @@ dev = [
22
  "huggingface-hub>=0.20.0",
23
  "scikit-learn>=1.0.0",
24
  "datasets>=2.0.0",
 
 
25
  ]
26
 
27
  [project.urls]
 
11
  keywords = ["vietnamese", "nlp", "language-detection", "language-identification"]
12
  dependencies = [
13
  "underthesea>=9.2.9",
14
+ "underthesea_core>=3.3.0",
15
  "click>=8.0.0",
 
 
16
  ]
17
 
18
  [project.optional-dependencies]
 
21
  "huggingface-hub>=0.20.0",
22
  "scikit-learn>=1.0.0",
23
  "datasets>=2.0.0",
24
+ "fasttext-wheel>=0.9.2",
25
+ "numpy<2",
26
  ]
27
 
28
  [project.urls]