Yash1005 commited on
Commit
d3172ec
·
verified ·
1 Parent(s): a77aa5c

docs: add model card with eval metrics on held-out test set

Browse files
Files changed (1) hide show
  1. README.md +40 -37
README.md CHANGED
@@ -73,18 +73,7 @@ Rules:
73
  - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
74
  - When multiple languages appear, list every distinct one (still only true).
75
  Allowed language keys (use these exact spellings):
76
- Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
77
-
78
- Examples:
79
-
80
- Input: What's the weather forecast today?
81
- Output: {"is_valid": false, "category": {}}
82
-
83
- Input: Run this for me: print('hello world')
84
- Output: {"is_valid": true, "category": {"Python": true}}
85
-
86
- Input: Compare these — SELECT * FROM users vs the snippet: console.log(users)
87
- Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}"""
88
 
89
  llm = LLM(
90
  model=MODEL,
@@ -120,18 +109,7 @@ Rules:
120
  - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
121
  - When multiple languages appear, list every distinct one (still only true).
122
  Allowed language keys (use these exact spellings):
123
- Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
124
-
125
- Examples:
126
-
127
- Input: What's the weather forecast today?
128
- Output: {"is_valid": false, "category": {}}
129
-
130
- Input: Run this for me: print('hello world')
131
- Output: {"is_valid": true, "category": {"Python": true}}
132
-
133
- Input: Compare these — SELECT * FROM users vs the snippet: console.log(users)
134
- Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}"""
135
 
136
  tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
137
  model = AutoModelForCausalLM.from_pretrained(
@@ -161,22 +139,11 @@ Rules:
161
  - When multiple languages appear, list every distinct one (still only true).
162
  Allowed language keys (use these exact spellings):
163
  Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
164
-
165
- Examples:
166
-
167
- Input: What's the weather forecast today?
168
- Output: {"is_valid": false, "category": {}}
169
-
170
- Input: Run this for me: print('hello world')
171
- Output: {"is_valid": true, "category": {"Python": true}}
172
-
173
- Input: Compare these — SELECT * FROM users vs the snippet: console.log(users)
174
- Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}
175
  ```
176
  ## Evaluation (transformers)
177
  Evaluated on **200 held-out prompts** drawn from `test_dataset_langid.csv` (same single + multi + benign composition as training).
178
 
179
- - Evaluation timestamp: `2026-05-24 12:05 UTC`
180
  - GPU: `NVIDIA A10G`
181
  - Source adapter: `Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8`
182
  - JSON parse errors: `0/200` (`0.0%`)
@@ -251,5 +218,41 @@ The model emits one or more of these keys in the `category` map of its JSON outp
251
  Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
252
  ```
253
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
254
  ---
255
- *Model card generated automatically by `eval_and_push_card.py` on 2026-05-24 12:05 UTC.*
 
73
  - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
74
  - When multiple languages appear, list every distinct one (still only true).
75
  Allowed language keys (use these exact spellings):
76
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq"""
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  llm = LLM(
79
  model=MODEL,
 
109
  - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
110
  - When multiple languages appear, list every distinct one (still only true).
111
  Allowed language keys (use these exact spellings):
112
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq"""
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
115
  model = AutoModelForCausalLM.from_pretrained(
 
139
  - When multiple languages appear, list every distinct one (still only true).
140
  Allowed language keys (use these exact spellings):
141
  Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
 
 
 
 
 
 
 
 
 
 
 
142
  ```
143
  ## Evaluation (transformers)
144
  Evaluated on **200 held-out prompts** drawn from `test_dataset_langid.csv` (same single + multi + benign composition as training).
145
 
146
+ - Evaluation timestamp: `2026-05-24 12:53 UTC`
147
  - GPU: `NVIDIA A10G`
148
  - Source adapter: `Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8`
149
  - JSON parse errors: `0/200` (`0.0%`)
 
218
  Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
219
  ```
220
 
221
+ ## Evaluation — vLLM serving (merged model, text-only)
222
+ Same **500 held-out prompts**, served through **vLLM `0.21.0`**'s native Qwen3.5/Mamba runner instead of the transformers `.generate()` loop above. Only text prompts are sent; vLLM auto-detects text-only mode. This reflects production serving accuracy + latency.
223
+
224
+ - Engine: vLLM `0.21.0`, text-only (auto (limit_mm_per_prompt=0)), dtype bf16, greedy decoding
225
+ - GPU: `NVIDIA A10G`
226
+ - JSON parse errors: `0/500` (`0.0%`)
227
+ ### Accuracy (vLLM)
228
+ | Metric | Value |
229
+ |---|---:|
230
+ | `is_valid` accuracy | **1.0000** |
231
+ | Language-set exact match | **0.9700** |
232
+ | Binary F1 (positive = contains code) | **1.0000** |
233
+ | Binary precision | 1.0000 |
234
+ | Binary recall | 1.0000 |
235
+ | Macro F1 across languages | **0.9771** |
236
+ ### Confusion matrix — binary `is_valid` (vLLM)
237
+ | | predicted contains-code | predicted no-code |
238
+ |---|---:|---:|
239
+ | **actual contains-code** | TP = 450 | FN = 0 |
240
+ | **actual no-code** | FP = 0 | TN = 50 |
241
+ ### vLLM inference latency (single-stream, batch = 1)
242
+ | Stat | ms / prompt |
243
+ |---|---:|
244
+ | Mean | **200.0** |
245
+ | Median | 186.2 |
246
+ | p95 | 278.9 |
247
+ | p99 | 343.7 |
248
+ | Max | 1990.9 |
249
+ | Under 1 s | 99.6% |
250
+
251
+ ### vLLM throughput (single batched submit, continuous batching)
252
+ - Prompts/sec: **18.12**
253
+ - Output tokens/sec: 260.7
254
+ - Input tokens/sec: 15441.4
255
+ - Batched wall time for all 500 prompts: 27.60 s
256
+
257
  ---
258
+ *Model card generated automatically by `eval_and_push_card.py` on 2026-05-24 12:53 UTC.*