Yash1005 commited on
Commit
a77aa5c
Β·
verified Β·
1 Parent(s): 5fa661f

docs: add model card with eval metrics on held-out test set

Browse files
Files changed (1) hide show
  1. README.md +255 -0
README.md ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.5-2B
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ language:
7
+ - en
8
+ tags:
9
+ - qwen
10
+ - guardrails
11
+ - code-detection
12
+ - language-identification
13
+ - multi-label-classification
14
+ - merged
15
+ - vllm
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ - precision
20
+ - recall
21
+ model-index:
22
+ - name: CodeLanguage-Qwen3.5-2B-v8
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ name: Multi-label Programming Language Identification
27
+ dataset:
28
+ name: LangID Guard Held-out Test Set
29
+ type: custom
30
+ metrics:
31
+ - type: accuracy
32
+ name: is_valid accuracy
33
+ value: 1.0000
34
+ - type: accuracy
35
+ name: language-set exact match
36
+ value: 0.9650
37
+ - type: f1
38
+ name: binary F1 (positive=contains code)
39
+ value: 1.0000
40
+ - type: f1
41
+ name: macro F1 over languages
42
+ value: 0.9701
43
+ - type: precision
44
+ name: binary precision (positive=contains code)
45
+ value: 1.0000
46
+ - type: recall
47
+ name: binary recall (positive=contains code)
48
+ value: 1.0000
49
+ ---
50
+ # CodeLanguage-Qwen3.5-2B-v8
51
+ **Merged full model** (base `Qwen/Qwen3.5-2B` + LoRA adapter, merged via `peft.merge_and_unload()`) that identifies which programming languages are embedded in a user prompt across **25 languages and configuration formats**. This is a self-contained checkpoint β€” load it directly (no PEFT step) and serve it on **vLLM** (v0.21.0+). Trained on a combined dataset of Rosetta Code snippets and curated config-language samples (Dockerfile, YAML, Terraform, Makefile, SQL).
52
+ The model is fine-tuned to emit a strict JSON object describing the languages found:
53
+
54
+ ```json
55
+ {"is_valid": true, "category": {"Python": true, "Bash": true}}
56
+ ```
57
+
58
+ `is_valid` is `true` when at least one code/config snippet is present and `false` for natural-language-only prompts. `category` contains only the detected languages, each mapped to `true`; if no code is present `category` is `{}`.
59
+ ## Quick start
60
+ > **Text-only model.** The base `Qwen/Qwen3.5-2B` declares the multimodal `Qwen3_5ForConditionalGeneration` architecture (it carries a vision tower in its weights), but this is a **text-in / text-out** language guard β€” it never consumes images and only emits the JSON verdict. Send only text prompts; vLLM auto-detects text-only mode and prints `All limits of multimodal modalities ... set to 0, running in text-only mode` at startup. (`language_model_only=True` would in theory skip loading the vision-tower weights, but on vLLM v0.21.0 it crashes `Qwen3_5ForCausalLM.__init__` with a `vision_config` attribute error β€” leave it off until a later vLLM release fixes that path.)
61
+
62
+ ### vLLM (recommended β€” needs vLLM >= 0.21.0 for the Qwen3.5/Mamba runner)
63
+ ```python
64
+ from vllm import LLM, SamplingParams
65
+ from transformers import AutoTokenizer
66
+ import json, re
67
+
68
+ MODEL = "Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8"
69
+ SYSTEM_MSG = """You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
70
+ No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
71
+ Rules:
72
+ - is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
73
+ - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
74
+ - When multiple languages appear, list every distinct one (still only true).
75
+ Allowed language keys (use these exact spellings):
76
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
77
+
78
+ Examples:
79
+
80
+ Input: What's the weather forecast today?
81
+ Output: {"is_valid": false, "category": {}}
82
+
83
+ Input: Run this for me: print('hello world')
84
+ Output: {"is_valid": true, "category": {"Python": true}}
85
+
86
+ Input: Compare these β€” SELECT * FROM users vs the snippet: console.log(users)
87
+ Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}"""
88
+
89
+ llm = LLM(
90
+ model=MODEL,
91
+ trust_remote_code=True,
92
+ dtype="bfloat16",
93
+ max_model_len=4096,
94
+ # vLLM auto-detects text-only when no multimodal inputs are sent.
95
+ # Do NOT pass language_model_only=True here β€” see the note above.
96
+ )
97
+ tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
98
+ sampling = SamplingParams(temperature=0.0, max_tokens=220, stop=["\n\n\n"])
99
+
100
+ def langid(prompt: str) -> dict:
101
+ chat = tokenizer.apply_chat_template(
102
+ [{"role":"system","content":SYSTEM_MSG},
103
+ {"role":"user","content":prompt}],
104
+ tokenize=False, add_generation_prompt=True, enable_thinking=False)
105
+ out = llm.generate([chat], sampling)
106
+ text = out[0].outputs[0].text
107
+ return json.loads(re.search(r'\{.*\}', text, re.DOTALL).group(0))
108
+ ```
109
+
110
+ ### Plain transformers
111
+ ```python
112
+ from transformers import AutoModelForCausalLM, AutoTokenizer
113
+ import torch, json, re
114
+
115
+ MODEL = "Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8"
116
+ SYSTEM_MSG = """You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
117
+ No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
118
+ Rules:
119
+ - is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
120
+ - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
121
+ - When multiple languages appear, list every distinct one (still only true).
122
+ Allowed language keys (use these exact spellings):
123
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
124
+
125
+ Examples:
126
+
127
+ Input: What's the weather forecast today?
128
+ Output: {"is_valid": false, "category": {}}
129
+
130
+ Input: Run this for me: print('hello world')
131
+ Output: {"is_valid": true, "category": {"Python": true}}
132
+
133
+ Input: Compare these β€” SELECT * FROM users vs the snippet: console.log(users)
134
+ Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}"""
135
+
136
+ tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
137
+ model = AutoModelForCausalLM.from_pretrained(
138
+ MODEL, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
139
+ ).eval()
140
+
141
+ def langid(prompt: str) -> dict:
142
+ chat = tokenizer.apply_chat_template(
143
+ [{"role":"system","content":SYSTEM_MSG},
144
+ {"role":"user","content":prompt}],
145
+ tokenize=False, add_generation_prompt=True, enable_thinking=False)
146
+ inputs = tokenizer(chat, return_tensors="pt").to(model.device)
147
+ out = model.generate(**inputs, max_new_tokens=220, do_sample=False)
148
+ text = tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
149
+ return json.loads(re.search(r'\{.*\}', text, re.DOTALL).group(0))
150
+ ```
151
+
152
+ ## System prompt
153
+ The model was trained with the exact system prompt below. Pass it verbatim at inference time β€” the output schema depends on this prompt.
154
+
155
+ ```text
156
+ You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
157
+ No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
158
+ Rules:
159
+ - is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
160
+ - category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
161
+ - When multiple languages appear, list every distinct one (still only true).
162
+ Allowed language keys (use these exact spellings):
163
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
164
+
165
+ Examples:
166
+
167
+ Input: What's the weather forecast today?
168
+ Output: {"is_valid": false, "category": {}}
169
+
170
+ Input: Run this for me: print('hello world')
171
+ Output: {"is_valid": true, "category": {"Python": true}}
172
+
173
+ Input: Compare these β€” SELECT * FROM users vs the snippet: console.log(users)
174
+ Output: {"is_valid": true, "category": {"SQL": true, "JavaScript": true}}
175
+ ```
176
+ ## Evaluation (transformers)
177
+ Evaluated on **200 held-out prompts** drawn from `test_dataset_langid.csv` (same single + multi + benign composition as training).
178
+
179
+ - Evaluation timestamp: `2026-05-24 12:05 UTC`
180
+ - GPU: `NVIDIA A10G`
181
+ - Source adapter: `Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8`
182
+ - JSON parse errors: `0/200` (`0.0%`)
183
+ ### Top-level metrics
184
+ | Metric | Value |
185
+ |---|---:|
186
+ | `is_valid` accuracy | **1.0000** |
187
+ | Language-set exact match | **0.9650** |
188
+ | Binary F1 (positive = contains code) | **1.0000** |
189
+ | Binary precision | 1.0000 |
190
+ | Binary recall | 1.0000 |
191
+ | Macro F1 across languages | **0.9701** |
192
+ ### Confusion matrix β€” binary `is_valid` decision
193
+ Positive class = the prompt **contains code** (`is_valid=True`).
194
+
195
+ | | predicted contains-code | predicted no-code |
196
+ |---|---:|---:|
197
+ | **actual contains-code** | TP = 181 | FN = 0 |
198
+ | **actual no-code** | FP = 0 | TN = 19 |
199
+ ### Per-language metrics
200
+ Only languages that appear in either the actual or predicted labels are listed.
201
+
202
+ | Language | support | precision | recall | F1 |
203
+ |---|---:|---:|---:|---:|
204
+ | `Python` | 14 | 1.000 | 1.000 | 1.000 |
205
+ | `Terraform` | 14 | 1.000 | 1.000 | 1.000 |
206
+ | `Java` | 12 | 1.000 | 0.917 | 0.957 |
207
+ | `C` | 12 | 0.857 | 1.000 | 0.923 |
208
+ | `Rust` | 12 | 1.000 | 1.000 | 1.000 |
209
+ | `AWK` | 12 | 1.000 | 0.917 | 0.957 |
210
+ | `Ruby` | 11 | 1.000 | 1.000 | 1.000 |
211
+ | `R` | 11 | 0.846 | 1.000 | 0.917 |
212
+ | `Go` | 10 | 1.000 | 1.000 | 1.000 |
213
+ | `Swift` | 10 | 1.000 | 1.000 | 1.000 |
214
+ | `Scala` | 10 | 1.000 | 0.800 | 0.889 |
215
+ | `SQL` | 10 | 1.000 | 1.000 | 1.000 |
216
+ | `jq` | 10 | 0.909 | 1.000 | 0.952 |
217
+ | `JavaScript` | 9 | 0.900 | 1.000 | 0.947 |
218
+ | `Kotlin` | 9 | 1.000 | 1.000 | 1.000 |
219
+ | `Perl` | 9 | 1.000 | 1.000 | 1.000 |
220
+ | `PowerShell` | 9 | 1.000 | 1.000 | 1.000 |
221
+ | `Batch` | 9 | 1.000 | 1.000 | 1.000 |
222
+ | `YAML` | 9 | 1.000 | 0.889 | 0.941 |
223
+ | `C++` | 7 | 1.000 | 0.857 | 0.923 |
224
+ | `C#` | 7 | 1.000 | 1.000 | 1.000 |
225
+ | `Lua` | 7 | 1.000 | 0.857 | 0.923 |
226
+ | `Bash` | 7 | 1.000 | 1.000 | 1.000 |
227
+ | `Dockerfile` | 6 | 0.857 | 1.000 | 0.923 |
228
+ | `Makefile` | 6 | 1.000 | 1.000 | 1.000 |
229
+
230
+ ### Inference latency
231
+ - Mean: **0.99 s/prompt**
232
+ - Median: 0.94 s/prompt
233
+ - p95: 1.34 s/prompt
234
+ - Max: 1.64 s/prompt
235
+
236
+ ## Training setup
237
+ - Base model: `Qwen/Qwen3.5-2B` (loaded in full precision (bf16 / fp16, no `bitsandbytes` quantization))
238
+ - LoRA: r=16, alpha=32, dropout=0.05, target modules = {q,k,v,o,gate,up,down}_proj
239
+ - Optimizer: adamw_torch, lr=1e-4, cosine schedule, warmup 5%
240
+ - Epochs: 3
241
+ - Precision: bf16 if available, else fp16
242
+ - Effective batch size: 8 (per-device 1 + grad-accum 8), gradient checkpointing on
243
+ - Max sequence length: 3200 tokens
244
+ - Training data: 10,000 rows (7,000 single-language + 2,000 multi-language + 1,000 benign)
245
+ - Languages: 25 (programming + config formats)
246
+
247
+ ## Supported languages
248
+ The model emits one or more of these keys in the `category` map of its JSON output:
249
+
250
+ ```
251
+ Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
252
+ ```
253
+
254
+ ---
255
+ *Model card generated automatically by `eval_and_push_card.py` on 2026-05-24 12:05 UTC.*