Yash1005's picture
docs: add model card with eval metrics on held-out test set
d3172ec verified
---
license: apache-2.0
base_model: Qwen/Qwen3.5-2B
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- qwen
- guardrails
- code-detection
- language-identification
- multi-label-classification
- merged
- vllm
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: CodeLanguage-Qwen3.5-2B-v8
results:
- task:
type: text-classification
name: Multi-label Programming Language Identification
dataset:
name: LangID Guard Held-out Test Set
type: custom
metrics:
- type: accuracy
name: is_valid accuracy
value: 1.0000
- type: accuracy
name: language-set exact match
value: 0.9650
- type: f1
name: binary F1 (positive=contains code)
value: 1.0000
- type: f1
name: macro F1 over languages
value: 0.9701
- type: precision
name: binary precision (positive=contains code)
value: 1.0000
- type: recall
name: binary recall (positive=contains code)
value: 1.0000
---
# CodeLanguage-Qwen3.5-2B-v8
**Merged full model** (base `Qwen/Qwen3.5-2B` + LoRA adapter, merged via `peft.merge_and_unload()`) that identifies which programming languages are embedded in a user prompt across **25 languages and configuration formats**. This is a self-contained checkpoint — load it directly (no PEFT step) and serve it on **vLLM** (v0.21.0+). Trained on a combined dataset of Rosetta Code snippets and curated config-language samples (Dockerfile, YAML, Terraform, Makefile, SQL).
The model is fine-tuned to emit a strict JSON object describing the languages found:
```json
{"is_valid": true, "category": {"Python": true, "Bash": true}}
```
`is_valid` is `true` when at least one code/config snippet is present and `false` for natural-language-only prompts. `category` contains only the detected languages, each mapped to `true`; if no code is present `category` is `{}`.
## Quick start
> **Text-only model.** The base `Qwen/Qwen3.5-2B` declares the multimodal `Qwen3_5ForConditionalGeneration` architecture (it carries a vision tower in its weights), but this is a **text-in / text-out** language guard — it never consumes images and only emits the JSON verdict. Send only text prompts; vLLM auto-detects text-only mode and prints `All limits of multimodal modalities ... set to 0, running in text-only mode` at startup. (`language_model_only=True` would in theory skip loading the vision-tower weights, but on vLLM v0.21.0 it crashes `Qwen3_5ForCausalLM.__init__` with a `vision_config` attribute error — leave it off until a later vLLM release fixes that path.)
### vLLM (recommended — needs vLLM >= 0.21.0 for the Qwen3.5/Mamba runner)
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import json, re
MODEL = "Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8"
SYSTEM_MSG = """You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
Rules:
- is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
- category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
- When multiple languages appear, list every distinct one (still only true).
Allowed language keys (use these exact spellings):
Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq"""
llm = LLM(
model=MODEL,
trust_remote_code=True,
dtype="bfloat16",
max_model_len=4096,
# vLLM auto-detects text-only when no multimodal inputs are sent.
# Do NOT pass language_model_only=True here — see the note above.
)
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
sampling = SamplingParams(temperature=0.0, max_tokens=220, stop=["\n\n\n"])
def langid(prompt: str) -> dict:
chat = tokenizer.apply_chat_template(
[{"role":"system","content":SYSTEM_MSG},
{"role":"user","content":prompt}],
tokenize=False, add_generation_prompt=True, enable_thinking=False)
out = llm.generate([chat], sampling)
text = out[0].outputs[0].text
return json.loads(re.search(r'\{.*\}', text, re.DOTALL).group(0))
```
### Plain transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, json, re
MODEL = "Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8"
SYSTEM_MSG = """You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
Rules:
- is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
- category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
- When multiple languages appear, list every distinct one (still only true).
Allowed language keys (use these exact spellings):
Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq"""
tokenizer = AutoTokenizer.from_pretrained(MODEL, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True,
).eval()
def langid(prompt: str) -> dict:
chat = tokenizer.apply_chat_template(
[{"role":"system","content":SYSTEM_MSG},
{"role":"user","content":prompt}],
tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(chat, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=220, do_sample=False)
text = tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
return json.loads(re.search(r'\{.*\}', text, re.DOTALL).group(0))
```
## System prompt
The model was trained with the exact system prompt below. Pass it verbatim at inference time — the output schema depends on this prompt.
```text
You are a code language identifier. For the given user prompt, decide whether it contains any embedded source code (program source or recognizable code-like configuration). Output exactly one JSON object and nothing else: {"is_valid": <true|false>, "category": {"<Lang>": true, ...}}.
No preamble. No explanation. No <think> tags. No markdown code fences. No trailing prose.
Rules:
- is_valid is TRUE when the prompt contains at least one code/config snippet, FALSE when the prompt is plain natural-language only.
- category contains ONLY the languages that appear, each mapped to true. If no code is present, category is the empty object {}.
- When multiple languages appear, list every distinct one (still only true).
Allowed language keys (use these exact spellings):
Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
```
## Evaluation (transformers)
Evaluated on **200 held-out prompts** drawn from `test_dataset_langid.csv` (same single + multi + benign composition as training).
- Evaluation timestamp: `2026-05-24 12:53 UTC`
- GPU: `NVIDIA A10G`
- Source adapter: `Accuknoxtechnologies/CodeLanguage-Qwen3.5-2B-v8`
- JSON parse errors: `0/200` (`0.0%`)
### Top-level metrics
| Metric | Value |
|---|---:|
| `is_valid` accuracy | **1.0000** |
| Language-set exact match | **0.9650** |
| Binary F1 (positive = contains code) | **1.0000** |
| Binary precision | 1.0000 |
| Binary recall | 1.0000 |
| Macro F1 across languages | **0.9701** |
### Confusion matrix — binary `is_valid` decision
Positive class = the prompt **contains code** (`is_valid=True`).
| | predicted contains-code | predicted no-code |
|---|---:|---:|
| **actual contains-code** | TP = 181 | FN = 0 |
| **actual no-code** | FP = 0 | TN = 19 |
### Per-language metrics
Only languages that appear in either the actual or predicted labels are listed.
| Language | support | precision | recall | F1 |
|---|---:|---:|---:|---:|
| `Python` | 14 | 1.000 | 1.000 | 1.000 |
| `Terraform` | 14 | 1.000 | 1.000 | 1.000 |
| `Java` | 12 | 1.000 | 0.917 | 0.957 |
| `C` | 12 | 0.857 | 1.000 | 0.923 |
| `Rust` | 12 | 1.000 | 1.000 | 1.000 |
| `AWK` | 12 | 1.000 | 0.917 | 0.957 |
| `Ruby` | 11 | 1.000 | 1.000 | 1.000 |
| `R` | 11 | 0.846 | 1.000 | 0.917 |
| `Go` | 10 | 1.000 | 1.000 | 1.000 |
| `Swift` | 10 | 1.000 | 1.000 | 1.000 |
| `Scala` | 10 | 1.000 | 0.800 | 0.889 |
| `SQL` | 10 | 1.000 | 1.000 | 1.000 |
| `jq` | 10 | 0.909 | 1.000 | 0.952 |
| `JavaScript` | 9 | 0.900 | 1.000 | 0.947 |
| `Kotlin` | 9 | 1.000 | 1.000 | 1.000 |
| `Perl` | 9 | 1.000 | 1.000 | 1.000 |
| `PowerShell` | 9 | 1.000 | 1.000 | 1.000 |
| `Batch` | 9 | 1.000 | 1.000 | 1.000 |
| `YAML` | 9 | 1.000 | 0.889 | 0.941 |
| `C++` | 7 | 1.000 | 0.857 | 0.923 |
| `C#` | 7 | 1.000 | 1.000 | 1.000 |
| `Lua` | 7 | 1.000 | 0.857 | 0.923 |
| `Bash` | 7 | 1.000 | 1.000 | 1.000 |
| `Dockerfile` | 6 | 0.857 | 1.000 | 0.923 |
| `Makefile` | 6 | 1.000 | 1.000 | 1.000 |
### Inference latency
- Mean: **0.99 s/prompt**
- Median: 0.94 s/prompt
- p95: 1.34 s/prompt
- Max: 1.64 s/prompt
## Training setup
- Base model: `Qwen/Qwen3.5-2B` (loaded in full precision (bf16 / fp16, no `bitsandbytes` quantization))
- LoRA: r=16, alpha=32, dropout=0.05, target modules = {q,k,v,o,gate,up,down}_proj
- Optimizer: adamw_torch, lr=1e-4, cosine schedule, warmup 5%
- Epochs: 3
- Precision: bf16 if available, else fp16
- Effective batch size: 8 (per-device 1 + grad-accum 8), gradient checkpointing on
- Max sequence length: 3200 tokens
- Training data: 10,000 rows (7,000 single-language + 2,000 multi-language + 1,000 benign)
- Languages: 25 (programming + config formats)
## Supported languages
The model emits one or more of these keys in the `category` map of its JSON output:
```
Python, JavaScript, Java, C, C++, C#, Go, Rust, Kotlin, Swift, Ruby, R, Scala, Perl, Lua, Bash, PowerShell, Batch, SQL, Dockerfile, YAML, Makefile, Terraform, AWK, jq
```
## Evaluation — vLLM serving (merged model, text-only)
Same **500 held-out prompts**, served through **vLLM `0.21.0`**'s native Qwen3.5/Mamba runner instead of the transformers `.generate()` loop above. Only text prompts are sent; vLLM auto-detects text-only mode. This reflects production serving accuracy + latency.
- Engine: vLLM `0.21.0`, text-only (auto (limit_mm_per_prompt=0)), dtype bf16, greedy decoding
- GPU: `NVIDIA A10G`
- JSON parse errors: `0/500` (`0.0%`)
### Accuracy (vLLM)
| Metric | Value |
|---|---:|
| `is_valid` accuracy | **1.0000** |
| Language-set exact match | **0.9700** |
| Binary F1 (positive = contains code) | **1.0000** |
| Binary precision | 1.0000 |
| Binary recall | 1.0000 |
| Macro F1 across languages | **0.9771** |
### Confusion matrix — binary `is_valid` (vLLM)
| | predicted contains-code | predicted no-code |
|---|---:|---:|
| **actual contains-code** | TP = 450 | FN = 0 |
| **actual no-code** | FP = 0 | TN = 50 |
### vLLM inference latency (single-stream, batch = 1)
| Stat | ms / prompt |
|---|---:|
| Mean | **200.0** |
| Median | 186.2 |
| p95 | 278.9 |
| p99 | 343.7 |
| Max | 1990.9 |
| Under 1 s | 99.6% |
### vLLM throughput (single batched submit, continuous batching)
- Prompts/sec: **18.12**
- Output tokens/sec: 260.7
- Input tokens/sec: 15441.4
- Batched wall time for all 500 prompts: 27.60 s
---
*Model card generated automatically by `eval_and_push_card.py` on 2026-05-24 12:53 UTC.*