Evaluation Results — DLM-NL2JSON-4B vs Baselines
Test Configuration
- Test set:
task_analysis_sft_251128_test.jsonl(2,041 samples, 10 categories) - Metric: Field-level exact match accuracy (summary field excluded)
- Note: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
- Train/Test overlap: 16/2,041 (0.78%) — retained for consistency across models
Per-Category Accuracy
| Category | N | DLM-NL2JSON-4B | GPT-4o | Qwen3.5-35B-A3B |
|---|---|---|---|---|
| ALP-A (pattern) | 250 | 99.6% | 56.0% | 47.6% |
| ALP-B (flow) | 250 | 98.4% | 50.4% | 46.8% |
| CSM (consumption) | 700 | 90.6% | 90.1% | 86.1% |
| CREDIT-Income | 58 | 94.8% | 53.4% | 34.5% |
| CREDIT-Spending | 77 | 97.4% | 92.2% | 51.9% |
| CREDIT-Loan/Default | 73 | 98.6% | 94.5% | 72.6% |
| CPI (business) | 219 | 86.3% | 87.2% | 54.8% |
| GIS-Inflow | 72 | 97.2% | 79.2% | 93.1% |
| GIS-Outflow | 62 | 98.4% | 77.4% | 98.4% |
| GIS-Consumption | 280 | 98.2% | 99.6% | 97.5% |
Overall (Raw)
| Model | Params | Accuracy | Avg Latency |
|---|---|---|---|
| DLM-NL2JSON-4B | 4B | 94.4% (1926/2041) | 2.59s |
| GPT-4o | ~200B+ | 80.5% (1643/2041) | 1.58s |
| Qwen3.5-35B-A3B | 35B (3B active) | 72.2% (1473/2041) | 0.85s |
Overall (Adjusted — 64 CSM gold noise samples excluded)
| Model | Accuracy | N |
|---|---|---|
| DLM-NL2JSON-4B | 96.8% (1914/1977) | 1977 |
| GPT-4o | 82.5% (1631/1977) | 1977 |
| Qwen3.5-35B-A3B | 73.9% (1461/1977) | 1977 |
Hardware
| Model | Serving | GPU |
|---|---|---|
| DLM-NL2JSON-4B | vLLM (TensorRT-LLM) | NVIDIA L4 24GB |
| GPT-4o | OpenAI API | N/A |
| Qwen3.5-35B-A3B | vLLM | NVIDIA A6000 48GB |
Notes
- CSM gold noise: 64/700 CSM test samples have
age_cdcapped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (age_cd: [10,20,30,40,50,60,70]). This affects all models equally. - DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).