dataslab
/

DLM-NL2JSON-4B

Text Generation

structured-prediction

domain-specific

Eval Results (legacy)

Model card Files Files and versions

DLM-NL2JSON-4B / eval /results.md

hkyoo89's picture

Upload folder using huggingface_hub

29dd711 verified 4 months ago

|

History Blame Contribute Delete

2.15 kB

Evaluation Results — DLM-NL2JSON-4B vs Baselines

Test Configuration

Test set: task_analysis_sft_251128_test.jsonl (2,041 samples, 10 categories)
Metric: Field-level exact match accuracy (summary field excluded)
Note: 64 CSM samples with known gold label noise excluded in adjusted metrics (see below)
Train/Test overlap: 16/2,041 (0.78%) — retained for consistency across models

Per-Category Accuracy

Category	N	DLM-NL2JSON-4B	GPT-4o	Qwen3.5-35B-A3B
ALP-A (pattern)	250	99.6%	56.0%	47.6%
ALP-B (flow)	250	98.4%	50.4%	46.8%
CSM (consumption)	700	90.6%	90.1%	86.1%
CREDIT-Income	58	94.8%	53.4%	34.5%
CREDIT-Spending	77	97.4%	92.2%	51.9%
CREDIT-Loan/Default	73	98.6%	94.5%	72.6%
CPI (business)	219	86.3%	87.2%	54.8%
GIS-Inflow	72	97.2%	79.2%	93.1%
GIS-Outflow	62	98.4%	77.4%	98.4%
GIS-Consumption	280	98.2%	99.6%	97.5%

Overall (Raw)

Model	Params	Accuracy	Avg Latency
DLM-NL2JSON-4B	4B	94.4% (1926/2041)	2.59s
GPT-4o	~200B+	80.5% (1643/2041)	1.58s
Qwen3.5-35B-A3B	35B (3B active)	72.2% (1473/2041)	0.85s

Overall (Adjusted — 64 CSM gold noise samples excluded)

Model	Accuracy	N
DLM-NL2JSON-4B	96.8% (1914/1977)	1977
GPT-4o	82.5% (1631/1977)	1977
Qwen3.5-35B-A3B	73.9% (1461/1977)	1977

Hardware

Model	Serving	GPU
DLM-NL2JSON-4B	vLLM (TensorRT-LLM)	NVIDIA L4 24GB
GPT-4o	OpenAI API	N/A
Qwen3.5-35B-A3B	vLLM	NVIDIA A6000 48GB

Notes

CSM gold noise: 64/700 CSM test samples have age_cd capped at 60 instead of 70 for "all ages" queries, conflicting with the prompt specification (age_cd: [10,20,30,40,50,60,70]). This affects all models equally.
DLM-NL2JSON-4B wins 8/10 categories outright, ties 1, and loses only CPI (86.3% vs GPT-4o 87.2%).