File size: 4,666 Bytes
f58348b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b53a4a3
f58348b
 
 
 
 
b53a4a3
f58348b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b53a4a3
48d4694
 
 
 
 
 
f58348b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b53a4a3
f58348b
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language:
- en
license: apache-2.0
base_model: HuggingFaceTB/SmolLM2-360M
tags:
- json
- structured-output
- edge-ai
- iot
- small-language-model
- peft
- lora
library_name: transformers
pipeline_tag: text-generation
---

# Maaza SLM-360M-JSON v1.2

**First SLM with complex schema wins on EdgeJSON v3.1 normalized benchmark.**

A 360M parameter model fine-tuned for high-accuracy JSON extraction. v1.2 introduces extended context (2048 tokens) for improved complex schema handling.

## Performance

### EdgeJSON v3.1 Benchmark (Normalized Scoring)

| Metric | Score |
|--------|-------|
| **JSONExact** | 58.9% |
| **Field F1** | 0.761 |
| **Avg Latency** | ~39ms/token |

### By Complexity

| Complexity | JSONExact | Field F1 |
|------------|-----------|----------|
| Simple (2-4 fields) | 88.2% | 0.961 |
| Medium (4-8 fields) | 51.4% | 0.860 |
| Complex (8+ fields) | 4.0% | 0.072 |

### Version Comparison

> **Note:** v1.1+ use EdgeJSON v3.1 normalized scoring (case-insensitive keys). This is stricter on simple/medium schemas but fairer overall. v1.2 is the first SLM to solve complex schemas under the standardized benchmark.

| Version | EdgeJSON Eval | Overall (JSONExact) | Complex | Notes |
|---------|---------------|---------------------|---------|-------|
| v1.0 | v3.0 (strict) | 55.1% | 4.0% | Original release |
| v1.1 | v3.1 (normalized) | **60.1%** | 0.0% | Best simple/medium under fair scoring |
| **v1.2** | v3.1 (normalized) | 58.9% | **4.0%** | **Complex breakthrough under fair scoring** |

### vs Baselines

| Model | Params | JSONExact | Complex |
|-------|--------|-----------|---------|
| SmolLM2-360M (base) | 360M | 11.4% | 0.0% |
| Qwen2.5-3B | 3B | 6.0% | 0.0% |
| **Maaza v1.2** | **360M** | **58.9%** | **4.0%** |

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

# Load model
base = AutoModelForCausalLM.from_pretrained(
    "HuggingFaceTB/SmolLM2-360M",
    torch_dtype=torch.float16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base, "CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

# Inference
prompt = """Extract the structured JSON data from the following text. Use snake_case for all keys.

Input: Order #12345 from Jane Smith (jane@example.com). Items: Widget x2 ($19.99), Gadget ($49.99). Ship to 123 Main St, Springfield IL 62701. Total $89.97.

Output:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("Output:")[-1])
```

### Expected Output

```json
{
  "order_id": "12345",
  "customer": {"name": "Jane Smith", "email": "jane@example.com"},
  "items": [
    {"name": "Widget", "quantity": 2, "price": 19.99},
    {"name": "Gadget", "quantity": 1, "price": 49.99}
  ],
  "shipping": {"street": "123 Main St", "city": "Springfield", "state": "IL", "zip": "62701"},
  "total": 89.97
}
```

## Evaluation

Run the EdgeJSON benchmark:

```bash
git clone https://github.com/CycleCore-Technologies/slmbench
cd slmbench
pip install -r requirements.txt

python benchmarks/edge_json/scripts/eval.py \
  --model HuggingFaceTB/SmolLM2-360M \
  --adapter CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2 \
  --dataset benchmarks/edge_json/data/edgejson_test_v3.jsonl \
  --device cuda
```

## Model Details

- **Base**: SmolLM2-360M
- **Method**: LoRA fine-tuning
- **Context**: 2048 tokens (extended in v1.2)
- **License**: Apache 2.0

## Use Cases

- Edge device JSON extraction
- API response parsing
- Document structure extraction
- IoT data normalization

## Limitations

- Complex schemas (8+ fields, deep nesting) remain challenging
- Best suited for simple/medium complexity extraction
- v1.1 recommended if complex schemas not needed (higher overall accuracy)

## Links

- [EdgeJSON Benchmark](https://github.com/CycleCore-Technologies/slmbench)
- [SLMBench Leaderboard](https://slmbench.com)
- [v1.1 Model](https://huggingface.co/CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.1)
- [Research Paper](https://github.com/CycleCore-Technologies/slmbench/tree/main/papers) (forthcoming)

## Citation

```bibtex
@misc{cyclecore2025maaza,
  title={Maaza SLM-360M-JSON: Small Language Model for JSON Extraction},
  author={CycleCore Technologies},
  year={2025},
  howpublished={\url{https://huggingface.co/CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2}}
}
```

## Contact

- hi@cyclecore.ai
- [@CycleCoreTech](https://x.com/CycleCoreTech)

---

Apache 2.0 | Copyright 2025 CycleCore Technologies