CycleCore-Technologies commited on
Commit
f58348b
·
verified ·
1 Parent(s): 84b1d0d

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ base_model: HuggingFaceTB/SmolLM2-360M
6
+ tags:
7
+ - json
8
+ - structured-output
9
+ - edge-ai
10
+ - iot
11
+ - small-language-model
12
+ - peft
13
+ - lora
14
+ library_name: transformers
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # Maaza SLM-360M-JSON v1.2
19
+
20
+ **First sub-400M model with consistent complex schema wins on EdgeJSON benchmark.**
21
+
22
+ A 360M parameter model fine-tuned for high-accuracy JSON extraction. v1.2 introduces extended context (2048 tokens) for improved complex schema handling.
23
+
24
+ ## Performance
25
+
26
+ ### EdgeJSON v3 Benchmark (Normalized Scoring)
27
+
28
+ | Metric | Score |
29
+ |--------|-------|
30
+ | **JSONExact** | 58.9% |
31
+ | **Field F1** | 0.761 |
32
+ | **Complex Schemas** | 4.0% |
33
+ | **Avg Latency** | ~39ms/token |
34
+
35
+ ### By Complexity
36
+
37
+ | Complexity | JSONExact | Field F1 |
38
+ |------------|-----------|----------|
39
+ | Simple (2-4 fields) | 88.2% | 0.961 |
40
+ | Medium (4-8 fields) | 51.4% | 0.860 |
41
+ | Complex (8+ fields) | 4.0% | 0.072 |
42
+
43
+ ### Version Comparison
44
+
45
+ | Version | JSONExact | Complex | Notes |
46
+ |---------|-----------|---------|-------|
47
+ | v1.0 | 55.1% | 4.0% | Initial release |
48
+ | v1.1 | 60.1% | 0.0% | Best simple/medium |
49
+ | **v1.2** | **58.9%** | **4.0%** | Complex breakthrough |
50
+
51
+ ### vs Baselines
52
+
53
+ | Model | Params | JSONExact | Complex |
54
+ |-------|--------|-----------|---------|
55
+ | SmolLM2-360M (base) | 360M | 11.4% | 0.0% |
56
+ | Qwen2.5-3B | 3B | 6.0% | 0.0% |
57
+ | **Maaza v1.2** | **360M** | **58.9%** | **4.0%** |
58
+
59
+ ## Quick Start
60
+
61
+ ```python
62
+ from transformers import AutoTokenizer, AutoModelForCausalLM
63
+ from peft import PeftModel
64
+ import torch
65
+
66
+ # Load model
67
+ base = AutoModelForCausalLM.from_pretrained(
68
+ "HuggingFaceTB/SmolLM2-360M",
69
+ torch_dtype=torch.float16,
70
+ device_map="auto"
71
+ )
72
+ model = PeftModel.from_pretrained(base, "CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2")
73
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")
74
+
75
+ # Inference
76
+ prompt = """Extract the structured JSON data from the following text. Use snake_case for all keys.
77
+
78
+ Input: Order #12345 from Jane Smith (jane@example.com). Items: Widget x2 ($19.99), Gadget ($49.99). Ship to 123 Main St, Springfield IL 62701. Total $89.97.
79
+
80
+ Output:"""
81
+
82
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
83
+ outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
84
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("Output:")[-1])
85
+ ```
86
+
87
+ ### Expected Output
88
+
89
+ ```json
90
+ {
91
+ "order_id": "12345",
92
+ "customer": {"name": "Jane Smith", "email": "jane@example.com"},
93
+ "items": [
94
+ {"name": "Widget", "quantity": 2, "price": 19.99},
95
+ {"name": "Gadget", "quantity": 1, "price": 49.99}
96
+ ],
97
+ "shipping": {"street": "123 Main St", "city": "Springfield", "state": "IL", "zip": "62701"},
98
+ "total": 89.97
99
+ }
100
+ ```
101
+
102
+ ## Evaluation
103
+
104
+ Run the EdgeJSON benchmark:
105
+
106
+ ```bash
107
+ git clone https://github.com/CycleCore-Technologies/slmbench
108
+ cd slmbench
109
+ pip install -r requirements.txt
110
+
111
+ python benchmarks/edge_json/scripts/eval.py \
112
+ --model HuggingFaceTB/SmolLM2-360M \
113
+ --adapter CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2 \
114
+ --dataset benchmarks/edge_json/data/edgejson_test_v3.jsonl \
115
+ --device cuda
116
+ ```
117
+
118
+ ## Model Details
119
+
120
+ - **Base**: SmolLM2-360M
121
+ - **Method**: LoRA fine-tuning
122
+ - **Context**: 2048 tokens (extended in v1.2)
123
+ - **License**: Apache 2.0
124
+
125
+ ## Use Cases
126
+
127
+ - Edge device JSON extraction
128
+ - API response parsing
129
+ - Document structure extraction
130
+ - IoT data normalization
131
+
132
+ ## Limitations
133
+
134
+ - Complex schemas (8+ fields, deep nesting) remain challenging
135
+ - Best suited for simple/medium complexity extraction
136
+ - v1.1 recommended if complex schemas not needed (higher overall accuracy)
137
+
138
+ ## Links
139
+
140
+ - [EdgeJSON Benchmark](https://github.com/CycleCore-Technologies/slmbench)
141
+ - [SLMBench Leaderboard](https://slmbench.com)
142
+ - [v1.1 Model](https://huggingface.co/CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.1)
143
+ - [Research Paper](https://github.com/CycleCore-Technologies/slmbench/tree/main/papers) (forthcoming)
144
+
145
+ ## Citation
146
+
147
+ ```bibtex
148
+ @misc{cyclecore2025maaza,
149
+ title={Maaza SLM-360M-JSON: Sub-400M JSON Extraction},
150
+ author={CycleCore Technologies},
151
+ year={2025},
152
+ howpublished={\url{https://huggingface.co/CycleCoreTechnologies/Maaza-SLM-360M-JSON-v1.2}}
153
+ }
154
+ ```
155
+
156
+ ## Contact
157
+
158
+ - hi@cyclecore.ai
159
+ - [@CycleCoreTech](https://x.com/CycleCoreTech)
160
+
161
+ ---
162
+
163
+ Apache 2.0 | Copyright 2025 CycleCore Technologies
adapter_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "HuggingFaceTB/SmolLM2-360M",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layer_replication": null,
10
+ "layers_pattern": null,
11
+ "layers_to_transform": null,
12
+ "loftq_config": {},
13
+ "lora_alpha": 64,
14
+ "lora_dropout": 0.1,
15
+ "megatron_config": null,
16
+ "megatron_core": "megatron.core",
17
+ "modules_to_save": null,
18
+ "peft_type": "LORA",
19
+ "r": 32,
20
+ "rank_pattern": {},
21
+ "revision": null,
22
+ "target_modules": [
23
+ "o_proj",
24
+ "down_proj",
25
+ "gate_proj",
26
+ "up_proj",
27
+ "q_proj",
28
+ "v_proj",
29
+ "k_proj"
30
+ ],
31
+ "task_type": "CAUSAL_LM",
32
+ "use_dora": false,
33
+ "use_rslora": false
34
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b1e5ceee92c2ec998a1a0f413dd78aa9b73204e90dca4ae6d99f0667fdb7b99
3
+ size 69527352
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|endoftext|>",
4
+ "<|im_start|>",
5
+ "<|im_end|>",
6
+ "<repo_name>",
7
+ "<reponame>",
8
+ "<file_sep>",
9
+ "<filename>",
10
+ "<gh_stars>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<jupyter_script>",
19
+ "<empty_output>"
20
+ ],
21
+ "bos_token": {
22
+ "content": "<|endoftext|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false
27
+ },
28
+ "eos_token": {
29
+ "content": "<|endoftext|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "pad_token": "<|endoftext|>",
36
+ "unk_token": {
37
+ "content": "<|endoftext|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false
42
+ }
43
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<repo_name>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "4": {
37
+ "content": "<reponame>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "5": {
45
+ "content": "<file_sep>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "6": {
53
+ "content": "<filename>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "7": {
61
+ "content": "<gh_stars>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "8": {
69
+ "content": "<issue_start>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "9": {
77
+ "content": "<issue_comment>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "10": {
85
+ "content": "<issue_closed>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "11": {
93
+ "content": "<jupyter_start>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "12": {
101
+ "content": "<jupyter_text>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "13": {
109
+ "content": "<jupyter_code>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "14": {
117
+ "content": "<jupyter_output>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": true
123
+ },
124
+ "15": {
125
+ "content": "<jupyter_script>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": true
131
+ },
132
+ "16": {
133
+ "content": "<empty_output>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": true
139
+ }
140
+ },
141
+ "additional_special_tokens": [
142
+ "<|endoftext|>",
143
+ "<|im_start|>",
144
+ "<|im_end|>",
145
+ "<repo_name>",
146
+ "<reponame>",
147
+ "<file_sep>",
148
+ "<filename>",
149
+ "<gh_stars>",
150
+ "<issue_start>",
151
+ "<issue_comment>",
152
+ "<issue_closed>",
153
+ "<jupyter_start>",
154
+ "<jupyter_text>",
155
+ "<jupyter_code>",
156
+ "<jupyter_output>",
157
+ "<jupyter_script>",
158
+ "<empty_output>"
159
+ ],
160
+ "bos_token": "<|endoftext|>",
161
+ "clean_up_tokenization_spaces": false,
162
+ "eos_token": "<|endoftext|>",
163
+ "model_max_length": 8192,
164
+ "pad_token": "<|endoftext|>",
165
+ "tokenizer_class": "GPT2Tokenizer",
166
+ "unk_token": "<|endoftext|>",
167
+ "vocab_size": 49152
168
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff