mohdusman001 commited on
Commit
d624059
·
verified ·
1 Parent(s): 7981b17

Add Stage‑2 model card

Browse files
Files changed (1) hide show
  1. README.md +181 -0
README.md ADDED
@@ -0,0 +1,181 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
3
+ tags:
4
+ - sft
5
+ - structured-generation
6
+ - table-generation
7
+ - llama-3.1
8
+ - flash-attention-2
9
+ license: other
10
+ language:
11
+ - en
12
+ ---
13
+
14
+ # mohdusman001/Text-to-Table-Stage1
15
+
16
+ Stage‑2 (π₂) **text + schema → table** model fine‑tuned from **meta-llama/Meta-Llama-3.1-8B-Instruct** with a 3‑stage schedule
17
+ (2k → 4k → **8k** context). This repo includes merged weights + tokenizer and sample artifacts.
18
+
19
+ ## TL;DR
20
+ - Context window: **8192** tokens (final stage)
21
+ - Final eval (loss / ppl): 2.395211 / 10.9705
22
+ - Sanity (json_valid / key_order / type): 0.078 / 0.078 / 0.078
23
+ - Artifacts: see `metrics/` and `samples/`.
24
+
25
+ ## How to prompt (JSONL rows)
26
+ The model expects a schema and a document snippet. It should emit **one JSON object per line**,
27
+ with keys **exactly** in schema order (no code fences, no prose).
28
+
29
+ ```text
30
+ [SCHEMA]
31
+ {"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}
32
+
33
+ <|document|>
34
+ Orders today:
35
+ - O-1003: 2x pencil
36
+ - O-1004: 1x notebook
37
+ ```
38
+
39
+ ### Python usage (deterministic table emission)
40
+ ```python
41
+ import torch, json
42
+ from transformers import AutoTokenizer, AutoModelForCausalLM
43
+
44
+ model_id = "mohdusman001/Text-to-Table-Stage1"
45
+ tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
46
+ if tok.pad_token is None: tok.pad_token = tok.eos_token
47
+ dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
48
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map="auto",
49
+ attn_implementation="flash_attention_2")
50
+ prompt = (
51
+ "[SCHEMA]
52
+ "
53
+ '{"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}\n\n'
54
+ "<|document|>
55
+ "
56
+ "Orders today:
57
+ - O-1003: 2x pencil
58
+ - O-1004: 1x notebook
59
+ "
60
+ )
61
+ chat = [{"role":"user","content":prompt}]
62
+ txt = tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
63
+ inp = tok(txt, return_tensors="pt").to(model.device)
64
+
65
+ with torch.no_grad():
66
+ out = model.generate(**inp, max_new_tokens=256, do_sample=False, temperature=0.0,
67
+ eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id)
68
+ generated = tok.decode(out[0][inp["input_ids"].shape[1]:], skip_special_tokens=True)
69
+
70
+ rows = [json.loads(l) for l in generated.splitlines() if l.strip()]
71
+ print(rows) # list of dicts
72
+ ```
73
+
74
+ ## Example generations
75
+ **Sample 1**
76
+
77
+ **user**: [SCHEMA]
78
+ {"fields": [{"name": "order_id", "type": "string"}, {"name": "item", "type": "string"}, {"name": "qty", "type": "integer"}]}
79
+
80
+ <|document|>
81
+ Orders today:
82
+ - O-1003: 2x pencil
83
+ - O-1004: 1x notebook
84
+
85
+ **generation (head)**:
86
+
87
+ ```
88
+ Based on the provided schema, here's a breakdown of the data:
89
+
90
+ - **order_id**: a unique identifier for each order (e.g., O-1003, O-1004)
91
+ - **item**: the product being ordered (e.g., pencil, notebook)
92
+ - **qty**: the quantity of the item ordered (e.g., 2, 1)
93
+
94
+ So, the data can be represented as a list of dictionaries, where each dictionary represents an order:
95
+
96
+ ```json
97
+ [
98
+ ```
99
+
100
+ ---
101
+
102
+ **Sample 2**
103
+
104
+ **user**: [SCHEMA]
105
+ {"fields": [{"name": "name", "type": "string"}, {"name": "department", "type": "string"}, {"name": "start_date", "type": "date"}, {"name": "full_time", "type": "boolean"}]}
106
+
107
+ <|document|>
108
+ New hires:
109
+ - Priya Singh joined Sales on 2024-08-01 full‑time.
110
+ - Ken Adams joined Engineering on 2024-09-15 part‑time.
111
+
112
+ **generation (head)**:
113
+
114
+ ```
115
+ Based on the provided schema and the given information, we can create a JSON object for each new hire. However, since the schema specifies that the "full_time" field should be a boolean, we will assume that "part-time" is equivalent to False and "full-time" is equivalent to True.
116
+
117
+ Here's the JSON data:
118
+
119
+ ```json
120
+ [
121
+ {
122
+ "name": "Priya Singh",
123
+ "department": "Sales",
124
+ "start_date": "2024-08-01",
125
+ ```
126
+
127
+ ---
128
+
129
+ **Sample 3**
130
+
131
+ **system**: You convert documents into tabular data strictly under a provided JSON schema. Output ONLY JSON Lines (one JSON object per row), with EXACT columns and order as the schema, and no commentary.
132
+ **user**: <|policy|>
133
+ [POLICY]
134
+ - Extract only facts explicitly supported by the document. No guessing, no background knowledge, no synonyms.
135
+ - Never invent rows or columns. If a value is not present, output an empty string for that cell.
136
+ - Output exactly the columns listed in [SCHEMA]. The key order in each JSON object MUST match the schema order.
137
+ - Do not add headers, comments, explanations, or markdown. Emit ONLY raw JSONL (one JSON object per line).
138
+ - Output must be deterministic for identical input.
139
+ - Trim leading/trailing whitespace in strings; preserve internal spacing and case from the document.
140
+ - IDs/codes stay strings (preserve leading zeros). Do not convert units or reformat currencies.
141
+ - Booleans accept tokens ⊆ {true,false,yes,no,1,0,t,f} (case-insensitive). Keep the surface form unless the schema requires normalization.
142
+ - Integers: ^[+-]?\d+$ (after trim). Numbers: plain JSON numbers if unambiguous; otherwise keep as strings.
143
+ - Dates: prefer ISO-like YYYY, YYYY-MM, or YYYY-MM-DD if explicitly present; otherwise keep as strings.
144
+ - Treat as missing (case-insensitive, after trim): "", -, —, –, N/A, NA, None, Null, Unknown, TBD, and the na_token from [METADATA]. For missing values, emit empty string "".
145
+ - For pivot-like tables, emit a single row per entity with all columns populated when available.
146
+ - For key–value (slot/value) tables, emit one row per pair with exactly the two columns from the schema.
147
+ - No trailing commas. Ensure every line is valid JSON. Do not wrap rows in an array. No code fences.
148
+
149
+ <|metadata|>
150
+ [METADATA]
151
+ {"language": "en", "script": "auto", "direction": "auto", "source_modality": "plain_text", "na_token": "", "document_char_len": 76, "table_count": "auto", "structure_candidates": ["kv_single", "kv_multi", "flat_single_row", "row_grouped"], "table_hints": {"header_rows_max": 3, "header_cols_max": 2, "row_header_possible": true, "col_header_possible": true, "ragged_rows_possible": false, "multiple_tables_possible": true}, "locale": {"numeric": "auto", "decimal_separators": [".", ","], "thousand_separators": [",", ".", " "], "negative_patterns": ["-x", "(x)"], "percent_symbol": "%", "permille_symbol": "‰"}, "parsing_hints": {"kv_markers": [":", "—", "–", "=", "→"], "list_markers": ["- ", "* ", "• ", "•\t", "▪ "], "section_markers": ["###", "##", "--", "__", "**", ":"]}, "missing_tokens": ["", "-", "—", "–", "N/A", "NA", "None", "Null", "Unknown", "TBD"]}
152
+
153
+ <|schema|>
154
+ [SCHEMA]
155
+ {"schema_id": "gold.e2137f3600e1", "structure": "kv_multi", "description": "Schema induced from the gold table.", "fields": [{"name": "slot", "type": "string", "required": true}, {"name": "value", "type": "string", "required": true}], "constraints": {"enums": {"slot": ["name", "area", "familyFriendly"], "value": ["Alimentum", "city centre", "no"]}}}
156
+
157
+ <|document|>
158
+ [DOCUMENT]
159
+ There is a place in the city centre, Alimentum, that is not family-friendly.
160
+
161
+ <|output_format|>
162
+ [OUTPUT_FORMAT]
163
+ Emit ONLY JSON Lines (JSONL), one JSON object per row, with keys exactly as in [SCHEMA] and in the same order. Do NOT wrap in an array. No comments or markdown.
164
+
165
+ <|task|>
166
+ [TASK]
167
+ Fill the table under the given [SCHEMA] using facts from [DOCUMENT]. If a value is not explicitly present, output an empty string for that cell. Ensure every line is valid JSON.
168
+ **assistant**: {"slot": "name", "value": "Alimentum"}
169
+ {"slot": "area", "value": "city centre"}
170
+ {"slot": "familyFriendly", "value": "no"}
171
+
172
+ **generation (head)**:
173
+
174
+ ```
175
+ I've generated the output based on the provided schema and document.
176
+ ```
177
+
178
+ ## Notes
179
+ * Trained with FA‑2 + FSDP; LoRA in earlier stages then merged for the final artifact.
180
+ * For higher validity, keep decoding **deterministic** (no sampling) and validate lines downstream.
181
+ * Respect the base model license and your data licenses.