Add Stage‑2 model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
|
| 3 |
+
tags:
|
| 4 |
+
- sft
|
| 5 |
+
- structured-generation
|
| 6 |
+
- table-generation
|
| 7 |
+
- llama-3.1
|
| 8 |
+
- flash-attention-2
|
| 9 |
+
license: other
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# mohdusman001/Text-to-Table-Stage1
|
| 15 |
+
|
| 16 |
+
Stage‑2 (π₂) **text + schema → table** model fine‑tuned from **meta-llama/Meta-Llama-3.1-8B-Instruct** with a 3‑stage schedule
|
| 17 |
+
(2k → 4k → **8k** context). This repo includes merged weights + tokenizer and sample artifacts.
|
| 18 |
+
|
| 19 |
+
## TL;DR
|
| 20 |
+
- Context window: **8192** tokens (final stage)
|
| 21 |
+
- Final eval (loss / ppl): 2.395211 / 10.9705
|
| 22 |
+
- Sanity (json_valid / key_order / type): 0.078 / 0.078 / 0.078
|
| 23 |
+
- Artifacts: see `metrics/` and `samples/`.
|
| 24 |
+
|
| 25 |
+
## How to prompt (JSONL rows)
|
| 26 |
+
The model expects a schema and a document snippet. It should emit **one JSON object per line**,
|
| 27 |
+
with keys **exactly** in schema order (no code fences, no prose).
|
| 28 |
+
|
| 29 |
+
```text
|
| 30 |
+
[SCHEMA]
|
| 31 |
+
{"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}
|
| 32 |
+
|
| 33 |
+
<|document|>
|
| 34 |
+
Orders today:
|
| 35 |
+
- O-1003: 2x pencil
|
| 36 |
+
- O-1004: 1x notebook
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
### Python usage (deterministic table emission)
|
| 40 |
+
```python
|
| 41 |
+
import torch, json
|
| 42 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 43 |
+
|
| 44 |
+
model_id = "mohdusman001/Text-to-Table-Stage1"
|
| 45 |
+
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
|
| 46 |
+
if tok.pad_token is None: tok.pad_token = tok.eos_token
|
| 47 |
+
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
|
| 48 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map="auto",
|
| 49 |
+
attn_implementation="flash_attention_2")
|
| 50 |
+
prompt = (
|
| 51 |
+
"[SCHEMA]
|
| 52 |
+
"
|
| 53 |
+
'{"fields":[{"name":"order_id","type":"string"},{"name":"item","type":"string"},{"name":"qty","type":"integer"}]}\n\n'
|
| 54 |
+
"<|document|>
|
| 55 |
+
"
|
| 56 |
+
"Orders today:
|
| 57 |
+
- O-1003: 2x pencil
|
| 58 |
+
- O-1004: 1x notebook
|
| 59 |
+
"
|
| 60 |
+
)
|
| 61 |
+
chat = [{"role":"user","content":prompt}]
|
| 62 |
+
txt = tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
|
| 63 |
+
inp = tok(txt, return_tensors="pt").to(model.device)
|
| 64 |
+
|
| 65 |
+
with torch.no_grad():
|
| 66 |
+
out = model.generate(**inp, max_new_tokens=256, do_sample=False, temperature=0.0,
|
| 67 |
+
eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id)
|
| 68 |
+
generated = tok.decode(out[0][inp["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 69 |
+
|
| 70 |
+
rows = [json.loads(l) for l in generated.splitlines() if l.strip()]
|
| 71 |
+
print(rows) # list of dicts
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Example generations
|
| 75 |
+
**Sample 1**
|
| 76 |
+
|
| 77 |
+
**user**: [SCHEMA]
|
| 78 |
+
{"fields": [{"name": "order_id", "type": "string"}, {"name": "item", "type": "string"}, {"name": "qty", "type": "integer"}]}
|
| 79 |
+
|
| 80 |
+
<|document|>
|
| 81 |
+
Orders today:
|
| 82 |
+
- O-1003: 2x pencil
|
| 83 |
+
- O-1004: 1x notebook
|
| 84 |
+
|
| 85 |
+
**generation (head)**:
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
Based on the provided schema, here's a breakdown of the data:
|
| 89 |
+
|
| 90 |
+
- **order_id**: a unique identifier for each order (e.g., O-1003, O-1004)
|
| 91 |
+
- **item**: the product being ordered (e.g., pencil, notebook)
|
| 92 |
+
- **qty**: the quantity of the item ordered (e.g., 2, 1)
|
| 93 |
+
|
| 94 |
+
So, the data can be represented as a list of dictionaries, where each dictionary represents an order:
|
| 95 |
+
|
| 96 |
+
```json
|
| 97 |
+
[
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
**Sample 2**
|
| 103 |
+
|
| 104 |
+
**user**: [SCHEMA]
|
| 105 |
+
{"fields": [{"name": "name", "type": "string"}, {"name": "department", "type": "string"}, {"name": "start_date", "type": "date"}, {"name": "full_time", "type": "boolean"}]}
|
| 106 |
+
|
| 107 |
+
<|document|>
|
| 108 |
+
New hires:
|
| 109 |
+
- Priya Singh joined Sales on 2024-08-01 full‑time.
|
| 110 |
+
- Ken Adams joined Engineering on 2024-09-15 part‑time.
|
| 111 |
+
|
| 112 |
+
**generation (head)**:
|
| 113 |
+
|
| 114 |
+
```
|
| 115 |
+
Based on the provided schema and the given information, we can create a JSON object for each new hire. However, since the schema specifies that the "full_time" field should be a boolean, we will assume that "part-time" is equivalent to False and "full-time" is equivalent to True.
|
| 116 |
+
|
| 117 |
+
Here's the JSON data:
|
| 118 |
+
|
| 119 |
+
```json
|
| 120 |
+
[
|
| 121 |
+
{
|
| 122 |
+
"name": "Priya Singh",
|
| 123 |
+
"department": "Sales",
|
| 124 |
+
"start_date": "2024-08-01",
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
**Sample 3**
|
| 130 |
+
|
| 131 |
+
**system**: You convert documents into tabular data strictly under a provided JSON schema. Output ONLY JSON Lines (one JSON object per row), with EXACT columns and order as the schema, and no commentary.
|
| 132 |
+
**user**: <|policy|>
|
| 133 |
+
[POLICY]
|
| 134 |
+
- Extract only facts explicitly supported by the document. No guessing, no background knowledge, no synonyms.
|
| 135 |
+
- Never invent rows or columns. If a value is not present, output an empty string for that cell.
|
| 136 |
+
- Output exactly the columns listed in [SCHEMA]. The key order in each JSON object MUST match the schema order.
|
| 137 |
+
- Do not add headers, comments, explanations, or markdown. Emit ONLY raw JSONL (one JSON object per line).
|
| 138 |
+
- Output must be deterministic for identical input.
|
| 139 |
+
- Trim leading/trailing whitespace in strings; preserve internal spacing and case from the document.
|
| 140 |
+
- IDs/codes stay strings (preserve leading zeros). Do not convert units or reformat currencies.
|
| 141 |
+
- Booleans accept tokens ⊆ {true,false,yes,no,1,0,t,f} (case-insensitive). Keep the surface form unless the schema requires normalization.
|
| 142 |
+
- Integers: ^[+-]?\d+$ (after trim). Numbers: plain JSON numbers if unambiguous; otherwise keep as strings.
|
| 143 |
+
- Dates: prefer ISO-like YYYY, YYYY-MM, or YYYY-MM-DD if explicitly present; otherwise keep as strings.
|
| 144 |
+
- Treat as missing (case-insensitive, after trim): "", -, —, –, N/A, NA, None, Null, Unknown, TBD, and the na_token from [METADATA]. For missing values, emit empty string "".
|
| 145 |
+
- For pivot-like tables, emit a single row per entity with all columns populated when available.
|
| 146 |
+
- For key–value (slot/value) tables, emit one row per pair with exactly the two columns from the schema.
|
| 147 |
+
- No trailing commas. Ensure every line is valid JSON. Do not wrap rows in an array. No code fences.
|
| 148 |
+
|
| 149 |
+
<|metadata|>
|
| 150 |
+
[METADATA]
|
| 151 |
+
{"language": "en", "script": "auto", "direction": "auto", "source_modality": "plain_text", "na_token": "", "document_char_len": 76, "table_count": "auto", "structure_candidates": ["kv_single", "kv_multi", "flat_single_row", "row_grouped"], "table_hints": {"header_rows_max": 3, "header_cols_max": 2, "row_header_possible": true, "col_header_possible": true, "ragged_rows_possible": false, "multiple_tables_possible": true}, "locale": {"numeric": "auto", "decimal_separators": [".", ","], "thousand_separators": [",", ".", " "], "negative_patterns": ["-x", "(x)"], "percent_symbol": "%", "permille_symbol": "‰"}, "parsing_hints": {"kv_markers": [":", "—", "–", "=", "→"], "list_markers": ["- ", "* ", "• ", "•\t", "▪ "], "section_markers": ["###", "##", "--", "__", "**", ":"]}, "missing_tokens": ["", "-", "—", "–", "N/A", "NA", "None", "Null", "Unknown", "TBD"]}
|
| 152 |
+
|
| 153 |
+
<|schema|>
|
| 154 |
+
[SCHEMA]
|
| 155 |
+
{"schema_id": "gold.e2137f3600e1", "structure": "kv_multi", "description": "Schema induced from the gold table.", "fields": [{"name": "slot", "type": "string", "required": true}, {"name": "value", "type": "string", "required": true}], "constraints": {"enums": {"slot": ["name", "area", "familyFriendly"], "value": ["Alimentum", "city centre", "no"]}}}
|
| 156 |
+
|
| 157 |
+
<|document|>
|
| 158 |
+
[DOCUMENT]
|
| 159 |
+
There is a place in the city centre, Alimentum, that is not family-friendly.
|
| 160 |
+
|
| 161 |
+
<|output_format|>
|
| 162 |
+
[OUTPUT_FORMAT]
|
| 163 |
+
Emit ONLY JSON Lines (JSONL), one JSON object per row, with keys exactly as in [SCHEMA] and in the same order. Do NOT wrap in an array. No comments or markdown.
|
| 164 |
+
|
| 165 |
+
<|task|>
|
| 166 |
+
[TASK]
|
| 167 |
+
Fill the table under the given [SCHEMA] using facts from [DOCUMENT]. If a value is not explicitly present, output an empty string for that cell. Ensure every line is valid JSON.
|
| 168 |
+
**assistant**: {"slot": "name", "value": "Alimentum"}
|
| 169 |
+
{"slot": "area", "value": "city centre"}
|
| 170 |
+
{"slot": "familyFriendly", "value": "no"}
|
| 171 |
+
|
| 172 |
+
**generation (head)**:
|
| 173 |
+
|
| 174 |
+
```
|
| 175 |
+
I've generated the output based on the provided schema and document.
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
## Notes
|
| 179 |
+
* Trained with FA‑2 + FSDP; LoRA in earlier stages then merged for the final artifact.
|
| 180 |
+
* For higher validity, keep decoding **deterministic** (no sampling) and validate lines downstream.
|
| 181 |
+
* Respect the base model license and your data licenses.
|