|
|
--- |
|
|
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct |
|
|
tags: |
|
|
- sft |
|
|
- structured-generation |
|
|
- table-generation |
|
|
- llama-3.1 |
|
|
- flash-attention-2 |
|
|
license: other |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# mohdusman001/Text-to-Table-Stage2 |
|
|
|
|
|
Stage‑2 (π₂) **text + schema → table** model fine‑tuned from **meta-llama/Meta-Llama-3.1-8B-Instruct** with a 3‑stage schedule |
|
|
(2k → 4k → **8k** context). This repo includes merged weights + tokenizer and sample artifacts. |
|
|
|
|
|
## TL;DR |
|
|
- Context window: **8192** tokens (final stage) |
|
|
- Final eval (loss / ppl): 2.395211 / 10.9705 |
|
|
- Sanity (json_valid / key_order / type): 0.078 / 0.078 / 0.078 |
|
|
- Artifacts: see `metrics/` and `samples/`. |
|
|
|
|
|
## How to prompt (JSONL rows) |
|
|
The model expects a schema and a document snippet. It should emit **one JSON object per line**, |
|
|
with keys **exactly** in schema order (no code fences, no prose). |
|
|
|
|
|
```text |
|
|
[SCHEMA] |
|
|
{{"fields":[{{"name":"order_id","type":"string"}},{{"name":"item","type":"string"}},{{"name":"qty","type":"integer"}}]}} |
|
|
|
|
|
<|document|> |
|
|
Orders today: |
|
|
- O-1003: 2x pencil |
|
|
- O-1004: 1x notebook |
|
|
``` |
|
|
|
|
|
### Python usage (deterministic table emission) |
|
|
```python |
|
|
import torch, json |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
model_id = "mohdusman001/Text-to-Table-Stage2" |
|
|
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
|
|
if tok.pad_token is None: tok.pad_token = tok.eos_token |
|
|
dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32 |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map="auto", |
|
|
attn_implementation="flash_attention_2") |
|
|
prompt = ( |
|
|
"[SCHEMA] |
|
|
" |
|
|
'{{"fields":[{{"name":"order_id","type":"string"}},{{"name":"item","type":"string"}},{{"name":"qty","type":"integer"}}]}}\n\n' |
|
|
"<|document|> |
|
|
" |
|
|
"Orders today: |
|
|
- O-1003: 2x pencil |
|
|
- O-1004: 1x notebook |
|
|
" |
|
|
) |
|
|
chat = [{{"role":"user","content":prompt}}] |
|
|
txt = tok.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
|
inp = tok(txt, return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
out = model.generate(**inp, max_new_tokens=256, do_sample=False, temperature=0.0, |
|
|
eos_token_id=tok.eos_token_id, pad_token_id=tok.pad_token_id) |
|
|
generated = tok.decode(out[0][inp["input_ids"].shape[1]:], skip_special_tokens=True) |
|
|
|
|
|
rows = [json.loads(l) for l in generated.splitlines() if l.strip()] |
|
|
print(rows) # list of dicts |
|
|
``` |
|
|
|
|
|
## Example generations |
|
|
**Sample 1** |
|
|
|
|
|
**user**: [SCHEMA] |
|
|
{"fields": [{"name": "order_id", "type": "string"}, {"name": "item", "type": "string"}, {"name": "qty", "type": "integer"}]} |
|
|
|
|
|
<|document|> |
|
|
Orders today: |
|
|
- O-1003: 2x pencil |
|
|
- O-1004: 1x notebook |
|
|
|
|
|
**generation (head)**: |
|
|
|
|
|
``` |
|
|
Based on the provided schema, here's a breakdown of the data: |
|
|
|
|
|
- **order_id**: a unique identifier for each order (e.g., O-1003, O-1004) |
|
|
- **item**: the product being ordered (e.g., pencil, notebook) |
|
|
- **qty**: the quantity of the item ordered (e.g., 2, 1) |
|
|
|
|
|
So, the data can be represented as a list of dictionaries, where each dictionary represents an order: |
|
|
|
|
|
```json |
|
|
[ |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
**Sample 2** |
|
|
|
|
|
**user**: [SCHEMA] |
|
|
{"fields": [{"name": "name", "type": "string"}, {"name": "department", "type": "string"}, {"name": "start_date", "type": "date"}, {"name": "full_time", "type": "boolean"}]} |
|
|
|
|
|
<|document|> |
|
|
New hires: |
|
|
- Priya Singh joined Sales on 2024-08-01 full‑time. |
|
|
- Ken Adams joined Engineering on 2024-09-15 part‑time. |
|
|
|
|
|
**generation (head)**: |
|
|
|
|
|
``` |
|
|
Based on the provided schema and the given information, we can create a JSON object for each new hire. However, since the schema specifies that the "full_time" field should be a boolean, we will assume that "part-time" is equivalent to False and "full-time" is equivalent to True. |
|
|
|
|
|
Here's the JSON data: |
|
|
|
|
|
```json |
|
|
[ |
|
|
{ |
|
|
"name": "Priya Singh", |
|
|
"department": "Sales", |
|
|
"start_date": "2024-08-01", |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
**Sample 3** |
|
|
|
|
|
**system**: You convert documents into tabular data strictly under a provided JSON schema. Output ONLY JSON Lines (one JSON object per row), with EXACT columns and order as the schema, and no commentary. |
|
|
**user**: <|policy|> |
|
|
[POLICY] |
|
|
- Extract only facts explicitly supported by the document. No guessing, no background knowledge, no synonyms. |
|
|
- Never invent rows or columns. If a value is not present, output an empty string for that cell. |
|
|
- Output exactly the columns listed in [SCHEMA]. The key order in each JSON object MUST match the schema order. |
|
|
- Do not add headers, comments, explanations, or markdown. Emit ONLY raw JSONL (one JSON object per line). |
|
|
- Output must be deterministic for identical input. |
|
|
- Trim leading/trailing whitespace in strings; preserve internal spacing and case from the document. |
|
|
- IDs/codes stay strings (preserve leading zeros). Do not convert units or reformat currencies. |
|
|
- Booleans accept tokens ⊆ {true,false,yes,no,1,0,t,f} (case-insensitive). Keep the surface form unless the schema requires normalization. |
|
|
- Integers: ^[+-]?\d+$ (after trim). Numbers: plain JSON numbers if unambiguous; otherwise keep as strings. |
|
|
- Dates: prefer ISO-like YYYY, YYYY-MM, or YYYY-MM-DD if explicitly present; otherwise keep as strings. |
|
|
- Treat as missing (case-insensitive, after trim): "", -, —, –, N/A, NA, None, Null, Unknown, TBD, and the na_token from [METADATA]. For missing values, emit empty string "". |
|
|
- For pivot-like tables, emit a single row per entity with all columns populated when available. |
|
|
- For key–value (slot/value) tables, emit one row per pair with exactly the two columns from the schema. |
|
|
- No trailing commas. Ensure every line is valid JSON. Do not wrap rows in an array. No code fences. |
|
|
|
|
|
<|metadata|> |
|
|
[METADATA] |
|
|
{"language": "en", "script": "auto", "direction": "auto", "source_modality": "plain_text", "na_token": "", "document_char_len": 76, "table_count": "auto", "structure_candidates": ["kv_single", "kv_multi", "flat_single_row", "row_grouped"], "table_hints": {"header_rows_max": 3, "header_cols_max": 2, "row_header_possible": true, "col_header_possible": true, "ragged_rows_possible": false, "multiple_tables_possible": true}, "locale": {"numeric": "auto", "decimal_separators": [".", ","], "thousand_separators": [",", ".", " "], "negative_patterns": ["-x", "(x)"], "percent_symbol": "%", "permille_symbol": "‰"}, "parsing_hints": {"kv_markers": [":", "—", "–", "=", "→"], "list_markers": ["- ", "* ", "• ", "•\t", "▪ "], "section_markers": ["###", "##", "--", "__", "**", ":"]}, "missing_tokens": ["", "-", "—", "–", "N/A", "NA", "None", "Null", "Unknown", "TBD"]} |
|
|
|
|
|
<|schema|> |
|
|
[SCHEMA] |
|
|
{"schema_id": "gold.e2137f3600e1", "structure": "kv_multi", "description": "Schema induced from the gold table.", "fields": [{"name": "slot", "type": "string", "required": true}, {"name": "value", "type": "string", "required": true}], "constraints": {"enums": {"slot": ["name", "area", "familyFriendly"], "value": ["Alimentum", "city centre", "no"]}}} |
|
|
|
|
|
<|document|> |
|
|
[DOCUMENT] |
|
|
There is a place in the city centre, Alimentum, that is not family-friendly. |
|
|
|
|
|
<|output_format|> |
|
|
[OUTPUT_FORMAT] |
|
|
Emit ONLY JSON Lines (JSONL), one JSON object per row, with keys exactly as in [SCHEMA] and in the same order. Do NOT wrap in an array. No comments or markdown. |
|
|
|
|
|
<|task|> |
|
|
[TASK] |
|
|
Fill the table under the given [SCHEMA] using facts from [DOCUMENT]. If a value is not explicitly present, output an empty string for that cell. Ensure every line is valid JSON. |
|
|
**assistant**: {"slot": "name", "value": "Alimentum"} |
|
|
{"slot": "area", "value": "city centre"} |
|
|
{"slot": "familyFriendly", "value": "no"} |
|
|
|
|
|
**generation (head)**: |
|
|
|
|
|
``` |
|
|
I've generated the output based on the provided schema and document. |
|
|
``` |
|
|
|
|
|
## Notes |
|
|
* Trained with FA‑2 + FSDP; LoRA in earlier stages then merged for the final artifact. |
|
|
* For higher validity, keep decoding **deterministic** (no sampling) and validate lines downstream. |
|
|
* Respect the base model license and your data licenses. |
|
|
|