🀏 smolified-ocr-data-extractor-and-comparator

Intelligence, Distilled.

This is a Domain Specific Language Model (DSLM) generated by the Smolify Foundry.

It has been synthetically distilled from SOTA reasoning engines into a high-efficiency architecture, optimized for deployment on edge hardware (CPU/NPU) or low-VRAM environments.

πŸ“¦ Asset Details

  • Origin: Smolify Foundry (Job ID: 806ba38c)
  • Architecture: DSLM-Micro (270M Parameter Class)
  • Training Method: Proprietary Neural Distillation
  • Optimization: 4-bit Quantized / FP16 Mixed
  • Dataset: Link to Dataset

πŸš€ Usage (Inference)

This model is compatible with standard inference backends like vLLM.

# Example: Running your Sovereign Model
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "smolify/smolified-ocr-data-extractor-and-comparator"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = [
    {'role': 'system', 'content': '''FORGET EVERYTHING YOU KNOW BEFORE THIS  You are an OCR data extraction and comparison engine.  1. For each field in the fixed `reference_fields` list, find the best matching substring in `ocr_text` and copy it exactly into `extracted_text`, or set it to null if no confident match exists. 2. Never invent values that are not present in `ocr_text` or in the given reference values, and never add, remove, or rename fields. 3. Always return strictly valid JSON with one result object per reference field.'''},
    {'role': 'user', 'content': '''{'ocr_text': 'ALLIECO\nChemin des Roses, 12\n78370 Plaisir\nT Bon de pesée n° : 123456\nSIRET: 80898516100010\nCAP n°: P123M456\nClient: PROMETHEE SA\nProvenance: VERSAILLES (78000)\nMatière: DECHETS BOIS (Code: DB789)\nDate: 2024-03-01 10:30\nPoids Brut: 15.5 T\nPoids Tare: 5.0 T\nPoids Net: 10.50 T\nUnité: T', 'reference_fields': [{'name': 'netWeight', 'type': 'number', 'value': 10.5}, {'name': 'unit', 'type': 'string', 'value': 'T'}, {'name': 'siret', 'type': 'string', 'value': '80898516100010'}, {'name': 'date', 'type': 'string', 'value': '2024-03-01'}, {'name': 'startCity', 'type': 'string', 'value': 'VERSAILLES'}, {'name': 'startPostalCode', 'type': 'string', 'value': '78000'}, {'name': 'endCity', 'type': 'string', 'value': 'Plaisir'}, {'name': 'endPostalCode', 'type': 'string', 'value': '78370'}, {'name': 'operationId', 'type': 'string', 'value': 'P123M456'}, {'name': 'flow', 'type': 'string', 'value': 'DECHETS BOIS (Code: DB789)'}, {'name': 'pointBusinessName', 'type': 'string', 'value': 'PROMETHEE SA'}, {'name': 'operatorBusinessName', 'type': 'string', 'value': 'ALLIECO'}]}'''}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True,
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 1000,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

βš–οΈ License & Ownership

This model weights are a sovereign asset owned by smolify. Generated via Smolify.ai.

Downloads last month
12
Safetensors
Model size
0.3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support