doc-extractor-vl

Document data extraction model based on Qwen2.5-VL-7B-Instruct, configured for structured JSON output from document images (invoices, forms, receipts, etc.).

Key Features

  • Cyrillic-free output: Includes pre-computed logit bias file that blocks all 4129 Cyrillic tokens, preventing Cyrillic/Latin script confusion common in multilingual VL models
  • Structured JSON output: System prompt enforces JSON-only responses
  • Multilingual: Optimized for Slovenian, English, German, Croatian and other Latin-script languages

Files

File Description
cyrillic_logit_bias.json 4129 token IDs with bias -100 to block Cyrillic generation
system_prompt.txt System prompt template for document extraction
serving_config.yaml Recommended vLLM serving parameters
generate_cyrillic_bias.py Script to regenerate the logit bias file

Usage with vLLM

Serving

vllm serve mikrografija/doc-extractor-vl --max-model-len 4096

Request with Cyrillic blocking

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Load Cyrillic logit bias
with open("cyrillic_logit_bias.json") as f:
    cyrillic_bias = {int(k): v for k, v in json.load(f).items()}

# Load system prompt
with open("system_prompt.txt") as f:
    system_prompt = f.read()

response = client.chat.completions.create(
    model="mikrografija/doc-extractor-vl",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": [
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} },
            {"type": "text", "text": "Extract data into this JSON schema: {\"issuer\": \"\", \"date\": \"\", \"total\": \"\", \"items\": []}"}
        ]}
    ],
    logit_bias=cyrillic_bias,
    temperature=0.0,
    max_tokens=4096,
)

Why Cyrillic Blocking?

Qwen2.5-VL models are trained on multilingual data including Cyrillic scripts. When processing Latin-script documents (especially Slovenian, Croatian, or other languages with diacritics), the model occasionally substitutes Latin characters with visually similar Cyrillic characters (e.g., Latin "a" → Cyrillic "а"). The logit bias approach blocks this at the decoding level, making it impossible for the model to generate Cyrillic tokens.

Base Model

This model uses unmodified Qwen2.5-VL-7B-Instruct weights. No fine-tuning was applied. The configuration files provide the Cyrillic blocking and structured output enforcement.

Downloads last month
-
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mikrografija/doc-extractor-vl

Finetuned
(1001)
this model