--- base_model: Qwen/Qwen2.5-VL-7B-Instruct language: - en - sl - de - hr - sr library_name: transformers pipeline_tag: image-text-to-text tags: - vllm - document-extraction - ocr - invoice - json --- # doc-extractor-vl Document data extraction model based on [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), configured for structured JSON output from document images (invoices, forms, receipts, etc.). ## Key Features - **Cyrillic-free output**: Includes pre-computed logit bias file that blocks all 4129 Cyrillic tokens, preventing Cyrillic/Latin script confusion common in multilingual VL models - **Structured JSON output**: System prompt enforces JSON-only responses - **Multilingual**: Optimized for Slovenian, English, German, Croatian and other Latin-script languages ## Files | File | Description | |------|-------------| | `cyrillic_logit_bias.json` | 4129 token IDs with bias -100 to block Cyrillic generation | | `system_prompt.txt` | System prompt template for document extraction | | `serving_config.yaml` | Recommended vLLM serving parameters | | `generate_cyrillic_bias.py` | Script to regenerate the logit bias file | ## Usage with vLLM ### Serving ```bash vllm serve mikrografija/doc-extractor-vl --max-model-len 4096 ``` ### Request with Cyrillic blocking ```python import json from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") # Load Cyrillic logit bias with open("cyrillic_logit_bias.json") as f: cyrillic_bias = {int(k): v for k, v in json.load(f).items()} # Load system prompt with open("system_prompt.txt") as f: system_prompt = f.read() response = client.chat.completions.create( model="mikrografija/doc-extractor-vl", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."} }, {"type": "text", "text": "Extract data into this JSON schema: {\"issuer\": \"\", \"date\": \"\", \"total\": \"\", \"items\": []}"} ]} ], logit_bias=cyrillic_bias, temperature=0.0, max_tokens=4096, ) ``` ## Why Cyrillic Blocking? Qwen2.5-VL models are trained on multilingual data including Cyrillic scripts. When processing Latin-script documents (especially Slovenian, Croatian, or other languages with diacritics), the model occasionally substitutes Latin characters with visually similar Cyrillic characters (e.g., Latin "a" → Cyrillic "а"). The logit bias approach blocks this at the decoding level, making it impossible for the model to generate Cyrillic tokens. ## Base Model This model uses unmodified [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) weights. No fine-tuning was applied. The configuration files provide the Cyrillic blocking and structured output enforcement.