| ---
|
| base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
|
| tags:
|
| - vision-language
|
| - document-understanding
|
| - markdown-generation
|
| - transformers
|
| - unsloth
|
| - qwen3_vl
|
| license: apache-2.0
|
| language:
|
| - en
|
| datasets:
|
| - vidore/vidore_v3_computer_science
|
| pipeline_tag: image-text-to-text
|
| ---
|
| |
| # Qwen3-VL-8B — Document → Markdown (Fine-Tuned) |
|
|
| **Developed by:** vanishingradient |
| **License:** Apache-2.0 |
| **Base model:** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit |
|
|
| This is a fine-tuned **Qwen3-VL-8B Vision-Language model** optimized for **document understanding and structured markdown generation from images** such as scanned pages, PDFs, screenshots, and technical documents. |
|
|
| The model was fine-tuned using **Unsloth** and **Hugging Face TRL**, enabling faster training and reduced VRAM usage while maintaining output fidelity. |
|
|
| --- |
|
|
| ## Capabilities |
|
|
| - Image → structured Markdown |
| - Document layout preservation |
| - Headings, lists, tables, inline formatting |
| - Technical and academic documents |
| - Low-VRAM inference (4-bit quantized) |
|
|
| --- |
|
|
| ## Training Details |
|
|
| - Framework: Unsloth + Hugging Face TRL |
| - Quantization: 4-bit (bnb) |
| - Objective: Instruction-tuned image-to-text generation |
| - Domain focus: Documents and structured layouts |
|
|
| --- |
|
|
| ## Inference Example |
|
|
| ```python |
| from transformers import AutoModelForVision2Seq, AutoProcessor, TextStreamer |
| import torch |
| from PIL import Image |
| |
| model_id = "vanishingradient/qwen-docs-finetuned" |
| |
| # Load model (4-bit, fits on 16GB VRAM) |
| model = AutoModelForVision2Seq.from_pretrained( |
| model_id, |
| torch_dtype=torch.float16, |
| device_map="auto", |
| trust_remote_code=True, |
| load_in_4bit=True, |
| ) |
| |
| processor = AutoProcessor.from_pretrained( |
| model_id, |
| trust_remote_code=True |
| ) |
| |
| # -------------------------------------------------- |
| # PLACEHOLDER: path to your local image file |
| # -------------------------------------------------- |
| image = Image.open("/path/to/your/document_image.png") |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image"}, |
| {"type": "text", "text": "Convert this image to markdown format."} |
| ] |
| } |
| ] |
| |
| text = processor.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| inputs = processor( |
| text=[text], |
| images=[image], |
| return_tensors="pt" |
| ).to("cuda") |
| |
| streamer = TextStreamer( |
| processor.tokenizer, |
| skip_prompt=True |
| ) |
| |
| _ = model.generate( |
| **inputs, |
| streamer=streamer, |
| max_new_tokens=1024, |
| temperature=0.1, |
| ) |
| ``` |