File size: 2,548 Bytes
58f05ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197a675
58f05ba
 
 
 
 
 
 
 
 
197a675
 
58f05ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197a675
58f05ba
 
 
 
 
197a675
58f05ba
 
 
 
197a675
58f05ba
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---

base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
tags:
- vision-language
- document-understanding
- markdown-generation
- transformers
- unsloth
- qwen3_vl
license: apache-2.0
language:
- en
datasets:
- vidore/vidore_v3_computer_science
pipeline_tag: image-text-to-text
---


# Qwen3-VL-8B — Document → Markdown (Fine-Tuned)

**Developed by:** vanishingradient  
**License:** Apache-2.0  
**Base model:** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit  

This is a fine-tuned **Qwen3-VL-8B Vision-Language model** optimized for **document understanding and structured markdown generation from images** such as scanned pages, PDFs, screenshots, and technical documents.

The model was fine-tuned using **Unsloth** and **Hugging Face TRL**, enabling faster training and reduced VRAM usage while maintaining output fidelity.

---

## Capabilities

- Image → structured Markdown
- Document layout preservation
- Headings, lists, tables, inline formatting
- Technical and academic documents
- Low-VRAM inference (4-bit quantized)

---

## Training Details

- Framework: Unsloth + Hugging Face TRL  
- Quantization: 4-bit (bnb)  
- Objective: Instruction-tuned image-to-text generation  
- Domain focus: Documents and structured layouts  

---

## Inference Example

```python
from transformers import AutoModelForVision2Seq, AutoProcessor, TextStreamer
import torch
from PIL import Image

model_id = "vanishingradient/qwen-docs-finetuned"

# Load model (4-bit, fits on 16GB VRAM)
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    load_in_4bit=True,
)

processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True
)

# --------------------------------------------------
# PLACEHOLDER: path to your local image file
# --------------------------------------------------
image = Image.open("/path/to/your/document_image.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this image to markdown format."}
        ]
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to("cuda")

streamer = TextStreamer(
    processor.tokenizer,
    skip_prompt=True
)

_ = model.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=1024,
    temperature=0.1,
)
```