vanishingradient commited on
Commit
58f05ba
·
verified ·
1 Parent(s): 135a356

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -15
README.md CHANGED
@@ -1,21 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - qwen3_vl
8
- license: apache-2.0
9
- language:
10
- - en
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** vanishingradient
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
 
 
18
 
19
- This qwen3_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
3
+ tags:
4
+ - vision-language
5
+ - document-understanding
6
+ - markdown-generation
7
+ - transformers
8
+ - unsloth
9
+ - qwen3_vl
10
+ license: apache-2.0
11
+ language:
12
+ - en
13
+ datasets:
14
+ - vidore/vidore_v3_computer_science
15
+ pipeline_tag: image-text-to-text
16
+ ---
17
+
18
+ # Qwen3-VL-8B — Document → Markdown (Fine-Tuned)
19
+
20
+ **Developed by:** vanishingradient
21
+ **License:** Apache-2.0
22
+ **Base model:** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
23
+
24
+ This is a fine-tuned **Qwen3-VL-8B Vision-Language model** optimized for **document understanding and structured markdown generation from images** such as scanned pages, PDFs, screenshots, and technical documents.
25
+
26
+ The model was fine-tuned using **Unsloth** and **Hugging Face TRL**, enabling faster training and reduced VRAM usage while maintaining output fidelity.
27
+
28
  ---
29
+
30
+ ## Capabilities
31
+
32
+ - Image → structured Markdown
33
+ - Document layout preservation
34
+ - Headings, lists, tables, inline formatting
35
+ - Technical and academic documents
36
+ - Low-VRAM inference (4-bit quantized)
37
+
38
  ---
39
 
40
+ ## Training Details
41
+
42
+ - Framework: Unsloth + Hugging Face TRL
43
+ - Quantization: 4-bit (bnb)
44
+ - Objective: Instruction-tuned image-to-text generation
45
+ - Domain focus: Documents and structured layouts
46
+
47
+ ---
48
+
49
+ ## Inference Example
50
+
51
+ ```python
52
+ from transformers import AutoModelForVision2Seq, AutoProcessor, TextStreamer
53
+ import torch
54
+ from PIL import Image
55
+
56
+ model_id = "vanishingradient/qwen-docs-finetuned"
57
+
58
+ # Load model (4-bit, fits on 16GB VRAM)
59
+ model = AutoModelForVision2Seq.from_pretrained(
60
+ model_id,
61
+ torch_dtype=torch.float16,
62
+ device_map="auto",
63
+ trust_remote_code=True,
64
+ load_in_4bit=True,
65
+ )
66
+
67
+ processor = AutoProcessor.from_pretrained(
68
+ model_id,
69
+ trust_remote_code=True
70
+ )
71
+
72
+ # --------------------------------------------------
73
+ # PLACEHOLDER: path to your local image file
74
+ # --------------------------------------------------
75
+ image = Image.open("/path/to/your/document_image.png")
76
+
77
+ messages = [
78
+ {
79
+ "role": "user",
80
+ "content": [
81
+ {"type": "image"},
82
+ {"type": "text", "text": "Convert this image to markdown format."}
83
+ ]
84
+ }
85
+ ]
86
+
87
+ text = processor.apply_chat_template(
88
+ messages,
89
+ tokenize=False,
90
+ add_generation_prompt=True
91
+ )
92
 
93
+ inputs = processor(
94
+ text=[text],
95
+ images=[image],
96
+ return_tensors="pt"
97
+ ).to("cuda")
98
 
99
+ streamer = TextStreamer(
100
+ processor.tokenizer,
101
+ skip_prompt=True
102
+ )
103
 
104
+ _ = model.generate(
105
+ **inputs,
106
+ streamer=streamer,
107
+ max_new_tokens=1024,
108
+ temperature=0.1,
109
+ )
110
+ ```