vanishingradient
/

qwen-docs-finetuned

Image-Text-to-Text

vision-language

document-understanding

markdown-generation

Model card Files Files and versions

qwen-docs-finetuned / README.md

vanishingradient's picture

vanishingradient

Update README.md

58f05ba verified 4 months ago

|

history blame contribute delete

2.55 kB

	---
	base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
	tags:
	- vision-language
	- document-understanding
	- markdown-generation
	- transformers
	- unsloth
	- qwen3_vl
	license: apache-2.0
	language:
	- en
	datasets:
	- vidore/vidore_v3_computer_science
	pipeline_tag: image-text-to-text
	---

	# Qwen3-VL-8B — Document → Markdown (Fine-Tuned)

	Developed by: vanishingradient
	License: Apache-2.0
	Base model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit

	This is a fine-tuned Qwen3-VL-8B Vision-Language model optimized for document understanding and structured markdown generation from images such as scanned pages, PDFs, screenshots, and technical documents.

	The model was fine-tuned using Unsloth and Hugging Face TRL, enabling faster training and reduced VRAM usage while maintaining output fidelity.

	---

	## Capabilities

	- Image → structured Markdown
	- Document layout preservation
	- Headings, lists, tables, inline formatting
	- Technical and academic documents
	- Low-VRAM inference (4-bit quantized)

	---

	## Training Details

	- Framework: Unsloth + Hugging Face TRL
	- Quantization: 4-bit (bnb)
	- Objective: Instruction-tuned image-to-text generation
	- Domain focus: Documents and structured layouts

	---

	## Inference Example

	```python
	from transformers import AutoModelForVision2Seq, AutoProcessor, TextStreamer
	import torch
	from PIL import Image

	model_id = "vanishingradient/qwen-docs-finetuned"

	# Load model (4-bit, fits on 16GB VRAM)
	model = AutoModelForVision2Seq.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True,
	load_in_4bit=True,
	)

	processor = AutoProcessor.from_pretrained(
	model_id,
	trust_remote_code=True
	)

	# --------------------------------------------------
	# PLACEHOLDER: path to your local image file
	# --------------------------------------------------
	image = Image.open("/path/to/your/document_image.png")

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "Convert this image to markdown format."}
	]
	}
	]

	text = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = processor(
	text=[text],
	images=[image],
	return_tensors="pt"
	).to("cuda")

	streamer = TextStreamer(
	processor.tokenizer,
	skip_prompt=True
	)

	_ = model.generate(
	**inputs,
	streamer=streamer,
	max_new_tokens=1024,
	temperature=0.1,
	)
	```