vanishingradient
/

qwen-docs-finetuned

@@ -1,21 +1,110 @@
 ---
-base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
-tags:
-- text-generation-inference
-- transformers
-- unsloth
-- qwen3_vl
-license: apache-2.0
-language:
-- en
 ---
-# Uploaded finetuned  model
-- **Developed by:** vanishingradient
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
-This qwen3_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

+---
+base_model: unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
+tags:
+- vision-language
+- document-understanding
+- markdown-generation
+- transformers
+- unsloth
+- qwen3_vl
+license: apache-2.0
+language:
+- en
+datasets:
+- vidore/vidore_v3_computer_science
+pipeline_tag: image-text-to-text
+---
+# Qwen3-VL-8B — Document → Markdown (Fine-Tuned)
+**Developed by:** vanishingradient
+**License:** Apache-2.0
+**Base model:** unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit
+This is a fine-tuned **Qwen3-VL-8B Vision-Language model** optimized for **document understanding and structured markdown generation from images** such as scanned pages, PDFs, screenshots, and technical documents.
+The model was fine-tuned using **Unsloth** and **Hugging Face TRL**, enabling faster training and reduced VRAM usage while maintaining output fidelity.
 ---
+## Capabilities
+- Image → structured Markdown
+- Document layout preservation
+- Headings, lists, tables, inline formatting
+- Technical and academic documents
+- Low-VRAM inference (4-bit quantized)
 ---
+## Training Details
+- Framework: Unsloth + Hugging Face TRL
+- Quantization: 4-bit (bnb)
+- Objective: Instruction-tuned image-to-text generation
+- Domain focus: Documents and structured layouts
+---
+## Inference Example
+```python
+from transformers import AutoModelForVision2Seq, AutoProcessor, TextStreamer
+import torch
+from PIL import Image
+model_id = "vanishingradient/qwen-docs-finetuned"
+# Load model (4-bit, fits on 16GB VRAM)
+model = AutoModelForVision2Seq.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True,
+    load_in_4bit=True,
+)
+processor = AutoProcessor.from_pretrained(
+    model_id,
+    trust_remote_code=True
+)
+# --------------------------------------------------
+# PLACEHOLDER: path to your local image file
+# --------------------------------------------------
+image = Image.open("/path/to/your/document_image.png")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "Convert this image to markdown format."}
+        ]
+    }
+]
+text = processor.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = processor(
+    text=[text],
+    images=[image],
+    return_tensors="pt"
+).to("cuda")
+streamer = TextStreamer(
+    processor.tokenizer,
+    skip_prompt=True
+)
+_ = model.generate(
+    **inputs,
+    streamer=streamer,
+    max_new_tokens=1024,
+    temperature=0.1,
+)
+```