File size: 6,455 Bytes

b1f7fd1
 
 
 
 
 
 
 
 
 
96e5bd4
 
 
 
 
b1f7fd1
 
96e5bd4
b1f7fd1
96e5bd4
b1f7fd1
96e5bd4
b1f7fd1
96e5bd4

---
base_model: unsloth/qwen2.5-vl-7b-instruct-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- qwen2_5_vl
license: apache-2.0
language:
- en
datasets:
- AI4Math/MathVista
- unsloth/LaTeX_OCR
- mychen76/invoices-and-receipts_ocr_v1
- corto-ai/handwritten-text
---

# Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding

**Cernis-Thinking** is a reasoning-capable vision language model fine-tuned with reinforcement learning (GRPO/GSPO) for document understanding tasks. Built on Qwen2.5-VL-7B, it excels at mathematical reasoning, LaTeX OCR, invoice extraction, and handwriting transcription.

## Model Details

- **Base Model**: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- **Training Method**: Group Relative Policy Optimization (GRPO) with GSPO extensions
- **Training Data**: ~2,000 samples across 4 document understanding tasks
- **Model Size**: 7B parameters
- **License**: Apache 2.0

## Capabilities

Cernis-Thinking is trained on four distinct document understanding tasks:

1. **Mathematical Reasoning** - Solves math problems from images with step-by-step reasoning
2. **LaTeX OCR** - Converts mathematical notation images to LaTeX code
3. **Invoice Extraction** - Extracts structured information from invoices and receipts
4. **Handwriting Transcription** - Transcribes handwritten text from images

## Training Details

### Datasets

- [AI4Math/MathVista](https://huggingface.co/datasets/AI4Math/MathVista) - Mathematical reasoning (filtered for numeric answers)
- [unsloth/LaTeX_OCR](https://huggingface.co/datasets/unsloth/LaTeX_OCR) - LaTeX formula recognition
- [mychen76/invoices-and-receipts_ocr_v1](https://huggingface.co/datasets/mychen76/invoices-and-receipts_ocr_v1) - Invoice extraction
- [corto-ai/handwritten-text](https://huggingface.co/datasets/corto-ai/handwritten-text) - Handwriting transcription

### Reinforcement Learning Approach

The model was trained using GRPO (Group Relative Policy Optimization) with custom reward functions:

**1. Formatting Reward Function**
- Rewards proper use of `<REASONING>` and `<SOLUTION>` tags
- Penalizes malformed outputs (e.g., excessive "addCriterion" artifacts)
- Encourages structured, parseable responses

**2. Task-Specific Correctness Reward**
- **Math**: Exact numeric matching (2.0 points)
- **LaTeX/Handwriting**: String similarity with word overlap scoring (0.75-2.0 points)
- **Invoices**: Partial credit for extracting key information (1.5 points)

**3. ROUGE-like Word Overlap**
- For text-heavy tasks, rewards based on word overlap ratio:
  - >50% overlap: 1.5 points
  - >30% overlap: 0.75 points
  - Prevents wasted training on completely wrong outputs

### Training Configuration

```python
training_args = GRPOConfig(
    learning_rate = 5e-6,
    num_train_epochs = 0.5,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    num_generations = 4,
    max_prompt_length = 1024,
    max_completion_length = 1024,
    
    # GSPO settings
    importance_sampling_level = "sequence",
    loss_type = "dr_grpo",
)
```

## Usage

### With Transformers

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "coolAI/cernis-thinking",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("coolAI/cernis-thinking")

# Prepare image and prompt
image = Image.open("document.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Extract the key information from this invoice. First provide your reasoning between <REASONING> and </REASONING>, then your answer between <SOLUTION> and </SOLUTION>"}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True).to(model.device)

# Generate
output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_text[0])
```

### With vLLM (Recommended for Production)

```python
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset

# Initialize vLLM
llm = LLM(
    model="coolAI/cernis-thinking",
    max_model_len=16384,
    gpu_memory_utilization=0.8
)

# Prepare prompt
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>What is the LaTeX code shown in this image? Provide your answer between <SOLUTION> and </SOLUTION><|im_end|>\n<|im_start|>assistant\n"

# Sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_k=50,
    max_tokens=1024
)

# Generate
outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": ImageAsset("formula.png").pil_image}
    },
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)
```

## Example Outputs

### Mathematical Reasoning
**Input**: Image of geometry problem  
**Output**:
```
<REASONING>
To solve this parallelogram problem, I need to use the properties:
1. Opposite sides are equal in a parallelogram
2. Angle bisectors create specific relationships...
</REASONING>

<SOLUTION>
42
</SOLUTION>
```

### LaTeX OCR
**Input**: Image of mathematical formula  
**Output**:
```
<SOLUTION>
\frac{2}{3} < a^{2} \alpha^{2} \leq 1
</SOLUTION>
```

### Invoice Extraction
**Input**: Invoice image  
**Output**:
```
<SOLUTION>
Invoice No: 53553822
Date: 07/24/2012
Vendor: Leo Brown
Seller Address: 082 Christopher Club Apt. 771 Thomasberg, OH 42949
Seller Tax ID: 926-74-9803
Total: $247.50
</SOLUTION>
```

## Citation

```bibtex
@misc{cernis-thinking-2025,
  title={Cernis-Thinking: Multi-Task Vision Language Model for Document Understanding},
  author={Your Name},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/coolAI/cernis-thinking}}
}
```

## Acknowledgments

- Built with [Unsloth](https://github.com/unslothai/unsloth) for efficient VLM training
- Base model: [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
- Training datasets: AI4Math, Unsloth, mychen76, corto-ai

## License

Apache 2.0 - Free for commercial and research use