Image-Text-to-Text
Safetensors
qwen2_5_vl
historical
conversational
File size: 2,310 Bytes
ca2150e
 
 
 
 
 
 
 
 
 
 
 
1a2a0cd
 
c3c88dd
1a2a0cd
 
 
 
ca2150e
1a2a0cd
 
 
 
 
 
 
 
 
c3c88dd
1a2a0cd
 
 
 
 
c3c88dd
1a2a0cd
c3c88dd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---

license: other
license_name: qwen-research
license_link: LICENSE
datasets:
- stanford-oval/churro-dataset
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
- historical
---


<p align="center">
	<img src="https://raw.githubusercontent.com/stanford-oval/Churro/refs/heads/main/static/churro.png" width="70px" alt="CHURRO Logo" style="display:block;margin:0 auto;" />
	<p align="center">CHURRO: Making History Readable with an Open-Weight Large Vision-Language Model for High-Accuracy, Low-Cost Historical Text Recognition</p>
	<p align="center">
		<a href="https://huggingface.co/stanford-oval/churro-3B" target="_blank"><img src="https://img.shields.io/badge/Model-CHURRO%203B-8A4FFF" alt="Model" /></a>
        <a href="https://huggingface.co/datasets/stanford-oval/churro-dataset" target="_blank"><img src="https://img.shields.io/badge/Dataset-CHURRO--DS-0A7BBB" alt="Dataset" /></a>
		<a href="https://arxiv.org/abs/2509.19768" target="_blank"><img src="https://img.shields.io/badge/Paper-arXiv%20-B31B1B" alt="Paper" /></a>
		<a href="https://github.com/stanford-oval/churro/stargazers" target="_blank"><img src="https://img.shields.io/github/stars/stanford-oval/churro?style=social" alt="GitHub Stars" /></a>
	</p>
</p>

<p align="center">
	<sub><i>Handwritten and printed text recognition across 22 centuries and 46 language clusters, including historical and dead languages.</i></sub>
</p>

<p align="center">
    <img src="https://raw.githubusercontent.com/stanford-oval/Churro/refs/heads/main/static/performance_cost.png" alt="Cost vs Performance comparison showing CHURRO's accuracy advantage at significantly lower cost" width="75%" />
    <br/>
    <sub><i>Cost vs. accuracy: CHURRO (3B) achieves higher accuracy than much larger commercial and open-weight VLMs while being substantially cheaper.</i></sub>
</p>

**CHURRO** is a 3B-parameter open-weight vision-language model (VLM) for historical document transcription. It is trained on **CHURRO-DS**, a curated dataset of ~100K pages from 155 historical collections spanning 22 centuries and 46 language clusters.
On the CHURRO-DS test set, CHURRO delivers **15.5× lower cost than Gemini 2.5 Pro while exceeding its accuracy**.

For more details and code see https://github.com/stanford-oval/Churro.