pkulium
/

easy_deepocr

Image-Text-to-Text

vision-language

Model card Files Files and versions

pkulium commited on Nov 4, 2025

Commit

dce86e8

·

verified ·

1 Parent(s): ca21765

Upload folder using huggingface_hub

Files changed (1) hide show

README.md +88 -0

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+language:
+- en
+library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- ocr
+- vision-language
+- qwen2-vl
+- vila
+- multimodal
+license: apache-2.0
+---
+# Easy DeepOCR - VILA-Qwen2-VL-8B
+A vision-language model fine-tuned for OCR tasks, based on VILA architecture with Qwen2-VL-8B as the language backbone.
+## Model Description
+This model combines:
+- **Language Model**: Qwen2-VL-8B
+- **Vision Encoders**: SAM + CLIP
+- **Architecture**: VILA (Visual Language Adapter)
+- **Task**: Optical Character Recognition (OCR)
+## Model Structure
+```
+easy_deepocr/
+├── config.json              # Model configuration
+├── llm/                     # Qwen2-VL-8B language model weights
+├── mm_projector/            # Multimodal projection layer
+├── sam_clip_ckpt/           # SAM and CLIP vision encoder weights
+└── trainer_state.json       # Training state information
+```
+## Usage
+```python
+# TODO: Add your inference code here
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("pkulium/easy_deepocr", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("pkulium/easy_deepocr")
+# Example inference
+# image = ...
+# text = ...
+```
+## Training Details
+- **Base Model**: Qwen2-VL-8B
+- **Vision Encoders**: SAM + CLIP
+- **Training Framework**: VILA
+- **Training Type**: Pretraining for OCR tasks
+## Intended Use
+This model is designed for:
+- Document OCR
+- Scene text recognition
+- Handwriting recognition
+- Multi-language text extraction
+## Limitations
+- [Add any known limitations]
+- Model performance may vary with image quality
+- Best suited for [specify use cases]
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{easy_deepocr,
+  author = {Ming Liu},
+  title = {Easy DeepOCR - VILA-Qwen2-VL-8B},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/pkulium/easy_deepocr}
+}
+```
+## Acknowledgments
+- [VILA](https://github.com/NVlabs/VILA) for the architecture
+- [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) for the language model
+- SAM and CLIP for vision encoding capabilities