--- language: - en library_name: transformers pipeline_tag: image-text-to-text tags: - ocr - vision-language - qwen2-vl - vila - multimodal license: apache-2.0 --- # Easy DeepOCR - VILA-Qwen2-VL-8B A vision-language model fine-tuned for OCR tasks, based on VILA architecture with Qwen2-VL-8B as the language backbone. ## Model Description This model combines: - **Language Model**: Qwen2-VL-8B - **Vision Encoders**: SAM + CLIP - **Architecture**: VILA (Visual Language Adapter) - **Task**: Optical Character Recognition (OCR) ## Model Structure ``` easy_deepocr/ ├── config.json # Model configuration ├── llm/ # Qwen2-VL-8B language model weights ├── mm_projector/ # Multimodal projection layer ├── sam_clip_ckpt/ # SAM and CLIP vision encoder weights └── trainer_state.json # Training state information ``` ## Usage ```python # TODO: Add your inference code here from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("pkulium/easy_deepocr", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("pkulium/easy_deepocr") # Example inference # image = ... # text = ... ``` ## Training Details - **Base Model**: Qwen2-VL-8B - **Vision Encoders**: SAM + CLIP - **Training Framework**: VILA - **Training Type**: Pretraining for OCR tasks ## Intended Use This model is designed for: - Document OCR - Scene text recognition - Handwriting recognition - Multi-language text extraction ## Limitations - [Add any known limitations] - Model performance may vary with image quality - Best suited for [specify use cases] ## Citation If you use this model, please cite: ```bibtex @misc{easy_deepocr, author = {Ming Liu}, title = {Easy DeepOCR - VILA-Qwen2-VL-8B}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/pkulium/easy_deepocr} } ``` ## Acknowledgments - [VILA](https://github.com/NVlabs/VILA) for the architecture - [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) for the language model - SAM and CLIP for vision encoding capabilities