pkulium commited on
Commit
dce86e8
Β·
verified Β·
1 Parent(s): ca21765

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - ocr
8
+ - vision-language
9
+ - qwen2-vl
10
+ - vila
11
+ - multimodal
12
+ license: apache-2.0
13
+ ---
14
+
15
+ # Easy DeepOCR - VILA-Qwen2-VL-8B
16
+
17
+ A vision-language model fine-tuned for OCR tasks, based on VILA architecture with Qwen2-VL-8B as the language backbone.
18
+
19
+ ## Model Description
20
+
21
+ This model combines:
22
+ - **Language Model**: Qwen2-VL-8B
23
+ - **Vision Encoders**: SAM + CLIP
24
+ - **Architecture**: VILA (Visual Language Adapter)
25
+ - **Task**: Optical Character Recognition (OCR)
26
+
27
+ ## Model Structure
28
+ ```
29
+ easy_deepocr/
30
+ β”œβ”€β”€ config.json # Model configuration
31
+ β”œβ”€β”€ llm/ # Qwen2-VL-8B language model weights
32
+ β”œβ”€β”€ mm_projector/ # Multimodal projection layer
33
+ β”œβ”€β”€ sam_clip_ckpt/ # SAM and CLIP vision encoder weights
34
+ └── trainer_state.json # Training state information
35
+ ```
36
+
37
+ ## Usage
38
+ ```python
39
+ # TODO: Add your inference code here
40
+ from transformers import AutoModel, AutoTokenizer
41
+
42
+ model = AutoModel.from_pretrained("pkulium/easy_deepocr", trust_remote_code=True)
43
+ tokenizer = AutoTokenizer.from_pretrained("pkulium/easy_deepocr")
44
+
45
+ # Example inference
46
+ # image = ...
47
+ # text = ...
48
+ ```
49
+
50
+ ## Training Details
51
+
52
+ - **Base Model**: Qwen2-VL-8B
53
+ - **Vision Encoders**: SAM + CLIP
54
+ - **Training Framework**: VILA
55
+ - **Training Type**: Pretraining for OCR tasks
56
+
57
+ ## Intended Use
58
+
59
+ This model is designed for:
60
+ - Document OCR
61
+ - Scene text recognition
62
+ - Handwriting recognition
63
+ - Multi-language text extraction
64
+
65
+ ## Limitations
66
+
67
+ - [Add any known limitations]
68
+ - Model performance may vary with image quality
69
+ - Best suited for [specify use cases]
70
+
71
+ ## Citation
72
+
73
+ If you use this model, please cite:
74
+ ```bibtex
75
+ @misc{easy_deepocr,
76
+ author = {Ming Liu},
77
+ title = {Easy DeepOCR - VILA-Qwen2-VL-8B},
78
+ year = {2025},
79
+ publisher = {HuggingFace},
80
+ url = {https://huggingface.co/pkulium/easy_deepocr}
81
+ }
82
+ ```
83
+
84
+ ## Acknowledgments
85
+
86
+ - [VILA](https://github.com/NVlabs/VILA) for the architecture
87
+ - [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) for the language model
88
+ - SAM and CLIP for vision encoding capabilities