Yougen commited on
Commit
1019842
·
verified ·
1 Parent(s): ca451f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -1
README.md CHANGED
@@ -1,3 +1,214 @@
1
  ---
2
  license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - zh
5
+ ---
6
+
7
+ # Model Card for InternVL3Fangwusha14B
8
+
9
+ InternVL3Fangwusha14B is a 14B-parameter vision-language model (VLM) fine-tuned from InternVL3-14B, dedicated to high-performance Chinese multimodal understanding, deep visual reasoning, complex document analysis, table structure parsing, and multi-turn interactive visual dialogue for enterprise and advanced research scenarios.
10
+
11
+ ## Model Details
12
+
13
+ ### Model Description
14
+
15
+ This model is a large-scale vision-language model built on the InternVL3-14B base architecture. It is fine-tuned to significantly improve cross-modal semantic alignment, fine-grained visual recognition, complex layout understanding, and professional scene multimodal reasoning in Chinese. It provides powerful generation and reasoning capabilities while maintaining relatively efficient inference.
16
+
17
+ - **Developed by:** Yougen Yuan
18
+ - **Funded by [optional]:** Personal Research Project
19
+ - **Shared by [optional]:** Yougen Yuan
20
+ - **Model type:** Vision-Language Model (VLM), Multimodal Large Language Model
21
+ - **Language(s) (NLP):** Chinese (Simplified)
22
+ - **License:** Apache-2.0
23
+ - **Finetuned from model [optional]:** InternVL3-14B
24
+
25
+ ### Model Sources [optional]
26
+
27
+ - **Repository:** https://huggingface.co/Yougen/InternVL3Fangwusha14B
28
+ - **Paper [optional]:** [More Information Needed]
29
+ - **Demo [optional]:** [More Information Needed]
30
+
31
+ ## Uses
32
+
33
+ ### Direct Use
34
+
35
+ This model can be directly used for:
36
+ - Complex Chinese visual question answering (VQA)
37
+ - Fine-grained image understanding and detailed description generation
38
+ - Complex document analysis, table extraction, form parsing and key information mining
39
+ - Multi-turn interactive visual dialogue and logical reasoning based on images
40
+ - High-precision OCR + deep semantic understanding for scanned documents and photos
41
+
42
+ ### Downstream Use [optional]
43
+
44
+ Can be further fine-tuned for:
45
+ - Enterprise-level intelligent document processing and review systems
46
+ - Professional vertical-domain visual question answering (finance, law, administration)
47
+ - Multimodal RAG systems supporting image-text hybrid retrieval
48
+ - AI assistants with deep visual understanding capabilities
49
+ - Automated report generation based on charts and images
50
+
51
+ ### Out-of-Scope Use
52
+
53
+ - Not suitable for unregulated high-risk visual tasks (medical diagnosis, autonomous driving, industrial safety without professional certification)
54
+ - Not intended for generating harmful, illegal, pornographic, violent or privacy-violating multimodal content
55
+ - Not optimized for non-Chinese languages
56
+ - Not designed for ultra-specialized scientific images (remote sensing, microscopic, astronomical) without domain adaptation
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ - The model may inherit social, cultural and visual biases from the pre-training data of InternVL3 and public multimodal datasets.
61
+ - It may produce visual hallucinations, misidentification or inconsistent descriptions for blurry, highly reflective or occluded images.
62
+ - Without domain fine-tuning, performance in highly professional fields may be limited.
63
+ - The model cannot independently verify facts and may generate incorrect descriptions or reasoning.
64
+
65
+ ### Recommendations
66
+
67
+ All outputs in professional or production scenarios should be reviewed by qualified personnel.
68
+ It is strongly recommended to configure content security and privacy protection mechanisms for public deployment.
69
+ Professional dedicated models are preferred for high-precision industrial or medical visual tasks.
70
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
71
+
72
+ ## How to Get Started with the Model
73
+
74
+ Use the code below to get started with the model.
75
+
76
+ ```python
77
+ from transformers import AutoModel, AutoTokenizer
78
+
79
+ model_name = "Yougen/InternVL3Fangwusha14B"
80
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
81
+ model = AutoModel.from_pretrained(
82
+ model_name,
83
+ device_map="auto",
84
+ torch_dtype="auto",
85
+ trust_remote_code=True
86
+ ).eval()
87
+
88
+ # Example usage:
89
+ # image = load_image("your_image.jpg")
90
+ # question = "请详细解析这张图片中的表格数据和内容"
91
+ # response = model.chat(tokenizer, image, question)
92
+ # print(response)
93
+ ```
94
+
95
+ ## Training Details
96
+
97
+ ### Training Data
98
+
99
+ Training data includes high-quality Chinese image-text pairs, complex documents, tables, charts, professional scene images, and multi-turn instruction-based multimodal dialogue. Data has been strictly processed with deduplication, noise filtering, and quality control.
100
+
101
+ ### Training Procedure
102
+
103
+ #### Preprocessing [optional]
104
+
105
+ - Image resizing, normalization and enhancement
106
+ - Text cleaning and standardized instruction formatting
107
+ - Multimodal sequence alignment and tokenization
108
+ - Filtering low-quality, duplicated or sensitive data
109
+
110
+ #### Training Hyperparameters
111
+
112
+ - **Training regime:** bf16 mixed precision
113
+ - **Learning rate:** 1.5e-5
114
+ - **Batch size:** 8
115
+ - **Optimizer:** AdamW
116
+ - **Weight decay:** 0.01
117
+ - **Epochs:** 2
118
+
119
+ #### Speeds, Sizes, Times [optional]
120
+
121
+ - Model size: 14B parameters
122
+ - Training hardware: NVIDIA A100 / H100 GPU cluster
123
+ - Training duration: Several days
124
+
125
+ ## Evaluation
126
+
127
+ ### Testing Data, Factors & Metrics
128
+
129
+ #### Testing Data
130
+
131
+ Internal Chinese multimodal evaluation set covering VQA, document analysis, table extraction, chart understanding and complex visual reasoning.
132
+
133
+ #### Factors
134
+
135
+ Image complexity, layout density, text definition, domain professionalism, multi-turn dialogue depth.
136
+
137
+ #### Metrics
138
+
139
+ - VQA accuracy
140
+ - Table & structure extraction accuracy
141
+ - OCR accuracy + semantic consistency
142
+ - BLEU, CIDEr, ROUGE for generation
143
+ - Human evaluation of rationality and fluency
144
+
145
+ ### Results
146
+
147
+ [More Information Needed]
148
+
149
+ #### Summary
150
+
151
+ The model delivers strong performance in complex Chinese multimodal understanding and reasoning, suitable for high-demand enterprise and advanced research visual-language tasks.
152
+
153
+ ## Model Examination [optional]
154
+
155
+ [More Information Needed]
156
+
157
+ ## Environmental Impact
158
+
159
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](sslocal://flow/file_open?url=https%3A%2F%2Fmlco2.github.io%2Fimpact%23compute&flow_extra=eyJsaW5rX3R5cGUiOiJjb2RlX2ludGVycHJldGVyIn0=) presented in [Lacoste et al. (2019)](sslocal://flow/file_open?url=https%3A%2F%2Farxiv.org%2Fabs%2F1910.09700&flow_extra=eyJsaW5rX3R5cGUiOiJjb2RlX2ludGVycHJldGVyIn0=).
160
+
161
+ - **Hardware Type:** NVIDIA A100 / H100
162
+ - **Hours used:** [More Information Needed]
163
+ - **Cloud Provider:** [More Information Needed]
164
+ - **Compute Region:** [More Information Needed]
165
+ - **Carbon Emitted:** [More Information Needed]
166
+
167
+ ## Technical Specifications [optional]
168
+
169
+ ### Model Architecture and Objective
170
+
171
+ Vision-language architecture with high-capacity visual encoder and large language decoder, based on InternVL3-14B.
172
+ Optimized for Chinese cross-modal alignment, fine-grained visual understanding, and complex document reasoning.
173
+
174
+ ### Compute Infrastructure
175
+
176
+ #### Hardware
177
+
178
+ NVIDIA high-performance GPU cluster with large VRAM
179
+
180
+ #### Software
181
+
182
+ - PyTorch
183
+ - Hugging Face Transformers & Accelerate
184
+ - TorchVision
185
+ - Pillow
186
+ - FlashAttention
187
+
188
+ ## Citation [optional]
189
+
190
+ **BibTeX:**
191
+
192
+ [More Information Needed]
193
+
194
+ **APA:**
195
+
196
+ [More Information Needed]
197
+
198
+ ## Glossary [optional]
199
+
200
+ - **VLM:** Vision-Language Model that unifies visual and language understanding.
201
+ - **InternVL3:** Advanced vision-language model series developed by the InternLM team.
202
+ - **Multimodal Reasoning:** The ability to perform logical inference based on both image and text.
203
+
204
+ ## More Information [optional]
205
+
206
+ For updates and issues, please visit the model repository on Hugging Face Hub.
207
+
208
+ ## Model Card Authors [optional]
209
+
210
+ Yougen Yuan
211
+
212
+ ## Model Card Contact
213
+
214
+ [More Information Needed]