| --- |
| license: apache-2.0 |
| language: |
| - zh |
| --- |
| |
| # Model Card for InternVL3Fangwusha14B |
|
|
| InternVL3Fangwusha14B is a 14B-parameter vision-language model (VLM) fine-tuned from InternVL3-14B, dedicated to high-performance Chinese multimodal understanding, deep visual reasoning, complex document analysis, table structure parsing, and multi-turn interactive visual dialogue for enterprise and advanced research scenarios. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| This model is a large-scale vision-language model built on the InternVL3-14B base architecture. It is fine-tuned to significantly improve cross-modal semantic alignment, fine-grained visual recognition, complex layout understanding, and professional scene multimodal reasoning in Chinese. It provides powerful generation and reasoning capabilities while maintaining relatively efficient inference. |
|
|
| - **Developed by:** Yougen Yuan |
| - **Funded by [optional]:** Personal Research Project |
| - **Shared by [optional]:** Yougen Yuan |
| - **Model type:** Vision-Language Model (VLM), Multimodal Large Language Model |
| - **Language(s) (NLP):** Chinese (Simplified) |
| - **License:** Apache-2.0 |
| - **Finetuned from model [optional]:** InternVL3-14B |
|
|
| ### Model Sources [optional] |
|
|
| - **Repository:** https://huggingface.co/Yougen/InternVL3Fangwusha14B |
| - **Paper [optional]:** [More Information Needed] |
| - **Demo [optional]:** [More Information Needed] |
|
|
| ## Uses |
|
|
| ### Direct Use |
|
|
| This model can be directly used for: |
| - Complex Chinese visual question answering (VQA) |
| - Fine-grained image understanding and detailed description generation |
| - Complex document analysis, table extraction, form parsing and key information mining |
| - Multi-turn interactive visual dialogue and logical reasoning based on images |
| - High-precision OCR + deep semantic understanding for scanned documents and photos |
|
|
| ### Downstream Use [optional] |
|
|
| Can be further fine-tuned for: |
| - Enterprise-level intelligent document processing and review systems |
| - Professional vertical-domain visual question answering (finance, law, administration) |
| - Multimodal RAG systems supporting image-text hybrid retrieval |
| - AI assistants with deep visual understanding capabilities |
| - Automated report generation based on charts and images |
|
|
| ### Out-of-Scope Use |
|
|
| - Not suitable for unregulated high-risk visual tasks (medical diagnosis, autonomous driving, industrial safety without professional certification) |
| - Not intended for generating harmful, illegal, pornographic, violent or privacy-violating multimodal content |
| - Not optimized for non-Chinese languages |
| - Not designed for ultra-specialized scientific images (remote sensing, microscopic, astronomical) without domain adaptation |
|
|
| ## Bias, Risks, and Limitations |
|
|
| - The model may inherit social, cultural and visual biases from the pre-training data of InternVL3 and public multimodal datasets. |
| - It may produce visual hallucinations, misidentification or inconsistent descriptions for blurry, highly reflective or occluded images. |
| - Without domain fine-tuning, performance in highly professional fields may be limited. |
| - The model cannot independently verify facts and may generate incorrect descriptions or reasoning. |
|
|
| ### Recommendations |
|
|
| All outputs in professional or production scenarios should be reviewed by qualified personnel. |
| It is strongly recommended to configure content security and privacy protection mechanisms for public deployment. |
| Professional dedicated models are preferred for high-precision industrial or medical visual tasks. |
| Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. |
|
|
| ## How to Get Started with the Model |
|
|
| Use the code below to get started with the model. |
|
|
| ```python |
| from transformers import AutoModel, AutoTokenizer |
| |
| model_name = "Yougen/InternVL3Fangwusha14B" |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModel.from_pretrained( |
| model_name, |
| device_map="auto", |
| torch_dtype="auto", |
| trust_remote_code=True |
| ).eval() |
| |
| # Example usage: |
| # image = load_image("your_image.jpg") |
| # question = "请详细解析这张图片中的表格数据和内容" |
| # response = model.chat(tokenizer, image, question) |
| # print(response) |
| ``` |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| Training data includes high-quality Chinese image-text pairs, complex documents, tables, charts, professional scene images, and multi-turn instruction-based multimodal dialogue. Data has been strictly processed with deduplication, noise filtering, and quality control. |
|
|
| ### Training Procedure |
|
|
| #### Preprocessing [optional] |
|
|
| - Image resizing, normalization and enhancement |
| - Text cleaning and standardized instruction formatting |
| - Multimodal sequence alignment and tokenization |
| - Filtering low-quality, duplicated or sensitive data |
|
|
| #### Training Hyperparameters |
|
|
| - **Training regime:** bf16 mixed precision |
| - **Learning rate:** 1.5e-5 |
| - **Batch size:** 8 |
| - **Optimizer:** AdamW |
| - **Weight decay:** 0.01 |
| - **Epochs:** 2 |
|
|
| #### Speeds, Sizes, Times [optional] |
|
|
| - Model size: 14B parameters |
| - Training hardware: NVIDIA A100 / H100 GPU cluster |
| - Training duration: Several days |
|
|
| ## Evaluation |
|
|
| ### Testing Data, Factors & Metrics |
|
|
| #### Testing Data |
|
|
| Internal Chinese multimodal evaluation set covering VQA, document analysis, table extraction, chart understanding and complex visual reasoning. |
|
|
| #### Factors |
|
|
| Image complexity, layout density, text definition, domain professionalism, multi-turn dialogue depth. |
|
|
| #### Metrics |
|
|
| - VQA accuracy |
| - Table & structure extraction accuracy |
| - OCR accuracy + semantic consistency |
| - BLEU, CIDEr, ROUGE for generation |
| - Human evaluation of rationality and fluency |
|
|
| ### Results |
|
|
| [More Information Needed] |
|
|
| #### Summary |
|
|
| The model delivers strong performance in complex Chinese multimodal understanding and reasoning, suitable for high-demand enterprise and advanced research visual-language tasks. |
|
|
| ## Model Examination [optional] |
|
|
| [More Information Needed] |
|
|
| ## Environmental Impact |
|
|
| Carbon emissions can be estimated using the [Machine Learning Impact calculator](sslocal://flow/file_open?url=https%3A%2F%2Fmlco2.github.io%2Fimpact%23compute&flow_extra=eyJsaW5rX3R5cGUiOiJjb2RlX2ludGVycHJldGVyIn0=) presented in [Lacoste et al. (2019)](sslocal://flow/file_open?url=https%3A%2F%2Farxiv.org%2Fabs%2F1910.09700&flow_extra=eyJsaW5rX3R5cGUiOiJjb2RlX2ludGVycHJldGVyIn0=). |
|
|
| - **Hardware Type:** NVIDIA A100 / H100 |
| - **Hours used:** [More Information Needed] |
| - **Cloud Provider:** [More Information Needed] |
| - **Compute Region:** [More Information Needed] |
| - **Carbon Emitted:** [More Information Needed] |
|
|
| ## Technical Specifications [optional] |
|
|
| ### Model Architecture and Objective |
|
|
| Vision-language architecture with high-capacity visual encoder and large language decoder, based on InternVL3-14B. |
| Optimized for Chinese cross-modal alignment, fine-grained visual understanding, and complex document reasoning. |
|
|
| ### Compute Infrastructure |
|
|
| #### Hardware |
|
|
| NVIDIA high-performance GPU cluster with large VRAM |
|
|
| #### Software |
|
|
| - PyTorch |
| - Hugging Face Transformers & Accelerate |
| - TorchVision |
| - Pillow |
| - FlashAttention |
|
|
| ## Citation [optional] |
|
|
| **BibTeX:** |
|
|
| [More Information Needed] |
|
|
| **APA:** |
|
|
| [More Information Needed] |
|
|
| ## Glossary [optional] |
|
|
| - **VLM:** Vision-Language Model that unifies visual and language understanding. |
| - **InternVL3:** Advanced vision-language model series developed by the InternLM team. |
| - **Multimodal Reasoning:** The ability to perform logical inference based on both image and text. |
|
|
| ## More Information [optional] |
|
|
| For updates and issues, please visit the model repository on Hugging Face Hub. |
|
|
| ## Model Card Authors [optional] |
|
|
| Yougen Yuan |
|
|
| ## Model Card Contact |
|
|
| [More Information Needed] |