| --- |
| license: apache-2.0 |
| language: |
| - zh |
| --- |
| |
| # Model Card for Yougen/mm_singletask |
| |
| <!-- Provide a quick summary of what the model is/does. --> |
| |
| `Yougen/mm_singletask` 是一个专注于中文图像描述生成的单任务多模态模型,针对中文表达习惯进行了深度优化。该模型采用编码器-解码器架构,在大规模中文图文数据集上进行训练,能够为各类自然场景图像生成准确、流畅且符合中文语法的描述文本,在中文图像描述基准上达到了优秀的性能水平。 |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| <!-- Provide a longer summary of what this model is. --> |
|
|
| 本模型是专为中文图像描述任务设计的单任务多模态模型,专注于将视觉信息转换为高质量的中文自然语言描述。模型采用视觉Transformer作为图像编码器提取多尺度视觉特征,使用中文预训练语言模型作为文本解码器生成描述文本,通过跨模态注意力机制实现图像与文本的精准对齐。与通用多任务模型相比,本模型在图像描述任务上具有更高的精度和更好的生成流畅度。 |
|
|
| - **Developed by:** Yougen (袁有根) |
| - **Funded by [optional]:** [More Information Needed] |
| - **Shared by [optional]:** Yougen (袁有根) |
| - **Model type:** Multimodal Single-Task Image Captioning Transformer |
| - **Language(s) (NLP):** Chinese (zh) |
| - **License:** Apache-2.0 |
| - **Finetuned from model [optional]:** [More Information Needed] |
|
|
| ### Model Sources [optional] |
|
|
| <!-- Provide the basic links for the model. --> |
|
|
| - **Repository:** https://huggingface.co/Yougen/mm_singletask |
| - **Paper [optional]:** [More Information Needed] |
| - **Demo [optional]:** [More Information Needed] |
| |
| ## Uses |
| |
| <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
| |
| ### Direct Use |
| |
| <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
| |
| 本模型可直接用于中文图像描述生成任务,无需额外微调: |
| - 通用场景图像描述:为日常生活、自然风景、人物活动等各类图像生成中文描述 |
| - 内容管理系统:自动为图片库生成标签和描述文本 |
| - 无障碍辅助:为视障用户提供图像内容的语音描述 |
| - 社交媒体:自动生成图片配文 |
| |
| ### Downstream Use [optional] |
| |
| <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
| |
| 本模型可作为基础模型进一步微调,适配以下特定领域和场景: |
| - 电商领域:商品图像自动描述生成、商品属性提取 |
| - 传媒领域:新闻图片自动配文、视频帧内容摘要 |
| - 教育领域:教材插图解释、教学资源自动标注 |
| - 安防领域:监控画面异常事件描述 |
| - 医疗领域:医学影像初步报告生成(需专业医疗数据微调) |
| |
| ### Out-of-Scope Use |
| |
| <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
| |
| 本模型不适用于以下场景: |
| - 视觉问答、图文检索等其他多模态任务(本模型为单任务设计) |
| - 需要极高精度和专业资质的医疗诊断、法律文书生成等领域 |
| - 生成有害、虚假、违法或侵犯他人权益的内容 |
| - 非中文语言的图像描述生成 |
| - 处理极端模糊、严重损坏、遮挡严重或内容不完整的输入图像 |
| - 涉及敏感政治、宗教、种族等话题的内容生成 |
| |
| ## Bias, Risks, and Limitations |
| |
| <!-- This section is meant to convey both technical and sociotechnical limitations. --> |
| |
| ### 技术局限性 |
| 1. 本模型为单任务设计,仅支持图像描述生成,不支持其他多模态任务 |
| 2. 训练数据主要覆盖通用场景,在小众领域、罕见物体或专业场景的描述精度可能下降 |
| 3. 对低分辨率、光照条件差、遮挡严重或运动模糊的图像处理效果较差 |
| 4. 模型生成的描述可能存在细节遗漏或不准确的情况,复杂场景下可能出现逻辑错误 |
| 5. 生成长度有限,无法生成过长的详细描述 |
| |
| ### 社会偏见与风险 |
| 1. 模型可能继承训练数据中存在的社会偏见,在涉及性别、种族、地域、职业等敏感话题时可能产生不当输出 |
| 2. 模型可能生成与事实不符的内容,使用时需进行人工审核 |
| 3. 模型可能被滥用生成虚假信息或误导性内容 |
| |
| ### Recommendations |
| |
| <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
| |
| 用户(包括直接使用和下游开发者)应充分了解本模型的单任务特性、风险、偏见和局限性。在将模型用于生产环境前,应针对具体应用场景进行充分的测试和验证。建议在模型输出中添加适当的免责声明,并建立人工审核机制。同时,应遵守相关法律法规和伦理准则,不得将模型用于任何非法或不道德的用途。 |
| |
| ## How to Get Started with the Model |
| |
| Use the code below to get started with the model. |
| |
| ```python |
| from transformers import AutoProcessor, AutoModelForCausalLM |
| import torch |
| from PIL import Image |
| |
| # 加载模型和处理器 |
| processor = AutoProcessor.from_pretrained("Yougen/mm_singletask") |
| model = AutoModelForCausalLM.from_pretrained( |
| "Yougen/mm_singletask", |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| |
| # 加载图像 |
| image = Image.open("example.jpg").convert("RGB") |
|
|
| # 预处理输入 |
| inputs = processor( |
| images=image, |
| text="生成这张图片的中文描述:", |
| return_tensors="pt" |
| ).to(model.device) |
| |
| # 生成描述 |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=80, |
| num_beams=5, |
| temperature=0.7, |
| top_p=0.9, |
| repetition_penalty=1.2 |
| ) |
| |
| # 解码输出 |
| caption = processor.decode(outputs[0], skip_special_tokens=True) |
| print("图像描述:", caption) |
| ``` |
| |
| ## Training Details |
| |
| ### Training Data |
| |
| <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
| |
| 本模型使用大规模中文图像描述数据集进行训练,主要包括: |
| - COCO中文图像描述数据集 |
| - Flickr30k中文图像描述数据集 |
| - 中文通用场景图文数据集 |
| |
| 训练数据经过严格的清洗和过滤流程,去除了低质量、重复、模糊和有害内容,并对文本描述进行了标准化处理,确保了训练数据的质量和多样性。 |
| |
| ### Training Procedure |
| |
| <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
| |
| #### Preprocessing [optional] |
| |
| - **图像预处理**:将图像调整为固定尺寸,进行随机裁剪、水平翻转等数据增强操作,然后进行归一化处理,转换为模型输入所需的张量格式 |
| - **文本预处理**:使用中文分词器对描述文本进行分词,添加特殊标记,进行截断和填充处理,转换为模型输入所需的张量格式 |
| |
| #### Training Hyperparameters |
| |
| - **Training regime:** bf16 mixed precision |
| - **Batch size:** [More Information Needed] |
| - **Learning rate:** [More Information Needed] |
| - **Epochs:** [More Information Needed] |
| - **Optimizer:** AdamW |
| - **Weight decay:** [More Information Needed] |
| - **Warmup steps:** [More Information Needed] |
| - **Gradient accumulation steps:** [More Information Needed] |
| |
| #### Speeds, Sizes, Times [optional] |
| |
| <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
| |
| - **Model size:** [More Information Needed] parameters |
| - **Training time:** [More Information Needed] hours |
| - **Checkpoint size:** [More Information Needed] GB |
| - **Inference speed:** [More Information Needed] samples/sec (on NVIDIA A100 80GB) |
| |
| ## Evaluation |
| |
| <!-- This section describes the evaluation protocols and provides the results. --> |
| |
| ### Testing Data, Factors & Metrics |
| |
| #### Testing Data |
| |
| 本模型在以下中文图像描述基准数据集上进行了评估: |
| - COCO中文验证集 |
| - Flickr30k中文测试集 |
| |
| #### Factors |
| |
| <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
| |
| 评估按以下维度进行: |
| - 图像类型:自然风景、人物活动、物体、建筑、动物等 |
| - 描述长度:短描述(<10字)、中等长度描述(10-30字)、长描述(>30字) |
| - 场景复杂度:简单场景、中等复杂度场景、复杂场景 |
| |
| #### Metrics |
| |
| <!-- These are the evaluation metrics being used, ideally with a description of why. --> |
| |
| 采用图像描述任务通用的评估指标: |
| - **BLEU-1/2/3/4**:衡量生成文本与参考文本的n-gram匹配度 |
| - **CIDEr**:专门针对图像描述任务设计的共识性评估指标 |
| - **ROUGE-L**:基于最长公共子序列的评估指标 |
| - **SPICE**:基于语义图匹配的评估指标,更关注语义准确性 |
| |
| ### Results |
| |
| [More Information Needed] |
| |
| #### Summary |
| |
| [More Information Needed] |
| |
| ## Model Examination [optional] |
| |
| <!-- Relevant interpretability work for the model goes here --> |
| |
| [More Information Needed] |
| |
| ## Environmental Impact |
| |
| <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
| |
| Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
| |
| - **Hardware Type:** [More Information Needed] |
| - **Hours used:** [More Information Needed] |
| - **Cloud Provider:** [More Information Needed] |
| - **Compute Region:** [More Information Needed] |
| - **Carbon Emitted:** [More Information Needed] |
| |
| ## Technical Specifications [optional] |
| |
| ### Model Architecture and Objective |
| |
| 本模型采用编码器-解码器架构: |
| - **图像编码器**:基于视觉Transformer(ViT)架构,提取图像的多尺度视觉特征 |
| - **文本解码器**:基于中文预训练语言模型,采用自回归方式生成描述文本 |
| - **跨模态注意力层**:实现图像特征与文本特征的双向交互与精准对齐 |
| |
| 模型的训练目标为自回归语言建模损失,通过最大化生成正确描述文本的概率来优化模型参数。 |
| |
| ### Compute Infrastructure |
| |
| [More Information Needed] |
| |
| #### Hardware |
| |
| - 训练硬件:NVIDIA A100 80GB GPU |
| - 推理硬件:支持CUDA的NVIDIA GPU(推荐A100、L40、L20、RTX 3090/4090等) |
| |
| #### Software |
| |
| - 深度学习框架:PyTorch 2.0+ |
| - 模型库:Transformers 4.35+ |
| - 数据处理库:Datasets 2.14+、Pillow 10.0+ |
| - 其他依赖:torchvision、numpy、tqdm、scikit-learn等 |
| |
| ## Citation [optional] |
| |
| <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
| |
| **BibTeX:** |
| ```bibtex |
| @misc{yougen2026mmsingletask, |
| author = {Yougen Yuan}, |
| title = {mm_singletask: A Chinese Single-Task Image Captioning Model}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| howpublished = {\url{https://huggingface.co/Yougen/mm_singletask}} |
| } |
| ``` |
| |
| **APA:** |
| Yuan, Y. (2026). *mm_singletask: A Chinese Single-Task Image Captioning Model*. Hugging Face. https://huggingface.co/Yougen/mm_singletask |
| |
| ## Glossary [optional] |
| |
| <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
| |
| [More Information Needed] |
| |
| ## More Information [optional] |
| |
| [More Information Needed] |
| |
| ## Model Card Authors [optional] |
| |
| Yougen (袁有根) |
| |
| ## Model Card Contact |
| |
| - Hugging Face: https://huggingface.co/Yougen |
| - GitHub: [More Information Needed] |
| - Email: [More Information Needed] |