Improve model card metadata and README structure

Browse files

Files changed (1) hide show

README.md +77 -53

README.md CHANGED Viewed

@@ -2,9 +2,22 @@
 license: apache-2.0
 library_name: transformers
 pipeline_tag: zero-shot-image-classification
 tags:
 - multimodal
 - image-text-retrieval
 - bilingual
 - chinese
 - english
@@ -12,30 +25,47 @@ tags:
 - custom-code
 ---
-# M2-Encoder-0.4B Hugging Face Export
-This folder is generated from `Ant-Multi-Modal-Framework/prj/M2_Encoder` and is structured for direct upload to Hugging Face Hub.
-## What This Repo Supports
-- `AutoConfig.from_pretrained(..., trust_remote_code=True)`
-- `AutoProcessor.from_pretrained(..., trust_remote_code=True)`
-- `AutoModel.from_pretrained(..., trust_remote_code=True)`
-- Zero-shot image-text retrieval and zero-shot image classification
-## Included Weight File
-This repo includes the model weight file in the repo root with this exact filename:
-`m2_encoder_0.4B.safetensors`
-Large files should be tracked by Git LFS. A `.gitattributes` file is included for that.
-## Usage
 ### ModelScope-equivalent scoring
-The original ModelScope sample computes probabilities from the raw normalized embeddings:
 ```python
 from transformers import AutoModel, AutoProcessor
@@ -61,16 +91,16 @@ print(probs)
 ### CLIP-style logits
 `model(**inputs)` also returns `logits_per_image` and `logits_per_text`, which use the model's learned `logit_scale`.
-Those logits are useful, but they are not the same computation as the raw dot product in the original ModelScope demo.
-### ONNXRuntime
 This repo also includes two ONNX exports:
 - `onnx/text_encoder.onnx`
 - `onnx/image_encoder.onnx`
-Example:
 ```python
 import importlib
@@ -132,48 +162,17 @@ image_embeds = image_session.run(
 )[0]
 ```
-Standalone script:
-`examples/run_onnx_inference.py`
-```bash
-python examples/run_onnx_inference.py \
-  --image pokemon.jpeg \
-  --text 杰尼龟 妙蛙种子 小火龙 皮卡丘
-```
-You can also download from the Hub first:
 ```bash
 python examples/run_onnx_inference.py \
-  --repo-id malusama/M2-Encoder-0.4B \
   --image pokemon.jpeg \
   --text 杰尼龟 妙蛙种子 小火龙 皮卡丘
 ```
-## Upload
-Option 1:
-```bash
-python upload_to_hub.py --repo-id malusama/M2-Encoder-0.4B
-```
-Option 2:
-```bash
-huggingface-cli login
-git init
-git lfs install
-git remote add origin https://huggingface.co/malusama/M2-Encoder-0.4B
-git add .
-git commit -m "Upload M2-Encoder HF export"
-git push origin main
-```
 ## Inference Endpoints
-This repo also includes a `handler.py` for Hugging Face Inference Endpoints custom deployments.
 Example request body:
@@ -198,9 +197,34 @@ Example response fields:
 - `probs`
 - `logits_per_image` when `return_logits=true`
 ## Notes
-- This is a Hugging Face remote-code adapter, not a native `transformers` implementation.
-- The underlying model code still comes from the official M2-Encoder repo.
-- You need `trust_remote_code=True`.
-- The `.safetensors` weight file is already included in this Hub repo.

 license: apache-2.0
 library_name: transformers
 pipeline_tag: zero-shot-image-classification
+language:
+- zh
+- en
+datasets:
+- BM-6B
+- ImageNet
+- ImageNet-CN
+- Flickr30K
+- Flickr30K-CN
+- COCO-CN
 tags:
+- onnx
+- feature-extraction
 - multimodal
 - image-text-retrieval
+- zero-shot-image-classification
 - bilingual
 - chinese
 - english
 - custom-code
 ---
+# M2-Encoder-0.4B
+`M2-Encoder-0.4B` is a Hugging Face export of the bilingual vision-language foundation model from the paper [M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining](https://arxiv.org/abs/2401.15896).
+It supports Chinese-English image-text retrieval, zero-shot image classification, `transformers` remote-code loading, ONNXRuntime inference, and Hugging Face Inference Endpoints via the bundled `handler.py`.
+This is the smallest published M2-Encoder variant and is the best starting point for CPU demos, Spaces, and lightweight retrieval services.
+## Links
+- Paper: https://arxiv.org/abs/2401.15896
+- Official code: https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/M2_Encoder
+- ModelScope source model: `M2Cognition/M2-Encoder`
+- Hugging Face repo: `malusama/M2-Encoder-0.4B`
+- Hugging Face Space demo: https://huggingface.co/spaces/malusama/M2-Encoder-0.4B-Space
+## At A Glance
+| Item | Value |
+| --- | --- |
+| Variant | `M2-Encoder-0.4B` |
+| Languages | Chinese, English |
+| Embedding dimension | `768` |
+| Image size | `224` |
+| Main tasks | Image-text retrieval, zero-shot image classification, bilingual feature extraction |
+| Weight format | `safetensors` |
+| ONNX export | `onnx/text_encoder.onnx`, `onnx/image_encoder.onnx` |
+## Files In This Repo
+- `m2_encoder_0.4B.safetensors`: main `transformers` weight file
+- `onnx/text_encoder.onnx`: text embedding encoder
+- `onnx/image_encoder.onnx`: image embedding encoder
+- `examples/run_onnx_inference.py`: runnable ONNX example
+- `handler.py`: custom handler for Hugging Face Inference Endpoints
+## Transformers Usage
 ### ModelScope-equivalent scoring
+The original ModelScope sample computes probabilities from raw normalized embedding dot products:
 ```python
 from transformers import AutoModel, AutoProcessor
 ### CLIP-style logits
 `model(**inputs)` also returns `logits_per_image` and `logits_per_text`, which use the model's learned `logit_scale`.
+Those logits are useful, but they are not the same computation as the raw dot product used in the original ModelScope demo.
+## ONNXRuntime Usage
 This repo also includes two ONNX exports:
 - `onnx/text_encoder.onnx`
 - `onnx/image_encoder.onnx`
+Minimal example:
 ```python
 import importlib
 )[0]
 ```
+Runnable script:
 ```bash
 python examples/run_onnx_inference.py \
   --image pokemon.jpeg \
   --text 杰尼龟 妙蛙种子 小火龙 皮卡丘
 ```
 ## Inference Endpoints
+This repo includes a `handler.py` for Hugging Face Inference Endpoints custom deployments.
 Example request body:
 - `probs`
 - `logits_per_image` when `return_logits=true`
+## Evaluation Summary
+According to the official project README and paper, the M2-Encoder series is trained on the bilingual BM-6B corpus and evaluated on:
+- ImageNet
+- ImageNet-CN
+- Flickr30K
+- Flickr30K-CN
+- COCO-CN
+The official project reports that the M2-Encoder family sets strong bilingual retrieval and zero-shot classification results, and that the 10B variant reaches 88.5 top-1 on ImageNet and 80.7 top-1 on ImageNet-CN in the zero-shot setting. See the paper for exact cross-variant comparisons.
+![Benchmark overview](https://raw.githubusercontent.com/alipay/Ant-Multi-Modal-Framework/main/prj/M2_Encoder/pics/effect.png)
 ## Notes
+- This is a Hugging Face remote-code adapter, not a native `transformers` implementation merged upstream.
+- `trust_remote_code=True` is required for `AutoModel` and `AutoProcessor`.
+- This repo is intended for retrieval, classification, and embedding use cases, not text generation.
+- The Hub export has been numerically checked against the official implementation for the published demo workflow.
+## Citation
+```bibtex
+@misc{guo2024m2encoder,
+  title={M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining},
+  author={Qingpei Guo and Furong Xu and Hanxiao Zhang and Wang Ren and Ziping Ma and Lin Ju and Jian Wang and Jingdong Chen and Ming Yang},
+  year={2024},
+  url={https://arxiv.org/abs/2401.15896}
+}
+```