Add model-index evaluation metadata

Browse files

Files changed (1) hide show

README.md +66 -3

README.md CHANGED Viewed

@@ -23,6 +23,69 @@ tags:
 - english
 - vision-language
 - custom-code
 ---
 # M2-Encoder-0.4B
@@ -165,9 +228,7 @@ image_embeds = image_session.run(
 Runnable script:
 ```bash
-python examples/run_onnx_inference.py \
-  --image pokemon.jpeg \
-  --text 杰尼龟 妙蛙种子 小火龙 皮卡丘
 ```
 ## Inference Endpoints
@@ -209,6 +270,8 @@ According to the official project README and paper, the M2-Encoder series is tra
 The official project reports that the M2-Encoder family sets strong bilingual retrieval and zero-shot classification results, and that the 10B variant reaches 88.5 top-1 on ImageNet and 80.7 top-1 on ImageNet-CN in the zero-shot setting. See the paper for exact cross-variant comparisons.
 ![Benchmark overview](https://raw.githubusercontent.com/alipay/Ant-Multi-Modal-Framework/main/prj/M2_Encoder/pics/effect.png)
 ## Notes

 - english
 - vision-language
 - custom-code
+model-index:
+- name: M2-Encoder-0.4B
+  results:
+  - task:
+      type: zero-shot-image-classification
+      name: Zero-Shot Image Classification
+    dataset:
+      name: ImageNet
+      type: ImageNet
+    metrics:
+    - type: accuracy
+      value: 78.5
+      name: Top-1 Accuracy
+  - task:
+      type: zero-shot-image-classification
+      name: Zero-Shot Image Classification
+    dataset:
+      name: ImageNet-CN
+      type: ImageNet-CN
+    metrics:
+    - type: accuracy
+      value: 69.1
+      name: Top-1 Accuracy
+  - task:
+      type: image-text-retrieval
+      name: Zero-Shot Image-Text Retrieval
+    dataset:
+      name: Flickr30K
+      type: Flickr30K
+    metrics:
+    - type: mean_recall
+      value: 94.5
+      name: MR
+  - task:
+      type: image-text-retrieval
+      name: Zero-Shot Image-Text Retrieval
+    dataset:
+      name: COCO
+      type: COCO
+    metrics:
+    - type: mean_recall
+      value: 75.2
+      name: MR
+  - task:
+      type: image-text-retrieval
+      name: Zero-Shot Image-Text Retrieval
+    dataset:
+      name: Flickr30K-CN
+      type: Flickr30K-CN
+    metrics:
+    - type: mean_recall
+      value: 91.2
+      name: MR
+  - task:
+      type: image-text-retrieval
+      name: Zero-Shot Image-Text Retrieval
+    dataset:
+      name: COCO-CN
+      type: COCO-CN
+    metrics:
+    - type: mean_recall
+      value: 87.8
+      name: MR
 ---
 # M2-Encoder-0.4B
 Runnable script:
 ```bash
+python examples/run_onnx_inference.py   --image pokemon.jpeg   --text 杰尼龟 妙蛙种子 小火龙 皮卡丘
 ```
 ## Inference Endpoints
 The official project reports that the M2-Encoder family sets strong bilingual retrieval and zero-shot classification results, and that the 10B variant reaches 88.5 top-1 on ImageNet and 80.7 top-1 on ImageNet-CN in the zero-shot setting. See the paper for exact cross-variant comparisons.
+The structured `model-index` metadata in this card is taken from the official paper tables for this released variant. On the Hugging Face page, those results should surface in the evaluation panel once the metadata is parsed.
 ![Benchmark overview](https://raw.githubusercontent.com/alipay/Ant-Multi-Modal-Framework/main/prj/M2_Encoder/pics/effect.png)
 ## Notes