File size: 5,495 Bytes
8a91a1c 79c582a fddfffd 79c582a fddfffd 8a91a1c fddfffd 8a91a1c fddfffd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
library_name: transformers
license: mit
base_model: jhu-clsp/mmBERT-small
tags:
- sentiment
- text-classification
- multilingual
- modernbert
- sentiment-analysis
- product-reviews
- place-reviews
- mmbert
metrics:
- f1
- precision
- recall
model-index:
- name: mmBERT-small-multilingual-sentiment
results: []
datasets:
- clapAI/MultiLingualSentiment
language:
- en
- zh
- vi
- ko
- ja
- ar
- de
- es
- fr
- hi
- id
- it
- ms
- pt
- ru
- tr
pipeline_tag: text-classification
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# clapAI/mmBERT-small-multilingual-sentiment
## Introduction
**mmBERT-small-multilingual-sentiment** is a multilingual sentiment classification model, part of
the [Multilingual-Sentiment](https://huggingface.co/collections/clapAI/multilingual-sentiment-677416a6b23e03f52cb6cc3f)
collection.
The model is fine-tuned from [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) using the
multilingual sentiment
dataset [clapAI/MultiLingualSentiment](https://huggingface.co/datasets/clapAI/MultiLingualSentiment).
Model supports multilingual sentiment classification across 16+ languages, including English, Vietnamese, Chinese,
French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and more.
## Key Highlights
> 📈 **Improved accuracy**: Achieves **F1 = 82.2**.
> 📜 **Long context support**: Handles sequences up to **8192 tokens**.
> 🪶 **Efficient size**: Only **140M parameters**, smaller than RoBERTa-base (278M) with better performance.
> ⚡ **FlashAttention-2 support**: Enables much faster inference on modern GPUs.
## Evaluation & Performance
Results on the test split
of [clapAI/MultiLingualSentiment](https://huggingface.co/datasets/clapAI/MultiLingualSentiment)
| Model | Pretrained Model | Parameters | Context-length | F1-score |
|:---------------------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------:|:----------:|----------------|:--------:|
| [clapAI/mmBERT-small-multilingual-sentiment](https://huggingface.co/clapAI/mmBERT-small-multilingual-sentiment) | [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) | 140M | 8192 | **82.2** |
| [modernBERT-base-multilingual-sentiment](https://huggingface.co/clapAI/modernBERT-base-multilingual-sentiment) | [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) | 150M | 8192 | 80.16 |
| [roberta-base-multilingual-sentiment](https://huggingface.co/clapAI/roberta-base-multilingual-sentiment) | [XLM-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) | 278M | 512 | 81.8 |
## How to use
### Installation
```bash
pip install torch==2.8
pip install transformers==4.55.0
```
`Optional: accelerate inference with FlashAttention-2 (if supported by your GPU):`
```bash
pip install packaging==25.0 ninja==1.13.0
MAX_JOBS=4 pip install flash-attn==2.8.3 --no-build-isolation
```
### Example Usage
Try it on [Google Colab](https://colab.research.google.com/drive/1nsh22sEz0znV3OedE8RqA0dNoPOJjghJ?usp=sharing)
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "clapAI/mmBERT-small-multilingual-sentiment"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
torch_dtype=dtype,
# Uncomment if device supports FA2
# attn_implementation="flash_attention_2"
)
model.to(device)
model.eval()
# Retrieve labels from the model's configuration
id2label = model.config.id2label
texts = [
"I absolutely love the new design of this app!", # English
"الخدمة كانت سيئة للغاية.",
"Ich bin sehr zufrieden mit dem Kauf.", # German
"El producto llegó roto y no funciona.", # Spanish
"J'adore ce restaurant, la nourriture est délicieuse!", # French
"Makanannya benar-benar tidak enak.", # Indonesian
"この製品は本当に素晴らしいです!", # Japanese
"고객 서비스가 정말 실망스러웠어요.", # Korean
"Этот фильм просто потрясающий!", # Russian
"Tôi thực sự yêu thích sản phẩm này!", # Vietnamese
"质量真的很差。" # Chinese
]
for text in texts:
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.inference_mode():
outputs = model(**inputs)
prediction = id2label[outputs.logits.argmax(dim=-1).item()]
print(f"Text: {text} | Prediction: {prediction}")
```
## Citation
If you use this model, please consider citing:
```bibtex
@misc{clapAI_mmbert_small_multilingual_sentiment,
title={mmBERT-small-multilingual-sentiment: A Multilingual Sentiment Classification Model},
author={clapAI},
howpublished={\url{https://huggingface.co/clapAI/mmBERT-small-multilingual-sentiment}},
year={2025},
} |