hungnm's picture
Update README.md
8a91a1c verified
metadata
library_name: transformers
license: mit
base_model: jhu-clsp/mmBERT-small
tags:
  - sentiment
  - text-classification
  - multilingual
  - modernbert
  - sentiment-analysis
  - product-reviews
  - place-reviews
  - mmbert
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: mmBERT-small-multilingual-sentiment
    results: []
datasets:
  - clapAI/MultiLingualSentiment
language:
  - en
  - zh
  - vi
  - ko
  - ja
  - ar
  - de
  - es
  - fr
  - hi
  - id
  - it
  - ms
  - pt
  - ru
  - tr
pipeline_tag: text-classification

clapAI/mmBERT-small-multilingual-sentiment

Introduction

mmBERT-small-multilingual-sentiment is a multilingual sentiment classification model, part of the Multilingual-Sentiment collection.

The model is fine-tuned from jhu-clsp/mmBERT-small using the multilingual sentiment dataset clapAI/MultiLingualSentiment.

Model supports multilingual sentiment classification across 16+ languages, including English, Vietnamese, Chinese, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Arabic, and more.

Key Highlights

📈 Improved accuracy: Achieves F1 = 82.2.
📜 Long context support: Handles sequences up to 8192 tokens.
🪶 Efficient size: Only 140M parameters, smaller than RoBERTa-base (278M) with better performance.
FlashAttention-2 support: Enables much faster inference on modern GPUs.

Evaluation & Performance

Results on the test split of clapAI/MultiLingualSentiment

How to use

Installation

pip install torch==2.8
pip install transformers==4.55.0

Optional: accelerate inference with FlashAttention-2 (if supported by your GPU):

pip install packaging==25.0 ninja==1.13.0
MAX_JOBS=4 pip install flash-attn==2.8.3 --no-build-isolation

Example Usage

Try it on Google Colab

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_id = "clapAI/mmBERT-small-multilingual-sentiment"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    torch_dtype=dtype,
    # Uncomment if device supports FA2
    # attn_implementation="flash_attention_2" 
)

model.to(device)
model.eval()

# Retrieve labels from the model's configuration
id2label = model.config.id2label

texts = [
    "I absolutely love the new design of this app!",  # English
    "الخدمة كانت سيئة للغاية.",
    "Ich bin sehr zufrieden mit dem Kauf.",  # German
    "El producto llegó roto y no funciona.",  # Spanish
    "J'adore ce restaurant, la nourriture est délicieuse!",  # French
    "Makanannya benar-benar tidak enak.",  # Indonesian
    "この製品は本当に素晴らしいです!",  # Japanese
    "고객 서비스가 정말 실망스러웠어요.",  # Korean
    "Этот фильм просто потрясающий!",  # Russian
    "Tôi thực sự yêu thích sản phẩm này!",  # Vietnamese
    "质量真的很差。"  # Chinese
]

for text in texts:
    inputs = tokenizer(text, return_tensors="pt").to(device)
    with torch.inference_mode():
        outputs = model(**inputs)
        prediction = id2label[outputs.logits.argmax(dim=-1).item()]
    print(f"Text: {text} | Prediction: {prediction}")

Citation

If you use this model, please consider citing:

@misc{clapAI_mmbert_small_multilingual_sentiment,
      title={mmBERT-small-multilingual-sentiment: A Multilingual Sentiment Classification Model},
      author={clapAI},
      howpublished={\url{https://huggingface.co/clapAI/mmBERT-small-multilingual-sentiment}},
      year={2025},
}