Update model card

5ec43c9 verified 17 days ago

2.39 kB

language:
  - vi
license: mit
library_name: transformers
pipeline_tag: text-classification
base_model: intfloat/multilingual-e5-small
tags:
  - vietnamese
  - custom-code
  - transformers
  - multilingual-e5
  - uni_vsfc
  - uit-vsfc
  - education
  - multitask
  - text-classification

m-e5-small-uit-vsfc-uni

Overview

Vietnamese multi-task text classification model for student feedback. The model jointly predicts sentiment and topic labels from a single sentence.

Model Details

Base model: intfloat/multilingual-e5-small
Architecture: uni_vsfc
Checkpoint source: uit-vsfc-uni-e5-small-best.pt
Sequence length used during training/inference pipeline: 256
Tasks: sentiment, topic

Label Schema

sentiment: 0 = negative, 1 = neutral, 2 = positive
topic: 0 = lecturer, 1 = training_program, 2 = facility, 3 = others

Task Heads

sentiment: 3 classes
topic: 4 classes

Dataset

Dataset: Vietnamese Students' Feedback Corpus (UIT-VSFC) Vietnamese Students' Feedback Corpus (UIT-VSFC) contains more than 16,000 human-annotated student feedback sentences with sentiment and topic labels.

Data Format

sentence is the input text column.
sentiment is a 3-class label and topic is a 4-class label.

Splits

Train: 11426 samples
Validation: 1583 samples
Test: 3166 samples

Checkpoint Metrics

loss: 0.2894
accuracy: 0.9005

Usage

Load the model with trust_remote_code=True because this repository contains custom modeling code.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo_id = "NeoCyber/m-e5-small-uit-vsfc-uni"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSequenceClassification.from_pretrained(
    repo_id,
    trust_remote_code=True,
)

texts = ["slide giáo trình đầy đủ ."]
inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
predictions = model.decode_predictions(outputs.logits_by_task)
print(predictions)

Notes

The repository includes custom configuration_*.py and modeling_*.py files required by transformers AutoClasses.
outputs.logits_by_task contains one tensor per task, and outputs.logits is the concatenated tensor.