NeoCyber
/

m-e5-small-uit-vsfc-uni

Text Classification

uni_vsfc_transformer

multilingual-e5

Model card Files Files and versions

m-e5-small-uit-vsfc-uni / README.md

NeoCyber's picture

Update model card

5ec43c9 verified 18 days ago

|

history blame contribute delete

2.39 kB

	---
	language:
	- vi
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	base_model: intfloat/multilingual-e5-small
	tags:
	- vietnamese
	- custom-code
	- transformers
	- multilingual-e5
	- uni_vsfc
	- uit-vsfc
	- education
	- multitask
	- text-classification
	---

	# m-e5-small-uit-vsfc-uni

	## Overview

	Vietnamese multi-task text classification model for student feedback. The model jointly predicts sentiment and topic labels from a single sentence.

	## Model Details

	- Base model: `intfloat/multilingual-e5-small`
	- Architecture: `uni_vsfc`
	- Checkpoint source: `uit-vsfc-uni-e5-small-best.pt`
	- Sequence length used during training/inference pipeline: `256`
	- Tasks: `sentiment, topic`

	## Label Schema

	- `sentiment`: `0 = negative`, `1 = neutral`, `2 = positive`
	- `topic`: `0 = lecturer`, `1 = training_program`, `2 = facility`, `3 = others`

	## Task Heads

	- `sentiment`: `3` classes
	- `topic`: `4` classes

	## Dataset

	- Dataset: `Vietnamese Students' Feedback Corpus (UIT-VSFC)`
	Vietnamese Students' Feedback Corpus (UIT-VSFC) contains more than 16,000 human-annotated student feedback sentences with sentiment and topic labels.

	### Data Format

	- `sentence` is the input text column.
	- `sentiment` is a 3-class label and `topic` is a 4-class label.

	### Splits

	- Train: `11426` samples
	- Validation: `1583` samples
	- Test: `3166` samples

	## Checkpoint Metrics

	- `loss`: `0.2894`
	- `accuracy`: `0.9005`

	## Usage

	Load the model with `trust_remote_code=True` because this repository contains custom modeling code.

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	repo_id = "NeoCyber/m-e5-small-uit-vsfc-uni"
	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForSequenceClassification.from_pretrained(
	repo_id,
	trust_remote_code=True,
	)

	texts = ["slide giáo trình đầy đủ ."]
	inputs = tokenizer(texts, return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)
	predictions = model.decode_predictions(outputs.logits_by_task)
	print(predictions)
	```

	## Notes

	- The repository includes custom `configuration_.py` and `modeling_.py` files required by `transformers` AutoClasses.
	- `outputs.logits_by_task` contains one tensor per task, and `outputs.logits` is the concatenated tensor.