Text Classification
Transformers
Safetensors
roberta
reranking
cross-encoder
vietnamese
phobert
rag
Generated from Trainer
Eval Results (legacy)
Instructions to use HiImHa/phobert-cross-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HiImHa/phobert-cross-encoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="HiImHa/phobert-cross-encoder")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HiImHa/phobert-cross-encoder") model = AutoModelForSequenceClassification.from_pretrained("HiImHa/phobert-cross-encoder") - Notebooks
- Google Colab
- Kaggle
metadata
tags:
- transformers
- text-classification
- reranking
- cross-encoder
- vietnamese
- phobert
- rag
- generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
- accuracy
- f1
model-index:
- name: PhoBERT Cross-Encoder for Reranking
results:
- task:
type: text-classification
name: Relevance Classification
dataset:
name: cross_eval
type: cross_eval
metrics:
- type: accuracy
value: 0.995473
name: Accuracy
- type: f1
value: 0.990951
name: F1 Score
PhoBERT Cross-Encoder for Vietnamese Reranking
This model is a cross-encoder fine-tuned from vinai/phobert-base-v2 for binary relevance classification between a query and a document.
Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.
Model Overview
- Architecture: Cross-Encoder (Sequence Classification)
- Base Model:
vinai/phobert-base-v2 - Task: Binary classification (relevant / not relevant)
- Input Format:
[CLS] query [SEP] context [SEP] - Max Sequence Length: 256 tokens
Intended Use
This model is designed for:
- Reranking top-k results from a bi-encoder
- Improving semantic search precision
- Vietnamese legal QA systems
- Second-stage ranking in RAG pipelines
Training Details
Dataset
Format:
- query
- context
- label (0 = irrelevant, 1 = relevant)
Training Configuration
- Epochs: 5
- Learning rate: 2e-5
- Batch size:
- Train: 16
- Eval: 32
- Warmup: 0.1
- Weight decay: 0.01
- Mixed precision: FP16 (if GPU available)
Evaluation Results
| Epoch | Validation Loss | Accuracy | F1 Score |
|---|---|---|---|
| 1 | 0.0820 | 0.9934 | 0.9869 |
| 2 | 0.0675 | 0.9936 | 0.9871 |
| 3 | 0.0793 | 0.9934 | 0.9869 |
| 4 | 0.0572 | 0.9955 | 0.9910 |
| 5 | 0.0711 | 0.9955 | 0.9910 |
Best model selected based on F1 score = 0.9909
Model Architecture
PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)
Usage
Load model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "HiImHa/phobert-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Inference Example
query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."
inputs = tokenizer(
query,
context,
return_tensors="pt",
truncation="only_second",
max_length=256
)
outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()
print(score) # relevance score
How to Use in RAG
Typical pipeline:
- Use bi-encoder -> retrieve top-k documents
- Use this cross-encoder -> rerank candidates
- Select top results for downstream tasks
Notes on Initialization
- Classification head was randomly initialized and trained during fine-tuning
- Some PhoBERT pretraining weights (e.g.,
lm_head) are unused -> expected behavior - LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled
Limitations
- Slower than bi-encoder (pairwise inference)
- Limited to 256 tokens -> long contexts are truncated
- Binary classification may not capture nuanced ranking differences
Future Improvements
- Pairwise / listwise ranking loss
- Hard negative mining
- Knowledge distillation from cross -> bi encoder
- Larger and more diverse dataset
Training Configuration (Summary)
- Epochs: 5
- Learning rate: 2e-5
- Loss: Cross-entropy
- Metric: F1 (primary)
Acknowledgements
- PhoBERT by VinAI
- Hugging Face Transformers
Citation
@inproceedings{reimers-2019-sentence-bert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Reimers, Nils and Gurevych, Iryna},
year={2019}
}