HiImHa's picture
Update README.md
a638e0b verified
---
tags:
- transformers
- text-classification
- reranking
- cross-encoder
- vietnamese
- phobert
- rag
- generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
- accuracy
- f1
model-index:
- name: PhoBERT Cross-Encoder for Reranking
results:
- task:
type: text-classification
name: Relevance Classification
dataset:
name: cross_eval
type: cross_eval
metrics:
- type: accuracy
value: 0.995473
name: Accuracy
- type: f1
value: 0.990951
name: F1 Score
---
# PhoBERT Cross-Encoder for Vietnamese Reranking
This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document.
Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.
## Model Overview
* **Architecture:** Cross-Encoder (Sequence Classification)
* **Base Model:** `vinai/phobert-base-v2`
* **Task:** Binary classification (relevant / not relevant)
* **Input Format:** `[CLS] query [SEP] context [SEP]`
* **Max Sequence Length:** 256 tokens
## Intended Use
This model is designed for:
* Reranking top-k results from a bi-encoder
* Improving semantic search precision
* Vietnamese legal QA systems
* Second-stage ranking in RAG pipelines
## Training Details
### Dataset
**Format:**
* query
* context
* label (0 = irrelevant, 1 = relevant)
### Training Configuration
* **Epochs:** 5
* **Learning rate:** 2e-5
* **Batch size:**
* Train: 16
* Eval: 32
* **Warmup:** 0.1
* **Weight decay:** 0.01
* **Mixed precision:** FP16 (if GPU available)
## Evaluation Results
| Epoch | Validation Loss | Accuracy | F1 Score |
| :---: | :---: | :---: | :---: |
| 1 | 0.0820 | 0.9934 | 0.9869 |
| 2 | 0.0675 | 0.9936 | 0.9871 |
| 3 | 0.0793 | 0.9934 | 0.9869 |
| 4 | 0.0572 | 0.9955 | 0.9910 |
| 5 | 0.0711 | 0.9955 | 0.9910 |
*Best model selected based on F1 score = 0.9909*
## Model Architecture
PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)
## Usage
### Load model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "HiImHa/phobert-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
```
### Inference Example
```python
query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."
inputs = tokenizer(
query,
context,
return_tensors="pt",
truncation="only_second",
max_length=256
)
outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()
print(score) # relevance score
```
## How to Use in RAG
Typical pipeline:
1. Use bi-encoder -> retrieve top-k documents
2. Use this cross-encoder -> rerank candidates
3. Select top results for downstream tasks
## Notes on Initialization
* Classification head was randomly initialized and trained during fine-tuning
* Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior
* LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled
## Limitations
* Slower than bi-encoder (pairwise inference)
* Limited to 256 tokens -> long contexts are truncated
* Binary classification may not capture nuanced ranking differences
## Future Improvements
* Pairwise / listwise ranking loss
* Hard negative mining
* Knowledge distillation from cross -> bi encoder
* Larger and more diverse dataset
## Training Configuration (Summary)
* **Epochs:** 5
* **Learning rate:** 2e-5
* **Loss:** Cross-entropy
* **Metric:** F1 (primary)
## Acknowledgements
* PhoBERT by VinAI
* Hugging Face Transformers
## Citation
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Reimers, Nils and Gurevych, Iryna},
year={2019}
}
```