Text Classification
Transformers
Safetensors
roberta
reranking
cross-encoder
vietnamese
phobert
rag
Generated from Trainer
Eval Results (legacy)
Instructions to use HiImHa/phobert-cross-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HiImHa/phobert-cross-encoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="HiImHa/phobert-cross-encoder")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HiImHa/phobert-cross-encoder") model = AutoModelForSequenceClassification.from_pretrained("HiImHa/phobert-cross-encoder") - Notebooks
- Google Colab
- Kaggle
File size: 4,170 Bytes
cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 cf5d749 4497bd1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ---
tags:
- transformers
- text-classification
- reranking
- cross-encoder
- vietnamese
- phobert
- rag
- generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
- accuracy
- f1
model-index:
- name: PhoBERT Cross-Encoder for Reranking
results:
- task:
type: text-classification
name: Relevance Classification
dataset:
name: cross_eval
type: cross_eval
metrics:
- type: accuracy
value: 0.995473
name: Accuracy
- type: f1
value: 0.990951
name: F1 Score
---
# PhoBERT Cross-Encoder for Vietnamese Reranking
This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document.
Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.
## Model Overview
* **Architecture:** Cross-Encoder (Sequence Classification)
* **Base Model:** `vinai/phobert-base-v2`
* **Task:** Binary classification (relevant / not relevant)
* **Input Format:** `[CLS] query [SEP] context [SEP]`
* **Max Sequence Length:** 256 tokens
## Intended Use
This model is designed for:
* Reranking top-k results from a bi-encoder
* Improving semantic search precision
* Vietnamese legal QA systems
* Second-stage ranking in RAG pipelines
## Training Details
### Dataset
**Format:**
* query
* context
* label (0 = irrelevant, 1 = relevant)
### Training Configuration
* **Epochs:** 5
* **Learning rate:** 2e-5
* **Batch size:**
* Train: 16
* Eval: 32
* **Warmup:** 0.1
* **Weight decay:** 0.01
* **Mixed precision:** FP16 (if GPU available)
## Evaluation Results
| Epoch | Validation Loss | Accuracy | F1 Score |
| :---: | :---: | :---: | :---: |
| 1 | 0.0820 | 0.9934 | 0.9869 |
| 2 | 0.0675 | 0.9936 | 0.9871 |
| 3 | 0.0793 | 0.9934 | 0.9869 |
| 4 | 0.0572 | 0.9955 | 0.9910 |
| 5 | 0.0711 | 0.9955 | 0.9910 |
*Best model selected based on F1 score = 0.9909*
## Model Architecture
PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)
## Usage
### Load model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "HiImHa/phobert-cross-encoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
```
### Inference Example
```python
query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."
inputs = tokenizer(
query,
context,
return_tensors="pt",
truncation="only_second",
max_length=256
)
outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()
print(score) # relevance score
```
## How to Use in RAG
Typical pipeline:
1. Use bi-encoder -> retrieve top-k documents
2. Use this cross-encoder -> rerank candidates
3. Select top results for downstream tasks
## Notes on Initialization
* Classification head was randomly initialized and trained during fine-tuning
* Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior
* LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled
## Limitations
* Slower than bi-encoder (pairwise inference)
* Limited to 256 tokens -> long contexts are truncated
* Binary classification may not capture nuanced ranking differences
## Future Improvements
* Pairwise / listwise ranking loss
* Hard negative mining
* Knowledge distillation from cross -> bi encoder
* Larger and more diverse dataset
## Training Configuration (Summary)
* **Epochs:** 5
* **Learning rate:** 2e-5
* **Loss:** Cross-entropy
* **Metric:** F1 (primary)
## Acknowledgements
* PhoBERT by VinAI
* Hugging Face Transformers
## Citation
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
author={Reimers, Nils and Gurevych, Iryna},
year={2019}
}
``` |