File size: 4,170 Bytes

cf5d749
4497bd1
 
 
 
 
 
 
 
 
 
 
cf5d749
4497bd1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cf5d749
 
4497bd1
cf5d749
4497bd1
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
 
cf5d749
 
 
4497bd1
 
 
 
 
cf5d749
4497bd1
 
 
 
 
 
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
cf5d749
4497bd1
cf5d749
4497bd1
cf5d749
4497bd1
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
cf5d749
4497bd1
 
 
 
cf5d749
4497bd1
 
 
 
 
 
 
cf5d749
4497bd1
 
cf5d749
4497bd1
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
 
 
cf5d749
4497bd1
cf5d749
4497bd1
 
cf5d749
 
4497bd1

---
tags:
- transformers
- text-classification
- reranking
- cross-encoder
- vietnamese
- phobert
- rag
- generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
- accuracy
- f1
model-index:
- name: PhoBERT Cross-Encoder for Reranking
  results:
  - task:
      type: text-classification
      name: Relevance Classification
    dataset:
      name: cross_eval
      type: cross_eval
    metrics:
    - type: accuracy
      value: 0.995473
      name: Accuracy
    - type: f1
      value: 0.990951
      name: F1 Score
---

# PhoBERT Cross-Encoder for Vietnamese Reranking

This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document. 
Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.

## Model Overview

*   **Architecture:** Cross-Encoder (Sequence Classification)
*   **Base Model:** `vinai/phobert-base-v2`
*   **Task:** Binary classification (relevant / not relevant)
*   **Input Format:** `[CLS] query [SEP] context [SEP]`
*   **Max Sequence Length:** 256 tokens

## Intended Use

This model is designed for:
*   Reranking top-k results from a bi-encoder
*   Improving semantic search precision
*   Vietnamese legal QA systems
*   Second-stage ranking in RAG pipelines

## Training Details

### Dataset
**Format:**
*   query
*   context
*   label (0 = irrelevant, 1 = relevant)

### Training Configuration
*   **Epochs:** 5
*   **Learning rate:** 2e-5
*   **Batch size:** 
    *   Train: 16
    *   Eval: 32
*   **Warmup:** 0.1
*   **Weight decay:** 0.01
*   **Mixed precision:** FP16 (if GPU available)

## Evaluation Results

| Epoch | Validation Loss | Accuracy | F1 Score |
| :---: | :---: | :---: | :---: |
| 1 | 0.0820 | 0.9934 | 0.9869 |
| 2 | 0.0675 | 0.9936 | 0.9871 |
| 3 | 0.0793 | 0.9934 | 0.9869 |
| 4 | 0.0572 | 0.9955 | 0.9910 |
| 5 | 0.0711 | 0.9955 | 0.9910 |

*Best model selected based on F1 score = 0.9909*

## Model Architecture

PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)

## Usage

### Load model
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "HiImHa/phobert-cross-encoder"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
```

### Inference Example
```python
query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."

inputs = tokenizer(
    query,
    context,
    return_tensors="pt",
    truncation="only_second",
    max_length=256
)

outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()

print(score)  # relevance score
```

## How to Use in RAG

Typical pipeline:
1.  Use bi-encoder -> retrieve top-k documents
2.  Use this cross-encoder -> rerank candidates
3.  Select top results for downstream tasks

## Notes on Initialization

*   Classification head was randomly initialized and trained during fine-tuning
*   Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior
*   LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled

## Limitations

*   Slower than bi-encoder (pairwise inference)
*   Limited to 256 tokens -> long contexts are truncated
*   Binary classification may not capture nuanced ranking differences

## Future Improvements

*   Pairwise / listwise ranking loss
*   Hard negative mining
*   Knowledge distillation from cross -> bi encoder
*   Larger and more diverse dataset

## Training Configuration (Summary)

*   **Epochs:** 5
*   **Learning rate:** 2e-5
*   **Loss:** Cross-entropy
*   **Metric:** F1 (primary)

## Acknowledgements

*   PhoBERT by VinAI
*   Hugging Face Transformers


## Citation
```bibtex
@inproceedings{reimers-2019-sentence-bert,
  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author={Reimers, Nils and Gurevych, Iryna},
  year={2019}
}
```