Update README.md

a638e0b verified 12 days ago

4.17 kB

tags:
  - transformers
  - text-classification
  - reranking
  - cross-encoder
  - vietnamese
  - phobert
  - rag
  - generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
  - accuracy
  - f1
model-index:
  - name: PhoBERT Cross-Encoder for Reranking
    results:
      - task:
          type: text-classification
          name: Relevance Classification
        dataset:
          name: cross_eval
          type: cross_eval
        metrics:
          - type: accuracy
            value: 0.995473
            name: Accuracy
          - type: f1
            value: 0.990951
            name: F1 Score

PhoBERT Cross-Encoder for Vietnamese Reranking

This model is a cross-encoder fine-tuned from vinai/phobert-base-v2 for binary relevance classification between a query and a document. Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.

Model Overview

Architecture: Cross-Encoder (Sequence Classification)
Base Model: vinai/phobert-base-v2
Task: Binary classification (relevant / not relevant)
Input Format: [CLS] query [SEP] context [SEP]
Max Sequence Length: 256 tokens

Intended Use

This model is designed for:

Reranking top-k results from a bi-encoder
Improving semantic search precision
Vietnamese legal QA systems
Second-stage ranking in RAG pipelines

Training Details

Dataset

Format:

query
context
label (0 = irrelevant, 1 = relevant)

Training Configuration

Epochs: 5
Learning rate: 2e-5
Batch size:
- Train: 16
- Eval: 32
Warmup: 0.1
Weight decay: 0.01
Mixed precision: FP16 (if GPU available)

Evaluation Results

Epoch	Validation Loss	Accuracy	F1 Score
1	0.0820	0.9934	0.9869
2	0.0675	0.9936	0.9871
3	0.0793	0.9934	0.9869
4	0.0572	0.9955	0.9910
5	0.0711	0.9955	0.9910

Best model selected based on F1 score = 0.9909

Model Architecture

PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)

Usage

Load model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "HiImHa/phobert-cross-encoder"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Inference Example

query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."

inputs = tokenizer(
    query,
    context,
    return_tensors="pt",
    truncation="only_second",
    max_length=256
)

outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()

print(score)  # relevance score

How to Use in RAG

Typical pipeline:

Use bi-encoder -> retrieve top-k documents
Use this cross-encoder -> rerank candidates
Select top results for downstream tasks

Notes on Initialization

Classification head was randomly initialized and trained during fine-tuning
Some PhoBERT pretraining weights (e.g., lm_head) are unused -> expected behavior
LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled

Limitations

Slower than bi-encoder (pairwise inference)
Limited to 256 tokens -> long contexts are truncated
Binary classification may not capture nuanced ranking differences

Future Improvements

Pairwise / listwise ranking loss
Hard negative mining
Knowledge distillation from cross -> bi encoder
Larger and more diverse dataset

Training Configuration (Summary)

Epochs: 5
Learning rate: 2e-5
Loss: Cross-entropy
Metric: F1 (primary)

Acknowledgements

PhoBERT by VinAI
Hugging Face Transformers

Citation

@inproceedings{reimers-2019-sentence-bert,
  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author={Reimers, Nils and Gurevych, Iryna},
  year={2019}
}