HiImHa's picture
Update README.md
a638e0b verified
metadata
tags:
  - transformers
  - text-classification
  - reranking
  - cross-encoder
  - vietnamese
  - phobert
  - rag
  - generated_from_trainer
base_model: vinai/phobert-base-v2
pipeline_tag: text-classification
library_name: transformers
metrics:
  - accuracy
  - f1
model-index:
  - name: PhoBERT Cross-Encoder for Reranking
    results:
      - task:
          type: text-classification
          name: Relevance Classification
        dataset:
          name: cross_eval
          type: cross_eval
        metrics:
          - type: accuracy
            value: 0.995473
            name: Accuracy
          - type: f1
            value: 0.990951
            name: F1 Score

PhoBERT Cross-Encoder for Vietnamese Reranking

This model is a cross-encoder fine-tuned from vinai/phobert-base-v2 for binary relevance classification between a query and a document. Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.

Model Overview

  • Architecture: Cross-Encoder (Sequence Classification)
  • Base Model: vinai/phobert-base-v2
  • Task: Binary classification (relevant / not relevant)
  • Input Format: [CLS] query [SEP] context [SEP]
  • Max Sequence Length: 256 tokens

Intended Use

This model is designed for:

  • Reranking top-k results from a bi-encoder
  • Improving semantic search precision
  • Vietnamese legal QA systems
  • Second-stage ranking in RAG pipelines

Training Details

Dataset

Format:

  • query
  • context
  • label (0 = irrelevant, 1 = relevant)

Training Configuration

  • Epochs: 5
  • Learning rate: 2e-5
  • Batch size:
    • Train: 16
    • Eval: 32
  • Warmup: 0.1
  • Weight decay: 0.01
  • Mixed precision: FP16 (if GPU available)

Evaluation Results

Epoch Validation Loss Accuracy F1 Score
1 0.0820 0.9934 0.9869
2 0.0675 0.9936 0.9871
3 0.0793 0.9934 0.9869
4 0.0572 0.9955 0.9910
5 0.0711 0.9955 0.9910

Best model selected based on F1 score = 0.9909

Model Architecture

PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)

Usage

Load model

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "HiImHa/phobert-cross-encoder"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Inference Example

query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."

inputs = tokenizer(
    query,
    context,
    return_tensors="pt",
    truncation="only_second",
    max_length=256
)

outputs = model(**inputs)
score = outputs.logits.softmax(dim=-1)[0][1].item()

print(score)  # relevance score

How to Use in RAG

Typical pipeline:

  1. Use bi-encoder -> retrieve top-k documents
  2. Use this cross-encoder -> rerank candidates
  3. Select top results for downstream tasks

Notes on Initialization

  • Classification head was randomly initialized and trained during fine-tuning
  • Some PhoBERT pretraining weights (e.g., lm_head) are unused -> expected behavior
  • LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled

Limitations

  • Slower than bi-encoder (pairwise inference)
  • Limited to 256 tokens -> long contexts are truncated
  • Binary classification may not capture nuanced ranking differences

Future Improvements

  • Pairwise / listwise ranking loss
  • Hard negative mining
  • Knowledge distillation from cross -> bi encoder
  • Larger and more diverse dataset

Training Configuration (Summary)

  • Epochs: 5
  • Learning rate: 2e-5
  • Loss: Cross-entropy
  • Metric: F1 (primary)

Acknowledgements

  • PhoBERT by VinAI
  • Hugging Face Transformers

Citation

@inproceedings{reimers-2019-sentence-bert,
  title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
  author={Reimers, Nils and Gurevych, Iryna},
  year={2019}
}