--- tags: - transformers - text-classification - reranking - cross-encoder - vietnamese - phobert - rag - generated_from_trainer base_model: vinai/phobert-base-v2 pipeline_tag: text-classification library_name: transformers metrics: - accuracy - f1 model-index: - name: PhoBERT Cross-Encoder for Reranking results: - task: type: text-classification name: Relevance Classification dataset: name: cross_eval type: cross_eval metrics: - type: accuracy value: 0.995473 name: Accuracy - type: f1 value: 0.990951 name: F1 Score --- # PhoBERT Cross-Encoder for Vietnamese Reranking This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document. Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems. ## Model Overview * **Architecture:** Cross-Encoder (Sequence Classification) * **Base Model:** `vinai/phobert-base-v2` * **Task:** Binary classification (relevant / not relevant) * **Input Format:** `[CLS] query [SEP] context [SEP]` * **Max Sequence Length:** 256 tokens ## Intended Use This model is designed for: * Reranking top-k results from a bi-encoder * Improving semantic search precision * Vietnamese legal QA systems * Second-stage ranking in RAG pipelines ## Training Details ### Dataset **Format:** * query * context * label (0 = irrelevant, 1 = relevant) ### Training Configuration * **Epochs:** 5 * **Learning rate:** 2e-5 * **Batch size:** * Train: 16 * Eval: 32 * **Warmup:** 0.1 * **Weight decay:** 0.01 * **Mixed precision:** FP16 (if GPU available) ## Evaluation Results | Epoch | Validation Loss | Accuracy | F1 Score | | :---: | :---: | :---: | :---: | | 1 | 0.0820 | 0.9934 | 0.9869 | | 2 | 0.0675 | 0.9936 | 0.9871 | | 3 | 0.0793 | 0.9934 | 0.9869 | | 4 | 0.0572 | 0.9955 | 0.9910 | | 5 | 0.0711 | 0.9955 | 0.9910 | *Best model selected based on F1 score = 0.9909* ## Model Architecture PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer) ## Usage ### Load model ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification model_name = "HiImHa/phobert-cross-encoder" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) ``` ### Inference Example ```python query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?" context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn." inputs = tokenizer( query, context, return_tensors="pt", truncation="only_second", max_length=256 ) outputs = model(**inputs) score = outputs.logits.softmax(dim=-1)[0][1].item() print(score) # relevance score ``` ## How to Use in RAG Typical pipeline: 1. Use bi-encoder -> retrieve top-k documents 2. Use this cross-encoder -> rerank candidates 3. Select top results for downstream tasks ## Notes on Initialization * Classification head was randomly initialized and trained during fine-tuning * Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior * LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled ## Limitations * Slower than bi-encoder (pairwise inference) * Limited to 256 tokens -> long contexts are truncated * Binary classification may not capture nuanced ranking differences ## Future Improvements * Pairwise / listwise ranking loss * Hard negative mining * Knowledge distillation from cross -> bi encoder * Larger and more diverse dataset ## Training Configuration (Summary) * **Epochs:** 5 * **Learning rate:** 2e-5 * **Loss:** Cross-entropy * **Metric:** F1 (primary) ## Acknowledgements * PhoBERT by VinAI * Hugging Face Transformers ## Citation ```bibtex @inproceedings{reimers-2019-sentence-bert, title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, author={Reimers, Nils and Gurevych, Iryna}, year={2019} } ```