Text Classification
Transformers
Safetensors
roberta
reranking
cross-encoder
vietnamese
phobert
rag
Generated from Trainer
Eval Results (legacy)
Instructions to use HiImHa/phobert-cross-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HiImHa/phobert-cross-encoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="HiImHa/phobert-cross-encoder")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("HiImHa/phobert-cross-encoder") model = AutoModelForSequenceClassification.from_pretrained("HiImHa/phobert-cross-encoder") - Notebooks
- Google Colab
- Kaggle
| tags: | |
| - transformers | |
| - text-classification | |
| - reranking | |
| - cross-encoder | |
| - vietnamese | |
| - phobert | |
| - rag | |
| - generated_from_trainer | |
| base_model: vinai/phobert-base-v2 | |
| pipeline_tag: text-classification | |
| library_name: transformers | |
| metrics: | |
| - accuracy | |
| - f1 | |
| model-index: | |
| - name: PhoBERT Cross-Encoder for Reranking | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Relevance Classification | |
| dataset: | |
| name: cross_eval | |
| type: cross_eval | |
| metrics: | |
| - type: accuracy | |
| value: 0.995473 | |
| name: Accuracy | |
| - type: f1 | |
| value: 0.990951 | |
| name: F1 Score | |
| # PhoBERT Cross-Encoder for Vietnamese Reranking | |
| This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document. | |
| Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems. | |
| ## Model Overview | |
| * **Architecture:** Cross-Encoder (Sequence Classification) | |
| * **Base Model:** `vinai/phobert-base-v2` | |
| * **Task:** Binary classification (relevant / not relevant) | |
| * **Input Format:** `[CLS] query [SEP] context [SEP]` | |
| * **Max Sequence Length:** 256 tokens | |
| ## Intended Use | |
| This model is designed for: | |
| * Reranking top-k results from a bi-encoder | |
| * Improving semantic search precision | |
| * Vietnamese legal QA systems | |
| * Second-stage ranking in RAG pipelines | |
| ## Training Details | |
| ### Dataset | |
| **Format:** | |
| * query | |
| * context | |
| * label (0 = irrelevant, 1 = relevant) | |
| ### Training Configuration | |
| * **Epochs:** 5 | |
| * **Learning rate:** 2e-5 | |
| * **Batch size:** | |
| * Train: 16 | |
| * Eval: 32 | |
| * **Warmup:** 0.1 | |
| * **Weight decay:** 0.01 | |
| * **Mixed precision:** FP16 (if GPU available) | |
| ## Evaluation Results | |
| | Epoch | Validation Loss | Accuracy | F1 Score | | |
| | :---: | :---: | :---: | :---: | | |
| | 1 | 0.0820 | 0.9934 | 0.9869 | | |
| | 2 | 0.0675 | 0.9936 | 0.9871 | | |
| | 3 | 0.0793 | 0.9934 | 0.9869 | | |
| | 4 | 0.0572 | 0.9955 | 0.9910 | | |
| | 5 | 0.0711 | 0.9955 | 0.9910 | | |
| *Best model selected based on F1 score = 0.9909* | |
| ## Model Architecture | |
| PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer) | |
| ## Usage | |
| ### Load model | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| model_name = "HiImHa/phobert-cross-encoder" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) | |
| ``` | |
| ### Inference Example | |
| ```python | |
| query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?" | |
| context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn." | |
| inputs = tokenizer( | |
| query, | |
| context, | |
| return_tensors="pt", | |
| truncation="only_second", | |
| max_length=256 | |
| ) | |
| outputs = model(**inputs) | |
| score = outputs.logits.softmax(dim=-1)[0][1].item() | |
| print(score) # relevance score | |
| ``` | |
| ## How to Use in RAG | |
| Typical pipeline: | |
| 1. Use bi-encoder -> retrieve top-k documents | |
| 2. Use this cross-encoder -> rerank candidates | |
| 3. Select top results for downstream tasks | |
| ## Notes on Initialization | |
| * Classification head was randomly initialized and trained during fine-tuning | |
| * Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior | |
| * LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled | |
| ## Limitations | |
| * Slower than bi-encoder (pairwise inference) | |
| * Limited to 256 tokens -> long contexts are truncated | |
| * Binary classification may not capture nuanced ranking differences | |
| ## Future Improvements | |
| * Pairwise / listwise ranking loss | |
| * Hard negative mining | |
| * Knowledge distillation from cross -> bi encoder | |
| * Larger and more diverse dataset | |
| ## Training Configuration (Summary) | |
| * **Epochs:** 5 | |
| * **Learning rate:** 2e-5 | |
| * **Loss:** Cross-entropy | |
| * **Metric:** F1 (primary) | |
| ## Acknowledgements | |
| * PhoBERT by VinAI | |
| * Hugging Face Transformers | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{reimers-2019-sentence-bert, | |
| title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks}, | |
| author={Reimers, Nils and Gurevych, Iryna}, | |
| year={2019} | |
| } | |
| ``` |