Update README.md

a638e0b verified 14 days ago

4.17 kB

	---
	tags:
	- transformers
	- text-classification
	- reranking
	- cross-encoder
	- vietnamese
	- phobert
	- rag
	- generated_from_trainer
	base_model: vinai/phobert-base-v2
	pipeline_tag: text-classification
	library_name: transformers
	metrics:
	- accuracy
	- f1
	model-index:
	- name: PhoBERT Cross-Encoder for Reranking
	results:
	- task:
	type: text-classification
	name: Relevance Classification
	dataset:
	name: cross_eval
	type: cross_eval
	metrics:
	- type: accuracy
	value: 0.995473
	name: Accuracy
	- type: f1
	value: 0.990951
	name: F1 Score
	---

	# PhoBERT Cross-Encoder for Vietnamese Reranking

	This model is a cross-encoder fine-tuned from `vinai/phobert-base-v2` for binary relevance classification between a query and a document.
	Unlike bi-encoders, this model jointly encodes (query, context) pairs, enabling high-accuracy reranking in retrieval systems.

	## Model Overview

	* Architecture: Cross-Encoder (Sequence Classification)
	* Base Model: `vinai/phobert-base-v2`
	* Task: Binary classification (relevant / not relevant)
	* Input Format: `[CLS] query [SEP] context [SEP]`
	* Max Sequence Length: 256 tokens

	## Intended Use

	This model is designed for:
	* Reranking top-k results from a bi-encoder
	* Improving semantic search precision
	* Vietnamese legal QA systems
	* Second-stage ranking in RAG pipelines

	## Training Details

	### Dataset
	Format:
	* query
	* context
	* label (0 = irrelevant, 1 = relevant)

	### Training Configuration
	* Epochs: 5
	* Learning rate: 2e-5
	* Batch size:
	* Train: 16
	* Eval: 32
	* Warmup: 0.1
	* Weight decay: 0.01
	* Mixed precision: FP16 (if GPU available)

	## Evaluation Results

	\| Epoch \| Validation Loss \| Accuracy \| F1 Score \|
	\| :---: \| :---: \| :---: \| :---: \|
	\| 1 \| 0.0820 \| 0.9934 \| 0.9869 \|
	\| 2 \| 0.0675 \| 0.9936 \| 0.9871 \|
	\| 3 \| 0.0793 \| 0.9934 \| 0.9869 \|
	\| 4 \| 0.0572 \| 0.9955 \| 0.9910 \|
	\| 5 \| 0.0711 \| 0.9955 \| 0.9910 \|

	Best model selected based on F1 score = 0.9909

	## Model Architecture

	PhoBERT (RoBERTa-based encoder) -> Classification Head (dense + output layer)

	## Usage

	### Load model
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	model_name = "HiImHa/phobert-cross-encoder"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)
	```

	### Inference Example
	```python
	query = "Tôi lái xe không giữ khoảng cách an toàn thì bị phạt như thế nào?"
	context = "Phạt tiền từ 2.000.000 đến 3.000.000 đồng nếu không giữ khoảng cách an toàn."

	inputs = tokenizer(
	query,
	context,
	return_tensors="pt",
	truncation="only_second",
	max_length=256
	)

	outputs = model(**inputs)
	score = outputs.logits.softmax(dim=-1)[0][1].item()

	print(score) # relevance score
	```

	## How to Use in RAG

	Typical pipeline:
	1. Use bi-encoder -> retrieve top-k documents
	2. Use this cross-encoder -> rerank candidates
	3. Select top results for downstream tasks

	## Notes on Initialization

	* Classification head was randomly initialized and trained during fine-tuning
	* Some PhoBERT pretraining weights (e.g., `lm_head`) are unused -> expected behavior
	* LayerNorm naming differences (beta/gamma vs weight/bias) are automatically handled

	## Limitations

	* Slower than bi-encoder (pairwise inference)
	* Limited to 256 tokens -> long contexts are truncated
	* Binary classification may not capture nuanced ranking differences

	## Future Improvements

	* Pairwise / listwise ranking loss
	* Hard negative mining
	* Knowledge distillation from cross -> bi encoder
	* Larger and more diverse dataset

	## Training Configuration (Summary)

	* Epochs: 5
	* Learning rate: 2e-5
	* Loss: Cross-entropy
	* Metric: F1 (primary)

	## Acknowledgements

	* PhoBERT by VinAI
	* Hugging Face Transformers


	## Citation
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title={Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks},
	author={Reimers, Nils and Gurevych, Iryna},
	year={2019}
	}
	```