tiam4tt
/

BGE-Reranker-VietFinance

Sentence Similarity

Model card Files Files and versions

BGE-Reranker-VietFinance / README.md

tiam4tt's picture

Update README.md

ffbf33a verified 19 days ago

|

history blame contribute delete

2.52 kB

	---
	license: mit
	base_model:
	- BAAI/bge-reranker-v2-m3
	pipeline_tag: sentence-similarity
	---

	# Model Card — BGE-Reranker-VietFinance

	## Overview

	This is a cross-encoder reranker finetuned from `BAAI/bge-reranker-v2-m3` to score (query, passage) pairs for retrieval reranking in Vietnamese financial/news search systems.

	## Intended Use

	- Primary: improve Hit@K by re-ranking candidate passages produced by an upstream retriever (BM25 + embedding-based).
	- Not for: standalone generation, non-Vietnamese domains, or high-stakes automated decisions without human review.

	## Essential Statistics

	- Base model: `BAAI/bge-reranker-v2-m3`
	- Embedding model (used for retrieval/hard-negative mining): `BAAI/bge-m3`
	- Max sequence length for reranking: 1536 tokens (inputs longer than this are truncated)
	- Retrieval strategy: temporal-aware hybrid (BM25 + dense embeddings with temporal boosting)
	- Saved artifacts in this folder: `model.safetensors`, `tokenizer.json`, `tokenizer_config.json`, `config.json`.

	## Evaluation (concise)

	- Procedure: retrieve candidate passages (temporal-aware hybrid) → rerank with cross-encoder → compute Hit@K for K ∈ {1,3,5,10,20}.
	- Numeric results are saved in run outputs (summary CSVs / JSONL); include them here if you want the actual Hit@K values embedded.

	## Limitations & Risks

	- Domain-specific: optimized for Vietnamese financial/news passages; generalization outside this domain/language is uncertain.
	- Retrieval dependency: reranker cannot recover gold passages not present among retrieval candidates.
	- Truncation risk: 1536-token truncation may drop important context for long passages.
	- Data & license: dataset provenance and license are not specified here — verify before public distribution.

	## Bias & Safety

	- Model reflects biases in the source news corpus (topic/regional biases).
	- Temporal heuristics can misinterpret ambiguous locale-specific dates and cause incorrect boosts.
	- Do not rely on reranker outputs alone for automated financial, legal, or medical decisions.

	## Quick usage (inference)

	Load this checkpoint with `AutoTokenizer` / `AutoModelForSequenceClassification`, tokenize (query, passage) pairs, score in eval mode, and sort candidates by descending score (higher = more relevant).

	## License & Citation

	- License: not specified in the checkpoint — confirm before redistribution.
	- Cite the base models `BAAI/bge-reranker-v2-m3` and `BAAI/bge-m3` when reporting results.