Update README.md

1c4c371 verified about 2 months ago

3.79 kB

	---
	library_name: transformers
	pipeline_tag: text-ranking
	license: mit
	language:
	- en
	- zh
	base_model:
	- jhu-clsp/mmBERT-base
	tags:
	- reranker
	- modernbert
	- English
	- zh-tw
	- zh-cn
	---

	# AuroraX: A Fast Cross-Lingual Reranker Bridging English and Chinese

	AuroraX is a lightweight yet powerful cross-lingual reranker built upon the [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) architecture.
	It is designed to bridge Traditional Chinese, Simplified Chinese and English, enabling high-quality semantic ranking across languages with remarkable efficiency.

	Despite having only 110M non-embedding parameters, AuroraX achieves comparable performance to state-of-the-art rerankers that are twice as large.
	Its design emphasizes both speed and language adaptability, making it ideal for real-world multilingual retrieval and re-ranking applications.

	✨ Key Features:
	- 🌏 Cross-Lingual Understanding — Trained to handle English, Traditional Chinese, and Simplified Chinese seamlessly.
	- ⚡ Lightweight & Fast — Only 110M parameters (non-embedding), optimized for latency-sensitive pipelines.
	- 🎯 SOTA-Level Accuracy — Comparable or superior to larger rerankers on Chinese and English benchmarks.

	---

	## Evaluation

	### Monolingual Benchmarks

	\| Model \| Metric \| CMedQAv2-reranking (ZH) \| T2Reranking (ZH) \| ZH AVG \| AskUbuntuDupQuestions (EN) \| HUMENews21InstructionReranking (EN) \| HUMEWikipediaRerankingMultilingual (EN) \| SciDocsRR (EN) \| EN AVG \| Total AVG \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| AuroraX-Reranker-Base-v1.0<br>(Ours, 300M with 100M non-embed params) \| mrr@10 \| 0.8201 \| 0.8554 \| 0.8378 \| 0.7936 \| 1.0000 \| 0.9778 \| 0.9305 \| 0.9255 \| 0.8962 \|
	\| \| mrr@5 \| 0.8145 \| 0.8514 \| 0.8329 \| 0.7841 \| 1.0000 \| 0.9778 \| 0.9289 \| 0.9227 \| 0.8928 \|
	\| bge-reranker-v2-m3<br>(600M params) \| mrr@10 \| 0.8598 \| 0.8004 \| 0.8301 \| 0.7635 \| 0.9839 \| 0.8750 \| 0.9211 \| 0.8859 \| 0.8673 \|
	\| \| mrr@5 \| 0.8569 \| 0.7954 \| 0.8262 \| 0.7532 \| 0.9839 \| 0.8750 \| 0.9191 \| 0.8828 \| 0.8639 \|
	\| jina-reranker-v2-base-multilingual<br>(300M params) \| mrr@10 \| 0.2828 \| 0.7577 \| 0.5203 \| 0.7420 \| 1.0000 \| 0.8761 \| 0.9478 \| 0.8915 \| 0.7677 \|
	\| \| mrr@5 \| 0.2759 \| 0.7512 \| 0.5136 \| 0.7299 \| 1.0000 \| 0.8761 \| 0.9467 \| 0.8882 \| 0.7633 \|

	---

	### Cross-Lingual (ZH ↔ EN) Results

	\| Model \| inhouse-en2zh (HitRate@5) \| inhouse-zh2en (HitRate@5) \|
	\| --- \| --- \| --- \|
	\| AuroraX-Reranker-Base-v1.0 (Ours, 300M with 100M non-embed params) \| 0.8459 \| 0.9427 \|
	\| bge-reranker-v2-m3 (600M params) \| 0.8179 \| 0.9160 \|
	\| jina-reranker-v2-base-multilingual (300M params) \| 0.7815 \| 0.8855 \|

	---
	## Usage

	### Sentence-Transformers

	```py
	from sentence_transformers import CrossEncoder

	model = CrossEncoder("aqweteddy/AuroraX-Reranker-Base-v1.0")
	score = model.predict([("What is Deep Learning?", "Deep learning is a subfield of ML...")])
	print(score)
	```

	### Text-Embedding-Inference (API)

	1. Install and launch the router:

	```bash
	text-embeddings-router --model-id aqweteddy/AuroraX-Reranker-Base-v1.0
	```

	2. Run via REST API:

	```bash
	curl 127.0.0.1:8080/rerank \
	-X POST \
	-d '{"query": "What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
	-H 'Content-Type: application/json'
	```

	---

	## Citation

	```
	@misc{aurorax2025,
	title = {AuroraX: A Fast Cross-Lingual Reranker Bridging English and Chinese},
	author = {aqweteddy},
	year = {2025},
	howpublished = {\url{https://huggingface.co/aqweteddy/AuroraX-Reranker-Base-v1.0}},
	note = {Lightweight and powerful eranker for English, Traditional Chinese, and Simplified Chinese}
	}
	```