e5-finetuned-georgian

This repository contains a fine-tuned version of the intfloat/multilingual-e5-small model, specifically adapted for generating text embeddings for the Georgian language.

Model Description

This model was developed by fine-tuning the intfloat/multilingual-e5-small base model on a large-scale Georgian text pair dataset. The goal was to enhance its ability to understand the nuances of the Georgian language and produce more accurate and semantically rich vector representations of Georgian text.

The model is ideal for tasks such as:

Semantic search
Text similarity and clustering
Retrieval-Augmented Generation (RAG)
Zero-shot classification

Training Data

The model was fine-tuned using the sithet/georgian-text-pairs dataset from the Hugging Face Hub.

Dataset: sithet/georgian-text-pairs

Benchmark Results

BelebeleRetrieval (zero-shot)

Task	NDCG@1	NDCG@10	NDCG@1000
Georgian → Georgian	0.613	0.7178	0.7492
Georgian → English	0.513	0.6561	0.6938
English → Georgian	0.530	0.6608	0.7004

GeorgianFAQRetrieval (fine-tuned domain)

Metric	Value
NDCG@10	0.4702
MAP@10	0.4209
MRR@10	0.4210
Recall@10	0.6259

Tatoeba (Georgian ↔ English)

Metric	Score
Accuracy	0.8378
Precision	0.7741
Recall	0.8378
F1	0.7943

How to Use

You can use this model directly with the sentence-transformers library.

First, install the library:

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sithet/e5-finetuned-georgian")

query = "მთვარე"
embedding = model.encode(query)

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32