Update README.md

2fd40a2 verified 10 months ago

4.43 kB

	---
	license: mit
	language:
	- ru
	- en
	tags:
	- mteb
	- transformers
	- sentence-transformers
	base_model:
	- ai-forever/FRIDA
	pipeline_tag: feature-extraction
	---

	# Model Card for FRIDA GGUF

	<figure>
	<img src="img.jpg">
	</figure>

	+ https://huggingface.co/evilfreelancer/FRIDA-GGUF
	+ https://ollama.com/evilfreelancer/FRIDA

	FRIDA is a full-scale finetuned general text embedding model inspired by denoising architecture based on T5. The model is based on the encoder part of [FRED-T5](https://arxiv.org/abs/2309.10931) model and continues research of text embedding models ([ruMTEB](https://arxiv.org/abs/2408.12503), [ru-en-RoSBERTa](https://huggingface.co/ai-forever/ru-en-RoSBERTa)). It has been pre-trained on a Russian-English dataset and fine-tuned for improved performance on the target task.

	For more model details please refer to our technical report [TODO].

	## Usage

	The model can be used as is with prefixes. It is recommended to use CLS pooling. The choice of prefix and pooling depends on the task.

	We use the following basic rules to choose a prefix:
	- `"search_query: "` and `"search_document: "` prefixes are for answer or relevant paragraph retrieval
	- `"paraphrase: "` prefix is for symmetric paraphrasing related tasks (STS, paraphrase mining, deduplication)
	- `"categorize: "` prefix is for asymmetric matching of document title and body (e.g. news, scientific papers, social posts)
	- `"categorize_sentiment: "` prefix is for any tasks that rely on sentiment features (e.g. hate, toxic, emotion)
	- `"categorize_topic: "` prefix is intended for tasks where you need to group texts by topic
	- `"categorize_entailment: "` prefix is for textual entailment task (NLI)

	To better tailor the model to your needs, you can fine-tune it with relevant high-quality Russian and English datasets.

	Below are examples of texts encoding using the Transformers and SentenceTransformers libraries.

	### Ollama

	```shell
	ollama pull evilfreelancer/FRIDA:f16
	```

	```python
	import json
	import requests
	import numpy as np

	OLLAMA_HOST = "http://localhost:11434"
	MODEL_NAME = "evilfreelancer/FRIDA:f16"


	def get_embedding(text):
	payload = {
	"model": MODEL_NAME,
	"input": text
	}

	response = requests.post(
	f"{OLLAMA_HOST}/api/embed",
	data=json.dumps(payload, ensure_ascii=False),
	headers={"Content-Type": "application/x-www-form-urlencoded"}
	)
	response.raise_for_status()
	return np.array(response.json()["embeddings"][0])


	def normalize(vectors):
	vectors = np.atleast_2d(vectors)
	norms = np.linalg.norm(vectors, axis=1, keepdims=True)
	norms[norms == 0] = 1.0
	return vectors / norms


	def cosine_diag_similarity(a, b):
	return np.sum(a * b, axis=1)


	inputs = [
	#
	"paraphrase: В Ярославской области разрешили работу бань, но без посетителей",
	"categorize_entailment: Женщину доставили в больницу за ее жизнь сейчас борются врачи.",
	"search_query: Сколько программистов нужно, чтобы вкрутить лампочку?",
	#
	"paraphrase: Ярославским баням разрешили работать без посетителей",
	"categorize_entailment: Женщину спасают врачи.",
	"search_document: Чтобы вкрутить лампочку нужно три программиста.",
	]
	size = int(len(inputs)/2)

	embeddings = normalize(np.array([get_embedding(text) for text in inputs]))
	sim_scores = cosine_diag_similarity(embeddings[:size], embeddings[size:])
	print(sim_scores.tolist())
	```

	## Authors
	+ [SaluteDevices](https://sberdevices.ru/) AI for B2C RnD Team.
	+ Artem Snegirev: [HF profile](https://huggingface.co/artemsnegirev), [Github](https://github.com/artemsnegirev);
	+ Anna Maksimova [HF profile](https://huggingface.co/anpalmak);
	+ Aleksandr Abramov: [HF profile](https://huggingface.co/Andrilko), [Github](https://github.com/Ab1992ao), [Kaggle Competitions Master](https://www.kaggle.com/andrilko)
	+ Pavel Rykov: [HF profile](https://huggingface.co/evilfreelancer), [Github](https://github.com/evilfreelancer) - creator of GGUF version

	## Citation

	```
	@misc{TODO
	}
	```

	## Limitations

	The model is designed to process texts in Russian, the quality in English is unknown. Maximum input text length is limited to 512 tokens.