tatar2vec-demo

Sleeping

App Files Files Community

tatar2vec-demo / README.md

ArabovMK

Update README.md

09e1de9 verified 4 months ago

preview code

raw

history blame contribute delete

5.33 kB

	---
	title: Tatar2Vec Explorer
	emoji: 🏆
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	pinned: true
	app_file: app.py
	---

	# 🏆 Tatar2Vec Explorer

	<div align="center">

	Discover the Power of Tatar Language AI

	High-quality word embeddings for the Tatar language

	[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
	[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
	[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)

	</div>

	## 🌟 Overview

	Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.

	## 🚀 Features

	### 🔍 Semantic Search
	- Word Similarity: Find semantically similar words
	- Vector Operations: Perform complex word analogies
	- Interactive Visualizations: Explore results with beautiful charts and word clouds

	### 🧠 Advanced Analytics
	- Model Comparison: Compare FastText vs Word2Vec performance
	- OOV Handling: Test out-of-vocabulary word capabilities
	- Performance Metrics: Detailed model evaluation scores

	### 🎯 Model Variants
	- 🥇 Best FastText: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
	- 🥈 Alternative FastText: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
	- 🥇 Best Word2Vec: `w2v_dim200_win5_min5_sg.epoch4`
	- 🥈 Compact Word2Vec: `w2v_dim100_win5_min5_sg`

	## 📊 Performance Highlights

	\| Model \| Composite Score \| Semantic Similarity \| OOV Handling \|
	\|-------\|----------------\|-------------------\|-------------\|
	\| Best FastText \| 0.7019 \| 0.7368 \| 1.0000 \|
	\| Meta cc.tt.300 \| 0.2000 \| - \| - \|
	\| Improvement \| 3.5× \| Significant \| Perfect \|

	## 🎮 Quick Start

	### Try These Examples:

	#### Word Similarity
	```python
	# Find words similar to "мәктәп" (school)
	similar_words = model.most_similar('мәктәп', topn=10)
	```

	#### Word Analogies
	```python
	# Doctor - man + woman = ?
	analogy = model.most_similar(
	positive=['табиб', 'хатын'], # doctor, woman
	negative=['ир'] # man
	)
	```

	#### OOV Testing (FastText Only)
	```python
	# Handle unknown words
	vector = model['технологияләштерү'] # technology-related word
	```

	## 🏗️ Technical Details

	### Training Corpus
	- Total Tokens: 203.2 million
	- Vocabulary Size: 637.7K words
	- Unique Words: 1.8 million
	- Domains: Wikipedia, news, books, social media

	### Model Architecture
	- FastText: Subword information support
	- Word2Vec: Classical word embeddings
	- Optimized: Skip-gram architecture, 100 dimensions

	## 📚 Use Cases

	### 🎓 Education
	- Language learning applications
	- Educational content analysis
	- Academic research

	### 💼 Business
	- Content recommendation systems
	- Search engine enhancement
	- Customer feedback analysis

	### 🔬 Research
	- Linguistic studies
	- Cross-lingual comparisons
	- AI model development

	## 🛠️ Installation

	### Local Development
	```bash
	git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
	cd tatar2vec-demo
	pip install -r requirements.txt
	streamlit run app.py
	```

	### Docker Deployment
	```bash
	docker build -t tatar2vec-demo .
	docker run -p 7860:7860 tatar2vec-demo
	```

	## 🌐 API Access

	```python
	from huggingface_hub import snapshot_download
	from gensim.models import FastText

	# Download and load the best model
	model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
	model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")

	# Use the model
	similar_words = model.wv.most_similar('мәктәп')
	```

	## 📊 Evaluation Metrics

	Our models were evaluated on multiple dimensions:
	- Semantic Similarity: Human-judged word pairs
	- Analogy Accuracy: Word relationship tasks
	- OOV Handling: Unknown word processing
	- Neighbor Coherence: Semantic consistency

	## 🤝 Contributing

	We welcome contributions from the community! Areas of interest:
	- Additional evaluation benchmarks
	- New model architectures
	- Expanded training data
	- Multilingual applications

	## 📜 Citation

	If you use Tatar2Vec in your research, please cite:

	```bibtex
	@misc{tatar2vec2025,
	title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
	author = {Arabovs AI Lab},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
	note = {Version 1.0}
	}
	```

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## 🙏 Acknowledgments

	- Tatar language speakers and contributors
	- Hugging Face for platform support
	- Open-source community for tools and libraries

	---

	<div align="center">

	Empowering Tatar Language Technology

	Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)

	[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) •
	[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) •
	[Contact Team](mailto:contact@arabovs-ai-lab.com)

	</div>