--- title: Tatar2Vec Explorer emoji: ๐Ÿ† colorFrom: indigo colorTo: purple sdk: docker pinned: true app_file: app.py --- # ๐Ÿ† Tatar2Vec Explorer
**Discover the Power of Tatar Language AI** *High-quality word embeddings for the Tatar language* [![Hugging Face](https://img.shields.io/badge/๐Ÿค—-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec) [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
## ๐ŸŒŸ Overview Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models. ## ๐Ÿš€ Features ### ๐Ÿ” Semantic Search - **Word Similarity**: Find semantically similar words - **Vector Operations**: Perform complex word analogies - **Interactive Visualizations**: Explore results with beautiful charts and word clouds ### ๐Ÿง  Advanced Analytics - **Model Comparison**: Compare FastText vs Word2Vec performance - **OOV Handling**: Test out-of-vocabulary word capabilities - **Performance Metrics**: Detailed model evaluation scores ### ๐ŸŽฏ Model Variants - **๐Ÿฅ‡ Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019) - **๐Ÿฅˆ Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3` - **๐Ÿฅ‡ Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4` - **๐Ÿฅˆ Compact Word2Vec**: `w2v_dim100_win5_min5_sg` ## ๐Ÿ“Š Performance Highlights | Model | Composite Score | Semantic Similarity | OOV Handling | |-------|----------------|-------------------|-------------| | **Best FastText** | 0.7019 | 0.7368 | 1.0000 | | Meta cc.tt.300 | 0.2000 | - | - | | **Improvement** | **3.5ร—** | **Significant** | **Perfect** | ## ๐ŸŽฎ Quick Start ### Try These Examples: #### Word Similarity ```python # Find words similar to "ะผำ™ะบั‚ำ™ะฟ" (school) similar_words = model.most_similar('ะผำ™ะบั‚ำ™ะฟ', topn=10) ``` #### Word Analogies ```python # Doctor - man + woman = ? analogy = model.most_similar( positive=['ั‚ะฐะฑะธะฑ', 'ั…ะฐั‚ั‹ะฝ'], # doctor, woman negative=['ะธั€'] # man ) ``` #### OOV Testing (FastText Only) ```python # Handle unknown words vector = model['ั‚ะตั…ะฝะพะปะพะณะธัะปำ™ัˆั‚ะตั€าฏ'] # technology-related word ``` ## ๐Ÿ—๏ธ Technical Details ### Training Corpus - **Total Tokens**: 203.2 million - **Vocabulary Size**: 637.7K words - **Unique Words**: 1.8 million - **Domains**: Wikipedia, news, books, social media ### Model Architecture - **FastText**: Subword information support - **Word2Vec**: Classical word embeddings - **Optimized**: Skip-gram architecture, 100 dimensions ## ๐Ÿ“š Use Cases ### ๐ŸŽ“ Education - Language learning applications - Educational content analysis - Academic research ### ๐Ÿ’ผ Business - Content recommendation systems - Search engine enhancement - Customer feedback analysis ### ๐Ÿ”ฌ Research - Linguistic studies - Cross-lingual comparisons - AI model development ## ๐Ÿ› ๏ธ Installation ### Local Development ```bash git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo cd tatar2vec-demo pip install -r requirements.txt streamlit run app.py ``` ### Docker Deployment ```bash docker build -t tatar2vec-demo . docker run -p 7860:7860 tatar2vec-demo ``` ## ๐ŸŒ API Access ```python from huggingface_hub import snapshot_download from gensim.models import FastText # Download and load the best model model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec") model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model") # Use the model similar_words = model.wv.most_similar('ะผำ™ะบั‚ำ™ะฟ') ``` ## ๐Ÿ“Š Evaluation Metrics Our models were evaluated on multiple dimensions: - **Semantic Similarity**: Human-judged word pairs - **Analogy Accuracy**: Word relationship tasks - **OOV Handling**: Unknown word processing - **Neighbor Coherence**: Semantic consistency ## ๐Ÿค Contributing We welcome contributions from the community! Areas of interest: - Additional evaluation benchmarks - New model architectures - Expanded training data - Multilingual applications ## ๐Ÿ“œ Citation If you use Tatar2Vec in your research, please cite: ```bibtex @misc{tatar2vec2025, title = {Tatar2Vec: High-Quality Tatar Word Embeddings}, author = {Arabovs AI Lab}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}, note = {Version 1.0} } ``` ## ๐Ÿ“„ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## ๐Ÿ™ Acknowledgments - Tatar language speakers and contributors - Hugging Face for platform support - Open-source community for tools and libraries ---
**Empowering Tatar Language Technology** *Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)* [Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) โ€ข [Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) โ€ข [Contact Team](mailto:contact@arabovs-ai-lab.com)