Spaces:
Sleeping
Sleeping
| title: Tatar2Vec Explorer | |
| emoji: 🏆 | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: docker | |
| pinned: true | |
| app_file: app.py | |
| # 🏆 Tatar2Vec Explorer | |
| <div align="center"> | |
| **Discover the Power of Tatar Language AI** | |
| *High-quality word embeddings for the Tatar language* | |
| [](https://huggingface.co/arabovs-ai-lab/Tatar2Vec) | |
| [](LICENSE) | |
| [](https://streamlit.io) | |
| </div> | |
| ## 🌟 Overview | |
| Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models. | |
| ## 🚀 Features | |
| ### 🔍 Semantic Search | |
| - **Word Similarity**: Find semantically similar words | |
| - **Vector Operations**: Perform complex word analogies | |
| - **Interactive Visualizations**: Explore results with beautiful charts and word clouds | |
| ### 🧠 Advanced Analytics | |
| - **Model Comparison**: Compare FastText vs Word2Vec performance | |
| - **OOV Handling**: Test out-of-vocabulary word capabilities | |
| - **Performance Metrics**: Detailed model evaluation scores | |
| ### 🎯 Model Variants | |
| - **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019) | |
| - **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3` | |
| - **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4` | |
| - **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg` | |
| ## 📊 Performance Highlights | |
| | Model | Composite Score | Semantic Similarity | OOV Handling | | |
| |-------|----------------|-------------------|-------------| | |
| | **Best FastText** | 0.7019 | 0.7368 | 1.0000 | | |
| | Meta cc.tt.300 | 0.2000 | - | - | | |
| | **Improvement** | **3.5×** | **Significant** | **Perfect** | | |
| ## 🎮 Quick Start | |
| ### Try These Examples: | |
| #### Word Similarity | |
| ```python | |
| # Find words similar to "мәктәп" (school) | |
| similar_words = model.most_similar('мәктәп', topn=10) | |
| ``` | |
| #### Word Analogies | |
| ```python | |
| # Doctor - man + woman = ? | |
| analogy = model.most_similar( | |
| positive=['табиб', 'хатын'], # doctor, woman | |
| negative=['ир'] # man | |
| ) | |
| ``` | |
| #### OOV Testing (FastText Only) | |
| ```python | |
| # Handle unknown words | |
| vector = model['технологияләштерү'] # technology-related word | |
| ``` | |
| ## 🏗️ Technical Details | |
| ### Training Corpus | |
| - **Total Tokens**: 203.2 million | |
| - **Vocabulary Size**: 637.7K words | |
| - **Unique Words**: 1.8 million | |
| - **Domains**: Wikipedia, news, books, social media | |
| ### Model Architecture | |
| - **FastText**: Subword information support | |
| - **Word2Vec**: Classical word embeddings | |
| - **Optimized**: Skip-gram architecture, 100 dimensions | |
| ## 📚 Use Cases | |
| ### 🎓 Education | |
| - Language learning applications | |
| - Educational content analysis | |
| - Academic research | |
| ### 💼 Business | |
| - Content recommendation systems | |
| - Search engine enhancement | |
| - Customer feedback analysis | |
| ### 🔬 Research | |
| - Linguistic studies | |
| - Cross-lingual comparisons | |
| - AI model development | |
| ## 🛠️ Installation | |
| ### Local Development | |
| ```bash | |
| git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo | |
| cd tatar2vec-demo | |
| pip install -r requirements.txt | |
| streamlit run app.py | |
| ``` | |
| ### Docker Deployment | |
| ```bash | |
| docker build -t tatar2vec-demo . | |
| docker run -p 7860:7860 tatar2vec-demo | |
| ``` | |
| ## 🌐 API Access | |
| ```python | |
| from huggingface_hub import snapshot_download | |
| from gensim.models import FastText | |
| # Download and load the best model | |
| model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec") | |
| model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model") | |
| # Use the model | |
| similar_words = model.wv.most_similar('мәктәп') | |
| ``` | |
| ## 📊 Evaluation Metrics | |
| Our models were evaluated on multiple dimensions: | |
| - **Semantic Similarity**: Human-judged word pairs | |
| - **Analogy Accuracy**: Word relationship tasks | |
| - **OOV Handling**: Unknown word processing | |
| - **Neighbor Coherence**: Semantic consistency | |
| ## 🤝 Contributing | |
| We welcome contributions from the community! Areas of interest: | |
| - Additional evaluation benchmarks | |
| - New model architectures | |
| - Expanded training data | |
| - Multilingual applications | |
| ## 📜 Citation | |
| If you use Tatar2Vec in your research, please cite: | |
| ```bibtex | |
| @misc{tatar2vec2025, | |
| title = {Tatar2Vec: High-Quality Tatar Word Embeddings}, | |
| author = {Arabovs AI Lab}, | |
| year = {2025}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}, | |
| note = {Version 1.0} | |
| } | |
| ``` | |
| ## 📄 License | |
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. | |
| ## 🙏 Acknowledgments | |
| - Tatar language speakers and contributors | |
| - Hugging Face for platform support | |
| - Open-source community for tools and libraries | |
| --- | |
| <div align="center"> | |
| **Empowering Tatar Language Technology** | |
| *Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)* | |
| [Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) • | |
| [Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) • | |
| [Contact Team](mailto:contact@arabovs-ai-lab.com) | |
| </div> | |