tatar2vec-demo

Sleeping

App Files Files Community

ArabovMK commited on Nov 9, 2025

Commit

09e1de9

verified ·

1 Parent(s): a0e97e3

Update README.md

Browse files

Files changed (1) hide show

README.md +193 -192

README.md CHANGED Viewed

@@ -1,192 +1,193 @@
-title: Tatar2Vec Explorer
-emoji: 🏆
-colorFrom: indigo
-colorTo: purple
-sdk: docker
-pinned: true
-app_file: app.py
----
-# 🏆 Tatar2Vec Explorer
-<div align="center">
-**Discover the Power of Tatar Language AI**
-*High-quality word embeddings for the Tatar language*
-[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
-[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
-[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
-</div>
-## 🌟 Overview
-Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
-## 🚀 Features
-### 🔍 Semantic Search
-- **Word Similarity**: Find semantically similar words
-- **Vector Operations**: Perform complex word analogies
-- **Interactive Visualizations**: Explore results with beautiful charts and word clouds
-### 🧠 Advanced Analytics
-- **Model Comparison**: Compare FastText vs Word2Vec performance
-- **OOV Handling**: Test out-of-vocabulary word capabilities
-- **Performance Metrics**: Detailed model evaluation scores
-### 🎯 Model Variants
-- **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
-- **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
-- **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
-- **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
-## 📊 Performance Highlights
-| Model | Composite Score | Semantic Similarity | OOV Handling |
-|-------|----------------|-------------------|-------------|
-| **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
-| Meta cc.tt.300 | 0.2000 | - | - |
-| **Improvement** | **3.5×** | **Significant** | **Perfect** |
-## 🎮 Quick Start
-### Try These Examples:
-#### Word Similarity
-```python
-# Find words similar to "мәктәп" (school)
-similar_words = model.most_similar('мәктәп', topn=10)
-```
-#### Word Analogies
-```python
-# Doctor - man + woman = ?
-analogy = model.most_similar(
-    positive=['табиб', 'хатын'],  # doctor, woman
-    negative=['ир']               # man
-)
-```
-#### OOV Testing (FastText Only)
-```python
-# Handle unknown words
-vector = model['технологияләштерү']  # technology-related word
-```
-## 🏗️ Technical Details
-### Training Corpus
-- **Total Tokens**: 203.2 million
-- **Vocabulary Size**: 637.7K words
-- **Unique Words**: 1.8 million
-- **Domains**: Wikipedia, news, books, social media
-### Model Architecture
-- **FastText**: Subword information support
-- **Word2Vec**: Classical word embeddings
-- **Optimized**: Skip-gram architecture, 100 dimensions
-## 📚 Use Cases
-### 🎓 Education
-- Language learning applications
-- Educational content analysis
-- Academic research
-### 💼 Business
-- Content recommendation systems
-- Search engine enhancement
-- Customer feedback analysis
-### 🔬 Research
-- Linguistic studies
-- Cross-lingual comparisons
-- AI model development
-## 🛠️ Installation
-### Local Development
-```bash
-git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
-cd tatar2vec-demo
-pip install -r requirements.txt
-streamlit run app.py
-```
-### Docker Deployment
-```bash
-docker build -t tatar2vec-demo .
-docker run -p 7860:7860 tatar2vec-demo
-```
-## 🌐 API Access
-```python
-from huggingface_hub import snapshot_download
-from gensim.models import FastText
-# Download and load the best model
-model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
-model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
-# Use the model
-similar_words = model.wv.most_similar('мәктәп')
-```
-## 📊 Evaluation Metrics
-Our models were evaluated on multiple dimensions:
-- **Semantic Similarity**: Human-judged word pairs
-- **Analogy Accuracy**: Word relationship tasks
-- **OOV Handling**: Unknown word processing
-- **Neighbor Coherence**: Semantic consistency
-## 🤝 Contributing
-We welcome contributions from the community! Areas of interest:
-- Additional evaluation benchmarks
-- New model architectures
-- Expanded training data
-- Multilingual applications
-## 📜 Citation
-If you use Tatar2Vec in your research, please cite:
-```bibtex
-@misc{tatar2vec2025,
-  title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
-  author = {Arabovs AI Lab},
-  year = {2025},
-  publisher = {Hugging Face},
-  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
-  note = {Version 1.0}
-}
-```
-## 📄 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
-## 🙏 Acknowledgments
-- Tatar language speakers and contributors
-- Hugging Face for platform support
-- Open-source community for tools and libraries
----
-<div align="center">
-**Empowering Tatar Language Technology**
-*Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
-[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) •
-[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) •
-[Contact Team](mailto:contact@arabovs-ai-lab.com)
-</div>

+---
+title: Tatar2Vec Explorer
+emoji: 🏆
+colorFrom: indigo
+colorTo: purple
+sdk: docker
+pinned: true
+app_file: app.py
+---
+# 🏆 Tatar2Vec Explorer
+<div align="center">
+**Discover the Power of Tatar Language AI**
+*High-quality word embeddings for the Tatar language*
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
+</div>
+## 🌟 Overview
+Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
+## 🚀 Features
+### 🔍 Semantic Search
+- **Word Similarity**: Find semantically similar words
+- **Vector Operations**: Perform complex word analogies
+- **Interactive Visualizations**: Explore results with beautiful charts and word clouds
+### 🧠 Advanced Analytics
+- **Model Comparison**: Compare FastText vs Word2Vec performance
+- **OOV Handling**: Test out-of-vocabulary word capabilities
+- **Performance Metrics**: Detailed model evaluation scores
+### 🎯 Model Variants
+- **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
+- **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
+- **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
+- **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
+## 📊 Performance Highlights
+| Model | Composite Score | Semantic Similarity | OOV Handling |
+|-------|----------------|-------------------|-------------|
+| **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
+| Meta cc.tt.300 | 0.2000 | - | - |
+| **Improvement** | **3.5×** | **Significant** | **Perfect** |
+## 🎮 Quick Start
+### Try These Examples:
+#### Word Similarity
+```python
+# Find words similar to "мәктәп" (school)
+similar_words = model.most_similar('мәктәп', topn=10)
+```
+#### Word Analogies
+```python
+# Doctor - man + woman = ?
+analogy = model.most_similar(
+    positive=['табиб', 'хатын'],  # doctor, woman
+    negative=['ир']               # man
+)
+```
+#### OOV Testing (FastText Only)
+```python
+# Handle unknown words
+vector = model['технологияләштерү']  # technology-related word
+```
+## 🏗️ Technical Details
+### Training Corpus
+- **Total Tokens**: 203.2 million
+- **Vocabulary Size**: 637.7K words
+- **Unique Words**: 1.8 million
+- **Domains**: Wikipedia, news, books, social media
+### Model Architecture
+- **FastText**: Subword information support
+- **Word2Vec**: Classical word embeddings
+- **Optimized**: Skip-gram architecture, 100 dimensions
+## 📚 Use Cases
+### 🎓 Education
+- Language learning applications
+- Educational content analysis
+- Academic research
+### 💼 Business
+- Content recommendation systems
+- Search engine enhancement
+- Customer feedback analysis
+### 🔬 Research
+- Linguistic studies
+- Cross-lingual comparisons
+- AI model development
+## 🛠️ Installation
+### Local Development
+```bash
+git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
+cd tatar2vec-demo
+pip install -r requirements.txt
+streamlit run app.py
+```
+### Docker Deployment
+```bash
+docker build -t tatar2vec-demo .
+docker run -p 7860:7860 tatar2vec-demo
+```
+## 🌐 API Access
+```python
+from huggingface_hub import snapshot_download
+from gensim.models import FastText
+# Download and load the best model
+model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
+model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
+# Use the model
+similar_words = model.wv.most_similar('мәктәп')
+```
+## 📊 Evaluation Metrics
+Our models were evaluated on multiple dimensions:
+- **Semantic Similarity**: Human-judged word pairs
+- **Analogy Accuracy**: Word relationship tasks
+- **OOV Handling**: Unknown word processing
+- **Neighbor Coherence**: Semantic consistency
+## 🤝 Contributing
+We welcome contributions from the community! Areas of interest:
+- Additional evaluation benchmarks
+- New model architectures
+- Expanded training data
+- Multilingual applications
+## 📜 Citation
+If you use Tatar2Vec in your research, please cite:
+```bibtex
+@misc{tatar2vec2025,
+  title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
+  author = {Arabovs AI Lab},
+  year = {2025},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
+  note = {Version 1.0}
+}
+```
+## 📄 License
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+- Tatar language speakers and contributors
+- Hugging Face for platform support
+- Open-source community for tools and libraries
+---
+<div align="center">
+**Empowering Tatar Language Technology**
+*Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
+[Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) •
+[Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) •
+[Contact Team](mailto:contact@arabovs-ai-lab.com)
+</div>