tatar2vec-demo

Sleeping

App Files Files Community

tatar2vec-demo / README.md

ArabovMK

Update README.md

09e1de9 verified 4 months ago

preview code

raw

history blame contribute delete

5.33 kB

metadata

title: Tatar2Vec Explorer
emoji: 🏆
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
app_file: app.py

🏆 Tatar2Vec Explorer

Discover the Power of Tatar Language AI

High-quality word embeddings for the Tatar language

🌟 Overview

Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.

🚀 Features

🔍 Semantic Search

Word Similarity: Find semantically similar words
Vector Operations: Perform complex word analogies
Interactive Visualizations: Explore results with beautiful charts and word clouds

🧠 Advanced Analytics

Model Comparison: Compare FastText vs Word2Vec performance
OOV Handling: Test out-of-vocabulary word capabilities
Performance Metrics: Detailed model evaluation scores

🎯 Model Variants

🥇 Best FastText: ft_dim100_win5_min5_ngram3-6_sg.epoch1 (Composite: 0.7019)
🥈 Alternative FastText: ft_dim100_win5_min5_ngram3-6_sg.epoch3
🥇 Best Word2Vec: w2v_dim200_win5_min5_sg.epoch4
🥈 Compact Word2Vec: w2v_dim100_win5_min5_sg

📊 Performance Highlights

Model	Composite Score	Semantic Similarity	OOV Handling
Best FastText	0.7019	0.7368	1.0000
Meta cc.tt.300	0.2000	-	-
Improvement	3.5×	Significant	Perfect

🎮 Quick Start

Try These Examples:

Word Similarity

# Find words similar to "мәктәп" (school)
similar_words = model.most_similar('мәктәп', topn=10)

Word Analogies

# Doctor - man + woman = ?
analogy = model.most_similar(
    positive=['табиб', 'хатын'],  # doctor, woman
    negative=['ир']               # man
)

OOV Testing (FastText Only)

# Handle unknown words
vector = model['технологияләштерү']  # technology-related word

🏗️ Technical Details

Training Corpus

Total Tokens: 203.2 million
Vocabulary Size: 637.7K words
Unique Words: 1.8 million
Domains: Wikipedia, news, books, social media

Model Architecture

FastText: Subword information support
Word2Vec: Classical word embeddings
Optimized: Skip-gram architecture, 100 dimensions

📚 Use Cases

🎓 Education

Language learning applications
Educational content analysis
Academic research

💼 Business

Content recommendation systems
Search engine enhancement
Customer feedback analysis

🔬 Research

Linguistic studies
Cross-lingual comparisons
AI model development

🛠️ Installation

Local Development

git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
cd tatar2vec-demo
pip install -r requirements.txt
streamlit run app.py

Docker Deployment

docker build -t tatar2vec-demo .
docker run -p 7860:7860 tatar2vec-demo

🌐 API Access

from huggingface_hub import snapshot_download
from gensim.models import FastText

# Download and load the best model
model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")

# Use the model
similar_words = model.wv.most_similar('мәктәп')

📊 Evaluation Metrics

Our models were evaluated on multiple dimensions:

Semantic Similarity: Human-judged word pairs
Analogy Accuracy: Word relationship tasks
OOV Handling: Unknown word processing
Neighbor Coherence: Semantic consistency

🤝 Contributing

We welcome contributions from the community! Areas of interest:

Additional evaluation benchmarks
New model architectures
Expanded training data
Multilingual applications

📜 Citation

If you use Tatar2Vec in your research, please cite:

@misc{tatar2vec2025,
  title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
  author = {Arabovs AI Lab},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
  note = {Version 1.0}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Tatar language speakers and contributors
Hugging Face for platform support
Open-source community for tools and libraries

Empowering Tatar Language Technology

Brought to you by Arabovs AI Lab

Report Issues • Request Features • Contact Team