tatar2vec-demo / README.md
ArabovMK's picture
Update README.md
09e1de9 verified
metadata
title: Tatar2Vec Explorer
emoji: ๐Ÿ†
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: true
app_file: app.py

๐Ÿ† Tatar2Vec Explorer

Discover the Power of Tatar Language AI

High-quality word embeddings for the Tatar language

Hugging Face License Streamlit

๐ŸŒŸ Overview

Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.

๐Ÿš€ Features

๐Ÿ” Semantic Search

  • Word Similarity: Find semantically similar words
  • Vector Operations: Perform complex word analogies
  • Interactive Visualizations: Explore results with beautiful charts and word clouds

๐Ÿง  Advanced Analytics

  • Model Comparison: Compare FastText vs Word2Vec performance
  • OOV Handling: Test out-of-vocabulary word capabilities
  • Performance Metrics: Detailed model evaluation scores

๐ŸŽฏ Model Variants

  • ๐Ÿฅ‡ Best FastText: ft_dim100_win5_min5_ngram3-6_sg.epoch1 (Composite: 0.7019)
  • ๐Ÿฅˆ Alternative FastText: ft_dim100_win5_min5_ngram3-6_sg.epoch3
  • ๐Ÿฅ‡ Best Word2Vec: w2v_dim200_win5_min5_sg.epoch4
  • ๐Ÿฅˆ Compact Word2Vec: w2v_dim100_win5_min5_sg

๐Ÿ“Š Performance Highlights

Model Composite Score Semantic Similarity OOV Handling
Best FastText 0.7019 0.7368 1.0000
Meta cc.tt.300 0.2000 - -
Improvement 3.5ร— Significant Perfect

๐ŸŽฎ Quick Start

Try These Examples:

Word Similarity

# Find words similar to "ะผำ™ะบั‚ำ™ะฟ" (school)
similar_words = model.most_similar('ะผำ™ะบั‚ำ™ะฟ', topn=10)

Word Analogies

# Doctor - man + woman = ?
analogy = model.most_similar(
    positive=['ั‚ะฐะฑะธะฑ', 'ั…ะฐั‚ั‹ะฝ'],  # doctor, woman
    negative=['ะธั€']               # man
)

OOV Testing (FastText Only)

# Handle unknown words
vector = model['ั‚ะตั…ะฝะพะปะพะณะธัะปำ™ัˆั‚ะตั€าฏ']  # technology-related word

๐Ÿ—๏ธ Technical Details

Training Corpus

  • Total Tokens: 203.2 million
  • Vocabulary Size: 637.7K words
  • Unique Words: 1.8 million
  • Domains: Wikipedia, news, books, social media

Model Architecture

  • FastText: Subword information support
  • Word2Vec: Classical word embeddings
  • Optimized: Skip-gram architecture, 100 dimensions

๐Ÿ“š Use Cases

๐ŸŽ“ Education

  • Language learning applications
  • Educational content analysis
  • Academic research

๐Ÿ’ผ Business

  • Content recommendation systems
  • Search engine enhancement
  • Customer feedback analysis

๐Ÿ”ฌ Research

  • Linguistic studies
  • Cross-lingual comparisons
  • AI model development

๐Ÿ› ๏ธ Installation

Local Development

git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
cd tatar2vec-demo
pip install -r requirements.txt
streamlit run app.py

Docker Deployment

docker build -t tatar2vec-demo .
docker run -p 7860:7860 tatar2vec-demo

๐ŸŒ API Access

from huggingface_hub import snapshot_download
from gensim.models import FastText

# Download and load the best model
model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")

# Use the model
similar_words = model.wv.most_similar('ะผำ™ะบั‚ำ™ะฟ')

๐Ÿ“Š Evaluation Metrics

Our models were evaluated on multiple dimensions:

  • Semantic Similarity: Human-judged word pairs
  • Analogy Accuracy: Word relationship tasks
  • OOV Handling: Unknown word processing
  • Neighbor Coherence: Semantic consistency

๐Ÿค Contributing

We welcome contributions from the community! Areas of interest:

  • Additional evaluation benchmarks
  • New model architectures
  • Expanded training data
  • Multilingual applications

๐Ÿ“œ Citation

If you use Tatar2Vec in your research, please cite:

@misc{tatar2vec2025,
  title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
  author = {Arabovs AI Lab},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
  note = {Version 1.0}
}

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Tatar language speakers and contributors
  • Hugging Face for platform support
  • Open-source community for tools and libraries

Empowering Tatar Language Technology

Brought to you by Arabovs AI Lab

Report Issues โ€ข Request Features โ€ข Contact Team