ArabovMK commited on
Commit
09e1de9
ยท
verified ยท
1 Parent(s): a0e97e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -192
README.md CHANGED
@@ -1,192 +1,193 @@
1
- title: Tatar2Vec Explorer
2
- emoji: ๐Ÿ†
3
- colorFrom: indigo
4
- colorTo: purple
5
- sdk: docker
6
- pinned: true
7
- app_file: app.py
8
- ---
9
-
10
- # ๐Ÿ† Tatar2Vec Explorer
11
-
12
- <div align="center">
13
-
14
- **Discover the Power of Tatar Language AI**
15
-
16
- *High-quality word embeddings for the Tatar language*
17
-
18
- [![Hugging Face](https://img.shields.io/badge/๐Ÿค—-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
19
- [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
20
- [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
21
-
22
- </div>
23
-
24
- ## ๐ŸŒŸ Overview
25
-
26
- Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
27
-
28
- ## ๐Ÿš€ Features
29
-
30
- ### ๐Ÿ” Semantic Search
31
- - **Word Similarity**: Find semantically similar words
32
- - **Vector Operations**: Perform complex word analogies
33
- - **Interactive Visualizations**: Explore results with beautiful charts and word clouds
34
-
35
- ### ๐Ÿง  Advanced Analytics
36
- - **Model Comparison**: Compare FastText vs Word2Vec performance
37
- - **OOV Handling**: Test out-of-vocabulary word capabilities
38
- - **Performance Metrics**: Detailed model evaluation scores
39
-
40
- ### ๐ŸŽฏ Model Variants
41
- - **๐Ÿฅ‡ Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
42
- - **๐Ÿฅˆ Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
43
- - **๐Ÿฅ‡ Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
44
- - **๐Ÿฅˆ Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
45
-
46
- ## ๐Ÿ“Š Performance Highlights
47
-
48
- | Model | Composite Score | Semantic Similarity | OOV Handling |
49
- |-------|----------------|-------------------|-------------|
50
- | **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
51
- | Meta cc.tt.300 | 0.2000 | - | - |
52
- | **Improvement** | **3.5ร—** | **Significant** | **Perfect** |
53
-
54
- ## ๐ŸŽฎ Quick Start
55
-
56
- ### Try These Examples:
57
-
58
- #### Word Similarity
59
- ```python
60
- # Find words similar to "ะผำ™ะบั‚ำ™ะฟ" (school)
61
- similar_words = model.most_similar('ะผำ™ะบั‚ำ™ะฟ', topn=10)
62
- ```
63
-
64
- #### Word Analogies
65
- ```python
66
- # Doctor - man + woman = ?
67
- analogy = model.most_similar(
68
- positive=['ั‚ะฐะฑะธะฑ', 'ั…ะฐั‚ั‹ะฝ'], # doctor, woman
69
- negative=['ะธั€'] # man
70
- )
71
- ```
72
-
73
- #### OOV Testing (FastText Only)
74
- ```python
75
- # Handle unknown words
76
- vector = model['ั‚ะตั…ะฝะพะปะพะณะธัะปำ™ัˆั‚ะตั€าฏ'] # technology-related word
77
- ```
78
-
79
- ## ๐Ÿ—๏ธ Technical Details
80
-
81
- ### Training Corpus
82
- - **Total Tokens**: 203.2 million
83
- - **Vocabulary Size**: 637.7K words
84
- - **Unique Words**: 1.8 million
85
- - **Domains**: Wikipedia, news, books, social media
86
-
87
- ### Model Architecture
88
- - **FastText**: Subword information support
89
- - **Word2Vec**: Classical word embeddings
90
- - **Optimized**: Skip-gram architecture, 100 dimensions
91
-
92
- ## ๐Ÿ“š Use Cases
93
-
94
- ### ๐ŸŽ“ Education
95
- - Language learning applications
96
- - Educational content analysis
97
- - Academic research
98
-
99
- ### ๐Ÿ’ผ Business
100
- - Content recommendation systems
101
- - Search engine enhancement
102
- - Customer feedback analysis
103
-
104
- ### ๐Ÿ”ฌ Research
105
- - Linguistic studies
106
- - Cross-lingual comparisons
107
- - AI model development
108
-
109
- ## ๐Ÿ› ๏ธ Installation
110
-
111
- ### Local Development
112
- ```bash
113
- git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
114
- cd tatar2vec-demo
115
- pip install -r requirements.txt
116
- streamlit run app.py
117
- ```
118
-
119
- ### Docker Deployment
120
- ```bash
121
- docker build -t tatar2vec-demo .
122
- docker run -p 7860:7860 tatar2vec-demo
123
- ```
124
-
125
- ## ๐ŸŒ API Access
126
-
127
- ```python
128
- from huggingface_hub import snapshot_download
129
- from gensim.models import FastText
130
-
131
- # Download and load the best model
132
- model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
133
- model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
134
-
135
- # Use the model
136
- similar_words = model.wv.most_similar('ะผำ™ะบั‚ำ™ะฟ')
137
- ```
138
-
139
- ## ๐Ÿ“Š Evaluation Metrics
140
-
141
- Our models were evaluated on multiple dimensions:
142
- - **Semantic Similarity**: Human-judged word pairs
143
- - **Analogy Accuracy**: Word relationship tasks
144
- - **OOV Handling**: Unknown word processing
145
- - **Neighbor Coherence**: Semantic consistency
146
-
147
- ## ๐Ÿค Contributing
148
-
149
- We welcome contributions from the community! Areas of interest:
150
- - Additional evaluation benchmarks
151
- - New model architectures
152
- - Expanded training data
153
- - Multilingual applications
154
-
155
- ## ๐Ÿ“œ Citation
156
-
157
- If you use Tatar2Vec in your research, please cite:
158
-
159
- ```bibtex
160
- @misc{tatar2vec2025,
161
- title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
162
- author = {Arabovs AI Lab},
163
- year = {2025},
164
- publisher = {Hugging Face},
165
- url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
166
- note = {Version 1.0}
167
- }
168
- ```
169
-
170
- ## ๐Ÿ“„ License
171
-
172
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
173
-
174
- ## ๐Ÿ™ Acknowledgments
175
-
176
- - Tatar language speakers and contributors
177
- - Hugging Face for platform support
178
- - Open-source community for tools and libraries
179
-
180
- ---
181
-
182
- <div align="center">
183
-
184
- **Empowering Tatar Language Technology**
185
-
186
- *Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
187
-
188
- [Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) โ€ข
189
- [Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) โ€ข
190
- [Contact Team](mailto:contact@arabovs-ai-lab.com)
191
-
192
- </div>
 
 
1
+ ---
2
+ title: Tatar2Vec Explorer
3
+ emoji: ๐Ÿ†
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: true
8
+ app_file: app.py
9
+ ---
10
+
11
+ # ๐Ÿ† Tatar2Vec Explorer
12
+
13
+ <div align="center">
14
+
15
+ **Discover the Power of Tatar Language AI**
16
+
17
+ *High-quality word embeddings for the Tatar language*
18
+
19
+ [![Hugging Face](https://img.shields.io/badge/๐Ÿค—-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
20
+ [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
21
+ [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
22
+
23
+ </div>
24
+
25
+ ## ๐ŸŒŸ Overview
26
+
27
+ Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
28
+
29
+ ## ๐Ÿš€ Features
30
+
31
+ ### ๐Ÿ” Semantic Search
32
+ - **Word Similarity**: Find semantically similar words
33
+ - **Vector Operations**: Perform complex word analogies
34
+ - **Interactive Visualizations**: Explore results with beautiful charts and word clouds
35
+
36
+ ### ๐Ÿง  Advanced Analytics
37
+ - **Model Comparison**: Compare FastText vs Word2Vec performance
38
+ - **OOV Handling**: Test out-of-vocabulary word capabilities
39
+ - **Performance Metrics**: Detailed model evaluation scores
40
+
41
+ ### ๐ŸŽฏ Model Variants
42
+ - **๐Ÿฅ‡ Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
43
+ - **๐Ÿฅˆ Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
44
+ - **๐Ÿฅ‡ Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
45
+ - **๐Ÿฅˆ Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
46
+
47
+ ## ๐Ÿ“Š Performance Highlights
48
+
49
+ | Model | Composite Score | Semantic Similarity | OOV Handling |
50
+ |-------|----------------|-------------------|-------------|
51
+ | **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
52
+ | Meta cc.tt.300 | 0.2000 | - | - |
53
+ | **Improvement** | **3.5ร—** | **Significant** | **Perfect** |
54
+
55
+ ## ๐ŸŽฎ Quick Start
56
+
57
+ ### Try These Examples:
58
+
59
+ #### Word Similarity
60
+ ```python
61
+ # Find words similar to "ะผำ™ะบั‚ำ™ะฟ" (school)
62
+ similar_words = model.most_similar('ะผำ™ะบั‚ำ™ะฟ', topn=10)
63
+ ```
64
+
65
+ #### Word Analogies
66
+ ```python
67
+ # Doctor - man + woman = ?
68
+ analogy = model.most_similar(
69
+ positive=['ั‚ะฐะฑะธะฑ', 'ั…ะฐั‚ั‹ะฝ'], # doctor, woman
70
+ negative=['ะธั€'] # man
71
+ )
72
+ ```
73
+
74
+ #### OOV Testing (FastText Only)
75
+ ```python
76
+ # Handle unknown words
77
+ vector = model['ั‚ะตั…ะฝะพะปะพะณะธัะปำ™ัˆั‚ะตั€าฏ'] # technology-related word
78
+ ```
79
+
80
+ ## ๐Ÿ—๏ธ Technical Details
81
+
82
+ ### Training Corpus
83
+ - **Total Tokens**: 203.2 million
84
+ - **Vocabulary Size**: 637.7K words
85
+ - **Unique Words**: 1.8 million
86
+ - **Domains**: Wikipedia, news, books, social media
87
+
88
+ ### Model Architecture
89
+ - **FastText**: Subword information support
90
+ - **Word2Vec**: Classical word embeddings
91
+ - **Optimized**: Skip-gram architecture, 100 dimensions
92
+
93
+ ## ๐Ÿ“š Use Cases
94
+
95
+ ### ๐ŸŽ“ Education
96
+ - Language learning applications
97
+ - Educational content analysis
98
+ - Academic research
99
+
100
+ ### ๐Ÿ’ผ Business
101
+ - Content recommendation systems
102
+ - Search engine enhancement
103
+ - Customer feedback analysis
104
+
105
+ ### ๐Ÿ”ฌ Research
106
+ - Linguistic studies
107
+ - Cross-lingual comparisons
108
+ - AI model development
109
+
110
+ ## ๐Ÿ› ๏ธ Installation
111
+
112
+ ### Local Development
113
+ ```bash
114
+ git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
115
+ cd tatar2vec-demo
116
+ pip install -r requirements.txt
117
+ streamlit run app.py
118
+ ```
119
+
120
+ ### Docker Deployment
121
+ ```bash
122
+ docker build -t tatar2vec-demo .
123
+ docker run -p 7860:7860 tatar2vec-demo
124
+ ```
125
+
126
+ ## ๐ŸŒ API Access
127
+
128
+ ```python
129
+ from huggingface_hub import snapshot_download
130
+ from gensim.models import FastText
131
+
132
+ # Download and load the best model
133
+ model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
134
+ model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
135
+
136
+ # Use the model
137
+ similar_words = model.wv.most_similar('ะผำ™ะบั‚ำ™ะฟ')
138
+ ```
139
+
140
+ ## ๐Ÿ“Š Evaluation Metrics
141
+
142
+ Our models were evaluated on multiple dimensions:
143
+ - **Semantic Similarity**: Human-judged word pairs
144
+ - **Analogy Accuracy**: Word relationship tasks
145
+ - **OOV Handling**: Unknown word processing
146
+ - **Neighbor Coherence**: Semantic consistency
147
+
148
+ ## ๐Ÿค Contributing
149
+
150
+ We welcome contributions from the community! Areas of interest:
151
+ - Additional evaluation benchmarks
152
+ - New model architectures
153
+ - Expanded training data
154
+ - Multilingual applications
155
+
156
+ ## ๐Ÿ“œ Citation
157
+
158
+ If you use Tatar2Vec in your research, please cite:
159
+
160
+ ```bibtex
161
+ @misc{tatar2vec2025,
162
+ title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
163
+ author = {Arabovs AI Lab},
164
+ year = {2025},
165
+ publisher = {Hugging Face},
166
+ url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
167
+ note = {Version 1.0}
168
+ }
169
+ ```
170
+
171
+ ## ๐Ÿ“„ License
172
+
173
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
174
+
175
+ ## ๐Ÿ™ Acknowledgments
176
+
177
+ - Tatar language speakers and contributors
178
+ - Hugging Face for platform support
179
+ - Open-source community for tools and libraries
180
+
181
+ ---
182
+
183
+ <div align="center">
184
+
185
+ **Empowering Tatar Language Technology**
186
+
187
+ *Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
188
+
189
+ [Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) โ€ข
190
+ [Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) โ€ข
191
+ [Contact Team](mailto:contact@arabovs-ai-lab.com)
192
+
193
+ </div>