Update README with proper attribution
Browse files
README.md
CHANGED
|
@@ -6,7 +6,32 @@ tags:
|
|
| 6 |
- language-identification
|
| 7 |
---
|
| 8 |
|
| 9 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. It was introduced in [this paper](https://arxiv.org/abs/1607.04606). The official website can be found [here](https://fasttext.cc/).
|
| 12 |
|
|
@@ -30,6 +55,10 @@ Here is how to use this model to detect the language of a given text:
|
|
| 30 |
>>> import fasttext
|
| 31 |
>>> from huggingface_hub import hf_hub_download
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
|
| 34 |
>>> model = fasttext.load_model(model_path)
|
| 35 |
>>> model.predict("Hello, world!")
|
|
@@ -38,13 +67,13 @@ Here is how to use this model to detect the language of a given text:
|
|
| 38 |
|
| 39 |
>>> model.predict("Hello, world!", k=5)
|
| 40 |
|
| 41 |
-
(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'),
|
| 42 |
array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))
|
| 43 |
```
|
| 44 |
|
| 45 |
### Limitations and bias
|
| 46 |
|
| 47 |
-
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
|
| 48 |
|
| 49 |
Cosine similarity can be used to measure the similarity between two different word vectors. If two two vectors are identical, the cosine similarity will be 1. For two completely unrelated vectors, the value will be 0. If two vectors have an opposite relationship, the value will be -1.
|
| 50 |
|
|
@@ -81,7 +110,7 @@ More information about the training of these models can be found in the article
|
|
| 81 |
|
| 82 |
### License
|
| 83 |
|
| 84 |
-
The language identification model is distributed under the [
|
| 85 |
|
| 86 |
### Evaluation datasets
|
| 87 |
|
|
@@ -91,7 +120,7 @@ The analogy evaluation datasets described in the paper are available here: [Fren
|
|
| 91 |
|
| 92 |
Please cite [1] if using this code for learning word representations or [2] if using for text classification.
|
| 93 |
|
| 94 |
-
[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [
|
| 95 |
|
| 96 |
```markup
|
| 97 |
@article{bojanowski2016enriching,
|
|
@@ -102,7 +131,7 @@ Please cite [1] if using this code for learning word representations or [2] if u
|
|
| 102 |
}
|
| 103 |
```
|
| 104 |
|
| 105 |
-
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [
|
| 106 |
|
| 107 |
```markup
|
| 108 |
@article{joulin2016bag,
|
|
@@ -113,7 +142,7 @@ Please cite [1] if using this code for learning word representations or [2] if u
|
|
| 113 |
}
|
| 114 |
```
|
| 115 |
|
| 116 |
-
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [
|
| 117 |
|
| 118 |
```markup
|
| 119 |
@article{joulin2016fasttext,
|
|
@@ -126,7 +155,7 @@ Please cite [1] if using this code for learning word representations or [2] if u
|
|
| 126 |
|
| 127 |
If you use these word vectors, please cite the following paper:
|
| 128 |
|
| 129 |
-
[4] E. Grave\*, P. Bojanowski\*, P. Gupta, A. Joulin, T. Mikolov, [
|
| 130 |
|
| 131 |
```markup
|
| 132 |
@inproceedings{grave2018learning,
|
|
@@ -139,3 +168,10 @@ If you use these word vectors, please cite the following paper:
|
|
| 139 |
|
| 140 |
(\* These authors contributed equally.)
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
- language-identification
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# 🔗 FastText Language Identification - Mirror Repository
|
| 10 |
+
|
| 11 |
+
> **⚠️ IMPORTANT NOTICE**: This is a **mirror/fork** of the original Facebook FastText Language Identification model.
|
| 12 |
+
>
|
| 13 |
+
> **Original Repository**: [facebook/fasttext-language-identification](https://huggingface.co/facebook/fasttext-language-identification)
|
| 14 |
+
>
|
| 15 |
+
> **Original Authors**: Facebook Research Team
|
| 16 |
+
>
|
| 17 |
+
> **Purpose of this mirror**: Providing an alternative access point for the model / Personal backup / Testing purposes
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## 📌 Attribution & Credits
|
| 22 |
+
|
| 23 |
+
**ALL CREDITS GO TO THE ORIGINAL AUTHORS AT FACEBOOK RESEARCH**
|
| 24 |
+
|
| 25 |
+
This model was developed by Facebook Research as part of the NLLB (No Language Left Behind) project. I do not claim any ownership or authorship of this model. This repository serves only as a mirror/backup.
|
| 26 |
+
|
| 27 |
+
- **Original Model Card**: [facebook/fasttext-language-identification](https://huggingface.co/facebook/fasttext-language-identification)
|
| 28 |
+
- **Paper**: [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606)
|
| 29 |
+
- **Official Website**: [fasttext.cc](https://fasttext.cc/)
|
| 30 |
+
- **License**: Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
# Original Model Description (from Facebook)
|
| 35 |
|
| 36 |
fastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. It was introduced in [this paper](https://arxiv.org/abs/1607.04606). The official website can be found [here](https://fasttext.cc/).
|
| 37 |
|
|
|
|
| 55 |
>>> import fasttext
|
| 56 |
>>> from huggingface_hub import hf_hub_download
|
| 57 |
|
| 58 |
+
>>> # You can use either the original repo or this mirror
|
| 59 |
+
>>> # Original: repo_id="facebook/fasttext-language-identification"
|
| 60 |
+
>>> # Mirror: repo_id="nahiar/language-detection"
|
| 61 |
+
>>>
|
| 62 |
>>> model_path = hf_hub_download(repo_id="facebook/fasttext-language-identification", filename="model.bin")
|
| 63 |
>>> model = fasttext.load_model(model_path)
|
| 64 |
>>> model.predict("Hello, world!")
|
|
|
|
| 67 |
|
| 68 |
>>> model.predict("Hello, world!", k=5)
|
| 69 |
|
| 70 |
+
(('__label__eng_Latn', '__label__vie_Latn', '__label__nld_Latn', '__label__pol_Latn', '__label__deu_Latn'),
|
| 71 |
array([0.61224753, 0.21323682, 0.09696738, 0.01359863, 0.01319415]))
|
| 72 |
```
|
| 73 |
|
| 74 |
### Limitations and bias
|
| 75 |
|
| 76 |
+
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased predictions.
|
| 77 |
|
| 78 |
Cosine similarity can be used to measure the similarity between two different word vectors. If two two vectors are identical, the cosine similarity will be 1. For two completely unrelated vectors, the value will be 0. If two vectors have an opposite relationship, the value will be -1.
|
| 79 |
|
|
|
|
| 110 |
|
| 111 |
### License
|
| 112 |
|
| 113 |
+
The language identification model is distributed under the [_Creative Commons Attribution-NonCommercial 4.0 International Public License_](https://creativecommons.org/licenses/by-nc/4.0/).
|
| 114 |
|
| 115 |
### Evaluation datasets
|
| 116 |
|
|
|
|
| 120 |
|
| 121 |
Please cite [1] if using this code for learning word representations or [2] if using for text classification.
|
| 122 |
|
| 123 |
+
[1] P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [_Enriching Word Vectors with Subword Information_](https://arxiv.org/abs/1607.04606)
|
| 124 |
|
| 125 |
```markup
|
| 126 |
@article{bojanowski2016enriching,
|
|
|
|
| 131 |
}
|
| 132 |
```
|
| 133 |
|
| 134 |
+
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [_Bag of Tricks for Efficient Text Classification_](https://arxiv.org/abs/1607.01759)
|
| 135 |
|
| 136 |
```markup
|
| 137 |
@article{joulin2016bag,
|
|
|
|
| 142 |
}
|
| 143 |
```
|
| 144 |
|
| 145 |
+
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, [_FastText.zip: Compressing text classification models_](https://arxiv.org/abs/1612.03651)
|
| 146 |
|
| 147 |
```markup
|
| 148 |
@article{joulin2016fasttext,
|
|
|
|
| 155 |
|
| 156 |
If you use these word vectors, please cite the following paper:
|
| 157 |
|
| 158 |
+
[4] E. Grave\*, P. Bojanowski\*, P. Gupta, A. Joulin, T. Mikolov, [_Learning Word Vectors for 157 Languages_](https://arxiv.org/abs/1802.06893)
|
| 159 |
|
| 160 |
```markup
|
| 161 |
@inproceedings{grave2018learning,
|
|
|
|
| 168 |
|
| 169 |
(\* These authors contributed equally.)
|
| 170 |
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 📝 Repository Maintainer Note
|
| 174 |
+
|
| 175 |
+
This repository is maintained by [@nahiar](https://huggingface.co/nahiar) for easier access and backup purposes only. For any issues with the model itself, please refer to the original repository or Facebook Research team.
|
| 176 |
+
|
| 177 |
+
**If you are the original author and have any concerns about this mirror, please contact me and I will immediately take appropriate action.**
|