| ## Language Identification | |
| ```text | |
| langdetect | |
| https://pypi.org/project/langdetect/ | |
| ``` | |
| ### lanid | |
| langid 识别 97 种语言。 | |
| https://github.com/saffsd/langid.py | |
| 原理: | |
| ```text | |
| https://github.com/saffsd/langid.py/tree/master/langid/train | |
| 1. 分词. | |
| 2. 计算 `字符ngram` 或 `词ngram` 特征. | |
| 3. 计算 item 的文档频率. | |
| 4. 计算 IG weights 信息增益权重, 提取重要特征. | |
| 4. 训练 NB (Naive Bayes) 概率模型, 即每个 item 对每个类型的概率贡献. | |
| ``` | |
| ### fasttext | |
| 识别 176 种语言。 | |
| https://fasttext.cc/docs/en/language-identification.html | |
| ### 参考 | |
| ```text | |
| https://zhuanlan.zhihu.com/p/600245782 | |
| ``` | |