File size: 5,743 Bytes
c56fe59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee0c896
c56fe59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
library_name: transformers
license: apache-2.0
language:
- zh
- yue
tags:
- text-classification
- zhlid
- modernbert
pipeline_tag: text-classification
---

# ZHLID model card

**Authors**: Lung-Chuan Chen

**GitHub page**: https://github.com/Musubi-ai/ZHLID

## Model information
ZHLID is a classification model specialized in fine-grained Chinese varieties. It adopts [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture and is trained with in-house dataset composed by Traditional Chinese and Simplified Chinese data.

Unlike general-purpose LID tools, ZHLID focuses on distinguishing between closely related Chinese varieties, including:

**Traditional Chinese (繁體中文)** – written in the traditional character set, used in formal and classical texts.  
**Simplified Chinese (簡體中文)** – written in the simplified character set, designed for easier reading and writing.  
**Cantonese (粵語)** – written form reflecting spoken Cantonese with unique vocabulary and grammar.  
**Classical Chinese (Traditional) (繁體文言文)** – literary Chinese in traditional characters with concise, classical syntax.  
**Classical Chinese (Simplified) (簡體文言文)** – literary Chinese in simplified characters, used in modern reprints and education.

This makes ZHLID useful for linguistic research, corpus analysis, preprocessing for NLP tasks, or any application requiring accurate recognition of Chinese textual forms.

The following table compares ZHLID with other popular LID tools supporting Chinese detection:

| Identification | General Chinese | Traditional Chinese | Simplified Chinese | Classical Chinese | Cantonese |
|------|:----:|:----:|:----:|:----:|:----:|
| ZHLID (ours) | ✅ | ✅ | ✅ | ✅ | ✅ |
| [langdetect](https://github.com/Mimino666/langdetect) | ✅ | ✅ | ✅ | ❌ | ❌ |
| [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) | ✅ | ❌ |❌ |❌ | ✅ |
| [langid.py](https://github.com/saffsd/langid.py) | ✅ | ❌ | ❌ | ❌ | ❌ |
| [CLD3](https://github.com/google/cld3?tab=readme-ov-file#supported-languages) | ✅ | ❌ | ❌ | ❌ | ❌ |
| [Lingua](https://github.com/pemistahl/lingua-py) | ✅ | ❌ | ❌ | ❌ | ❌ |

## Installation
To use ZHLID model, install `transformers` with version higher than v4.48.0:
```bash
pip install -U transformers>=4.48.0
```
Optionally, you can install [flash-attention](https://github.com/Dao-AILab/flash-attention) to improve inference efficiency:
```bash
pip install flash-attn --no-build-isolation
```

## Usage
With `pipeline` function in `transformers`:
```python
from transformers import pipeline

pipe = pipeline("text-classification", model="MusubiAI/ZHLID")
text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。"

res = pipe(text)
print(res)
# [{'label': 'zhcn_classical', 'score': 0.9998414516448975}]
```

With `AutoModelForSequenceClassification`:
```python
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "MusubiAI/ZHLID"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
id2label = model.config.id2label

text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

with torch.no_grad():
    logits = model(**inputs)["logits"]

scores = nn.functional.softmax(logits, dim=-1)
pred_score, pred_index = torch.max(scores, dim=-1)
pred_score = pred_score.item()
pred_index = pred_index.item()
label = id2label[pred_index]
prediction = {"label": label, "confidence_score": pred_score}
print(prediction)
# {'label': 'zhcn_classical', 'confidence_score': 0.99983811378479}
```
Using `vllm` is also available:
```python
from vllm import LLM
import torch
import torch.nn.functional as F


llm = LLM(model="MusubiAI/ZHLID", task="classify")


text = "孔子\n大成至圣先师孔丘,字仲尼,子姓,孔氏,敬称孔子、孔夫子,生于鲁昌平乡陬邑。"

output = llm.classify(text)[0]
probs = output.outputs.probs
probabilities = torch.tensor(output.outputs.probs)

# Get the top predicted class
top_idx = torch.argmax(probabilities).item()
top_prob = probabilities[top_idx].item()

print(f"Confidence: {top_prob:.4f}")

id2label = {
    "0": "yue",
    "1": "zhcn_classical",
    "2": "zhtw_classical",
    "3": "zhcn",
    "4": "zhtw"
}

label = id2label[str(top_idx)]
print(label)
```

## Evaluation
We compare our top-1 accuracy result with [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) and [langdetect](https://github.com/Mimino666/langdetect). Note that since GlotLID only provides a general "cmn_Hani" label for Chinese, its performance on Traditional and Simplified Chinese is measured by whether it outputs this label for both categories.

| Top-1 accuracy | Traditional Chinese | Simplified Chinese | Classical Chinese (Traditional) | Classical Chinese (Simplified) | Cantonese |
|------|:----:|:----:|:----:|:----:|:----:|
| ZHLID (ours) | 1.0 | 1.0 | 0.9 | 1.0 | 0.96 |
| [GlotLID](https://github.com/cisnlp/GlotLID/tree/main) | 0.98 | 0.98 | - | - | 0.9 |
| [langdetect](https://github.com/Mimino666/langdetect) | 0.3 | 0.9 | - | - | - |

## License
ZHLID model is released under the Apache 2.0 license.

## Citation
If you use ZHLID in your research, please cite this repository:
```bibtex
@misc{zhlid2025 ,
  title  = {ZHLID: Fine-grained Chinese Language Identification Package},
  author = {Lung-Chuan Chen},
  year   = {2025},
  howpublished = {\url{https://github.com/Musubi-ai/ZHLID}}
}
```