SolarisCipher commited on
Commit
9bec6eb
·
verified ·
1 Parent(s): be34fa6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -3
README.md CHANGED
@@ -1,3 +1,120 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - yue
4
+ - zh
5
+ language_details: "yue-Hant-HK; zh-Hant-HK"
6
+ license: cc-by-4.0
7
+ datasets:
8
+ - SolarisCipher/hk_content_corpus
9
+ metrics:
10
+ - accuracy
11
+ - exact_match
12
+ tags:
13
+ - ELECTRA
14
+ - pretrained
15
+ - masked-language-model
16
+ - replaced-token-detection
17
+ - feature-extraction
18
+ library_name: transformers
19
+ ---
20
+
21
+ # HKELECTRA - ELECTRA Pretrained Models for Hong Kong Content
22
+
23
+ This repository contains **pretrained ELECTRA models** trained on Hong Kong Cantonese and Traditional Chinese content, focused on studying diglossia effects for NLP modeling.
24
+
25
+ The repo includes:
26
+
27
+ - `generator/` : HuggingFace Transformers format **generator** model for masked token prediction.
28
+ - `discriminator/` : HuggingFace Transformers format **discriminator** model for replaced token detection.
29
+ - `tf_checkpoint/` : Original **TensorFlow checkpoint** from pretraining (requires TensorFlow to load).
30
+ - `runs/` : **TensorBoard log** of pretraining.
31
+
32
+ **Note:** Because this repo contains multiple models with different purposes, there is **no `pipeline_tag`**. Users should select the appropriate model and pipeline for their use case.
33
+
34
+ ## Model Details
35
+
36
+ ### Model Description
37
+
38
+ **Architecture:** ELECTRA (small/base/large)
39
+ **Pretraining:** from scratch (no base model)
40
+ **Languages:** Hong Kong Cantonese, Traditional Chinese
41
+ **Intended Use:** Research, feature extraction, masked token prediction
42
+ **License:** cc-by-4.0
43
+
44
+ ## Usage Examples
45
+
46
+ ### Load Generator (Masked LM)
47
+
48
+ ```python
49
+ from transformers import ElectraTokenizer, ElectraForMaskedLM, pipeline
50
+
51
+ tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/generator/small")
52
+ model = ElectraForMaskedLM.from_pretrained("SolarisCipher/HKELECTRA/generator/small")
53
+
54
+ unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
55
+ unmasker("從中環[MASK]到尖沙咀。")
56
+ ```
57
+
58
+ ### Load Discriminator (Feature Extraction / Replaced Token Detection)
59
+
60
+ ```python
61
+ from transformers import ElectraTokenizer, ElectraForPreTraining
62
+
63
+ tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")
64
+ model = ElectraForPreTraining.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")
65
+
66
+ inputs = tokenizer("從中環坐車到[MASK]。", return_tensors="pt")
67
+ outputs = model(**inputs) # logits for replaced token detection
68
+ ```
69
+
70
+ ## Citation
71
+
72
+ If you use this model in your work, please cite our dataset and the original research:
73
+
74
+ Dataset (Upstream SQL Dump)
75
+ ```bibtex
76
+ @dataset{yung_2025_16875235,
77
+ author = {Yung, Yiu Cheong},
78
+ title = {HK Web Text Corpus (MySQL Dump, raw version)},
79
+ month = aug,
80
+ year = 2025,
81
+ publisher = {Zenodo},
82
+ doi = {10.5281/zenodo.16875235},
83
+ url = {https://doi.org/10.5281/zenodo.16875235},
84
+ }
85
+ ```
86
+
87
+ Dataset (Cleaned Corpus)
88
+ ```bibtex
89
+ @dataset{yung_2025_16882351,
90
+ author = {Yung, Yiu Cheong},
91
+ title = {HK Content Corpus (Cantonese \& Traditional Chinese)},
92
+ month = aug,
93
+ year = 2025,
94
+ publisher = {Zenodo},
95
+ doi = {10.5281/zenodo.16882351},
96
+ url = {https://doi.org/10.5281/zenodo.16882351},
97
+ }
98
+ ```
99
+
100
+ Research Paper
101
+ ```bibtex
102
+ @article{10.1145/3744341,
103
+ author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},
104
+ title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},
105
+ year = {2025},
106
+ issue_date = {July 2025},
107
+ publisher = {Association for Computing Machinery},
108
+ address = {New York, NY, USA},
109
+ volume = {24},
110
+ number = {7},
111
+ issn = {2375-4699},
112
+ url = {https://doi.org/10.1145/3744341},
113
+ doi = {10.1145/3744341},
114
+ journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
115
+ month = jul,
116
+ articleno = {71},
117
+ numpages = {16},
118
+ keywords = {Hong Kong, diglossia, ELECTRA, language modeling}
119
+ }
120
+ ```