SolarisCipher
/

HKELECTRA

+---
+language:
+- yue
+- zh
+language_details: "yue-Hant-HK; zh-Hant-HK"
+license: cc-by-4.0
+datasets:
+- SolarisCipher/hk_content_corpus
+metrics:
+- accuracy
+- exact_match
+tags:
+- ELECTRA
+- pretrained
+- masked-language-model
+- replaced-token-detection
+- feature-extraction
+library_name: transformers
+---
+# HKELECTRA - ELECTRA Pretrained Models for Hong Kong Content
+This repository contains **pretrained ELECTRA models** trained on Hong Kong Cantonese and Traditional Chinese content, focused on studying diglossia effects for NLP modeling.
+The repo includes:
+- `generator/` : HuggingFace Transformers format **generator** model for masked token prediction.
+- `discriminator/` : HuggingFace Transformers format **discriminator** model for replaced token detection.
+- `tf_checkpoint/` : Original **TensorFlow checkpoint** from pretraining (requires TensorFlow to load).
+- `runs/` : **TensorBoard log** of pretraining.
+**Note:** Because this repo contains multiple models with different purposes, there is **no `pipeline_tag`**. Users should select the appropriate model and pipeline for their use case.
+## Model Details
+### Model Description
+**Architecture:** ELECTRA (small/base/large)
+**Pretraining:** from scratch (no base model)
+**Languages:** Hong Kong Cantonese, Traditional Chinese
+**Intended Use:** Research, feature extraction, masked token prediction
+**License:** cc-by-4.0
+## Usage Examples
+### Load Generator (Masked LM)
+```python
+from transformers import ElectraTokenizer, ElectraForMaskedLM, pipeline
+tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/generator/small")
+model = ElectraForMaskedLM.from_pretrained("SolarisCipher/HKELECTRA/generator/small")
+unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+unmasker("從中環[MASK]到尖沙咀。")
+```
+### Load Discriminator (Feature Extraction / Replaced Token Detection)
+```python
+from transformers import ElectraTokenizer, ElectraForPreTraining
+tokenizer = ElectraTokenizer.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")
+model = ElectraForPreTraining.from_pretrained("SolarisCipher/HKELECTRA/discriminator/small")
+inputs = tokenizer("從中環坐車到[MASK]。", return_tensors="pt")
+outputs = model(**inputs)  # logits for replaced token detection
+```
+## Citation
+If you use this model in your work, please cite our dataset and the original research:
+Dataset (Upstream SQL Dump)
+```bibtex
+@dataset{yung_2025_16875235,
+  author       = {Yung, Yiu Cheong},
+  title        = {HK Web Text Corpus (MySQL Dump, raw version)},
+  month        = aug,
+  year         = 2025,
+  publisher    = {Zenodo},
+  doi          = {10.5281/zenodo.16875235},
+  url          = {https://doi.org/10.5281/zenodo.16875235},
+}
+```
+Dataset (Cleaned Corpus)
+```bibtex
+@dataset{yung_2025_16882351,
+  author       = {Yung, Yiu Cheong},
+  title        = {HK Content Corpus (Cantonese \& Traditional Chinese)},
+  month        = aug,
+  year         = 2025,
+  publisher    = {Zenodo},
+  doi          = {10.5281/zenodo.16882351},
+  url          = {https://doi.org/10.5281/zenodo.16882351},
+}
+```
+Research Paper
+```bibtex
+@article{10.1145/3744341,
+author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},
+title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},
+year = {2025},
+issue_date = {July 2025},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {24},
+number = {7},
+issn = {2375-4699},
+url = {https://doi.org/10.1145/3744341},
+doi = {10.1145/3744341},
+journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
+month = jul,
+articleno = {71},
+numpages = {16},
+keywords = {Hong Kong, diglossia, ELECTRA, language modeling}
+}
+```