| | --- |
| | language: |
| | - yue |
| | - zh |
| | language_details: "yue-Hant-HK; zh-Hant-HK" |
| | license: cc-by-4.0 |
| | datasets: |
| | - IKMLab-team/hk_content_corpus |
| | metrics: |
| | - accuracy |
| | - exact_match |
| | tags: |
| | - ELECTRA |
| | - pretrained |
| | - masked-language-model |
| | - replaced-token-detection |
| | - feature-extraction |
| | library_name: transformers |
| | --- |
| | |
| | # HKELECTRA - ELECTRA Pretrained Models for Hong Kong Content |
| |
|
| | This repository contains **pretrained ELECTRA models** trained on Hong Kong Cantonese and Traditional Chinese content, focused on studying diglossia effects for NLP modeling. |
| |
|
| | The repo includes: |
| |
|
| | - `generator/` : HuggingFace Transformers format **generator** model for masked token prediction. |
| | - `discriminator/` : HuggingFace Transformers format **discriminator** model for replaced token detection. |
| | - `tf_checkpoint/` : Original **TensorFlow checkpoint** from pretraining (requires TensorFlow to load). |
| | - `runs/` : **TensorBoard log** of pretraining. |
| |
|
| | **Note:** Because this repo contains multiple models with different purposes, there is **no `pipeline_tag`**. Users should select the appropriate model and pipeline for their use case. TensorFlow checkpoint requires TensorFlow >= 2.X to load manually. |
| | |
| | This model is also available at Zenodo: https://doi.org/10.5281/zenodo.16889492 |
| | |
| | ## Model Details |
| | |
| | ### Model Description |
| | |
| | **Architecture:** ELECTRA (small/base/large) |
| | **Pretraining:** from scratch (no base model) |
| | **Languages:** Hong Kong Cantonese, Traditional Chinese |
| | **Intended Use:** Research, feature extraction, masked token prediction |
| | **License:** cc-by-4.0 |
| | |
| | ## Usage Examples |
| | |
| | ### Load Generator (Masked LM) |
| | |
| | ```python |
| | from transformers import ElectraTokenizer, ElectraForMaskedLM, pipeline |
| | |
| | tokenizer = ElectraTokenizer.from_pretrained("IKMLab-team/HKELECTRA/generator/small") |
| | model = ElectraForMaskedLM.from_pretrained("IKMLab-team/HKELECTRA/generator/small") |
| | |
| | unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
| | unmasker("從中環[MASK]到尖沙咀。") |
| | ``` |
| | |
| | ### Load Discriminator (Feature Extraction / Replaced Token Detection) |
| | |
| | ```python |
| | from transformers import ElectraTokenizer, ElectraForPreTraining |
| | |
| | tokenizer = ElectraTokenizer.from_pretrained("IKMLab-team/HKELECTRA/discriminator/small") |
| | model = ElectraForPreTraining.from_pretrained("IKMLab-team/HKELECTRA/discriminator/small") |
| | |
| | inputs = tokenizer("從中環坐車到[MASK]。", return_tensors="pt") |
| | outputs = model(**inputs) # logits for replaced token detection |
| | ``` |
| | |
| | ## Citation |
| | |
| | If you use this model in your work, please cite our dataset and the original research: |
| | |
| | Dataset (Upstream SQL Dump) |
| | ```bibtex |
| | @dataset{yung_2025_16875235, |
| | author = {Yung, Yiu Cheong}, |
| | title = {HK Web Text Corpus (MySQL Dump, raw version)}, |
| | month = aug, |
| | year = 2025, |
| | publisher = {Zenodo}, |
| | doi = {10.5281/zenodo.16875235}, |
| | url = {https://doi.org/10.5281/zenodo.16875235}, |
| | } |
| | ``` |
| | |
| | Dataset (Cleaned Corpus) |
| | ```bibtex |
| | @dataset{yung_2025_16882351, |
| | author = {Yung, Yiu Cheong}, |
| | title = {HK Content Corpus (Cantonese \& Traditional Chinese)}, |
| | month = aug, |
| | year = 2025, |
| | publisher = {Zenodo}, |
| | doi = {10.5281/zenodo.16882351}, |
| | url = {https://doi.org/10.5281/zenodo.16882351}, |
| | } |
| | ``` |
| | |
| | Research Paper |
| | ```bibtex |
| | @article{10.1145/3744341, |
| | author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu}, |
| | title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content}, |
| | year = {2025}, |
| | issue_date = {July 2025}, |
| | publisher = {Association for Computing Machinery}, |
| | address = {New York, NY, USA}, |
| | volume = {24}, |
| | number = {7}, |
| | issn = {2375-4699}, |
| | url = {https://doi.org/10.1145/3744341}, |
| | doi = {10.1145/3744341}, |
| | journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.}, |
| | month = jul, |
| | articleno = {71}, |
| | numpages = {16}, |
| | keywords = {Hong Kong, diglossia, ELECTRA, language modeling} |
| | } |
| | ``` |