Update README.md

ae1f30a verified 6 months ago

3.91 kB

	---
	language:
	- yue
	- zh
	language_details: "yue-Hant-HK; zh-Hant-HK"
	license: cc-by-4.0
	datasets:
	- IKMLab-team/hk_content_corpus
	metrics:
	- accuracy
	- exact_match
	tags:
	- ELECTRA
	- pretrained
	- masked-language-model
	- replaced-token-detection
	- feature-extraction
	library_name: transformers
	---

	# HKELECTRA - ELECTRA Pretrained Models for Hong Kong Content

	This repository contains pretrained ELECTRA models trained on Hong Kong Cantonese and Traditional Chinese content, focused on studying diglossia effects for NLP modeling.

	The repo includes:

	- `generator/` : HuggingFace Transformers format generator model for masked token prediction.
	- `discriminator/` : HuggingFace Transformers format discriminator model for replaced token detection.
	- `tf_checkpoint/` : Original TensorFlow checkpoint from pretraining (requires TensorFlow to load).
	- `runs/` : TensorBoard log of pretraining.

	Note: Because this repo contains multiple models with different purposes, there is no `pipeline_tag`. Users should select the appropriate model and pipeline for their use case. TensorFlow checkpoint requires TensorFlow >= 2.X to load manually.

	This model is also available at Zenodo: https://doi.org/10.5281/zenodo.16889492

	## Model Details

	### Model Description

	Architecture: ELECTRA (small/base/large)
	Pretraining: from scratch (no base model)
	Languages: Hong Kong Cantonese, Traditional Chinese
	Intended Use: Research, feature extraction, masked token prediction
	License: cc-by-4.0

	## Usage Examples

	### Load Generator (Masked LM)

	```python
	from transformers import ElectraTokenizer, ElectraForMaskedLM, pipeline

	tokenizer = ElectraTokenizer.from_pretrained("IKMLab-team/HKELECTRA/generator/small")
	model = ElectraForMaskedLM.from_pretrained("IKMLab-team/HKELECTRA/generator/small")

	unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	unmasker("從中環[MASK]到尖沙咀。")
	```

	### Load Discriminator (Feature Extraction / Replaced Token Detection)

	```python
	from transformers import ElectraTokenizer, ElectraForPreTraining

	tokenizer = ElectraTokenizer.from_pretrained("IKMLab-team/HKELECTRA/discriminator/small")
	model = ElectraForPreTraining.from_pretrained("IKMLab-team/HKELECTRA/discriminator/small")

	inputs = tokenizer("從中環坐車到[MASK]。", return_tensors="pt")
	outputs = model(**inputs) # logits for replaced token detection
	```

	## Citation

	If you use this model in your work, please cite our dataset and the original research:

	Dataset (Upstream SQL Dump)
	```bibtex
	@dataset{yung_2025_16875235,
	author = {Yung, Yiu Cheong},
	title = {HK Web Text Corpus (MySQL Dump, raw version)},
	month = aug,
	year = 2025,
	publisher = {Zenodo},
	doi = {10.5281/zenodo.16875235},
	url = {https://doi.org/10.5281/zenodo.16875235},
	}
	```

	Dataset (Cleaned Corpus)
	```bibtex
	@dataset{yung_2025_16882351,
	author = {Yung, Yiu Cheong},
	title = {HK Content Corpus (Cantonese \& Traditional Chinese)},
	month = aug,
	year = 2025,
	publisher = {Zenodo},
	doi = {10.5281/zenodo.16882351},
	url = {https://doi.org/10.5281/zenodo.16882351},
	}
	```

	Research Paper
	```bibtex
	@article{10.1145/3744341,
	author = {Yung, Yiu Cheong and Lin, Ying-Jia and Kao, Hung-Yu},
	title = {Exploring the Effectiveness of Pre-training Language Models with Incorporation of Diglossia for Hong Kong Content},
	year = {2025},
	issue_date = {July 2025},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	volume = {24},
	number = {7},
	issn = {2375-4699},
	url = {https://doi.org/10.1145/3744341},
	doi = {10.1145/3744341},
	journal = {ACM Trans. Asian Low-Resour. Lang. Inf. Process.},
	month = jul,
	articleno = {71},
	numpages = {16},
	keywords = {Hong Kong, diglossia, ELECTRA, language modeling}
	}
	```