mohamedah
/

censorship_detector

Text Classification

Generated from Trainer

text-embeddings-inference

Model card Files Files and versions

mohamedah commited on Jun 3, 2025

Commit

4c22db0

·

verified ·

1 Parent(s): c5ddbe9

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -21,7 +21,7 @@ language:
 ## Overview
-CensorshipDetector is a Chinese-language text-classification model finetuned to classify a given piece of text as more or less similar to known sanitized content (i.e., those pieces of content which remain after being subjected to state censorship including alterations, deletions, and self-imposed censorship). To fine-tune CensorshipDetector we used two corpora of Simplified Chinese text, one which has been subjected to the CCP's online information controls and one which was not. For the non-censored dataset, we used the  [November 2023 Wikipedia dump](https://huggingface.co/datasets/wikimedia/wikipedia). For the censored dataset we used we [scraped 587,819 articles from Baidu Baike](https://huggingface.co/datasets/mohamedah/baidu_baike), an online encyclopedia which is the largest mainland Chinese alternative to Wikipedia. These articles were scraped from the Internet Archive's snapshots of the encyclopedia. Once we trained CensorshipDetector, we validated it using 5,039 [Chinese-language news articles](https://huggingface.co/datasets/mohamedah/zh-news-articles), 3,007 of which were from Chinese state media and the remaining 2,032 were from the Chinese language version of the New York Times. We sourced the state media articles from the [news2016zh](https://github.com/brightmart/nlp_chinese_corpus?tab=readme-ov-file#2%E6%96%B0%E9%97%BB%E8%AF%AD%E6%96%99json%E7%89%88news2016zh) corpus and we automatically scraped the New York Times articles.
 ## Evaluation and Validation

 ## Overview
+CensorshipDetector is a Chinese-language text-classification model finetuned to classify a given piece of text as more or less similar to known sanitized content (i.e., those pieces of content which remain after being subjected to state censorship including alterations, deletions, and self-imposed censorship). To fine-tune CensorshipDetector we used two corpora of Simplified Chinese text, one which has been subjected to the CCP's online information controls and one which was not. For the non-censored dataset, we used the  [November 2023 Wikipedia dump](https://huggingface.co/datasets/wikimedia/wikipedia). For the censored dataset we [scraped 587,819 articles from Baidu Baike](https://huggingface.co/datasets/mohamedah/baidu_baike), an online encyclopedia which is the largest mainland Chinese alternative to Wikipedia. These articles were scraped from the Internet Archive's snapshots of the encyclopedia. Once we trained CensorshipDetector, we validated it using 5,039 [Chinese-language news articles](https://huggingface.co/datasets/mohamedah/zh-news-articles), 3,007 of which were from Chinese state media and the remaining 2,032 were from the Chinese language version of the New York Times. We sourced the state media articles from the [news2016zh](https://github.com/brightmart/nlp_chinese_corpus?tab=readme-ov-file#2%E6%96%B0%E9%97%BB%E8%AF%AD%E6%96%99json%E7%89%88news2016zh) corpus and we automatically scraped the New York Times articles.
 ## Evaluation and Validation