mohamedah commited on
Commit
4c22db0
·
verified ·
1 Parent(s): c5ddbe9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -21,7 +21,7 @@ language:
21
 
22
  ## Overview
23
 
24
- CensorshipDetector is a Chinese-language text-classification model finetuned to classify a given piece of text as more or less similar to known sanitized content (i.e., those pieces of content which remain after being subjected to state censorship including alterations, deletions, and self-imposed censorship). To fine-tune CensorshipDetector we used two corpora of Simplified Chinese text, one which has been subjected to the CCP's online information controls and one which was not. For the non-censored dataset, we used the [November 2023 Wikipedia dump](https://huggingface.co/datasets/wikimedia/wikipedia). For the censored dataset we used we [scraped 587,819 articles from Baidu Baike](https://huggingface.co/datasets/mohamedah/baidu_baike), an online encyclopedia which is the largest mainland Chinese alternative to Wikipedia. These articles were scraped from the Internet Archive's snapshots of the encyclopedia. Once we trained CensorshipDetector, we validated it using 5,039 [Chinese-language news articles](https://huggingface.co/datasets/mohamedah/zh-news-articles), 3,007 of which were from Chinese state media and the remaining 2,032 were from the Chinese language version of the New York Times. We sourced the state media articles from the [news2016zh](https://github.com/brightmart/nlp_chinese_corpus?tab=readme-ov-file#2%E6%96%B0%E9%97%BB%E8%AF%AD%E6%96%99json%E7%89%88news2016zh) corpus and we automatically scraped the New York Times articles.
25
 
26
  ## Evaluation and Validation
27
 
 
21
 
22
  ## Overview
23
 
24
+ CensorshipDetector is a Chinese-language text-classification model finetuned to classify a given piece of text as more or less similar to known sanitized content (i.e., those pieces of content which remain after being subjected to state censorship including alterations, deletions, and self-imposed censorship). To fine-tune CensorshipDetector we used two corpora of Simplified Chinese text, one which has been subjected to the CCP's online information controls and one which was not. For the non-censored dataset, we used the [November 2023 Wikipedia dump](https://huggingface.co/datasets/wikimedia/wikipedia). For the censored dataset we [scraped 587,819 articles from Baidu Baike](https://huggingface.co/datasets/mohamedah/baidu_baike), an online encyclopedia which is the largest mainland Chinese alternative to Wikipedia. These articles were scraped from the Internet Archive's snapshots of the encyclopedia. Once we trained CensorshipDetector, we validated it using 5,039 [Chinese-language news articles](https://huggingface.co/datasets/mohamedah/zh-news-articles), 3,007 of which were from Chinese state media and the remaining 2,032 were from the Chinese language version of the New York Times. We sourced the state media articles from the [news2016zh](https://github.com/brightmart/nlp_chinese_corpus?tab=readme-ov-file#2%E6%96%B0%E9%97%BB%E8%AF%AD%E6%96%99json%E7%89%88news2016zh) corpus and we automatically scraped the New York Times articles.
25
 
26
  ## Evaluation and Validation
27