AndyReas commited on
Commit
dd43248
·
1 Parent(s): d4346e7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+
4
+ license: mit
5
+
6
+ widget:
7
+ - text: "COVID-19 is"
8
+ example_title: "COVID"
9
+ - text: "The NBA will"
10
+ example_title: "NBA"
11
+ - text: "Breaking news"
12
+ example_title: "Breaking"
13
+ ---
14
+
15
+ # NewsGPT
16
+
17
+ ## Model Description
18
+ The model is the [gpt2](https://huggingface.co/gpt2) fine-tuned to generate news.
19
+
20
+ ## Training Data
21
+ The model's training data consists of ~13,000,000 English articles from ~90 outlets, which each consists of a headline (title) and a subheading (description). The articles were collected from the [Sciride News Mine](http://sciride.org/news.html), after which some additional cleaning was performed on the data, such as removing duplicate articles and removing repeated "outlet tags" appearing before or after headlines such as "| Daily Mail Online".
22
+
23
+ The cleaned dataset can be found on huggingface [here](https://huggingface.co/datasets/AndyReas/frontpage-news).
24
+ The data was repacked before training, to avoid abrupt truncation, which altered the order of the data a bit but it is ultimately the same sentences.
25
+
26
+ ## How to use
27
+ The model can be used with the HuggingFace pipeline like so:
28
+ ```python
29
+ >>> from transformers import pipeline
30
+ >>> generator = pipeline('text-generation', model='andyreas/gennewsgpt')
31
+ >>> generator("COVID-19 is", max_length=50, num_return_sequences=2)
32
+
33
+ [{'generated_text': "COVID-19 is killing more people than the coronavirus. The number of people who have been infected has more than doubled in the past decade, according to a new analysis.The study of 2,000 people by the University of California.The study by"},
34
+ {'generated_text': "COVID-19 is the worst thing to happen in Canada: A new study. A new study suggests that the COVID-19 pandemic has become the \"best thing to happen in Canada.\". But the pandemic has also been a long-term challenge for"}]
35
+ ```
36
+ The model's config.json file includes default parameters for text-generation, which results in the same prompt producing different outputs.
37
+ These can be overwritten to generate consistent outputs by setting "do_sample" = False, like so:
38
+
39
+ ```python
40
+ >>> generator("COVID-19 is", do_sample=False)
41
+ ```
42
+
43
+ or increase variance by increasing the amount of words considered during sampling, like so:
44
+
45
+ ```python
46
+ >>> generator("COVID-19 is", do_sample=True, top_k=50)
47
+ ```
48
+
49
+ ## Training
50
+ Training ran for 1 epoch using a learning rate of 2e-6 and 50K warm-up steps out of ~800K total steps.
51
+
52
+ ## Bias
53
+ Like any other model, NewsGPT is subject to bias according to the data it was trained on.