Upload readme.md
Browse files
readme.md
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
|
| 4 |
+
license: mit
|
| 5 |
+
|
| 6 |
+
widget:
|
| 7 |
+
- text: "COVID-19 is"
|
| 8 |
+
example_title: "COVID"
|
| 9 |
+
- text: "The NBA will"
|
| 10 |
+
example_title: "NBA"
|
| 11 |
+
- text: "Breaking news"
|
| 12 |
+
example_title: "Breaking"
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# NewsGPT
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
+
The model is similar to [gpt2](https://huggingface.co/gpt2) in that it shares its size, architecture, tokenizer algorithm and Causal Language Modeling objective.
|
| 19 |
+
The model parameters of a [GPT2LMHeadModel](https://huggingface.co/docs/transformers/v4.26.1/en/model_doc/gpt2#transformers.GPT2LMHeadModel) model were randomly initialized and pre-trained from scratch using a dataset consisting only of news.
|
| 20 |
+
|
| 21 |
+
## Training Data
|
| 22 |
+
The model's training data consists of ~13,000,000 english articles from ~90 outlets, which each consists of a headline (title) and a subheading (description). The articles were collected from the [Sciride News Mine](http://sciride.org/news.html), after which some additional cleaning was performed on the data, such as removing duplicate articles and removing repeated "outlet tags" appearing before or after headlines such as "| Daily Mail Online".
|
| 23 |
+
|
| 24 |
+
## How to use
|
| 25 |
+
The model can be used with the Huggingface pipeline like so:
|
| 26 |
+
```python
|
| 27 |
+
>>> from transformers import pipeline
|
| 28 |
+
>>> generator = pipeline('text-generation', model='andyreas/newsgpt')
|
| 29 |
+
>>> generator("COVID-19 is", max_length=50, num_return_sequences=2)
|
| 30 |
+
|
| 31 |
+
[{'generated_text': "COVID-19 is killing more people than the coronavirus. The number of people who have been infected has more than doubled in the past decade, according to a new analysis.The study of 2,000 people by the University of California.The study by"},
|
| 32 |
+
{'generated_text': "COVID-19 is the worst thing to happen in Canada: A new study. A new study suggests that the COVID-19 pandemic has become the \"best thing to happen in Canada.\". But the pandemic has also been a long-term challenge for"}]
|
| 33 |
+
```
|
| 34 |
+
The model's config.json file includes default parameters for text-generation, which results in the same prompt producing different outputs.
|
| 35 |
+
These can be overwritten to generate consistent outputs by setting "do_sample" = False, like so:
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
>>> generator("COVID-19 is", do_sample=False)
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
or increase variance by increasing the amount of words considered during sampling, like so:
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
>>> generator("COVID-19 is", do_sample=True, top_k=50)
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## Bias
|
| 48 |
+
Like any other model, NewsGPT is subject to bias according to the data it was trained on.
|