Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- usenet
|
| 7 |
+
- historical-text
|
| 8 |
+
- big8
|
| 9 |
+
- nlp
|
| 10 |
+
- pretraining
|
| 11 |
+
size_categories:
|
| 12 |
+
- 100K<n<1M
|
| 13 |
+
task_categories:
|
| 14 |
+
- text-generation
|
| 15 |
+
- text-classification
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# Usenet Big8 Complete – 100K Sample
|
| 19 |
+
|
| 20 |
+
## Dataset Summary
|
| 21 |
+
|
| 22 |
+
This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).
|
| 23 |
+
|
| 24 |
+
The samples are released under Apache-2.0 and intended for **research, evaluation, and prototyping purposes**.
|
| 25 |
+
|
| 26 |
+
### Full Archive Statistics (Big8 Complete)
|
| 27 |
+
|
| 28 |
+
| Hierarchy | Posts | Tokens | Oldest | Newest |
|
| 29 |
+
|-----------------|-------------|-------------------|------------|------------|
|
| 30 |
+
| GLOBAL (Big8) | 155,215,668 | 375,839,972,798 | 1/1/80 | 12/11/13 |
|
| 31 |
+
|
| 32 |
+
The full archive represents **~376 billion tokens** of authentic, unfiltered pre-social-media discussions spanning 1980–2013.
|
| 33 |
+
|
| 34 |
+
### Data Fields
|
| 35 |
+
|
| 36 |
+
- `id` (string): Unique post identifier
|
| 37 |
+
- `group` (string): Newsgroup
|
| 38 |
+
- `date` (string): YYYY-MM-DD
|
| 39 |
+
- `author` (string): Author email/handle
|
| 40 |
+
- `subject` (string): Subject line
|
| 41 |
+
- `text` (string): Full cleaned post body
|
| 42 |
+
|
| 43 |
+
### Example Instance
|
| 44 |
+
|
| 45 |
+
```json
|
| 46 |
+
{
|
| 47 |
+
"id": "comp.ai.1997-03-15-001",
|
| 48 |
+
"group": "comp.ai.neural-nets",
|
| 49 |
+
"date": "1997-03-15",
|
| 50 |
+
"author": "ai-pioneer@usenet.org",
|
| 51 |
+
"subject": "Backprop Limits in Early NNs",
|
| 52 |
+
"text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
|
| 53 |
+
}
|