Usenet Big8 Complete β 100K Sample
Dataset Summary
This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).
The samples are released under Apache-2.0 and intended for research, evaluation, and prototyping purposes.
Full Archive Statistics (Big8 Complete)
| Hierarchy | Posts | Tokens | Oldest | Newest |
|---|---|---|---|---|
| GLOBAL (Big8) | 155,215,668 | 375,839,972,798 | 1/1/80 | 12/11/13 |
The full archive represents ~376 billion tokens of authentic, unfiltered pre-social-media discussions spanning 1980β2013.
Data Fields
id(string): Unique post identifiergroup(string): Newsgroupdate(string): YYYY-MM-DDauthor(string): Author email/handlesubject(string): Subject linetext(string): Full cleaned post body
Example Instance
{
"id": "comp.ai.1997-03-15-001",
"group": "comp.ai.neural-nets",
"date": "1997-03-15",
"author": "ai-pioneer@usenet.org",
"subject": "Backprop Limits in Early NNs",
"text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support