Usenet Big8 Complete – 100K Sample

Dataset Summary

This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).

The samples are released under Apache-2.0 and intended for research, evaluation, and prototyping purposes.

Full Archive Statistics (Big8 Complete)

Hierarchy	Posts	Tokens	Oldest	Newest
GLOBAL (Big8)	155,215,668	375,839,972,798	1/1/80	12/11/13

The full archive represents ~376 billion tokens of authentic, unfiltered pre-social-media discussions spanning 1980–2013.

Data Fields

id (string): Unique post identifier
group (string): Newsgroup
date (string): YYYY-MM-DD
author (string): Author email/handle
subject (string): Subject line
text (string): Full cleaned post body

Example Instance

{
  "id": "comp.ai.1997-03-15-001",
  "group": "comp.ai.neural-nets",
  "date": "1997-03-15",
  "author": "ai-pioneer@usenet.org",
  "subject": "Backprop Limits in Early NNs",
  "text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support