Usenet Big8 Complete – 100K Sample

Dataset Summary

This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).

The samples are released under Apache-2.0 and intended for research, evaluation, and prototyping purposes.

Full Archive Statistics (Big8 Complete)

Hierarchy Posts Tokens Oldest Newest
GLOBAL (Big8) 155,215,668 375,839,972,798 1/1/80 12/11/13

The full archive represents ~376 billion tokens of authentic, unfiltered pre-social-media discussions spanning 1980–2013.

Data Fields

  • id (string): Unique post identifier
  • group (string): Newsgroup
  • date (string): YYYY-MM-DD
  • author (string): Author email/handle
  • subject (string): Subject line
  • text (string): Full cleaned post body

Example Instance

{
  "id": "comp.ai.1997-03-15-001",
  "group": "comp.ai.neural-nets",
  "date": "1997-03-15",
  "author": "ai-pioneer@usenet.org",
  "subject": "Backprop Limits in Early NNs",
  "text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support