OwnedByDanes
/

README.md

+---
+license: apache-2.0
+language:
+  - en
+tags:
+  - usenet
+  - historical-text
+  - big8
+  - nlp
+  - pretraining
+size_categories:
+  - 100K<n<1M
+task_categories:
+  - text-generation
+  - text-classification
+---
+# Usenet Big8 Complete – 100K Sample
+## Dataset Summary
+This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).
+The samples are released under Apache-2.0 and intended for **research, evaluation, and prototyping purposes**.
+### Full Archive Statistics (Big8 Complete)
+| Hierarchy       | Posts       | Tokens            | Oldest     | Newest     |
+|-----------------|-------------|-------------------|------------|------------|
+| GLOBAL (Big8)   | 155,215,668 | 375,839,972,798   | 1/1/80     | 12/11/13   |
+The full archive represents **~376 billion tokens** of authentic, unfiltered pre-social-media discussions spanning 1980–2013.
+### Data Fields
+- `id` (string): Unique post identifier
+- `group` (string): Newsgroup
+- `date` (string): YYYY-MM-DD
+- `author` (string): Author email/handle
+- `subject` (string): Subject line
+- `text` (string): Full cleaned post body
+### Example Instance
+```json
+{
+  "id": "comp.ai.1997-03-15-001",
+  "group": "comp.ai.neural-nets",
+  "date": "1997-03-15",
+  "author": "ai-pioneer@usenet.org",
+  "subject": "Backprop Limits in Early NNs",
+  "text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
+}