OwnedByDanes commited on
Commit
bba505f
·
verified ·
1 Parent(s): f3c5b10

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - usenet
7
+ - historical-text
8
+ - big8
9
+ - nlp
10
+ - pretraining
11
+ size_categories:
12
+ - 100K<n<1M
13
+ task_categories:
14
+ - text-generation
15
+ - text-classification
16
+ ---
17
+
18
+ # Usenet Big8 Complete – 100K Sample
19
+
20
+ ## Dataset Summary
21
+
22
+ This dataset contains a random 100,000-post sample from the complete Usenet Big8 hierarchies (alt, comp, humanities, misc, news, rec, sci, soc, talk).
23
+
24
+ The samples are released under Apache-2.0 and intended for **research, evaluation, and prototyping purposes**.
25
+
26
+ ### Full Archive Statistics (Big8 Complete)
27
+
28
+ | Hierarchy | Posts | Tokens | Oldest | Newest |
29
+ |-----------------|-------------|-------------------|------------|------------|
30
+ | GLOBAL (Big8) | 155,215,668 | 375,839,972,798 | 1/1/80 | 12/11/13 |
31
+
32
+ The full archive represents **~376 billion tokens** of authentic, unfiltered pre-social-media discussions spanning 1980–2013.
33
+
34
+ ### Data Fields
35
+
36
+ - `id` (string): Unique post identifier
37
+ - `group` (string): Newsgroup
38
+ - `date` (string): YYYY-MM-DD
39
+ - `author` (string): Author email/handle
40
+ - `subject` (string): Subject line
41
+ - `text` (string): Full cleaned post body
42
+
43
+ ### Example Instance
44
+
45
+ ```json
46
+ {
47
+ "id": "comp.ai.1997-03-15-001",
48
+ "group": "comp.ai.neural-nets",
49
+ "date": "1997-03-15",
50
+ "author": "ai-pioneer@usenet.org",
51
+ "subject": "Backprop Limits in Early NNs",
52
+ "text": "Fellow netters: Backpropagation hits vanishing gradients at depth 3+. Has anyone tested sigmoid alternatives like tanh in real hardware? My DEC Alpha sims show 15% convergence boost, but scaling to 100 neurons crashes. Thoughts? -J"
53
+ }