gate369 commited on
Commit
989f378
·
verified ·
1 Parent(s): 714f3d9

Upload Tiny Epstein 100M model

Browse files
Files changed (5) hide show
  1. README.md +109 -0
  2. config.json +15 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +12 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - tiny-epstein
6
+ - epstein-files
7
+ - transformers
8
+ ---
9
+
10
+ # tiny-epstein-100m
11
+
12
+ A small transformer model (~100M parameters) trained on the [teyler/epstein-files-20k](https://huggingface.co/datasets/teyler/epstein-files-20k) dataset.
13
+ The architecture is inspired by **Tiny Aya** modifications and is designed for efficient on-device inference.
14
+
15
+ ## Model Details
16
+
17
+ - **Architecture**: Decoder-only transformer with parallel blocks, Grouped Query Attention (GQA), SwiGLU activation, and bias‑free LayerNorm.
18
+ - **Sliding Window Attention**: 3:1 local:global ratio (first 75% of layers use sliding window with RoPE; remaining layers use full attention with NoPE).
19
+ - **Parameters**: ~100 million
20
+ - **Context Length**: 1024 tokens (configurable)
21
+ - **Tokenizer**: GPT‑2 (same as used during training)
22
+ - **Training Data**: [teyler/epstein-files-20k](https://huggingface.co/datasets/teyler/epstein-files-20k) – 20,000 documents related to the Epstein files.
23
+
24
+ ## Intended Use
25
+
26
+ This model is primarily for research and experimentation. It can generate continuations of text given a prompt, especially on topics related to the Epstein files.
27
+
28
+ ## How to Use
29
+
30
+ ### Installation
31
+
32
+ Make sure you have `torch` and `transformers` installed.
33
+ If you want to run inference, install the required packages:
34
+
35
+ ```bash
36
+ pip install torch transformers
37
+ ```
38
+
39
+ Loading the Model and Tokenizer
40
+
41
+ ```python
42
+ import torch
43
+ from transformers import AutoTokenizer
44
+ from huggingface_hub import snapshot_download
45
+
46
+ # Download the model from Hugging Face Hub
47
+ model_path = snapshot_download(repo_id="liminerity/tiny-epstein-100m")
48
+
49
+ # Load tokenizer
50
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
51
+
52
+ # Load model (custom architecture needs the model definition – see below)
53
+ # We need to define the model class again or import from a module.
54
+ # For convenience, the model definition is included in the training script.
55
+ # Here we provide a minimal loading snippet assuming you have the model class.
56
+
57
+ # Define model config (must match the saved config.json)
58
+ class ModelConfig:
59
+ vocab_size = 50257
60
+ emb_dim = 768
61
+ hidden_dim = 2048
62
+ num_layers = 12
63
+ num_heads = 12
64
+ num_kv_heads = 4
65
+ max_seq_len = 1024
66
+ window_size = 1024
67
+ sliding_window_ratio = 0.75
68
+ rope_theta = 10000.0
69
+ dtype = torch.float16
70
+ bias = False
71
+ dropout = 0.0
72
+
73
+ # Instantiate model (you need the model class definition, e.g., TinyAya)
74
+ # Here we assume you have the TinyAya class from the training script.
75
+ # If not, copy the class definition from the training script into this cell.
76
+ model = TinyAya(ModelConfig())
77
+ state_dict = torch.load(os.path.join(model_path, "pytorch_model.bin"), map_location="cpu")
78
+ model.load_state_dict(state_dict)
79
+ model.eval()
80
+ ```
81
+
82
+ Text Generation Example
83
+
84
+ ```python
85
+ prompt = "The Epstein files reveal"
86
+ inputs = tokenizer(prompt, return_tensors="pt")
87
+ with torch.no_grad():
88
+ outputs = model.generate(
89
+ inputs.input_ids,
90
+ max_new_tokens=50,
91
+ temperature=0.8,
92
+ do_sample=True
93
+ )
94
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
95
+ ```
96
+
97
+ Training Details
98
+
99
+ The model was trained for one epoch on the full dataset using an L4 GPU in Google Colab.
100
+ Optimizer: AdamW (lr=1e-4) with gradient clipping (max norm=1.0). Mixed precision (float16) was used.
101
+
102
+ Limitations
103
+
104
+ · The model is small and was trained on a limited dataset; it may produce repetitive or nonsensical outputs.
105
+ · It has not undergone any safety fine‑tuning; use with caution.
106
+
107
+ License
108
+
109
+ MIT
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 50257,
3
+ "emb_dim": 768,
4
+ "hidden_dim": 2048,
5
+ "num_layers": 12,
6
+ "num_heads": 12,
7
+ "num_kv_heads": 4,
8
+ "max_seq_len": 1024,
9
+ "window_size": 1024,
10
+ "sliding_window_ratio": 0.75,
11
+ "rope_theta": 10000.0,
12
+ "dtype": "torch.float16",
13
+ "bias": false,
14
+ "dropout": 0.0
15
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9d45ba5b7dafe97cd332405e514ead294500c539dd188ab7d87eb9f9e0820f56
3
+ size 456456877
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|endoftext|>",
5
+ "eos_token": "<|endoftext|>",
6
+ "errors": "replace",
7
+ "is_local": false,
8
+ "model_max_length": 1024,
9
+ "pad_token": "<|endoftext|>",
10
+ "tokenizer_class": "GPT2Tokenizer",
11
+ "unk_token": "<|endoftext|>"
12
+ }