File size: 9,533 Bytes
87ec58d
 
 
 
 
 
 
 
 
 
 
 
547dcf8
87ec58d
547dcf8
 
 
 
6bc6b8e
547dcf8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87ec58d
 
 
 
 
 
 
 
 
 
 
 
 
 
6bc6b8e
87ec58d
 
d63a0f6
87ec58d
 
 
 
 
 
 
 
 
 
 
6bc6b8e
87ec58d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bc6b8e
87ec58d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d63a0f6
87ec58d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d63a0f6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
license: apache-2.0
tags:
- base-model
- causal-lm
- qwen3
- transformer
language:
- en
pipeline_tag: text-generation
---

# QVAC Genesis I Pretrained Model

## Key Highlights
- **Pretrained on the Largest Synthetic Educational Dataset**  
  This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.

  The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.

  Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.

- **Multi-Domain Educational Coverage**  
  Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:  
  - Mathematics  
  - Physics  
  - Biology  
  - Medicine  

- **Superior Benchmark Performance**  
  Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:  
  - Reasoning tasks  
  - Knowledge assessments  
  - Subject-specific QA  

- **First Publicly Released Education-Specific Pretrained Model**  
  This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage.
abilities  

## Intended Uses
- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)  
- Benchmarking reasoning and subject-specific QA performance  
- Research into synthetic dataset–driven LLM training  

---

## Model Details

### Model Description

- **Developed by:** Qvac by Tether
- **Model type:** Decoder-only Transformer (causal LM)
- **Language(s) (NLP):** Primarily English 
- **License:** Apache-2.0
- **Finetuned from model:** **None (trained from scratch)**
- **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment)

### Dataset Details

- **Repository:** https://huggingface.co/qvac/genesisI-model
- **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i

---

## Uses

### Direct Use

- General language modeling: next-token prediction, continuation, summarization, drafting.
- Research baseline for scaling, data ablations, or tokenizer studies.

### Downstream Use (recommended)
- **CPT** Continued Pre-Training on more tokens.
- **SFT** for assistants, domain experts, or task-specific models.
- **Preference optimization / RLHF** for safer, more helpful behavior.
- **Adapters/LoRA** for efficient domain specialization.

### Out-of-Scope Use

- High-stakes decision-making (medical/financial/legal).
- Safety-critical or autonomous control systems.
- Unfiltered end-user chat deployment without alignment / safety layers.
- Any use that violates applicable laws or platform policies.

---

## Bias, Risks, and Limitations

- **Bias & toxicity:** May reflect or amplify biases present in web text.
- **Hallucinations:** Can produce confident but incorrect statements or citations.
- **Security / privacy:** May emit continous random strings.
- **Context limit:** 4,096 tokens; longer inputs require chunking.

### Recommendations

- Disclose limitations to downstream users.
- Research Model : Not to be used in production use cases. 

---

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "qvac/genesisI-model"

tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # trained with BF16 mixed precision
    device_map="auto"
)

prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))
````

*Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).*

---

## Training Details

### Training Data

* **Size:** ~**40B tokens**, single epoch.
* **Domains:** Mixed general + STEM/technical sources (expository text, problem sets, references).
* **Format:** Hugging Face Datasets (Arrow).
* **Tokenizer:** **Qwen3** tokenizer.
* **Processing:** Normalization, filtering of extremes, document chunking to fit **4096** context, sequence packing where applicable.
* **Dataset Card:** *Coming Soon*

### Training Procedure

#### Preprocessing

* Unicode normalization, whitespace cleanup, control-char stripping.
* Length filtering; chunking to 4096; optional packing to improve throughput.

#### Training Hyperparameters

* **Optimizer:** AdamW (β₁=0.9, β₂=0.95), **weight decay 0.01**
* **Learning rate:** **2e-4** (linear warmup)
* **Warmup:** **600** steps (~10% of max steps)
* **Precision:** **BF16 mixed precision**
* **Gradient clipping:** **1.0**
* **Seed:** **42**
* **Logging:** Every **50** steps
* **Eval:** Every **500** steps (20 iters)
* **Checkpointing:** Every **1000** steps (sharded; full optimizer/state resume)

#### Speeds, Sizes, Times

* **Per-GPU micro-batch:** 4
* **Grad accumulation:** 8
* **World size:** 480 GPUs
* **Effective global batch:** `4 × 8 × 480 = 15,360` samples/step
* **Step time (indicative):** ~**1.5 s/step** (cluster/I-O dependent)

#### Stability & Performance

* Activation checkpointing.
* Fused kernels where available (fused attention/optimizer).
* **FlashAttention-2** on H100.
* `torch.compile` (safe mode) after warmup stability.
* Dynamic loss scaling to mitigate BF16 overflow.
* Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8).

---

## Multi-Node GPU Setup

* **Cluster:** ~**60 nodes**, each **8× NVIDIA H100 80GB** (total **480 GPUs**), ~800 GB RAM/node.
* **Scheduler:** Slurm (priority partition, exclusive allocation, 72-hour limit).
* **Launch:** `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env).
* **Storage:** Sharded checkpoints; periodic saves for robust resume.
* **Networking:** NCCL over InfiniBand with UCX

  * `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=<ib0/enoX>`, `NCCL_BLOCKING_WAIT=1`
  * Watchdog ~**720s** for fail-fast on fabric issues
* **I/O:** Async dataset prefetching; pinned FS threads.
* **Observability:** W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
* **Reproducibility:** Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.

> Final checkpoint converted to **Hugging Face format** for plug-and-play inference.

---

## Evaluation

### Testing Data, Factors & Metrics

* **Testing data:** Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
* **Factors:** Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
* **Metrics:** Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.

**Suggested suite (edit as applicable):**

* General knowledge & reasoning: **MMLU (STEM subsets)**, **ARC-E/ARC-C**, **HellaSwag**, **PIQA**, **Winogrande**
* Math/coding (optional): **GSM8K**, **HumanEval**
* Reading comprehension (optional): **BoolQ**, **RACE**

### Results

* *To be released with an evaluated checkpoint and harness version pin.*
  Include tables with exact versions, seeds, and commit hashes.

#### Summary

* Base LM targets broad generalization at 41B tokens.
* Expect material gains after SFT + preference optimization for target tasks.

---

## Technical Specifications

### Model Architecture and Objective

* **Architecture:** Qwen3-style decoder-only Transformer
* **Parameters:** ~**1.7B**
* **Context length:** **4,096** tokens
* **Positional encoding:** *Rotary / relative (specify)*
* **Attention:** Multi-head scaled dot-product; FlashAttention-2 enabled on H100
* **Activation:** *GELU / SiLU (specify)*
* **Norms:** *RMSNorm / LayerNorm (specify)*
* **Objective:** **Causal LM** (next-token prediction)

### Compute Infrastructure

**Hardware**

* 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric.

**Software**

* PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
* Slurm for orchestration; W&B for logging
* (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train

---

## Reproducibility (Launch Sketch)

```bash
# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
  --cpus-per-task=8 --mem=0 \
  bash -lc '
  export NCCL_IB_DISABLE=0
  export NCCL_IB_HCA="mlx5*"
  export NCCL_SOCKET_IFNAME=ib0
  export NCCL_BLOCKING_WAIT=1
  export TORCH_DISTRIBUTED_DEBUG=DETAIL

  python train.py \
    --model qwen3_1p7b_from_scratch \
    --tokenizer qwen3 \
    --data_path /path/to/arrow \
    --context_length 4096 \
    --optimizer adamw --weight_decay 0.01 \
    --lr 2e-4 --warmup_steps 600 \
    --precision bf16-mixed \
    --micro_batch_size 4 \
    --grad_accum_steps 8 \
    --eval_every 500 --log_every 50 \
    --ckpt_every 1000 \
    --activation_checkpointing \
    --flash_attn 2 \
    --compile safe \
    --seed 42
'
```

---

## Conversion & Inference

* Checkpoints are **HF-compatible**: load with `AutoModelForCausalLM`.
* For memory-limited environments, prefer half-precision or 4/8-bit loading.
* Distribute as `safetensors` for integrity.

---


## Changelog

* **v0.1 (2025-11-17):** Initial public release — 40B-token 1-epoch pretrain; HF conversion.