|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- base-model |
|
|
- causal-lm |
|
|
- qwen3 |
|
|
- transformer |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# QVAC Genesis I Pretrained Model |
|
|
|
|
|
## Key Highlights |
|
|
- **Pretrained on the Largest Synthetic Educational Dataset** |
|
|
This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training. |
|
|
|
|
|
The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture. |
|
|
|
|
|
Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning. |
|
|
|
|
|
- **Multi-Domain Educational Coverage** |
|
|
Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across: |
|
|
- Mathematics |
|
|
- Physics |
|
|
- Biology |
|
|
- Medicine |
|
|
|
|
|
- **Superior Benchmark Performance** |
|
|
Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in: |
|
|
- Reasoning tasks |
|
|
- Knowledge assessments |
|
|
- Subject-specific QA |
|
|
|
|
|
- **First Publicly Released Education-Specific Pretrained Model** |
|
|
This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage. |
|
|
abilities |
|
|
|
|
|
## Intended Uses |
|
|
- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support) |
|
|
- Benchmarking reasoning and subject-specific QA performance |
|
|
- Research into synthetic dataset–driven LLM training |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Qvac by Tether |
|
|
- **Model type:** Decoder-only Transformer (causal LM) |
|
|
- **Language(s) (NLP):** Primarily English |
|
|
- **License:** Apache-2.0 |
|
|
- **Finetuned from model:** **None (trained from scratch)** |
|
|
- **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment) |
|
|
|
|
|
### Dataset Details |
|
|
|
|
|
- **Repository:** https://huggingface.co/qvac/genesisI-model |
|
|
- **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- General language modeling: next-token prediction, continuation, summarization, drafting. |
|
|
- Research baseline for scaling, data ablations, or tokenizer studies. |
|
|
|
|
|
### Downstream Use (recommended) |
|
|
- **CPT** Continued Pre-Training on more tokens. |
|
|
- **SFT** for assistants, domain experts, or task-specific models. |
|
|
- **Preference optimization / RLHF** for safer, more helpful behavior. |
|
|
- **Adapters/LoRA** for efficient domain specialization. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- High-stakes decision-making (medical/financial/legal). |
|
|
- Safety-critical or autonomous control systems. |
|
|
- Unfiltered end-user chat deployment without alignment / safety layers. |
|
|
- Any use that violates applicable laws or platform policies. |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Bias & toxicity:** May reflect or amplify biases present in web text. |
|
|
- **Hallucinations:** Can produce confident but incorrect statements or citations. |
|
|
- **Security / privacy:** May emit continous random strings. |
|
|
- **Context limit:** 4,096 tokens; longer inputs require chunking. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Disclose limitations to downstream users. |
|
|
- Research Model : Not to be used in production use cases. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_id = "qvac/genesisI-model" |
|
|
|
|
|
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.bfloat16, # trained with BF16 mixed precision |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
prompt = "Explain precision vs. recall in one paragraph." |
|
|
inputs = tok(prompt, return_tensors="pt").to(model.device) |
|
|
out = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
do_sample=True, |
|
|
top_p=0.9, |
|
|
temperature=0.7 |
|
|
) |
|
|
print(tok.decode(out[0], skip_special_tokens=True)) |
|
|
```` |
|
|
|
|
|
*Tip: On consumer GPUs, consider loading in `float16` or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).* |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
* **Size:** ~**40B tokens**, single epoch. |
|
|
* **Domains:** Mixed general + STEM/technical sources (expository text, problem sets, references). |
|
|
* **Format:** Hugging Face Datasets (Arrow). |
|
|
* **Tokenizer:** **Qwen3** tokenizer. |
|
|
* **Processing:** Normalization, filtering of extremes, document chunking to fit **4096** context, sequence packing where applicable. |
|
|
* **Dataset Card:** *Coming Soon* |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
* Unicode normalization, whitespace cleanup, control-char stripping. |
|
|
* Length filtering; chunking to 4096; optional packing to improve throughput. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
* **Optimizer:** AdamW (β₁=0.9, β₂=0.95), **weight decay 0.01** |
|
|
* **Learning rate:** **2e-4** (linear warmup) |
|
|
* **Warmup:** **600** steps (~10% of max steps) |
|
|
* **Precision:** **BF16 mixed precision** |
|
|
* **Gradient clipping:** **1.0** |
|
|
* **Seed:** **42** |
|
|
* **Logging:** Every **50** steps |
|
|
* **Eval:** Every **500** steps (20 iters) |
|
|
* **Checkpointing:** Every **1000** steps (sharded; full optimizer/state resume) |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
|
|
|
* **Per-GPU micro-batch:** 4 |
|
|
* **Grad accumulation:** 8 |
|
|
* **World size:** 480 GPUs |
|
|
* **Effective global batch:** `4 × 8 × 480 = 15,360` samples/step |
|
|
* **Step time (indicative):** ~**1.5 s/step** (cluster/I-O dependent) |
|
|
|
|
|
#### Stability & Performance |
|
|
|
|
|
* Activation checkpointing. |
|
|
* Fused kernels where available (fused attention/optimizer). |
|
|
* **FlashAttention-2** on H100. |
|
|
* `torch.compile` (safe mode) after warmup stability. |
|
|
* Dynamic loss scaling to mitigate BF16 overflow. |
|
|
* Fragmentation mitigations (e.g., `max_split_size_mb=512`, expandable segments, GC threshold ~0.8). |
|
|
|
|
|
--- |
|
|
|
|
|
## Multi-Node GPU Setup |
|
|
|
|
|
* **Cluster:** ~**60 nodes**, each **8× NVIDIA H100 80GB** (total **480 GPUs**), ~800 GB RAM/node. |
|
|
* **Scheduler:** Slurm (priority partition, exclusive allocation, 72-hour limit). |
|
|
* **Launch:** `srun` + PyTorch DDP (world size 480; ranks bound via Slurm env). |
|
|
* **Storage:** Sharded checkpoints; periodic saves for robust resume. |
|
|
* **Networking:** NCCL over InfiniBand with UCX |
|
|
|
|
|
* `NCCL_IB_DISABLE=0`, `NCCL_IB_HCA="mlx5*"`, `NCCL_SOCKET_IFNAME=<ib0/enoX>`, `NCCL_BLOCKING_WAIT=1` |
|
|
* Watchdog ~**720s** for fail-fast on fabric issues |
|
|
* **I/O:** Async dataset prefetching; pinned FS threads. |
|
|
* **Observability:** W&B + structured logs (throughput, TFLOPs/GPU, mem, step time). |
|
|
* **Reproducibility:** Fixed seeds; exact launch scripts/env logged; effective tokens/step reported. |
|
|
|
|
|
> Final checkpoint converted to **Hugging Face format** for plug-and-play inference. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
* **Testing data:** Standard academic suites (e.g., EleutherAI LM Evaluation Harness). |
|
|
* **Factors:** Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended). |
|
|
* **Metrics:** Accuracy (MCQ), EM/F1 (QA), plus task-native metrics. |
|
|
|
|
|
**Suggested suite (edit as applicable):** |
|
|
|
|
|
* General knowledge & reasoning: **MMLU (STEM subsets)**, **ARC-E/ARC-C**, **HellaSwag**, **PIQA**, **Winogrande** |
|
|
* Math/coding (optional): **GSM8K**, **HumanEval** |
|
|
* Reading comprehension (optional): **BoolQ**, **RACE** |
|
|
|
|
|
### Results |
|
|
|
|
|
* *To be released with an evaluated checkpoint and harness version pin.* |
|
|
Include tables with exact versions, seeds, and commit hashes. |
|
|
|
|
|
#### Summary |
|
|
|
|
|
* Base LM targets broad generalization at 41B tokens. |
|
|
* Expect material gains after SFT + preference optimization for target tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
* **Architecture:** Qwen3-style decoder-only Transformer |
|
|
* **Parameters:** ~**1.7B** |
|
|
* **Context length:** **4,096** tokens |
|
|
* **Positional encoding:** *Rotary / relative (specify)* |
|
|
* **Attention:** Multi-head scaled dot-product; FlashAttention-2 enabled on H100 |
|
|
* **Activation:** *GELU / SiLU (specify)* |
|
|
* **Norms:** *RMSNorm / LayerNorm (specify)* |
|
|
* **Objective:** **Causal LM** (next-token prediction) |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
**Hardware** |
|
|
|
|
|
* 60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric. |
|
|
|
|
|
**Software** |
|
|
|
|
|
* PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL |
|
|
* Slurm for orchestration; W&B for logging |
|
|
* (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducibility (Launch Sketch) |
|
|
|
|
|
```bash |
|
|
# Slurm (illustrative) |
|
|
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \ |
|
|
--cpus-per-task=8 --mem=0 \ |
|
|
bash -lc ' |
|
|
export NCCL_IB_DISABLE=0 |
|
|
export NCCL_IB_HCA="mlx5*" |
|
|
export NCCL_SOCKET_IFNAME=ib0 |
|
|
export NCCL_BLOCKING_WAIT=1 |
|
|
export TORCH_DISTRIBUTED_DEBUG=DETAIL |
|
|
|
|
|
python train.py \ |
|
|
--model qwen3_1p7b_from_scratch \ |
|
|
--tokenizer qwen3 \ |
|
|
--data_path /path/to/arrow \ |
|
|
--context_length 4096 \ |
|
|
--optimizer adamw --weight_decay 0.01 \ |
|
|
--lr 2e-4 --warmup_steps 600 \ |
|
|
--precision bf16-mixed \ |
|
|
--micro_batch_size 4 \ |
|
|
--grad_accum_steps 8 \ |
|
|
--eval_every 500 --log_every 50 \ |
|
|
--ckpt_every 1000 \ |
|
|
--activation_checkpointing \ |
|
|
--flash_attn 2 \ |
|
|
--compile safe \ |
|
|
--seed 42 |
|
|
' |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Conversion & Inference |
|
|
|
|
|
* Checkpoints are **HF-compatible**: load with `AutoModelForCausalLM`. |
|
|
* For memory-limited environments, prefer half-precision or 4/8-bit loading. |
|
|
* Distribute as `safetensors` for integrity. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Changelog |
|
|
|
|
|
* **v0.1 (2025-11-17):** Initial public release — 40B-token 1-epoch pretrain; HF conversion. |