TinyStories Pretrained Models
Collection
3 items
β’
Updated
A PyTorch implementation of a DeepSeek V3 inspired transformer model with Mixture of Experts (MoE), Latent Attention, and other advanced features.
--block_size: Maximum sequence length (default: 128)--batch_size: Training batch size (default: 256)--embeddings_dims: Model embedding dimensions (default: 384)--no_of_heads: Number of attention heads (default: 8)--no_of_decoder_layers: Number of decoder layers (default: 6)--latent_dim: Latent dimension for attention (default: 64)--experts: Number of MoE experts (default: 8)--top_experts: Number of experts to route to (default: 2)--use_shared_expert: Enable shared expert in MoE (default: True)--noisy_topk: Use noisy top-k routing (default: False)--useauxFreeLoadBalancingLoss: Use auxiliary-free load balancing loss (default: True)--aux_free_bias_update_rate: Bias update rate for load balancing (default: 0.001)--loss_scale: Loss scaling factor (default: 0.3)--epochs: Number of training epochs (default: 1)--max_lr: Maximum learning rate (default: 6e-4)--weight_decay_optim: Weight decay for optimizer (default: 0.1)--beta_1: Beta1 for optimizer (default: 0.9)--beta_2: Beta2 for optimizer (default: 0.95)--eps: Epsilon for optimizer (default: 1e-8)--clip: Gradient clipping value (default: 1.0)--dropout: Dropout rate (default: 0.1)--attn_dropout: Attention dropout rate (default: 0.1)--device: Device to use (default: 'cuda')--use_checkpointing: Use gradient checkpointing (default: False)--use_liger: Use Liger kernels for optimization (default: True)--ignore_pad_token_in_loss: Ignore padding tokens in loss calculation (default: True)--vocab_size: Vocabulary size (default: 32000, updated based on tokenizer)--base_freq: Base frequency for positional encoding (default: 100000)--hf_token: Hugging Face token for accessing gated models like Llama-2 (default: None)--dataset: Dataset to use ('tinystories', 'fineweb', 'tinyshakespeare') (default: 'tinystories')--generation_max_length: Maximum length for text generation (default: 50)--generation_top_k: Top-k value for sampling (default: 50)--generation_temperature: Temperature for sampling (default: 1.0)--log_interval: Steps between logging (default: 100)--save_interval: Steps between saving checkpoints (default: 2000)--eval_interval: Steps between evaluation (default: 400)--eval_iters: Number of iterations for evaluation (default: 400)--warmup_iters: Number of warmup iterations (default: 400)--total_iters: Total training iterations (default: 10000)--lr_decay_iters: Learning rate decay iterations (default: 10000)--wandb_project: Wandb project name (default: 'smolkimi')--wandb_run_name: Wandb run name (default: None)--total_batch_size: Total batch size for gradient accumulation (default: 524288)--micro_batch_size: Micro batch size (default: batch_size)--use_ddp: Use distributed data parallel (default: False)chmod +x install.sh
./install.sh
Since this model uses the Llama-2 tokenizer, you'll need a Hugging Face token to access the gated model.
Get a Hugging Face Token:
Set your token in one of these ways:
# Option 1: Environment variable (recommended)
export HF_TOKEN="your_token_here"
# Option 2: Pass as command line argument
python trainer.py --hf_token "your_token_here"
# With environment variable
export HF_TOKEN="your_token_here"
python trainer.py
# With command line argument
python trainer.py --hf_token "your_token_here"
# Train with larger model
python trainer.py --hf_token "your_token_here" --embeddings_dims 512 --no_of_heads 16 --no_of_decoder_layers 8
# Train with different dataset
python trainer.py --hf_token "your_token_here" --dataset fineweb --epochs 3
# Train with custom learning rate and batch size
python trainer.py --hf_token "your_token_here" --max_lr 1e-3 --batch_size 128 --block_size 256
# Train with more experts
python trainer.py --hf_token "your_token_here" --experts 16 --top_experts 4
# Train without shared expert
python trainer.py --hf_token "your_token_here" --use_shared_expert False
# Train with noisy top-k routing
python trainer.py --hf_token "your_token_here" --noisy_topk True
# Set token as environment variable for distributed training
export HF_TOKEN="your_token_here"
# 2 GPUs
torchrun --nproc_per_node=2 trainer.py
# 4 GPUs with custom parameters
torchrun --nproc_per_node=4 trainer.py --batch_size 128 --embeddings_dims 512
# 8 GPUs with large model configuration
torchrun --nproc_per_node=8 trainer.py \
--embeddings_dims 768 \
--no_of_heads 12 \
--no_of_decoder_layers 12 \
--experts 16 \
--top_experts 4 \
--batch_size 64 \
--block_size 512
export HF_TOKEN="your_token_here"
python trainer.py \
--embeddings_dims 768 \
--no_of_heads 12 \
--no_of_decoder_layers 12 \
--experts 16 \
--top_experts 4 \
--batch_size 32 \
--block_size 512 \
--max_lr 3e-4 \
--epochs 5 \
--use_liger True \
--wandb_project "smolkimi-large"
export HF_TOKEN="your_token_here"
python trainer.py \
--noisy_topk True \
--use_shared_expert False \
--aux_free_bias_update_rate 0.01 \
--loss_scale 0.5 \
--dropout 0.2 \
--attn_dropout 0.15 \
--wandb_project "smolkimi-experimental"
export HF_TOKEN="your_token_here"
python trainer.py \
--use_checkpointing True \
--batch_size 64 \
--micro_batch_size 16 \
--total_batch_size 262144 \
--block_size 128
# Set your HF token
export HF_TOKEN="your_token_here"
# Run the Gradio app
cd gradio
python app.py --hf_token "your_token_here"
# Or with environment variable
cd gradio
python app.py
# With custom port and public sharing
cd gradio
python app.py --hf_token "your_token_here" --port 8080 --share
# View all available parameters
python trainer.py --help
# View Gradio app parameters
cd gradio
python app.py --help
You can set the following environment variables instead of passing them as arguments:
# Hugging Face token (recommended approach)
export HF_TOKEN="your_token_here"
# Wandb API key (optional, for experiment tracking)
export WANDB_API_KEY="your_wandb_key_here"
SmolKimi/
βββ config.py # Model configuration and hyperparameters with argparse
βββ model.py # Model architecture (DeepSeekV3, MoE, Attention, etc.)
βββ tokenizer.py # Tokenizer setup
βββ data.py # Data loading and preparation
βββ inference.py # Inference functions and text generation
βββ trainer.py # Main training loop with DDP support
βββ install.sh # Setup script
βββ requirements.txt # Python dependencies
βββ gradio/
β βββ app.py # Gradio web interface
β βββ requirements.txt
βββ generated_data/ # Generated text outputs
All parameters can be set via command line arguments. For complex configurations, consider creating shell scripts:
#!/bin/bash
# large_model_config.sh
python trainer.py \
--embeddings_dims 1024 \
--no_of_heads 16 \
--no_of_decoder_layers 24 \
--experts 32 \
--top_experts 8 \
--batch_size 16 \
--block_size 1024 \
--max_lr 1e-4 \
--epochs 10 \
--use_liger True \
--use_checkpointing True \
--wandb_project "smolkimi-large-scale"
# TinyStories (default)
python trainer.py --dataset tinystories
# FineWeb (large scale)
python trainer.py --dataset fineweb --epochs 3 --batch_size 64
# TinyShakespeare (character level)
python trainer.py --dataset tinyshakespeare --block_size 256
# Custom wandb configuration
python trainer.py \
--wandb_project "my-experiment" \
--wandb_run_name "test-run-1" \
--log_interval 50 \
--eval_interval 200 \
--save_interval 1000
python trainer.py \
--batch_size 512 \
--block_size 2048 \
--embeddings_dims 1024 \
--total_batch_size 1048576
python trainer.py \
--batch_size 32 \
--micro_batch_size 8 \
--block_size 128 \
--use_checkpointing True \
--embeddings_dims 256
from trainer import train
train()
from inference import topk_sampling
from model import DeepSeekV3
from config import ModelArgs, get_args
# Load with custom config
args = get_args()
model_args = ModelArgs(args)
model = DeepSeekV3(device='cuda')
text = topk_sampling(model, "Once upon a time", device='cuda')
import torch
from model import DeepSeekV3
from config import ModelArgs, get_args
# Load saved model
args = get_args()
model_args = ModelArgs(args)
model = DeepSeekV3(device='cuda')
model.load_state_dict(torch.load('path/to/checkpoint.pt'))
model.eval()
--use_checkpointing True for memory-constrained setups--use_liger True for optimized operations# Make sure you have accepted the Llama-2 license and have a valid token
# Visit: https://huggingface.co/meta-llama/Llama-2-7b-hf
# Then set your token:
export HF_TOKEN="your_token_here"
# Reduce batch size and enable checkpointing
python trainer.py --hf_token "your_token_here" --batch_size 16 --use_checkpointing True
# Use gradient accumulation
python trainer.py --hf_token "your_token_here" --batch_size 32 --micro_batch_size 8
# Enable Liger kernels and increase batch size
python trainer.py --hf_token "your_token_here" --use_liger True --batch_size 256
# Use multiple GPUs
export HF_TOKEN="your_token_here"
torchrun --nproc_per_node=4 trainer.py
# Reduce learning rate and enable gradient clipping
python trainer.py --hf_token "your_token_here" --max_lr 1e-4 --clip 0.5
Feel free to contribute improvements, bug fixes, or new features!
MIT License