HumanV-0.2B-Base
HumanV-0.2B-Base is a 199-million parameter causal language model built with the custom HumanV architecture. This model serves as a technical proof-of-concept for the integration of hybrid sparse attention mechanisms inside modern transformer pipelines, verifying the stability, gradient flow, and memory footprint optimizations of the HumanV architecture under enterprise pre-training configurations.
Architecture Specification
The HumanV architecture is designed for resource-efficient text generation, implementing advanced hardware-friendly optimizations to significantly reduce the Key-Value (KV) Cache footprint during autoregressive inference.
- Total Parameters: 199,056,384 (~200M)
- Hidden Dimension (
d_model): 768 - Intermediate Feed-Forward Dimension (
d_ff): 2304 - Decoder Layers: 12
- Attention Mechanism: Hybrid Configuration
- Grouped-Query Attention (GQA): 12 Query Heads, 4 Key-Value Heads (3:1 query-to-group ratio).
- First 3 Layers: Full Dense Self-Attention.
- Remaining 9 Layers: Local-Global Block Sparse Attention (Block size: 16, Window: 128, 4 Local Blocks, 1 Global Block).
- Positional Encoding: Rotary Position Embeddings (RoPE).
- Activation Function: SwiGLU.
- Normalization: RMSNorm (
epsilon = 1e-5) with tied word embeddings.
Training & Optimization
The model was pre-trained using a highly structured, padding-free pipeline to eliminate overfitting on padding tokens and ensure stable, continuous learning.
Training Dataset
- Source: Subset of
roneneldan/TinyStories(40,000 training stories, 4,000 validation stories). - Data Processing: Token Packing (Concatenated documents chunked into contiguous blocks of 256 tokens).
Training Hyperparameters (Production Grade)
- Optimizer: AdamW (beta1 = 0.9, beta2 = 0.95)
- Weight Decay: 0.1 (applied exclusively to weight matrices of linear layers; bias and norm weights excluded).
- Dropout Strategy: Active regularization (Attention Dropout: 0.1, Residual Dropout: 0.1, MLP Hidden Dropout: 0.1).
- Batch Configuration: Per-device batch size: 4, Gradient accumulation steps: 4 (Effective Batch Size: 16).
- Precision: Mixed Precision (FP16) optimized for NVIDIA Tensor Cores.
- Gradient Controls: Gradient Checkpointing enabled, Gradient Clipping threshold set to 1.0.
- Learning Rate Schedule: Cosine Annealing with a peak rate of 2.5e-4 and 80 Warmup steps over 1000 total steps.
Metric Progression
| Training Step | Training Loss* | Validation Loss | Evaluation Perplexity (PPL) |
|---|---|---|---|
| 100 | 22.980 | 5.575 | 263.879 |
| 300 | 14.670 | 3.727 | 41.569 |
| 500 | 12.876 | 3.277 | 26.505 |
| 700 | 11.938 | 3.055 | 21.223 |
| 900 | 11.812 | 2.970 | 19.495 |
| 1000 (Final) | 11.641 | 2.964 | 19.384 |
*Note on Training Loss: Due to standard Hugging Face logging formats under active gradient accumulation, the logged training loss is scaled by the accumulation factor of 4. The real training loss at step 1000 is 11.641 / 4 ≈ 2.91, showing exceptional alignment with the validation loss of 2.96.
Production Deployment & Safe Usage
To deploy or run inference with this model, you must use the custom transformers fork containing the HumanV module registration.
Installation of Custom Backend
Install the validated company-specific fork of transformers containing the HumanV code:
pip install git+https://github.com/humanprojectceo/transformers.git@main -U
Safe Inference Implementation
Executing LLM generations in secure production environments requires sandboxing and deterministic constraints. Use the following clean implementation structure:
import torch
from transformers import AutoTokenizer
# Classes are exposed dynamically via your registered Hugging Face modeling fork
from transformers import HumanVConfig, HumanVForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1. Initialize tokenizer (Uses Qwen2.5 multilingual tokenizer)
tokenizer_id = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# 2. Load pre-trained HumanV model weights
model_id = "humanvcompany/HumanV-0.2B-Base"
model = HumanVForCausalLM.from_pretrained(model_id).to(device)
model.eval()
# 3. Setup input prompt (Position IDs are aligned from index 0)
prompt = "Once upon a time, a small boy named Jack"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
# 4. Secure generation sequence
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True, # Creative sampling enabled
temperature=0.8, # Controlled randomness
top_k=50, # Top-K filtering
top_p=0.9, # Nucleus sampling
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated Story:\n", generated_text)
Security & Operational Guidelines
- Environment Isolation: Always execute untrusted generation inputs in containerized environments (e.g., Docker, sandboxed Kubernetes nodes) with resource limitations to prevent CPU/GPU exhaustion attacks.
- Tied Embeddings Warning: During model load, you may observe a warning regarding
missing keys: ['lm_head.weight']. This is expected behavior under thetie_word_embeddings=Trueconfiguration, which successfully maps output weights to the input embeddings to save VRAM and disk space. It does not impact inference or functional operations.
- Downloads last month
- 71