HumanV-0.2B-Base

HumanV-0.2B-Base is a 199-million parameter causal language model built with the custom HumanV architecture. This model serves as a technical proof-of-concept for the integration of hybrid sparse attention mechanisms inside modern transformer pipelines, verifying the stability, gradient flow, and memory footprint optimizations of the HumanV architecture under enterprise pre-training configurations.

Architecture Specification

The HumanV architecture is designed for resource-efficient text generation, implementing advanced hardware-friendly optimizations to significantly reduce the Key-Value (KV) Cache footprint during autoregressive inference.

Total Parameters: 199,056,384 (~200M)
Hidden Dimension (d_model): 768
Intermediate Feed-Forward Dimension (d_ff): 2304
Decoder Layers: 12
Attention Mechanism: Hybrid Configuration
- Grouped-Query Attention (GQA): 12 Query Heads, 4 Key-Value Heads (3:1 query-to-group ratio).
- First 3 Layers: Full Dense Self-Attention.
- Remaining 9 Layers: Local-Global Block Sparse Attention (Block size: 16, Window: 128, 4 Local Blocks, 1 Global Block).
Positional Encoding: Rotary Position Embeddings (RoPE).
Activation Function: SwiGLU.
Normalization: RMSNorm (epsilon = 1e-5) with tied word embeddings.

Training & Optimization

The model was pre-trained using a highly structured, padding-free pipeline to eliminate overfitting on padding tokens and ensure stable, continuous learning.

Training Dataset

Source: Subset of roneneldan/TinyStories (40,000 training stories, 4,000 validation stories).
Data Processing: Token Packing (Concatenated documents chunked into contiguous blocks of 256 tokens).

Training Hyperparameters (Production Grade)

Optimizer: AdamW (beta1 = 0.9, beta2 = 0.95)
Weight Decay: 0.1 (applied exclusively to weight matrices of linear layers; bias and norm weights excluded).
Dropout Strategy: Active regularization (Attention Dropout: 0.1, Residual Dropout: 0.1, MLP Hidden Dropout: 0.1).
Batch Configuration: Per-device batch size: 4, Gradient accumulation steps: 4 (Effective Batch Size: 16).
Precision: Mixed Precision (FP16) optimized for NVIDIA Tensor Cores.
Gradient Controls: Gradient Checkpointing enabled, Gradient Clipping threshold set to 1.0.
Learning Rate Schedule: Cosine Annealing with a peak rate of 2.5e-4 and 80 Warmup steps over 1000 total steps.

Metric Progression

Training Step	Training Loss*	Validation Loss	Evaluation Perplexity (PPL)
100	22.980	5.575	263.879
300	14.670	3.727	41.569
500	12.876	3.277	26.505
700	11.938	3.055	21.223
900	11.812	2.970	19.495
1000 (Final)	11.641	2.964	19.384

*Note on Training Loss: Due to standard Hugging Face logging formats under active gradient accumulation, the logged training loss is scaled by the accumulation factor of 4. The real training loss at step 1000 is 11.641 / 4 ≈ 2.91, showing exceptional alignment with the validation loss of 2.96.

Production Deployment & Safe Usage

To deploy or run inference with this model, you must use the custom transformers fork containing the HumanV module registration.

Installation of Custom Backend

Install the validated company-specific fork of transformers containing the HumanV code:

pip install git+https://github.com/humanprojectceo/transformers.git@main -U

Safe Inference Implementation

Executing LLM generations in secure production environments requires sandboxing and deterministic constraints. Use the following clean implementation structure:

import torch
from transformers import AutoTokenizer
# Classes are exposed dynamically via your registered Hugging Face modeling fork
from transformers import HumanVConfig, HumanVForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Initialize tokenizer (Uses Qwen2.5 multilingual tokenizer)
tokenizer_id = "Qwen/Qwen2.5-0.5B"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# 2. Load pre-trained HumanV model weights
model_id = "humanvcompany/HumanV-0.2B-Base"
model = HumanVForCausalLM.from_pretrained(model_id).to(device)
model.eval()

# 3. Setup input prompt (Position IDs are aligned from index 0)
prompt = "Once upon a time, a small boy named Jack"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# 4. Secure generation sequence
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=True,             # Creative sampling enabled
        temperature=0.8,            # Controlled randomness
        top_k=50,                   # Top-K filtering
        top_p=0.9,                  # Nucleus sampling
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("\nGenerated Story:\n", generated_text)

Security & Operational Guidelines

Environment Isolation: Always execute untrusted generation inputs in containerized environments (e.g., Docker, sandboxed Kubernetes nodes) with resource limitations to prevent CPU/GPU exhaustion attacks.
Tied Embeddings Warning: During model load, you may observe a warning regarding missing keys: ['lm_head.weight']. This is expected behavior under the tie_word_embeddings=True configuration, which successfully maps output weights to the input embeddings to save VRAM and disk space. It does not impact inference or functional operations.

Downloads last month: 71

Safetensors

Model size

0.2B params

Tensor type

F32

humanvcompany
/

HumanV-0.2B-Base