HRM-Text1-C4-large / README.md
dreamwar's picture
Update README.md
9a9b9a8 verified

HRM-Text1: Hierarchical Reasoning Model for Text Generation

Open In Colab

A large-scale transformer model with Hierarchical Reasoning Module (HRM) architecture trained on multiple high-quality text datasets. This model features adaptive computation with pondering mechanisms for improved text generation quality.

Model Architecture

HRM-Text1 implements a novel hierarchical reasoning architecture with the following key components:

  • Model Size: 99M parameters (Large variant)
  • Architecture: Hierarchical Reasoning Module with dual-stream processing
  • Embeddings: 1024 dimensions
  • Attention Heads: 16 heads
  • Feed-Forward: 4096 dimensions
  • Context Length: 512 tokens
  • Vocabulary: 32,128 tokens (T5 tokenizer)

Key Features

  • Adaptive Computation: Pondering mechanism with halt probabilities
  • Dual-Stream Processing: High-level (H) and Low-level (L) reasoning modules
  • SwiGLU Activation: Enhanced non-linear transformations
  • RMSNorm: Improved normalization for stable training
  • Mixed Precision: BF16 training support for NVIDIA Ampere+ GPUs

Training Configuration

Datasets

The model supports training on multiple high-quality datasets:

  • C4 Multilingual: Common Crawl web text (multilingual)
  • OpenWebText: English web content dataset
  • The Pile: Diverse text from EleutherAI
  • SlimPajama: 627B token dataset (filtered variants available)
  • FineWeb: High-quality web content
  • Spanish: Spanish language subset from C4

Mixed Dataset Training

The training script supports custom dataset mixing ratios:

CUSTOM_MIX_RATIOS = {
    "high_quality": {
        "slimpajama_en": 0.5,  # 50% SlimPajama English
        "pile": 0.3,           # 30% The Pile
        "openwebtext": 0.2     # 20% OpenWebText
    }
}

Training Hyperparameters

  • Learning Rate: 3e-4 (max) β†’ 1e-5 (min) with cosine annealing
  • Batch Size: 40 (with gradient accumulation steps: 2)
  • Weight Decay: 0.05
  • Optimizer: AdamW with β₁=0.9, Ξ²β‚‚=0.95
  • Epochs: 2
  • Mixed Precision: Enabled for compatible hardware

Model Components

HRMBlock Architecture

class HRMBlock(nn.Module):
    def __init__(self, n_embd, n_head, d_ff, dropout=0.1):
        super().__init__()
        self.norm1 = RMSNorm(n_embd)
        self.attn = nn.MultiheadAttention(n_embd, n_head, dropout=dropout, batch_first=True)
        self.norm2 = RMSNorm(n_embd)
        self.mlp = SwiGLUMuchPelu(n_embd, d_ff, dropout)
        self.dropout = nn.Dropout(dropout)

Pondering Mechanism

The model implements adaptive computation through a halt probability mechanism:

  • Max Steps: 8 reasoning steps
  • Halt Bias: -2.2 (initial)
  • Ponder Loss Weight: 1e-2

Usage

Quick Start

from transformers import T5Tokenizer
from modeling_hrm_text1 import HRMText1

# Load model and tokenizer
model = HRMText1.from_pretrained("dreamwar/HRM-Text1-{DATASET}-large")
tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Generate text
prompt = "The future of artificial intelligence"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training from Scratch

Option 1: Google Colab (Recommended)

# Open the Colab notebook
https://colab.research.google.com/drive/1c4exU-zMt4SuT1kRlwQQXlLPaiazEDCf?usp=sharing

Option 2: Local Training

# Set environment variables
export HRM_OUTPUT_BASE="/path/to/output"
export HF_TOKEN="your_huggingface_token"

# Run training
python hrm_llm_training_c4_b.py

Configuration Options

The training script supports extensive configuration:

# Dataset selection
ACTIVE_DATASET = "mixed"  # Options: "c4", "openwebtext", "pile", "spanish", "mixed"

# Dataset subset percentage
DATASET_SUBSET_PERCENT = 5  # 1-100%

# Custom output path
CUSTOM_BASE_PATH = "/your/custom/path"

# Model parameters (large variant)
MODEL_PARAMS = {
    "n_embd": 1024,
    "n_head": 16,
    "d_ff": 4096,
    "dropout": 0.1,
    "halt_max_steps": 8,
    "ponder_loss_weight": 1e-2,
    "halt_bias_init": -2.2
}

Features

Multi-Dataset Support

  • Individual Datasets: Train on single datasets (C4, OpenWebText, Pile, etc.)
  • Mixed Training: Combine multiple datasets with custom ratios
  • Language Filtering: Optional language detection and filtering
  • Streaming: Memory-efficient streaming for large datasets

Training Optimizations

  • Checkpointing: Automatic checkpoint saving and resuming
  • Early Stopping: Validation-based early stopping (patience: 2)
  • Gradient Clipping: Norm clipping at 1.0
  • Mixed Precision: BF16 for memory efficiency
  • Model Compilation: PyTorch 2.0 compilation support

Hardware Support

  • CUDA: GPU acceleration with TF32 precision on Ampere+
  • Multi-Platform: Linux, macOS, Windows support
  • Google Colab: Full compatibility with free and pro tiers
  • Memory Management: Automatic DataLoader worker detection

Output Structure

HRM_Models/
β”œβ”€β”€ hrm_text1_{dataset}_output-large/
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ pytorch_model.bin
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ best_model.bin
β”‚   └── checkpoint.pth

Environment Setup

Quick Start with Google Colab

Click the Colab badge above to get started immediately with a pre-configured environment including all dependencies.

Local Installation

pip install torch transformers datasets tqdm huggingface_hub
pip install langdetect  # Optional: for language filtering

Environment Variables

# Required for model upload
export HF_TOKEN="your_huggingface_token"

# Optional: custom output path
export HRM_OUTPUT_BASE="/your/custom/path"

Model Variants

The training script produces several model variants:

  • HRM-Text1-C4-large: Trained on C4 multilingual
  • HRM-Text1-Mixed-large: Trained on balanced dataset mixture
  • HRM-Text1-Spanish-large: Spanish language variant
  • HRM-Text1-Custom-{name}-large: Custom mixture variants

Performance

Model Specifications

  • Parameters: ~1B trainable parameters
  • Memory Usage: ~4-6GB VRAM for inference
  • Training Time: Varies by dataset size and hardware
  • Context Length: 512 tokens

Generation Quality

The model implements sophisticated reasoning through:

  • Hierarchical processing of information
  • Adaptive computation based on input complexity
  • Pondering mechanism for quality-vs-speed trade-offs

License

This model and training code are released under the Apache 2.0 License.

Citation

@misc{hrm-text1-2024,
  title={HRM-Text1: Hierarchical Reasoning Model for Text Generation},
  author={DreamWar},
  year={2024},
  url={https://huggingface.co/dreamwar/HRM-Text1}
}

Troubleshooting

Common Issues

  1. Memory Errors: Reduce batch size or enable gradient checkpointing
  2. Dataset Loading: Ensure stable internet connection for streaming
  3. CUDA Errors: Update PyTorch and CUDA drivers
  4. Language Detection: Install langdetect for language filtering

Support

For issues and questions:

  • Check the training script comments for detailed configuration
  • Review error messages for specific guidance
  • Ensure proper environment setup and dependencies

This model was trained using the HRM (Hierarchical Reasoning Module) architecture with adaptive computation for improved text generation capabilities.