| # BitTransformerLM Claude Code Integration Guide | |
| ## Overview | |
| BitTransformerLM is optimally designed for use with [Claude Code](https://claude.ai/code), providing AI-assisted setup, development, and research workflows. This document provides guidelines for working with BitTransformerLM in Claude Code and standalone development. | |
| ## Why Claude Code? | |
| BitTransformerLM's unique bit-native architecture has several complexities that Claude Code can help navigate: | |
| - **Complex Architecture**: Understanding bit-level processing, reversible layers, and safety telemetry | |
| - **Parameter Tuning**: Optimizing model configurations for different use cases | |
| - **Safety Monitoring**: Interpreting K/C/S metrics and configuring safety gates | |
| - **Distributed Training**: Setting up FSDP and pipeline parallelism correctly | |
| - **Debugging**: Identifying issues specific to bit-native processing | |
| Claude Code understands these nuances and can provide real-time assistance. | |
| --- | |
| ## Repository Scope and Architecture | |
| ### Core Capabilities | |
| BitTransformerLM implements bit-native language modeling with: | |
| - **Bit-Native Processing**: Direct binary sequence modeling with parity protection | |
| - **Reversible Layers**: Memory-efficient transformer blocks that save ~50% memory | |
| - **Safety Telemetry**: Real-time K/C/S (Negentropy/Complexity/Symbiosis) monitoring | |
| - **Diffusion Mode**: Bidirectional denoising with multiple noise schedules | |
| - **Progressive Scaling**: Automatic model expansion based on validation performance | |
| - **Distributed Training**: FSDP and pipeline parallelism for large-scale training | |
| - **Interactive Dashboard**: Real-time training control and visualization | |
| ### Experimental Status | |
| **Important**: BitTransformerLM is experimental research software requiring: | |
| - Rigorous baseline comparisons against standard transformers | |
| - Validation on established language modeling benchmarks | |
| - Statistical significance testing across multiple runs | |
| - Careful interpretation of safety metrics and claims | |
| --- | |
| ## Environment Setup | |
| ### Requirements | |
| - **Python 3.10+** (required for modern PyTorch features) | |
| - **PyTorch 2.7.1+** with appropriate CUDA support if using GPUs | |
| ### Installation Options | |
| #### CPU-Only Installation | |
| ```bash | |
| pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt | |
| ``` | |
| #### GPU Installation | |
| ```bash | |
| pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118 | |
| pip install -r requirements.txt | |
| ``` | |
| #### Claude Code Assisted Setup | |
| When using Claude Code, simply ask for: | |
| - "Help me set up BitTransformerLM for my system" | |
| - "Configure BitTransformerLM for GPU training" | |
| - "Set up a development environment for bit-native language modeling" | |
| Claude Code will guide you through hardware detection, dependency installation, and initial configuration. | |
| --- | |
| ## Repository Structure | |
| ``` | |
| BitTransformerLM/ | |
| ├── bit_transformer/ # Core package | |
| │ ├── model.py # BitTransformerLM architecture | |
| │ ├── telemetry.py # K/C/S safety metrics | |
| │ ├── safety.py # Safety gates and monitoring | |
| │ ├── bit_io.py # Text ↔ bits conversion | |
| │ ├── compression.py # Run-length encoding | |
| │ ├── training.py # Training utilities | |
| │ ├── distributed.py # FSDP and pipeline parallelism | |
| │ ├── dashboard_app.py # Interactive web dashboard | |
| │ ├── quantization.py # INT8/4-bit quantization | |
| │ └── [other modules...] # Additional functionality | |
| ├── tests/ # Test suite and results | |
| ├── example.py # Basic usage example | |
| ├── unified_workflow.py # Main training script | |
| ├── mcp_server.py # Management Control Protocol server | |
| ├── USER_GUIDE.md # Comprehensive user documentation | |
| └── [other scripts...] # Utilities and examples | |
| ``` | |
| --- | |
| ## Development Workflow with Claude Code | |
| ### Getting Started | |
| 1. **Initial Setup** | |
| ``` | |
| "Help me understand BitTransformerLM's architecture" | |
| "Create a simple training script for bit-native language modeling" | |
| "Explain the difference between causal and diffusion modes" | |
| ``` | |
| 2. **Model Configuration** | |
| ``` | |
| "Configure a BitTransformerLM for [my specific use case]" | |
| "What are optimal hyperparameters for a [size] model?" | |
| "Help me enable reversible layers and gradient checkpointing" | |
| ``` | |
| 3. **Training and Monitoring** | |
| ``` | |
| "Set up distributed training with FSDP" | |
| "Interpret these K/C/S telemetry values: K=0.3, C=0.6, S=0.4" | |
| "Debug this memory error during training" | |
| ``` | |
| ### Claude Code Advantages | |
| **Real-time Assistance**: Get immediate help with: | |
| - Parameter configuration and tuning | |
| - Error diagnosis and resolution | |
| - Architecture modification and experimentation | |
| - Safety metric interpretation | |
| - Performance optimization | |
| **Context-Aware Suggestions**: Claude Code understands: | |
| - BitTransformerLM's unique bit-native processing | |
| - The relationship between safety metrics | |
| - Memory optimization strategies | |
| - Distributed training complexities | |
| --- | |
| ## Key Commands and Workflows | |
| ### Basic Usage | |
| ```bash | |
| # Run simple example | |
| python example.py | |
| # Launch interactive dashboard | |
| python unified_workflow.py --dashboard | |
| # Train with diffusion mode | |
| python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32 | |
| ``` | |
| ### Advanced Training | |
| ```bash | |
| # Distributed training with FSDP | |
| python unified_workflow.py --distributed --batch-size 2 --epochs 10 | |
| # Mixed precision with quantization | |
| python unified_workflow.py --amp --qat | |
| # Progressive scaling with curriculum learning | |
| python unified_workflow.py --progressive --diffusion-curriculum | |
| ``` | |
| ### Dashboard and Monitoring | |
| ```bash | |
| # Start MCP server and dashboard | |
| python mcp_server.py & | |
| MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app | |
| ``` | |
| **Dashboard Features:** | |
| - Real-time telemetry visualization | |
| - Interactive model configuration | |
| - HuggingFace checkpoint management | |
| - Safe inference testing interface | |
| --- | |
| ## Safety and Telemetry | |
| ### Core Metrics | |
| | Metric | Full Name | Range | Interpretation | | |
| |--------|-----------|-------|----------------| | |
| | **K** | Negentropy | 0-1 | Information content (0=noise, 1=ordered) | | |
| | **C** | LZ Complexity | 0-1 | Pattern complexity (higher=more complex) | | |
| | **S** | Symbiosis | 0-1 | Alignment with reference (higher=better) | | |
| ### Using with Claude Code | |
| ``` | |
| "Explain what K=0.2, C=0.8, S=0.3 means for my model" | |
| "Configure safety gates for production use" | |
| "My model is generating repetitive output, what safety metrics should I check?" | |
| "Set up drift detection for telemetry monitoring" | |
| ``` | |
| Claude Code can help interpret these metrics in context and suggest appropriate safety thresholds. | |
| ### Safety Gate Configuration | |
| ```python | |
| from bit_transformer.safety import SafetyGate | |
| # Production-ready safety gate | |
| gate = SafetyGate( | |
| c_floor=0.3, # Minimum complexity | |
| s_floor=0.5, # Minimum symbiosis | |
| decay=0.9, # EMA decay factor | |
| burn_in=10 # Steps before gating starts | |
| ) | |
| ``` | |
| --- | |
| ## Best Practices for Claude Code Development | |
| ### 1. **Always Validate Research Claims** | |
| Ask Claude Code to help you: | |
| - Set up proper baseline comparisons | |
| - Design statistical significance tests | |
| - Implement evaluation on standard benchmarks | |
| - Document limitations and assumptions | |
| ### 2. **Use Progressive Development** | |
| ``` | |
| "Start me with a minimal BitTransformerLM example" | |
| "Now add safety monitoring" | |
| "Scale up to distributed training" | |
| "Add diffusion mode capabilities" | |
| ``` | |
| ### 3. **Leverage Claude Code for Architecture Understanding** | |
| ``` | |
| "Explain how reversible layers save memory" | |
| "Walk me through the bit encoding process" | |
| "How does the safety telemetry system work?" | |
| "Compare BitTransformerLM to standard transformers" | |
| ``` | |
| ### 4. **Get Help with Complex Configurations** | |
| ```python | |
| # Ask Claude Code to help configure models like: | |
| model = BitTransformerLM( | |
| d_model=1024, # Claude Code can suggest optimal values | |
| nhead=16, # Based on your hardware and use case | |
| num_layers=20, | |
| dim_feedforward=4096, | |
| max_seq_len=2048, | |
| reversible=True, # Memory optimization | |
| use_checkpoint=True, # Gradient checkpointing | |
| chunk_size=256, # Attention chunking | |
| lambda_K=0.1, # Regularization weights | |
| lambda_C=0.1, | |
| lambda_S=0.1 | |
| ) | |
| ``` | |
| --- | |
| ## Development Guidelines | |
| ### Code Style | |
| - **Functions**: `snake_case` (e.g., `train_loop`, `safe_inference`) | |
| - **Classes**: `CamelCase` (e.g., `BitTransformerLM`, `SafetyGate`) | |
| - **Constants**: `UPPER_SNAKE_CASE` (e.g., `MAX_SEQ_LEN`) | |
| - **Keep functions under 300 lines** and minimize deep nesting | |
| ### Security and Safety | |
| - **Never reintroduce deprecated `/exec` endpoint** | |
| - **Always use safety gates in production** | |
| - **Validate all user inputs** in dashboard and API endpoints | |
| - **Monitor telemetry metrics** for anomalous behavior | |
| - **Use `cpu_autocast()` helper** instead of direct `torch.amp.autocast` | |
| ### Memory Management | |
| ```python | |
| # Good: Memory-efficient configuration | |
| model = BitTransformerLM( | |
| reversible=True, # Enable reversible layers | |
| use_checkpoint=True, # Gradient checkpointing | |
| chunk_size=128, # Chunked attention | |
| full_attn_logging=False # Skip full attention reconstruction | |
| ) | |
| # Training with memory optimizations | |
| train_loop( | |
| model, data, | |
| amp=True, # Mixed precision | |
| accum_steps=4, # Gradient accumulation | |
| compile_model=True # torch.compile optimization | |
| ) | |
| ``` | |
| ### Testing and Validation | |
| ```bash | |
| # Run tests after changes | |
| pytest -q | |
| # Model evaluation modes | |
| model.train() # For training | |
| model.eval() # For inference/evaluation | |
| set_dropout(model, 0.0) # Disable dropout for reproducible results | |
| ``` | |
| --- | |
| ## Getting Help from Claude Code | |
| ### Specific Areas Where Claude Code Excels | |
| 1. **Architecture Design** | |
| - "Design a BitTransformerLM architecture for [specific task]" | |
| - "Optimize this model configuration for memory efficiency" | |
| - "Explain the trade-offs between reversible and standard layers" | |
| 2. **Training Optimization** | |
| - "My training is running out of memory, help optimize" | |
| - "Configure distributed training for 4 GPUs" | |
| - "Set up a training curriculum for bit-native language modeling" | |
| 3. **Safety and Monitoring** | |
| - "Interpret these telemetry readings and suggest adjustments" | |
| - "Set up production-ready safety monitoring" | |
| - "Debug why my safety gate is triggering constantly" | |
| 4. **Research and Evaluation** | |
| - "Design a rigorous evaluation comparing BitTransformerLM to GPT-2" | |
| - "Set up statistical significance testing for my experiments" | |
| - "Help me write up my research findings honestly" | |
| ### Sample Claude Code Interactions | |
| ``` | |
| User: "I'm getting OOM errors training a 500M parameter BitTransformerLM" | |
| Claude Code: "Let me help optimize your memory usage. Here's a configuration | |
| that should work better for your model size: | |
| model = BitTransformerLM( | |
| d_model=768, | |
| nhead=12, | |
| num_layers=12, | |
| reversible=True, # Critical for large models | |
| use_checkpoint=True, # Trade compute for memory | |
| chunk_size=64, # Reduce attention memory | |
| full_attn_logging=False # Skip expensive logging | |
| ) | |
| train_loop( | |
| model, data, | |
| batch_size=1, # Small batch size | |
| accum_steps=16, # Maintain effective batch size | |
| amp=True # Mixed precision training | |
| ) | |
| This should reduce memory usage by ~60% compared to standard configuration." | |
| ``` | |
| --- | |
| ## Licensing and Distribution | |
| BitTransformerLM is available under dual licensing: | |
| - **Open Source**: AGPLv3 for research and open source use | |
| - **Commercial**: Contact **contact@wcnegentropy.com** for commercial licensing | |
| When working with Claude Code, ensure compliance with the AGPLv3 license for any derivatives or modifications you create. | |
| --- | |
| ## Research Integrity | |
| **Important Reminder**: BitTransformerLM is experimental research software. When using Claude Code: | |
| 1. **Always validate claims** through proper baseline comparisons | |
| 2. **Document limitations** honestly in any publications or reports | |
| 3. **Use statistical significance testing** for any performance claims | |
| 4. **Follow established ML research best practices** | |
| 5. **Share negative results** as well as positive ones | |
| Claude Code can help you design rigorous experiments and avoid common pitfalls in ML research. | |
| --- | |
| ## Support and Community | |
| ### Getting Help | |
| - **Claude Code**: Real-time AI assistance with BitTransformerLM | |
| - **GitHub Issues**: Bug reports and feature requests | |
| - **Discussions**: Community questions and sharing | |
| - **User Guide**: Comprehensive documentation (`USER_GUIDE.md`) | |
| - **Project Overview**: Complete project information (`ABOUTME.md`) | |
| ### Contributing | |
| When contributing to BitTransformerLM: | |
| 1. Use Claude Code to ensure code quality and consistency | |
| 2. Follow the development guidelines in this document | |
| 3. Add tests for new functionality | |
| 4. Update documentation as needed | |
| 5. Ensure all safety and security practices are followed | |
| --- | |
| **BitTransformerLM + Claude Code provides a powerful combination for exploring bit-native language modeling with AI assistance. Start experimenting responsibly and share your findings with the research community!** 🤖✨ |