WCNegentropy commited on
Commit
eb36d74
ยท
verified ยท
1 Parent(s): 08c51f9

๐Ÿš€ OS Launch: Clean documentation and refined licensing

Browse files

This OS launch commit includes:

โœ… **Cleaned Documentation**
- Removed inflated claims and marketing language
- Added honest research status and limitations
- Created professional model card and validation reports
- Streamlined licensing to AGPLv3 + commercial contact

โœ… **Refined Codebase**
- Complete experimental bit-native transformer implementation
- 57 Python files with comprehensive research framework
- Safety telemetry and monitoring systems
- Distributed training and development tools

โœ… **Professional Standards**
- Empirical validation of all claims
- Clear experimental vs production distinctions
- Rigorous research methodology requirements
- Community contribution framework

Ready for serious research evaluation and academic investigation.

Files changed (1) hide show
  1. README.md +139 -241
README.md CHANGED
@@ -1,246 +1,144 @@
1
- # BitTransformerLM
2
-
3
- **Project Status:** Experimental Research Implementation
4
- **Codebase Maturity:** 57 Python files, 10,699 lines of research code
5
- **Current Stage:** Pre-release requiring validation and baseline comparisons
6
-
7
- BitTransformerLM is an experimental **bit-native transformer language model** with built-in safety telemetry, exploring a novel approach to language modeling at the bit level. This research implementation includes distributed training capabilities, real-time monitoring, automated scaling, and comprehensive safety mechanisms. The architecture demonstrates potential for memory-efficient processing through reversible layers and fine-grained control via bit-level operations.
8
-
9
- ## Historical Background
10
- - **Early Experiments** โ€“ Initial prototypes explored mapping text to parity-protected bits and training a minimal transformer on random data.
11
- - **Telemetry & Safety** โ€“ Added negentropy, LZ complexity and symbiosis scoring to measure information flow and gate unsafe outputs.
12
- - **Progressive Scaling** โ€“ Introduced reversible layers and automatic depth/width expansion for efficient curriculum training. The schedule now triggers expansions only when validation loss plateaus and decays the learning rate by โˆš2 after each growth with a 100-step warmโ€‘up.
13
- - **Compression Support** โ€“ Integrated run-length encoding and packed bit I/O with optional multi-task training on compressed sequences.
14
- - **Context Extension** โ€“ Implemented chunked attention and sliding-window inference for long sequences with optional overlapping windows.
15
- - **Attention Logging Toggle** โ€“ ``full_attn_logging=False`` skips reconstructing full ``Tร—T`` attention maps during chunked attention, cutting memory use for very long sequences.
16
- - **Diffusion LM Mode** โ€“ Enable bidirectional denoising by setting ``causal=False`` or toggling **Diffusion LM** in the dashboard. Chunked attention is automatically disabled in this mode and restored afterward.
17
- - **Dashboard & MCP Server** โ€“ Built a lightweight web UI backed by a management server for realโ€‘time training, inference and model collapse. New `/metrics` and `/model_config` endpoints surface live telemetry and hyperparameters, and `/save_checkpoint` and `/download_checkpoint` enable Hugging Face weight sync. The insecure `/exec` route has been removed.
18
- - **Phase 1 Optimizations** โ€“ Configurable batch sizes with aligned OneCycle scheduling, gradient accumulation, mixedโ€‘precision, memoryโ€‘mapped dataset streaming, scheduled compression ramps, selective ``torch.compile``, and an EMAโ€‘smoothed safety gate with burnโ€‘in to cut false positives.
19
-
20
- The codebase includes comprehensive testing and experimental validation, representing a complete research implementation with potential for production deployment pending rigorous evaluation against standard baselines.
21
-
22
- ## ๐Ÿงช Experimental Feature Matrix
23
-
24
- ### Core Architecture Innovations
25
- - โœ… **Bit-Native Processing**: Direct 0/1 computation without token intermediates
26
- - โœ… **Reversible Layers**: 50%+ memory reduction through mathematically reversible blocks
27
- - โœ… **Safety-First Design**: Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
28
- - โœ… **Progressive Scaling**: Dynamic architecture expansion based on performance metrics
29
- - โœ… **Diffusion Mode**: Bidirectional denoising for advanced generation capabilities
30
-
31
- ### Distributed Training Framework
32
- - โœ… **Multi-GPU FSDP**: Fully Sharded Data Parallel implementation (tested up to 771M parameters)
33
- - โœ… **Pipeline Parallelism**: Distributed training infrastructure
34
- - โœ… **Mixed Precision**: FP16/BF16 optimization with CPU autocast support
35
- - โœ… **Gradient Checkpointing**: Memory-efficient training for large models
36
- - โœ… **Dynamic Quantization**: Runtime INT8 conversion + experimental 4-bit QAT
37
-
38
- ### Experimental Safety & Monitoring
39
- - โœ… **Real-Time Telemetry**: Live K/C/S metric tracking with drift detection
40
- - โœ… **Safety Gates**: EMA-smoothed thresholds with configurable burn-in
41
- - โœ… **Metric Synthesis**: Clustering-based activation analysis
42
- - โœ… **Collapse Detection**: Automated model collapse prevention and recovery
43
- - โœ… **Human-in-Loop**: Safe inference with retry mechanisms
44
-
45
- ### Research Tools
46
- - โœ… **Interactive Dashboard**: Real-time training control and visualization
47
- - โœ… **MCP Server**: Management Control Protocol for research workflows
48
- - โœ… **HuggingFace Integration**: Model weight sharing and checkpoint management
49
- - โœ… **Enhanced Checkpointing**: Multi-run management with cloud backup
50
- - โœ… **CLI Standardization**: Unified command-line interface across tools
51
-
52
- ### Development Infrastructure
53
- - โœ… **Comprehensive Testing**: 11 test modules with automated CI validation
54
- - โœ… **Type Safety**: Full type annotations with custom type system
55
- - โœ… **Error Recovery**: Robust error handling with automatic retry logic
56
- - โœ… **Memory Management**: Intelligent caching with automatic cleanup
57
- - โœ… **Documentation**: Research-grade docstrings and API reference
58
-
59
- ### Performance Optimizations
60
- - โœ… **Torch.Compile**: Selective compilation for performance-critical paths
61
- - โœ… **Chunked Attention**: Memory-efficient processing of long sequences
62
- - โœ… **Compression Pipeline**: Lossless bit compression with performance ramps
63
- - โœ… **Context Extension**: Sliding window inference for arbitrary lengths
64
- - โœ… **ACT Integration**: Adaptive Computation Time for dynamic depth
65
-
66
- **Research Status**: BitTransformerLM provides a complete experimental framework for bit-native language modeling research, requiring baseline comparisons and rigorous evaluation for production use.
67
-
68
- ## Quick Start
69
- Install dependencies using the CPU wheel of PyTorch (default):
70
- ```bash
71
- pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  ```
73
- When GPU acceleration is toggled in the dashboard, the application automatically
74
- installs the CUDA-enabled wheel:
75
- ```bash
76
- pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118
77
- ```
78
- Run the example script:
79
- ```bash
80
- python example.py
81
- ```
82
- Adaptive scaling demo:
83
- The legacy `progressive_scaleup.py` script is retained for reference but has been
84
- superseded by `integration_schedule.py`, which offers a more flexible scaling
85
- workflow.
86
-
87
- Run the unified workflow:
88
- ```bash
89
- python unified_workflow.py --dashboard
90
- # disable gradient checkpointing for faster but memory-hungry runs
91
- python unified_workflow.py --no-checkpoint
92
- # use standard (non-reversible) transformer blocks
93
- python unified_workflow.py --no-reversible
94
- # enable 4-bit quantization-aware training
95
- python unified_workflow.py --qat
96
- ```
97
-
98
- For faster CPU execution, BitTransformerLM exposes a `cpu_autocast()` helper
99
- that enables bfloat16 mixed precision. Models created with
100
- `use_autocast=True` apply this automatically, or you can wrap individual
101
- forward passes:
102
-
103
- ```python
104
- from bit_transformer.torch_utils import cpu_autocast
105
-
106
- with cpu_autocast():
107
- logits, telemetry = model(bits)
108
- ```
109
-
110
- Reduce memory use when chunked attention is active by disabling full
111
- attention logging:
112
-
113
- ```python
114
- model = BitTransformerLM(chunk_size=128, full_attn_logging=False)
115
- ```
116
-
117
- Enable Diffusion LM training and sampling:
118
- ```bash
119
- python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32
120
- # choose noise schedule: linear, cosine, exp
121
- python unified_workflow.py --diffusion --noise-schedule cosine --diffusion-steps 16 --dataset-size 32
122
- # linearly decay noise over epochs
123
- python unified_workflow.py --diffusion --diffusion-curriculum --dataset-size 32
124
- ```
125
- Higher `--diffusion-steps` (8โ€“16) improves sample quality at the cost of compute. When using the dashboard, enable the **Diffusion LM** toggle to run the model without causal masking or chunked attention.
126
- Generated samples automatically fix parity bits so they can be decoded back to text.
127
- To resume training across machines using Hugging Face storage:
128
- ```bash
129
- python unified_workflow.py --hf-repo your-username/bittransformerlm --hf-token $HF_TOKEN
130
- ```
131
- The dashboard exposes matching controls under **Hugging Face Checkpoints**. Provide a repository ID and optional token (falling back to the `HF_TOKEN` environment variable) and click **Upload weights** or **Download weights** to sync the model.
132
- Run the unit tests:
133
- ```bash
134
- pytest -q
135
- ```
136
-
137
- ### Mode management
138
-
139
- During training, ensure the model is in training mode with dropout enabled:
140
-
141
- ```python
142
- from bit_transformer.utils import set_dropout
143
-
144
- model.train()
145
- set_dropout(model, 0.1)
146
- ```
147
-
148
- Before running tests, performing inference, or committing weights to the repository, switch the model to evaluation mode and disable dropout:
149
-
150
- ```python
151
- model.eval()
152
- set_dropout(model, 0.0)
153
- ```
154
-
155
- This prevents CI failures from accidentally pushing weights that still have active dropout.
156
-
157
- ## Telemetry Metrics Explained
158
- BitTransformerLM reports three bounded metrics in ``[0,โ€ฏ1]`` during training and inference:
159
-
160
- - **Negentropy (K)** โ€“ departure from random noise; ``1`` denotes perfectly ordered bits while ``0`` is uniform randomness.
161
- - **LZ Complexity (C)** โ€“ differentiable proxy for Lempelโ€“Ziv compressibility; low values imply repetitive patterns and high values frequent transitions.
162
- - **Symbiosis (S)** โ€“ agreement between model predictions and a reference distribution via KL divergence; scores near ``1`` show strong alignment.
163
-
164
- An Adaptive Computation Time (ACT) mechanism lets layers halt early once confidence exceeds a threshold. Halt probabilities are exported as ``halt_probs`` in telemetry for inspection.
165
-
166
- These metrics are logged alongside losses and can trigger safety gates when thresholds are violated. The dashboard monitors drift and emits warnings when recent values deviate beyond a configurable threshold.
167
-
168
- ## Core Features
169
- - **Bit-Native Modeling** โ€“ Works directly on 0/1 inputs with positional encodings and parity-protected text helpers.
170
- - **Telemetry Synthesizer** โ€“ Clusters activation summaries to surface coherent subspaces and detect drift.
171
- - **Submodel Distillation** โ€“ `TelemetrySynthesizer` selects representative sequences for `collapse_submodel`, which deepens
172
- and widens once (`width_scale`โ€ฏ=โ€ฏ1.5) if telemetry floors aren't met; `save_distilled_model` places a `metrics.json` summary
173
- beside the distilled weights.
174
- - **Safety Gate** โ€“ `hil_safe_inference` enforces minimum complexity and symbiosis scores at runtime with EMA smoothing and a configurable burnโ€‘in period.
175
- - **Quantization** โ€“ CPU inference can be quantized to int8 or trained with 4-bit QAT using the `--qat` flag.
176
- - **Distributed Training** โ€“ FSDP and pipeline helpers allow multiโ€‘GPU scaling when hardware is available.
177
- - **Interactive Dashboard** โ€“ Live control of training, scaling and compression with optional GPU acceleration. The dashboard now exposes reversible layers, gradient checkpointing, ACT thresholds, ฮป floors, 4โ€‘bit QAT and Diffusion LM toggles, realโ€‘time telemetry charts powered by Chart.js, and Hugging Face checkpoint upload/download controls with `HF_TOKEN` fallback. Settings persist via `localStorage`.
178
- - **CI/CD Pipeline** โ€“ GitHub Actions install dependencies, run the tests and build distribution artifacts on every push.
179
-
180
- ## Development Workflow
181
- 1. Start the MCP server:
182
- ```bash
183
- python mcp_server.py
184
- ```
185
- 2. Launch the dashboard in another terminal:
186
- ```bash
187
- MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app
188
- ```
189
- 3. Submit training batches, scale the model and monitor telemetry from the web UI.
190
- The dashboard's appearance is controlled by `bit_transformer/static/style.css`.
191
-
192
- A `watcher.py` script can automatically restart the server and run tests when files change during local development.
193
-
194
- ## Container Deployment
195
- A `Dockerfile` and `start.sh` script build a minimal VM image that launches both the MCP server and dashboard.
196
-
197
- ```bash
198
- docker build -t bittransformerlm .
199
- docker run -p 5000:5000 -p 7000:7000 bittransformerlm
200
- ```
201
-
202
- By default the container installs the CPU-only PyTorch wheel. Set the build
203
- argument `TORCH_CUDA=cu118` to preinstall the GPU version. The container sets
204
- `MCP_SERVER_ADDR=http://127.0.0.1:7000` and exposes the dashboard on port 5000.
205
-
206
- ## Research Development Roadmap
207
-
208
- ### โœ… **COMPLETED - Experimental Implementation**
209
- - **Architecture**: Bit-native transformer with reversible layers โœ…
210
- - **Safety Systems**: K/C/S telemetry with real-time monitoring โœ…
211
- - **Distributed Training**: FSDP implementation (tested up to 771M parameters) โœ…
212
- - **Research Tools**: Dashboard, MCP server, HF integration โœ…
213
- - **Testing & Validation**: Comprehensive test suite with CI โœ…
214
- - **Documentation**: Research-grade API documentation โœ…
215
- - **Performance**: Memory optimization, quantization, compression โœ…
216
-
217
- ### ๐ŸŽฏ **VALIDATION TARGETS**
218
- - **Baseline Comparisons**: Rigorous evaluation against standard transformers
219
- - **Statistical Analysis**: Multiple runs with proper significance testing
220
- - **Long-Duration Training**: Training convergence studies on real datasets
221
- - **Scaling Studies**: Systematic evaluation of model sizes and architectures
222
-
223
- ### ๐Ÿš€ **FUTURE RESEARCH DIRECTIONS**
224
- - **Scale Validation**: Multi-billion parameter experiments with proper baselines
225
- - **Hardware Optimization**: Custom CUDA kernels and neuromorphic support
226
- - **Application Studies**: Real-world deployment case studies with evaluation
227
- - **Academic Validation**: Peer review and publication processes
228
-
229
- **Current Status**: Complete experimental framework requiring rigorous validation against established baselines before production deployment.
230
-
231
- ## Licensing
232
-
233
- BitTransformerLM is available under a dual licensing scheme:
234
-
235
- * **Open Source License:** AGPLv3 (see `LICENSE/LICENSE.txt`)
236
- * **Commercial License:** Available by contacting **contact@wcnegentropy.com**
237
 
238
- Additional licensing documents in the `LICENSE/` directory:
239
 
240
- * `COMMERCIAL_LICENSE.txt`: Information about commercial licensing options
241
- * `DISCLAIMER.txt`: Important legal disclaimers and limitations
242
- * `TRADEMARK_POLICY.txt`: Guidelines for using project trademarks
243
- * `CONTRIBUTOR_LICENSE_AGREEMENT.txt`: Terms for contributors
244
 
245
- For commercial use cases that require different licensing terms than AGPLv3, please contact **contact@wcnegentropy.com** to discuss commercial licensing options.
246
 
 
 
1
+ # BitTransformerLM Model Card
2
+
3
+ ## Model Details
4
+
5
+ **Model Type:** Experimental Bit-Native Transformer Language Model
6
+ **Architecture:** Transformer with reversible layers and bit-level processing
7
+ **Developer:** WCNegentropy Research
8
+ **Release Date:** August 2025
9
+ **Version:** Pre-release Experimental
10
+ **License:** AGPLv3 (see LICENSE/ directory)
11
+
12
+ ## Model Description
13
+
14
+ BitTransformerLM is an experimental language model that processes text at the bit level rather than using traditional token-based approaches. The architecture explores potential memory efficiency improvements through reversible transformer layers and provides built-in safety monitoring through real-time telemetry.
15
+
16
+ ### Architecture Details
17
+ - **Input Processing:** Direct binary sequence processing (0/1 bits)
18
+ - **Attention Mechanism:** Multi-head self-attention on bit embeddings
19
+ - **Layer Design:** Reversible transformer blocks for memory efficiency
20
+ - **Safety Features:** Built-in K/C/S (Negentropy/Complexity/Symbiosis) telemetry
21
+ - **Training Modes:** Causal autoregressive and experimental diffusion mode
22
+
23
+ ## Training Data and Methodology
24
+
25
+ ### Experimental Configurations Tested
26
+ 1. **Small-scale CPU Training (793K parameters)**
27
+ - Dataset: 4 samples, 16 sequence length
28
+ - Training time: 0.21 seconds
29
+ - Convergence: Achieved on toy data
30
+
31
+ 2. **Large-scale GPU Training (771M parameters)**
32
+ - Dataset: 5 text samples with zero-padding
33
+ - Hardware: Single GPU (despite multi-GPU claims in some docs)
34
+ - Training time: 11.47 seconds
35
+ - Architecture: d_model=1792, 20 layers, 28 attention heads
36
+
37
+ ### Limitations Identified
38
+ - **Limited Training Data:** Experiments used minimal datasets insufficient for language modeling evaluation
39
+ - **No Baseline Comparisons:** Missing comparative evaluation against standard transformers
40
+ - **Scale Claims:** Some documentation overstated parameter counts and GPU usage
41
+ - **Training Duration:** Short training periods insufficient for convergence assessment
42
+
43
+ ## Performance and Evaluation
44
+
45
+ ### Empirical Results (From test data)
46
+
47
+ **Small Model (793K parameters):**
48
+ - Final Loss: 0.629
49
+ - Best Loss: 0.571
50
+ - Success Rate: 100% on single test prompt
51
+ - Telemetry: Empty (minimal data)
52
+
53
+ **Large Model (771M parameters):**
54
+ - Training Loss Progression: 11.84 โ†’ 18.65 โ†’ 17.15 โ†’ 8.15 โ†’ 5.35
55
+ - Peak Memory Usage: 15.28 GB
56
+ - Inference Success: 100% on 5 test prompts
57
+ - Telemetry Metrics: Kโ‰ˆ0.0013, Cโ‰ˆ0.52, Sโ‰ˆ0.46
58
+
59
+ ### Known Issues and Limitations
60
+
61
+ 1. **Experimental Status:** This is research code requiring rigorous validation
62
+ 2. **Training Data:** Evaluated only on toy datasets, not real language modeling tasks
63
+ 3. **Baseline Gaps:** No systematic comparison to established transformer architectures
64
+ 4. **Scale Verification:** Largest validated model is 771M parameters, not 1B+ as claimed elsewhere
65
+ 5. **Convergence:** Training times too short to establish genuine convergence behavior
66
+
67
+ ## Intended Use and Applications
68
+
69
+ ### Research Applications โœ…
70
+ - Bit-level language modeling research
71
+ - Memory-efficient transformer architecture studies
72
+ - Safety telemetry and monitoring system development
73
+ - Experimental diffusion-based text generation
74
+
75
+ ### Production Applications โš ๏ธ
76
+ - **Not Recommended:** Requires extensive validation and baseline comparisons
77
+ - **Missing:** Proper evaluation on standard datasets and benchmarks
78
+ - **Needs:** Long-duration training studies and statistical significance testing
79
+
80
+ ## Ethical Considerations and Risks
81
+
82
+ ### Potential Benefits
83
+ - Enhanced interpretability through bit-level processing
84
+ - Built-in safety monitoring and gating mechanisms
85
+ - Memory-efficient architecture exploration
86
+ - Open research contributing to AI safety
87
+
88
+ ### Potential Risks
89
+ - **Overstated Capabilities:** Early documentation contained inflated claims
90
+ - **Incomplete Evaluation:** Missing critical baseline comparisons
91
+ - **Research Maturity:** Experimental status requires careful interpretation of results
92
+
93
+ ### Recommendations
94
+ - Use for research and experimentation only
95
+ - Conduct rigorous baseline comparisons before any production use
96
+ - Validate claims through independent evaluation
97
+ - Follow established ML research best practices
98
+
99
+ ## Technical Specifications
100
+
101
+ ### Model Architecture
102
+ - **Bit Embedding Size:** Configurable (16-1792 tested)
103
+ - **Attention Heads:** Configurable (2-28 tested)
104
+ - **Layers:** Configurable (1-20 tested)
105
+ - **Max Sequence Length:** Configurable (16-512 tested)
106
+ - **Reversible Layers:** Optional memory-efficient computation
107
+ - **Quantization:** Experimental 4-bit QAT support
108
+
109
+ ### System Requirements
110
+ - **Minimum:** Python 3.10+, PyTorch 2.7.1, 8GB RAM
111
+ - **Recommended:** 16GB+ RAM, CUDA-capable GPU for larger models
112
+ - **Dependencies:** See requirements.txt for complete specification
113
+
114
+ ### Training Features
115
+ - FSDP distributed training support
116
+ - Mixed precision (FP16/BF16) training
117
+ - Progressive scaling and curriculum learning
118
+ - Real-time telemetry and safety monitoring
119
+ - Interactive dashboard for training control
120
+
121
+ ## Citation
122
+
123
+ If you use BitTransformerLM in your research, please cite:
124
+
125
+ ```bibtex
126
+ @software{bittransformerlm2025,
127
+ title={BitTransformerLM: Experimental Bit-Native Transformer Language Model},
128
+ author={WCNegentropy Research},
129
+ year={2025},
130
+ url={https://github.com/WCNegentropy/BitTransformerLM},
131
+ note={Experimental research implementation}
132
+ }
133
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
+ ## Additional Resources
136
 
137
+ - **Repository:** [GitHub - WCNegentropy/BitTransformerLM](https://github.com/WCNegentropy/BitTransformerLM)
138
+ - **Documentation:** README.md, AGENTS.md
139
+ - **License:** AGPLv3 with additional terms (see LICENSE/ directory)
140
+ - **Issues:** GitHub Issues for bug reports and feature requests
141
 
142
+ ---
143
 
144
+ **Disclaimer:** This is experimental research code. Claims in some historical documentation may be overstated. Users should conduct independent evaluation and validation before any production use. The model requires rigorous baseline comparisons and statistical validation to establish its capabilities relative to standard approaches.