BitTransformerLM / AGENTS.md

🤖 Updated BitTransformerLM from development space

36c78b1 verified 5 months ago

5.54 kB

	# AGENTS Guidelines for BitTransformerLM

	## Repository Scope and Purpose
	- BitTransformerLM models raw binary streams using reversible transformer blocks and safety telemetry. The project is the canonical implementation under WCNegentropy.
	- Core capabilities include bit-native modeling, telemetry metrics (negentropy, LZ complexity, symbiosis), progressive scaling, compression, context extension, diffusion mode (linear/cosine/exp noise schedules with parity correction), dashboard control, distributed training, and quantization.
	- Phase 1 optimizations provide configurable batch sizing, gradient accumulation, mixed-precision, memory-mapped dataset streaming, scheduled compression ramps, selective `torch.compile`, and an EMA-smoothed safety gate with burn-in.

	## Environment Setup
	- Requires Python 3.10+.
	- Install dependencies:
	- CPU: `pip install --extra-index-url https://download.pytorch.org/whl/cpu -r requirements.txt`
	- Optional GPU: `pip install --extra-index-url https://download.pytorch.org/whl/cu118 torch==2.7.1+cu118`
	- The package name is `bit-transformer`; project metadata lives in `pyproject.toml`.

	## Repository Layout
	- `bit_transformer/` – core package (`model`, `compression`, `telemetry`, `safety`, `dashboard_app`, `quantization`, etc.).
	- `tests/` – pytest suite and historical `TEST_RESULTS.md`.
	- Scripts: `example.py`, `unified_workflow.py`, `full_bits_train.py`, `build_full_bits.py`, `mcp_server.py`, `wikitext_*` utilities. The legacy `progressive_scaleup.py` is retained for reference but superseded by `integration_schedule.py`.
	- Docs and specs: `README.md`, `state_of_the_repo_audit.md`, licensing files in `LICENSE/`.

	## Development Practices
	- Follow snake_case for functions and CamelCase for classes.
	- Keep functions under ~300 lines and minimize deeply nested control flow.
	- Avoid reintroducing the deprecated dashboard `/exec` endpoint or other insecure code paths.
	- Use the `/status` endpoint for model introspection; all routes return JSON and surface errors with stack traces.
	- Ensure compression, decompression, and halting logic stay consistent with current implementation.
	- Use the `cpu_autocast()` helper for BF16 mixed precision on CPU instead of
	calling `torch.amp.autocast` directly.
	- Adaptive training now expands depth, width, or context only when validation loss plateaus and automatically decays the base learning rate by √2 after each expansion with a 100‑step warm‑up.

	## Workflow & Commands
	- Run the example: `python example.py`.
	- Adaptive scaling now lives in `integration_schedule.py`; `progressive_scaleup.py` is deprecated.
	- Unified workflow (optionally with dashboard or diffusion): `python unified_workflow.py --dashboard` or `python unified_workflow.py --diffusion --diffusion-steps 8 --dataset-size 32`.
	- Increase `--diffusion-steps` for higher fidelity (8–16) and add `--diffusion-curriculum` to linearly decay noise over epochs.
	- Disable checkpointing or reversible blocks when speed is prioritized over memory: `python unified_workflow.py --no-checkpoint --no-reversible`.
	- Enable 4-bit quantization-aware training: `python unified_workflow.py --qat`.
	- Skip full attention logging during chunked attention for memory savings by constructing the model with `full_attn_logging=False`.
	- Start MCP server: `python mcp_server.py` and launch dashboard: `MCP_SERVER_ADDR=http://127.0.0.1:7000 python -m bit_transformer.dashboard_app`.
	- `/metrics` and `/model_config` endpoints expose telemetry streams and hyperparameters.
	- `/save_checkpoint` and `/download_checkpoint` sync weights with Hugging Face (token defaults to `HF_TOKEN`).
	- Container build: `docker build -t bittransformerlm .` and run with exposed ports `5000` (dashboard) and `7000` (MCP).

	## Telemetry Metrics
	\| Metric \| Meaning \| Range \|
	\|--------\|---------\|-------\|
	\| K \| Negentropy – deviation from random noise \| 0–1 (1 = ordered) \|
	\| C \| LZ Complexity – compressibility proxy \| 0–1 (higher = more changes) \|
	\| S \| Symbiosis – agreement with reference distribution \| 0–1 (1 = aligned) \|

	ACT halting exports `halt_probs` in telemetry showing how many layers executed. For robust sampling under safety constraints, call `safe_sample_with_retry(model, bits)` which retries with diffusion mode and exponential backoff.

	`TelemetrySynthesizer.cluster_sequences` can be used to select representative training samples before invoking `collapse_submodel`. The distillation helper deepens the model and widens once (`width_scale` = 1.5) if floors are missed, and `save_distilled_model` emits a `metrics.json` summary beside the weights.

	## Testing
	- Run unit tests after any change: `pytest -q`.
	- Use `watcher.py` for auto-reload and test on local development if desired.
	- During training, call `model.train()` and keep dropout probabilities around `0.1–0.2`.
	- Before running tests, inference, or pushing weights, switch to `model.eval()` and set all dropout probabilities to `0` to avoid flaky results.
	- Dashboard will warn if telemetry metrics drift by more than 0.2 over the last 10 steps. Adjust via `ModelManager(drift_window, drift_threshold)` as needed.

	## Licensing
	- Project governed by documents in `LICENSE/` (AGPLv3, commercial terms, disclaimers, etc.). Ensure compliance before contributing or distributing.

	These guidelines keep the repository consistent with the project roadmap and previous audits. Maintain security, style, and testing discipline to keep BitTransformerLM production-ready.