Viharikvs
/

CMBA-768M-FineWeb

Text Generation

Model card Files Files and versions

CMBA-768M-FineWeb / README.md

Viharikvs's picture

Model card updated after epoch 2

b329ee8 verified 3 months ago

|

history blame contribute delete

2.45 kB

	---
	base_model: t5-small
	license: apache-2.0
	datasets:
	- HuggingFaceFW/fineweb-edu
	tags:
	- text-generation
	- causal-lm
	- mamba
	- hrm
	- pytorch
	language:
	- en
	pipeline_tag: text-generation
	---

	# CMBA-768M-FineWeb

	A 768M parameter Hierarchical Recurrent Memory (HRM) language model trained on high-quality web text from FineWeb-Edu. This model uses Mamba2 state-space models instead of traditional attention mechanisms, enabling efficient long-range sequence modeling.

	## Model Architecture

	CMBA (Causal Mamba-based Architecture) implements a hierarchical processing structure:

	- Hierarchical Design: Dual-level processing with H-layers (high-level abstraction) and L-layers (low-level specialists)
	- Mamba2 Mixers: State-space models replace attention for O(n) complexity vs O(n²)
	- Adaptive Computation: Halting mechanism allows variable compute per token (ACT-style pondering)
	- Parameters: ~768M total
	- Context Length: 1024 tokens

	### Configuration
	```python
	Model Dimensions:
	- d_model: 768
	- n_heads: 12 (for compatibility, not used in Mamba)
	- d_ff: 3072
	- H_layers: 12 (high-level hierarchy)
	- L_layers: 12 (low-level processing)

	Mamba2 Settings:
	- d_state: 128
	- expand: 2
	- headdim: 64
	- d_conv: 4
	- ngroups: 1

	Training:
	- Max halt steps: 8
	- Block size: 1024
	- Batch size: 32 (effective)
	- Learning rate: 0.0002 → 1e-06
	- Weight decay: 0.1
	```

	## Training Data

	- Dataset: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) (sample-10BT)
	- Tokenizer: `t5-small` (T5 SentencePiece)
	- Vocab Size: 32100

	## Latest Performance (Epoch 2)

	- Validation Loss: `8.1216`
	- Validation Perplexity: `3366.37`

	## Usage

	```python
	from transformers import T5Tokenizer
	from hrm_text1_modeling import HRMText1

	tokenizer = T5Tokenizer.from_pretrained("t5-small")
	model = HRMText1.from_pretrained("Viharikvs/CMBA-768M-FineWeb")

	# Generate text
	input_ids = tokenizer("Once upon a time", return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_length=100)
	print(tokenizer.decode(outputs[0]))
	```

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{cmba-768m-fineweb,
	author = {Vihari},
	title = {CMBA-768M-FineWeb: Hierarchical Mamba-based Language Model},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/Viharikvs/CMBA-768M-FineWeb}
	}
	```

	## License

	Apache 2.0