hbfreed
/

flex-general-2048

Text Generation

pruned_flex_olmo

mixture-of-experts

Model card Files Files and versions

flex-general-2048 / README.md

hbfreed's picture

Upload README.md with huggingface_hub

9317eec verified about 1 month ago

|

history blame contribute delete

2.92 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- pruned_flex_olmo
	- custom_code
	- pruned
	- distilled
	- mixture-of-experts
	base_model: allenai/Flex-math-2x7B-1T
	pipeline_tag: text-generation
	---

	# flex-general-2048

	A pruned and partially distilled variant of [allenai/Flex-math-2x7B-1T](https://huggingface.co/allenai/Flex-math-2x7B-1T) with a variable-width expert MLP. Expert 1 has been pruned from the full 11,008 intermediate size down to 2048 (19% of original width), then partially recovered via knowledge distillation.

	Unlike the [math-calibrated variant](https://huggingface.co/hbfreed/flex-math-2048), this model's pruning was calibrated on general-purpose data — meaning importance scores were computed on a broad data mix rather than math-specific data. 58% of the top-2048 most important neurons differ between the two calibration approaches.

	\| \| \|
	\|---\|---\|
	\| Total Parameters \| 8.1B \|
	\| Expert 1 Parameters \| 0.8B \|
	\| Expert 1 Width \| 2048 (19%) \|
	\| Base Model \| allenai/Flex-math-2x7B-1T (11.6B params) \|
	\| Distillation \| Partial (~20k steps, stopped early) \|

	For full details, see the [blog post](https://hbfreed.com/2026/01/28/variable-flexolmo.html).

	## How to Use

	This repo includes a `modeling_pruned_flex_olmo.py` file that handles the variable-width expert architecture. Just load with `trust_remote_code=True` and it works like any other HuggingFace model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("hbfreed/flex-general-2048", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("allenai/Flex-math-2x7B-1T")

	inputs = tokenizer("Hello, world!", return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	The tokenizer is the same as the base model's.

	## How It Was Made

	1. Structured pruning: Neuron importance scores were computed on general-purpose data. The least important neurons in Expert 1's gate/up/down projections were removed, reducing intermediate size from 11,008 to 2048.
	2. Partial knowledge distillation: The pruned model was partially retrained (~20k steps) using logprobs from the full-sized teacher model. Training was stopped early — the general-calibrated model converged slower and to a higher loss than the math-calibrated variant.

	## Related Models

	\| Model \| Calibration \| Expert Width \| Distillation \|
	\|---\|---\|---\|---\|
	\| [flex-math-8192](https://huggingface.co/hbfreed/flex-math-8192) \| Math \| 8192 (74%) \| Full \|
	\| [flex-math-5504](https://huggingface.co/hbfreed/flex-math-5504) \| Math \| 5504 (50%) \| Full \|
	\| [flex-math-2048](https://huggingface.co/hbfreed/flex-math-2048) \| Math \| 2048 (19%) \| Full \|
	\| flex-general-2048 \| General \| 2048 (19%) \| Partial \|

	## License

	Apache 2.0 (same as base model)