hbfreed
/

flex-math-8192

Text Generation

pruned_flex_olmo

mixture-of-experts

Model card Files Files and versions

flex-math-8192 / README.md

hbfreed's picture

Upload README.md with huggingface_hub

e039a6b verified 2 months ago

|

history blame contribute delete

2.91 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- pruned_flex_olmo
	- custom_code
	- math
	- pruned
	- distilled
	- mixture-of-experts
	base_model: allenai/Flex-math-2x7B-1T
	pipeline_tag: text-generation
	---

	# flex-math-8192

	A pruned and distilled variant of [allenai/Flex-math-2x7B-1T](https://huggingface.co/allenai/Flex-math-2x7B-1T) with a variable-width expert MLP. Expert 1 has been pruned from the full 11,008 intermediate size down to 8192 (74% of original width), then recovered via knowledge distillation.

	\| \| \|
	\|---\|---\|
	\| Total Parameters \| 10.5B \|
	\| Expert 1 Parameters \| 3.2B \|
	\| Expert 1 Width \| 8192 (74%) \|
	\| Base Model \| allenai/Flex-math-2x7B-1T (11.6B params) \|

	For full details, see the [blog post](https://hbfreed.com/2026/01/28/variable-flexolmo.html).

	## How to Use

	This repo includes a `modeling_pruned_flex_olmo.py` file that handles the variable-width expert architecture. Just load with `trust_remote_code=True` and it works like any other HuggingFace model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("hbfreed/flex-math-8192", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("allenai/Flex-math-2x7B-1T")

	input_text = "Solve: What is 15% of 200?"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	The tokenizer is the same as the base model's.

	## How It Was Made

	1. Structured pruning: Neuron importance scores were computed on math-specific data (GSM8k, Metamath, TuluMath subsets). The least important neurons in Expert 1's gate/up/down projections were removed, reducing intermediate size from 11,008 to 8192.
	2. Knowledge distillation: The pruned model was retrained for ~228M tokens using the top-128 logprobs from the full-sized teacher model. Distillation data: [hbfreed/flexolmo-math-logprobs](https://huggingface.co/datasets/hbfreed/flexolmo-math-logprobs).

	Math-calibrated importance analysis was used — 58% of the top-2048 neurons differ between math-calibrated and general-calibrated rankings.

	## Benchmark Results

	\| Model \| GSM8K \| MATH \| Math2 \|
	\|---\|---\|---\|---\|
	\| No-expert baseline (7.3B) \| — \| — \| 8.1 \|
	\| flex-math-8192 \| 70.1 \| 31.3 \| 50.7 \|
	\| Full teacher (11.6B) \| 69.7 \| 35.4 \| 52.5 \|

	### All Variants

	\| Model \| Total Params \| Expert Width \| GSM8K \| MATH \| Math2 \|
	\|---\|---\|---\|---\|---\|---\|
	\| [flex-math-8192](https://huggingface.co/hbfreed/flex-math-8192) \| 10.5B \| 8192 (74%) \| 70.1 \| 31.3 \| 50.7 \|
	\| [flex-math-5504](https://huggingface.co/hbfreed/flex-math-5504) \| 9.5B \| 5504 (50%) \| 66.6 \| 26.8 \| 46.7 \|
	\| [flex-math-2048](https://huggingface.co/hbfreed/flex-math-2048) \| 8.1B \| 2048 (19%) \| 44.3 \| 13.9 \| 29.1 \|

	## License

	Apache 2.0 (same as base model)