foadmk
/

context-1-MLX-MXFP4

Text Generation

Mixture of Experts

mixture-of-experts

4-bit precision

context-retrieval

Eval Results (legacy)

Model card Files Files and versions

context-1-MLX-MXFP4 / README.md

foadmk's picture

Add proper model card with metadata

5a0ee67 verified 4 days ago

|

history blame contribute delete

3.57 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: mlx
	tags:
	- mlx
	- apple-silicon
	- moe
	- mixture-of-experts
	- 4-bit
	- quantized
	- gpt-oss
	- context-retrieval
	base_model: chromadb/context-1
	pipeline_tag: text-generation
	model-index:
	- name: context-1-MLX-MXFP4
	results:
	- task:
	type: text-generation
	metrics:
	- name: Tokens per second (M1 Max)
	type: throughput
	value: 69
	- name: Peak Memory (GB)
	type: memory
	value: 12
	---

	# chromadb/context-1 MLX MXFP4

	This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.

	## Model Description

	- Base Model: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b))
	- Architecture: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
	- Format: MLX with MXFP4 quantization
	- Quantization: 4.504 bits per weight

	## Performance (Apple M1 Max, 64GB)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Model Size \| 11 GB \|
	\| Peak Memory \| 12 GB \|
	\| Generation Speed \| ~69 tokens/sec \|
	\| Prompt Processing \| ~70 tokens/sec \|
	\| Latency \| ~14.5 ms/token \|

	## Usage

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
	response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)
	```

	## Conversion Notes

	The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:

	### Key Differences from Original Format
	- Dense BF16 tensors (not quantized blocks with `_blocks` suffix)
	- gate_up_proj shape: `(experts, hidden, intermediate*2)` with interleaved gate/up weights

	### Weight Transformations Applied

	1. gate_up_proj `(32, 2880, 5760)`:
	- Transpose to `(32, 5760, 2880)`
	- Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up
	- Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)`

	2. down_proj `(32, 2880, 2880)`:
	- Transpose to match MLX expected format

	3. Bypass mlx_lm sanitize: Pre-naming weights with `.weight` suffix to skip incorrect splitting

	### Conversion Script

	A conversion script is included in this repo: `convert_context1_to_mlx.py`

	```bash
	python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4
	```

	## Intended Use

	This model is optimized for:
	- Context-aware retrieval and search tasks
	- Running locally on Apple Silicon Macs
	- Low-latency inference without GPU requirements

	## Limitations

	- Requires Apple Silicon Mac with MLX support
	- Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
	- Model outputs structured JSON-like responses (inherited from base model training)

	## Citation

	If you use this model, please cite the original:

	```bibtex
	@misc{chromadb-context-1,
	author = {Chroma},
	title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/chromadb/context-1}
	}
	```

	## Acknowledgments

	- [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model
	- [OpenAI](https://openai.com) for the gpt-oss-20b base model
	- [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework
	- [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools