context-1-MLX-MXFP4 / README.md
foadmk's picture
Add proper model card with metadata
5a0ee67 verified
---
language:
- en
license: apache-2.0
library_name: mlx
tags:
- mlx
- apple-silicon
- moe
- mixture-of-experts
- 4-bit
- quantized
- gpt-oss
- context-retrieval
base_model: chromadb/context-1
pipeline_tag: text-generation
model-index:
- name: context-1-MLX-MXFP4
results:
- task:
type: text-generation
metrics:
- name: Tokens per second (M1 Max)
type: throughput
value: 69
- name: Peak Memory (GB)
type: memory
value: 12
---
# chromadb/context-1 MLX MXFP4
This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.
## Model Description
- **Base Model**: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b))
- **Architecture**: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
- **Format**: MLX with MXFP4 quantization
- **Quantization**: 4.504 bits per weight
## Performance (Apple M1 Max, 64GB)
| Metric | Value |
|--------|-------|
| Model Size | 11 GB |
| Peak Memory | 12 GB |
| Generation Speed | ~69 tokens/sec |
| Prompt Processing | ~70 tokens/sec |
| Latency | ~14.5 ms/token |
## Usage
```python
from mlx_lm import load, generate
model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)
```
## Conversion Notes
The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:
### Key Differences from Original Format
- **Dense BF16 tensors** (not quantized blocks with `_blocks` suffix)
- **gate_up_proj shape**: `(experts, hidden, intermediate*2)` with interleaved gate/up weights
### Weight Transformations Applied
1. **gate_up_proj** `(32, 2880, 5760)`:
- Transpose to `(32, 5760, 2880)`
- Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up
- Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)`
2. **down_proj** `(32, 2880, 2880)`:
- Transpose to match MLX expected format
3. **Bypass mlx_lm sanitize**: Pre-naming weights with `.weight` suffix to skip incorrect splitting
### Conversion Script
A conversion script is included in this repo: `convert_context1_to_mlx.py`
```bash
python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4
```
## Intended Use
This model is optimized for:
- Context-aware retrieval and search tasks
- Running locally on Apple Silicon Macs
- Low-latency inference without GPU requirements
## Limitations
- Requires Apple Silicon Mac with MLX support
- Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
- Model outputs structured JSON-like responses (inherited from base model training)
## Citation
If you use this model, please cite the original:
```bibtex
@misc{chromadb-context-1,
author = {Chroma},
title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/chromadb/context-1}
}
```
## Acknowledgments
- [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model
- [OpenAI](https://openai.com) for the gpt-oss-20b base model
- [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework
- [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools