| --- |
| language: |
| - en |
| license: apache-2.0 |
| library_name: mlx |
| tags: |
| - mlx |
| - apple-silicon |
| - moe |
| - mixture-of-experts |
| - 4-bit |
| - quantized |
| - gpt-oss |
| - context-retrieval |
| base_model: chromadb/context-1 |
| pipeline_tag: text-generation |
| model-index: |
| - name: context-1-MLX-MXFP4 |
| results: |
| - task: |
| type: text-generation |
| metrics: |
| - name: Tokens per second (M1 Max) |
| type: throughput |
| value: 69 |
| - name: Peak Memory (GB) |
| type: memory |
| value: 12 |
| --- |
| |
| # chromadb/context-1 MLX MXFP4 |
|
|
| This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon. |
|
|
| ## Model Description |
|
|
| - **Base Model**: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)) |
| - **Architecture**: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token |
| - **Format**: MLX with MXFP4 quantization |
| - **Quantization**: 4.504 bits per weight |
|
|
| ## Performance (Apple M1 Max, 64GB) |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Model Size | 11 GB | |
| | Peak Memory | 12 GB | |
| | Generation Speed | ~69 tokens/sec | |
| | Prompt Processing | ~70 tokens/sec | |
| | Latency | ~14.5 ms/token | |
|
|
| ## Usage |
|
|
| ```python |
| from mlx_lm import load, generate |
| |
| model, tokenizer = load("foadmk/context-1-MLX-MXFP4") |
| response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True) |
| ``` |
|
|
| ## Conversion Notes |
|
|
| The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic: |
|
|
| ### Key Differences from Original Format |
| - **Dense BF16 tensors** (not quantized blocks with `_blocks` suffix) |
| - **gate_up_proj shape**: `(experts, hidden, intermediate*2)` with interleaved gate/up weights |
|
|
| ### Weight Transformations Applied |
|
|
| 1. **gate_up_proj** `(32, 2880, 5760)`: |
| - Transpose to `(32, 5760, 2880)` |
| - Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up |
| - Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)` |
|
|
| 2. **down_proj** `(32, 2880, 2880)`: |
| - Transpose to match MLX expected format |
| |
| 3. **Bypass mlx_lm sanitize**: Pre-naming weights with `.weight` suffix to skip incorrect splitting |
|
|
| ### Conversion Script |
|
|
| A conversion script is included in this repo: `convert_context1_to_mlx.py` |
|
|
| ```bash |
| python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4 |
| ``` |
|
|
| ## Intended Use |
|
|
| This model is optimized for: |
| - Context-aware retrieval and search tasks |
| - Running locally on Apple Silicon Macs |
| - Low-latency inference without GPU requirements |
|
|
| ## Limitations |
|
|
| - Requires Apple Silicon Mac with MLX support |
| - Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM |
| - Model outputs structured JSON-like responses (inherited from base model training) |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original: |
|
|
| ```bibtex |
| @misc{chromadb-context-1, |
| author = {Chroma}, |
| title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/chromadb/context-1} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model |
| - [OpenAI](https://openai.com) for the gpt-oss-20b base model |
| - [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework |
| - [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools |
|
|