--- language: - en license: apache-2.0 library_name: mlx tags: - mlx - apple-silicon - moe - mixture-of-experts - 4-bit - quantized - gpt-oss - context-retrieval base_model: chromadb/context-1 pipeline_tag: text-generation model-index: - name: context-1-MLX-MXFP4 results: - task: type: text-generation metrics: - name: Tokens per second (M1 Max) type: throughput value: 69 - name: Peak Memory (GB) type: memory value: 12 --- # chromadb/context-1 MLX MXFP4 This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon. ## Model Description - **Base Model**: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)) - **Architecture**: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token - **Format**: MLX with MXFP4 quantization - **Quantization**: 4.504 bits per weight ## Performance (Apple M1 Max, 64GB) | Metric | Value | |--------|-------| | Model Size | 11 GB | | Peak Memory | 12 GB | | Generation Speed | ~69 tokens/sec | | Prompt Processing | ~70 tokens/sec | | Latency | ~14.5 ms/token | ## Usage ```python from mlx_lm import load, generate model, tokenizer = load("foadmk/context-1-MLX-MXFP4") response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True) ``` ## Conversion Notes The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic: ### Key Differences from Original Format - **Dense BF16 tensors** (not quantized blocks with `_blocks` suffix) - **gate_up_proj shape**: `(experts, hidden, intermediate*2)` with interleaved gate/up weights ### Weight Transformations Applied 1. **gate_up_proj** `(32, 2880, 5760)`: - Transpose to `(32, 5760, 2880)` - Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up - Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)` 2. **down_proj** `(32, 2880, 2880)`: - Transpose to match MLX expected format 3. **Bypass mlx_lm sanitize**: Pre-naming weights with `.weight` suffix to skip incorrect splitting ### Conversion Script A conversion script is included in this repo: `convert_context1_to_mlx.py` ```bash python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4 ``` ## Intended Use This model is optimized for: - Context-aware retrieval and search tasks - Running locally on Apple Silicon Macs - Low-latency inference without GPU requirements ## Limitations - Requires Apple Silicon Mac with MLX support - Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM - Model outputs structured JSON-like responses (inherited from base model training) ## Citation If you use this model, please cite the original: ```bibtex @misc{chromadb-context-1, author = {Chroma}, title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/chromadb/context-1} } ``` ## Acknowledgments - [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model - [OpenAI](https://openai.com) for the gpt-oss-20b base model - [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework - [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools