context-1-MLX-MXFP4 / README.md
foadmk's picture
Add proper model card with metadata
5a0ee67 verified
metadata
language:
  - en
license: apache-2.0
library_name: mlx
tags:
  - mlx
  - apple-silicon
  - moe
  - mixture-of-experts
  - 4-bit
  - quantized
  - gpt-oss
  - context-retrieval
base_model: chromadb/context-1
pipeline_tag: text-generation
model-index:
  - name: context-1-MLX-MXFP4
    results:
      - task:
          type: text-generation
        metrics:
          - name: Tokens per second (M1 Max)
            type: throughput
            value: 69
          - name: Peak Memory (GB)
            type: memory
            value: 12

chromadb/context-1 MLX MXFP4

This model was converted from chromadb/context-1 to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.

Model Description

  • Base Model: chromadb/context-1 (fine-tuned from openai/gpt-oss-20b)
  • Architecture: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
  • Format: MLX with MXFP4 quantization
  • Quantization: 4.504 bits per weight

Performance (Apple M1 Max, 64GB)

Metric Value
Model Size 11 GB
Peak Memory 12 GB
Generation Speed ~69 tokens/sec
Prompt Processing ~70 tokens/sec
Latency ~14.5 ms/token

Usage

from mlx_lm import load, generate

model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)

Conversion Notes

The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:

Key Differences from Original Format

  • Dense BF16 tensors (not quantized blocks with _blocks suffix)
  • gate_up_proj shape: (experts, hidden, intermediate*2) with interleaved gate/up weights

Weight Transformations Applied

  1. gate_up_proj (32, 2880, 5760):

    • Transpose to (32, 5760, 2880)
    • Interleaved split: [:, ::2, :] for gate, [:, 1::2, :] for up
    • Result: gate_proj.weight and up_proj.weight each (32, 2880, 2880)
  2. down_proj (32, 2880, 2880):

    • Transpose to match MLX expected format
  3. Bypass mlx_lm sanitize: Pre-naming weights with .weight suffix to skip incorrect splitting

Conversion Script

A conversion script is included in this repo: convert_context1_to_mlx.py

python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4

Intended Use

This model is optimized for:

  • Context-aware retrieval and search tasks
  • Running locally on Apple Silicon Macs
  • Low-latency inference without GPU requirements

Limitations

  • Requires Apple Silicon Mac with MLX support
  • Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
  • Model outputs structured JSON-like responses (inherited from base model training)

Citation

If you use this model, please cite the original:

@misc{chromadb-context-1,
  author = {Chroma},
  title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/chromadb/context-1}
}

Acknowledgments