foadmk
/

context-1-MLX-MXFP4

@@ -1,34 +1,121 @@
 # chromadb/context-1 MLX MXFP4
-This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 quantization.
-## Model Info
-- **Base model**: chromadb/context-1 (fine-tune of openai/gpt-oss-20b)
-- **Format**: MLX MXFP4 (4-bit quantization)
-- **Size**: ~11 GB
-- **Peak memory**: ~12 GB
-## Performance (Apple M1 Max)
-- **Generation speed**: ~69 tokens/sec
-- **Prompt processing**: ~70 tokens/sec
-- **Latency**: ~14.5ms per token
 ## Usage
 ```python
 from mlx_lm import load, generate
-model, tokenizer = load("~/Models/context1-mlx-mxfp4")
-response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100)
 ```
 ## Conversion Notes
-The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b:
-- Weights are stored as dense BF16 tensors (not quantized blocks)
-- gate_up_proj shape: (experts, hidden, intermediate*2) - interleaved
-Conversion required:
-1. Transpose expert weights from (experts, hidden, intermediate) to (experts, intermediate, hidden)
-2. Interleaved split of gate_up_proj into separate gate_proj and up_proj
-3. Pre-naming weights with `.weight` suffix to bypass mlx_lm's sanitize function

+---
+language:
+  - en
+license: apache-2.0
+library_name: mlx
+tags:
+  - mlx
+  - apple-silicon
+  - moe
+  - mixture-of-experts
+  - 4-bit
+  - quantized
+  - gpt-oss
+  - context-retrieval
+base_model: chromadb/context-1
+pipeline_tag: text-generation
+model-index:
+  - name: context-1-MLX-MXFP4
+    results:
+      - task:
+          type: text-generation
+        metrics:
+          - name: Tokens per second (M1 Max)
+            type: throughput
+            value: 69
+          - name: Peak Memory (GB)
+            type: memory
+            value: 12
+---
 # chromadb/context-1 MLX MXFP4
+This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.
+## Model Description
+- **Base Model**: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b))
+- **Architecture**: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
+- **Format**: MLX with MXFP4 quantization
+- **Quantization**: 4.504 bits per weight
+## Performance (Apple M1 Max, 64GB)
+| Metric | Value |
+|--------|-------|
+| Model Size | 11 GB |
+| Peak Memory | 12 GB |
+| Generation Speed | ~69 tokens/sec |
+| Prompt Processing | ~70 tokens/sec |
+| Latency | ~14.5 ms/token |
 ## Usage
 ```python
 from mlx_lm import load, generate
+model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
+response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)
 ```
 ## Conversion Notes
+The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:
+### Key Differences from Original Format
+- **Dense BF16 tensors** (not quantized blocks with `_blocks` suffix)
+- **gate_up_proj shape**: `(experts, hidden, intermediate*2)` with interleaved gate/up weights
+### Weight Transformations Applied
+1. **gate_up_proj** `(32, 2880, 5760)`:
+   - Transpose to `(32, 5760, 2880)`
+   - Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up
+   - Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)`
+2. **down_proj** `(32, 2880, 2880)`:
+   - Transpose to match MLX expected format
+3. **Bypass mlx_lm sanitize**: Pre-naming weights with `.weight` suffix to skip incorrect splitting
+### Conversion Script
+A conversion script is included in this repo: `convert_context1_to_mlx.py`
+```bash
+python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4
+```
+## Intended Use
+This model is optimized for:
+- Context-aware retrieval and search tasks
+- Running locally on Apple Silicon Macs
+- Low-latency inference without GPU requirements
+## Limitations
+- Requires Apple Silicon Mac with MLX support
+- Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
+- Model outputs structured JSON-like responses (inherited from base model training)
+## Citation
+If you use this model, please cite the original:
+```bibtex
+@misc{chromadb-context-1,
+  author = {Chroma},
+  title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
+  year = {2025},
+  publisher = {HuggingFace},
+  url = {https://huggingface.co/chromadb/context-1}
+}
+```
+## Acknowledgments
+- [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model
+- [OpenAI](https://openai.com) for the gpt-oss-20b base model
+- [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework
+- [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools