foadmk commited on
Commit
5a0ee67
·
verified ·
1 Parent(s): bed8f90

Add proper model card with metadata

Browse files
Files changed (1) hide show
  1. README.md +106 -19
README.md CHANGED
@@ -1,34 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # chromadb/context-1 MLX MXFP4
2
 
3
- This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 quantization.
 
 
 
 
 
 
 
4
 
5
- ## Model Info
6
- - **Base model**: chromadb/context-1 (fine-tune of openai/gpt-oss-20b)
7
- - **Format**: MLX MXFP4 (4-bit quantization)
8
- - **Size**: ~11 GB
9
- - **Peak memory**: ~12 GB
10
 
11
- ## Performance (Apple M1 Max)
12
- - **Generation speed**: ~69 tokens/sec
13
- - **Prompt processing**: ~70 tokens/sec
14
- - **Latency**: ~14.5ms per token
 
 
 
15
 
16
  ## Usage
17
 
18
  ```python
19
  from mlx_lm import load, generate
20
 
21
- model, tokenizer = load("~/Models/context1-mlx-mxfp4")
22
- response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100)
23
  ```
24
 
25
  ## Conversion Notes
26
 
27
- The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b:
28
- - Weights are stored as dense BF16 tensors (not quantized blocks)
29
- - gate_up_proj shape: (experts, hidden, intermediate*2) - interleaved
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
- Conversion required:
32
- 1. Transpose expert weights from (experts, hidden, intermediate) to (experts, intermediate, hidden)
33
- 2. Interleaved split of gate_up_proj into separate gate_proj and up_proj
34
- 3. Pre-naming weights with `.weight` suffix to bypass mlx_lm's sanitize function
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: mlx
6
+ tags:
7
+ - mlx
8
+ - apple-silicon
9
+ - moe
10
+ - mixture-of-experts
11
+ - 4-bit
12
+ - quantized
13
+ - gpt-oss
14
+ - context-retrieval
15
+ base_model: chromadb/context-1
16
+ pipeline_tag: text-generation
17
+ model-index:
18
+ - name: context-1-MLX-MXFP4
19
+ results:
20
+ - task:
21
+ type: text-generation
22
+ metrics:
23
+ - name: Tokens per second (M1 Max)
24
+ type: throughput
25
+ value: 69
26
+ - name: Peak Memory (GB)
27
+ type: memory
28
+ value: 12
29
+ ---
30
+
31
  # chromadb/context-1 MLX MXFP4
32
 
33
+ This model was converted from [chromadb/context-1](https://huggingface.co/chromadb/context-1) to MLX format with MXFP4 (4-bit) quantization for efficient inference on Apple Silicon.
34
+
35
+ ## Model Description
36
+
37
+ - **Base Model**: [chromadb/context-1](https://huggingface.co/chromadb/context-1) (fine-tuned from [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b))
38
+ - **Architecture**: 20B parameter Mixture of Experts (MoE) with 32 experts, 4 active per token
39
+ - **Format**: MLX with MXFP4 quantization
40
+ - **Quantization**: 4.504 bits per weight
41
 
42
+ ## Performance (Apple M1 Max, 64GB)
 
 
 
 
43
 
44
+ | Metric | Value |
45
+ |--------|-------|
46
+ | Model Size | 11 GB |
47
+ | Peak Memory | 12 GB |
48
+ | Generation Speed | ~69 tokens/sec |
49
+ | Prompt Processing | ~70 tokens/sec |
50
+ | Latency | ~14.5 ms/token |
51
 
52
  ## Usage
53
 
54
  ```python
55
  from mlx_lm import load, generate
56
 
57
+ model, tokenizer = load("foadmk/context-1-MLX-MXFP4")
58
+ response = generate(model, tokenizer, prompt="What is the capital of France?", max_tokens=100, verbose=True)
59
  ```
60
 
61
  ## Conversion Notes
62
 
63
+ The chromadb/context-1 model uses a different weight format than the original openai/gpt-oss-20b, which required custom conversion logic:
64
+
65
+ ### Key Differences from Original Format
66
+ - **Dense BF16 tensors** (not quantized blocks with `_blocks` suffix)
67
+ - **gate_up_proj shape**: `(experts, hidden, intermediate*2)` with interleaved gate/up weights
68
+
69
+ ### Weight Transformations Applied
70
+
71
+ 1. **gate_up_proj** `(32, 2880, 5760)`:
72
+ - Transpose to `(32, 5760, 2880)`
73
+ - Interleaved split: `[:, ::2, :]` for gate, `[:, 1::2, :]` for up
74
+ - Result: `gate_proj.weight` and `up_proj.weight` each `(32, 2880, 2880)`
75
+
76
+ 2. **down_proj** `(32, 2880, 2880)`:
77
+ - Transpose to match MLX expected format
78
+
79
+ 3. **Bypass mlx_lm sanitize**: Pre-naming weights with `.weight` suffix to skip incorrect splitting
80
+
81
+ ### Conversion Script
82
+
83
+ A conversion script is included in this repo: `convert_context1_to_mlx.py`
84
+
85
+ ```bash
86
+ python convert_context1_to_mlx.py --output ./context1-mlx-mxfp4
87
+ ```
88
+
89
+ ## Intended Use
90
+
91
+ This model is optimized for:
92
+ - Context-aware retrieval and search tasks
93
+ - Running locally on Apple Silicon Macs
94
+ - Low-latency inference without GPU requirements
95
+
96
+ ## Limitations
97
+
98
+ - Requires Apple Silicon Mac with MLX support
99
+ - Best performance on M1 Pro/Max/Ultra or newer with 32GB+ RAM
100
+ - Model outputs structured JSON-like responses (inherited from base model training)
101
+
102
+ ## Citation
103
+
104
+ If you use this model, please cite the original:
105
+
106
+ ```bibtex
107
+ @misc{chromadb-context-1,
108
+ author = {Chroma},
109
+ title = {Context-1: A Fine-tuned GPT-OSS Model for Retrieval},
110
+ year = {2025},
111
+ publisher = {HuggingFace},
112
+ url = {https://huggingface.co/chromadb/context-1}
113
+ }
114
+ ```
115
+
116
+ ## Acknowledgments
117
 
118
+ - [chromadb](https://github.com/chroma-core/chroma) for the original context-1 model
119
+ - [OpenAI](https://openai.com) for the gpt-oss-20b base model
120
+ - [Apple MLX team](https://github.com/ml-explore/mlx) for the MLX framework
121
+ - [mlx-community](https://huggingface.co/mlx-community) for MLX model conversion tools