nicolasembleton
/

context-1-GGUF

Text Generation

Mixture of Experts

Model card Files Files and versions

nicolasembleton commited on 26 days ago

Commit

f485ca9

·

verified ·

1 Parent(s): fe43437

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +6 -18

README.md CHANGED Viewed

@@ -30,23 +30,17 @@ GGUF quantized versions of [chromadb/context-1](https://huggingface.co/chromadb/
 | **Hidden Size** | 2880 |
 | **License** | Apache-2.0 |
-## Quantization Details
-Quantized from the F16 safetensors in the original repository using [llama.cpp](https://github.com/ggml-org/llama.cpp)'s `llama-quantize` tool, running on NVIDIA H100 via [Modal](https://modal.com/).
-| File | Format | Size |
-|------|--------|------|
-| `context-1-Q4_K_M.gguf` | Q4_K_M | 15.81 GB |
-More quantization levels (Q3, Q5, Q8, etc.) may be added in the future.
 ## Usage
 ### llama.cpp
 ```bash
-# Download
-huggingface-cli download nicoism/context-1-GGUF context-1-Q4_K_M.gguf --local-dir .
 # Run
 ./llama-cli -m context-1-Q4_K_M.gguf -p "Your prompt here" -ngl 99
@@ -54,7 +48,7 @@ huggingface-cli download nicoism/context-1-GGUF context-1-Q4_K_M.gguf --local-di
 ### LM Studio
-Search for `nicoism/context-1-GGUF` in LM Studio's model browser and download the desired quantization.
 ### Python (llama-cpp-python)
@@ -62,7 +56,7 @@ Search for `nicoism/context-1-GGUF` in LM Studio's model browser and download th
 from llama_cpp import Llama
 llm = Llama.from_pretrained(
-    repo_id="nicoism/context-1-GGUF",
     filename="context-1-Q4_K_M.gguf",
     n_gpu_layers=-1,
 )
@@ -88,12 +82,6 @@ Format overview:
 For the full template, see [`chat_template.jinja`](https://huggingface.co/chromadb/context-1/blob/main/chat_template.jinja) in the original repository.
-## Limitations
-- GGUF quantization introduces minor quality degradation compared to the original F16 weights. Q4_K_M provides a good balance of quality and size.
-- This model inherits any biases and limitations from the base model.
-- Requires a GPU or sufficient RAM for inference (16 GB+ for Q4_K_M).
 ## License
 Apache-2.0 — same as the [original model](https://huggingface.co/chromadb/context-1).

 | **Hidden Size** | 2880 |
 | **License** | Apache-2.0 |
+## Quantization
+Quantized from F16 weights using [llama.cpp](https://github.com/ggml-org/llama.cpp) with importance matrix (imatrix) calibration, running on NVIDIA H100 GPUs via [Modal](https://modal.com/). All standard K-quant and I-quant variants are provided.
 ## Usage
 ### llama.cpp
 ```bash
+# Download your preferred quant
+huggingface-cli download nicolasembleton/context-1-GGUF context-1-Q4_K_M.gguf --local-dir .
 # Run
 ./llama-cli -m context-1-Q4_K_M.gguf -p "Your prompt here" -ngl 99
 ### LM Studio
+Search for `nicolasembleton/context-1-GGUF` in LM Studio's model browser and download the desired quantization.
 ### Python (llama-cpp-python)
 from llama_cpp import Llama
 llm = Llama.from_pretrained(
+    repo_id="nicolasembleton/context-1-GGUF",
     filename="context-1-Q4_K_M.gguf",
     n_gpu_layers=-1,
 )
 For the full template, see [`chat_template.jinja`](https://huggingface.co/chromadb/context-1/blob/main/chat_template.jinja) in the original repository.
 ## License
 Apache-2.0 — same as the [original model](https://huggingface.co/chromadb/context-1).