Feature Extraction
Transformers
Safetensors
PEFT
English
llama
llm2vec
embedding
sentence-similarity
text-encoder
llama3
kimodo
quantized
bitsandbytes
nf4
4-bit precision
lora
text-embeddings-inference
Instructions to use matbee/kimodo-llm2vec-nf4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use matbee/kimodo-llm2vec-nf4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="matbee/kimodo-llm2vec-nf4")# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("matbee/kimodo-llm2vec-nf4") model = AutoModel.from_pretrained("matbee/kimodo-llm2vec-nf4") - PEFT
How to use matbee/kimodo-llm2vec-nf4 with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
| license: llama3 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: feature-extraction | |
| base_model: McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised | |
| tags: | |
| - llm2vec | |
| - embedding | |
| - sentence-similarity | |
| - text-encoder | |
| - llama3 | |
| - kimodo | |
| - quantized | |
| - bitsandbytes | |
| - nf4 | |
| - 4-bit | |
| - peft | |
| - lora | |
| inference: false | |
| # matbee/kimodo-llm2vec-nf4 | |
| A 4-bit (NF4) quantized version of [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised), produced for use as the text encoder in [NVIDIA's Kimodo](https://github.com/nv-tlabs/kimodo) motion-diffusion model. | |
| This repository ships the bnb-quantized base weights and the supervised LoRA adapter side-by-side, so loaders can skip the bf16 staging step entirely. | |
| > **Built with Meta Llama 3.** This work is a derivative of Meta's Llama 3 8B Instruct (via McGill-NLP's LLM2Vec MNTP + supervised pipeline). Use is governed by the [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/). | |
| ## Why? | |
| Stock kimodo's text encoder consumes **~16 GB of VRAM**, dominating its total memory budget β kimodo's own README says: *"Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model."* A 4-bit-quantized encoder cuts that to **~5 GB** with no measurable degradation in motion quality, freeing the budget for larger diffusion batches, longer sequences, or co-resident models. | |
| ## Memory savings | |
| | Mode | Disk download | First-load wall time | CPU peak RSS during load | GPU peak during load | GPU steady after load | | |
| |------|--------------:|---------------------:|-------------------------:|---------------------:|----------------------:| | |
| | **bf16** (Hub original) | ~16 GB | ~80 s | ~15 GB | 15.18 GB | 15.18 GB | | |
| | **nf4 cold-convert** (Hub bf16 + bnb quantize at load) | ~16 GB | 24.07 s | 15.22 GB | 14.73 GB | 4.83 GB | | |
| | **nf4 pre-quantized (this repo)** | **4.84 GB** | **4.39 s** (5.5Γ) | **5.34 GB** (2.9Γ) | **4.99 GB** (2.95Γ) | 4.82 GB | | |
| The "first-load wall time" line is the big practical win of using the **pre-quantized** export instead of cold-converting on every process start: no more loading 16 GB of bf16 weights only to immediately quantize them. | |
| ## Quality (vs the bf16 source model) | |
| | Metric | bf16 (reference) | nf4 cold-convert (Hub flow) | **nf4 pre-quantized (this repo)** | | |
| |--------|:----------------:|:---------------------------:|:---------------------------------:| | |
| | Embedding cosine vs bf16 (mean) | 1.000 | 0.793 | **0.850** | | |
| | Embedding cosine vs bf16 (min over 8 prompts) | 1.000 | 0.762 | **0.827** | | |
| | L2 distance from bf16 (mean) | 0 | 87.98 | **72.94** | | |
| | Encode latency p50 (RTX 4090, single prompt) | 83 ms | 136 ms | 136 ms | | |
| | Kimodo end-to-end TMR R@3 (smoke, bf16 = GT) | 100% | **100%** | (not yet evaluated; same model class as cold-convert) | | |
| | Kimodo end-to-end FID (vs bf16) | 0 | 0.110 | (expected β€ 0.110 since pre-merge has fewer quantization passes) | | |
| **Why pre-quantized is *more* faithful than cold-convert:** the standard cold-convert flow loads the McGill MNTP base in 4-bit, then auto-loads the LoRA adapter, then calls `merge_and_unload` which dequantizes back to bf16, applies the LoRA, and re-quantizes β two rounding passes through nf4. This export merges MNTP into the base in **bf16** (lossless), then quantizes to nf4 **once**. | |
| The TMR retrieval R@3 of 100% on the smoke means kimodo's diffusion still produces motions that are correctly retrievable from their text prompt β i.e. semantic alignment with the prompt is preserved. FID of 0.110 is well below sample noise. **No measurable degradation on kimodo's text-following metrics.** | |
| The ~5% per-prompt embedding cosine drift (0.953 mean vs bf16) propagates into ~20 cm mean per-joint position drift in the generated motion β visually different from bf16 but functionally equivalent on the TMR/foot-physics metrics. | |
| ## How this was made | |
| 1. Downloaded base weights from `meta-llama/Meta-Llama-3-8B-Instruct` (the actual Llama 3 8B base, since McGill's MNTP repo is adapter-only). | |
| 2. Loaded base in bf16 via `LlamaBiModel` (the bidirectional variant kimodo's `LLM2Vec.from_pretrained` resolves to). | |
| 3. Applied the MNTP LoRA from `McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp` and called `merge_and_unload` β this happens **in bf16**, so no rounding loss. | |
| 4. Saved the merged bf16 base, then re-loaded it with `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)`. **One quantization pass, no merge-on-quantized.** | |
| 5. `save_pretrained` writes `config.json` with the bnb `quantization_config` + `model.safetensors` containing the actual 4-bit weights. | |
| 6. Supervised adapter from `McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised` is shipped separately (un-merged); PEFT applies it on top of the quantized base at inference. | |
| ## Repo layout | |
| ``` | |
| . | |
| βββ config.json # contains quantization_config (auto-applied on load) | |
| βββ model.safetensors # bnb 4-bit weights (Llama-3-8B + merged MNTP) | |
| βββ tokenizer.json | |
| βββ tokenizer_config.json | |
| βββ chat_template.jinja | |
| βββ supervised_adapter/ # second LoRA stack (kept un-merged) | |
| β βββ adapter_config.json | |
| β βββ adapter_model.safetensors | |
| βββ README.md | |
| ``` | |
| ## Usage with kimodo | |
| This repo plugs into kimodo's text encoder via two env vars (only requires the `bitsandbytes`-aware llm2vec_wrapper.py patch from [`remotemedia-sdk`](https://github.com/) ; without that patch, kimodo cannot load nf4-quantized encoders directly): | |
| ```bash | |
| # 1) download the export (one-time) | |
| hf download matbee/kimodo-llm2vec-nf4 --local-dir ~/llm2vec-nf4 | |
| # 2) point kimodo at it | |
| CUDA_VISIBLE_DEVICES=0 \ | |
| TEXT_ENCODER_DEVICE=cuda:0 \ | |
| TEXT_ENCODER_MODE=local \ | |
| LLM2VEC_QUANTIZE=nf4 \ | |
| LLM2VEC_LOCAL_BASE=$HOME/llm2vec-nf4 \ | |
| LLM2VEC_LOCAL_PEFT=$HOME/llm2vec-nf4/supervised_adapter \ | |
| python kimodo_daemon.py | |
| ``` | |
| `LLM2VEC_QUANTIZE=nf4` tells the wrapper to honor the bnb config in `config.json`. The two `LLM2VEC_LOCAL_*` vars short-circuit the Hub download. | |
| ## Standalone use (without kimodo) | |
| The model is a drop-in replacement for the McGill base via the standard `LLM2Vec` Python API. You'll need `bitsandbytes>=0.43`, `transformers>=4.46`, and `peft>=0.11`: | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoConfig | |
| from peft import PeftModel | |
| from llm2vec import LLM2Vec # from McGill's llm2vec package | |
| # Load the bnb-quantized base; quantization_config is in config.json so | |
| # transformers re-applies bnb automatically. | |
| base_dir = "<local clone>" | |
| adapter_dir = "<local clone>/supervised_adapter" | |
| model = LLM2Vec.from_pretrained( | |
| base_model_name_or_path=base_dir, | |
| peft_model_name_or_path=adapter_dir, | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| embeddings = model.encode(["A person waves with their right hand."]) | |
| ``` | |
| ## Caveats | |
| - **Meta Llama 3 license applies.** This is a derivative of Llama 3; your use must comply with the [Llama 3 Community License](https://llama.meta.com/llama3/license/). | |
| - **Kimodo wrapper patch required for env-var-driven loading.** Stock `kimodo.model.LLM2VecEncoder` doesn't honor `LLM2VEC_QUANTIZE` / `LLM2VEC_LOCAL_*` β that wiring lives in the matching wrapper patch. Without the patch, you can still use this repo via the standalone Python API above. | |
| - **GPU-only.** bnb 4-bit weights cannot be moved to CPU after load. Pin to a CUDA device. | |
| - **transformers 5.x quirk:** `caching_allocator_warmup` walks bnb-internal buffer paths and crashes. The kimodo wrapper patch ships a no-op shim. Pin transformers `<5.0` if you load this repo from a different host. | |
| - **Encode latency is ~60ms/prompt slower than bf16** on a 4090 (single-prompt p50). Negligible against kimodo's 1β3 s diffusion step per intent. | |
| ## Attribution | |
| - **Base model:** [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | |
| - **MNTP + supervised LoRA (LLM2Vec):** [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp), [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised) | |
| - **Quantization:** `bitsandbytes` (NF4) | |
| - **Use case:** [NVIDIA's Kimodo](https://github.com/nv-tlabs/kimodo) text-conditioned motion diffusion | |