matbee/kimodo-llm2vec-nf4

A 4-bit (NF4) quantized version of McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised, produced for use as the text encoder in NVIDIA's Kimodo motion-diffusion model.

This repository ships the bnb-quantized base weights and the supervised LoRA adapter side-by-side, so loaders can skip the bf16 staging step entirely.

Built with Meta Llama 3. This work is a derivative of Meta's Llama 3 8B Instruct (via McGill-NLP's LLM2Vec MNTP + supervised pipeline). Use is governed by the Meta Llama 3 Community License.

Why?

Stock kimodo's text encoder consumes ~16 GB of VRAM, dominating its total memory budget — kimodo's own README says: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." A 4-bit-quantized encoder cuts that to ~5 GB with no measurable degradation in motion quality, freeing the budget for larger diffusion batches, longer sequences, or co-resident models.

Memory savings

Mode	Disk download	First-load wall time	CPU peak RSS during load	GPU peak during load	GPU steady after load
bf16 (Hub original)	~16 GB	~80 s	~15 GB	15.18 GB	15.18 GB
nf4 cold-convert (Hub bf16 + bnb quantize at load)	~16 GB	24.07 s	15.22 GB	14.73 GB	4.83 GB
nf4 pre-quantized (this repo)	4.84 GB	4.39 s (5.5×)	5.34 GB (2.9×)	4.99 GB (2.95×)	4.82 GB

The "first-load wall time" line is the big practical win of using the pre-quantized export instead of cold-converting on every process start: no more loading 16 GB of bf16 weights only to immediately quantize them.

Quality (vs the bf16 source model)

Metric	bf16 (reference)	nf4 cold-convert (Hub flow)	nf4 pre-quantized (this repo)
Embedding cosine vs bf16 (mean)	1.000	0.793	0.850
Embedding cosine vs bf16 (min over 8 prompts)	1.000	0.762	0.827
L2 distance from bf16 (mean)	0	87.98	72.94
Encode latency p50 (RTX 4090, single prompt)	83 ms	136 ms	136 ms
Kimodo end-to-end TMR R@3 (smoke, bf16 = GT)	100%	100%	(not yet evaluated; same model class as cold-convert)
Kimodo end-to-end FID (vs bf16)	0	0.110	(expected ≤ 0.110 since pre-merge has fewer quantization passes)

Why pre-quantized is more faithful than cold-convert: the standard cold-convert flow loads the McGill MNTP base in 4-bit, then auto-loads the LoRA adapter, then calls merge_and_unload which dequantizes back to bf16, applies the LoRA, and re-quantizes — two rounding passes through nf4. This export merges MNTP into the base in bf16 (lossless), then quantizes to nf4 once.

The TMR retrieval R@3 of 100% on the smoke means kimodo's diffusion still produces motions that are correctly retrievable from their text prompt — i.e. semantic alignment with the prompt is preserved. FID of 0.110 is well below sample noise. No measurable degradation on kimodo's text-following metrics.

The ~5% per-prompt embedding cosine drift (0.953 mean vs bf16) propagates into ~20 cm mean per-joint position drift in the generated motion — visually different from bf16 but functionally equivalent on the TMR/foot-physics metrics.

How this was made

Downloaded base weights from meta-llama/Meta-Llama-3-8B-Instruct (the actual Llama 3 8B base, since McGill's MNTP repo is adapter-only).
Loaded base in bf16 via LlamaBiModel (the bidirectional variant kimodo's LLM2Vec.from_pretrained resolves to).
Applied the MNTP LoRA from McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp and called merge_and_unload — this happens in bf16, so no rounding loss.
Saved the merged bf16 base, then re-loaded it with BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16). One quantization pass, no merge-on-quantized.
save_pretrained writes config.json with the bnb quantization_config + model.safetensors containing the actual 4-bit weights.
Supervised adapter from McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised is shipped separately (un-merged); PEFT applies it on top of the quantized base at inference.

Repo layout

.
├── config.json                 # contains quantization_config (auto-applied on load)
├── model.safetensors           # bnb 4-bit weights (Llama-3-8B + merged MNTP)
├── tokenizer.json
├── tokenizer_config.json
├── chat_template.jinja
├── supervised_adapter/         # second LoRA stack (kept un-merged)
│   ├── adapter_config.json
│   └── adapter_model.safetensors
└── README.md

Usage with kimodo

This repo plugs into kimodo's text encoder via two env vars (only requires the bitsandbytes-aware llm2vec_wrapper.py patch from remotemedia-sdk ; without that patch, kimodo cannot load nf4-quantized encoders directly):

# 1) download the export (one-time)
hf download matbee/kimodo-llm2vec-nf4 --local-dir ~/llm2vec-nf4

# 2) point kimodo at it
CUDA_VISIBLE_DEVICES=0 \
TEXT_ENCODER_DEVICE=cuda:0 \
TEXT_ENCODER_MODE=local \
LLM2VEC_QUANTIZE=nf4 \
LLM2VEC_LOCAL_BASE=$HOME/llm2vec-nf4 \
LLM2VEC_LOCAL_PEFT=$HOME/llm2vec-nf4/supervised_adapter \
python kimodo_daemon.py

LLM2VEC_QUANTIZE=nf4 tells the wrapper to honor the bnb config in config.json. The two LLM2VEC_LOCAL_* vars short-circuit the Hub download.

Standalone use (without kimodo)

The model is a drop-in replacement for the McGill base via the standard LLM2Vec Python API. You'll need bitsandbytes>=0.43, transformers>=4.46, and peft>=0.11:

import torch
from transformers import AutoTokenizer, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec  # from McGill's llm2vec package

# Load the bnb-quantized base; quantization_config is in config.json so
# transformers re-applies bnb automatically.
base_dir = "<local clone>"
adapter_dir = "<local clone>/supervised_adapter"

model = LLM2Vec.from_pretrained(
    base_model_name_or_path=base_dir,
    peft_model_name_or_path=adapter_dir,
    torch_dtype=torch.bfloat16,
)
embeddings = model.encode(["A person waves with their right hand."])

Caveats

Meta Llama 3 license applies. This is a derivative of Llama 3; your use must comply with the Llama 3 Community License.
Kimodo wrapper patch required for env-var-driven loading. Stock kimodo.model.LLM2VecEncoder doesn't honor LLM2VEC_QUANTIZE / LLM2VEC_LOCAL_* — that wiring lives in the matching wrapper patch. Without the patch, you can still use this repo via the standalone Python API above.
GPU-only. bnb 4-bit weights cannot be moved to CPU after load. Pin to a CUDA device.
transformers 5.x quirk: caching_allocator_warmup walks bnb-internal buffer paths and crashes. The kimodo wrapper patch ships a no-op shim. Pin transformers <5.0 if you load this repo from a different host.
Encode latency is ~60ms/prompt slower than bf16 on a 4090 (single-prompt p50). Negligible against kimodo's 1–3 s diffusion step per intent.

Attribution

Base model: meta-llama/Meta-Llama-3-8B-Instruct
MNTP + supervised LoRA (LLM2Vec): McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp, McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised
Quantization: bitsandbytes (NF4)
Use case: NVIDIA's Kimodo text-conditioned motion diffusion

Downloads last month: 14

Safetensors

Model size

8B params

Tensor type

F32

BF16

Model tree for matbee/kimodo-llm2vec-nf4

Base model

McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised

Adapter

(1)

this model