How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("feature-extraction", model="matbee/kimodo-llm2vec-nf4")
# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("matbee/kimodo-llm2vec-nf4")
model = AutoModel.from_pretrained("matbee/kimodo-llm2vec-nf4")
Quick Links

matbee/kimodo-llm2vec-nf4

A 4-bit (NF4) quantized version of McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised, produced for use as the text encoder in NVIDIA's Kimodo motion-diffusion model.

This repository ships the bnb-quantized base weights and the supervised LoRA adapter side-by-side, so loaders can skip the bf16 staging step entirely.

Built with Meta Llama 3. This work is a derivative of Meta's Llama 3 8B Instruct (via McGill-NLP's LLM2Vec MNTP + supervised pipeline). Use is governed by the Meta Llama 3 Community License.

Why?

Stock kimodo's text encoder consumes ~16 GB of VRAM, dominating its total memory budget β€” kimodo's own README says: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." A 4-bit-quantized encoder cuts that to ~5 GB with no measurable degradation in motion quality, freeing the budget for larger diffusion batches, longer sequences, or co-resident models.

Memory savings

Mode Disk download First-load wall time CPU peak RSS during load GPU peak during load GPU steady after load
bf16 (Hub original) ~16 GB ~80 s ~15 GB 15.18 GB 15.18 GB
nf4 cold-convert (Hub bf16 + bnb quantize at load) ~16 GB 24.07 s 15.22 GB 14.73 GB 4.83 GB
nf4 pre-quantized (this repo) 4.84 GB 4.39 s (5.5Γ—) 5.34 GB (2.9Γ—) 4.99 GB (2.95Γ—) 4.82 GB

The "first-load wall time" line is the big practical win of using the pre-quantized export instead of cold-converting on every process start: no more loading 16 GB of bf16 weights only to immediately quantize them.

Quality (vs the bf16 source model)

Metric bf16 (reference) nf4 cold-convert (Hub flow) nf4 pre-quantized (this repo)
Embedding cosine vs bf16 (mean) 1.000 0.793 0.850
Embedding cosine vs bf16 (min over 8 prompts) 1.000 0.762 0.827
L2 distance from bf16 (mean) 0 87.98 72.94
Encode latency p50 (RTX 4090, single prompt) 83 ms 136 ms 136 ms
Kimodo end-to-end TMR R@3 (smoke, bf16 = GT) 100% 100% (not yet evaluated; same model class as cold-convert)
Kimodo end-to-end FID (vs bf16) 0 0.110 (expected ≀ 0.110 since pre-merge has fewer quantization passes)

Why pre-quantized is more faithful than cold-convert: the standard cold-convert flow loads the McGill MNTP base in 4-bit, then auto-loads the LoRA adapter, then calls merge_and_unload which dequantizes back to bf16, applies the LoRA, and re-quantizes β€” two rounding passes through nf4. This export merges MNTP into the base in bf16 (lossless), then quantizes to nf4 once.

The TMR retrieval R@3 of 100% on the smoke means kimodo's diffusion still produces motions that are correctly retrievable from their text prompt β€” i.e. semantic alignment with the prompt is preserved. FID of 0.110 is well below sample noise. No measurable degradation on kimodo's text-following metrics.

The ~5% per-prompt embedding cosine drift (0.953 mean vs bf16) propagates into ~20 cm mean per-joint position drift in the generated motion β€” visually different from bf16 but functionally equivalent on the TMR/foot-physics metrics.

How this was made

  1. Downloaded base weights from meta-llama/Meta-Llama-3-8B-Instruct (the actual Llama 3 8B base, since McGill's MNTP repo is adapter-only).
  2. Loaded base in bf16 via LlamaBiModel (the bidirectional variant kimodo's LLM2Vec.from_pretrained resolves to).
  3. Applied the MNTP LoRA from McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp and called merge_and_unload β€” this happens in bf16, so no rounding loss.
  4. Saved the merged bf16 base, then re-loaded it with BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16). One quantization pass, no merge-on-quantized.
  5. save_pretrained writes config.json with the bnb quantization_config + model.safetensors containing the actual 4-bit weights.
  6. Supervised adapter from McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised is shipped separately (un-merged); PEFT applies it on top of the quantized base at inference.

Repo layout

.
β”œβ”€β”€ config.json                 # contains quantization_config (auto-applied on load)
β”œβ”€β”€ model.safetensors           # bnb 4-bit weights (Llama-3-8B + merged MNTP)
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ chat_template.jinja
β”œβ”€β”€ supervised_adapter/         # second LoRA stack (kept un-merged)
β”‚   β”œβ”€β”€ adapter_config.json
β”‚   └── adapter_model.safetensors
└── README.md

Usage with kimodo

This repo plugs into kimodo's text encoder via two env vars (only requires the bitsandbytes-aware llm2vec_wrapper.py patch from remotemedia-sdk ; without that patch, kimodo cannot load nf4-quantized encoders directly):

# 1) download the export (one-time)
hf download matbee/kimodo-llm2vec-nf4 --local-dir ~/llm2vec-nf4

# 2) point kimodo at it
CUDA_VISIBLE_DEVICES=0 \
TEXT_ENCODER_DEVICE=cuda:0 \
TEXT_ENCODER_MODE=local \
LLM2VEC_QUANTIZE=nf4 \
LLM2VEC_LOCAL_BASE=$HOME/llm2vec-nf4 \
LLM2VEC_LOCAL_PEFT=$HOME/llm2vec-nf4/supervised_adapter \
python kimodo_daemon.py

LLM2VEC_QUANTIZE=nf4 tells the wrapper to honor the bnb config in config.json. The two LLM2VEC_LOCAL_* vars short-circuit the Hub download.

Standalone use (without kimodo)

The model is a drop-in replacement for the McGill base via the standard LLM2Vec Python API. You'll need bitsandbytes>=0.43, transformers>=4.46, and peft>=0.11:

import torch
from transformers import AutoTokenizer, AutoConfig
from peft import PeftModel
from llm2vec import LLM2Vec  # from McGill's llm2vec package

# Load the bnb-quantized base; quantization_config is in config.json so
# transformers re-applies bnb automatically.
base_dir = "<local clone>"
adapter_dir = "<local clone>/supervised_adapter"

model = LLM2Vec.from_pretrained(
    base_model_name_or_path=base_dir,
    peft_model_name_or_path=adapter_dir,
    torch_dtype=torch.bfloat16,
)
embeddings = model.encode(["A person waves with their right hand."])

Caveats

  • Meta Llama 3 license applies. This is a derivative of Llama 3; your use must comply with the Llama 3 Community License.
  • Kimodo wrapper patch required for env-var-driven loading. Stock kimodo.model.LLM2VecEncoder doesn't honor LLM2VEC_QUANTIZE / LLM2VEC_LOCAL_* β€” that wiring lives in the matching wrapper patch. Without the patch, you can still use this repo via the standalone Python API above.
  • GPU-only. bnb 4-bit weights cannot be moved to CPU after load. Pin to a CUDA device.
  • transformers 5.x quirk: caching_allocator_warmup walks bnb-internal buffer paths and crashes. The kimodo wrapper patch ships a no-op shim. Pin transformers <5.0 if you load this repo from a different host.
  • Encode latency is ~60ms/prompt slower than bf16 on a 4090 (single-prompt p50). Negligible against kimodo's 1–3 s diffusion step per intent.

Attribution

Downloads last month
14
Safetensors
Model size
8B params
Tensor type
F32
Β·
BF16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for matbee/kimodo-llm2vec-nf4