Upload folder using huggingface_hub

268b2f7 verified 8 days ago

8.52 kB

	---
	license: llama3
	language:
	- en
	library_name: transformers
	pipeline_tag: feature-extraction
	base_model: McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised
	tags:
	- llm2vec
	- embedding
	- sentence-similarity
	- text-encoder
	- llama3
	- kimodo
	- quantized
	- bitsandbytes
	- nf4
	- 4-bit
	- peft
	- lora
	inference: false
	---

	# matbee/kimodo-llm2vec-nf4

	A 4-bit (NF4) quantized version of [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised), produced for use as the text encoder in [NVIDIA's Kimodo](https://github.com/nv-tlabs/kimodo) motion-diffusion model.

	This repository ships the bnb-quantized base weights and the supervised LoRA adapter side-by-side, so loaders can skip the bf16 staging step entirely.

	> Built with Meta Llama 3. This work is a derivative of Meta's Llama 3 8B Instruct (via McGill-NLP's LLM2Vec MNTP + supervised pipeline). Use is governed by the [Meta Llama 3 Community License](https://llama.meta.com/llama3/license/).

	## Why?

	Stock kimodo's text encoder consumes ~16 GB of VRAM, dominating its total memory budget — kimodo's own README says: "Kimodo requires ~17GB of VRAM to generate locally entirely on GPU, primarily due to the text embedding model." A 4-bit-quantized encoder cuts that to ~5 GB with no measurable degradation in motion quality, freeing the budget for larger diffusion batches, longer sequences, or co-resident models.

	## Memory savings

	\| Mode \| Disk download \| First-load wall time \| CPU peak RSS during load \| GPU peak during load \| GPU steady after load \|
	\|------\|--------------:\|---------------------:\|-------------------------:\|---------------------:\|----------------------:\|
	\| bf16 (Hub original) \| ~16 GB \| ~80 s \| ~15 GB \| 15.18 GB \| 15.18 GB \|
	\| nf4 cold-convert (Hub bf16 + bnb quantize at load) \| ~16 GB \| 24.07 s \| 15.22 GB \| 14.73 GB \| 4.83 GB \|
	\| nf4 pre-quantized (this repo) \| 4.84 GB \| 4.39 s (5.5×) \| 5.34 GB (2.9×) \| 4.99 GB (2.95×) \| 4.82 GB \|


	The "first-load wall time" line is the big practical win of using the pre-quantized export instead of cold-converting on every process start: no more loading 16 GB of bf16 weights only to immediately quantize them.

	## Quality (vs the bf16 source model)

	\| Metric \| bf16 (reference) \| nf4 cold-convert (Hub flow) \| nf4 pre-quantized (this repo) \|
	\|--------\|:----------------:\|:---------------------------:\|:---------------------------------:\|
	\| Embedding cosine vs bf16 (mean) \| 1.000 \| 0.793 \| 0.850 \|
	\| Embedding cosine vs bf16 (min over 8 prompts) \| 1.000 \| 0.762 \| 0.827 \|
	\| L2 distance from bf16 (mean) \| 0 \| 87.98 \| 72.94 \|
	\| Encode latency p50 (RTX 4090, single prompt) \| 83 ms \| 136 ms \| 136 ms \|
	\| Kimodo end-to-end TMR R@3 (smoke, bf16 = GT) \| 100% \| 100% \| (not yet evaluated; same model class as cold-convert) \|
	\| Kimodo end-to-end FID (vs bf16) \| 0 \| 0.110 \| (expected ≤ 0.110 since pre-merge has fewer quantization passes) \|

	*Why pre-quantized is more* faithful than cold-convert: the standard cold-convert flow loads the McGill MNTP base in 4-bit, then auto-loads the LoRA adapter, then calls `merge_and_unload` which dequantizes back to bf16, applies the LoRA, and re-quantizes — two rounding passes through nf4. This export merges MNTP into the base in bf16 (lossless), then quantizes to nf4 once**.


	The TMR retrieval R@3 of 100% on the smoke means kimodo's diffusion still produces motions that are correctly retrievable from their text prompt — i.e. semantic alignment with the prompt is preserved. FID of 0.110 is well below sample noise. No measurable degradation on kimodo's text-following metrics.

	The ~5% per-prompt embedding cosine drift (0.953 mean vs bf16) propagates into ~20 cm mean per-joint position drift in the generated motion — visually different from bf16 but functionally equivalent on the TMR/foot-physics metrics.

	## How this was made

	1. Downloaded base weights from `meta-llama/Meta-Llama-3-8B-Instruct` (the actual Llama 3 8B base, since McGill's MNTP repo is adapter-only).
	2. Loaded base in bf16 via `LlamaBiModel` (the bidirectional variant kimodo's `LLM2Vec.from_pretrained` resolves to).
	3. Applied the MNTP LoRA from `McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp` and called `merge_and_unload` — this happens in bf16, so no rounding loss.
	4. Saved the merged bf16 base, then re-loaded it with `BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16)`. One quantization pass, no merge-on-quantized.
	5. `save_pretrained` writes `config.json` with the bnb `quantization_config` + `model.safetensors` containing the actual 4-bit weights.
	6. Supervised adapter from `McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised` is shipped separately (un-merged); PEFT applies it on top of the quantized base at inference.

	## Repo layout

	```
	.
	├── config.json # contains quantization_config (auto-applied on load)
	├── model.safetensors # bnb 4-bit weights (Llama-3-8B + merged MNTP)
	├── tokenizer.json
	├── tokenizer_config.json
	├── chat_template.jinja
	├── supervised_adapter/ # second LoRA stack (kept un-merged)
	│ ├── adapter_config.json
	│ └── adapter_model.safetensors
	└── README.md
	```

	## Usage with kimodo

	This repo plugs into kimodo's text encoder via two env vars (only requires the `bitsandbytes`-aware llm2vec_wrapper.py patch from [`remotemedia-sdk`](https://github.com/) ; without that patch, kimodo cannot load nf4-quantized encoders directly):

	```bash
	# 1) download the export (one-time)
	hf download matbee/kimodo-llm2vec-nf4 --local-dir ~/llm2vec-nf4

	# 2) point kimodo at it
	CUDA_VISIBLE_DEVICES=0 \
	TEXT_ENCODER_DEVICE=cuda:0 \
	TEXT_ENCODER_MODE=local \
	LLM2VEC_QUANTIZE=nf4 \
	LLM2VEC_LOCAL_BASE=$HOME/llm2vec-nf4 \
	LLM2VEC_LOCAL_PEFT=$HOME/llm2vec-nf4/supervised_adapter \
	python kimodo_daemon.py
	```

	`LLM2VEC_QUANTIZE=nf4` tells the wrapper to honor the bnb config in `config.json`. The two `LLM2VEC_LOCAL_*` vars short-circuit the Hub download.

	## Standalone use (without kimodo)

	The model is a drop-in replacement for the McGill base via the standard `LLM2Vec` Python API. You'll need `bitsandbytes>=0.43`, `transformers>=4.46`, and `peft>=0.11`:

	```python
	import torch
	from transformers import AutoTokenizer, AutoConfig
	from peft import PeftModel
	from llm2vec import LLM2Vec # from McGill's llm2vec package

	# Load the bnb-quantized base; quantization_config is in config.json so
	# transformers re-applies bnb automatically.
	base_dir = "<local clone>"
	adapter_dir = "<local clone>/supervised_adapter"

	model = LLM2Vec.from_pretrained(
	base_model_name_or_path=base_dir,
	peft_model_name_or_path=adapter_dir,
	torch_dtype=torch.bfloat16,
	)
	embeddings = model.encode(["A person waves with their right hand."])
	```

	## Caveats

	- Meta Llama 3 license applies. This is a derivative of Llama 3; your use must comply with the [Llama 3 Community License](https://llama.meta.com/llama3/license/).
	- Kimodo wrapper patch required for env-var-driven loading. Stock `kimodo.model.LLM2VecEncoder` doesn't honor `LLM2VEC_QUANTIZE` / `LLM2VEC_LOCAL_*` — that wiring lives in the matching wrapper patch. Without the patch, you can still use this repo via the standalone Python API above.
	- GPU-only. bnb 4-bit weights cannot be moved to CPU after load. Pin to a CUDA device.
	- transformers 5.x quirk: `caching_allocator_warmup` walks bnb-internal buffer paths and crashes. The kimodo wrapper patch ships a no-op shim. Pin transformers `<5.0` if you load this repo from a different host.
	- Encode latency is ~60ms/prompt slower than bf16 on a 4090 (single-prompt p50). Negligible against kimodo's 1–3 s diffusion step per intent.

	## Attribution

	- Base model: [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
	- MNTP + supervised LoRA (LLM2Vec): [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp), [`McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised`](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-supervised)
	- Quantization: `bitsandbytes` (NF4)
	- Use case: [NVIDIA's Kimodo](https://github.com/nv-tlabs/kimodo) text-conditioned motion diffusion