|
|
--- |
|
|
library_name: transformers |
|
|
tags: |
|
|
- quantization |
|
|
- bitsandbytes |
|
|
- 4-bit |
|
|
- nf4 |
|
|
- double-quant |
|
|
- mcqa |
|
|
--- |
|
|
# Model Card for `Kikinoking/MNLP_M3_quantized_model` |
|
|
|
|
|
A 4-bit double-quantized (NF4 + nested quant) version of the MNLP_M3_mcqa_model, compressed with bitsandbytes. This model answers multiple-choice questions (MCQA) with minimal GPU memory usage. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model ID:** `Kikinoking/MNLP_M3_quantized_model` |
|
|
- **Quantization:** 4-bit NF4 + nested quantization (`bnb_4bit_use_double_quant=True`) |
|
|
- **Base model:** `aidasvenc/MNLP_M3_mcqa_model` |
|
|
- **Library:** Transformers + bitsandbytes |
|
|
- **Task:** Multiple-choice question answering (MCQA) |
|
|
|
|
|
## Usage |
|
|
|
|
|
Load and run inference in just a few lines: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_id = "Kikinoking/MNLP_M3_quantized_model" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto", |
|
|
torch_dtype="auto", |
|
|
low_cpu_mem_usage=True |
|
|
).eval() |
|
|
|
|
|
prompt = "What is the capital of France ?\nA) Lyon B) Marseille C) Paris D) Toulouse\nAnswer: " |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.inference_mode(): |
|
|
output = model.generate(**inputs, max_new_tokens=1) |
|
|
|
|
|
print("Answer:", tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
|
|
|
##How It Was Built |
|
|
|
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
import torch |
|
|
|
|
|
base_id = "aidasvenc/MNLP_M3_mcqa_model" |
|
|
qcfg = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=True |
|
|
) |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
base_id, |
|
|
quantization_config=qcfg, |
|
|
device_map="auto", |
|
|
torch_dtype="auto" |
|
|
) |
|
|
|
|
|
# Push to Hugging Face Hub |
|
|
model.push_to_hub("Kikinoking/MNLP_M3_quantized_model", private=True) |
|
|
tokenizer.push_to_hub("Kikinoking/MNLP_M3_quantized_model") |
|
|
|
|
|
print("VRAM used (MiB):", torch.cuda.memory_reserved()/1024**2) |
|
|
|