File size: 2,122 Bytes

c4ff942
 
7548dee
 
 
 
 
 
 
c4ff942
7548dee
c4ff942
7548dee
c4ff942
 
 
7548dee
 
 
 
 
c4ff942
7548dee
c4ff942
7548dee
c4ff942
7548dee
 
 
c4ff942
7548dee
c4ff942
7548dee
 
 
 
 
 
 
c4ff942
7548dee
 
c4ff942
7548dee
 
c4ff942
7548dee
45f7328

---
library_name: transformers
tags:
  - quantization
  - bitsandbytes
  - 4-bit
  - nf4
  - double-quant
  - mcqa
---
# Model Card for `Kikinoking/MNLP_M3_quantized_model`

A 4-bit double-quantized (NF4 + nested quant) version of the MNLP_M3_mcqa_model, compressed with bitsandbytes. This model answers multiple-choice questions (MCQA) with minimal GPU memory usage.

## Model Details

- **Model ID:** `Kikinoking/MNLP_M3_quantized_model`  
- **Quantization:** 4-bit NF4 + nested quantization (`bnb_4bit_use_double_quant=True`)  
- **Base model:** `aidasvenc/MNLP_M3_mcqa_model` 
- **Library:**  Transformers + bitsandbytes  
- **Task:** Multiple-choice question answering (MCQA)  

## Usage

Load and run inference in just a few lines:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Kikinoking/MNLP_M3_quantized_model"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    low_cpu_mem_usage=True
).eval()

prompt = "What is the capital of France ?\nA) Lyon  B) Marseille  C) Paris  D) Toulouse\nAnswer: "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=1)

print("Answer:", tokenizer.decode(output[0], skip_special_tokens=True))

##How It Was Built

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

base_id = "aidasvenc/MNLP_M3_mcqa_model"
qcfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

tokenizer = AutoTokenizer.from_pretrained(base_id)
model = AutoModelForCausalLM.from_pretrained(
    base_id,
    quantization_config=qcfg,
    device_map="auto",
    torch_dtype="auto"
)

# Push to Hugging Face Hub
model.push_to_hub("Kikinoking/MNLP_M3_quantized_model", private=True)
tokenizer.push_to_hub("Kikinoking/MNLP_M3_quantized_model")

print("VRAM used (MiB):", torch.cuda.memory_reserved()/1024**2)