Quantized Model Card (INT4 / W4A16)
This repository provides an INT4 weight-quantized (W4A16) variant of
IQuestLab/IQuest-Coder-V1-40B-Instruct,
quantized with llm-compressor and saved in the compressed-tensors format.
It can be loaded for inference directly with transformers.
Quantization Config (Key Settings)
- Method: GPTQ (
llm-compressor) - Scheme: W4A16 (int4 weights; activations remain fp16/bf16)
- Group / block size: 128
- Symmetric: true
- Actorder: static
- Ignored modules:
lm_head
Transformers Inference (INT4)
from transformers import AutoModelForCausalLM, AutoTokenizer
# You can also replace this with your local path
MODEL_ID = "xhxlb-12138/IQuest-Coder-V1-40B-Instruct-int4"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype="auto",
device_map="cuda:0",
trust_remote_code=True,
)
prompt = "Write a Python function to calculate the Fibonacci sequence using dynamic programming."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1] :], skip_special_tokens=True))
Transformers Basic Test
You are using the default legacy behaviour of the IQuestCoderTokenizer. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 5/5 [00:13<00:00, 2.73s/it]
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Building a web application with Django follows a structured workflow. Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design.
Here are the step-by-step instructions to build a web application using Django.
---
### Step 1: Install Python and Django
Ensure you have Python installed on your computer. Then, install Django using the `pip` package manager.
```bash
# Install pip (if not already installed)
python -m ensurepip --upgrade
# Install Django
pip install django
```
### Step 2: Create a Project
Navigate to your desired directory and create a new Django project.
```bash
# Create a project named 'myproject'
django-admin startproject myproject
# Navigate into the project directory
cd myproject
```
### Step 3: Create an App
Django projects are composed of apps. Create a new app for your specific functionality (e.g., a blog or a store).
```bash
# Create an app named 'myapp'
python manage.py startapp myapp
```
###
vLLM Serve
vllm serve xhxlb-12138/IQuest-Coder-V1-40B-Instruct-int4 \
--served-model-name iquestcoder-instruct-int4 \
--port 8000 \
-tp 2 \
-dp 1 \
--trust_remote_code \
--reasoning-parser qwen3
Reproduce: Generate W4A16 (GPTQ) with llm-compressor
The following workflow matches how this checkpoint is produced: run GPTQ on a small calibration set, then save with save_compressed=True to emit a compressed-tensors model.
Refer to https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
BASE_MODEL = "IQuestLab/IQuest-Coder-V1-40B-Instruct"
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]").shuffle(seed=42)
def preprocess(example):
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
oneshot(
model=model,
dataset=ds,
tokenizer=tokenizer,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
SAVE_DIR = "IQuest-Coder-V1-40B-Instruct-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Disclaimer
- Please follow the base model's license and terms of use. This repository only provides the quantized weight format and usage examples.
- Downloads last month
- -
Model tree for xhxlb/IQuest-Coder-V1-40B-Instruct-int4
Base model
IQuestLab/IQuest-Coder-V1-40B-Instruct