You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Quantized Model Card (INT4 / W4A16)

This repository provides an INT4 weight-quantized (W4A16) variant of IQuestLab/IQuest-Coder-V1-40B-Instruct, quantized with llm-compressor and saved in the compressed-tensors format. It can be loaded for inference directly with transformers.

Quantization Config (Key Settings)

  • Method: GPTQ (llm-compressor)
  • Scheme: W4A16 (int4 weights; activations remain fp16/bf16)
  • Group / block size: 128
  • Symmetric: true
  • Actorder: static
  • Ignored modules: lm_head

Transformers Inference (INT4)

from transformers import AutoModelForCausalLM, AutoTokenizer

# You can also replace this with your local path
MODEL_ID = "xhxlb-12138/IQuest-Coder-V1-40B-Instruct-int4"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map="cuda:0",
    trust_remote_code=True,
)

prompt = "Write a Python function to calculate the Fibonacci sequence using dynamic programming."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1] :], skip_special_tokens=True))

Transformers Basic Test

You are using the default legacy behaviour of the IQuestCoderTokenizer. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 5/5 [00:13<00:00,  2.73s/it]
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Building a web application with Django follows a structured workflow. Django is a high-level Python web framework that encourages rapid development and clean, pragmatic design.

Here are the step-by-step instructions to build a web application using Django.

---

### Step 1: Install Python and Django
Ensure you have Python installed on your computer. Then, install Django using the `pip` package manager.

```bash
# Install pip (if not already installed)
python -m ensurepip --upgrade

# Install Django
pip install django
```

### Step 2: Create a Project
Navigate to your desired directory and create a new Django project.

```bash
# Create a project named 'myproject'
django-admin startproject myproject

# Navigate into the project directory
cd myproject
```

### Step 3: Create an App
Django projects are composed of apps. Create a new app for your specific functionality (e.g., a blog or a store).

```bash
# Create an app named 'myapp'
python manage.py startapp myapp
```

###

vLLM Serve

vllm serve xhxlb-12138/IQuest-Coder-V1-40B-Instruct-int4 \
  --served-model-name iquestcoder-instruct-int4 \
  --port 8000 \
  -tp 2 \
  -dp 1 \
  --trust_remote_code \
  --reasoning-parser qwen3

Reproduce: Generate W4A16 (GPTQ) with llm-compressor

The following workflow matches how this checkpoint is produced: run GPTQ on a small calibration set, then save with save_compressed=True to emit a compressed-tensors model.

Refer to https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w4a16

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

BASE_MODEL = "IQuestLab/IQuest-Coder-V1-40B-Instruct"
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIBRATION_SAMPLES}]").shuffle(seed=42)

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
oneshot(
    model=model,
    dataset=ds,
    tokenizer=tokenizer,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

SAVE_DIR = "IQuest-Coder-V1-40B-Instruct-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Disclaimer

  • Please follow the base model's license and terms of use. This repository only provides the quantized weight format and usage examples.
Downloads last month
-
Safetensors
Model size
6B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for xhxlb/IQuest-Coder-V1-40B-Instruct-int4

Quantized
(15)
this model