YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Inferra-Qwen-LoRA

Lightweight conversational AI model built using QLoRA fine-tuning on top of Qwen2.5-3B-Instruct.


Features

  • QLoRA Fine-Tuning
  • 4-bit Quantization
  • Fast Inference
  • Low VRAM Usage
  • Hugging Face Compatible
  • LoRA Adapter Based

Model Details

Property Value
Base Model Qwen2.5-3B-Instruct
Fine-Tuning QLoRA
Precision 4-bit
Trainable Params ~15M
Total Params ~3.1B
GPU Used NVIDIA T4
Platform Kaggle

Installation

pip install -U transformers accelerate peft bitsandbytes unsloth

Hugging Face Model

mohdmusheer/inferra

Load Model

from unsloth import FastLanguageModel
from peft import PeftModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
)

model = PeftModel.from_pretrained(
    model,
    "mohdmusheer/inferra",
)

FastLanguageModel.for_inference(model)

Inference Example

messages = [
    {
        "role": "user",
        "content": "Explain machine learning in simple words."
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(
    text,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
)

response = tokenizer.decode(
    outputs[0],
    skip_special_tokens=True,
)

print(response)

Training Config

max_seq_length = 1024
batch_size = 1
gradient_accumulation_steps = 8
max_steps = 500
learning_rate = 2e-4

Docker Build

docker build -t inferra-qwen .

Docker Run

docker run --gpus all -p 8000:8000 inferra-qwen

Dockerfile

FROM pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime

WORKDIR /app

COPY . .

RUN pip install -U \
    transformers \
    accelerate \
    peft \
    bitsandbytes \
    unsloth

CMD ["python", "app.py"]

Project Structure

β”œβ”€β”€ app.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ inference.py
β”œβ”€β”€ training.ipynb
└── README.md

Limitations

  • Not full fine-tuning
  • Not a frontier reasoning model
  • Adapter-based conversational tuning

Future Improvements

  • Reasoning datasets
  • Coding specialization
  • RAG integration
  • GGUF export
  • Ollama support

License

Please follow the license terms of:

  • Base model
  • Datasets
  • Hugging Face ecosystem
Downloads last month
132
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support