Instructions to use ShahriarFerdoush/llama-3.2-1b-code-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ShahriarFerdoush/llama-3.2-1b-code-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ShahriarFerdoush/llama-3.2-1b-code-instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ShahriarFerdoush/llama-3.2-1b-code-instruct")
model = AutoModelForCausalLM.from_pretrained("ShahriarFerdoush/llama-3.2-1b-code-instruct")

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ShahriarFerdoush/llama-3.2-1b-code-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ShahriarFerdoush/llama-3.2-1b-code-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ShahriarFerdoush/llama-3.2-1b-code-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ShahriarFerdoush/llama-3.2-1b-code-instruct

SGLang

How to use ShahriarFerdoush/llama-3.2-1b-code-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ShahriarFerdoush/llama-3.2-1b-code-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ShahriarFerdoush/llama-3.2-1b-code-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ShahriarFerdoush/llama-3.2-1b-code-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ShahriarFerdoush/llama-3.2-1b-code-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ShahriarFerdoush/llama-3.2-1b-code-instruct with Docker Model Runner:
```
docker model run hf.co/ShahriarFerdoush/llama-3.2-1b-code-instruct
```

llama-3.2-1b-code-instruct / README.md

ShahriarFerdoush

Update README.md

fdd74ef verified 5 months ago

preview code

raw

history blame contribute delete

4.35 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- sahil2801/CodeAlpaca-20k
	base_model:
	- meta-llama/Llama-3.2-1B
	---


	# 🧠 Llama-3.2-1B Code Solver (QLoRA Fine-Tuned)

	A lightweight yet powerful code-focused language model fine-tuned from Meta Llama-3.2-1B using QLoRA (4-bit) on the CodeAlpaca-20K dataset.
	Designed for efficient code generation, reasoning, and problem-solving on limited GPU resources.

	> 🚀 Trained on a single Tesla P100 GPU
	> ⚡ Optimized for Kaggle, Colab, and low-VRAM environments
	> 🧩 Ideal for research, education, and rapid prototyping

	---

	## 🔍 Model Overview

	\| Attribute \| Value \|
	\|---------\|------\|
	\| Base Model \| `meta-llama/Llama-3.2-1B` \|
	\| Model Type \| Decoder-only causal language model \|
	\| Fine-Tuning Method \| QLoRA (4-bit quantization + LoRA) \|
	\| LoRA Rank \| 16 \|
	\| Task Domain \| Code generation & code reasoning \|
	\| Training Samples \| 10,000 \|
	\| Training Time \| ~5 hours \|
	\| Hardware \| NVIDIA Tesla P100 \|
	\| Precision \| 4-bit (NF4) \|
	\| Frameworks \| Hugging Face Transformers, PEFT, BitsAndBytes \|

	---

	## 🎯 What This Model Is Good At

	- 🧑‍💻 Code generation (Python-focused, but generalizable)
	- 🧠 Step-by-step coding reasoning
	- 🧪 Algorithmic problem solving
	- 📘 Educational coding assistance
	- ⚙️ Running efficiently on low-VRAM GPUs

	---

	## 📚 Training Dataset

	### CodeAlpaca-20K

	A high-quality instruction-tuning dataset derived from the Alpaca format and specialized for coding tasks.

	- Total dataset size: 20,000 samples
	- Used for training: 10,000 samples (50%)
	- Data format:
	```json
	{
	"instruction": "Describe the coding task",
	"input": "Optional context or input code",
	"output": "Expected code solution"
	}
	```

	* Task Types:

	* Algorithm implementation
	* Code completion
	* Debugging
	* Function writing
	* Problem solving

	---

	## 🏗️ Training Methodology

	This model was fine-tuned using QLoRA, enabling efficient adaptation of large language models on limited hardware.

	### Key Techniques Used

	* 4-bit Quantization (NF4) via BitsAndBytes
	* LoRA adapters applied to attention layers
	* Frozen base model weights
	* Low-rank updates only

	### Why QLoRA?

	* 🔻 Drastically reduces GPU memory usage
	* ⚡ Enables training on consumer-grade GPUs
	* 📈 Maintains strong downstream performance

	---

	## ⚙️ Training Configuration

	\| Parameter \| Value \|
	\| --------------------- \| ----------------------- \|
	\| Max Sequence Length \| 1024 \|
	\| LoRA Rank (r) \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| LoRA Dropout \| 0.05 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 2e-4 \|
	\| Batch Size \| Small (GPU-constrained) \|
	\| Gradient Accumulation \| Enabled \|
	\| Quantization \| 4-bit \|

	---

	## 🚀 Usage

	### Load the Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "YOUR_USERNAME/llama-3.2-1b-code-solver"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto",
	load_in_4bit=True
	)
	```

	### Example Inference

	```python
	prompt = "Write a Python function to check if a number is prime."

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=200)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```


	## 🧪 Evaluation Notes

	* This model is instruction-tuned, not benchmark-optimized
	* No formal benchmarks (HumanEval / MBPP) were run
	* Best evaluated through qualitative code generation

	## ⚠️ Limitations

	* 1B parameters → limited long-context reasoning
	* Not optimized for natural language chat
	* May hallucinate on complex or ambiguous prompts
	* English-centric training data


	## 🧭 Intended Use

	✅ Allowed

	* Research and experimentation
	* Coding assistants
	* Educational tools
	* Prototyping LLM systems


	## 🙏 Acknowledgements

	* Meta AI for Llama 3.2
	* CodeAlpaca dataset creators
	* Hugging Face ecosystem
	* QLoRA & PEFT authors