Instructions to use iamtarun/codegen-350M-mono-4bit-qlora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use iamtarun/codegen-350M-mono-4bit-qlora with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="iamtarun/codegen-350M-mono-4bit-qlora")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("iamtarun/codegen-350M-mono-4bit-qlora")
model = AutoModelForCausalLM.from_pretrained("iamtarun/codegen-350M-mono-4bit-qlora")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use iamtarun/codegen-350M-mono-4bit-qlora with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "iamtarun/codegen-350M-mono-4bit-qlora"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamtarun/codegen-350M-mono-4bit-qlora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/iamtarun/codegen-350M-mono-4bit-qlora

SGLang

How to use iamtarun/codegen-350M-mono-4bit-qlora with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "iamtarun/codegen-350M-mono-4bit-qlora" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamtarun/codegen-350M-mono-4bit-qlora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "iamtarun/codegen-350M-mono-4bit-qlora" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "iamtarun/codegen-350M-mono-4bit-qlora",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use iamtarun/codegen-350M-mono-4bit-qlora with Docker Model Runner:
```
docker model run hf.co/iamtarun/codegen-350M-mono-4bit-qlora
```

Competitive Programming LLM for Python Language

This model is a finetuned version of codegen350M-mono on python code dataset that uses alpaca style prompts while training.

Prompt function

'''
This function generates prompts using the problem description and input.
@param1 instruction: str - text problem description
@param2 inputs: str - input to the program
'''
def generate_prompt(instruction, inputs=""):
    text = ("Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n"
            f"{instruction}\n\n"
            "### Input:\n"
            f"{inputs}\n\n"
            "### Output:\n")
    return text

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("iamtarun/codegen-350M-mono-4bit-qlora", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("iamtarun/codegen-350M-mono-4bit-qlora")

# loading model for inference
model.eval()

# inference function
'''
This function takes text prompt as input which is generated from the generate_prompt function and returns the generated response

@param1 prompt: str - text prompt generated using generate_prompt function.
'''
def pipe(prompt):
    device = "cuda"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(**inputs, 
                                max_length=512,
                                do_sample=True,
                                temperature=0.5,
                                top_p=0.95,
                                repetition_penalty=1.15)
    return tokenizer.decode(output[0].tolist(), 
                            skip_special_tokens=True, 
                            clean_up_tokenization_space=False)

# generating code for a problem description
instruction = "Write a function to calculate square of a number in python"
inputs = "number = 5"
prompt = generate_prompt(instruction, inputs)
print(pipe(prompt))
print("\n", "="*100)

Downloads last month: 2

iamtarun
/

codegen-350M-mono-4bit-qlora

Competitive Programming LLM for Python Language

Prompt function

Usage

Dataset used to train iamtarun/codegen-350M-mono-4bit-qlora