Model Information

A lightweight Llama-based model with a conversational tone of voice. This model is based on Llama-3.2-1B-Instruct and has been finetuned by the team at Restack to respond in a natural, conversational tone suitable for voice interactions.

The model is compatible with the Ultravox speech-to-text model. You can replace the Llama-3.2-1B-Instruct backbone of ultravox-v0_5-llama-3_2-1b with this model, to obtain a speech-to-text model that responds in a conversational tone.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# --------------------------------------------------------------------------------------
# *** Settings

model_id = "restack/conversational-v1.1-Llama-3.2-1B-Instruct"

system_prompt = "You are a helpful assistant. You answer questions in a natural, conversational tone, like in a spoken conversation."

user_prompt = "In the context of machine learning, what is regularization?"

eot_token = "<|eot_id|>"
pad_token = "<|finetune_right_pad_id|>"
assistant_header = "<|start_header_id|>assistant<|end_header_id|>\n\n"

prompt_template = {
    "system": f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nPLACEHOLDER_SYSTEM_PROMPT{eot_token}",
    "user": f"<|start_header_id|>user<|end_header_id|>\n\nPLACEHOLDER_QUESTION{eot_token}",
    "assistant": f"{assistant_header}PLACEHOLDER_ANSWER{eot_token}",
}

torch_dtype = torch.bfloat16
max_length = 1024
temperature = 0.5

# --------------------------------------------------------------------------------------
# *** Load model

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# --------------------------------------------------------------------------------------
# *** Inference

pad_token_id = tokenizer.encode(pad_token, add_special_tokens=False)[0]

# Combine system prompt, user prompt, and assistant header.
messages = (
    prompt_template["system"].replace("PLACEHOLDER_SYSTEM_PROMPT", system_prompt)
    + prompt_template["user"].replace("PLACEHOLDER_QUESTION", user_prompt)
    + assistant_header
)

tokens = tokenizer(
    messages,
    add_special_tokens=False,
    return_tensors="pt",
)

input_ids = tokens["input_ids"].to(model.device)
attention_mask = tokens["attention_mask"].to(model.device)

model.eval()

generated_ids = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    max_length=max_length,
    pad_token_id=pad_token_id,
    temperature=temperature,
)

generated_text = tokenizer.batch_decode(
    generated_ids,
    skip_special_tokens=False,
)[0]

# Remove the prompt and special tokens, leaving only the assistant answer.
assistant_response = (
    generated_text.split(assistant_header)[-1].replace(eot_token, "").strip()
)

print(assistant_response)
# Model prediction:
# So, regularization is basically a way to prevent overfitting in machine learning.
# Think of it like this: if you're trying to fit a model to a bunch of data, it's easy
# to get it to fit the noise in the data instead of the actual pattern. That's called
# overfitting. Regularization helps by adding a penalty term to the loss function. It
# makes the model more simple and less likely to fit the noise, so it doesn't overfit.

Details

License: MIT
Research report: ...
Finetuning dataset: https://huggingface.co/datasets/restack/conversational-question-answer-wikipedia-v1.0

{
    "base_model": "fixie-ai/ultravox-v0_5-llama-3_2-1b",
    "batch_config": {
        "accumulation_steps": 2,
        "batch_size": 16,
        "batch_size_val": 32
    },
    "learning_rate_params": {
        "div_factor": 25.0,
        "final_div_factor": 1000.0,
        "learning_rate": 0.0002,
        "lr_scheduler_name": "OneCycleLR",
        "pct_start": 0.3
    },
    "lora_alpha": 64,
    "lora_dropout": 0.2,
    "lora_r": 64,
    "n_epochs": 5,
    "padding_side": "left",
    "quantization_config": {
        "llm_int8_skip_modules": [
            "lm_head"
        ],
        "llm_int8_threshold": 6.0,
        "load_in_4bit": false,
        "load_in_8bit": true
    },
    "target_modules": [
        "k_proj",
        "q_proj",
        "v_proj"
    ],
    "torch_dtype": "torch.bfloat16",
    "train_bias": "lora_only",
}

Downloads last month: 1

Safetensors

Model size

1B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

restack
/

conversational-v1.1-Llama-3.2-1B-Instruct

Model Information

How to use

Details

Space using restack/conversational-v1.1-Llama-3.2-1B-Instruct 1