🌌 Gemma-2b-TARS-SFT: Technical Model Card
Fine-tuned Gemma-2-2B-it optimized for Creative Writing, Technical Assistance, and Distinctive Persona-Driven Chat.
Gemma-2b-TARS-SFT is a specialized large language model fine-tuned to provide high-quality, nuanced responses across both technical and creative domains. By building upon the robust reasoning capabilities of the Gemma-2 architecture, this model is specifically aligned to assist with design philosophy, coding tasks, and Hindi/English literature.
🎠Model Persona & Roleplay
Unlike the standard, sterile base model, TARS has been fine-tuned with a distinct, slightly sarcastic, and theatrical personality (heavily inspired by science-fiction tropes).
- Emotes: The model may spontaneously use action tags (e.g.,
*Adjusts welding goggles*or*Leans in conspiratorially*). - Persona Control: If you require strict, professional API outputs without theatrical flair, append this to your system prompt:
"Do not use asterisks or theatrical actions. Provide only the direct, professional answer."
🛠Model Details
- Base Model:
google/gemma-2-2b-it - Architecture: 2.6 Billion parameters
- Fine-Tuning Method: 4-bit QLoRA (Quantized Low-Rank Adaptation)
- Quantization: 4-bit via
bitsandbytes(compressing the model mathematically to save VRAM) for highly efficient inference on consumer GPUs. - Creator: prash616 (Prashant)
📊 Training Procedure & Data
The model was developed using a Supervised Fine-Tuning (SFT) strategy. The primary goal was to enhance the model's ability to follow complex, multi-step instructions while maintaining a thoughtful, structured, and highly engaging conversational tone.
1. Datasets
- Databricks Dolly-15k: Utilized to build a strong foundation in general instruction-following, brainstorming, classification, and open QA tasks.
- Custom Alignment Subset: A curated dataset designed to refine the model's conversational tone and anchor its specialized focus on creative technology, poetry, and design logic.
2. Training Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 1e-4 |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| Max Steps | 300 |
| Optimiser | AdamW (8-bit) |
| Target Modules | q, k, v, o, gate, up, down |
🚀 Usage & Implementation (Google Colab / Python)
Prerequisites: Because this model is based on Gemma-2, you must have a Hugging Face token and accept the official Google Gemma terms of service.
1. Install Optimized Libraries:
pip install --no-deps unsloth unsloth_zoo "xformers<0.0.29" "trl<0.9.0" peft accelerate bitsandbytes
import torch
import getpass
from unsloth import FastLanguageModel
# Secure Token Input
hf_token = getpass.getpass("Enter your Hugging Face Token: ")
# Load the model in highly-efficient 4-bit mode (Critical for free-tier GPUs)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "prash616/Gemma-2b-TARS-SFT",
max_seq_length = 2048,
load_in_4bit = True,
token = hf_token,
)
FastLanguageModel.for_inference(model) # Enables 2x faster generation
# Format the Prompt
messages = [
{"role": "system", "content": "You are TARS, an AI assistant specialized in creative technology and literature. You were created by Prashant."},
{"role": "user", "content": "Explain the relationship between silence and structure in poetry."}
]
# apply_chat_template automatically formats the input into Gemma's <start_of_turn> structure
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True # Generates both input_ids and attention_mask
).to("cuda")
# Generate the Response
outputs = model.generate(
**inputs, # Unpacks the dictionary to provide all necessary tensors
max_new_tokens=256,
temperature=0.7,
do_sample=True
)
print("\n--- TARS RESPONDS ---\n")
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
- Downloads last month
- 4