|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- teknium/OpenHermes-2.5 |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- HuggingFaceTB/SmolLM3-3B-Base |
|
|
--- |
|
|
|
|
|
# Model Card: SmolLM3-Chat-v1 |
|
|
|
|
|
**SmolLM3-Chat-v1** is a finetune of the [SmolLM3-3B-Base](https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base) model, designed to be casual, witty, and human-like. Unlike standard assistants that sound robotic and overly formal, this model captures a distinct "internet-native" vibe. |
|
|
|
|
|
It was trained on a curated mix of high-quality instruction data and custom conversation logs to balance intelligence with personality. |
|
|
|
|
|
> **Note:** This is the **full merged version**. If you are looking for the LoRA adapter, please check `SmolLM3-Chat-v1-adapter`. |
|
|
|
|
|
## ⚠️ Important: System Instructions |
|
|
**Less is more.** |
|
|
|
|
|
This model relies on a specific "vibe" learned during training. Over-prompting it with complex system instructions (e.g., *"You are a helpful assistant who is polite, follows rules X, Y, Z..."*) will actually degrade the quality of the output. |
|
|
|
|
|
For the System Instuction simply leave it empty for the most raw, casual experience |
|
|
|
|
|
**Ironically, less instruction = more human.** |
|
|
|
|
|
|
|
|
## ⚠️ Quantization Warning |
|
|
|
|
|
**Avoid re-quantizing this merged model.** |
|
|
|
|
|
This model was trained using **QLoRA** (on a 4-bit base model) and then merged back to **Float16**. Compressing this merged model *again* (e.g., converting it to 4-bit GGUF, AWQ, or GPTQ) causes "double quantization" noise. |
|
|
|
|
|
This often breaks the specific "vibe" of the model, leading to: |
|
|
* Broken grammar or incoherent responses. |
|
|
* Loss of the casual/witty personality. |
|
|
* Looping issues. |
|
|
|
|
|
**If you need a low-VRAM (4-bit) version:** |
|
|
* ❌ **Do not** quantize this merged model. |
|
|
* ✅ **Use the Adapter instead:** [SmolLM3-Chat-v1-adapter](https://huggingface.co/igidn/SmolLM3-Chat-v1-adapter). |
|
|
* Load the base `SmolLM3-3B` in 4-bit and attach the adapter. This preserves the original training quality. |
|
|
|
|
|
## 💻 Usage |
|
|
|
|
|
To get the best performance (and prevent repetition loops), you **must** use the specific generation configuration below. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from threading import Thread |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer |
|
|
|
|
|
MODEL_ID = "igidn/SmolLM3-Chat-v1" |
|
|
|
|
|
# 1. Load Model & Tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
MODEL_ID, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# 2. Define Conversation |
|
|
messages = [ |
|
|
{"role": "user", "content": "hellooooo"} |
|
|
] |
|
|
|
|
|
# 3. Apply Chat Template |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
inputs = tokenizer([prompt], return_tensors="pt").to(model.device) |
|
|
|
|
|
# 4. Streamer Setup |
|
|
streamer = TextIteratorStreamer( |
|
|
tokenizer, |
|
|
timeout=10.0, |
|
|
skip_prompt=True, |
|
|
skip_special_tokens=True |
|
|
) |
|
|
|
|
|
# 5. Generation Configuration (CRITICAL) |
|
|
generate_kwargs = dict( |
|
|
**inputs, |
|
|
streamer=streamer, |
|
|
max_new_tokens=512, |
|
|
do_sample=True, |
|
|
|
|
|
# Core Parameters for "Vibe" |
|
|
temperature=0.8, # High creativity |
|
|
top_p=0.85, # Nuanced sampling |
|
|
|
|
|
# Stability Parameters |
|
|
repetition_penalty=1.15, # Prevents "I'll be gone in 5 mins" loops |
|
|
no_repeat_ngram_size=3, # Hard block on repetitive phrases |
|
|
|
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
# 6. Run Inference |
|
|
thread = Thread(target=model.generate, kwargs=generate_kwargs) |
|
|
thread.start() |
|
|
|
|
|
print("Assistant: ", end="") |
|
|
for new_text in streamer: |
|
|
print(new_text, end="", flush=True) |
|
|
``` |
|
|
|
|
|
## 📊 Training Details |
|
|
|
|
|
The model was trained for 2 epochs using `SFTTrainer` with a cosine learning rate scheduler. |
|
|
|
|
|
### Dataset Composition |
|
|
* **OpenHermes-2.5 (5k subset):** Provides logic, reasoning, and general helpfulness. |
|
|
* **Custom Dataset (15k):** Focused on casual chat, roleplay, and human-like interaction patterns. |
|
|
* **Total:** 20,000 examples. |
|
|
|
|
|
### Training metrics |
|
|
The model showed steady convergence without catastrophic overfitting. The final loss indicates a strong grasp of the training data without losing generalization capabilities. |
|
|
|
|
|
| Metric | Start | End | |
|
|
| :--- | :--- | :--- | |
|
|
| **Loss** | 2.47 | 1.41 | |
|
|
| **Token Accuracy** | 53.3% | 65.9% | |
|
|
| **Epochs** | 0 | 2.0 | |
|
|
|
|
|
**Loss Curve:** |
|
|
* **Epoch 0.2:** Loss 1.59 (Rapid initial learning) |
|
|
* **Epoch 1.0:** Loss 1.65 (Transition point) |
|
|
* **Epoch 2.0:** Loss 1.41 (Final convergence) |
|
|
|
|
|
## 🛠️ Hyperparameters |
|
|
* **Base Model:** HuggingFaceTB/SmolLM3-3B-Base |
|
|
* **Precision:** Float16 (Training) / Float16 (Inference) |
|
|
* **LoRA Config:** r=32, alpha=64 |
|
|
* **Learning Rate:** 3e-5 (Cosine Schedule) |
|
|
* **Optimizer:** paged_adamw_32bit |
|
|
|
|
|
--- |
|
|
*Created with <3 by me* |