Synapse SLM โ€” Personalized Offline AI Assistant

Built by Abhinav Tyagi
๐Ÿ“„ GitHub โ€ข ๐Ÿ”— LinkedIn โ€ข ๐Ÿง  Live Demo


What is Synapse SLM?

Synapse SLM is a QLoRA fine-tuned Llama-3.2-3B model optimized for:

  • Hinglish (Hindi-English code-switching) conversations
  • Offline, CPU-only inference via 4-bit GGUF quantization
  • Context-aware responses via an offline RAG pipeline
  • Persona-consistent, instruction-tuned behavior

This is not a wrapper around an API โ€” it runs fully locally, inside ~4GB RAM, at ~45 tokens/sec on CPU.


Model Details

Property Value
Base Model Llama-3.2-3B-Instruct
Fine-tuning Method QLoRA (rank=16, alpha=32)
Quantization 4-bit GGUF via llama.cpp
Inference Speed ~45 tokens/sec (CPU-only)
RAM Footprint ~4GB
Training Data 3,500+ Hinglish instruction samples
Languages English, Hindi, Hinglish
Deployment Docker containerized

Key Innovations

1. QLoRA Fine-Tuning for Behavioral Shaping

Fine-tuned with rank=16, alpha=32 on 3,500+ Hinglish instruction samples. The training objective wasn't just language โ€” it was behavioral engineering: teaching the model when to explain, when to commit, and how to handle Hindi-English code-switching naturally.

Fine-tuning doesn't just improve answers โ€” it rewires behavior. The training signal determines what feels "safe" to the model: explain vs. hedge, commit vs. qualify, answer vs. avoid.

2. 4-bit GGUF Quantization + CPU Inference

Converted to GGUF format using llama.cpp. Achieves ~45 tokens/sec on CPU-only hardware within a ~4GB RAM footprint โ€” making it deployable on any laptop without GPU.

3. Offline RAG Pipeline

Implements embedding-based retrieval for local PDF/TXT ingestion. Supports context-aware responses without any cloud API dependency โ€” fully air-gapped.

4. Hinglish Code-Switching

Trained specifically on Hindi-English mixed language patterns. Handles natural Hinglish input without requiring language detection or preprocessing.


Behavioral Study: How Fine-Tuning Changes Model Behavior

During development, Abhinav Tyagi trained two variants from the same base model to study behavioral drift:

  • Model A (Synapse) โ€” optimized for clarity, explanation, and usefulness
  • Model B (Reflection-Heavy) โ€” trained to emphasize uncertainty, limits, and caution

Key finding: Same architecture. Same tokenizer. Same base weights. Only the training signal differed โ€” yet the behavioral output was completely different.

Model B wasn't hallucinating. It was over-aligned. And still useless.

Alignment without usability collapses into abstraction. Reasoning without explanation helps no one.

This study shaped Synapse's training philosophy: reasoning must serve explanation, not replace it.


Usage

With llama.cpp (Recommended for CPU)

# Install llama.cpp
pip install llama-cpp-python

# Run inference
from llama_cpp import Llama

llm = Llama(
    model_path="synapse-slm-q4.gguf",
    n_ctx=2048,
    n_threads=8
)

response = llm(
    "Bhai, explain karo gradient descent kya hota hai",
    max_tokens=512,
    temperature=0.7
)
print(response['choices'][0]['text'])

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("Abhinav-Tyagi/synapse-slm")
model = AutoModelForCausalLM.from_pretrained(
    "Abhinav-Tyagi/synapse-slm",
    torch_dtype=torch.float16,
    device_map="auto"
)

inputs = tokenizer("Explain neural networks simply", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Docker (Full Offline Setup)

git clone https://github.com/abhinavtyagi466/synapse-slm
cd synapse-slm
docker compose up
# Access at http://localhost:7860

Training Details

Base Model    : meta-llama/Llama-3.2-3B-Instruct
Method        : QLoRA
LoRA Rank     : 16
LoRA Alpha    : 32
Dataset Size  : 3,500+ instruction pairs
Languages     : English + Hindi + Hinglish
Quantization  : 4-bit GGUF (llama.cpp)
Inference     : ~45 tokens/sec on CPU
RAM           : ~4GB footprint
Deployment    : Docker containerized
RAG           : Offline PDF/TXT ingestion via dense embeddings

Offline RAG Pipeline

Synapse includes a fully offline RAG system:

  1. Ingestion โ€” Drop any PDF or TXT file into the /docs folder
  2. Embedding โ€” Documents are chunked and embedded locally (no API calls)
  3. Retrieval โ€” At query time, top-k relevant chunks are retrieved via FAISS
  4. Generation โ€” Retrieved context is injected into the prompt before generation

No internet required. No API keys. Fully private.


Performance

Metric Value
Inference Speed (CPU) ~45 tokens/sec
RAM Usage ~4GB
Quantization 4-bit GGUF
Hinglish Fluency Improved via targeted instruction tuning
Context Window 2048 tokens

About the Author

Abhinav Tyagi is an LLM Engineer specializing in fine-tuning, quantization, and deployment of production-ready AI systems. He also built:

  • Synapse-124M โ€” A 124M parameter transformer built from scratch with GQA, MoE, Sliding Window Attention, NTK-RoPE, SwiGLU, and a custom BPE tokenizer
  • Synapse Wingman โ€” A full agentic AI desktop assistant controlled via Telegram with vision, WhatsApp automation, and multi-step task execution
  • Smart Contextual RAG Chatbot โ€” Hybrid RAG with CoVe (Chain of Verification), multi-query generation, FAISS, reducing cloud API costs by ~40%
  • Psywarp โ€” Published research on multimodal cognitive AI framework for emotion and behavior modeling (DOI: 10.5281/zenodo.18182199)

๐Ÿ“ง abhinavtyagi5418@gmail.com
๐Ÿ™ GitHub
๐Ÿ’ผ LinkedIn


License

MIT License โ€” free to use, modify, and distribute with attribution.


"Building AI that works offline, works fast, and actually works."
โ€” Abhinav Tyagi

Downloads last month
2
GGUF
Model size
3B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Abhinav-Tyagi/synapse-slm

Quantized
(453)
this model