Rajasrl's picture
Update README.md
daeb5f1 verified
---
license: mit
language:
- en
tags:
- vlsi
- verilog
- systemverilog
- code-generation
- hardware-design
- eda
- rtl
- fine-tuned
- codellama
- lora
- edge-ai
- jetson-orin
base_model: codellama/CodeLlama-7b-Instruct-hf
pipeline_tag: text-generation
library_name: transformers
model_type: llama
---
# VLSI-SLM V1 β€” CodeLlama Full Model
> **The first open-source, edge-trained, laptop-deployable Small Language Model specialized for VLSI design.**
A 7B parameter CodeLlama model fine-tuned on 30,354 curated VLSI examples β€” trained entirely on a NVIDIA Jetson Orin edge device with no cloud compute. Generates syntactically correct Verilog, explains VLSI concepts accurately, and runs offline on a 4GB laptop after quantization.
---
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | CodeLlama-7B-Instruct |
| **Fine-tuning Method** | LoRA (r=32, Ξ±=64) |
| **Trainable Parameters** | 82,265,088 (1.21% of 6.82B) |
| **Training Hardware** | NVIDIA Jetson Orin 64GB (edge device) |
| **Training Time** | ~84 hours wall time |
| **Dataset Size** | 30,354 examples (train) / 1,681 (val) |
| **Training Epochs** | 3 |
| **Final Train Loss** | 0.0122 |
| **Best Val Loss** | 0.3892 (step 4000) |
| **Precision** | bfloat16 (no quantization during training) |
| **License** | MIT |
### LoRA Configuration
```python
LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj", # MLP/FFN
"embed_tokens", "lm_head", # Embeddings
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
```
---
## Repository Contents
```
VLSI-SLM-V1-CodeLlama-Full/
β”œβ”€β”€ final_model/ ← Merged full model (~14GB, bf16 safetensors)
β”œβ”€β”€ final_adapter/ ← LoRA adapter only (~200MB)
β”œβ”€β”€ checkpoint-5000/ ← Training checkpoint
β”œβ”€β”€ checkpoint-5250/ ← Training checkpoint
β”œβ”€β”€ checkpoint-5500/ ← Training checkpoint
β”œβ”€β”€ checkpoint-5691/ ← Final training checkpoint
β”œβ”€β”€ evaluation/ ← Benchmark results and logs
β”œβ”€β”€ logs/ ← Full training logs
β”œβ”€β”€ baseline_pre_ft.json ← Base model responses (pre fine-tuning)
β”œβ”€β”€ best_checkpoint.txt ← Best validation checkpoint info
β”œβ”€β”€ heartbeat.json ← Last training state
└── m4_config_v31.json ← Exact training hyperparameters
```
---
## Evaluation Results
Evaluated using a semantic scoring system (not rigid keyword matching) with `max_new_tokens=1024`.
### Standard 50-Question VLSI Benchmark
| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Code Syntax Pass (iverilog) | **60.0%** | 40–60% | βœ… PASS |
| Concept Accuracy | **65.0%** | 85–90% | 🟑 CLOSE |
| Hallucination Rate | **0.0%** | <5% | βœ… PERFECT |
| Code Block Formatting | **95.0%** | β€” | βœ… |
| Debug Accuracy | **60.0%** | β€” | 🟑 |
| Overall | **72.0%** | β€” | βœ… |
### Coding Stress Test (50 Progressive Questions)
| Difficulty | Questions | Pass Rate | Examples |
|-----------|-----------|-----------|---------|
| Easy | 10 | **100%** | AND gate, DFF, counter, decoder |
| Medium | 15 | **87%** | FIFO, ALU, FSM, synchronizer |
| Hard | 13 | **62%** | Async FIFO, AXI-Lite, SPI master |
| Expert | 12 | **42%** | FP adder, MBIST, JTAG TAP controller |
**The model handles all standard VLSI building blocks cleanly. Expert-level complex modules (1000+ tokens) show truncation artifacts β€” a known training data issue being addressed in V2.**
---
## Quick Start
### Load and Run Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Rajasrl/VLSI-SLM-V1-CodeLlama-Full"
tokenizer = AutoTokenizer.from_pretrained(f"{model_id}/final_model")
model = AutoModelForCausalLM.from_pretrained(
f"{model_id}/final_model",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
def ask_vlsi(question: str, code_mode: bool = False) -> str:
if code_mode:
system = """You are a Senior VLSI RTL Engineer.
Rules:
1. Always wrap code in ```verilog blocks
2. Use non-blocking assignments (<=) in sequential always blocks
3. Use blocking assignments (=) in combinational always blocks
4. Always include complete module with endmodule
5. Never use reserved keywords as signal names"""
else:
system = "You are an expert VLSI engineer. Give accurate, technical answers."
prompt = f"### System:\n{system}\n\n### Instruction:\n{question}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=1024, # Important: use 1024+ for complete modules
temperature=0.0 if code_mode else 0.1,
do_sample=not code_mode,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
return response.strip()
# Code generation (deterministic)
print(ask_vlsi(
"Write a parameterizable 8-bit synchronous counter with reset.",
code_mode=True
))
# Concept explanation
print(ask_vlsi(
"Explain clock domain crossing and how to handle it safely.",
code_mode=False
))
```
### Run with Ollama (Recommended for Laptop Deployment)
First quantize to GGUF:
```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j4
# Convert and quantize
python convert_hf_to_gguf.py ./final_model --outtype f16 \
--outfile vlsi_slm_v1_f16.gguf
./llama-quantize vlsi_slm_v1_f16.gguf vlsi_slm_v1_Q4_K_M.gguf Q4_K_M
# Output: ~4GB file, runs on any laptop
```
Create `Modelfile`:
```
FROM ./vlsi_slm_v1_Q4_K_M.gguf
SYSTEM """You are an expert VLSI and Verilog engineer.
For code: output only syntactically correct, synthesizable Verilog.
Use non-blocking assignments (<=) in sequential always blocks.
Always wrap code in ```verilog blocks.
Always include endmodule.
For concepts: give accurate, technical explanations."""
PARAMETER temperature 0.1
PARAMETER num_ctx 2048
```
```bash
ollama create vlsi-slm-v1 -f Modelfile
ollama run vlsi-slm-v1
```
---
## What This Model Can Do βœ…
### Strong Capabilities (Easy–Medium complexity)
**Verilog Code Generation:**
- Flip-flops (D, T, JK) with synchronous/asynchronous reset
- Counters (binary, Gray code, Johnson, LFSR)
- Multiplexers, encoders, decoders
- Shift registers (parameterizable width/depth)
- State machines (Moore and Mealy FSM)
- Synchronous SRAM and FIFO
- Clock dividers and pulse generators
- Debounce circuits
- Two-flop CDC synchronizers
- Basic AXI-Lite and handshake protocols
- Simple UART, SPI, I2C controllers
- Testbench templates
**VLSI Concept Explanations:**
- Clock Domain Crossing (CDC) and metastability
- Setup time and hold time analysis
- Power reduction: clock gating and power gating
- Static Timing Analysis (STA) concepts
- Scan chains and Design for Testability (DFT)
- SRAM vs DRAM differences
- Electromigration and IR drop
- AXI, APB, AHB protocol rules
- Blocking vs non-blocking assignments
- Latch inference and how to avoid it
### Partial Capabilities (Hard complexity)
- Asynchronous FIFO with Gray code pointers (architecture correct, may miss endmodule)
- Round-robin arbiters
- Pipeline structures
- SPI master/slave controllers
- Branch predictors
- Memory BIST controllers
---
## Known Limitations ⚠️
### 1. Truncation Artifact (Primary Known Issue)
Complex modules exceeding ~800 tokens of output may be cut off before `endmodule`. This is a **training data artifact** β€” the dataset was generated using free APIs with 1800-token output limits, and truncated examples leaked through. The model learned this truncation pattern as a behavior.
**Workaround:** Always set `max_new_tokens=1024` or higher. If output is still truncated, append `\nendmodule` manually β€” the logic inside is typically correct.
**Fix in progress:** V2 training uses strict `endmodule` validation gates in the data pipeline.
### 2. Concept Accuracy Gap
Concept accuracy is 65% vs the 85-90% target. Root cause: PDF textbooks were extracted page-by-page (not paragraph-by-paragraph), causing "semantic blur" where opposing concepts (e.g., Setup vs Hold timing) were mixed in the same training example.
### 3. Submodule Hallucination
Occasionally instantiates undefined submodules (`fa fa0(...)` style) when asked for gate-level designs. Best avoided by explicitly requesting "behavioral RTL" in your prompt.
### 4. Not Trained for SoC-Level Design
This model is optimized for **block-level RTL** (FIFOs, arbiters, FSMs, protocol controllers). It is not intended for full SoC or chip-level architecture. Expert-level questions (5-stage RISC pipeline, NoC routers, IEEE 754 FP units) are attempted but may be incomplete.
### 5. Hardware Constraints on Base Hardware
Trained on a 64GB Jetson Orin. The merged model requires ~15GB RAM. Use the GGUF Q4_K_M quantized version (~4GB) for laptop deployment.
---
## Training Details
### Hardware
This model was trained entirely on a **NVIDIA Jetson Orin 64GB** β€” an edge computing device, with no cloud GPUs used.
```
Device : NVIDIA Jetson Orin (64GB unified RAM)
CUDA : 12.6 (ARM64)
OS : Ubuntu 22.04
PyTorch : 2.5.0a0 nv24.8
Transformers: 4.44.0
PEFT : 0.18.1
TRL : 0.8.6
```
**Important hardware note:** bitsandbytes is **not compatible** with CUDA 12.6 on Jetson Orin ARM64. Training used pure bfloat16 with `adamw_torch` optimizer. If you attempt to run this model on similar ARM64 Jetson hardware, do not use bitsandbytes or NEFTune.
### Training Configuration
```python
TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=1,
gradient_accumulation_steps=16, # Effective batch = 16
learning_rate=2e-5,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
fp16=False,
gradient_checkpointing=True,
optim="adamw_torch",
max_grad_norm=1.0,
save_steps=500,
eval_steps=500,
save_total_limit=4,
group_by_length=True,
)
```
### Thermal Management Innovation
A custom thermal batching system was implemented:
- Every 250 training steps: save checkpoint β†’ 5-minute cooldown β†’ resume
- Table fan added for additional airflow
- Result: GPU temperature maintained at 44–61Β°C throughout 84-hour run
- 6 power outages during training β€” all recovered via atomic heartbeat checkpointing
### Dataset
```
Source : Curated VLSI examples (code + concept + QA)
Format : Alpaca instruction tuning
Train : 30,354 examples
Validation : 1,681 examples
Test : 1,681 examples
Categories : 75.8% code_generation, 23.0% concept, 1.2% QA
Max seq length : 2048 tokens
Decontamination : βœ… Zero benchmark leaks verified
```
---
## Comparison: Base vs Fine-tuned
| Metric | Base CodeLlama-7B | VLSI-SLM V1 |
|--------|------------------|-------------|
| Verilog syntax knowledge | General | VLSI-specialized |
| VLSI concept depth | Surface-level | Detailed and accurate |
| Hallucination rate | ~10% | **0.0%** |
| Code syntax pass (iverilog) | ~0% | **60%** |
| Runs offline | βœ… | βœ… |
| Deployable on laptop | βœ… (4GB Q4) | βœ… (4GB Q4) |
| Cost | Free | Free |
---
## Roadmap: What V2 Will Fix
**VLSI-SLM V2** is currently in development with the following improvements:
| Issue | V1 Status | V2 Fix |
|-------|-----------|--------|
| Truncated endmodule | Present in complex modules | Strict validation gate in data pipeline |
| Concept accuracy 65% | Below target | Layout-aware PDF chunking (paragraph-level) |
| Submodule hallucination | Occasional | Anti-submodule prompt in data generation |
| Dataset quality | Quantity-focused (30K) | Quality-focused (12K clean) |
| JSON data corruption | Silent patching | Strict drop-on-failure |
| EOS alignment | Not enforced | EOS token after endmodule |
| Concept/code ratio | 23%/75% | 50%/50% balanced |
**Target V2 metrics:**
- Code Syntax Pass: 65–75%
- Concept Accuracy: 85–90%
- Hallucination Rate: <2%
---
## How to Contribute / Develop Further
### 1. Improve the Dataset
The biggest gains come from data quality, not model size.
```python
# The most impactful contribution: add validated Verilog examples
# Requirements:
# - Must compile with iverilog
# - Must end with endmodule/endinterface/endpackage
# - Must be self-contained (no undefined submodules)
# - Alpaca format: {"instruction": ..., "input": "", "output": ...}
# Validate before contributing:
import subprocess
result = subprocess.run(["iverilog", "-tnull", "your_file.v"],
capture_output=True, text=True)
assert result.returncode == 0, f"Syntax error: {result.stderr}"
assert "endmodule" in open("your_file.v").read()
```
### 2. Fine-tune Further on Your Domain
Use LoRA to specialize for your specific VLSI area:
```python
from peft import LoraConfig, get_peft_model, PeftModel
# Load V1 as base for V2 fine-tuning
model = AutoModelForCausalLM.from_pretrained(
"Rajasrl/VLSI-SLM-V1-CodeLlama-Full/final_model",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Add new LoRA adapters for your domain
# (FPGA-specific, ASIC timing, formal verification, etc.)
lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)
```
### 3. Extend to SystemVerilog / UVM
The model has basic SV knowledge but was primarily trained on Verilog-2001.
Adding UVM testbench examples and SystemVerilog assertions (SVA) would
significantly improve verification use cases.
### 4. Add Image Recognition
A compelling future direction: multi-modal VLSI assistant that can:
- Read handwritten schematic photos β†’ generate Verilog
- Analyze timing diagrams β†’ identify violations
- Recognize circuit board components β†’ explain connections
### 5. Build a Retrieval-Augmented Generation (RAG) Layer
Connect the model to a vector database of VLSI standards (IEEE 1800,
AMBA AXI spec, IEEE 1149.1 JTAG) for factually grounded answers.
### 6. Evaluation Contributions
Add more benchmark questions to `evaluation/` folder β€” especially:
- Formal verification questions (SVA, PSL)
- Physical design (placement, routing, DRC)
- Analog/mixed-signal interfaces
- RISC-V specific RTL patterns
---
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{vlsi-slm-v1-2026,
title = {VLSI-SLM V1: An Edge-Trained Small Language Model for VLSI Design},
author = {Rajasrl},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Rajasrl/VLSI-SLM-V1-CodeLlama-Full}},
note = {Fine-tuned CodeLlama-7B on NVIDIA Jetson Orin edge hardware.
30,354 curated VLSI examples. Zero cloud compute.}
}
```
---
## The Story
This model was trained by a final-year engineering student on borrowed edge
hardware, with no cloud budget, no research lab, and no team. The training
ran through 6 power outages, lightning storms, and thermal shutdowns β€” all
recovered automatically.
The goal was simple: build a VLSI assistant that works offline, costs
nothing to run, and belongs to the community β€” not behind an API paywall.
**"I built an AI to teach me VLSI."**
---
## License
MIT License β€” free to use, modify, and distribute. See LICENSE for details.
---
*Model trained: March 29 – April 3, 2026*
*Uploaded to Hugging Face: May 2026*
*Hardware: NVIDIA Jetson Orin 64GB (edge device, no cloud)*