---
license: mit
language:
- en
tags:
- vlsi
- verilog
- systemverilog
- code-generation
- hardware-design
- eda
- rtl
- fine-tuned
- codellama
- lora
- edge-ai
- jetson-orin
base_model: codellama/CodeLlama-7b-Instruct-hf
pipeline_tag: text-generation
library_name: transformers
model_type: llama
---

# VLSI-SLM V1 — CodeLlama Full Model

> **The first open-source, edge-trained, laptop-deployable Small Language Model specialized for VLSI design.**

A 7B parameter CodeLlama model fine-tuned on 30,354 curated VLSI examples — trained entirely on a NVIDIA Jetson Orin edge device with no cloud compute. Generates syntactically correct Verilog, explains VLSI concepts accurately, and runs offline on a 4GB laptop after quantization.

---

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | CodeLlama-7B-Instruct |
| **Fine-tuning Method** | LoRA (r=32, α=64) |
| **Trainable Parameters** | 82,265,088 (1.21% of 6.82B) |
| **Training Hardware** | NVIDIA Jetson Orin 64GB (edge device) |
| **Training Time** | ~84 hours wall time |
| **Dataset Size** | 30,354 examples (train) / 1,681 (val) |
| **Training Epochs** | 3 |
| **Final Train Loss** | 0.0122 |
| **Best Val Loss** | 0.3892 (step 4000) |
| **Precision** | bfloat16 (no quantization during training) |
| **License** | MIT |

### LoRA Configuration
```python
LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",   # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP/FFN
        "embed_tokens", "lm_head",                 # Embeddings
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
```

---

## Repository Contents

```
VLSI-SLM-V1-CodeLlama-Full/
├── final_model/          ← Merged full model (~14GB, bf16 safetensors)
├── final_adapter/        ← LoRA adapter only (~200MB)
├── checkpoint-5000/      ← Training checkpoint
├── checkpoint-5250/      ← Training checkpoint
├── checkpoint-5500/      ← Training checkpoint
├── checkpoint-5691/      ← Final training checkpoint
├── evaluation/           ← Benchmark results and logs
├── logs/                 ← Full training logs
├── baseline_pre_ft.json  ← Base model responses (pre fine-tuning)
├── best_checkpoint.txt   ← Best validation checkpoint info
├── heartbeat.json        ← Last training state
└── m4_config_v31.json    ← Exact training hyperparameters
```

---

## Evaluation Results

Evaluated using a semantic scoring system (not rigid keyword matching) with `max_new_tokens=1024`.

### Standard 50-Question VLSI Benchmark

| Metric | Score | Target | Status |
|--------|-------|--------|--------|
| Code Syntax Pass (iverilog) | **60.0%** | 40–60% | ✅ PASS |
| Concept Accuracy | **65.0%** | 85–90% | 🟡 CLOSE |
| Hallucination Rate | **0.0%** | <5% | ✅ PERFECT |
| Code Block Formatting | **95.0%** | — | ✅ |
| Debug Accuracy | **60.0%** | — | 🟡 |
| Overall | **72.0%** | — | ✅ |

### Coding Stress Test (50 Progressive Questions)

| Difficulty | Questions | Pass Rate | Examples |
|-----------|-----------|-----------|---------|
| Easy | 10 | **100%** | AND gate, DFF, counter, decoder |
| Medium | 15 | **87%** | FIFO, ALU, FSM, synchronizer |
| Hard | 13 | **62%** | Async FIFO, AXI-Lite, SPI master |
| Expert | 12 | **42%** | FP adder, MBIST, JTAG TAP controller |

**The model handles all standard VLSI building blocks cleanly. Expert-level complex modules (1000+ tokens) show truncation artifacts — a known training data issue being addressed in V2.**

---

## Quick Start

### Load and Run Inference

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Rajasrl/VLSI-SLM-V1-CodeLlama-Full"

tokenizer = AutoTokenizer.from_pretrained(f"{model_id}/final_model")
model = AutoModelForCausalLM.from_pretrained(
    f"{model_id}/final_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

def ask_vlsi(question: str, code_mode: bool = False) -> str:
    if code_mode:
        system = """You are a Senior VLSI RTL Engineer.
Rules:
1. Always wrap code in ```verilog blocks
2. Use non-blocking assignments (<=) in sequential always blocks
3. Use blocking assignments (=) in combinational always blocks
4. Always include complete module with endmodule
5. Never use reserved keywords as signal names"""
    else:
        system = "You are an expert VLSI engineer. Give accurate, technical answers."

    prompt = f"### System:\n{system}\n\n### Instruction:\n{question}\n\n### Response:\n"
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=1024,       # Important: use 1024+ for complete modules
            temperature=0.0 if code_mode else 0.1,
            do_sample=not code_mode,
            repetition_penalty=1.1,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    )
    return response.strip()

# Code generation (deterministic)
print(ask_vlsi(
    "Write a parameterizable 8-bit synchronous counter with reset.",
    code_mode=True
))

# Concept explanation
print(ask_vlsi(
    "Explain clock domain crossing and how to handle it safely.",
    code_mode=False
))
```

### Run with Ollama (Recommended for Laptop Deployment)

First quantize to GGUF:
```bash
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j4

# Convert and quantize
python convert_hf_to_gguf.py ./final_model --outtype f16 \
    --outfile vlsi_slm_v1_f16.gguf

./llama-quantize vlsi_slm_v1_f16.gguf vlsi_slm_v1_Q4_K_M.gguf Q4_K_M
# Output: ~4GB file, runs on any laptop
```

Create `Modelfile`:
```
FROM ./vlsi_slm_v1_Q4_K_M.gguf

SYSTEM """You are an expert VLSI and Verilog engineer.
For code: output only syntactically correct, synthesizable Verilog.
Use non-blocking assignments (<=) in sequential always blocks.
Always wrap code in ```verilog blocks.
Always include endmodule.
For concepts: give accurate, technical explanations."""

PARAMETER temperature 0.1
PARAMETER num_ctx 2048
```

```bash
ollama create vlsi-slm-v1 -f Modelfile
ollama run vlsi-slm-v1
```

---

## What This Model Can Do ✅

### Strong Capabilities (Easy–Medium complexity)

**Verilog Code Generation:**
- Flip-flops (D, T, JK) with synchronous/asynchronous reset
- Counters (binary, Gray code, Johnson, LFSR)
- Multiplexers, encoders, decoders
- Shift registers (parameterizable width/depth)
- State machines (Moore and Mealy FSM)
- Synchronous SRAM and FIFO
- Clock dividers and pulse generators
- Debounce circuits
- Two-flop CDC synchronizers
- Basic AXI-Lite and handshake protocols
- Simple UART, SPI, I2C controllers
- Testbench templates

**VLSI Concept Explanations:**
- Clock Domain Crossing (CDC) and metastability
- Setup time and hold time analysis
- Power reduction: clock gating and power gating
- Static Timing Analysis (STA) concepts
- Scan chains and Design for Testability (DFT)
- SRAM vs DRAM differences
- Electromigration and IR drop
- AXI, APB, AHB protocol rules
- Blocking vs non-blocking assignments
- Latch inference and how to avoid it

### Partial Capabilities (Hard complexity)

- Asynchronous FIFO with Gray code pointers (architecture correct, may miss endmodule)
- Round-robin arbiters
- Pipeline structures
- SPI master/slave controllers
- Branch predictors
- Memory BIST controllers

---

## Known Limitations ⚠️

### 1. Truncation Artifact (Primary Known Issue)
Complex modules exceeding ~800 tokens of output may be cut off before `endmodule`. This is a **training data artifact** — the dataset was generated using free APIs with 1800-token output limits, and truncated examples leaked through. The model learned this truncation pattern as a behavior.

**Workaround:** Always set `max_new_tokens=1024` or higher. If output is still truncated, append `\nendmodule` manually — the logic inside is typically correct.

**Fix in progress:** V2 training uses strict `endmodule` validation gates in the data pipeline.

### 2. Concept Accuracy Gap
Concept accuracy is 65% vs the 85-90% target. Root cause: PDF textbooks were extracted page-by-page (not paragraph-by-paragraph), causing "semantic blur" where opposing concepts (e.g., Setup vs Hold timing) were mixed in the same training example.

### 3. Submodule Hallucination
Occasionally instantiates undefined submodules (`fa fa0(...)` style) when asked for gate-level designs. Best avoided by explicitly requesting "behavioral RTL" in your prompt.

### 4. Not Trained for SoC-Level Design
This model is optimized for **block-level RTL** (FIFOs, arbiters, FSMs, protocol controllers). It is not intended for full SoC or chip-level architecture. Expert-level questions (5-stage RISC pipeline, NoC routers, IEEE 754 FP units) are attempted but may be incomplete.

### 5. Hardware Constraints on Base Hardware
Trained on a 64GB Jetson Orin. The merged model requires ~15GB RAM. Use the GGUF Q4_K_M quantized version (~4GB) for laptop deployment.

---

## Training Details

### Hardware
This model was trained entirely on a **NVIDIA Jetson Orin 64GB** — an edge computing device, with no cloud GPUs used.

```
Device      : NVIDIA Jetson Orin (64GB unified RAM)
CUDA        : 12.6 (ARM64)
OS          : Ubuntu 22.04
PyTorch     : 2.5.0a0 nv24.8
Transformers: 4.44.0
PEFT        : 0.18.1
TRL         : 0.8.6
```

**Important hardware note:** bitsandbytes is **not compatible** with CUDA 12.6 on Jetson Orin ARM64. Training used pure bfloat16 with `adamw_torch` optimizer. If you attempt to run this model on similar ARM64 Jetson hardware, do not use bitsandbytes or NEFTune.

### Training Configuration
```python
TrainingArguments(
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,     # Effective batch = 16
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    fp16=False,
    gradient_checkpointing=True,
    optim="adamw_torch",
    max_grad_norm=1.0,
    save_steps=500,
    eval_steps=500,
    save_total_limit=4,
    group_by_length=True,
)
```

### Thermal Management Innovation
A custom thermal batching system was implemented:
- Every 250 training steps: save checkpoint → 5-minute cooldown → resume
- Table fan added for additional airflow
- Result: GPU temperature maintained at 44–61°C throughout 84-hour run
- 6 power outages during training — all recovered via atomic heartbeat checkpointing

### Dataset
```
Source          : Curated VLSI examples (code + concept + QA)
Format          : Alpaca instruction tuning
Train           : 30,354 examples
Validation      : 1,681 examples  
Test            : 1,681 examples
Categories      : 75.8% code_generation, 23.0% concept, 1.2% QA
Max seq length  : 2048 tokens
Decontamination : ✅ Zero benchmark leaks verified
```

---

## Comparison: Base vs Fine-tuned

| Metric | Base CodeLlama-7B | VLSI-SLM V1 |
|--------|------------------|-------------|
| Verilog syntax knowledge | General | VLSI-specialized |
| VLSI concept depth | Surface-level | Detailed and accurate |
| Hallucination rate | ~10% | **0.0%** |
| Code syntax pass (iverilog) | ~0% | **60%** |
| Runs offline | ✅ | ✅ |
| Deployable on laptop | ✅ (4GB Q4) | ✅ (4GB Q4) |
| Cost | Free | Free |

---

## Roadmap: What V2 Will Fix

**VLSI-SLM V2** is currently in development with the following improvements:

| Issue | V1 Status | V2 Fix |
|-------|-----------|--------|
| Truncated endmodule | Present in complex modules | Strict validation gate in data pipeline |
| Concept accuracy 65% | Below target | Layout-aware PDF chunking (paragraph-level) |
| Submodule hallucination | Occasional | Anti-submodule prompt in data generation |
| Dataset quality | Quantity-focused (30K) | Quality-focused (12K clean) |
| JSON data corruption | Silent patching | Strict drop-on-failure |
| EOS alignment | Not enforced | EOS token after endmodule |
| Concept/code ratio | 23%/75% | 50%/50% balanced |

**Target V2 metrics:**
- Code Syntax Pass: 65–75%
- Concept Accuracy: 85–90%
- Hallucination Rate: <2%

---

## How to Contribute / Develop Further

### 1. Improve the Dataset
The biggest gains come from data quality, not model size.

```python
# The most impactful contribution: add validated Verilog examples
# Requirements:
# - Must compile with iverilog
# - Must end with endmodule/endinterface/endpackage
# - Must be self-contained (no undefined submodules)
# - Alpaca format: {"instruction": ..., "input": "", "output": ...}

# Validate before contributing:
import subprocess
result = subprocess.run(["iverilog", "-tnull", "your_file.v"],
                       capture_output=True, text=True)
assert result.returncode == 0, f"Syntax error: {result.stderr}"
assert "endmodule" in open("your_file.v").read()
```

### 2. Fine-tune Further on Your Domain
Use LoRA to specialize for your specific VLSI area:

```python
from peft import LoraConfig, get_peft_model, PeftModel

# Load V1 as base for V2 fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    "Rajasrl/VLSI-SLM-V1-CodeLlama-Full/final_model",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Add new LoRA adapters for your domain
# (FPGA-specific, ASIC timing, formal verification, etc.)
lora_config = LoraConfig(r=16, lora_alpha=32, ...)
model = get_peft_model(model, lora_config)
```

### 3. Extend to SystemVerilog / UVM
The model has basic SV knowledge but was primarily trained on Verilog-2001.
Adding UVM testbench examples and SystemVerilog assertions (SVA) would
significantly improve verification use cases.

### 4. Add Image Recognition
A compelling future direction: multi-modal VLSI assistant that can:
- Read handwritten schematic photos → generate Verilog
- Analyze timing diagrams → identify violations
- Recognize circuit board components → explain connections

### 5. Build a Retrieval-Augmented Generation (RAG) Layer
Connect the model to a vector database of VLSI standards (IEEE 1800,
AMBA AXI spec, IEEE 1149.1 JTAG) for factually grounded answers.

### 6. Evaluation Contributions
Add more benchmark questions to `evaluation/` folder — especially:
- Formal verification questions (SVA, PSL)
- Physical design (placement, routing, DRC)
- Analog/mixed-signal interfaces
- RISC-V specific RTL patterns

---

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{vlsi-slm-v1-2026,
  title        = {VLSI-SLM V1: An Edge-Trained Small Language Model for VLSI Design},
  author       = {Rajasrl},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Rajasrl/VLSI-SLM-V1-CodeLlama-Full}},
  note         = {Fine-tuned CodeLlama-7B on NVIDIA Jetson Orin edge hardware.
                  30,354 curated VLSI examples. Zero cloud compute.}
}
```

---

## The Story

This model was trained by a final-year engineering student on borrowed edge
hardware, with no cloud budget, no research lab, and no team. The training
ran through 6 power outages, lightning storms, and thermal shutdowns — all
recovered automatically.

The goal was simple: build a VLSI assistant that works offline, costs
nothing to run, and belongs to the community — not behind an API paywall.

**"I built an AI to teach me VLSI."**

---

## License

MIT License — free to use, modify, and distribute. See LICENSE for details.

---

*Model trained: March 29 – April 3, 2026*
*Uploaded to Hugging Face: May 2026*
*Hardware: NVIDIA Jetson Orin 64GB (edge device, no cloud)*