--- license: mit language: - en tags: - vlsi - verilog - systemverilog - code-generation - hardware-design - eda - rtl - fine-tuned - codellama - lora - edge-ai - jetson-orin base_model: codellama/CodeLlama-7b-Instruct-hf pipeline_tag: text-generation library_name: transformers model_type: llama --- # VLSI-SLM V1 — CodeLlama Full Model > **The first open-source, edge-trained, laptop-deployable Small Language Model specialized for VLSI design.** A 7B parameter CodeLlama model fine-tuned on 30,354 curated VLSI examples — trained entirely on a NVIDIA Jetson Orin edge device with no cloud compute. Generates syntactically correct Verilog, explains VLSI concepts accurately, and runs offline on a 4GB laptop after quantization. --- ## Model Details | Property | Value | |----------|-------| | **Base Model** | CodeLlama-7B-Instruct | | **Fine-tuning Method** | LoRA (r=32, α=64) | | **Trainable Parameters** | 82,265,088 (1.21% of 6.82B) | | **Training Hardware** | NVIDIA Jetson Orin 64GB (edge device) | | **Training Time** | ~84 hours wall time | | **Dataset Size** | 30,354 examples (train) / 1,681 (val) | | **Training Epochs** | 3 | | **Final Train Loss** | 0.0122 | | **Best Val Loss** | 0.3892 (step 4000) | | **Precision** | bfloat16 (no quantization during training) | | **License** | MIT | ### LoRA Configuration ```python LoraConfig( r=32, lora_alpha=64, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", # Attention "gate_proj", "up_proj", "down_proj", # MLP/FFN "embed_tokens", "lm_head", # Embeddings ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", ) ``` --- ## Repository Contents ``` VLSI-SLM-V1-CodeLlama-Full/ ├── final_model/ ← Merged full model (~14GB, bf16 safetensors) ├── final_adapter/ ← LoRA adapter only (~200MB) ├── checkpoint-5000/ ← Training checkpoint ├── checkpoint-5250/ ← Training checkpoint ├── checkpoint-5500/ ← Training checkpoint ├── checkpoint-5691/ ← Final training checkpoint ├── evaluation/ ← Benchmark results and logs ├── logs/ ← Full training logs ├── baseline_pre_ft.json ← Base model responses (pre fine-tuning) ├── best_checkpoint.txt ← Best validation checkpoint info ├── heartbeat.json ← Last training state └── m4_config_v31.json ← Exact training hyperparameters ``` --- ## Evaluation Results Evaluated using a semantic scoring system (not rigid keyword matching) with `max_new_tokens=1024`. ### Standard 50-Question VLSI Benchmark | Metric | Score | Target | Status | |--------|-------|--------|--------| | Code Syntax Pass (iverilog) | **60.0%** | 40–60% | ✅ PASS | | Concept Accuracy | **65.0%** | 85–90% | 🟡 CLOSE | | Hallucination Rate | **0.0%** | <5% | ✅ PERFECT | | Code Block Formatting | **95.0%** | — | ✅ | | Debug Accuracy | **60.0%** | — | 🟡 | | Overall | **72.0%** | — | ✅ | ### Coding Stress Test (50 Progressive Questions) | Difficulty | Questions | Pass Rate | Examples | |-----------|-----------|-----------|---------| | Easy | 10 | **100%** | AND gate, DFF, counter, decoder | | Medium | 15 | **87%** | FIFO, ALU, FSM, synchronizer | | Hard | 13 | **62%** | Async FIFO, AXI-Lite, SPI master | | Expert | 12 | **42%** | FP adder, MBIST, JTAG TAP controller | **The model handles all standard VLSI building blocks cleanly. Expert-level complex modules (1000+ tokens) show truncation artifacts — a known training data issue being addressed in V2.** --- ## Quick Start ### Load and Run Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Rajasrl/VLSI-SLM-V1-CodeLlama-Full" tokenizer = AutoTokenizer.from_pretrained(f"{model_id}/final_model") model = AutoModelForCausalLM.from_pretrained( f"{model_id}/final_model", torch_dtype=torch.bfloat16, device_map="auto", ) model.eval() def ask_vlsi(question: str, code_mode: bool = False) -> str: if code_mode: system = """You are a Senior VLSI RTL Engineer. Rules: 1. Always wrap code in ```verilog blocks 2. Use non-blocking assignments (<=) in sequential always blocks 3. Use blocking assignments (=) in combinational always blocks 4. Always include complete module with endmodule 5. Never use reserved keywords as signal names""" else: system = "You are an expert VLSI engineer. Give accurate, technical answers." prompt = f"### System:\n{system}\n\n### Instruction:\n{question}\n\n### Response:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=1024, # Important: use 1024+ for complete modules temperature=0.0 if code_mode else 0.1, do_sample=not code_mode, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id, ) response = tokenizer.decode( output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True ) return response.strip() # Code generation (deterministic) print(ask_vlsi( "Write a parameterizable 8-bit synchronous counter with reset.", code_mode=True )) # Concept explanation print(ask_vlsi( "Explain clock domain crossing and how to handle it safely.", code_mode=False )) ``` ### Run with Ollama (Recommended for Laptop Deployment) First quantize to GGUF: ```bash # Install llama.cpp git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make -j4 # Convert and quantize python convert_hf_to_gguf.py ./final_model --outtype f16 \ --outfile vlsi_slm_v1_f16.gguf ./llama-quantize vlsi_slm_v1_f16.gguf vlsi_slm_v1_Q4_K_M.gguf Q4_K_M # Output: ~4GB file, runs on any laptop ``` Create `Modelfile`: ``` FROM ./vlsi_slm_v1_Q4_K_M.gguf SYSTEM """You are an expert VLSI and Verilog engineer. For code: output only syntactically correct, synthesizable Verilog. Use non-blocking assignments (<=) in sequential always blocks. Always wrap code in ```verilog blocks. Always include endmodule. For concepts: give accurate, technical explanations.""" PARAMETER temperature 0.1 PARAMETER num_ctx 2048 ``` ```bash ollama create vlsi-slm-v1 -f Modelfile ollama run vlsi-slm-v1 ``` --- ## What This Model Can Do ✅ ### Strong Capabilities (Easy–Medium complexity) **Verilog Code Generation:** - Flip-flops (D, T, JK) with synchronous/asynchronous reset - Counters (binary, Gray code, Johnson, LFSR) - Multiplexers, encoders, decoders - Shift registers (parameterizable width/depth) - State machines (Moore and Mealy FSM) - Synchronous SRAM and FIFO - Clock dividers and pulse generators - Debounce circuits - Two-flop CDC synchronizers - Basic AXI-Lite and handshake protocols - Simple UART, SPI, I2C controllers - Testbench templates **VLSI Concept Explanations:** - Clock Domain Crossing (CDC) and metastability - Setup time and hold time analysis - Power reduction: clock gating and power gating - Static Timing Analysis (STA) concepts - Scan chains and Design for Testability (DFT) - SRAM vs DRAM differences - Electromigration and IR drop - AXI, APB, AHB protocol rules - Blocking vs non-blocking assignments - Latch inference and how to avoid it ### Partial Capabilities (Hard complexity) - Asynchronous FIFO with Gray code pointers (architecture correct, may miss endmodule) - Round-robin arbiters - Pipeline structures - SPI master/slave controllers - Branch predictors - Memory BIST controllers --- ## Known Limitations ⚠️ ### 1. Truncation Artifact (Primary Known Issue) Complex modules exceeding ~800 tokens of output may be cut off before `endmodule`. This is a **training data artifact** — the dataset was generated using free APIs with 1800-token output limits, and truncated examples leaked through. The model learned this truncation pattern as a behavior. **Workaround:** Always set `max_new_tokens=1024` or higher. If output is still truncated, append `\nendmodule` manually — the logic inside is typically correct. **Fix in progress:** V2 training uses strict `endmodule` validation gates in the data pipeline. ### 2. Concept Accuracy Gap Concept accuracy is 65% vs the 85-90% target. Root cause: PDF textbooks were extracted page-by-page (not paragraph-by-paragraph), causing "semantic blur" where opposing concepts (e.g., Setup vs Hold timing) were mixed in the same training example. ### 3. Submodule Hallucination Occasionally instantiates undefined submodules (`fa fa0(...)` style) when asked for gate-level designs. Best avoided by explicitly requesting "behavioral RTL" in your prompt. ### 4. Not Trained for SoC-Level Design This model is optimized for **block-level RTL** (FIFOs, arbiters, FSMs, protocol controllers). It is not intended for full SoC or chip-level architecture. Expert-level questions (5-stage RISC pipeline, NoC routers, IEEE 754 FP units) are attempted but may be incomplete. ### 5. Hardware Constraints on Base Hardware Trained on a 64GB Jetson Orin. The merged model requires ~15GB RAM. Use the GGUF Q4_K_M quantized version (~4GB) for laptop deployment. --- ## Training Details ### Hardware This model was trained entirely on a **NVIDIA Jetson Orin 64GB** — an edge computing device, with no cloud GPUs used. ``` Device : NVIDIA Jetson Orin (64GB unified RAM) CUDA : 12.6 (ARM64) OS : Ubuntu 22.04 PyTorch : 2.5.0a0 nv24.8 Transformers: 4.44.0 PEFT : 0.18.1 TRL : 0.8.6 ``` **Important hardware note:** bitsandbytes is **not compatible** with CUDA 12.6 on Jetson Orin ARM64. Training used pure bfloat16 with `adamw_torch` optimizer. If you attempt to run this model on similar ARM64 Jetson hardware, do not use bitsandbytes or NEFTune. ### Training Configuration ```python TrainingArguments( num_train_epochs=3, per_device_train_batch_size=1, gradient_accumulation_steps=16, # Effective batch = 16 learning_rate=2e-5, lr_scheduler_type="cosine", warmup_ratio=0.03, bf16=True, fp16=False, gradient_checkpointing=True, optim="adamw_torch", max_grad_norm=1.0, save_steps=500, eval_steps=500, save_total_limit=4, group_by_length=True, ) ``` ### Thermal Management Innovation A custom thermal batching system was implemented: - Every 250 training steps: save checkpoint → 5-minute cooldown → resume - Table fan added for additional airflow - Result: GPU temperature maintained at 44–61°C throughout 84-hour run - 6 power outages during training — all recovered via atomic heartbeat checkpointing ### Dataset ``` Source : Curated VLSI examples (code + concept + QA) Format : Alpaca instruction tuning Train : 30,354 examples Validation : 1,681 examples Test : 1,681 examples Categories : 75.8% code_generation, 23.0% concept, 1.2% QA Max seq length : 2048 tokens Decontamination : ✅ Zero benchmark leaks verified ``` --- ## Comparison: Base vs Fine-tuned | Metric | Base CodeLlama-7B | VLSI-SLM V1 | |--------|------------------|-------------| | Verilog syntax knowledge | General | VLSI-specialized | | VLSI concept depth | Surface-level | Detailed and accurate | | Hallucination rate | ~10% | **0.0%** | | Code syntax pass (iverilog) | ~0% | **60%** | | Runs offline | ✅ | ✅ | | Deployable on laptop | ✅ (4GB Q4) | ✅ (4GB Q4) | | Cost | Free | Free | --- ## Roadmap: What V2 Will Fix **VLSI-SLM V2** is currently in development with the following improvements: | Issue | V1 Status | V2 Fix | |-------|-----------|--------| | Truncated endmodule | Present in complex modules | Strict validation gate in data pipeline | | Concept accuracy 65% | Below target | Layout-aware PDF chunking (paragraph-level) | | Submodule hallucination | Occasional | Anti-submodule prompt in data generation | | Dataset quality | Quantity-focused (30K) | Quality-focused (12K clean) | | JSON data corruption | Silent patching | Strict drop-on-failure | | EOS alignment | Not enforced | EOS token after endmodule | | Concept/code ratio | 23%/75% | 50%/50% balanced | **Target V2 metrics:** - Code Syntax Pass: 65–75% - Concept Accuracy: 85–90% - Hallucination Rate: <2% --- ## How to Contribute / Develop Further ### 1. Improve the Dataset The biggest gains come from data quality, not model size. ```python # The most impactful contribution: add validated Verilog examples # Requirements: # - Must compile with iverilog # - Must end with endmodule/endinterface/endpackage # - Must be self-contained (no undefined submodules) # - Alpaca format: {"instruction": ..., "input": "", "output": ...} # Validate before contributing: import subprocess result = subprocess.run(["iverilog", "-tnull", "your_file.v"], capture_output=True, text=True) assert result.returncode == 0, f"Syntax error: {result.stderr}" assert "endmodule" in open("your_file.v").read() ``` ### 2. Fine-tune Further on Your Domain Use LoRA to specialize for your specific VLSI area: ```python from peft import LoraConfig, get_peft_model, PeftModel # Load V1 as base for V2 fine-tuning model = AutoModelForCausalLM.from_pretrained( "Rajasrl/VLSI-SLM-V1-CodeLlama-Full/final_model", torch_dtype=torch.bfloat16, device_map="auto", ) # Add new LoRA adapters for your domain # (FPGA-specific, ASIC timing, formal verification, etc.) lora_config = LoraConfig(r=16, lora_alpha=32, ...) model = get_peft_model(model, lora_config) ``` ### 3. Extend to SystemVerilog / UVM The model has basic SV knowledge but was primarily trained on Verilog-2001. Adding UVM testbench examples and SystemVerilog assertions (SVA) would significantly improve verification use cases. ### 4. Add Image Recognition A compelling future direction: multi-modal VLSI assistant that can: - Read handwritten schematic photos → generate Verilog - Analyze timing diagrams → identify violations - Recognize circuit board components → explain connections ### 5. Build a Retrieval-Augmented Generation (RAG) Layer Connect the model to a vector database of VLSI standards (IEEE 1800, AMBA AXI spec, IEEE 1149.1 JTAG) for factually grounded answers. ### 6. Evaluation Contributions Add more benchmark questions to `evaluation/` folder — especially: - Formal verification questions (SVA, PSL) - Physical design (placement, routing, DRC) - Analog/mixed-signal interfaces - RISC-V specific RTL patterns --- ## Citation If you use this model in your research, please cite: ```bibtex @misc{vlsi-slm-v1-2026, title = {VLSI-SLM V1: An Edge-Trained Small Language Model for VLSI Design}, author = {Rajasrl}, year = {2026}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Rajasrl/VLSI-SLM-V1-CodeLlama-Full}}, note = {Fine-tuned CodeLlama-7B on NVIDIA Jetson Orin edge hardware. 30,354 curated VLSI examples. Zero cloud compute.} } ``` --- ## The Story This model was trained by a final-year engineering student on borrowed edge hardware, with no cloud budget, no research lab, and no team. The training ran through 6 power outages, lightning storms, and thermal shutdowns — all recovered automatically. The goal was simple: build a VLSI assistant that works offline, costs nothing to run, and belongs to the community — not behind an API paywall. **"I built an AI to teach me VLSI."** --- ## License MIT License — free to use, modify, and distribute. See LICENSE for details. --- *Model trained: March 29 – April 3, 2026* *Uploaded to Hugging Face: May 2026* *Hardware: NVIDIA Jetson Orin 64GB (edge device, no cloud)*