RTL-PPA: RTL Power, Performance, Area Prediction via Chain-of-Thought Fine-Tuning
Model Overview
This model predicts Power, Performance (delay), and Area (PPA) of RTL designs synthesized for Skywater 130nm technology. Given a Verilog module, it performs step-by-step chain-of-thought reasoning about gate-level synthesis and outputs structured PPA estimates with [area], [delay], and [static_power] tags directly usable as reinforcement learning reward signals.
Model Details
- Developed by: Zhu Wenlong
- Model type: Qwen3-8B fine-tuned with LoRA (rank=256, alpha=512) and with RL
- License: MIT
- Finetuned from: Qwen/Qwen3-8B
Uses
Direct Use
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "/path/to/merged/model"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)
SYSTEM_PROMPT = (
"Your task is to estimate area, delay, and static power for RTL designs in Skywater 130nm technology node.\n"
"For the given RTL design, reason about the number and type of gates that would be present after synthesis, "
"then output all four tags:\n"
"<synth> ... </synth>\n"
"<area> ... [area]value[/area] </area>\n"
"<delay> ... [delay]value[/delay] </delay>\n"
"<static_power> ... [static_power]value[/static_power] </static_power>"
)
rtl_code = "module top_module (...); ..."
messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": rtl_code}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Extract: [area]VALUE[/area], [delay]VALUE[/delay], [static_power]VALUE[/static_power]
Out-of-Scope Use
- Not suitable for non-Verilog hardware description languages
- Not suitable for RTL targeting technology nodes other than Skywater 130nm
Training Details
Training Data
- Base dataset: scale-lab/MetRex (MetRex benchmark)
- Augmentation: Semantic-preserving RTL transformations (signal renaming, constant base conversion, declaration shuffling, whitespace randomization, begin/end insertion, module renaming)
- Format: Alpaca format with verbose chain-of-thought output tags
Training Procedure
Stage 1: Base SFT
Fine-tuned on 23,816 samples (small circuits, <1000 gates) to establish fundamental CoT reasoning capability.
SFT Training Configuration:
| Parameter | Value |
|---|---|
| Base Model | Qwen3-8B |
| LoRA Rank | 256 |
| LoRA Alpha | 512 |
| LoRA Target | all |
| Learning Rate | 5e-5 |
| Batch Size | 64 (8 GPU × 1 × 8 accum) |
| Cutoff Length | 8192 |
| Epochs | 3 |
| Precision | BF16 + DeepSpeed ZeRO-3 |
Stage 2: Reinforcement Learning (GRPO)
Refined via Group Relative Policy Optimization (GRPO) using the verl framework. The reward signal is computed from MAPE on [area], [delay], and [static_power] tags extracted from model outputs, enabling iterative improvement on hard circuits.
RL Training Configuration:
| Parameter | Value |
|---|---|
| Algorithm | GRPO (Group Relative Policy Optimization) |
| Framework | verl |
| Train Batch Size | 256 |
| Max Prompt Length | 3072 |
| Max Response Length | 4096 |
| Actor Learning Rate | 1e-6 |
| PPO Mini Batch Size | 32 |
| PPO Micro Batch Size per GPU | 2 |
| Rollout Samples (n) | 4 |
| KL Loss Coefficient | 0.0 (disabled) |
| Entropy Coefficient | 0 |
| Gradient Checkpointing | Enabled |
| Precision | BF16 |
| Rollout Engine | vLLM |
| Total Epochs | 5 |
RL Launch Command:
HF_HUB_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/path/to/train.parquet \
data.val_files=/path/to/val.parquet \
data.train_batch_size=256 \
data.max_prompt_length=3072 \
data.max_response_length=4096 \
actor_rollout_ref.model.path=/path/to/metrex_merged_full \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=32 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.kl_loss_coef=0.0 \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.temperature=1.0 \
actor_rollout_ref.rollout.top_p=1.0 \
actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \
actor_rollout_ref.rollout.val_kwargs.top_p=0.7 \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
reward_model.reward_manager=metrex \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.project_name=MetRex-RL \
trainer.experiment_name=GRPO-Qwen3-8B-MetRex \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.val_before_train=True \
trainer.test_freq=5 \
trainer.save_freq=10 \
trainer.total_epochs=5
Technical Specifications
Model Architecture
- Architecture: Qwen3ForCausalLM with LoRA adapters (rank=256, target=all modules)
- RL model: qwen3-8b with GRPO ,reward=[format]*[result]
- Reward calculation:
- Format reward (0-1): 0.25 per correctly formatted tag out of 4 tags (
<synth>,<area>,<delay>,<static_power>) - Result reward (0-3): MAPE for each of the 3 metrics (area, delay, static_power), each capped at 1.0
- Total reward = format_score × result_score, range [0, 4]
- Format reward (0-1): 0.25 per correctly formatted tag out of 4 tags (
Compute Infrastructure
- SFT Training: 8× NVIDIA GPU, DeepSpeed ZeRO-3, BF16 precision
- RL Training: 8× NVIDIA GPU, FSDP with gradient/optimizer offloading, BF16
- Inference: Single GPU sufficient (16GB VRAM for merged model)
## References
- [MetRex: A Benchmark for RTL Code Generation with LLMs](https://github.com/scale-lab/MetRex/tree/main) — Chain-of-thought PPA prediction baseline
- [ChipGPT: How Far Are We From Natural Language Hardware Design](https://arxiv.org/abs/2305.14019) — LLM-assisted hardware design framework
- [Data is All You Need: Finetuning LLMs for Chip Design via Automated Design-Data Augmentation](https://arxiv.org/abs/2403.11202) — Automated RTL data augmentation framework
- [verl: Versatile RL Framework](https://github.com/verl-project/verl) — GRPO training framework