RTL-PPA: RTL Power, Performance, Area Prediction via Chain-of-Thought Fine-Tuning

Model Overview

This model predicts Power, Performance (delay), and Area (PPA) of RTL designs synthesized for Skywater 130nm technology. Given a Verilog module, it performs step-by-step chain-of-thought reasoning about gate-level synthesis and outputs structured PPA estimates with [area], [delay], and [static_power] tags directly usable as reinforcement learning reward signals.

Model Details

Developed by: Zhu Wenlong
Model type: Qwen3-8B fine-tuned with LoRA (rank=256, alpha=512) and with RL
License: MIT
Finetuned from: Qwen/Qwen3-8B

Uses

Direct Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/path/to/merged/model"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True)

SYSTEM_PROMPT = (
    "Your task is to estimate area, delay, and static power for RTL designs in Skywater 130nm technology node.\n"
    "For the given RTL design, reason about the number and type of gates that would be present after synthesis, "
    "then output all four tags:\n"
    "<synth> ... </synth>\n"
    "<area> ... [area]value[/area] </area>\n"
    "<delay> ... [delay]value[/delay] </delay>\n"
    "<static_power> ... [static_power]value[/static_power] </static_power>"
)

rtl_code = "module top_module (...); ..."
messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": rtl_code}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=16384, do_sample=False)
result = tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Extract: [area]VALUE[/area], [delay]VALUE[/delay], [static_power]VALUE[/static_power]

Out-of-Scope Use

Not suitable for non-Verilog hardware description languages
Not suitable for RTL targeting technology nodes other than Skywater 130nm

Training Details

Training Data

Base dataset: scale-lab/MetRex (MetRex benchmark)
Augmentation: Semantic-preserving RTL transformations (signal renaming, constant base conversion, declaration shuffling, whitespace randomization, begin/end insertion, module renaming)
Format: Alpaca format with verbose chain-of-thought output tags

Training Procedure

Stage 1: Base SFT

Fine-tuned on 23,816 samples (small circuits, <1000 gates) to establish fundamental CoT reasoning capability.

SFT Training Configuration:

Parameter	Value
Base Model	Qwen3-8B
LoRA Rank	256
LoRA Alpha	512
LoRA Target	all
Learning Rate	5e-5
Batch Size	64 (8 GPU × 1 × 8 accum)
Cutoff Length	8192
Epochs	3
Precision	BF16 + DeepSpeed ZeRO-3

Stage 2: Reinforcement Learning (GRPO)

Refined via Group Relative Policy Optimization (GRPO) using the verl framework. The reward signal is computed from MAPE on [area], [delay], and [static_power] tags extracted from model outputs, enabling iterative improvement on hard circuits.

RL Training Configuration:

Parameter	Value
Algorithm	GRPO (Group Relative Policy Optimization)
Framework	verl
Train Batch Size	256
Max Prompt Length	3072
Max Response Length	4096
Actor Learning Rate	1e-6
PPO Mini Batch Size	32
PPO Micro Batch Size per GPU	2
Rollout Samples (n)	4
KL Loss Coefficient	0.0 (disabled)
Entropy Coefficient	0
Gradient Checkpointing	Enabled
Precision	BF16
Rollout Engine	vLLM
Total Epochs	5

RL Launch Command:

HF_HUB_OFFLINE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/path/to/train.parquet \
    data.val_files=/path/to/val.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=3072 \
    data.max_response_length=4096 \
    actor_rollout_ref.model.path=/path/to/metrex_merged_full \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.ppo_mini_batch_size=32 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.kl_loss_coef=0.0 \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.top_p=1.0 \
    actor_rollout_ref.rollout.val_kwargs.temperature=1.0 \
    actor_rollout_ref.rollout.val_kwargs.top_p=0.7 \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    reward_model.reward_manager=metrex \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.project_name=MetRex-RL \
    trainer.experiment_name=GRPO-Qwen3-8B-MetRex \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=10 \
    trainer.total_epochs=5

Technical Specifications

Model Architecture

Architecture: Qwen3ForCausalLM with LoRA adapters (rank=256, target=all modules)
RL model： qwen3-8b with GRPO ，reward=[format]*[result]
Reward calculation:
- Format reward (0-1): 0.25 per correctly formatted tag out of 4 tags (<synth>, <area>, <delay>, <static_power>)
- Result reward (0-3): MAPE for each of the 3 metrics (area, delay, static_power), each capped at 1.0
- Total reward = format_score × result_score, range [0, 4]

Compute Infrastructure

SFT Training: 8× NVIDIA GPU, DeepSpeed ZeRO-3, BF16 precision
RL Training: 8× NVIDIA GPU, FSDP with gradient/optimizer offloading, BF16
Inference: Single GPU sufficient (16GB VRAM for merged model)


## References

- [MetRex: A Benchmark for RTL Code Generation with LLMs](https://github.com/scale-lab/MetRex/tree/main) — Chain-of-thought PPA prediction baseline
- [ChipGPT: How Far Are We From Natural Language Hardware Design](https://arxiv.org/abs/2305.14019) — LLM-assisted hardware design framework
- [Data is All You Need: Finetuning LLMs for Chip Design via Automated Design-Data Augmentation](https://arxiv.org/abs/2403.11202) — Automated RTL data augmentation framework
- [verl: Versatile RL Framework](https://github.com/verl-project/verl) — GRPO training framework