--- language: - en library_name: hls4ml datasets: - lithobench tags: - pytorch - hls4ml - fpga - neural-network - quantization - xilinx - mask-optimization - lithography - inverse-lithography license: gpl --- # Penumbra UNet: FPGA-Accelerated Mask Optimization A compressed U-Net neural network for on-chip FPGA acceleration of Inverse Lithography Technology (ILT) mask optimization, targeting the Xilinx VU47P (AWS F2). ## Overview Penumbra UNet compresses a full-size teacher network by 64× (7.8M → 122K parameters) to fit entirely in on-chip BRAM, enabling a fully on-chip dataflow that eliminates external DRAM access. ## Architecture ### Network Structure U-Net encoder-decoder with extreme parameter compression: **Encoder:** - Conv 1→8 channels, 64×64 + MaxPool → 32×32 - Conv 8→16 channels, 32×32 + MaxPool → 16×16 - Conv 16→32 channels, 16×16 + MaxPool → 8×8 **Bottleneck:** - Conv 32→64 channels, 8×8 **Decoder:** - Upsample + skip concatenation + Conv 96→32 channels, 16×16 - Upsample + skip concatenation + Conv 48→16 channels, 32×32 - Upsample + skip concatenation + Conv 24→8 channels, 64×64 **Output:** - Conv 1×1 + Sigmoid → 64×64 **Compression summary:** | Metric | Full model | Penumbra UNet | |--------|-----------|----------| | Parameters | 7.8M | 122K | | Input tile | 512×512 | 64×64 | | Max channels | 512 | 64 | ### Tiling & Reassembly Input 512×512 masks are decomposed into 16×16 grid of 64×64 tiles (256 total): - **Overlap**: 16-pixel reflection padding for boundary handling - **Usable core**: 32×32 center pixels per tile - **Batch processing**: 256 tiles → 4 sequential batches of 64 Reassembly uses only differentiable operations (slice, reshape, permute) to enable end-to-end gradient flow: ``` (256, 1, 64, 64) [all tiles] ↓ center-crop (256, 1, 32, 32) [usable cores] ↓ reshape + permute (1, 1, 512, 512) [full mask] ``` ## Training ### Phase 1: Knowledge Distillation - **Epochs**: 16 - **Input**: 64×64 crops - **Loss**: α-blended (α decays 0.7→0) ``` L = α·MSE(student, teacher) + (1-α)·MSE(student, ground_truth) ``` - **Optimizer**: Adam (lr=1e-3), cosine-annealing schedule - **Teacher**: Frozen full-size NeuralILT model ### Phase 2: Physics-Informed Fine-Tuning - **Epochs**: 4 - **Pipeline**: Full tiled forward pass through differentiable lithography simulator - **Loss**: Print fidelity + process variation ``` L = MSE(P_nom, target) + MSE(P_max, P_min) ``` - **Optimizer**: Adam (lr=1e-4), StepLR (γ=0.1 at epoch 2) - **Gradients**: Flow through tiled reassembly to all network weights ## Code Organization ``` hls4ml_penumbra/ ├── firmware/ # Generated HLS C++ project │ ├── myproject.cpp # Top-level module │ ├── myproject.h # Interface & config │ ├── weights/ # Quantized weights │ ├── ap_types/ # Xilinx AP types (ap_fixed, ap_int) │ └── utils/ # HLS utilities ├── myproject_prj/ # Vivado HLS project │ └── solution1/ │ └── impl/ # Implementation artifacts ├── logs/ # Build logs └── [HLS build outputs] ``` --- **Author**: Roberto Treviño Cervantes