File size: 3,289 Bytes
221f0b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | ---
language:
- en
library_name: hls4ml
datasets:
- lithobench
tags:
- pytorch
- hls4ml
- fpga
- neural-network
- quantization
- xilinx
- mask-optimization
- lithography
- inverse-lithography
license: gpl
---
# Penumbra UNet: FPGA-Accelerated Mask Optimization
A compressed U-Net neural network for on-chip FPGA acceleration of Inverse Lithography Technology (ILT) mask optimization, targeting the Xilinx VU47P (AWS F2).
## Overview
Penumbra UNet compresses a full-size teacher network by 64× (7.8M → 122K parameters) to fit entirely in on-chip BRAM, enabling a fully on-chip dataflow that eliminates external DRAM access.
## Architecture
### Network Structure
U-Net encoder-decoder with extreme parameter compression:
**Encoder:**
- Conv 1→8 channels, 64×64 + MaxPool → 32×32
- Conv 8→16 channels, 32×32 + MaxPool → 16×16
- Conv 16→32 channels, 16×16 + MaxPool → 8×8
**Bottleneck:**
- Conv 32→64 channels, 8×8
**Decoder:**
- Upsample + skip concatenation + Conv 96→32 channels, 16×16
- Upsample + skip concatenation + Conv 48→16 channels, 32×32
- Upsample + skip concatenation + Conv 24→8 channels, 64×64
**Output:**
- Conv 1×1 + Sigmoid → 64×64
**Compression summary:**
| Metric | Full model | Penumbra UNet |
|--------|-----------|----------|
| Parameters | 7.8M | 122K |
| Input tile | 512×512 | 64×64 |
| Max channels | 512 | 64 |
### Tiling & Reassembly
Input 512×512 masks are decomposed into 16×16 grid of 64×64 tiles (256 total):
- **Overlap**: 16-pixel reflection padding for boundary handling
- **Usable core**: 32×32 center pixels per tile
- **Batch processing**: 256 tiles → 4 sequential batches of 64
Reassembly uses only differentiable operations (slice, reshape, permute) to enable end-to-end gradient flow:
```
(256, 1, 64, 64) [all tiles]
↓ center-crop
(256, 1, 32, 32) [usable cores]
↓ reshape + permute
(1, 1, 512, 512) [full mask]
```
## Training
### Phase 1: Knowledge Distillation
- **Epochs**: 16
- **Input**: 64×64 crops
- **Loss**: α-blended (α decays 0.7→0)
```
L = α·MSE(student, teacher) + (1-α)·MSE(student, ground_truth)
```
- **Optimizer**: Adam (lr=1e-3), cosine-annealing schedule
- **Teacher**: Frozen full-size NeuralILT model
### Phase 2: Physics-Informed Fine-Tuning
- **Epochs**: 4
- **Pipeline**: Full tiled forward pass through differentiable lithography simulator
- **Loss**: Print fidelity + process variation
```
L = MSE(P_nom, target) + MSE(P_max, P_min)
```
- **Optimizer**: Adam (lr=1e-4), StepLR (γ=0.1 at epoch 2)
- **Gradients**: Flow through tiled reassembly to all network weights
## Code Organization
```
hls4ml_penumbra/
├── firmware/ # Generated HLS C++ project
│ ├── myproject.cpp # Top-level module
│ ├── myproject.h # Interface & config
│ ├── weights/ # Quantized weights
│ ├── ap_types/ # Xilinx AP types (ap_fixed, ap_int)
│ └── utils/ # HLS utilities
├── myproject_prj/ # Vivado HLS project
│ └── solution1/
│ └── impl/ # Implementation artifacts
├── logs/ # Build logs
└── [HLS build outputs]
```
---
**Author**: Roberto Treviño Cervantes
|