| --- |
| language: |
| - en |
| library_name: hls4ml |
| datasets: |
| - lithobench |
| tags: |
| - pytorch |
| - hls4ml |
| - fpga |
| - neural-network |
| - quantization |
| - xilinx |
| - mask-optimization |
| - lithography |
| - inverse-lithography |
| license: gpl |
| --- |
| |
| # Penumbra UNet: FPGA-Accelerated Mask Optimization |
|
|
| A compressed U-Net neural network for on-chip FPGA acceleration of Inverse Lithography Technology (ILT) mask optimization, targeting the Xilinx VU47P (AWS F2). |
|
|
| ## Overview |
|
|
| Penumbra UNet compresses a full-size teacher network by 64× (7.8M → 122K parameters) to fit entirely in on-chip BRAM, enabling a fully on-chip dataflow that eliminates external DRAM access. |
|
|
| ## Architecture |
|
|
| ### Network Structure |
|
|
| U-Net encoder-decoder with extreme parameter compression: |
|
|
| **Encoder:** |
| - Conv 1→8 channels, 64×64 + MaxPool → 32×32 |
| - Conv 8→16 channels, 32×32 + MaxPool → 16×16 |
| - Conv 16→32 channels, 16×16 + MaxPool → 8×8 |
|
|
| **Bottleneck:** |
| - Conv 32→64 channels, 8×8 |
|
|
| **Decoder:** |
| - Upsample + skip concatenation + Conv 96→32 channels, 16×16 |
| - Upsample + skip concatenation + Conv 48→16 channels, 32×32 |
| - Upsample + skip concatenation + Conv 24→8 channels, 64×64 |
|
|
| **Output:** |
| - Conv 1×1 + Sigmoid → 64×64 |
|
|
| **Compression summary:** |
| | Metric | Full model | Penumbra UNet | |
| |--------|-----------|----------| |
| | Parameters | 7.8M | 122K | |
| | Input tile | 512×512 | 64×64 | |
| | Max channels | 512 | 64 | |
|
|
| ### Tiling & Reassembly |
|
|
| Input 512×512 masks are decomposed into 16×16 grid of 64×64 tiles (256 total): |
| - **Overlap**: 16-pixel reflection padding for boundary handling |
| - **Usable core**: 32×32 center pixels per tile |
| - **Batch processing**: 256 tiles → 4 sequential batches of 64 |
|
|
| Reassembly uses only differentiable operations (slice, reshape, permute) to enable end-to-end gradient flow: |
| ``` |
| (256, 1, 64, 64) [all tiles] |
| ↓ center-crop |
| (256, 1, 32, 32) [usable cores] |
| ↓ reshape + permute |
| (1, 1, 512, 512) [full mask] |
| ``` |
|
|
| ## Training |
|
|
| ### Phase 1: Knowledge Distillation |
| - **Epochs**: 16 |
| - **Input**: 64×64 crops |
| - **Loss**: α-blended (α decays 0.7→0) |
| ``` |
| L = α·MSE(student, teacher) + (1-α)·MSE(student, ground_truth) |
| ``` |
| - **Optimizer**: Adam (lr=1e-3), cosine-annealing schedule |
| - **Teacher**: Frozen full-size NeuralILT model |
|
|
| ### Phase 2: Physics-Informed Fine-Tuning |
| - **Epochs**: 4 |
| - **Pipeline**: Full tiled forward pass through differentiable lithography simulator |
| - **Loss**: Print fidelity + process variation |
| ``` |
| L = MSE(P_nom, target) + MSE(P_max, P_min) |
| ``` |
| - **Optimizer**: Adam (lr=1e-4), StepLR (γ=0.1 at epoch 2) |
| - **Gradients**: Flow through tiled reassembly to all network weights |
|
|
| ## Code Organization |
|
|
| ``` |
| hls4ml_penumbra/ |
| ├── firmware/ # Generated HLS C++ project |
| │ ├── myproject.cpp # Top-level module |
| │ ├── myproject.h # Interface & config |
| │ ├── weights/ # Quantized weights |
| │ ├── ap_types/ # Xilinx AP types (ap_fixed, ap_int) |
| │ └── utils/ # HLS utilities |
| ├── myproject_prj/ # Vivado HLS project |
| │ └── solution1/ |
| │ └── impl/ # Implementation artifacts |
| ├── logs/ # Build logs |
| └── [HLS build outputs] |
| ``` |
|
|
| --- |
|
|
| **Author**: Roberto Treviño Cervantes |
|
|