File size: 3,289 Bytes
221f0b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language:
  - en
library_name: hls4ml
datasets:
  - lithobench
tags:
  - pytorch
  - hls4ml
  - fpga
  - neural-network
  - quantization
  - xilinx
  - mask-optimization
  - lithography
  - inverse-lithography
license: gpl
---

# Penumbra UNet: FPGA-Accelerated Mask Optimization

A compressed U-Net neural network for on-chip FPGA acceleration of Inverse Lithography Technology (ILT) mask optimization, targeting the Xilinx VU47P (AWS F2).

## Overview

Penumbra UNet compresses a full-size teacher network by 64× (7.8M → 122K parameters) to fit entirely in on-chip BRAM, enabling a fully on-chip dataflow that eliminates external DRAM access.

## Architecture

### Network Structure

U-Net encoder-decoder with extreme parameter compression:

**Encoder:**
- Conv 1→8 channels, 64×64 + MaxPool → 32×32
- Conv 8→16 channels, 32×32 + MaxPool → 16×16
- Conv 16→32 channels, 16×16 + MaxPool → 8×8

**Bottleneck:**
- Conv 32→64 channels, 8×8

**Decoder:**
- Upsample + skip concatenation + Conv 96→32 channels, 16×16
- Upsample + skip concatenation + Conv 48→16 channels, 32×32
- Upsample + skip concatenation + Conv 24→8 channels, 64×64

**Output:**
- Conv 1×1 + Sigmoid → 64×64

**Compression summary:**
| Metric | Full model | Penumbra UNet |
|--------|-----------|----------|
| Parameters | 7.8M | 122K |
| Input tile | 512×512 | 64×64 |
| Max channels | 512 | 64 |

### Tiling & Reassembly

Input 512×512 masks are decomposed into 16×16 grid of 64×64 tiles (256 total):
- **Overlap**: 16-pixel reflection padding for boundary handling
- **Usable core**: 32×32 center pixels per tile
- **Batch processing**: 256 tiles → 4 sequential batches of 64

Reassembly uses only differentiable operations (slice, reshape, permute) to enable end-to-end gradient flow:
```
(256, 1, 64, 64)  [all tiles]
    ↓ center-crop
(256, 1, 32, 32)  [usable cores]
    ↓ reshape + permute
(1, 1, 512, 512)  [full mask]
```

## Training

### Phase 1: Knowledge Distillation
- **Epochs**: 16
- **Input**: 64×64 crops
- **Loss**: α-blended (α decays 0.7→0)
  ```
  L = α·MSE(student, teacher) + (1-α)·MSE(student, ground_truth)
  ```
- **Optimizer**: Adam (lr=1e-3), cosine-annealing schedule
- **Teacher**: Frozen full-size NeuralILT model

### Phase 2: Physics-Informed Fine-Tuning
- **Epochs**: 4
- **Pipeline**: Full tiled forward pass through differentiable lithography simulator
- **Loss**: Print fidelity + process variation
  ```
  L = MSE(P_nom, target) + MSE(P_max, P_min)
  ```
- **Optimizer**: Adam (lr=1e-4), StepLR (γ=0.1 at epoch 2)
- **Gradients**: Flow through tiled reassembly to all network weights

## Code Organization

```
hls4ml_penumbra/
├── firmware/           # Generated HLS C++ project
│   ├── myproject.cpp   # Top-level module
│   ├── myproject.h     # Interface & config
│   ├── weights/        # Quantized weights
│   ├── ap_types/       # Xilinx AP types (ap_fixed, ap_int)
│   └── utils/          # HLS utilities
├── myproject_prj/      # Vivado HLS project
│   └── solution1/
│       └── impl/       # Implementation artifacts
├── logs/               # Build logs
└── [HLS build outputs]
```

---

**Author**: Roberto Treviño Cervantes