MFIR / README.md
marduk-ra's picture
Update README.md
8806ce4 verified
---
license: mit
library_name: pytorch
tags:
- image-restoration
- multi-frame
- deformable-convolution
- temporal-fusion
- feature-level-alignment
- super-resolution
- astronomy
- satellite-images-refactoring
- photography
- denoise
- deblur
- novel-architecture
- from-scratch
- efficient
- research
language:
- en
pipeline_tag: image-to-image
---
# MFIR - Multi-Frame Image Restoration
A PyTorch model for multi-frame image restoration through temporal fusion and feature-level alignment. MFIR aligns and fuses features from multiple degraded frames to produce a high-quality restored image.
## Model Description
MFIR takes 2-16 degraded frames of the same scene and combines them into a single high-quality output. Unlike single-image restoration methods that struggle with heavily degraded inputs, MFIR leverages complementary information across multiple frames - each frame captures slightly different details, and the model learns to extract and merge the best parts from each.
### Architecture
```
Input Frames (B, N, 3, H, W)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Shared Encoder β”‚ ResNet-style feature extraction
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Deformable β”‚ Align frames using learned offsets
β”‚ Alignment β”‚ (3-layer cascade)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Temporal Attention β”‚ Multi-head attention fusion
β”‚ Fusion β”‚ (4 heads)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Decoder β”‚ PixelShuffle upsampling
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Output (B, 3, H, W)
```
### Key Components
| Component | Description |
|-----------|-------------|
| **Shared Encoder** | Multi-scale feature extraction with residual blocks. 4x spatial downsampling. |
| **Deformable Alignment** | Cascaded deformable convolutions (3 layers) to align frames to reference. More robust than optical flow for degraded inputs. |
| **Temporal Attention Fusion** | Multi-head attention (4 heads) where reference frame is query, all frames are key/value. Learns per-pixel frame contributions. |
| **Decoder** | Progressive upsampling with PixelShuffle (2 stages, 4x total). |
## Usage
### Installation
```bash
pip install torch torchvision huggingface_hub
```
### Inference
```python
import torch
from huggingface_hub import hf_hub_download
# Download checkpoint
checkpoint_path = hf_hub_download(
repo_id="marduk-ra/MFIR",
filename="temporal_fusion_model.pth"
)
# Load model
ckpt = torch.load(checkpoint_path, map_location="cuda", weights_only=False)
# Model architecture code available at:
# https://github.com/marduk-ra/MFIR
from model import FeatureFusionModel, FeatureFusionConfig
config = FeatureFusionConfig.from_dict(ckpt["config"])
model = FeatureFusionModel(config)
model.load_state_dict(ckpt["state_dict"])
model.eval()
# Inference
# frames: (batch, num_frames, 3, height, width) tensor in [0, 1]
with torch.no_grad():
result = model(frames, ref_idx=0)
output = result["output"] # (batch, 3, height, width)
```
### Web Demo
Try the model directly in your browser:
πŸš€ **[Hugging Face Space Demo](https://huggingface.co/spaces/marduk-ra/MFIR)**
## Model Details
| Parameter | Value |
|-----------|-------|
| Input Channels | 3 (RGB) |
| Output Channels | 3 (RGB) |
| Max Frames | 16 |
| Min Frames | 2 |
| Encoder Channels | [64, 128, 256] |
| Deformable Groups | 8 |
| Deformable Layers | 3 |
| Attention Heads | 4 |
| Fusion Type | Attention |
| Parameters | ~10M |
| Checkpoint Size | 42 MB |
## Example
**Input Frames (5 degraded images):**
<p>
<img src="photos/inputs/input1.png" width="150" />
<img src="photos/inputs/input2.png" width="150" />
<img src="photos/inputs/input3.png" width="150" />
<img src="photos/inputs/input4.png" width="150" />
<img src="photos/inputs/input5.png" width="150" />
</p>
**Output (restored):**
<img src="photos/output.png" width="400" />
*5 degraded input frames are fused into a single high-quality output.*
The model works best when:
- Frames have slight variations (different noise patterns, blur, etc.)
- Frames are roughly aligned (same scene)
- Input resolution matches training resolution
## Training
The model was trained on a custom dataset with the following specifications:
**Dataset:**
- 16,000 high-resolution source images
- Each image was used to generate 8 degraded input frames
- Multi-scale training: 128, 256, 512, and 1024 pixel resolutions
**Degradation Pipeline:**
- Random spatial shifts (simulating camera shake)
- Motion blur with varying kernel sizes and directions
- Gaussian and Poisson noise with random intensity
**Training Configuration:**
- Total epochs: 150 (progressive training)
- Optimizer: AdamW
- Loss: L1 + Perceptual (VGG) + SSIM + Color Correction
## Limitations
- Requires multiple frames of the same scene
- Performance depends on frame quality variation
- GPU recommended for real-time processing
## Citation
```bibtex
@software{karaarslan2026mfir,
author = {Karaarslan, Veli},
title = {MFIR: Multi-Frame Image Restoration},
year = {2026},
url = {https://github.com/allcodernet/MFIR}
}
```
## License
MIT License - see [LICENSE](https://github.com/allcodernet/MFIR/blob/main/LICENSE)
## Author
**Veli Karaarslan** - 2026
## Links
- [GitHub Repository](https://github.com/allcodernet/MFIR)
- [Hugging Face Space](https://huggingface.co/spaces/marduk-ra/MFIR)