---
license: mit
library_name: pytorch
tags:
  - image-restoration
  - multi-frame
  - deformable-convolution
  - temporal-fusion
  - feature-level-alignment
  - super-resolution
  - astronomy
  - satellite-images-refactoring
  - photography
  - denoise
  - deblur
  - novel-architecture
  - from-scratch
  - efficient
  - research
language:
  - en
pipeline_tag: image-to-image
---

# MFIR - Multi-Frame Image Restoration

A PyTorch model for multi-frame image restoration through temporal fusion and feature-level alignment. MFIR aligns and fuses features from multiple degraded frames to produce a high-quality restored image.

## Model Description

MFIR takes 2-16 degraded frames of the same scene and combines them into a single high-quality output. Unlike single-image restoration methods that struggle with heavily degraded inputs, MFIR leverages complementary information across multiple frames - each frame captures slightly different details, and the model learns to extract and merge the best parts from each.

### Architecture

```
Input Frames (B, N, 3, H, W)
         │
         ▼
┌─────────────────────┐
│   Shared Encoder    │  ResNet-style feature extraction
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│     Deformable      │  Align frames using learned offsets
│     Alignment       │  (3-layer cascade)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│  Temporal Attention │  Multi-head attention fusion
│      Fusion         │  (4 heads)
└─────────────────────┘
         │
         ▼
┌─────────────────────┐
│      Decoder        │  PixelShuffle upsampling
└─────────────────────┘
         │
         ▼
   Output (B, 3, H, W)
```

### Key Components

| Component | Description |
|-----------|-------------|
| **Shared Encoder** | Multi-scale feature extraction with residual blocks. 4x spatial downsampling. |
| **Deformable Alignment** | Cascaded deformable convolutions (3 layers) to align frames to reference. More robust than optical flow for degraded inputs. |
| **Temporal Attention Fusion** | Multi-head attention (4 heads) where reference frame is query, all frames are key/value. Learns per-pixel frame contributions. |
| **Decoder** | Progressive upsampling with PixelShuffle (2 stages, 4x total). |

## Usage

### Installation

```bash
pip install torch torchvision huggingface_hub
```

### Inference

```python
import torch
from huggingface_hub import hf_hub_download

# Download checkpoint
checkpoint_path = hf_hub_download(
    repo_id="marduk-ra/MFIR",
    filename="temporal_fusion_model.pth"
)

# Load model
ckpt = torch.load(checkpoint_path, map_location="cuda", weights_only=False)

# Model architecture code available at:
# https://github.com/marduk-ra/MFIR

from model import FeatureFusionModel, FeatureFusionConfig

config = FeatureFusionConfig.from_dict(ckpt["config"])
model = FeatureFusionModel(config)
model.load_state_dict(ckpt["state_dict"])
model.eval()

# Inference
# frames: (batch, num_frames, 3, height, width) tensor in [0, 1]
with torch.no_grad():
    result = model(frames, ref_idx=0)
    output = result["output"]  # (batch, 3, height, width)
```

### Web Demo

Try the model directly in your browser:

🚀 **[Hugging Face Space Demo](https://huggingface.co/spaces/marduk-ra/MFIR)**

## Model Details

| Parameter | Value |
|-----------|-------|
| Input Channels | 3 (RGB) |
| Output Channels | 3 (RGB) |
| Max Frames | 16 |
| Min Frames | 2 |
| Encoder Channels | [64, 128, 256] |
| Deformable Groups | 8 |
| Deformable Layers | 3 |
| Attention Heads | 4 |
| Fusion Type | Attention |
| Parameters | ~10M |
| Checkpoint Size | 42 MB |

## Example

**Input Frames (5 degraded images):**

<p>
  <img src="photos/inputs/input1.png" width="150" />
  <img src="photos/inputs/input2.png" width="150" />
  <img src="photos/inputs/input3.png" width="150" />
  <img src="photos/inputs/input4.png" width="150" />
  <img src="photos/inputs/input5.png" width="150" />
</p>

**Output (restored):**

<img src="photos/output.png" width="400" />

*5 degraded input frames are fused into a single high-quality output.*

The model works best when:
- Frames have slight variations (different noise patterns, blur, etc.)
- Frames are roughly aligned (same scene)
- Input resolution matches training resolution

## Training

The model was trained on a custom dataset with the following specifications:

**Dataset:**
- 16,000 high-resolution source images
- Each image was used to generate 8 degraded input frames
- Multi-scale training: 128, 256, 512, and 1024 pixel resolutions

**Degradation Pipeline:**
- Random spatial shifts (simulating camera shake)
- Motion blur with varying kernel sizes and directions
- Gaussian and Poisson noise with random intensity

**Training Configuration:**
- Total epochs: 150 (progressive training)
- Optimizer: AdamW
- Loss: L1 + Perceptual (VGG) + SSIM + Color Correction

## Limitations

- Requires multiple frames of the same scene
- Performance depends on frame quality variation
- GPU recommended for real-time processing

## Citation

```bibtex
@software{karaarslan2026mfir,
  author = {Karaarslan, Veli},
  title = {MFIR: Multi-Frame Image Restoration},
  year = {2026},
  url = {https://github.com/allcodernet/MFIR}
}
```

## License

MIT License - see [LICENSE](https://github.com/allcodernet/MFIR/blob/main/LICENSE)

## Author

**Veli Karaarslan** - 2026

## Links

- [GitHub Repository](https://github.com/allcodernet/MFIR)
- [Hugging Face Space](https://huggingface.co/spaces/marduk-ra/MFIR)