README.md · marduk-ra/MFIR at main

MFIR / README.md

marduk-ra

Update README.md

8806ce4 verified 27 days ago

preview code

raw

history blame contribute delete

5.94 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- image-restoration
	- multi-frame
	- deformable-convolution
	- temporal-fusion
	- feature-level-alignment
	- super-resolution
	- astronomy
	- satellite-images-refactoring
	- photography
	- denoise
	- deblur
	- novel-architecture
	- from-scratch
	- efficient
	- research
	language:
	- en
	pipeline_tag: image-to-image
	---

	# MFIR - Multi-Frame Image Restoration

	A PyTorch model for multi-frame image restoration through temporal fusion and feature-level alignment. MFIR aligns and fuses features from multiple degraded frames to produce a high-quality restored image.

	## Model Description

	MFIR takes 2-16 degraded frames of the same scene and combines them into a single high-quality output. Unlike single-image restoration methods that struggle with heavily degraded inputs, MFIR leverages complementary information across multiple frames - each frame captures slightly different details, and the model learns to extract and merge the best parts from each.

	### Architecture

	```
	Input Frames (B, N, 3, H, W)
	│
	▼
	┌─────────────────────┐
	│ Shared Encoder │ ResNet-style feature extraction
	└─────────────────────┘
	│
	▼
	┌─────────────────────┐
	│ Deformable │ Align frames using learned offsets
	│ Alignment │ (3-layer cascade)
	└─────────────────────┘
	│
	▼
	┌─────────────────────┐
	│ Temporal Attention │ Multi-head attention fusion
	│ Fusion │ (4 heads)
	└─────────────────────┘
	│
	▼
	┌─────────────────────┐
	│ Decoder │ PixelShuffle upsampling
	└─────────────────────┘
	│
	▼
	Output (B, 3, H, W)
	```

	### Key Components

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| Shared Encoder \| Multi-scale feature extraction with residual blocks. 4x spatial downsampling. \|
	\| Deformable Alignment \| Cascaded deformable convolutions (3 layers) to align frames to reference. More robust than optical flow for degraded inputs. \|
	\| Temporal Attention Fusion \| Multi-head attention (4 heads) where reference frame is query, all frames are key/value. Learns per-pixel frame contributions. \|
	\| Decoder \| Progressive upsampling with PixelShuffle (2 stages, 4x total). \|

	## Usage

	### Installation

	```bash
	pip install torch torchvision huggingface_hub
	```

	### Inference

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# Download checkpoint
	checkpoint_path = hf_hub_download(
	repo_id="marduk-ra/MFIR",
	filename="temporal_fusion_model.pth"
	)

	# Load model
	ckpt = torch.load(checkpoint_path, map_location="cuda", weights_only=False)

	# Model architecture code available at:
	# https://github.com/marduk-ra/MFIR

	from model import FeatureFusionModel, FeatureFusionConfig

	config = FeatureFusionConfig.from_dict(ckpt["config"])
	model = FeatureFusionModel(config)
	model.load_state_dict(ckpt["state_dict"])
	model.eval()

	# Inference
	# frames: (batch, num_frames, 3, height, width) tensor in [0, 1]
	with torch.no_grad():
	result = model(frames, ref_idx=0)
	output = result["output"] # (batch, 3, height, width)
	```

	### Web Demo

	Try the model directly in your browser:

	🚀 [Hugging Face Space Demo](https://huggingface.co/spaces/marduk-ra/MFIR)

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Input Channels \| 3 (RGB) \|
	\| Output Channels \| 3 (RGB) \|
	\| Max Frames \| 16 \|
	\| Min Frames \| 2 \|
	\| Encoder Channels \| [64, 128, 256] \|
	\| Deformable Groups \| 8 \|
	\| Deformable Layers \| 3 \|
	\| Attention Heads \| 4 \|
	\| Fusion Type \| Attention \|
	\| Parameters \| ~10M \|
	\| Checkpoint Size \| 42 MB \|

	## Example

	Input Frames (5 degraded images):

	<p>
	<img src="photos/inputs/input1.png" width="150" />
	<img src="photos/inputs/input2.png" width="150" />
	<img src="photos/inputs/input3.png" width="150" />
	<img src="photos/inputs/input4.png" width="150" />
	<img src="photos/inputs/input5.png" width="150" />
	</p>

	Output (restored):

	<img src="photos/output.png" width="400" />

	5 degraded input frames are fused into a single high-quality output.

	The model works best when:
	- Frames have slight variations (different noise patterns, blur, etc.)
	- Frames are roughly aligned (same scene)
	- Input resolution matches training resolution

	## Training

	The model was trained on a custom dataset with the following specifications:

	Dataset:
	- 16,000 high-resolution source images
	- Each image was used to generate 8 degraded input frames
	- Multi-scale training: 128, 256, 512, and 1024 pixel resolutions

	Degradation Pipeline:
	- Random spatial shifts (simulating camera shake)
	- Motion blur with varying kernel sizes and directions
	- Gaussian and Poisson noise with random intensity

	Training Configuration:
	- Total epochs: 150 (progressive training)
	- Optimizer: AdamW
	- Loss: L1 + Perceptual (VGG) + SSIM + Color Correction

	## Limitations

	- Requires multiple frames of the same scene
	- Performance depends on frame quality variation
	- GPU recommended for real-time processing

	## Citation

	```bibtex
	@software{karaarslan2026mfir,
	author = {Karaarslan, Veli},
	title = {MFIR: Multi-Frame Image Restoration},
	year = {2026},
	url = {https://github.com/allcodernet/MFIR}
	}
	```

	## License

	MIT License - see [LICENSE](https://github.com/allcodernet/MFIR/blob/main/LICENSE)

	## Author

	Veli Karaarslan - 2026

	## Links

	- [GitHub Repository](https://github.com/allcodernet/MFIR)
	- [Hugging Face Space](https://huggingface.co/spaces/marduk-ra/MFIR)