Spaces:

Askhedi
/

dtd-document-tampering

Running

App Files Files Community

astrosbd

seifbenayed commited on Oct 9, 2025

Commit

58d7142

verified ·

1 Parent(s): b35b448

let's go (#2)

Browse files

- let's go (28a202cb896ecb7388dcd60a130a61a763b87287)

Co-authored-by: Seif benayed <seifbenayed@users.noreply.huggingface.co>

Files changed (21) hide show

.gitattributes +3 -0
DEPLOYMENT.md +194 -0
README.md +105 -7
app.py +152 -0
checkpoints/dtd_doctamper.pth +3 -0
checkpoints/qt_table.pk +0 -0
checkpoints/swin_imagenet.pt +3 -0
checkpoints/vph_imagenet.pt +3 -0
examples/Paystub.jpg +3 -0
examples/TamperedPaystub.jpg +3 -0
examples/TamperedPaystubv1.jpg +3 -0
examples/carte.jpeg +0 -0
inference.py +187 -0
models/__init__.py +1 -0
models/dtd.py +356 -0
models/fix_imports.py +31 -0
models/fph.py +132 -0
models/patch_droppath.py +17 -0
models/patch_gelu.py +34 -0
models/swins.py +454 -0
requirements.txt +11 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+examples/Paystub.jpg filter=lfs diff=lfs merge=lfs -text
+examples/TamperedPaystub.jpg filter=lfs diff=lfs merge=lfs -text
+examples/TamperedPaystubv1.jpg filter=lfs diff=lfs merge=lfs -text

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,194 @@

+# 🚀 Deployment Guide - DTD Document Tampering Detection
+## Hugging Face Spaces Deployment
+### Prerequisites
+- Hugging Face account
+- Git installed locally
+- Git LFS installed (for large model files)
+### Step 1: Install Git LFS
+```bash
+# Mac
+brew install git-lfs
+# Linux
+sudo apt-get install git-lfs
+# Initialize Git LFS
+git lfs install
+```
+### Step 2: Create Hugging Face Space
+1. Go to https://huggingface.co/new-space
+2. Choose a name (e.g., `dtd-doctamper-detection`)
+3. Select **Gradio** as SDK
+4. Choose license: **MIT**
+5. Click **Create Space**
+### Step 3: Clone and Setup
+```bash
+# Clone your new space
+git clone https://huggingface.co/spaces/YOUR_USERNAME/dtd-doctamper-detection
+cd dtd-doctamper-detection
+# Copy app files
+cp -r /path/to/gradio_dtd_app/* .
+# Track large files with Git LFS
+git lfs track "*.pth"
+git lfs track "*.pt"
+git add .gitattributes
+# Add all files
+git add .
+# Commit
+git commit -m "Initial commit: DTD document tampering detection app"
+# Push to Hugging Face
+git push
+```
+### Step 4: Configure Space Settings
+After pushing, Hugging Face will automatically:
+- Install dependencies from `requirements.txt`
+- Build the Docker container
+- Start the Gradio app
+- Assign a public URL
+### Step 5: Test Your Space
+Visit: `https://huggingface.co/spaces/YOUR_USERNAME/dtd-doctamper-detection`
+## Local Testing
+Before deploying, test locally:
+```bash
+cd gradio_dtd_app
+# Create virtual environment
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+# Install dependencies
+pip install -r requirements.txt
+# Run app
+python app.py
+```
+Open browser to: `http://localhost:7860`
+## Troubleshooting
+### Issue: Git LFS bandwidth limit
+**Solution**: Use Hugging Face's built-in LFS storage:
+```bash
+# Track checkpoint files
+git lfs track "checkpoints/*.pth"
+git lfs track "checkpoints/*.pt"
+git add .gitattributes checkpoints/
+git commit -m "Add model checkpoints"
+git push
+```
+### Issue: Build timeout
+**Solution**: Reduce requirements versions or use pre-built images:
+```yaml
+# Create .github/workflows/deploy.yml
+sdk: gradio
+sdk_version: 4.44.0
+python_version: "3.10"
+```
+### Issue: Out of memory
+**Solution**: Enable GPU hardware in Space settings:
+1. Go to Space settings
+2. Select **Hardware**: GPU (free tier: T4)
+3. Save changes
+## File Size Optimization
+Current app size: **~450MB**
+To reduce size:
+1. **Quantize models** (reduce precision):
+```python
+# In inference.py
+torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
+```
+2. **Use model compression**:
+```bash
+pip install onnx onnxruntime
+# Convert to ONNX format (smaller)
+```
+3. **Lazy loading**:
+```python
+# Load models on first request instead of startup
+@lru_cache()
+def get_model():
+    return DTDPredictor()
+```
+## Environment Variables
+Add to Space secrets:
+```bash
+# Optional: Analytics tracking
+ANALYTICS_TOKEN=your_token
+# Optional: Rate limiting
+MAX_REQUESTS_PER_HOUR=100
+```
+## Monitoring
+Check Space logs:
+1. Go to Space page
+2. Click **Logs** tab
+3. Monitor real-time inference
+## Custom Domain (Optional)
+1. Go to Space settings
+2. Add custom domain
+3. Configure DNS records
+## Cost Optimization
+**Free Tier Limits:**
+- CPU: Free (slower inference)
+- GPU T4: Free tier available
+- Storage: 50GB LFS
+- Bandwidth: Limited
+**Upgrade Options:**
+- GPU A10G: Faster inference
+- Persistent storage
+- Higher bandwidth
+## Support
+- [Hugging Face Docs](https://huggingface.co/docs/hub/spaces)
+- [Gradio Docs](https://gradio.app/docs/)
+- [Git LFS](https://git-lfs.github.com/)
+## License
+MIT License - See LICENSE file

README.md CHANGED Viewed

@@ -1,13 +1,111 @@
 ---
-title: Dtd Document Tampering
-emoji: 📈
-colorFrom: pink
-colorTo: gray
 sdk: gradio
-sdk_version: 5.49.1
 app_file: app.py
 pinned: false
-license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DTD Document Tampering Detection
+emoji: 🔍
+colorFrom: blue
+colorTo: red
 sdk: gradio
+sdk_version: 4.44.0
 app_file: app.py
 pinned: false
+license: mit
 ---
+# 🔍 DTD: Document Tampering Detection
+Detect forged or tampered regions in document images using the **DTD (Document Tampering Detector)** model.
+## 📝 Description
+This application uses state-of-the-art deep learning to identify manipulated text in document images by analyzing JPEG compression artifacts (DCT coefficients).
+### ✨ Features
+- **DCT Analysis**: Examines JPEG compression patterns to detect inconsistencies
+- **Real-time Detection**: Fast inference on CPU or GPU
+- **Visual Heatmaps**: Clear visualization of tampered regions
+- **High Accuracy**: Trained on DocTamper dataset with 120K+ document images
+### 🎯 Use Cases
+- Verify document authenticity
+- Detect forged receipts, invoices, and forms
+- Identify copy-paste text manipulation
+- Detect splicing and content insertion
+## 🚀 How It Works
+1. **Upload** a document image (JPEG format works best)
+2. **Adjust** JPEG quality setting for DCT analysis (default: 90)
+3. **View** tampering detection results:
+   - **Heatmap**: Red overlay shows tampered regions
+   - **Binary Mask**: Clear segmentation of authentic vs tampered
+   - **Original**: Compare with input
+## 🏗️ Model Architecture
+- **Backbone**: VPH (Vision Pyramid Hybrid) + Swin Transformer
+- **Decoder**: Multi-scale Iterative Decoder (MID)
+- **Inputs**: RGB image + DCT coefficients + Quantization tables
+- **Output**: Binary segmentation mask (0=authentic, 1=tampered)
+## 📚 Citation
+```bibtex
+@inproceedings{qu2023towards,
+  title={Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution},
+  author={Qu, Chenfan and Liu, Chongyu and Liu, Yuliang and Chen, Xinhong and Peng, Dezhi and Guo, Fengjun and Jin, Lianwen},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={5937--5946},
+  year={2023}
+}
+```
+## 📖 Paper
+[Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution](https://openaccess.thecvf.com/content/CVPR2023/papers/Qu_Towards_Robust_Tampered_Text_Detection_in_Document_Image_New_Dataset_CVPR_2023_paper.pdf) (CVPR 2023)
+## ⚠️ Limitations
+- **JPEG Dependency**: Requires JPEG format for DCT analysis
+- **Quality Sensitivity**: Detection accuracy varies with compression quality
+- **False Positives**: May occur on low-quality scans or heavily compressed images
+- **Preprocessing**: Images must contain text/document content
+## 🛠️ Technical Details
+### Model Weights
+- **Main Model**: `dtd_doctamper.pth` (257MB)
+- **VPH Backbone**: `vph_imagenet.pt` (4.8MB)
+- **Swin Transformer**: `swin_imagenet.pt` (187MB)
+- **Total Size**: ~449MB
+### Performance
+- **Input Size**: Variable (auto-resized)
+- **Inference Time**: ~2-5 seconds on CPU
+- **GPU Acceleration**: Supported (CUDA)
+## 📦 Local Installation
+```bash
+# Clone repository
+git clone https://huggingface.co/spaces/YOUR_USERNAME/dtd-doctamper-detection
+cd dtd-doctamper-detection
+# Install dependencies
+pip install -r requirements.txt
+# Run application
+python app.py
+```
+## 📄 License
+MIT License - See LICENSE file for details
+## 🤝 Acknowledgments
+- Original DTD model by Qu et al. (CVPR 2023)
+- DocTamper dataset
+- Hugging Face Spaces for hosting

app.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""
+DTD DocTamper - Gradio Application
+Document Tampering Detection using DTD model
+"""
+import gradio as gr
+import numpy as np
+from inference import DTDPredictor
+# Initialize predictor
+print("Loading DTD model...")
+predictor = DTDPredictor(
+    checkpoint_path='checkpoints/dtd_doctamper.pth',
+    device='auto'
+)
+print("Model loaded!")
+def predict_tampering(image, quality=90):
+    """
+    Predict document tampering
+    Args:
+        image: Input image (PIL Image or numpy array)
+        quality: JPEG compression quality for DCT analysis
+    Returns:
+        Tuple of (original, mask, heatmap)
+    """
+    # Save uploaded image temporarily
+    import tempfile
+    with tempfile.NamedTemporaryFile(delete=False, suffix='.jpg') as tmp:
+        if hasattr(image, 'save'):
+            image.save(tmp, 'JPEG', quality=95)
+        else:
+            from PIL import Image
+            Image.fromarray(image).save(tmp, 'JPEG', quality=95)
+        tmp_path = tmp.name
+    try:
+        # Run prediction
+        result = predictor.predict(tmp_path, quality=quality)
+        return (
+            result['original'],
+            result['mask'],
+            result['heatmap']
+        )
+    finally:
+        import os
+        os.unlink(tmp_path)
+# Create Gradio interface
+with gr.Blocks(title="DTD Document Tampering Detection") as demo:
+    gr.Markdown("""
+    # 🔍 DTD: Document Tampering Detection
+    Upload a document image to detect forged or tampered regions using the DTD (Document Tampering Detector) model.
+    **How it works:**
+    - The model analyzes JPEG compression artifacts (DCT coefficients)
+    - Red regions indicate potential tampering
+    - Works best on JPEG images of documents
+    **Paper:** [Towards Robust Tampered Text Detection in Document Image](https://openaccess.thecvf.com/content/CVPR2023/papers/Qu_Towards_Robust_Tampered_Text_Detection_in_Document_Image_New_Dataset_CVPR_2023_paper.pdf)
+    """)
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(
+                label="Upload Document Image",
+                type="pil"
+            )
+            quality_slider = gr.Slider(
+                minimum=75,
+                maximum=95,
+                value=90,
+                step=5,
+                label="JPEG Quality for DCT Analysis",
+                info="Higher quality = more sensitive detection"
+            )
+            submit_btn = gr.Button("Detect Tampering", variant="primary")
+        with gr.Column():
+            with gr.Tab("Heatmap Overlay"):
+                output_heatmap = gr.Image(label="Tampering Heatmap")
+            with gr.Tab("Binary Mask"):
+                output_mask = gr.Image(label="Tampering Mask")
+            with gr.Tab("Original"):
+                output_original = gr.Image(label="Original Image")
+    # Examples
+    gr.Examples(
+        examples=[
+            ["examples/carte.jpeg", 90],
+            ["examples/TamperedPaystub.jpg", 90],
+            ["examples/Paystub.jpg", 90],
+        ],
+        inputs=[input_image, quality_slider],
+        outputs=[output_original, output_mask, output_heatmap],
+        fn=predict_tampering,
+        cache_examples=False,
+    )
+    # Event handlers
+    submit_btn.click(
+        fn=predict_tampering,
+        inputs=[input_image, quality_slider],
+        outputs=[output_original, output_mask, output_heatmap]
+    )
+    gr.Markdown("""
+    ---
+    ### ℹ️ About
+    **DTD (Document Tampering Detector)** is a deep learning model designed to detect forged text in document images.
+    **Features:**
+    - Analyzes JPEG compression artifacts using DCT (Discrete Cosine Transform)
+    - Detects copy-paste, splicing, and text manipulation
+    - Works on scanned documents, photos of documents, and digital documents
+    **Citation:**
+    ```bibtex
+    @inproceedings{qu2023towards,
+      title={Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution},
+      author={Qu, Chenfan and Liu, Chongyu and Liu, Yuliang and Chen, Xinhong and Peng, Dezhi and Guo, Fengjun and Jin, Lianwen},
+      booktitle={CVPR},
+      year={2023}
+    }
+    ```
+    **Model Architecture:**
+    - Backbone: VPH (Vision Pyramid Hybrid) + Swin Transformer
+    - Decoder: Multi-scale Iterative Decoder (MID)
+    - Input: RGB image + DCT coefficients + Quantization tables
+    - Output: Binary segmentation mask
+    **Limitations:**
+    - Requires JPEG images for DCT analysis
+    - May produce false positives on low-quality scans
+    - Performance varies with JPEG compression quality
+    """)
+if __name__ == "__main__":
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        share=False
+    )

checkpoints/dtd_doctamper.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:81291d19e0c92fd56a8e76f422114a9c6bc6f67f4ac03a0facc18a045894a8c1
+size 269695109

checkpoints/qt_table.pk ADDED Viewed

Binary file (7.74 kB). View file

checkpoints/swin_imagenet.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1436a2d793dbbb74c9578a68e52d0e7deaa3f305560a34d287a8e4edc866b245
+size 196402845

checkpoints/vph_imagenet.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4e99cbcbd5f17a0004278a57e3ff199a0de7189345d7e6924e104710d602898
+size 5000275

examples/Paystub.jpg ADDED Viewed

Git LFS Details

SHA256: e29a9c35b4d22e54500486a4d3ab8e9501dd5ebe2797ae51e2e2e6933dea60f8
Pointer size: 131 Bytes
Size of remote file: 293 kB

examples/TamperedPaystub.jpg ADDED Viewed

Git LFS Details

SHA256: 6b5dca0dcc2d057cfd06f78137fa5bf36c0b14cb58d9d62c1482833b0d14e152
Pointer size: 131 Bytes
Size of remote file: 287 kB

examples/TamperedPaystubv1.jpg ADDED Viewed

Git LFS Details

SHA256: 8295f2441596ec7caddbbfa4df94bd8705ad367244485c812f3bb1885d4b3386
Pointer size: 131 Bytes
Size of remote file: 686 kB

examples/carte.jpeg ADDED Viewed

inference.py ADDED Viewed

	@@ -0,0 +1,187 @@

+"""
+DTD DocTamper Inference Module
+Simplified for Gradio deployment on Hugging Face
+"""
+import os
+import sys
+import cv2
+import torch
+import jpegio
+import numpy as np
+import pickle
+import tempfile
+from PIL import Image
+from models import fix_imports  # Apply compatibility fixes
+from models import patch_gelu
+from models import patch_droppath
+from models.dtd import seg_dtd
+import torchvision.transforms as transforms
+class DTDPredictor:
+    def __init__(self, checkpoint_path='checkpoints/dtd_doctamper.pth', device='cpu'):
+        """
+        Initialize DTD model for inference
+        Args:
+            checkpoint_path: Path to model checkpoint
+            device: Device to use ('cpu', 'cuda', or 'auto')
+        """
+        # Auto-detect device
+        if device == 'auto':
+            if torch.cuda.is_available():
+                self.device = 'cuda'
+            else:
+                self.device = 'cpu'
+        else:
+            self.device = device
+        print(f'Using device: {self.device}')
+        # Load QT table
+        with open('checkpoints/qt_table.pk', 'rb') as fpk:
+            pks = pickle.load(fpk)
+        self.pks = {}
+        for k, v in pks.items():
+            self.pks[k] = torch.LongTensor(v)
+        # Image transforms
+        self.transform = transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize(mean=(0.485, 0.455, 0.406),
+                               std=(0.229, 0.224, 0.225))
+        ])
+        # Load model
+        self.model = seg_dtd('', 2, device=self.device)
+        if self.device == 'cuda':
+            self.model = self.model.cuda()
+        self.model = self.model.to(self.device)
+        # Load checkpoint
+        ckpt = torch.load(checkpoint_path, map_location='cpu')
+        state_dict = ckpt['state_dict']
+        # Remove 'module.' prefix if present
+        new_state_dict = {}
+        for k, v in state_dict.items():
+            if k.startswith('module.'):
+                new_state_dict[k[7:]] = v
+            else:
+                new_state_dict[k] = v
+        self.model.load_state_dict(new_state_dict)
+        self.model.eval()
+        print('Model loaded successfully!')
+    def extract_dct(self, image_path, quality=90):
+        """
+        Extract DCT coefficients from JPEG image
+        Args:
+            image_path: Path to JPEG image
+            quality: JPEG quality for re-compression
+        Returns:
+            DCT coefficients and quantization table
+        """
+        # Load image
+        im = Image.open(image_path).convert('RGB')
+        # Re-compress to JPEG with specified quality
+        with tempfile.NamedTemporaryFile(delete=False, suffix='.jpg') as tmp:
+            im_gray = im.convert("L")
+            im_gray.save(tmp, "JPEG", quality=quality)
+            tmp_path = tmp.name
+        try:
+            # Read JPEG with jpegio
+            jpg = jpegio.read(tmp_path)
+            dct = jpg.coef_arrays[0].copy()
+            # Get quantization table
+            qt = jpg.quant_tables[0]
+            qt_flat = qt.flatten()[:64]  # First 64 values
+            return dct, qt_flat
+        finally:
+            # Clean up temp file
+            os.unlink(tmp_path)
+    @torch.no_grad()
+    def predict(self, image_path, quality=90):
+        """
+        Predict tampering mask for input image
+        Args:
+            image_path: Path to input JPEG image
+            quality: JPEG quality for DCT extraction
+        Returns:
+            Dictionary containing:
+                - original: Original image (numpy array)
+                - mask: Binary tampering mask (numpy array)
+                - heatmap: Colorized heatmap overlay
+        """
+        # Load image
+        im = Image.open(image_path).convert('RGB')
+        im_np = np.array(im)
+        # Extract DCT coefficients
+        dct, qt = self.extract_dct(image_path, quality)
+        # Prepare inputs
+        # Image
+        im_tensor = self.transform(im).unsqueeze(0).to(self.device)
+        # DCT coefficients (clip to [0, 20])
+        dct_tensor = torch.from_numpy(np.clip(np.abs(dct), 0, 20)).unsqueeze(0).unsqueeze(0).float().to(self.device)
+        # Quantization table
+        qt_indices = []
+        for val in qt:
+            # Find closest match in quantization table
+            if val in self.pks:
+                qt_indices.append(val)
+            else:
+                # Find closest
+                closest = min(self.pks.keys(), key=lambda x: abs(x - val))
+                qt_indices.append(closest)
+        qt_tensor = torch.LongTensor(qt_indices[:64]).unsqueeze(0).to(self.device)
+        # Forward pass
+        output = self.model(im_tensor, dct_tensor, qt_tensor)
+        # Get prediction mask
+        pred_mask = output.argmax(1).squeeze().cpu().numpy()
+        # Create heatmap overlay
+        heatmap = self.create_heatmap(im_np, pred_mask)
+        return {
+            'original': im_np,
+            'mask': (pred_mask * 255).astype(np.uint8),
+            'heatmap': heatmap
+        }
+    def create_heatmap(self, image, mask):
+        """
+        Create colorized heatmap overlay
+        Args:
+            image: Original image (numpy array)
+            mask: Binary mask (numpy array)
+        Returns:
+            Heatmap overlay (numpy array)
+        """
+        # Create colored mask
+        colored_mask = np.zeros_like(image)
+        colored_mask[mask == 1] = [255, 0, 0]  # Red for tampered regions
+        # Blend with original image
+        alpha = 0.5
+        heatmap = cv2.addWeighted(image, 1 - alpha, colored_mask, alpha, 0)
+        return heatmap

models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # DTD Models Module

models/dtd.py ADDED Viewed

	@@ -0,0 +1,356 @@

+import os
+import fix_imports  # Apply import fixes
+import cv2
+import lmdb
+import torch
+import jpegio
+import numpy as np
+import torch.nn as nn
+import gc
+import math
+import time
+import copy
+import logging
+import torch.optim as optim
+import torch.distributed as dist
+import random
+import pickle
+import six
+from glob import glob
+from PIL import Image
+from tqdm import tqdm
+from torch.autograd import Variable
+from torch.cuda.amp import autocast
+import segmentation_models_pytorch as smp
+from torch.utils.data import Dataset, DataLoader
+from torch.cuda.amp import autocast, GradScaler#need pytorch>1.6
+from losses import DiceLoss,FocalLoss,SoftCrossEntropyLoss,LovaszLoss
+from fph import FPH
+import albumentations as A
+from swins import *
+from albumentations.pytorch import ToTensorV2
+import torchvision
+import torch.nn.functional as F
+try:
+    from timm.models.layers import trunc_normal_, DropPath
+except ImportError:
+    from timm.layers import trunc_normal_, DropPath
+from functools import partial
+from segmentation_models_pytorch.base import modules as md
+from typing import Optional, Union, List
+from segmentation_models_pytorch.base import SegmentationModel
+# Custom GELU for compatibility
+class GELU(nn.Module):
+    def forward(self, x):
+        return F.gelu(x)
+class LayerNorm(nn.Module):
+    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(normalized_shape))
+        self.bias = nn.Parameter(torch.zeros(normalized_shape))
+        self.eps = eps
+        self.data_format = data_format
+        if self.data_format not in ["channels_last", "channels_first"]:
+            raise NotImplementedError
+        self.normalized_shape = (normalized_shape, )
+    def forward(self, x):
+        if self.data_format == "channels_last":
+            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
+        elif self.data_format == "channels_first":
+            u = x.mean(1, keepdim=True)
+            s = (x - u).pow(2).mean(1, keepdim=True)
+            x = (x - u) / torch.sqrt(s + self.eps)
+            x = self.weight[:, None, None] * x + self.bias[:, None, None]
+            return x
+class SCSEModule(nn.Module):
+    def __init__(self, in_channels, reduction=16):
+        super().__init__()
+        self.cSE = nn.Sequential(
+            nn.AdaptiveAvgPool2d(1),
+            nn.Conv2d(in_channels, in_channels // reduction, 1),
+            nn.ReLU(inplace=True),
+            nn.Conv2d(in_channels // reduction, in_channels, 1),
+            nn.Sigmoid(),
+        )
+        self.sSE = nn.Sequential(nn.Conv2d(in_channels, 1, 1), nn.Sigmoid())
+    def forward(self, x):
+        return x * self.cSE(x) + x * self.sSE(x)
+class ConvBlock(nn.Module):
+    def __init__(self, dim, drop_path=0., layer_scale_init_value=1e-6):
+        super().__init__()
+        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
+        self.norm = LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, 4 * dim)
+        self.act = GELU()
+        self.pwconv2 = nn.Linear(4 * dim, dim)
+        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True) if layer_scale_init_value > 0 else None
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+    def forward(self, x):
+        ipt = x
+        x = self.dwconv(x)
+        x = x.permute(0, 2, 3, 1)
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.pwconv2(x)
+        if self.gamma is not None:
+            x = self.gamma * x
+        x = x.permute(0, 3, 1, 2)
+        x = ipt + self.drop_path(x)
+        return x
+class AddCoords(nn.Module):
+    def __init__(self, with_r=True):
+        super().__init__()
+        self.with_r = with_r
+    def forward(self, input_tensor):
+        batch_size, _, x_dim, y_dim = input_tensor.size()
+        xx_c, yy_c = torch.meshgrid(torch.arange(x_dim,dtype=input_tensor.dtype), torch.arange(y_dim,dtype=input_tensor.dtype))
+        xx_c = xx_c.to(input_tensor.device) / (x_dim - 1) * 2 - 1
+        yy_c = yy_c.to(input_tensor.device) / (y_dim - 1) * 2 - 1
+        xx_c = xx_c.expand(batch_size,1,x_dim,y_dim)
+        yy_c = yy_c.expand(batch_size,1,x_dim,y_dim)
+        ret = torch.cat((input_tensor,xx_c,yy_c), dim=1)
+        if self.with_r:
+            rr = torch.sqrt(torch.pow(xx_c - 0.5, 2) + torch.pow(yy_c - 0.5, 2))
+            ret = torch.cat([ret, rr], dim=1)
+        return ret
+class VPH(nn.Module):
+    def __init__(self, dims=[96, 192], drop_path_rate=0.4, layer_scale_init_value=1e-6):
+        super().__init__()
+        dp_rates=[x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
+        self.downsample_layers = nn.ModuleList([nn.Sequential(nn.Conv2d(6, dims[0], kernel_size=4, stride=4), LayerNorm(dims[0], eps=1e-6, data_format="channels_first")), nn.Sequential(LayerNorm(dims[1], eps=1e-6, data_format="channels_first"),nn.Conv2d(dims[1], dims[2], kernel_size=2, stride=2))])
+        self.stages = nn.ModuleList([nn.Sequential(*[ConvBlock(dim=dims[0], drop_path=dp_rates[j],layer_scale_init_value=layer_scale_init_value) for j in range(3)]), nn.Sequential(*[ConvBlock(dim=dims[1], drop_path=dp_rates[3 + j],layer_scale_init_value=layer_scale_init_value) for j in range(3)])])
+        self.apply(self._init_weights)
+    def initnorm(self):
+        norm_layer = partial(LayerNorm, eps=1e-6, data_format="channels_first")
+        for i_layer in range(4):
+            layer = norm_layer(self.dims[i_layer])
+            layer_name = f'norm{i_layer}'
+            self.add_module(layer_name, layer)
+    def _init_weights(self, m):
+        if isinstance(m, (nn.Conv2d, nn.Linear)):
+            trunc_normal_(m.weight, std=.02)
+            nn.init.constant_(m.bias, 0)
+    def init_weights(self, pretrained=None):
+        def _init_weights(m):
+            if isinstance(m, nn.Linear):
+                trunc_normal_(m.weight, std=.02)
+                if isinstance(m, nn.Linear) and m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.LayerNorm):
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)
+        self.apply(_init_weights)
+    def forward(self, x):
+        outs = []
+        x = self.stages[0](self.downsample_layers[0](x))
+        outs = [self.norm0(x)]
+        x = self.stages[1](self.downsample_layers[1](x))
+        outs.append(self.norm1(x))
+        return outs
+class SegmentationHead(nn.Sequential):
+    def __init__(self, in_channels, out_channels, kernel_size=3, activation=None, upsampling=1):
+        upsampling = nn.UpsamplingBilinear2d(scale_factor=upsampling) if upsampling > 1 else nn.Identity()
+        conv2d = nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=kernel_size // 2)
+        activation = md.Activation(activation)
+        super().__init__(conv2d, upsampling, activation)
+class DecoderBlock(nn.Module):
+    def __init__(self,cin,cadd,cout,):
+        super().__init__()
+        self.cin = (cin + cadd)
+        self.cout = cout
+        self.conv1 = nn.Sequential(
+            nn.Conv2d(self.cin, self.cout, kernel_size=3, padding=1, bias=False),
+            nn.BatchNorm2d(self.cout),
+            nn.ReLU(inplace=True)
+        )
+        self.conv2 = nn.Sequential(
+            nn.Conv2d(self.cout, self.cout, kernel_size=3, padding=1, bias=False),
+            nn.BatchNorm2d(self.cout),
+            nn.ReLU(inplace=True)
+        )
+    def forward(self, x1, x2=None):
+        x1 = F.interpolate(x1, scale_factor=2.0, mode="nearest")
+        if x2 is not None:
+            x1 = torch.cat([x1, x2], dim=1)
+        x1 = self.conv1(x1[:,:self.cin])
+        x1 = self.conv2(x1)
+        return x1
+class ConvBNReLU(nn.Module):
+    def __init__(self,in_c,out_c,ks,stride=1,norm=True,res=False):
+        super(ConvBNReLU, self).__init__()
+        if norm:
+            self.conv = nn.Sequential(nn.Conv2d(in_c, out_c, kernel_size=ks, padding = ks//2, stride=stride,bias=False),nn.BatchNorm2d(out_c),nn.ReLU(True))
+        else:
+            self.conv = nn.Conv2d(in_c, out_c, kernel_size=ks, padding = ks//2, stride=stride,bias=False)
+        self.res = res
+    def forward(self,x):
+        if self.res:
+            return (x + self.conv(x))
+        else:
+            return self.conv(x)
+class FUSE1(nn.Module):
+    def __init__(self,in_channels_list=(96,192,384,768)):
+        super(FUSE1, self).__init__()
+        self.c31 = ConvBNReLU(in_channels_list[2],in_channels_list[2],1)
+        self.c32 = ConvBNReLU(in_channels_list[3],in_channels_list[2],1)
+        self.c33 = ConvBNReLU(in_channels_list[2],in_channels_list[2],3)
+        self.c21 = ConvBNReLU(in_channels_list[1],in_channels_list[1],1)
+        self.c22 = ConvBNReLU(in_channels_list[2],in_channels_list[1],1)
+        self.c23 = ConvBNReLU(in_channels_list[1],in_channels_list[1],3)
+        self.c11 = ConvBNReLU(in_channels_list[0],in_channels_list[0],1)
+        self.c12 = ConvBNReLU(in_channels_list[1],in_channels_list[0],1)
+        self.c13 = ConvBNReLU(in_channels_list[0],in_channels_list[0],3)
+    def forward(self,x):
+        x,x1,x2,x3 = x
+        h,w = x2.shape[-2:]
+        x2 = self.c33(F.interpolate(self.c32(x3),size=(h,w))+self.c31(x2))
+        h,w = x1.shape[-2:]
+        x1 = self.c23(F.interpolate(self.c22(x2),size=(h,w))+self.c21(x1))
+        h,w = x.shape[-2:]
+        x = self.c13(F.interpolate(self.c12(x1),size=(h,w))+self.c11(x))
+        return x,x1,x2,x3
+class FUSE2(nn.Module):
+    def __init__(self,in_channels_list=(96,192,384)):
+        super(FUSE2, self).__init__()
+        self.c21 = ConvBNReLU(in_channels_list[1],in_channels_list[1],1)
+        self.c22 = ConvBNReLU(in_channels_list[2],in_channels_list[1],1)
+        self.c23 = ConvBNReLU(in_channels_list[1],in_channels_list[1],3)
+        self.c11 = ConvBNReLU(in_channels_list[0],in_channels_list[0],1)
+        self.c12 = ConvBNReLU(in_channels_list[1],in_channels_list[0],1)
+        self.c13 = ConvBNReLU(in_channels_list[0],in_channels_list[0],3)
+    def forward(self,x):
+        x,x1,x2 = x
+        h,w = x1.shape[-2:]
+        x1 = self.c23(F.interpolate(self.c22(x2),size=(h,w),mode='bilinear',align_corners=True)+self.c21(x1))
+        h,w = x.shape[-2:]
+        x = self.c13(F.interpolate(self.c12(x1),size=(h,w),mode='bilinear',align_corners=True)+self.c11(x))
+        return x,x1,x2
+class FUSE3(nn.Module):
+    def __init__(self,in_channels_list=(96,192)):
+        super(FUSE3, self).__init__()
+        self.c11 = ConvBNReLU(in_channels_list[0],in_channels_list[0],1)
+        self.c12 = ConvBNReLU(in_channels_list[1],in_channels_list[0],1)
+        self.c13 = ConvBNReLU(in_channels_list[0],in_channels_list[0],3)
+    def forward(self,x):
+        x,x1 = x
+        h,w = x.shape[-2:]
+        x = self.c13(F.interpolate(self.c12(x1),size=(h,w),mode='bilinear',align_corners=True)+self.c11(x))
+        return x,x1
+class MID(nn.Module):
+    def __init__(self, encoder_channels, decoder_channels):
+        super().__init__()
+        encoder_channels = encoder_channels[1:][::-1]
+        self.in_channels = [encoder_channels[0]] + list(decoder_channels[:-1])
+        self.add_channels = list(encoder_channels[1:]) + [96]
+        self.out_channels = decoder_channels
+        self.fuse1 = FUSE1()
+        self.fuse2 = FUSE2()
+        self.fuse3 = FUSE3()
+        decoder_convs = {}
+        for layer_idx in range(len(self.in_channels) - 1):
+            for depth_idx in range(layer_idx + 1):
+                if depth_idx == 0:
+                    in_ch = self.in_channels[layer_idx]
+                    skip_ch = self.add_channels[layer_idx] * (layer_idx + 1)
+                    out_ch = self.out_channels[layer_idx]
+                else:
+                    out_ch = self.add_channels[layer_idx]
+                    skip_ch = self.add_channels[layer_idx] * (layer_idx + 1 - depth_idx)
+                    in_ch = self.add_channels[layer_idx - 1]
+                decoder_convs[f"x_{depth_idx}_{layer_idx}"] = DecoderBlock(in_ch, skip_ch, out_ch)
+        decoder_convs[f"x_{0}_{len(self.in_channels)-1}"] = DecoderBlock(self.in_channels[-1], 0, self.out_channels[-1])
+        self.decoder_convs = nn.ModuleDict(decoder_convs)
+    def forward(self, *features):
+        decoder_features = {}
+        features = self.fuse1(features)[::-1]
+        decoder_features["x_0_0"] = self.decoder_convs["x_0_0"](features[0],features[1])
+        decoder_features["x_1_1"] = self.decoder_convs["x_1_1"](features[1],features[2])
+        decoder_features["x_2_2"] = self.decoder_convs["x_2_2"](features[2],features[3])
+        decoder_features["x_2_2"], decoder_features["x_1_1"], decoder_features["x_0_0"] = self.fuse2((decoder_features["x_2_2"], decoder_features["x_1_1"], decoder_features["x_0_0"]))
+        decoder_features["x_0_1"] = self.decoder_convs["x_0_1"](decoder_features["x_0_0"], torch.cat((decoder_features["x_1_1"], features[2]),1))
+        decoder_features["x_1_2"] = self.decoder_convs["x_1_2"](decoder_features["x_1_1"], torch.cat((decoder_features["x_2_2"], features[3]),1))
+        decoder_features["x_1_2"], decoder_features["x_0_1"] = self.fuse3((decoder_features["x_1_2"], decoder_features["x_0_1"]))
+        decoder_features["x_0_2"] = self.decoder_convs["x_0_2"](decoder_features["x_0_1"], torch.cat((decoder_features["x_1_2"], decoder_features["x_2_2"], features[3]),1))
+        return self.decoder_convs["x_0_3"](torch.cat((decoder_features["x_0_2"], decoder_features["x_1_2"], decoder_features["x_2_2"]),1))
+class DTD(SegmentationModel):
+    def __init__(self, encoder_name = "resnet18", decoder_channels = (384, 192, 96, 64), classes = 1, device='cpu'):
+        super().__init__()
+        # Load models with proper device mapping
+        import os
+        model_dir = os.path.dirname(os.path.abspath(__file__))
+        vph_path = os.path.join(model_dir, '..', 'pths', 'vph_imagenet.pt')
+        swin_path = os.path.join(model_dir, '..', 'pths', 'swin_imagenet.pt')
+        if device == 'mps':
+            self.vph = torch.load(vph_path, map_location=torch.device('cpu'))
+            self.swin = torch.load(swin_path, map_location=torch.device('cpu'))
+        else:
+            self.vph = torch.load(vph_path, map_location=device)
+            self.swin = torch.load(swin_path, map_location=device)
+        self.fph = FPH()
+        self.decoder = MID(encoder_channels=(96, 192, 384, 768), decoder_channels=decoder_channels)
+        self.segmentation_head = SegmentationHead(in_channels=decoder_channels[-1], out_channels=classes, upsampling=2.0)
+        self.addcoords = AddCoords()
+        self.FU = nn.Sequential(SCSEModule(448),nn.Conv2d(448,192,3,1,1),nn.BatchNorm2d(192),nn.ReLU(True))
+        self.classification_head = None
+        self.initialize()
+    def forward(self,x,dct,qt):
+        features = self.vph(self.addcoords(x))
+        features[1] = self.FU(torch.cat((features[1],self.fph(dct,qt)),1))
+        rst = self.swin[0](features[1].flatten(2).transpose(1,2).contiguous())
+        N,L,C = rst.shape
+        H = W = int(L**(1/2))
+        features.append(self.vph.norm2(rst.transpose(1,2).contiguous().view(N,C,H,W)))
+        features.append(self.vph.norm3(self.swin[2](self.swin[1](rst)).transpose(1,2).contiguous().view(N,C*2,H//2,W//2)))
+        decoder_output = self.decoder(*features)
+        return self.segmentation_head(decoder_output)
+class seg_dtd(nn.Module):
+    def __init__(self, model_name='resnet18', n_class=1, device='cpu'):
+        super().__init__()
+        self.model = DTD(encoder_name=model_name, classes=n_class, device=device)
+        self.device = device
+    def forward(self, x, dct, qt):
+        # Use autocast only for CUDA, not for MPS
+        if self.device == 'cuda':
+            with autocast():
+                x = self.model(x, dct, qt)
+        else:
+            x = self.model(x, dct, qt)
+        return x

models/fix_imports.py ADDED Viewed

	@@ -0,0 +1,31 @@

+"""
+Simple import compatibility fix for timm
+"""
+import sys
+import torch.nn as nn
+try:
+    import timm.layers as new_layers
+    # Create fake modules for backward compatibility
+    sys.modules['timm.models.layers.drop'] = new_layers.drop
+    sys.modules['timm.models.layers'] = new_layers
+    # Also ensure the imports work
+    from timm.layers import DropPath, trunc_normal_
+    # Patch DropPath to add missing attribute
+    def patched_droppath_init(self, drop_prob=0., scale_by_keep=True):
+        super(DropPath, self).__init__()
+        self.drop_prob = drop_prob
+        self.scale_by_keep = scale_by_keep
+    # Save original
+    _original_droppath_init = DropPath.__init__
+    # Apply patch
+    DropPath.__init__ = patched_droppath_init
+except ImportError:
+    pass
+print("Import compatibility fixes applied")

models/fph.py ADDED Viewed

	@@ -0,0 +1,132 @@

+from efficientnet_pytorch.utils import *
+import os
+import logging
+import functools
+import numpy as np
+import torch
+import torch.nn as nn
+import torch._utils
+import torch.nn.functional as F
+from functools import partial
+try:
+    from timm.models.layers import trunc_normal_, DropPath
+except ImportError:
+    from timm.layers import trunc_normal_, DropPath
+import collections
+BlockArgs = collections.namedtuple('BlockArgs', ['num_repeat', 'kernel_size', 'stride', 'expand_ratio','input_filters', 'output_filters', 'se_ratio', 'id_skip'])
+GlobalParams = collections.namedtuple('GlobalParams', ['width_coefficient', 'depth_coefficient', 'image_size', 'dropout_rate','num_classes', 'batch_norm_momentum', 'batch_norm_epsilon','drop_connect_rate', 'depth_divisor', 'min_depth', 'include_top'])
+global_params = GlobalParams(width_coefficient=1.8, depth_coefficient=2.6, image_size=528, dropout_rate=0.0, num_classes=1000, batch_norm_momentum=0.99, batch_norm_epsilon=0.001, drop_connect_rate=0.0, depth_divisor=8, min_depth=None, include_top=True)
+def get_width_and_height_from_size(x):
+    if isinstance(x, int):
+        return x, x
+    if isinstance(x, list) or isinstance(x, tuple):
+        return x
+    else:
+        raise TypeError()
+def calculate_output_image_size(input_image_size, stride):
+    if input_image_size is None:
+        return None
+    image_height, image_width = get_width_and_height_from_size(input_image_size)
+    stride = stride if isinstance(stride, int) else stride[0]
+    image_height = int(math.ceil(image_height / stride))
+    image_width = int(math.ceil(image_width / stride))
+    return [image_height, image_width]
+class MBConvBlock(nn.Module):
+    def __init__(self, block_args, global_params, image_size=25):
+        super().__init__()
+        self._block_args = block_args
+        self._bn_mom = 1 - global_params.batch_norm_momentum  # pytorch's difference from tensorflow
+        self._bn_eps = global_params.batch_norm_epsilon
+        self.has_se = (self._block_args.se_ratio is not None) and (0 < self._block_args.se_ratio <= 1)
+        self.id_skip = block_args.id_skip  # whether to use skip connection and drop connect
+        inp = self._block_args.input_filters  # number of input channels
+        oup = self._block_args.input_filters * self._block_args.expand_ratio  # number of output channels
+        if self._block_args.expand_ratio != 1:
+            Conv2d = get_same_padding_conv2d(image_size=image_size)
+            self._expand_conv = Conv2d(in_channels=inp, out_channels=oup, kernel_size=1, bias=False)
+            self._bn0 = nn.BatchNorm2d(num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)
+        k = self._block_args.kernel_size
+        s = self._block_args.stride
+        Conv2d = get_same_padding_conv2d(image_size=image_size)
+        self._depthwise_conv = Conv2d(
+            in_channels=oup, out_channels=oup, groups=oup,  # groups makes it depthwise
+            kernel_size=k, stride=s, bias=False)
+        self._bn1 = nn.BatchNorm2d(num_features=oup, momentum=self._bn_mom, eps=self._bn_eps)
+        image_size = calculate_output_image_size(image_size, s)
+        if self.has_se:
+            Conv2d = get_same_padding_conv2d(image_size=(1, 1))
+            num_squeezed_channels = max(1, int(self._block_args.input_filters * self._block_args.se_ratio))
+            self._se_reduce = Conv2d(in_channels=oup, out_channels=num_squeezed_channels, kernel_size=1)
+            self._se_expand = Conv2d(in_channels=num_squeezed_channels, out_channels=oup, kernel_size=1)
+        final_oup = self._block_args.output_filters
+        Conv2d = get_same_padding_conv2d(image_size=image_size)
+        self._project_conv = Conv2d(in_channels=oup, out_channels=final_oup, kernel_size=1, bias=False)
+        self._bn2 = nn.BatchNorm2d(num_features=final_oup, momentum=self._bn_mom, eps=self._bn_eps)
+        self._swish = MemoryEfficientSwish()
+    def forward(self, inputs, drop_connect_rate=None):
+        x = inputs
+        if self._block_args.expand_ratio != 1:
+            x = self._expand_conv(inputs)
+            x = self._bn0(x)
+            x = self._swish(x)
+        x = self._depthwise_conv(x)
+        x = self._bn1(x)
+        x = self._swish(x)
+        if self.has_se:
+            x_squeezed = F.adaptive_avg_pool2d(x, 1)
+            x_squeezed = self._se_reduce(x_squeezed)
+            x_squeezed = self._swish(x_squeezed)
+            x_squeezed = self._se_expand(x_squeezed)
+            x = torch.sigmoid(x_squeezed) * x
+        x = self._project_conv(x)
+        x = self._bn2(x)
+        input_filters, output_filters = self._block_args.input_filters, self._block_args.output_filters
+        if self.id_skip and self._block_args.stride == 1 and input_filters == output_filters:
+            if drop_connect_rate:
+                x = drop_connect(x, p=drop_connect_rate, training=self.training)
+            x = x + inputs  # skip connection
+        return x
+    def set_swish(self, memory_efficient=True):
+        self._swish = MemoryEfficientSwish() if memory_efficient else Swish()
+class AddCoords(nn.Module):
+    def __init__(self, with_r=True):
+        super().__init__()
+        self.with_r = with_r
+    def forward(self, input_tensor):
+        batch_size, _, x_dim, y_dim = input_tensor.size()
+        xx_c, yy_c = torch.meshgrid(torch.arange(x_dim,dtype=input_tensor.dtype), torch.arange(y_dim,dtype=input_tensor.dtype))
+        xx_c = xx_c.to(input_tensor.device) / (x_dim - 1) * 2 - 1
+        yy_c = yy_c.to(input_tensor.device) / (y_dim - 1) * 2 - 1
+        xx_c = xx_c.expand(batch_size,1,x_dim,y_dim)
+        yy_c = yy_c.expand(batch_size,1,x_dim,y_dim)
+        ret = torch.cat((input_tensor,xx_c,yy_c), dim=1)
+        if self.with_r:
+            rr = torch.sqrt(torch.pow(xx_c - 0.5, 2) + torch.pow(yy_c - 0.5, 2))
+            ret = torch.cat([ret, rr], dim=1)
+        return ret
+class FPH(nn.Module):
+    def __init__(self):
+        super(FPH, self).__init__()
+        self.obembed = nn.Embedding(21,21).from_pretrained(torch.eye(21))
+        self.qtembed = nn.Embedding(64,16)
+        self.conv1 = nn.Sequential(nn.Conv2d(in_channels=21,out_channels=64,kernel_size=3,stride=1,dilation=8,padding=8),nn.BatchNorm2d(64, momentum=0.01),nn.ReLU(inplace=True))
+        self.conv2 = nn.Sequential(nn.Conv2d(in_channels=64, out_channels=16, kernel_size=1, stride=1, padding=0, bias=False),nn.BatchNorm2d(16, momentum=0.01),nn.ReLU(inplace=True))
+        self.addcoords = AddCoords()
+        repeats = (1,1,1)
+        in_channles = (256,256,256)
+        out_channles = (256,256,512)
+        self.conv0 = nn.Sequential(nn.Conv2d(in_channels=35, out_channels=256, kernel_size=8, stride=8, padding=0, bias=False),nn.BatchNorm2d(256, momentum=0.01),nn.ReLU(inplace=True),MBConvBlock(BlockArgs(num_repeat=repeats[0], kernel_size=3, stride=[1], expand_ratio=6, input_filters=in_channles[0], output_filters=in_channles[1], se_ratio=0.25, id_skip=True), global_params),MBConvBlock(BlockArgs(num_repeat=repeats[0], kernel_size=3, stride=[1], expand_ratio=6, input_filters=in_channles[1], output_filters=in_channles[1], se_ratio=0.25, id_skip=True), global_params),MBConvBlock(BlockArgs(num_repeat=repeats[0], kernel_size=3, stride=[1], expand_ratio=6, input_filters=in_channles[1], output_filters=in_channles[1], se_ratio=0.25, id_skip=True), global_params),)
+    def forward(self, x, qtable):
+        x = self.conv2(self.conv1(self.obembed(x).permute(0,3,1,2).contiguous()))
+        B, C, H, W = x.shape
+        return self.conv0(self.addcoords(torch.cat(((x.reshape(B,C,H//8,8,W//8,8).permute(0,1,3,5,2,4)*self.qtembed(qtable.unsqueeze(-1).unsqueeze(-1).long()).transpose(1,6).squeeze(6).contiguous()).permute(0,1,4,2,5,3).reshape(B,C,H,W),x), dim=1)))

models/patch_droppath.py ADDED Viewed

	@@ -0,0 +1,17 @@

+"""
+Patch DropPath for compatibility
+"""
+try:
+    from timm.layers import DropPath
+    # Patch existing instances to add scale_by_keep
+    def patched_droppath_getattr(self, name):
+        if name == 'scale_by_keep':
+            return True
+        return object.__getattribute__(self, name)
+    DropPath.__getattr__ = patched_droppath_getattr
+    print("DropPath patched for compatibility")
+except ImportError:
+    pass

models/patch_gelu.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""
+Monkey patch GELU to fix compatibility issues
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# Patch the forward method of existing GELU instances
+def patched_gelu_forward(self, input):
+    return F.gelu(input)
+# Save original
+_original_gelu_forward = nn.GELU.forward
+# Apply patch
+nn.GELU.forward = patched_gelu_forward
+# Also create a new GELU class
+class PatchedGELU(nn.Module):
+    def __init__(self, approximate='none'):
+        super().__init__()
+    def forward(self, input):
+        return F.gelu(input)
+    def __getattr__(self, name):
+        if name == 'approximate':
+            return 'none'
+        raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
+# Replace the class too
+nn.GELU = PatchedGELU
+print("GELU patched for compatibility")

models/swins.py ADDED Viewed

	@@ -0,0 +1,454 @@

+import torch
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils.checkpoint as checkpoint
+try:
+    from timm.models.layers import DropPath, to_2tuple, trunc_normal_
+except ImportError:
+    from timm.layers import DropPath, to_2tuple, trunc_normal_
+import numpy as np
+class Mlp(nn.Module):
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = nn.GELU()# act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = F.gelu(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+def window_partition(x, window_size):
+    B, H, W, C = x.shape
+    x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
+    windows = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, C)
+    return windows
+def window_reverse(windows, window_size, H, W):
+    B = int(windows.shape[0] / (H * W / window_size / window_size))
+    x = windows.view(B, H // window_size, W // window_size, window_size, window_size, -1)
+    x = x.permute(0, 1, 3, 2, 4, 5).contiguous().view(B, H, W, -1)
+    return x
+class WindowAttention(nn.Module):
+    def __init__(self, dim, window_size, num_heads, qkv_bias=True, attn_drop=0., proj_drop=0.,
+                 pretrained_window_size=[0, 0]):
+        super().__init__()
+        self.dim = dim
+        self.window_size = window_size  # Wh, Ww
+        self.pretrained_window_size = pretrained_window_size
+        self.num_heads = num_heads
+        self.logit_scale = nn.Parameter(torch.log(10 * torch.ones((num_heads, 1, 1))), requires_grad=True)
+        # mlp to generate continuous relative position bias
+        self.cpb_mlp = nn.Sequential(nn.Linear(2, 512, bias=True),
+                                     nn.ReLU(inplace=True),
+                                     nn.Linear(512, num_heads, bias=False))
+        # get relative_coords_table
+        relative_coords_h = torch.arange(-(self.window_size[0] - 1), self.window_size[0], dtype=torch.float32)
+        relative_coords_w = torch.arange(-(self.window_size[1] - 1), self.window_size[1], dtype=torch.float32)
+        relative_coords_table = torch.stack(
+            torch.meshgrid([relative_coords_h,
+                            relative_coords_w])).permute(1, 2, 0).contiguous().unsqueeze(0)  # 1, 2*Wh-1, 2*Ww-1, 2
+        if pretrained_window_size[0] > 0:
+            relative_coords_table[:, :, :, 0] /= (pretrained_window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (pretrained_window_size[1] - 1)
+        else:
+            relative_coords_table[:, :, :, 0] /= (self.window_size[0] - 1)
+            relative_coords_table[:, :, :, 1] /= (self.window_size[1] - 1)
+        relative_coords_table *= 8  # normalize to -8, 8
+        relative_coords_table = torch.sign(relative_coords_table) * torch.log2(
+            torch.abs(relative_coords_table) + 1.0) / np.log2(8)
+        self.register_buffer("relative_coords_table", relative_coords_table)
+        # get pair-wise relative position index for each token inside the window
+        coords_h = torch.arange(self.window_size[0])
+        coords_w = torch.arange(self.window_size[1])
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))  # 2, Wh, Ww
+        coords_flatten = torch.flatten(coords, 1)  # 2, Wh*Ww
+        relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :]  # 2, Wh*Ww, Wh*Ww
+        relative_coords = relative_coords.permute(1, 2, 0).contiguous()  # Wh*Ww, Wh*Ww, 2
+        relative_coords[:, :, 0] += self.window_size[0] - 1  # shift to start from 0
+        relative_coords[:, :, 1] += self.window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1
+        relative_position_index = relative_coords.sum(-1)  # Wh*Ww, Wh*Ww
+        self.register_buffer("relative_position_index", relative_position_index)
+        self.qkv = nn.Linear(dim, dim * 3, bias=False)
+        if qkv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(dim))
+            self.v_bias = nn.Parameter(torch.zeros(dim))
+        else:
+            self.q_bias = None
+            self.v_bias = None
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.softmax = nn.Softmax(dim=-1)
+    def forward(self, x, mask=None):
+        B_, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
+        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        qkv = qkv.reshape(B_, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
+        # cosine attention
+        attn = (F.normalize(q, dim=-1) @ F.normalize(k, dim=-1).transpose(-2, -1))
+        logit_scale = torch.clamp(self.logit_scale, max=torch.log(torch.tensor(1. / 0.01,device=attn.device))).exp()
+        attn = attn * logit_scale
+        relative_position_bias_table = self.cpb_mlp(self.relative_coords_table).view(-1, self.num_heads)
+        relative_position_bias = relative_position_bias_table[self.relative_position_index.view(-1)].view(
+            self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1)  # Wh*Ww,Wh*Ww,nH
+        relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+        relative_position_bias = 16 * torch.sigmoid(relative_position_bias)
+        attn = attn + relative_position_bias.unsqueeze(0)
+        if mask is not None:
+            nW = mask.shape[0]
+            attn = attn.view(B_ // nW, nW, self.num_heads, N, N) + mask.unsqueeze(1).unsqueeze(0)
+            attn = attn.view(-1, self.num_heads, N, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+    def extra_repr(self) -> str:
+        return f'dim={self.dim}, window_size={self.window_size}, ' \
+               f'pretrained_window_size={self.pretrained_window_size}, num_heads={self.num_heads}'
+class SwinTransformerBlock(nn.Module):
+    def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0,
+                 mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0., drop_path=0.,
+                 act_layer=nn.GELU, norm_layer=nn.LayerNorm, pretrained_window_size=0):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.shift_size = shift_size
+        self.mlp_ratio = mlp_ratio
+        if min(self.input_resolution) <= self.window_size:
+            # if window size is larger than input resolution, we don't partition windows
+            self.shift_size = 0
+            self.window_size = min(self.input_resolution)
+        assert 0 <= self.shift_size < self.window_size, "shift_size must in 0-window_size"
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(
+            dim, window_size=to_2tuple(self.window_size), num_heads=num_heads,
+            qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop,
+            pretrained_window_size=to_2tuple(pretrained_window_size))
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+        if self.shift_size > 0:
+            # calculate attention mask for SW-MSA
+            H, W = self.input_resolution
+            img_mask = torch.zeros((1, H, W, 1))  # 1 H W 1
+            h_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            w_slices = (slice(0, -self.window_size),
+                        slice(-self.window_size, -self.shift_size),
+                        slice(-self.shift_size, None))
+            cnt = 0
+            for h in h_slices:
+                for w in w_slices:
+                    img_mask[:, h, w, :] = cnt
+                    cnt += 1
+            mask_windows = window_partition(img_mask, self.window_size)  # nW, window_size, window_size, 1
+            mask_windows = mask_windows.view(-1, self.window_size * self.window_size)
+            attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2)
+            attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))
+        else:
+            attn_mask = None
+        self.register_buffer("attn_mask", attn_mask)
+    def forward(self, x):
+        # H, W = self.input_resolution
+        B, L, C = x.shape
+        H = W = int(L**(1/2))
+        assert L == H * W, "input feature has wrong size"
+        shortcut = x
+        x = x.view(B, H, W, C)
+        # cyclic shift
+        if self.shift_size > 0:
+            shifted_x = torch.roll(x, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2))
+        else:
+            shifted_x = x
+        # partition windows
+        x_windows = window_partition(shifted_x, self.window_size)  # nW*B, window_size, window_size, C
+        x_windows = x_windows.view(-1, self.window_size * self.window_size, C)  # nW*B, window_size*window_size, C
+        # W-MSA/SW-MSA
+        attn_windows = self.attn(x_windows, mask=self.attn_mask)  # nW*B, window_size*window_size, C
+        # merge windows
+        attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
+        shifted_x = window_reverse(attn_windows, self.window_size, H, W)  # B H' W' C
+        # reverse cyclic shift
+        if self.shift_size > 0:
+            x = torch.roll(shifted_x, shifts=(self.shift_size, self.shift_size), dims=(1, 2))
+        else:
+            x = shifted_x
+        x = x.view(B, H * W, C)
+        x = shortcut + self.norm1(x)##self.drop_path(self.norm1(x))
+        # FFN
+        x = x + self.norm2(self.mlp(x))##self.drop_path(self.norm2(self.mlp(x)))
+        return x
+    def extra_repr(self) -> str:
+        return f"dim={self.dim}, input_resolution={self.input_resolution}, num_heads={self.num_heads}, " \
+               f"window_size={self.window_size}, shift_size={self.shift_size}, mlp_ratio={self.mlp_ratio}"
+class PatchMerging(nn.Module):
+    def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm):
+        super().__init__()
+        self.input_resolution = input_resolution
+        self.dim = dim
+        self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False)
+        self.norm = norm_layer(2 * dim)
+    def forward(self, x):
+        """
+        x: B, H*W, C
+        """
+        # H, W = self.input_resolution
+        B, L, C = x.shape
+        H = W = int(L**(1/2))
+        assert L == H * W, "input feature has wrong size"
+        assert H % 2 == 0 and W % 2 == 0, f"x size ({H}*{W}) are not even."
+        x = x.view(B, H, W, C)
+        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
+        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
+        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
+        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C
+        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C
+        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C
+        x = self.reduction(x)
+        x = self.norm(x)
+        return x
+    def extra_repr(self) -> str:
+        return f"input_resolution={self.input_resolution}, dim={self.dim}"
+class BasicLayer(nn.Module):
+    def __init__(self, dim, input_resolution, depth, num_heads, window_size,
+                 mlp_ratio=4., qkv_bias=True, drop=0., attn_drop=0.,
+                 drop_path=0., norm_layer=nn.LayerNorm, downsample=None, use_checkpoint=False,
+                 pretrained_window_size=0):
+        super().__init__()
+        self.dim = dim
+        self.input_resolution = input_resolution
+        self.depth = depth
+        self.use_checkpoint = use_checkpoint
+        # build blocks
+        self.blocks = nn.ModuleList([
+            SwinTransformerBlock(dim=dim, input_resolution=input_resolution,
+                                 num_heads=num_heads, window_size=window_size,
+                                 shift_size=0 if (i % 2 == 0) else window_size // 2,
+                                 mlp_ratio=mlp_ratio,
+                                 qkv_bias=qkv_bias,
+                                 drop=drop, attn_drop=attn_drop,
+                                 drop_path=drop_path[i] if isinstance(drop_path, list) else drop_path,
+                                 norm_layer=norm_layer,
+                                 pretrained_window_size=pretrained_window_size)
+            for i in range(depth)])
+        # patch merging layer
+        if downsample is not None:
+            self.downsample = downsample(input_resolution, dim=dim, norm_layer=norm_layer)
+        else:
+            self.downsample = None
+    def forward(self, x):
+        for blk in self.blocks:
+            if self.use_checkpoint:
+                x = checkpoint.checkpoint(blk, x)
+            else:
+                x = blk(x)
+        if self.downsample is not None:
+            x = self.downsample(x)
+        return x
+    def extra_repr(self) -> str:
+        return f"dim={self.dim}, input_resolution={self.input_resolution}, depth={self.depth}"
+    def _init_respostnorm(self):
+        for blk in self.blocks:
+            nn.init.constant_(blk.norm1.bias, 0)
+            nn.init.constant_(blk.norm1.weight, 0)
+            nn.init.constant_(blk.norm2.bias, 0)
+            nn.init.constant_(blk.norm2.weight, 0)
+class PatchEmbed(nn.Module):
+    def __init__(self, img_size=224, patch_size=4, in_chans=3, embed_dim=96, norm_layer=None):
+        super().__init__()
+        img_size = to_2tuple(img_size)
+        patch_size = to_2tuple(patch_size)
+        patches_resolution = [img_size[0] // patch_size[0], img_size[1] // patch_size[1]]
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.patches_resolution = patches_resolution
+        self.num_patches = patches_resolution[0] * patches_resolution[1]
+        self.in_chans = in_chans
+        self.embed_dim = embed_dim
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=patch_size, stride=patch_size)
+        if norm_layer is not None:
+            self.norm = norm_layer(embed_dim)
+        else:
+            self.norm = None
+    def forward(self, x):
+        B, C, H, W = x.shape
+        # FIXME look at relaxing size constraints
+        assert H == self.img_size[0] and W == self.img_size[1], \
+            f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
+        x = self.proj(x).flatten(2).transpose(1, 2)  # B Ph*Pw C
+        if self.norm is not None:
+            x = self.norm(x)
+        return x
+class SwinTransformerV2(nn.Module):
+    def __init__(self, img_size=256, patch_size=4, in_chans=3, num_classes=1000,
+                 embed_dim=128, depths=[2, 2, 18, 2], num_heads=[4, 8, 16, 32],
+                 window_size=8, mlp_ratio=4., qkv_bias=True,
+                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0.0,
+                 norm_layer=nn.LayerNorm, ape=False, patch_norm=True,
+                 use_checkpoint=False, pretrained_window_sizes=[8, 8, 8, 6], **kwargs):
+        super().__init__()
+        self.num_classes = num_classes
+        self.num_layers = len(depths)
+        self.embed_dim = embed_dim
+        self.ape = ape
+        self.patch_norm = patch_norm
+        self.num_features = int(embed_dim * 2 ** (self.num_layers - 1))
+        self.mlp_ratio = mlp_ratio
+        # split image into non-overlapping patches
+        self.patch_embed = PatchEmbed(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim,
+            norm_layer=norm_layer if self.patch_norm else None)
+        num_patches = self.patch_embed.num_patches
+        patches_resolution = self.patch_embed.patches_resolution
+        self.patches_resolution = patches_resolution
+        # absolute position embedding
+        if self.ape:
+            self.absolute_pos_embed = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
+            trunc_normal_(self.absolute_pos_embed, std=.02)
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        # stochastic depth
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]  # stochastic depth decay rule
+        # build layers
+        self.layers = nn.ModuleList()
+        for i_layer in range(self.num_layers):
+            layer = BasicLayer(dim=int(embed_dim * 2 ** i_layer),
+                               input_resolution=(patches_resolution[0] // (2 ** i_layer),
+                                                 patches_resolution[1] // (2 ** i_layer)),
+                               depth=depths[i_layer],
+                               num_heads=num_heads[i_layer],
+                               window_size=window_size,
+                               mlp_ratio=self.mlp_ratio,
+                               qkv_bias=qkv_bias,
+                               drop=drop_rate, attn_drop=attn_drop_rate,
+                               drop_path=dpr[sum(depths[:i_layer]):sum(depths[:i_layer + 1])],
+                               norm_layer=norm_layer,
+                               downsample=PatchMerging if (i_layer < self.num_layers - 1) else None,
+                               use_checkpoint=use_checkpoint,
+                               pretrained_window_size=pretrained_window_sizes[i_layer])
+            self.layers.append(layer)
+        self.norm = norm_layer(self.num_features)
+        self.avgpool = nn.AdaptiveAvgPool1d(1)
+        self.head = nn.Linear(self.num_features, num_classes) if num_classes > 0 else nn.Identity()
+        self.apply(self._init_weights)
+        for bly in self.layers:
+            bly._init_respostnorm()
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'absolute_pos_embed'}
+    @torch.jit.ignore
+    def no_weight_decay_keywords(self):
+        return {"cpb_mlp", "logit_scale", 'relative_position_bias_table'}
+    def forward(self, x):
+        x = self.patch_embed(x)
+        if self.ape:
+            x = x + self.absolute_pos_embed
+        x = self.pos_drop(x)
+        for li,layer in enumerate(self.layers):
+            print(li,'0',x.shape)
+            x = layer(x)
+            print(li,'1',x.shape)
+        x = self.norm(x)
+        return x

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+gradio==4.44.0
+torch==2.0.1
+torchvision==0.15.2
+opencv-python-headless==4.8.1.78
+numpy==1.24.3
+pillow==10.0.0
+jpegio==0.2.3
+segmentation-models-pytorch==0.3.3
+timm==0.9.12
+efficientnet-pytorch==0.7.1
+albumentations==1.3.1