Spaces:

lucasddmc
/

ViTViz

Sleeping

App Files Files Community

lucasddmc commited on Feb 2

Commit

d012f9c

1 Parent(s): 98cf39b

feat: make possible to use different ViT formats and architectures

Browse files

Files changed (3) hide show

.github/copilot-instructions.md +32 -19
app.py +9 -9
utils/model_loader.py +218 -53

.github/copilot-instructions.md CHANGED Viewed

@@ -3,7 +3,7 @@
 ## Project Overview
 **ViTViz** is a Gradio-based web app for visualizing Vision Transformer (ViT) attention mechanisms and adversarial attacks on image classification. The app supports:
-- Custom ViT model upload (.pth files) or Hugging Face Hub models
 - Multiple adversarial attack methods (FGSM, PGD, MIM, TGR, SAGA)
 - Attention visualization via Attention Rollout and per-layer/per-head views
 - Interactive iteration-by-iteration comparison of adversarial examples
@@ -24,40 +24,53 @@
 ### Key Design Patterns
 #### Dynamic Architecture Support (ViTConfig)
-The codebase now supports multiple ViT architectures via automatic inference:
 ```python
-from utils.model_loader import ViTConfig, infer_config_from_model, infer_config_from_state_dict
 # ViTConfig contains all architecture parameters
 config = ViTConfig(
-    embed_dim=768,    # 384=small, 768=base, 1024=large
-    num_heads=12,     # 6=small, 12=base, 16=large
-    num_layers=12,    # varies by model
-    patch_size=16,    # 16 or 32
-    img_size=224,     # 224, 384, etc.
-    num_classes=1000
 )
 # Properties computed automatically
 config.grid_size      # img_size // patch_size (e.g., 14 for 224/16)
 config.num_patches    # grid_size ** 2
-config.timm_model_name  # e.g., "vit_base_patch16_224"
 ```
-Supported architectures (auto-detected):
-- `vit_tiny_patch16_224` (embed_dim=192, heads=3)
-- `vit_small_patch16_224` (embed_dim=384, heads=6)
-- `vit_base_patch16_224` (embed_dim=768, heads=12)
-- `vit_large_patch16_224` (embed_dim=1024, heads=16)
-- `vit_base_patch32_224` (embed_dim=768, patch_size=32, grid=7)
 #### Model Loading Strategy
 The codebase supports multiple model sources:
-1. **Local .pth files**: Can contain full model, `state_dict`, `model_state_dict`, or checkpoint dicts with `class_names`
 2. **Hugging Face Hub**: Use `hf-model://username/repo-name` format; automatically converts HF ViT to timm-compatible format
 3. **Special `hf://` URIs**: For CNN backbones in SAGA attacks (e.g., `hf://lucasddmc/resnet101-stanford40-actions/resnet.pth`)
 The main loader returns 4 values:
 ```python
 model, class_names, label_source, vit_config = load_model_and_labels(model_path, None, device=DEVICE)
@@ -143,7 +156,7 @@ The app injects Bootstrap Icons via CDN and custom CSS for panels/tables. Icon c
 ## External Dependencies
-- **timm**: ViT model architecture (`vit_base_patch16_224` is the default)
 - **torchattacks**: Base classes for adversarial attacks
 - **transformers**: Optional, for loading HF Hub models
 - **gradio**: Version 5.49.1 (specified in requirements)
@@ -158,7 +171,7 @@ Currently no automated tests. Manual testing workflow:
 ## Known Limitations
-- Supports timm ViT architectures (tiny, small, base, large) with patch sizes 16 and 32
 - No support for non-standard ViT variants (DeiT distillation token, Swin hierarchical, BEiT) without additional conversion
 - Custom CSS may break with Gradio version updates
 - No batch processing support (processes one image at a time)

 ## Project Overview
 **ViTViz** is a Gradio-based web app for visualizing Vision Transformer (ViT) attention mechanisms and adversarial attacks on image classification. The app supports:
+- Custom ViT model upload (.pth, .pt, .safetensors files) or Hugging Face Hub models
 - Multiple adversarial attack methods (FGSM, PGD, MIM, TGR, SAGA)
 - Attention visualization via Attention Rollout and per-layer/per-head views
 - Interactive iteration-by-iteration comparison of adversarial examples
 ### Key Design Patterns
 #### Dynamic Architecture Support (ViTConfig)
+The codebase supports **any ViT architecture** with timm-compatible structure (`model.blocks[i].attn.qkv`), not limited to predefined model names. Architecture parameters are inferred automatically from state_dict:
 ```python
+from utils.model_loader import ViTConfig, create_vit_from_config, infer_config_from_state_dict
 # ViTConfig contains all architecture parameters
 config = ViTConfig(
+    embed_dim=768,    # Any value (192, 384, 512, 768, 1024, etc.)
+    num_heads=12,     # Any valid divisor of embed_dim
+    num_layers=12,    # Any depth
+    patch_size=16,    # 8, 14, 16, 32, etc.
+    img_size=224,     # 224, 384, 448, etc.
+    num_classes=1000,
+    mlp_ratio=4.0,    # MLP hidden dim = embed_dim * mlp_ratio
+    qkv_bias=True     # Whether QKV projection has bias
 )
+# Create model directly from config (no predefined names needed)
+model = create_vit_from_config(config, device=DEVICE)
 # Properties computed automatically
 config.grid_size      # img_size // patch_size (e.g., 14 for 224/16)
 config.num_patches    # grid_size ** 2
+config.timm_model_name  # Informational: "vit_base_patch16_224" or "vit_custom_patch16_224"
 ```
+**Inference from state_dict**: When loading a checkpoint, all parameters are inferred automatically:
+- `embed_dim`: from `blocks.0.attn.qkv.weight.shape[1]`
+- `num_heads`: heuristic based on common head_dim values (64, 32, 96)
+- `num_layers`: count of `blocks.X.attn.*` keys
+- `patch_size`: from `patch_embed.proj.weight.shape[2]`
+- `img_size`: from `pos_embed.shape[1]` (num_patches + 1)
+- `mlp_ratio`: from `blocks.0.mlp.fc1.weight.shape[0] / embed_dim`
+- `qkv_bias`: presence of `blocks.0.attn.qkv.bias` key
+**Validation**: Use `validate_vit_structure(model)` to check if a model has the required structure before attempting attention extraction.
 #### Model Loading Strategy
 The codebase supports multiple model sources:
+1. **Local files**: `.pth`, `.pt`, `.safetensors` - Can contain full model, `state_dict`, `model_state_dict`, or checkpoint dicts with `class_names`
 2. **Hugging Face Hub**: Use `hf-model://username/repo-name` format; automatically converts HF ViT to timm-compatible format
 3. **Special `hf://` URIs**: For CNN backbones in SAGA attacks (e.g., `hf://lucasddmc/resnet101-stanford40-actions/resnet.pth`)
+**Supported file formats**:
+- `.pth` / `.pt`: Standard PyTorch checkpoint (torch.load)
+- `.safetensors`: Modern HuggingFace format (faster, more secure)
 The main loader returns 4 values:
 ```python
 model, class_names, label_source, vit_config = load_model_and_labels(model_path, None, device=DEVICE)
 ## External Dependencies
+- **timm**: ViT model architecture (VisionTransformer class for flexible model creation)
 - **torchattacks**: Base classes for adversarial attacks
 - **transformers**: Optional, for loading HF Hub models
 - **gradio**: Version 5.49.1 (specified in requirements)
 ## Known Limitations
+- Supports any timm-compatible ViT (must have `model.blocks[i].attn.qkv` structure)
 - No support for non-standard ViT variants (DeiT distillation token, Swin hierarchical, BEiT) without additional conversion
 - Custom CSS may break with Gradio version updates
 - No batch processing support (processes one image at a time)

app.py CHANGED Viewed

@@ -83,7 +83,7 @@ def classify_image(model_file, use_hf_vit: bool, image):
     """
     try:
         if not use_hf_vit and model_file is None:
-            return "Please upload a model file (.pth) or enable 'Use vit-b16-stanford40-actions'"
         # Extrair paths dos componentes de arquivo do Gradio
         model_path = HF_VIT_MODEL_SPEC if use_hf_vit else _to_path(model_file)
@@ -153,7 +153,7 @@ def visualize_attention(
     """
     try:
         if not use_hf_vit and model_file is None:
-            return None, "Please upload a model file (.pth) or enable 'Use vit-b16-stanford40-actions'"
         if image is None:
             return None, "Please upload an image"
@@ -244,7 +244,7 @@ def run_attack(
     """
     try:
         if not use_hf_vit and model_file is None:
-            return [], "Please upload a model file (.pth) or enable 'Use vit-b16-stanford40-actions'", []
         if image is None:
             return [], "Please upload an image", []
@@ -601,8 +601,8 @@ def create_app():
                                     label="Use vit-b16-stanford40-actions"
                                 )
                                 model_upload_classify = gr.File(
-                                    label="Upload Model (.pth/.pt)",
-                                    file_types=[".pth", ".pt"],
                                     interactive=False
                                 )
                             with gr.Column(scale=2):
@@ -641,8 +641,8 @@ def create_app():
                                     label="Use vit-b16-stanford40-actions"
                                 )
                                 model_upload_attention = gr.File(
-                                    label="Upload Model (.pth/.pt)",
-                                    file_types=[".pth", ".pt"],
                                     interactive=False
                                 )
                             with gr.Column(scale=2):
@@ -714,8 +714,8 @@ def create_app():
                                 label="Use vit-b16-stanford40-actions"
                             )
                             model_upload_attack = gr.File(
-                                label="Upload Model (.pth/.pt)",
-                                file_types=[".pth", ".pt"],
                                 interactive=False
                             )
                         with gr.Column(scale=3):

     """
     try:
         if not use_hf_vit and model_file is None:
+            return "Please upload a model file (.pth/.pt/.safetensors/.ckpt) or enable 'Use vit-b16-stanford40-actions'"
         # Extrair paths dos componentes de arquivo do Gradio
         model_path = HF_VIT_MODEL_SPEC if use_hf_vit else _to_path(model_file)
     """
     try:
         if not use_hf_vit and model_file is None:
+            return None, "Please upload a model file (.pth/.pt/.safetensors/.ckpt) or enable 'Use vit-b16-stanford40-actions'"
         if image is None:
             return None, "Please upload an image"
     """
     try:
         if not use_hf_vit and model_file is None:
+            return [], "Please upload a model file (.pth/.pt/.safetensors/.ckpt) or enable 'Use vit-b16-stanford40-actions'", []
         if image is None:
             return [], "Please upload an image", []
                                     label="Use vit-b16-stanford40-actions"
                                 )
                                 model_upload_classify = gr.File(
+                                    label="Upload Model (.pth/.pt/.safetensors/.ckpt)",
+                                    file_types=[".pth", ".pt", ".safetensors", ".ckpt"],
                                     interactive=False
                                 )
                             with gr.Column(scale=2):
                                     label="Use vit-b16-stanford40-actions"
                                 )
                                 model_upload_attention = gr.File(
+                                    label="Upload Model (.pth/.pt/.safetensors/.ckpt)",
+                                    file_types=[".pth", ".pt", ".safetensors", ".ckpt"],
                                     interactive=False
                                 )
                             with gr.Column(scale=2):
                                 label="Use vit-b16-stanford40-actions"
                             )
                             model_upload_attack = gr.File(
+                                label="Upload Model (.pth/.pt/.safetensors/.ckpt)",
+                                file_types=[".pth", ".pt", ".safetensors", ".ckpt"],
                                 interactive=False
                             )
                         with gr.Column(scale=3):

utils/model_loader.py CHANGED Viewed

@@ -4,11 +4,23 @@ import timm
 from dataclasses import dataclass
 from typing import Optional, Tuple, Dict, Any
 try:
     from transformers import AutoModelForImageClassification
 except Exception:  # pragma: no cover
     AutoModelForImageClassification = None
 DEVICE_DEFAULT = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@@ -21,6 +33,8 @@ class ViTConfig:
     patch_size: int = 16
     img_size: int = 224
     num_classes: int = 1000
     @property
     def grid_size(self) -> int:
@@ -34,7 +48,7 @@ class ViTConfig:
     @property
     def timm_model_name(self) -> str:
-        """Retorna o nome do modelo timm correspondente à configuração."""
         # Mapeamento baseado em embed_dim e num_heads
         size_map = {
             (192, 3): 'tiny',
@@ -43,10 +57,107 @@ class ViTConfig:
             (1024, 16): 'large',
             (1280, 16): 'huge',
         }
-        size = size_map.get((self.embed_dim, self.num_heads), 'base')
         return f"vit_{size}_patch{self.patch_size}_{self.img_size}"
 def infer_config_from_model(model: torch.nn.Module) -> ViTConfig:
     """Infere configuração ViT a partir de um modelo timm carregado."""
     config = ViTConfig()
@@ -95,20 +206,47 @@ def infer_config_from_state_dict(state_dict: Dict[str, torch.Tensor]) -> ViTConf
     if layer_indices:
         config.num_layers = max(layer_indices) + 1
-    # Inferir embed_dim e num_heads do primeiro bloco
     qkv_key = 'blocks.0.attn.qkv.weight'
     if qkv_key in state_dict:
         qkv_weight = state_dict[qkv_key]
         # qkv.weight shape: [3*embed_dim, embed_dim]
         config.embed_dim = qkv_weight.shape[1]
-    # Inferir num_heads do proj bias ou de forma heurística
     proj_key = 'blocks.0.attn.proj.weight'
-    if proj_key in state_dict:
-        # proj.weight shape: [embed_dim, embed_dim]
         embed_dim = state_dict[proj_key].shape[0]
-        # Heurística: head_dim típico é 64
-        config.num_heads = embed_dim // 64
     # Inferir num_classes do head
     head_key = 'head.weight'
@@ -219,6 +357,8 @@ def load_vit_from_huggingface(model_id: str, device: Optional[torch.device] = No
     num_heads = int(getattr(cfg, "num_attention_heads", 12)) if cfg is not None else 12
     patch_size = int(getattr(cfg, "patch_size", 16)) if cfg is not None else 16
     img_size = int(getattr(cfg, "image_size", 224)) if cfg is not None else 224
     class_names = _hf_id2label_to_class_names(getattr(cfg, "id2label", None)) if cfg is not None else None
     # Criar config dinâmico
@@ -228,26 +368,23 @@ def load_vit_from_huggingface(model_id: str, device: Optional[torch.device] = No
         num_layers=num_layers,
         patch_size=patch_size,
         img_size=img_size,
-        num_classes=num_labels
     )
-    # Tentar encontrar o modelo timm correspondente
-    timm_name = vit_config.timm_model_name
-    try:
-        timm_model = timm.create_model(timm_name, pretrained=False, num_classes=num_labels)
-    except Exception:
-        # Fallback para vit_base_patch16_224 se o modelo não existir
-        print(f"[ViTViz] Modelo timm '{timm_name}' não encontrado, usando vit_base_patch16_224")
-        timm_model = timm.create_model("vit_base_patch16_224", pretrained=False, num_classes=num_labels)
     timm_sd = _convert_hf_vit_to_timm_state_dict(hf_model.state_dict(), num_layers=num_layers)
     timm_model.load_state_dict(timm_sd, strict=False)
-    timm_model = timm_model.to(device)
     timm_model.eval()
-    # Atualizar config com valores reais do modelo carregado
-    vit_config = infer_config_from_model(timm_model)
     return timm_model, class_names, vit_config
@@ -263,11 +400,27 @@ class CustomUnpickler(pickle.Unpickler):
 def load_checkpoint(model_path: str, device: Optional[torch.device] = None) -> Any:
-    """Carrega um checkpoint/modelo do caminho informado, com fallback para unpickler customizado.
     Retorna o objeto carregado (modelo completo, state_dict ou dict de checkpoint).
     """
     device = device or DEVICE_DEFAULT
     try:
         return torch.load(model_path, map_location=device, weights_only=False)
     except (AttributeError, ModuleNotFoundError, RuntimeError):
@@ -349,59 +502,71 @@ def load_class_names_from_file(labels_file: Optional[str]) -> Optional[Dict[int,
 def build_model_from_checkpoint(checkpoint: Any, device: Optional[torch.device] = None) -> Tuple[torch.nn.Module, ViTConfig]:
     """Constroi um modelo a partir de um checkpoint que pode ser um dict, state_dict ou o próprio modelo.
     Returns:
         (model, config) - modelo carregado e configuração inferida
     """
     device = device or DEVICE_DEFAULT
     config: Optional[ViTConfig] = None
     if isinstance(checkpoint, dict):
         if 'model' in checkpoint:
             model = checkpoint['model']
             config = infer_config_from_model(model)
         elif 'state_dict' in checkpoint:
             state_dict = checkpoint['state_dict']
             config = infer_config_from_state_dict(state_dict)
-            num_classes = infer_num_classes(state_dict)
-            config.num_classes = num_classes
-            # Usar arquitetura inferida
-            timm_name = config.timm_model_name
-            try:
-                model = timm.create_model(timm_name, pretrained=False, num_classes=num_classes)
-            except Exception:
-                print(f"[ViTViz] Modelo timm '{timm_name}' não encontrado, usando vit_base_patch16_224")
-                model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=num_classes)
-            model.load_state_dict(state_dict)
         elif 'model_state_dict' in checkpoint:
             # Novo formato com class_names embutidas
             state_dict = checkpoint['model_state_dict']
             config = infer_config_from_state_dict(state_dict)
-            num_classes = infer_num_classes(state_dict)
-            config.num_classes = num_classes
-            # Usar arquitetura inferida
-            timm_name = config.timm_model_name
-            try:
-                model = timm.create_model(timm_name, pretrained=False, num_classes=num_classes)
-            except Exception:
-                print(f"[ViTViz] Modelo timm '{timm_name}' não encontrado, usando vit_base_patch16_224")
-                model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=num_classes)
-            model.load_state_dict(state_dict)
         else:
-            # assume dict é um state_dict
             config = infer_config_from_state_dict(checkpoint)
-            num_classes = infer_num_classes(checkpoint)
-            config.num_classes = num_classes
-            # Usar arquitetura inferida
-            timm_name = config.timm_model_name
-            try:
-                model = timm.create_model(timm_name, pretrained=False, num_classes=num_classes)
-            except Exception:
-                print(f"[ViTViz] Modelo timm '{timm_name}' não encontrado, usando vit_base_patch16_224")
-                model = timm.create_model('vit_base_patch16_224', pretrained=False, num_classes=num_classes)
-            model.load_state_dict(checkpoint)
     else:
         # modelo completo salvo via torch.save(model, ...)
         model = checkpoint
         config = infer_config_from_model(model)
     model = model.to(device)

 from dataclasses import dataclass
 from typing import Optional, Tuple, Dict, Any
+# Importar VisionTransformer diretamente para criar modelos com arquiteturas customizadas
+try:
+    from timm.models.vision_transformer import VisionTransformer
+except ImportError:
+    VisionTransformer = None
 try:
     from transformers import AutoModelForImageClassification
 except Exception:  # pragma: no cover
     AutoModelForImageClassification = None
+# Suporte a safetensors (formato moderno do HuggingFace)
+try:
+    from safetensors.torch import load_file as load_safetensors
+except ImportError:
+    load_safetensors = None
 DEVICE_DEFAULT = torch.device("cuda" if torch.cuda.is_available() else "cpu")
     patch_size: int = 16
     img_size: int = 224
     num_classes: int = 1000
+    mlp_ratio: float = 4.0
+    qkv_bias: bool = True
     @property
     def grid_size(self) -> int:
     @property
     def timm_model_name(self) -> str:
+        """Retorna o nome do modelo timm correspondente (para fins informativos)."""
         # Mapeamento baseado em embed_dim e num_heads
         size_map = {
             (192, 3): 'tiny',
             (1024, 16): 'large',
             (1280, 16): 'huge',
         }
+        size = size_map.get((self.embed_dim, self.num_heads), 'custom')
         return f"vit_{size}_patch{self.patch_size}_{self.img_size}"
+def create_vit_from_config(config: ViTConfig, device: Optional[torch.device] = None) -> torch.nn.Module:
+    """Cria um modelo ViT diretamente a partir da configuração inferida.
+    Isso permite criar modelos com arquiteturas arbitrárias, não limitadas
+    aos nomes predefinidos do timm (vit_base_patch16_224, etc.).
+    """
+    device = device or DEVICE_DEFAULT
+    if VisionTransformer is None:
+        raise RuntimeError("VisionTransformer não disponível. Verifique a instalação do timm.")
+    model = VisionTransformer(
+        img_size=config.img_size,
+        patch_size=config.patch_size,
+        in_chans=3,
+        num_classes=config.num_classes,
+        embed_dim=config.embed_dim,
+        depth=config.num_layers,
+        num_heads=config.num_heads,
+        mlp_ratio=config.mlp_ratio,
+        qkv_bias=config.qkv_bias,
+        class_token=True,
+        global_pool='token',
+    )
+    return model.to(device)
+def _strip_state_dict_prefix(state_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
+    """Remove prefixos comuns de frameworks (Lightning, DDP, etc.) das keys do state_dict.
+    Prefixos tratados:
+    - 'model.' (PyTorch Lightning)
+    - 'module.' (DataParallel/DistributedDataParallel)
+    - 'encoder.' (alguns frameworks de self-supervised learning)
+    - 'backbone.' (alguns frameworks de detecção)
+    Returns:
+        state_dict com keys sem prefixo
+    """
+    prefixes = ['model.', 'module.', 'encoder.', 'backbone.']
+    # Verificar se alguma key tem prefixo
+    has_prefix = False
+    detected_prefix = None
+    for key in state_dict.keys():
+        for prefix in prefixes:
+            if key.startswith(prefix):
+                has_prefix = True
+                detected_prefix = prefix
+                break
+        if has_prefix:
+            break
+    if not has_prefix:
+        return state_dict
+    print(f"[ViTViz] Detectado prefixo '{detected_prefix}' nas keys do state_dict (Lightning/DDP). Removendo...")
+    new_sd: Dict[str, torch.Tensor] = {}
+    for key, value in state_dict.items():
+        new_key = key
+        for prefix in prefixes:
+            if key.startswith(prefix):
+                new_key = key[len(prefix):]
+                break
+        new_sd[new_key] = value
+    return new_sd
+def validate_vit_structure(model: torch.nn.Module) -> Tuple[bool, str]:
+    """Valida se o modelo tem a estrutura esperada de um ViT timm-compatível.
+    Returns:
+        (is_valid, error_message) - se inválido, error_message descreve o problema
+    """
+    if not hasattr(model, 'blocks'):
+        return False, "Modelo não tem atributo 'blocks'. Não é um ViT compatível."
+    if len(model.blocks) == 0:
+        return False, "Modelo tem 'blocks' vazio."
+    block = model.blocks[0]
+    if not hasattr(block, 'attn'):
+        return False, "Bloco não tem atributo 'attn'. Estrutura incompatível."
+    attn = block.attn
+    if not hasattr(attn, 'qkv'):
+        return False, "Módulo de atenção não tem 'qkv'. Estrutura incompatível."
+    if not hasattr(attn, 'num_heads'):
+        return False, "Módulo de atenção não tem 'num_heads'. Estrutura incompatível."
+    return True, ""
 def infer_config_from_model(model: torch.nn.Module) -> ViTConfig:
     """Infere configuração ViT a partir de um modelo timm carregado."""
     config = ViTConfig()
     if layer_indices:
         config.num_layers = max(layer_indices) + 1
+    # Inferir embed_dim do primeiro bloco
     qkv_key = 'blocks.0.attn.qkv.weight'
     if qkv_key in state_dict:
         qkv_weight = state_dict[qkv_key]
         # qkv.weight shape: [3*embed_dim, embed_dim]
         config.embed_dim = qkv_weight.shape[1]
+        # Inferir num_heads diretamente: qkv tem shape [3*embed_dim, embed_dim]
+        # O output é 3*embed_dim = 3*num_heads*head_dim
+        # Podemos calcular num_heads = (qkv_out // 3) // head_dim
+        # Mas head_dim varia. Tentamos inferir de outra forma.
+    # Inferir num_heads: tentar múltiplos métodos
     proj_key = 'blocks.0.attn.proj.weight'
+    if proj_key in state_dict and qkv_key in state_dict:
         embed_dim = state_dict[proj_key].shape[0]
+        qkv_out = state_dict[qkv_key].shape[0]  # 3*embed_dim
+        # Método 1: Se qkv_out == 3*embed_dim, tentar head_dim comum (64, 32, 96)
+        if qkv_out == 3 * embed_dim:
+            # Testar head_dims comuns em ordem de preferência
+            for head_dim in [64, 32, 96, 48, 128]:
+                if embed_dim % head_dim == 0:
+                    config.num_heads = embed_dim // head_dim
+                    break
+            else:
+                # Fallback: assumir que num_heads divide embed_dim uniformemente
+                # Tentar valores comuns de num_heads
+                for nh in [12, 16, 8, 6, 24, 4, 3]:
+                    if embed_dim % nh == 0:
+                        config.num_heads = nh
+                        break
+    # Inferir qkv_bias
+    qkv_bias_key = 'blocks.0.attn.qkv.bias'
+    config.qkv_bias = qkv_bias_key in state_dict
+    # Inferir mlp_ratio do MLP
+    mlp_fc1_key = 'blocks.0.mlp.fc1.weight'
+    if mlp_fc1_key in state_dict and config.embed_dim > 0:
+        mlp_hidden = state_dict[mlp_fc1_key].shape[0]
+        config.mlp_ratio = mlp_hidden / config.embed_dim
     # Inferir num_classes do head
     head_key = 'head.weight'
     num_heads = int(getattr(cfg, "num_attention_heads", 12)) if cfg is not None else 12
     patch_size = int(getattr(cfg, "patch_size", 16)) if cfg is not None else 16
     img_size = int(getattr(cfg, "image_size", 224)) if cfg is not None else 224
+    intermediate_size = int(getattr(cfg, "intermediate_size", hidden_size * 4)) if cfg is not None else hidden_size * 4
+    qkv_bias = bool(getattr(cfg, "qkv_bias", True)) if cfg is not None else True
     class_names = _hf_id2label_to_class_names(getattr(cfg, "id2label", None)) if cfg is not None else None
     # Criar config dinâmico
         num_layers=num_layers,
         patch_size=patch_size,
         img_size=img_size,
+        num_classes=num_labels,
+        mlp_ratio=intermediate_size / hidden_size,
+        qkv_bias=qkv_bias
     )
+    print(f"[ViTViz] Carregando do HuggingFace: {vit_config.timm_model_name} "
+          f"(embed_dim={vit_config.embed_dim}, heads={vit_config.num_heads}, "
+          f"layers={vit_config.num_layers})")
+    # Criar modelo com arquitetura customizada diretamente
+    timm_model = create_vit_from_config(vit_config, device=device)
+    # Converter e carregar state_dict
     timm_sd = _convert_hf_vit_to_timm_state_dict(hf_model.state_dict(), num_layers=num_layers)
     timm_model.load_state_dict(timm_sd, strict=False)
     timm_model.eval()
     return timm_model, class_names, vit_config
 def load_checkpoint(model_path: str, device: Optional[torch.device] = None) -> Any:
+    """Carrega um checkpoint/modelo do caminho informado.
+    Suporta formatos:
+    - .pth / .pt: PyTorch checkpoint (torch.load)
+    - .safetensors: Formato moderno do HuggingFace (mais seguro e rápido)
     Retorna o objeto carregado (modelo completo, state_dict ou dict de checkpoint).
     """
     device = device or DEVICE_DEFAULT
+    # Detectar formato safetensors
+    if model_path.endswith('.safetensors'):
+        if load_safetensors is None:
+            raise ImportError(
+                "safetensors não está instalado. Instale com: pip install safetensors"
+            )
+        # safetensors sempre retorna um state_dict (não suporta modelo completo)
+        state_dict = load_safetensors(model_path, device=str(device))
+        return state_dict
+    # Formato PyTorch padrão (.pth, .pt, .ckpt, etc.)
     try:
         return torch.load(model_path, map_location=device, weights_only=False)
     except (AttributeError, ModuleNotFoundError, RuntimeError):
 def build_model_from_checkpoint(checkpoint: Any, device: Optional[torch.device] = None) -> Tuple[torch.nn.Module, ViTConfig]:
     """Constroi um modelo a partir de um checkpoint que pode ser um dict, state_dict ou o próprio modelo.
+    Suporta arquiteturas ViT arbitrárias, não limitadas aos nomes predefinidos do timm.
     Returns:
         (model, config) - modelo carregado e configuração inferida
     """
     device = device or DEVICE_DEFAULT
     config: Optional[ViTConfig] = None
+    # Detectar e logar se é checkpoint PyTorch Lightning
+    if isinstance(checkpoint, dict) and 'pytorch-lightning_version' in checkpoint:
+        print(f"[ViTViz] Detectado checkpoint PyTorch Lightning (v{checkpoint.get('pytorch-lightning_version', '?')})")
     if isinstance(checkpoint, dict):
         if 'model' in checkpoint:
+            # Modelo completo dentro do dict
             model = checkpoint['model']
             config = infer_config_from_model(model)
+            # Validar estrutura
+            is_valid, error_msg = validate_vit_structure(model)
+            if not is_valid:
+                raise ValueError(f"Modelo inválido: {error_msg}")
         elif 'state_dict' in checkpoint:
             state_dict = checkpoint['state_dict']
+            # Remover prefixos de frameworks (Lightning, DDP, etc.)
+            state_dict = _strip_state_dict_prefix(state_dict)
             config = infer_config_from_state_dict(state_dict)
+            print(f"[ViTViz] Arquitetura inferida: {config.timm_model_name} "
+                  f"(embed_dim={config.embed_dim}, heads={config.num_heads}, "
+                  f"layers={config.num_layers}, patch={config.patch_size}, img={config.img_size})")
+            # Criar modelo com arquitetura customizada
+            model = create_vit_from_config(config, device=device)
+            # strict=False para suportar variações como CLIP (norm_pre, etc.)
+            model.load_state_dict(state_dict, strict=False)
         elif 'model_state_dict' in checkpoint:
             # Novo formato com class_names embutidas
             state_dict = checkpoint['model_state_dict']
+            # Remover prefixos de frameworks (Lightning, DDP, etc.)
+            state_dict = _strip_state_dict_prefix(state_dict)
             config = infer_config_from_state_dict(state_dict)
+            print(f"[ViTViz] Arquitetura inferida: {config.timm_model_name} "
+                  f"(embed_dim={config.embed_dim}, heads={config.num_heads}, "
+                  f"layers={config.num_layers}, patch={config.patch_size}, img={config.img_size})")
+            # Criar modelo com arquitetura customizada
+            model = create_vit_from_config(config, device=device)
+            # strict=False para suportar variações como CLIP (norm_pre, etc.)
+            model.load_state_dict(state_dict, strict=False)
         else:
+            # assume dict é um state_dict puro
+            # Remover prefixos de frameworks (Lightning, DDP, etc.)
+            checkpoint = _strip_state_dict_prefix(checkpoint)
             config = infer_config_from_state_dict(checkpoint)
+            print(f"[ViTViz] Arquitetura inferida: {config.timm_model_name} "
+                  f"(embed_dim={config.embed_dim}, heads={config.num_heads}, "
+                  f"layers={config.num_layers}, patch={config.patch_size}, img={config.img_size})")
+            # Criar modelo com arquitetura customizada
+            model = create_vit_from_config(config, device=device)
+            # strict=False para suportar variações como CLIP (norm_pre, etc.)
+            model.load_state_dict(checkpoint, strict=False)
     else:
         # modelo completo salvo via torch.save(model, ...)
         model = checkpoint
+        # Validar estrutura
+        is_valid, error_msg = validate_vit_structure(model)
+        if not is_valid:
+            raise ValueError(f"Modelo inválido: {error_msg}")
         config = infer_config_from_model(model)
     model = model.to(device)