Clarification on model files, backbones, and usage with Hugging Face / PyTorch

#1
by abelBEDOYA - opened

Hello, and congratulations on the excellent work on this model.
I’m very interested in testing it on some images I’ve collected, but I have a few questions regarding the model structure and how to use the provided files.

  1. Identification of model components (.pth files)
    From the file names alone, I find it difficult to understand which part of the model each checkpoint corresponds to. As mentioned in the README, you trained different heads on histological images using DINOv1 and DINOv2 as backbones.

Are all the .pth files classification heads only?

Which checkpoints correspond to DINOv1 and which to DINOv2?

Are any of the provided files full models (backbone + head), or do they all assume externally loaded DINO backbones?

  1. Input consistency across heads
    Do all heads (MLP-based and CNN-based) expect the same input format and preprocessing?
    Since they all originate from the same DINO backbone. More specifically, could you clarify how the DINO output is extracted and fed into each type of head? For example, do MLP-based heads operate on the global representation (e.g., the CLS token or a pooled embedding), while CNN-based heads consume spatial patch-token feature maps reshaped into a 2D grid?

  2. Usage with Hugging Face Transformers
    I’ve tried to load the model using the Hugging Face transformers library, but I’m running into issues because the standard HF configuration objects seem to be missing required information (e.g., architecture definition).
    Is there an intended or recommended way to use this model with transformers, or is it expected to be used outside the HF model API?

  3. Using the checkpoints directly in PyTorch
    I also attempted to use the model directly in PyTorch. However, it seems that the .pth files contain only checkpoint weights and not the full model architecture.

The DINO backbones themselves are available via timm / PyTorch model hubs, but the trained heads are not.

How can I access or reconstruct the architecture of the trained heads?

Is there example code showing how to attach these heads to the corresponding DINOv1 / DINOv2 backbones?

If I’ve missed something obvious in the repository or documentation, please feel free to point it out.

Thanks in advance for your time and for sharing this work!

Thanks also from my side. I didn't get this model to run either. It looks like the repository was not uploaded correctly in HuggingFace format. It has a preprocessor_config.json (which is why the processor loads fine), but the weights were saved as raw PyTorch checkpoints instead of the standard pytorch_model.bin / model.safetensors that from_pretrained() expects.

I tried downloading the DINO-mlp locally but in the checkpoint I can only find DINO SSL weights and no Gleason MLP on top of the image embeddings.

Furthermore, there are no details on what input resolution in terms of microns/pxl the models expect.

Any help would be greatly appreciated.

Sign up or log in to comment