test

by fuxer - opened Mar 31

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+2892

-525707

Files changed (11) hide show

.gitattributes +0 -1
README.md +34 -56
eval_viz.png +0 -3
inference_tagger_standalone.py +143 -402
requirements.txt +0 -11
tagger_proto.safetensors +2 -2
tagger_ui/templates/index.html +2 -88
tagger_ui_server.py +1 -41
tagger_vocab_with_categories.json +0 -0
tagger_vocab_with_categories_and_alias.json +0 -0
tagger_vocab_with_categories_and_alias_updated.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,4 +33,3 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-eval_viz.png filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,5 +1,6 @@
 ---
 license: apache-2.0
 tags:
   - image-classification
   - multi-label-classification
@@ -44,87 +45,63 @@ backbone fine-tuned end-to-end with a single linear projection head.
 | Precision | bfloat16 (backbone) / float32 (projection + loss) |
 | Hardware | 2× GPU, ThreadPoolExecutor + NCCL all-reduce |
-![eval_viz](./eval_viz.png)
 ## Usage
-### 1. Install dependencies
-```bash
-pip install -r requirements.txt
-```
-Or manually:
-```bash
-pip install torch torchvision safetensors Pillow requests \
-            python-multipart fastapi uvicorn jinja2 aiofiles
-```
-### 2. Download model files
-```bash
-huggingface-cli download lodestones/taggerine \
-    tagger_proto.safetensors \
-    tagger_vocab_with_categories_and_alias_updated.json \
-    tagger_ui_server.py \
-    inference_tagger_standalone.py \
-    --local-dir .
 ```
-> **Note:** `tagger_proto.safetensors` is ~5.3 GB. Make sure you have enough disk space.
-### 3. Download the `tagger_ui/` templates folder
-The server requires the `tagger_ui/templates/` directory to be present alongside `tagger_ui_server.py`:
 ```bash
-huggingface-cli download lodestones/taggerine \
-    --include "tagger_ui/**" \
-    --local-dir .
-```
-### 4. Run the Web UI
-```bash
-python tagger_ui_server.py \
-    --checkpoint tagger_proto.safetensors \
-    --vocab tagger_vocab_with_categories_and_alias_updated.json \
-    --port 7860
-# → open http://localhost:7860
 ```
-**CPU-only machine?** Add `--device cpu` (inference will be slower):
 ```bash
 python tagger_ui_server.py \
     --checkpoint tagger_proto.safetensors \
-    --vocab tagger_vocab_with_categories_and_alias_updated.json \
-    --device cpu \
     --port 7860
-```
-### Standalone CLI inference (no server)
-```bash
-python inference_tagger_standalone.py \
-    --checkpoint tagger_proto.safetensors \
-    --vocab tagger_vocab_with_categories_and_alias_updated.json \
-    --images photo.jpg \
-    --topk 30
 ```
 ## Files
 | File | Description |
 |---|---|
-| `tagger_proto.safetensors` | Model weights (bfloat16) |
-| `tagger_vocab_with_categories_and_alias_updated.json` | `{"idx2tag": [...], "tag2category": {...}}` — 74 625 tags with category metadata |
-| `tagger_vocab_with_categories.json` | Same without alias data |
-| `tagger_vocab.json` | Minimal vocab — `{"idx2tag": [...]}` only |
-| `inference_tagger_standalone.py` | Self-contained CLI inference script (no `transformers` dep) |
 | `tagger_ui_server.py` | FastAPI + Jinja2 web UI server |
-| `requirements.txt` | Python dependencies |
 ## Tag Vocabulary
@@ -141,7 +118,8 @@ Minimum tag frequency threshold: **50** occurrences across the combined dataset.
 ## Limitations
-- Evaluated on booru-style illustrations and furry art; performance on photographic images or other art works to some extend.
 - The vocabulary reflects the biases of e621 and Danbooru annotation practices.
 ## License

 ---
 license: apache-2.0
 tags:
   - image-classification
   - multi-label-classification
 | Precision | bfloat16 (backbone) / float32 (projection + loss) |
 | Hardware | 2× GPU, ThreadPoolExecutor + NCCL all-reduce |
 ## Usage
+### Standalone (no `transformers` dependency)
+```python
+from inference_tagger_standalone import Tagger
+tagger = Tagger(
+    checkpoint_path="tagger_proto.safetensors",
+    vocab_path="tagger_vocab_with_categories.json",
+    device="cuda",
+)
+tags = tagger.predict("photo.jpg", topk=40)
+# → [("solo", 0.98), ("anthro", 0.95), ...]
+# or threshold-based
+tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)
 ```
+### CLI
 ```bash
+# top-30 tags, pretty output
+python inference_tagger_standalone.py \
+    --checkpoint tagger_proto.safetensors \
+    --vocab tagger_vocab_with_categories.json \
+    --images photo.jpg https://example.com/image.jpg \
+    --topk 30
+# comma-separated string (pipe into diffusion trainer)
+python inference_tagger_standalone.py ... --format tags
+# JSON
+python inference_tagger_standalone.py ... --format json
 ```
+### Web UI
 ```bash
+pip install fastapi uvicorn jinja2 aiofiles
 python tagger_ui_server.py \
     --checkpoint tagger_proto.safetensors \
+    --vocab tagger_vocab_with_categories.json \
     --port 7860
+# → open http://localhost:7860
 ```
 ## Files
 | File | Description |
 |---|---|
+| `*.safetensors` | Model weights (bfloat16) |
+| `tagger_vocab_with_categories.json` | `{"idx2tag": [...]}` — 74 625 tag strings ordered by training frequency |
+| `inference_tagger_standalone.py` | Self-contained inference script (no `transformers` dep) |
 | `tagger_ui_server.py` | FastAPI + Jinja2 web UI server |
 ## Tag Vocabulary
 ## Limitations
+- Evaluated on booru-style illustrations and furry art; performance on photographic
+  images or other art styles is untested.
 - The vocabulary reflects the biases of e621 and Danbooru annotation practices.
 ## License

eval_viz.png DELETED Viewed

Git LFS Details

SHA256: 8cc88c49e85e69897c4ab5b7abb3e9a5400036fcf26060e3b5ff75e62931b177
Pointer size: 131 Bytes
Size of remote file: 270 kB

inference_tagger_standalone.py CHANGED Viewed

@@ -64,19 +64,17 @@ from safetensors.torch import load_file
 # All hyperparameters match facebook/dinov3-vith16plus-pretrain-lvd1689m
 # =============================================================================
-D_MODEL = 1280
-N_HEADS = 20
-HEAD_DIM = D_MODEL // N_HEADS  # 64
-N_LAYERS = 32
-D_FFN = 5120
 N_REGISTERS = 4
-PATCH_SIZE = 16
-ROPE_THETA = 100.0
-ROPE_RESCALE = 2.0
-LN_EPS = 1e-5
-LAYERSCALE = 1.0
-FEATURE_DIM = (1 + N_REGISTERS) * D_MODEL  # 6400
 # ---------------------------------------------------------------------------
@@ -85,23 +83,25 @@ FEATURE_DIM = (1 + N_REGISTERS) * D_MODEL  # 6400
 @lru_cache(maxsize=32)
 def _patch_coords_cached(h: int, w: int, device_str: str) -> torch.Tensor:
     device = torch.device(device_str)
     cy = torch.arange(0.5, h, dtype=torch.float32, device=device) / h
     cx = torch.arange(0.5, w, dtype=torch.float32, device=device) / w
     coords = torch.stack(torch.meshgrid(cy, cx, indexing="ij"), dim=-1).flatten(0, 1)
-    coords = 2.0 * coords - 1.0
     coords = coords * ROPE_RESCALE
     return coords  # [h*w, 2]
 def _build_rope(h_patches: int, w_patches: int,
                 dtype: torch.dtype, device: torch.device):
-    coords = _patch_coords_cached(h_patches, w_patches, str(device))
     inv_freq = 1.0 / (ROPE_THETA ** torch.arange(
-        0, 1, 4 / HEAD_DIM, dtype=torch.float32, device=device))
-    angles = 2 * math.pi * coords[:, :, None] * inv_freq[None, None, :]
-    angles = angles.flatten(1, 2).tile(2)
-    cos = torch.cos(angles).to(dtype).unsqueeze(0).unsqueeze(0)
     sin = torch.sin(angles).to(dtype).unsqueeze(0).unsqueeze(0)
     return cos, sin
@@ -113,6 +113,7 @@ def _rotate_half(x: torch.Tensor) -> torch.Tensor:
 def _apply_rope(q: torch.Tensor, k: torch.Tensor,
                 cos: torch.Tensor, sin: torch.Tensor):
     n_pre = 1 + N_REGISTERS
     q_pre, q_pat = q[..., :n_pre, :], q[..., n_pre:, :]
     k_pre, k_pat = k[..., :n_pre, :], k[..., n_pre:, :]
@@ -122,7 +123,7 @@ def _apply_rope(q: torch.Tensor, k: torch.Tensor,
 # ---------------------------------------------------------------------------
-# Transformer blocks
 # ---------------------------------------------------------------------------
 class _Attention(nn.Module):
@@ -133,7 +134,7 @@ class _Attention(nn.Module):
         self.v_proj = nn.Linear(D_MODEL, D_MODEL, bias=True)
         self.o_proj = nn.Linear(D_MODEL, D_MODEL, bias=True)
-    def forward(self, x, cos, sin):
         B, S, _ = x.shape
         q = self.q_proj(x).view(B, S, N_HEADS, HEAD_DIM).transpose(1, 2)
         k = self.k_proj(x).view(B, S, N_HEADS, HEAD_DIM).transpose(1, 2)
@@ -147,273 +148,125 @@ class _GatedMLP(nn.Module):
     def __init__(self):
         super().__init__()
         self.gate_proj = nn.Linear(D_MODEL, D_FFN, bias=True)
-        self.up_proj = nn.Linear(D_MODEL, D_FFN, bias=True)
-        self.down_proj = nn.Linear(D_FFN, D_MODEL, bias=True)
-    def forward(self, x):
         return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
 class _Block(nn.Module):
     def __init__(self):
         super().__init__()
-        self.norm1 = nn.LayerNorm(D_MODEL, eps=LN_EPS)
-        self.attention = _Attention()
         self.layer_scale1 = nn.Parameter(torch.full((D_MODEL,), LAYERSCALE))
-        self.norm2 = nn.LayerNorm(D_MODEL, eps=LN_EPS)
-        self.mlp = _GatedMLP()
         self.layer_scale2 = nn.Parameter(torch.full((D_MODEL,), LAYERSCALE))
-    def forward(self, x, cos, sin):
         x = x + self.attention(self.norm1(x), cos, sin) * self.layer_scale1
         x = x + self.mlp(self.norm2(x)) * self.layer_scale2
         return x
-class _Embeddings(nn.Module):
-    def __init__(self):
-        super().__init__()
-        # zeros() rather than empty() so a forgotten checkpoint key fails
-        # predictably instead of producing undefined outputs.
-        self.cls_token = nn.Parameter(torch.zeros(1, 1, D_MODEL))
-        self.mask_token = nn.Parameter(torch.zeros(1, 1, D_MODEL))
-        self.register_tokens = nn.Parameter(torch.zeros(1, N_REGISTERS, D_MODEL))
-        self.patch_embeddings = nn.Conv2d(
-            3, D_MODEL, kernel_size=PATCH_SIZE, stride=PATCH_SIZE)
-    def forward(self, pixel_values):
-        B = pixel_values.shape[0]
-        dtype = self.patch_embeddings.weight.dtype
-        patches = self.patch_embeddings(
-            pixel_values.to(dtype)).flatten(2).transpose(1, 2)
-        cls = self.cls_token.expand(B, -1, -1)
-        regs = self.register_tokens.expand(B, -1, -1)
-        return torch.cat([cls, regs, patches], dim=1)
 class DINOv3ViTH(nn.Module):
     """DINOv3 ViT-H/16+ backbone.
-    Token layout: [CLS, reg_0..reg_3, patch_0..patch_N].
     Returns last_hidden_state [B, 1+R+P, D_MODEL].
     """
     def __init__(self):
         super().__init__()
         self.embeddings = _Embeddings()
         self.layer = nn.ModuleList([_Block() for _ in range(N_LAYERS)])
-        self.norm = nn.LayerNorm(D_MODEL, eps=LN_EPS)
-    def forward(self, pixel_values):
-        _, _, H, W = pixel_values.shape
-        x = self.embeddings(pixel_values)
         h_p, w_p = H // PATCH_SIZE, W // PATCH_SIZE
         cos, sin = _build_rope(h_p, w_p, x.dtype, pixel_values.device)
-        for block in self.layer:
-            x = block(x, cos, sin)
-        return self.norm(x)
-    def get_image_tokens(self, pixel_values):
-        """Return patch tokens only (no CLS/registers) as [B, h_p*w_p, D_MODEL]
-        and the spatial grid dimensions (h_p, w_p)."""
-        _, _, H, W = pixel_values.shape
-        h_p, w_p = H // PATCH_SIZE, W // PATCH_SIZE
-        x = self.embeddings(pixel_values)
-        cos, sin = _build_rope(h_p, w_p, x.dtype, pixel_values.device)
         for block in self.layer:
             x = block(x, cos, sin)
-        x = self.norm(x)
-        # token layout: [CLS, reg_0..reg_R-1, patch_0..patch_N]
-        patch_tokens = x[:, 1 + N_REGISTERS:, :]  # [B, h_p*w_p, D_MODEL]
-        return patch_tokens, h_p, w_p
-# =============================================================================
-# Head — auto-detected from the checkpoint
-# =============================================================================
-class _LowRankHead(nn.Module):
-    """Two-matrix low-rank projection head.
-        features (in_dim)
-          → Linear(in_dim, rank, bias=?)
-          → Linear(rank, num_tags, bias=?)
     """
-    def __init__(self, in_dim: int, rank: int, num_tags: int,
-                 down_bias: bool, up_bias: bool):
         super().__init__()
-        self.proj_down = nn.Linear(in_dim, rank, bias=down_bias)
-        self.proj_up = nn.Linear(rank, num_tags, bias=up_bias)
-    def forward(self, x):
-        return self.proj_up(self.proj_down(x))
-def _build_head_from_checkpoint(
-    head_sd: dict,
-    in_dim: int,
-    num_tags: int,
-) -> tuple[nn.Module, dict]:
-    """Inspect head_sd and build a matching Module.
-    Supports two layouts, in order of preference:
-      1. Single linear          — any ``*.weight`` with shape [num_tags, in_dim]
-      2. Low-rank pair (2 mats) — one ``*.weight`` [rank, in_dim] plus
-                                   one ``*.weight`` [num_tags, rank]
-    Returns (module, remapped_state_dict) where the remapped state dict
-    matches the module's own key names so strict loading works.
-    """
-    weights_2d = [(k, v) for k, v in head_sd.items()
-                  if k.endswith(".weight") and v.ndim == 2]
-    # --- Case 1: single dense linear ---------------------------------------
-    singles = [(k, v) for k, v in weights_2d
-               if tuple(v.shape) == (num_tags, in_dim)]
-    if len(weights_2d) <= 2 and len(singles) == 1:
-        wkey, wval = singles[0]
-        base = wkey[:-len(".weight")]
-        bias_key = base + ".bias"
-        has_bias = bias_key in head_sd
-        module = nn.Linear(in_dim, num_tags, bias=has_bias)
-        remapped = {"weight": wval}
-        if has_bias:
-            remapped["bias"] = head_sd[bias_key]
-        # Sanity check: no extra keys we don't understand
-        expected_src = {wkey} | ({bias_key} if has_bias else set())
-        extra = set(head_sd) - expected_src
-        if extra:
-            raise RuntimeError(
-                f"Head has single-linear shape but extra unknown keys: {sorted(extra)}")
-        return module, remapped
-    # --- Case 2: low-rank pair ---------------------------------------------
-    down = None  # (key, tensor) with shape [rank, in_dim]
-    up = None    # (key, tensor) with shape [num_tags, rank]
-    for k, v in weights_2d:
-        if v.shape[1] == in_dim and v.shape[0] != num_tags:
-            down = (k, v)
-        elif v.shape[0] == num_tags and v.shape[1] != in_dim:
-            up = (k, v)
-    if down is not None and up is not None:
-        rank_down = down[1].shape[0]
-        rank_up = up[1].shape[1]
-        if rank_down != rank_up:
-            raise RuntimeError(
-                f"Low-rank head: inner dims disagree "
-                f"(down out={rank_down}, up in={rank_up})")
-        down_key, down_w = down
-        up_key, up_w = up
-        down_base = down_key[:-len(".weight")]
-        up_base = up_key[:-len(".weight")]
-        down_bias_key = down_base + ".bias"
-        up_bias_key = up_base + ".bias"
-        has_down_bias = down_bias_key in head_sd
-        has_up_bias = up_bias_key in head_sd
-        module = _LowRankHead(
-            in_dim=in_dim,
-            rank=rank_down,
-            num_tags=num_tags,
-            down_bias=has_down_bias,
-            up_bias=has_up_bias,
-        )
-        remapped = {
-            "proj_down.weight": down_w,
-            "proj_up.weight": up_w,
-        }
-        if has_down_bias:
-            remapped["proj_down.bias"] = head_sd[down_bias_key]
-        if has_up_bias:
-            remapped["proj_up.bias"] = head_sd[up_bias_key]
-        # Sanity check
-        expected_src = {down_key, up_key}
-        if has_down_bias:
-            expected_src.add(down_bias_key)
-        if has_up_bias:
-            expected_src.add(up_bias_key)
-        extra = set(head_sd) - expected_src
-        if extra:
-            raise RuntimeError(
-                f"Low-rank head detected but checkpoint has extra unknown "
-                f"head keys: {sorted(extra)}")
-        print(f"[Tagger] Detected low-rank head: "
-              f"in_dim={in_dim}, rank={rank_down}, num_tags={num_tags} "
-              f"(down_bias={has_down_bias}, up_bias={has_up_bias})")
-        return module, remapped
-    raise RuntimeError(
-        "Could not infer head architecture from checkpoint. "
-        f"Non-backbone keys found: {sorted(head_sd.keys())}"
-    )
 # =============================================================================
-# Tagger wrapper module
 # =============================================================================
 class DINOv3Tagger(nn.Module):
-    """Backbone + head. The head is attached after the checkpoint is
-    inspected (so we can build the right shape)."""
-    def __init__(self):
-        super().__init__()
-        self.backbone = DINOv3ViTH()
-        self.head: nn.Module | None = None  # attached by Tagger
-    def forward(self, pixel_values):
-        hidden = self.backbone(pixel_values)
-        cls = hidden[:, 0, :]
-        regs = hidden[:, 1: 1 + N_REGISTERS, :].flatten(1)
-        features = torch.cat([cls, regs], dim=-1).float()  # fp32 for head
-        return self.head(features)
-# =============================================================================
-# Checkpoint loading helpers
-# =============================================================================
-def _split_and_clean_state_dict(sd: dict) -> tuple[dict, dict]:
-    """Split full state dict into (backbone_sd, head_sd), stripping the
-    ``backbone.`` prefix and applying the remaps needed to match
-    ``DINOv3ViTH``'s parameter layout:
-      1. ``backbone.model.layer.N.*`` → ``layer.N.*``
-         (the checkpoint has an HF-style intermediate ``model`` wrapper
-         that our flat backbone class does not)
-      2. ``...layer_scale{1,2}.lambda1`` → ``...layer_scale{1,2}``
-         (HF stores layer_scale as a sub-module with a ``lambda1``
-         parameter; we use a plain ``nn.Parameter``)
-      3. Drop any ``rope_embeddings`` buffers (recomputed on the fly)
     """
-    backbone_sd: dict = {}
-    head_sd: dict = {}
-    for k, v in sd.items():
-        if k.startswith("backbone."):
-            nk = k[len("backbone."):]
-            # Remap (1): strip intermediate "model." before "layer."
-            if nk.startswith("model.layer."):
-                nk = nk[len("model."):]
-            backbone_sd[nk] = v
-        else:
-            head_sd[k] = v
-    # Remap (2): layer.N.layer_scale{1,2}.lambda1 → layer.N.layer_scale{1,2}
-    for k in list(backbone_sd.keys()):
-        if ".layer_scale" in k and k.endswith(".lambda1"):
-            backbone_sd[k[:-len(".lambda1")]] = backbone_sd.pop(k)
-    # Remap (3): drop rope buffers (recomputed on the fly)
-    for k in list(backbone_sd.keys()):
-        if "rope_embeddings" in k:
-            backbone_sd.pop(k)
-    return backbone_sd, head_sd
 # =============================================================================
@@ -421,7 +274,7 @@ def _split_and_clean_state_dict(sd: dict) -> tuple[dict, dict]:
 # =============================================================================
 _IMAGENET_MEAN = [0.485, 0.456, 0.406]
-_IMAGENET_STD = [0.229, 0.224, 0.225]
 def _snap(x: int, m: int) -> int:
@@ -438,22 +291,12 @@ def _open_image(source) -> Image.Image:
 def preprocess_image(source, max_size: int = 1024) -> torch.Tensor:
-    """Load and preprocess an image → [1, 3, H, W] float32, ImageNet-normalised.
-    Aspect ratio is preserved: a single scale factor is chosen so that the
-    long edge fits inside max_size after snapping to a PATCH_SIZE multiple.
-    """
     img = _open_image(source)
     w, h = img.size
-    # Target long-edge (snapped to patch multiple).
-    long_edge = max(w, h)
-    target_long = _snap(min(long_edge, max_size), PATCH_SIZE)
-    scale = target_long / long_edge
-    new_w = _snap(max(PATCH_SIZE, round(w * scale)), PATCH_SIZE)
-    new_h = _snap(max(PATCH_SIZE, round(h * scale)), PATCH_SIZE)
     return v2.Compose([
         v2.Resize((new_h, new_w), interpolation=v2.InterpolationMode.LANCZOS),
         v2.ToImage(),
@@ -472,15 +315,13 @@ class Tagger:
     Parameters
     ----------
     checkpoint_path : str
-        Path to a .safetensors or .pt/.pth checkpoint.
     vocab_path : str
-        Path to tagger_vocab.json or tagger_vocab_with_categories.json
-        (either must contain an ``idx2tag`` list).
     device : str
-        "cuda", "cuda:0", "cpu", ...
     dtype : torch.dtype
-        Backbone precision. bfloat16 recommended on Ampere+, float16 for
-        older GPUs, float32 for CPU. The head always runs in fp32.
     max_size : int
         Long-edge cap in pixels before feeding to the model.
     """
@@ -493,13 +334,8 @@ class Tagger:
         dtype: torch.dtype = torch.bfloat16,
         max_size: int = 1024,
     ):
-        want_cuda = device.startswith("cuda")
-        if want_cuda and not torch.cuda.is_available():
-            print("[Tagger] CUDA not available, falling back to CPU")
-            device = "cpu"
-            dtype = torch.float32
-        self.device = torch.device(device)
-        self.dtype = dtype
         self.max_size = max_size
         with open(vocab_path) as f:
@@ -508,112 +344,36 @@ class Tagger:
         self.num_tags = len(self.idx2tag)
         print(f"[Tagger] Vocabulary: {self.num_tags:,} tags")
-        # --- Load checkpoint to CPU first so we can inspect shapes ---------
         print(f"[Tagger] Loading checkpoint: {checkpoint_path}")
         if checkpoint_path.endswith((".safetensors", ".sft")):
-            sd = load_file(checkpoint_path, device="cpu")
         else:
-            sd = torch.load(checkpoint_path, map_location="cpu")
-        backbone_sd, head_sd = _split_and_clean_state_dict(sd)
-        if not head_sd:
-            raise RuntimeError(
-                "Checkpoint contains no non-backbone keys — cannot build head.")
-        # --- Build model, inferring head shape from the checkpoint --------
-        self.model = DINOv3Tagger()
-        head_module, head_sd_remapped = _build_head_from_checkpoint(
-            head_sd, in_dim=FEATURE_DIM, num_tags=self.num_tags,
-        )
-        self.model.head = head_module
-        # --- Strict load — mismatches raise instead of silently passing ----
-        self.model.backbone.load_state_dict(backbone_sd, strict=True)
-        self.model.head.load_state_dict(head_sd_remapped, strict=True)
-        # --- Move to device. Backbone → bf16/fp16; head stays fp32. --------
-        self.model.backbone = self.model.backbone.to(
-            device=self.device, dtype=dtype)
-        self.model.head = self.model.head.to(
-            device=self.device, dtype=torch.float32)
-        self.model.eval()
-        print(f"[Tagger] Ready on {self.device} (backbone={dtype}, head=fp32)")
-    @torch.no_grad()
-    def embed_pca(
-        self,
-        image,
-        n_components: int = 3,
-        max_size: int | None = None,
-    ) -> "Image.Image":
-        """Run PCA on the patch-token features of *image* and return a
-        false-colour RGB PIL image where R/G/B channels correspond to the
-        first three principal components, each normalised to [0, 255].
-        Parameters
-        ----------
-        image :
-            Local path, URL, or PIL.Image.Image.
-        n_components :
-            Number of PCA components (must be 3 for RGB output).
-        max_size :
-            Long-edge cap in pixels (defaults to ``self.max_size``).
-        """
-        if n_components != 3:
-            raise ValueError("n_components must be 3 for false-colour RGB output")
-        if max_size is None:
-            max_size = self.max_size
-        if isinstance(image, Image.Image):
-            img = image.convert("RGB")
-            w, h = img.size
-            scale = min(1.0, max_size / max(w, h))
-            new_w = _snap(round(w * scale), PATCH_SIZE)
-            new_h = _snap(round(h * scale), PATCH_SIZE)
-            pv = v2.Compose([
-                v2.Resize((new_h, new_w), interpolation=v2.InterpolationMode.LANCZOS),
-                v2.ToImage(),
-                v2.ToDtype(torch.float32, scale=True),
-                v2.Normalize(mean=_IMAGENET_MEAN, std=_IMAGENET_STD),
-            ])(img).unsqueeze(0).to(self.device)
-        else:
-            pv = preprocess_image(image, max_size=max_size).to(self.device)
-        with torch.autocast(device_type=self.device.type, dtype=self.dtype):
-            patch_tokens, h_p, w_p = self.model.backbone.get_image_tokens(pv)
-        # patch_tokens: [1, h_p*w_p, D_MODEL] → [N, D]
-        tokens = patch_tokens[0].float()  # fp32 for PCA
-        # Centre
-        mean = tokens.mean(dim=0, keepdim=True)
-        tokens_c = tokens - mean
-        # PCA via SVD (economy)
-        _, _, Vt = torch.linalg.svd(tokens_c, full_matrices=False)
-        components = Vt[:n_components]  # [3, D]
-        projected = tokens_c @ components.T  # [N, 3]
-        # Normalise each component to [0, 1]
-        lo = projected.min(dim=0).values
-        hi = projected.max(dim=0).values
-        projected = (projected - lo) / (hi - lo + 1e-8)
-        # Reshape to spatial grid and convert to uint8 PIL image
-        rgb = projected.reshape(h_p, w_p, 3).cpu().numpy()
-        rgb_uint8 = (rgb * 255).clip(0, 255).astype("uint8")
-        return Image.fromarray(rgb_uint8, mode="RGB")
     @torch.no_grad()
     def predict(self, image, topk: int | None = 30,
                 threshold: float | None = None) -> list[tuple[str, float]]:
-        """Tag a single image (local path or URL)."""
         if topk is None and threshold is None:
             topk = 30
         pv = preprocess_image(image, max_size=self.max_size).to(self.device)
-        logits = self.model(pv)[0]
         scores = torch.sigmoid(logits.float())
         if topk is not None:
@@ -621,18 +381,17 @@ class Tagger:
         else:
             assert threshold is not None
             indices = (scores >= threshold).nonzero(as_tuple=True)[0]
-            values = scores[indices]
-            order = values.argsort(descending=True)
             indices, values = indices[order], values[order]
-        return [(self.idx2tag[i], float(v))
-                for i, v in zip(indices.tolist(), values.tolist())]
     @torch.no_grad()
     def predict_batch(self, images, topk: int | None = 30,
-                      threshold: float | None = None):
-        return [self.predict(img, topk=topk, threshold=threshold)
-                for img in images]
 # =============================================================================
@@ -640,20 +399,17 @@ class Tagger:
 # =============================================================================
 def _fmt_pretty(path: str, results) -> str:
-    lines = [f"\n{'─' * 60}", f" {path}", f"{'─' * 60}"]
     for rank, (tag, score) in enumerate(results, 1):
         bar = "█" * int(score * 20)
-        lines.append(f" {rank:>3}. {score:.3f} {bar:<20} {tag}")
     return "\n".join(lines)
 def _fmt_tags(results) -> str:
     return ", ".join(tag for tag, _ in results)
 def _fmt_json(path: str, results) -> dict:
-    return {"file": path,
-            "tags": [{"tag": t, "score": round(s, 4)} for t, s in results]}
 # =============================================================================
@@ -662,40 +418,28 @@ def _fmt_json(path: str, results) -> dict:
 def main():
     parser = argparse.ArgumentParser(
-        description="DINOv3 ViT-H/16+ tagger inference (standalone)",
         formatter_class=argparse.RawDescriptionHelpFormatter,
     )
-    parser.add_argument("--checkpoint", required=True,
-                        help="Path to .safetensors or .pt checkpoint")
-    parser.add_argument("--vocab", required=True,
-                        help="Path to tagger_vocab*.json")
-    parser.add_argument("--images", nargs="+", required=True,
-                        help="Image paths and/or http(s) URLs")
-    parser.add_argument("--device", default="cuda",
-                        help="Device: cuda, cuda:0, cpu (default: cuda)")
     parser.add_argument("--max-size", type=int, default=1024,
-                        help="Long-edge cap in pixels (default: 1024)")
     mode = parser.add_mutually_exclusive_group()
-    mode.add_argument("--topk", type=int, default=30,
-                      help="Return top-k tags (default: 30)")
-    mode.add_argument("--threshold", type=float,
-                      help="Return all tags with score >= threshold")
     parser.add_argument("--format", choices=["pretty", "tags", "json"],
                         default="pretty", help="Output format (default: pretty)")
     args = parser.parse_args()
-    tagger = Tagger(
-        checkpoint_path=args.checkpoint,
-        vocab_path=args.vocab,
-        device=args.device,
-        max_size=args.max_size,
-    )
-    topk, threshold = (
-        (None, args.threshold) if args.threshold else (args.topk, None)
-    )
     json_out = []
     for src in args.images:
@@ -704,16 +448,13 @@ def main():
             print(f"[warning] File not found: {src}", file=sys.stderr)
             continue
         results = tagger.predict(src, topk=topk, threshold=threshold)
-        if args.format == "pretty":
-            print(_fmt_pretty(src, results))
-        elif args.format == "tags":
-            print(_fmt_tags(results))
-        elif args.format == "json":
-            json_out.append(_fmt_json(src, results))
     if args.format == "json":
         print(json.dumps(json_out, indent=2, ensure_ascii=False))
 if __name__ == "__main__":
-    main()

 # All hyperparameters match facebook/dinov3-vith16plus-pretrain-lvd1689m
 # =============================================================================
+D_MODEL     = 1280
+N_HEADS     = 20
+HEAD_DIM    = D_MODEL // N_HEADS   # 64
+N_LAYERS    = 32
+D_FFN       = 5120
 N_REGISTERS = 4
+PATCH_SIZE  = 16
+ROPE_THETA  = 100.0
+ROPE_RESCALE = 2.0   # pos_embed_rescale applied at inference
+LN_EPS      = 1e-5
+LAYERSCALE  = 1.0
 # ---------------------------------------------------------------------------
 @lru_cache(maxsize=32)
 def _patch_coords_cached(h: int, w: int, device_str: str) -> torch.Tensor:
+    """Normalised [-1,+1] patch-centre coordinates (float32, cached)."""
     device = torch.device(device_str)
     cy = torch.arange(0.5, h, dtype=torch.float32, device=device) / h
     cx = torch.arange(0.5, w, dtype=torch.float32, device=device) / w
     coords = torch.stack(torch.meshgrid(cy, cx, indexing="ij"), dim=-1).flatten(0, 1)
+    coords = 2.0 * coords - 1.0   # [0,1] → [-1,+1]
     coords = coords * ROPE_RESCALE
     return coords  # [h*w, 2]
 def _build_rope(h_patches: int, w_patches: int,
                 dtype: torch.dtype, device: torch.device):
+    """Return (cos, sin) of shape [1, 1, h*w, HEAD_DIM] for broadcasting."""
+    coords = _patch_coords_cached(h_patches, w_patches, str(device))  # [P, 2]
     inv_freq = 1.0 / (ROPE_THETA ** torch.arange(
+        0, 1, 4 / HEAD_DIM, dtype=torch.float32, device=device))      # [D/4]
+    angles = 2 * math.pi * coords[:, :, None] * inv_freq[None, None, :]  # [P, 2, D/4]
+    angles = angles.flatten(1, 2).tile(2)                                 # [P, D]
+    cos = torch.cos(angles).to(dtype).unsqueeze(0).unsqueeze(0)  # [1,1,P,D]
     sin = torch.sin(angles).to(dtype).unsqueeze(0).unsqueeze(0)
     return cos, sin
 def _apply_rope(q: torch.Tensor, k: torch.Tensor,
                 cos: torch.Tensor, sin: torch.Tensor):
+    """Apply RoPE only to patch tokens (skip CLS + register prefix)."""
     n_pre = 1 + N_REGISTERS
     q_pre, q_pat = q[..., :n_pre, :], q[..., n_pre:, :]
     k_pre, k_pat = k[..., :n_pre, :], k[..., n_pre:, :]
 # ---------------------------------------------------------------------------
+# Building blocks
 # ---------------------------------------------------------------------------
 class _Attention(nn.Module):
         self.v_proj = nn.Linear(D_MODEL, D_MODEL, bias=True)
         self.o_proj = nn.Linear(D_MODEL, D_MODEL, bias=True)
+    def forward(self, x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
         B, S, _ = x.shape
         q = self.q_proj(x).view(B, S, N_HEADS, HEAD_DIM).transpose(1, 2)
         k = self.k_proj(x).view(B, S, N_HEADS, HEAD_DIM).transpose(1, 2)
     def __init__(self):
         super().__init__()
         self.gate_proj = nn.Linear(D_MODEL, D_FFN, bias=True)
+        self.up_proj   = nn.Linear(D_MODEL, D_FFN, bias=True)
+        self.down_proj = nn.Linear(D_FFN,   D_MODEL, bias=True)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
         return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
 class _Block(nn.Module):
     def __init__(self):
         super().__init__()
+        self.norm1        = nn.LayerNorm(D_MODEL, eps=LN_EPS)
+        self.attention    = _Attention()
         self.layer_scale1 = nn.Parameter(torch.full((D_MODEL,), LAYERSCALE))
+        self.norm2        = nn.LayerNorm(D_MODEL, eps=LN_EPS)
+        self.mlp          = _GatedMLP()
         self.layer_scale2 = nn.Parameter(torch.full((D_MODEL,), LAYERSCALE))
+    def forward(self, x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
         x = x + self.attention(self.norm1(x), cos, sin) * self.layer_scale1
         x = x + self.mlp(self.norm2(x)) * self.layer_scale2
         return x
+# ---------------------------------------------------------------------------
+# Full backbone
+# ---------------------------------------------------------------------------
 class DINOv3ViTH(nn.Module):
     """DINOv3 ViT-H/16+ backbone.
+    Accepts any H, W that are multiples of 16.
     Returns last_hidden_state [B, 1+R+P, D_MODEL].
+    Token layout: [CLS, reg_0..reg_3, patch_0..patch_N].
+    State-dict keys are intentionally identical to the HuggingFace
+    transformers layout so .safetensors checkpoints load without remapping.
     """
     def __init__(self):
         super().__init__()
+        # These names must match HF exactly
         self.embeddings = _Embeddings()
         self.layer = nn.ModuleList([_Block() for _ in range(N_LAYERS)])
+        self.norm  = nn.LayerNorm(D_MODEL, eps=LN_EPS)
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata,
+                               strict, missing_keys, unexpected_keys, error_msgs):
+        # HF stores layer_scale as a sub-module with a "lambda1" parameter;
+        # we store it as a plain Parameter directly on _Block.
+        # Remap "layer.i.layer_scale{1,2}.lambda1" → "layer.i.layer_scale{1,2}"
+        for k in list(state_dict.keys()):
+            if k.startswith(prefix) and ".layer_scale" in k and k.endswith(".lambda1"):
+                new_k = k[:-len(".lambda1")]
+                state_dict[new_k] = state_dict.pop(k)
+        # Drop rope_embeddings buffer (computed on-the-fly)
+        for k in list(state_dict.keys()):
+            if k.startswith(prefix) and "rope_embeddings" in k:
+                state_dict.pop(k)
+        super()._load_from_state_dict(
+            state_dict, prefix, local_metadata, strict,
+            missing_keys, unexpected_keys, error_msgs)
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        B, _, H, W = pixel_values.shape
+        x = self.embeddings(pixel_values)  # [B, 1+R+P, D]
         h_p, w_p = H // PATCH_SIZE, W // PATCH_SIZE
         cos, sin = _build_rope(h_p, w_p, x.dtype, pixel_values.device)
         for block in self.layer:
             x = block(x, cos, sin)
+        return self.norm(x)
+class _Embeddings(nn.Module):
+    """Patch + CLS + register token embeddings.
+    Key names match HF: embeddings.cls_token, embeddings.register_tokens,
+    embeddings.patch_embeddings.{weight,bias}.
     """
+    def __init__(self):
         super().__init__()
+        self.cls_token       = nn.Parameter(torch.empty(1, 1, D_MODEL))
+        self.mask_token      = nn.Parameter(torch.zeros(1, 1, D_MODEL))  # unused at inference
+        self.register_tokens = nn.Parameter(torch.empty(1, N_REGISTERS, D_MODEL))
+        self.patch_embeddings = nn.Conv2d(3, D_MODEL, kernel_size=PATCH_SIZE, stride=PATCH_SIZE)
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        B = pixel_values.shape[0]
+        dtype = self.patch_embeddings.weight.dtype
+        patches = self.patch_embeddings(pixel_values.to(dtype)).flatten(2).transpose(1, 2)
+        cls  = self.cls_token.expand(B, -1, -1)
+        regs = self.register_tokens.expand(B, -1, -1)
+        return torch.cat([cls, regs, patches], dim=1)
 # =============================================================================
+# Tagger head
 # =============================================================================
 class DINOv3Tagger(nn.Module):
+    """DINOv3 ViT-H/16+ backbone + linear projection head.
+    features = concat(CLS, reg_0..reg_3) → [B, (1+R)*D]
+    projection: Linear → [B, num_tags]
     """
+    def __init__(self, num_tags: int, projection_bias: bool = False):
+        super().__init__()
+        self.backbone   = DINOv3ViTH()
+        self.projection = nn.Linear((1 + N_REGISTERS) * D_MODEL, num_tags, bias=projection_bias)
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        hidden   = self.backbone(pixel_values)                      # [B, S, D]
+        cls      = hidden[:, 0, :]                                  # [B, D]
+        regs     = hidden[:, 1: 1 + N_REGISTERS, :].flatten(1)     # [B, R*D]
+        features = torch.cat([cls, regs], dim=-1)                   # [B, (1+R)*D]
+        return self.projection(features.float())                     # fp32 for stability
 # =============================================================================
 # =============================================================================
 _IMAGENET_MEAN = [0.485, 0.456, 0.406]
+_IMAGENET_STD  = [0.229, 0.224, 0.225]
 def _snap(x: int, m: int) -> int:
 def preprocess_image(source, max_size: int = 1024) -> torch.Tensor:
+    """Load and preprocess an image → [1, 3, H, W] float32, ImageNet-normalised."""
     img = _open_image(source)
     w, h = img.size
+    scale = min(1.0, max_size / max(w, h))
+    new_w = _snap(round(w * scale), PATCH_SIZE)
+    new_h = _snap(round(h * scale), PATCH_SIZE)
     return v2.Compose([
         v2.Resize((new_h, new_w), interpolation=v2.InterpolationMode.LANCZOS),
         v2.ToImage(),
     Parameters
     ----------
     checkpoint_path : str
+        Path to a .safetensors or .pth checkpoint saved by TaggerTrainer.
     vocab_path : str
+        Path to tagger_vocab.json  ({"idx2tag": [...]}).
     device : str
+        "cuda", "cuda:0", "cpu", etc.
     dtype : torch.dtype
+        bfloat16 recommended on Ampere+; float16 for older GPUs; float32 for CPU.
     max_size : int
         Long-edge cap in pixels before feeding to the model.
     """
         dtype: torch.dtype = torch.bfloat16,
         max_size: int = 1024,
     ):
+        self.device   = torch.device(device if torch.cuda.is_available() or device == "cpu" else "cpu")
+        self.dtype    = dtype
         self.max_size = max_size
         with open(vocab_path) as f:
         self.num_tags = len(self.idx2tag)
         print(f"[Tagger] Vocabulary: {self.num_tags:,} tags")
+        self.model = DINOv3Tagger(num_tags=self.num_tags)
         print(f"[Tagger] Loading checkpoint: {checkpoint_path}")
         if checkpoint_path.endswith((".safetensors", ".sft")):
+            sd = load_file(checkpoint_path, device=str(self.device))
         else:
+            sd = torch.load(checkpoint_path, map_location=str(self.device))
+        missing, unexpected = self.model.load_state_dict(sd, strict=False, assign=True)
+        if missing:
+            print(f"[Tagger] Missing keys ({len(missing)}): {missing[:5]}{'...' if len(missing) > 5 else ''}")
+        if unexpected:
+            print(f"[Tagger] Unexpected keys ({len(unexpected)}): {unexpected[:5]}{'...' if len(unexpected) > 5 else ''}")
+        self.model.backbone = self.model.backbone.to(dtype=dtype)
+        self.model = self.model.to(self.device)
+        self.model.eval()
+        print(f"[Tagger] Ready on {self.device} ({dtype})")
     @torch.no_grad()
     def predict(self, image, topk: int | None = 30,
                 threshold: float | None = None) -> list[tuple[str, float]]:
+        """Tag a single image (local path or URL).
+        Specify either topk OR threshold. Returns [(tag, score), ...] desc."""
         if topk is None and threshold is None:
             topk = 30
         pv = preprocess_image(image, max_size=self.max_size).to(self.device)
+        with torch.autocast(device_type=self.device.type, dtype=self.dtype):
+            logits = self.model(pv)[0]
         scores = torch.sigmoid(logits.float())
         if topk is not None:
         else:
             assert threshold is not None
             indices = (scores >= threshold).nonzero(as_tuple=True)[0]
+            values  = scores[indices]
+            order   = values.argsort(descending=True)
             indices, values = indices[order], values[order]
+        return [(self.idx2tag[i], float(v)) for i, v in zip(indices.tolist(), values.tolist())]
     @torch.no_grad()
     def predict_batch(self, images, topk: int | None = 30,
+                      threshold: float | None = None) -> list[list[tuple[str, float]]]:
+        """Tag multiple images (processed individually for mixed resolutions)."""
+        return [self.predict(img, topk=topk, threshold=threshold) for img in images]
 # =============================================================================
 # =============================================================================
 def _fmt_pretty(path: str, results) -> str:
+    lines = [f"\n{'─' * 60}", f"  {path}", f"{'─' * 60}"]
     for rank, (tag, score) in enumerate(results, 1):
         bar = "█" * int(score * 20)
+        lines.append(f"  {rank:>3}.  {score:.3f}  {bar:<20}  {tag}")
     return "\n".join(lines)
 def _fmt_tags(results) -> str:
     return ", ".join(tag for tag, _ in results)
 def _fmt_json(path: str, results) -> dict:
+    return {"file": path, "tags": [{"tag": t, "score": round(s, 4)} for t, s in results]}
 # =============================================================================
 def main():
     parser = argparse.ArgumentParser(
+        description="DINOv3 ViT-H/16+ tagger inference (standalone, no transformers dep)",
         formatter_class=argparse.RawDescriptionHelpFormatter,
     )
+    parser.add_argument("--checkpoint", required=True, help="Path to .safetensors or .pth checkpoint")
+    parser.add_argument("--vocab",      required=True, help="Path to tagger_vocab.json")
+    parser.add_argument("--images", nargs="+", required=True, help="Image paths and/or http(s) URLs")
+    parser.add_argument("--device",   default="cuda",  help="Device: cuda, cuda:0, cpu, … (default: cuda)")
     parser.add_argument("--max-size", type=int, default=1024,
+                        help="Long-edge cap in pixels, multiple of 16 (default: 1024)")
     mode = parser.add_mutually_exclusive_group()
+    mode.add_argument("--topk",      type=int,   default=30, help="Return top-k tags (default: 30)")
+    mode.add_argument("--threshold", type=float,             help="Return all tags with score >= threshold")
     parser.add_argument("--format", choices=["pretty", "tags", "json"],
                         default="pretty", help="Output format (default: pretty)")
     args = parser.parse_args()
+    tagger = Tagger(checkpoint_path=args.checkpoint, vocab_path=args.vocab,
+                    device=args.device, max_size=args.max_size)
+    topk, threshold = (None, args.threshold) if args.threshold else (args.topk, None)
     json_out = []
     for src in args.images:
             print(f"[warning] File not found: {src}", file=sys.stderr)
             continue
         results = tagger.predict(src, topk=topk, threshold=threshold)
+        if   args.format == "pretty": print(_fmt_pretty(src, results))
+        elif args.format == "tags":   print(_fmt_tags(results))
+        elif args.format == "json":   json_out.append(_fmt_json(src, results))
     if args.format == "json":
         print(json.dumps(json_out, indent=2, ensure_ascii=False))
 if __name__ == "__main__":
+    main()

requirements.txt DELETED Viewed

@@ -1,11 +0,0 @@
-packaging
-safetensors
-requests
-Pillow
-torch
-torchvision
-python-multipart
-fastapi>=0.121.0
-uvicorn
-jinja2
-aiofiles

tagger_proto.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6ef471936e144eb75a21b37e664f7499e0139a6643170a59473cdde8d1d4c238
-size 5272842048

 version https://git-lfs.github.com/spec/v1
+oid sha256:20fd8c1cad2fa5653c632d847397f0491faa662d33abde1e9239041bc95b8b6c
+size 5272838400

tagger_ui/templates/index.html CHANGED Viewed

@@ -202,36 +202,6 @@
     .tag-pill:hover { opacity: .8; }
     .tag-pill .score { font-size: .66rem; opacity: .7; }
     .tag-pill.hidden { display: none; }
-    /* ---- PCA panel ---- */
-    .preview-wrap { flex-wrap: wrap; }
-    .preview-col { flex: 1 1 0; min-width: 0; }
-    .pca-col {
-      flex: 1 1 0; min-width: 0;
-      display: flex; flex-direction: column; gap: .5rem;
-    }
-    .pca-label {
-      font-size: .72rem; color: var(--muted); text-align: center;
-      letter-spacing: .04em; text-transform: uppercase;
-    }
-    #pca-img {
-      border-radius: var(--radius); width: 100%; max-height: 420px;
-      object-fit: contain; border: 1px solid var(--border);
-      display: block; image-rendering: pixelated;
-    }
-    #pca-spinner {
-      display: none; width: 18px; height: 18px; margin: auto;
-      border: 3px solid var(--border); border-top-color: var(--accent);
-      border-radius: 50%; animation: spin .7s linear infinite;
-    }
-    .pca-toggle {
-      background: var(--bg); border: 1px solid var(--border);
-      border-radius: 6px; color: var(--muted); cursor: pointer;
-      font-size: .75rem; padding: .3rem .7rem; align-self: center;
-      transition: border-color .15s, color .15s;
-    }
-    .pca-toggle:hover { border-color: var(--accent); color: var(--text); }
-    .pca-toggle.active { border-color: var(--accent); color: #a78bfa; }
   </style>
 </head>
 <body>
@@ -269,22 +239,12 @@
   <div id="results-area">
-    <!-- image + PCA side by side -->
     <div class="preview-wrap">
-      <div class="preview-col">
         <img id="preview-img" src="" alt="preview" />
         <div class="img-meta" id="img-meta"></div>
       </div>
-      <div class="pca-col" id="pca-col" style="display:none">
-        <div class="pca-label">PCA · patch features (R=PC1, G=PC2, B=PC3)</div>
-        <div id="pca-spinner"></div>
-        <img id="pca-img" src="" alt="PCA" style="display:none" />
-      </div>
-    </div>
-    <!-- PCA toggle -->
-    <div style="display:flex;justify-content:flex-end;margin-bottom:.6rem">
-      <button class="pca-toggle" id="pca-toggle" onclick="togglePca()">Show PCA</button>
     </div>
     <!-- global copy bar -->
@@ -338,48 +298,6 @@
     if (el) el.value = Math.max(1, Math.min(99, parseInt(pct) || 1));
   }
-  // ---- PCA state ----
-  let _pcaEnabled = false;
-  let _lastPcaRequest = null;  // { type: 'url'|'file', url?: string, file?: File }
-  function togglePca() {
-    _pcaEnabled = !_pcaEnabled;
-    const btn = document.getElementById('pca-toggle');
-    btn.textContent = _pcaEnabled ? 'Hide PCA' : 'Show PCA';
-    btn.classList.toggle('active', _pcaEnabled);
-    document.getElementById('pca-col').style.display = _pcaEnabled ? 'flex' : 'none';
-    if (_pcaEnabled && _lastPcaRequest) runPca(_lastPcaRequest);
-  }
-  function runPca(req) {
-    const spinner = document.getElementById('pca-spinner');
-    const img     = document.getElementById('pca-img');
-    spinner.style.display = 'block';
-    img.style.display = 'none';
-    const maxSize = document.getElementById('maxsize-input').value;
-    let fetchPromise;
-    if (req.type === 'url') {
-      fetchPromise = fetch(
-        `/pca/url?max_size=${maxSize}&url=${encodeURIComponent(req.url)}`,
-        { method: 'POST' }
-      );
-    } else {
-      const fd = new FormData();
-      fd.append('file', req.file);
-      fetchPromise = fetch(`/pca/upload?max_size=${maxSize}`, { method: 'POST', body: fd });
-    }
-    fetchPromise
-      .then(r => r.ok ? r.blob() : Promise.reject('PCA failed'))
-      .then(blob => {
-        img.src = URL.createObjectURL(blob);
-        img.style.display = 'block';
-      })
-      .catch(() => { img.style.display = 'none'; })
-      .finally(() => { spinner.style.display = 'none'; });
-  }
   // ---- drag & drop ----
   const dz = document.getElementById('drop-zone');
   dz.addEventListener('dragover',  e => { e.preventDefault(); dz.classList.add('drag-over'); });
@@ -396,8 +314,6 @@
     const url = document.getElementById('url-input').value.trim();
     if (!url) return;
     setPreview(url, url);
-    _lastPcaRequest = { type: 'url', url };
-    if (_pcaEnabled) runPca(_lastPcaRequest);
     submitFetch(`/tag/url?max_size=${document.getElementById('maxsize-input').value}&url=${encodeURIComponent(url)}`,
                 { method: 'POST' });
   }
@@ -409,8 +325,6 @@
     const reader = new FileReader();
     reader.onload = e => setPreview(e.target.result, file.name);
     reader.readAsDataURL(file);
-    _lastPcaRequest = { type: 'file', file };
-    if (_pcaEnabled) runPca(_lastPcaRequest);
     submitFetch(`/tag/upload?max_size=${maxSize}`, { method: 'POST', body: fd });
   }

     .tag-pill:hover { opacity: .8; }
     .tag-pill .score { font-size: .66rem; opacity: .7; }
     .tag-pill.hidden { display: none; }
   </style>
 </head>
 <body>
   <div id="results-area">
+    <!-- image full-width on top -->
     <div class="preview-wrap">
+      <div style="width:100%">
         <img id="preview-img" src="" alt="preview" />
         <div class="img-meta" id="img-meta"></div>
       </div>
     </div>
     <!-- global copy bar -->
     if (el) el.value = Math.max(1, Math.min(99, parseInt(pct) || 1));
   }
   // ---- drag & drop ----
   const dz = document.getElementById('drop-zone');
   dz.addEventListener('dragover',  e => { e.preventDefault(); dz.classList.add('drag-over'); });
     const url = document.getElementById('url-input').value.trim();
     if (!url) return;
     setPreview(url, url);
     submitFetch(`/tag/url?max_size=${document.getElementById('maxsize-input').value}&url=${encodeURIComponent(url)}`,
                 { method: 'POST' });
   }
     const reader = new FileReader();
     reader.onload = e => setPreview(e.target.result, file.name);
     reader.readAsDataURL(file);
     submitFetch(`/tag/upload?max_size=${maxSize}`, { method: 'POST', body: fd });
   }

tagger_ui_server.py CHANGED Viewed

@@ -48,7 +48,7 @@ CATEGORY_META: dict[int, dict] = {
     3:  {"name": "contributor",     "color": "#a78bfa"},   # raw  2
     4:  {"name": "copyright",       "color": "#fb923c"},   # raw  3
     5:  {"name": "character",       "color": "#60a5fa"},   # raw  4
-    6:  {"name": "species",         "color": "#facc15"},   # raw  5
     7:  {"name": "disambiguation",  "color": "#94a3b8"},   # raw  6
     8:  {"name": "meta",            "color": "#e2e8f0"},   # raw  7
     9:  {"name": "lore",            "color": "#f87171"},   # raw  8
@@ -113,46 +113,6 @@ async def tag_upload(
     return _run_tagger(img, max_size, floor)
-# ---------------------------------------------------------------------------
-# PCA endpoints
-# ---------------------------------------------------------------------------
-@app.post("/pca/url")
-async def pca_url(
-    url:      str = Query(...),
-    max_size: int = Query(default=1024),
-):
-    from fastapi.responses import Response
-    assert _tagger is not None
-    try:
-        from inference_tagger_standalone import _open_image
-        img = _open_image(url)
-    except Exception as e:
-        raise HTTPException(status_code=400, detail=f"Could not fetch image: {e}")
-    pca_img = _tagger.embed_pca(img, max_size=max_size)
-    buf = io.BytesIO()
-    pca_img.save(buf, format="PNG")
-    return Response(content=buf.getvalue(), media_type="image/png")
-@app.post("/pca/upload")
-async def pca_upload(
-    file:     UploadFile = File(...),
-    max_size: int        = Query(default=1024),
-):
-    from fastapi.responses import Response
-    assert _tagger is not None
-    try:
-        data = await file.read()
-        img = Image.open(io.BytesIO(data)).convert("RGB")
-    except Exception as e:
-        raise HTTPException(status_code=400, detail=f"Could not read image: {e}")
-    pca_img = _tagger.embed_pca(img, max_size=max_size)
-    buf = io.BytesIO()
-    pca_img.save(buf, format="PNG")
-    return Response(content=buf.getvalue(), media_type="image/png")
 # ---------------------------------------------------------------------------
 # Inference helper
 # ---------------------------------------------------------------------------

     3:  {"name": "contributor",     "color": "#a78bfa"},   # raw  2
     4:  {"name": "copyright",       "color": "#fb923c"},   # raw  3
     5:  {"name": "character",       "color": "#60a5fa"},   # raw  4
+    6:  {"name": "species/meta",    "color": "#facc15"},   # raw  5
     7:  {"name": "disambiguation",  "color": "#94a3b8"},   # raw  6
     8:  {"name": "meta",            "color": "#e2e8f0"},   # raw  7
     9:  {"name": "lore",            "color": "#f87171"},   # raw  8
     return _run_tagger(img, max_size, floor)
 # ---------------------------------------------------------------------------
 # Inference helper
 # ---------------------------------------------------------------------------

tagger_vocab_with_categories.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tagger_vocab_with_categories_and_alias.json DELETED Viewed

The diff for this file is too large to render. See raw diff

tagger_vocab_with_categories_and_alias_updated.json DELETED Viewed

The diff for this file is too large to render. See raw diff