Add files using upload-large-folder tool

Browse files

Files changed (11) hide show

.gitattributes +6 -0
LICENSE +21 -0
PET_Finetuned.safetensors +3 -0
README.md +151 -3
TechnicalReport.pdf +3 -0
images/pexels-558331748-30295833.jpg +3 -0
images/pexels-ilyasajpg-7038431.jpg +3 -0
images/pexels-peter-almario-388108-19472286.jpg +3 -0
images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg +3 -0
images/pexels-wendywei-4945353.jpg +3 -0
test.py +406 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+TechnicalReport.pdf filter=lfs diff=lfs merge=lfs -text
+images/pexels-558331748-30295833.jpg filter=lfs diff=lfs merge=lfs -text
+images/pexels-ilyasajpg-7038431.jpg filter=lfs diff=lfs merge=lfs -text
+images/pexels-peter-almario-388108-19472286.jpg filter=lfs diff=lfs merge=lfs -text
+images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg filter=lfs diff=lfs merge=lfs -text
+images/pexels-wendywei-4945353.jpg filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Awiros
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

PET_Finetuned.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ab940304e869fff4afe92dd8c2ebff798603ce18f4548e0435aa923bf4f15f39
+size 224692940

README.md CHANGED Viewed

@@ -1,3 +1,151 @@
----
-license: mit
----

+---
+license: mit
+language:
+  - en
+library_name: pytorch
+tags: [crowd-counting, localization, PET]
+---
+# Hierarchical Training on Partial Annotations Enables Density-Robust Crowd Counting and Localization
+## Abstract
+Reliable crowd analysis requires both accurate counting and precise head-point
+localization under severe density and scale variation. In practice, dense
+scenes exhibit heavy occlusion and perspective distortion, while the same
+camera can undergo abrupt distribution shifts over time due to zoom and
+viewpoint changes or event dynamics. We present the model, obtained by fine-tuning Point Query Tranformer(PET) on a
+curated, multi-source dataset with partial and heterogeneous annotations. Our
+training recipe combines (i) a hierarchical iterative loop that aligns count
+distributions across partial ground truth, fine-tuned predictions, and the
+pre-trained baseline to guide outlier-driven data refinement, (ii)
+multi-patch resolution training (128x128, 256x256, and 512x512) to reduce
+scale sensitivity, (iii) count-aware patch sampling to mitigate long-tailed
+density skew, and (iv) adaptive background-query loss weighting to prevent
+resolution-dependent background dominance. This approach improves F1 scores
+F1@4px and F1@8px on ShanghaiTech Part A (SHHA), ShanghaiTech Part B (SHHB),
+JHU-Crowd++, and UCF-QNRF, and exhibits more stable behavior during
+sparse-to-dense density transitions.
+For detailed data curation and training recipe, refer to our technical
+report: [Technical Report](TechnicalReport.pdf).
+## Evaluation and Results
+Across four benchmarks, PET-Finetuned shows the strongest overall transfer,
+with consistent gains in both counting and localization on SHHB, UCF-QNRF, and
+JHU-Crowd++. On SHHB, it reduces MAE/MSE to 13.794/22.163 from 19.472/29.651
+(PET-SHHA) and 19.579/28.398 (APGCC-SHHA), while increasing F1@8 to 0.820.
+The same pattern holds on UCF-QNRF (MAE 105.772, MSE 199.544, F1@8 0.738) and
+JHU-Crowd++ (MAE 74.778, MSE 271.886, F1@8 0.698), where PET-Finetuned
+outperforms both references by clear margins. On SHHA, counting error is higher
+than PET-SHHA and APGCC-SHHA (MAE 62.742 vs 48.879/48.725), but localization is
+best in table (F1@4 0.614, F1@8 0.794), indicating a stronger precision-recall
+balance for head-point prediction at both matching thresholds.
+> **Note (evaluation protocol):** PET-SHHA and APGCC-SHHA numbers in this
+> section can differ from values reported in the original papers. The original
+> works typically train one model per target dataset and evaluate in-domain. In
+> contrast, `PET-Finetuned(Ours)` is initialized from PET-SHHA weights and
+> fine-tuned in our framework. For cross-dataset baseline comparison, we use
+> the best public SHHA Part A checkpoints released by the authors for PET-SHHA
+> and APGCC-SHHA (APGCC publicly provides only the SHHA-best checkpoint).
+> Therefore, the PET-SHHA and APGCC-SHHA rows above reflect transfer from SHHA
+> initialization rather than per-dataset retraining. All metrics in this
+> section are evaluated at `threshold = 0.5`.
+### ShanghaiTech Part A (SHHA)
+| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| PET-Finetuned(Ours) | 62.742 | 102.996 | **0.615** | **0.613** | **0.614** | **0.796** | **0.793** | **0.794** |
+| PET-SHHA | 48.879 | **76.520** | 0.596 | 0.604 | 0.600 | 0.781 | 0.792 | 0.786 |
+| APGCC-SHHA | **48.725** | 76.721 | 0.439 | 0.428 | 0.433 | 0.773 | 0.754 | 0.764 |
+### ShanghaiTech Part B (SHHB)
+| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| PET-Finetuned(Ours) | **13.794** | **22.163** | **0.666** | **0.596** | **0.629** | **0.869** | **0.777** | **0.820** |
+| PET-SHHA | 19.472 | 29.651 | 0.640 | 0.547 | 0.590 | 0.847 | 0.724 | 0.781 |
+| APGCC-SHHA | 19.579 | 28.398 | 0.517 | 0.441 | 0.476 | 0.837 | 0.714 | 0.771 |
+### UCF-QNRF
+| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| PET-Finetuned(Ours) | **105.772** | **199.544** | **0.533** | **0.505** | **0.519** | **0.759** | **0.719** | **0.738** |
+| PET-SHHA | 123.135 | 240.943 | 0.495 | 0.487 | 0.491 | 0.708 | 0.696 | 0.702 |
+| APGCC-SHHA | 126.763 | 228.998 | 0.311 | 0.284 | 0.297 | 0.638 | 0.583 | 0.609 |
+### JHU-Crowd++
+| Model | MAE | MSE | AP@4px | AR@4px | F1@4px | AP@8px | AR@8px | F1@8px |
+| --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| PET-Finetuned(Ours) | **74.778** | **271.886** | **0.467** | **0.491** | **0.479** | **0.681** | **0.715** | **0.698** |
+| PET-SHHA | 115.861 | 393.281 | 0.379 | 0.449 | 0.411 | 0.582 | 0.690 | 0.632 |
+| APGCC-SHHA | 102.461 | 331.883 | 0.303 | 0.330 | 0.316 | 0.578 | 0.630 | 0.603 |
+## Qualitative Analysis
+Full-resolution qualitative comparisons in the report use horizontal stacked
+panels ordered as `PET-Finetuned(Ours)`, `PET-SHHA`, and `APGCC-SHHA`, with
+point colors green, yellow, and red. Inference for these comparisons uses
+`threshold = 0.5` and `upper_bound = -1`. Qualitatively,
+`PET-Finetuned(Ours)` shows fewer sparse-scene false positives, stronger
+dense-scene recall under occlusion, and more stable localization under
+perspective and scale variation.
+[![Qualitative comparison for pexels-558331748-30295833](images/pexels-558331748-30295833.jpg)](images/pexels-558331748-30295833.jpg)
+[![Qualitative comparison for pexels-ilyasajpg-7038431](images/pexels-ilyasajpg-7038431.jpg)](images/pexels-ilyasajpg-7038431.jpg)
+[![Qualitative comparison for pexels-peter-almario-388108-19472286](images/pexels-peter-almario-388108-19472286.jpg)](images/pexels-peter-almario-388108-19472286.jpg)
+[![Qualitative comparison for pexels-rafeeque-kodungookaran-374579689-18755903](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)](images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg)
+[![Qualitative comparison for pexels-wendywei-4945353](images/pexels-wendywei-4945353.jpg)](images/pexels-wendywei-4945353.jpg)
+## Model Inference
+Use the official PET repository to run single-image inference with this
+release model.
+1. Clone PET and move into the repository root.
+   ```bash
+   git clone https://github.com/cxliu0/PET.git
+   cd PET
+   ```
+2. Install dependencies.
+   ```bash
+   pip install -r requirements.txt
+   pip install safetensors pillow
+   ```
+3. Copy `test.py` from this release folder into the PET repository root.
+4. Place `PET_Finetuned.safetensors` in the PET repository root.
+5. Run inference (dummy example).
+   ```bash
+   python test.py \
+     --image_path path/to/image.jpg \
+     --resume PET_Finetuned.safetensors \
+     --device cpu \
+     --output_json outputs/prediction.json \
+     --output_image outputs/prediction.jpg
+   ```
+## Summary
+We present a practical adaptation of PET for density-robust
+crowd counting and head-point localization under partial and heterogeneous
+annotations. The training framework combines a hierarchical iterative
+fine-tuning loop with outlier-driven data refinement, mixed patch-resolution
+optimization (128x128/256x256/512x512), count-aware sampling for dense-scene
+emphasis, and adaptive background-query loss weighting to stabilize supervision
+across scales.
+Under the reported cross-dataset transfer protocol from SHHA initialization,
+the model achieves the strongest overall transfer on SHHB, UCF-QNRF, and
+JHU-Crowd++, while maintaining the best localization balance on SHHA at both
+matching thresholds. Qualitative evidence is consistent with these trends,
+showing fewer sparse-scene false positives and stronger dense-scene recall
+under occlusion and perspective variation.

TechnicalReport.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:20927c9dc1d5c4b1a32d5d11c7b804b24273bd345e22e3635ac615ed9b4dc4a2
+size 9209917

images/pexels-558331748-30295833.jpg ADDED Viewed

Git LFS Details

SHA256: 543c2b643fc56412e0915e9ba2154a374dcf4d06c5ff2d2e8a428bdfaf3f9853
Pointer size: 132 Bytes
Size of remote file: 1.08 MB

images/pexels-ilyasajpg-7038431.jpg ADDED Viewed

Git LFS Details

SHA256: 7687eb069e447666947e23cc24dbdee26ce3f892413091c02a7ad25044aec679
Pointer size: 131 Bytes
Size of remote file: 375 kB

images/pexels-peter-almario-388108-19472286.jpg ADDED Viewed

Git LFS Details

SHA256: b8cc45e3abd0564bef3476f4324f119262593e96d9cd546a7d0690f4287de71b
Pointer size: 132 Bytes
Size of remote file: 1.03 MB

images/pexels-rafeeque-kodungookaran-374579689-18755903.jpg ADDED Viewed

Git LFS Details

SHA256: eeba3f5ee328fc6fe27aac6e566e74c5aa0726d6e562785b3d2215090b4a382e
Pointer size: 132 Bytes
Size of remote file: 1.03 MB

images/pexels-wendywei-4945353.jpg ADDED Viewed

Git LFS Details

SHA256: d5131cb931d4ceb370653999c88239ba33bb56c6c9258077b9061c41d4bd4eb7
Pointer size: 131 Bytes
Size of remote file: 690 kB

test.py ADDED Viewed

	@@ -0,0 +1,406 @@

+import argparse
+import json
+import os
+from pathlib import Path
+import cv2
+import numpy as np
+from PIL import Image, ImageDraw, ImageFont
+import torch
+import torchvision.transforms as standard_transforms
+import util.misc as utils
+from models import build_model
+PET_TRANSFORM = standard_transforms.Compose([
+    standard_transforms.ToTensor(),
+    standard_transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+])
+def get_args_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser('PET single-image inference (HF release)', add_help=False)
+    parser.add_argument('--image_path', required=True, type=str,
+                        help='Path to a single input image.')
+    parser.add_argument('--resume', default='PET_Finetuned.safetensors', type=str,
+                        help='Path to model weights (.safetensors or .pth).')
+    parser.add_argument('--device', default='cuda', type=str,
+                        help='Device for inference, e.g. cuda or cpu.')
+    parser.add_argument('--backbone', default='vgg16_bn', type=str)
+    parser.add_argument('--position_embedding', default='sine', type=str, choices=('sine', 'learned', 'fourier'))
+    parser.add_argument('--dec_layers', default=2, type=int)
+    parser.add_argument('--dim_feedforward', default=512, type=int)
+    parser.add_argument('--hidden_dim', default=256, type=int)
+    parser.add_argument('--dropout', default=0.0, type=float)
+    parser.add_argument('--nheads', default=8, type=int)
+    parser.add_argument('--set_cost_class', default=1, type=float)
+    parser.add_argument('--set_cost_point', default=0.05, type=float)
+    parser.add_argument('--ce_loss_coef', default=1.0, type=float)
+    parser.add_argument('--point_loss_coef', default=5.0, type=float)
+    parser.add_argument('--eos_coef', default=0.5, type=float)
+    parser.add_argument('--dataset_file', default='SHA')
+    parser.add_argument('--data_path', default='./data/ShanghaiTech/PartA', type=str)
+    parser.add_argument('--upper_bound', default=-1, type=int,
+                        help='Max image side for inference; -1 means only cap at 2560 (same as compare_models).')
+    parser.add_argument('--output_image', default='', type=str,
+                        help='Optional path to save annotated image panel.')
+    parser.add_argument('--title_text', default='PET-Finetuned', type=str,
+                        help='Title prefix used in top panel text.')
+    parser.add_argument('--radius', default=3, type=int)
+    parser.add_argument('--point_color', default='0,255,0', type=str,
+                        help='BGR color for points, e.g., 0,255,0')
+    parser.add_argument('--panel_long_side', default=1600, type=int,
+                        help='Resize annotated panel long side to this value.')
+    parser.add_argument('--panel_pad', default=24, type=int,
+                        help='Panel padding around the image and title area.')
+    parser.add_argument('--panel_font_size', default=48, type=int,
+                        help='Font size for panel title text.')
+    parser.add_argument('--output_json', default='', type=str,
+                        help='Optional output JSON path for prediction details.')
+    parser.add_argument('--seed', default=42, type=int)
+    return parser
+def parse_color(color_str: str):
+    parts = color_str.split(',')
+    if len(parts) != 3:
+        raise ValueError('color must be B,G,R like 0,255,0')
+    return tuple(int(p.strip()) for p in parts)
+def resolve_device(device_str: str) -> torch.device:
+    if device_str.startswith('cuda') and not torch.cuda.is_available():
+        print('CUDA not available. Falling back to CPU.')
+        return torch.device('cpu')
+    device = torch.device(device_str)
+    if device.type == 'cuda' and device.index is not None:
+        torch.cuda.set_device(device.index)
+    return device
+def resize_for_eval(frame_rgb, upper_bound):
+    h, w = frame_rgb.shape[:2]
+    max_size = max(h, w)
+    if upper_bound != -1 and max_size > upper_bound:
+        scale = float(upper_bound) / float(max_size)
+    elif max_size > 2560:
+        scale = 2560.0 / float(max_size)
+    else:
+        scale = 1.0
+    if scale == 1.0:
+        return frame_rgb, scale
+    new_w = max(1, int(round(w * scale)))
+    new_h = max(1, int(round(h * scale)))
+    resized = cv2.resize(frame_rgb, (new_w, new_h), interpolation=cv2.INTER_LINEAR)
+    return resized, scale
+def load_font(font_size=40, bold=False, font_paths=None):
+    if font_paths is None:
+        if bold:
+            font_paths = [
+                '/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf',
+                '/usr/share/fonts/truetype/liberation/LiberationSans-Bold.ttf',
+                '/usr/share/fonts/truetype/freefont/FreeSansBold.ttf',
+            ]
+        else:
+            font_paths = [
+                '/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf',
+                '/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf',
+                '/usr/share/fonts/truetype/freefont/FreeSans.ttf',
+            ]
+    for font_path in font_paths:
+        if os.path.exists(font_path):
+            try:
+                return ImageFont.truetype(font_path, font_size)
+            except OSError:
+                continue
+    try:
+        fallback = 'DejaVuSans-Bold.ttf' if bold else 'DejaVuSans.ttf'
+        return ImageFont.truetype(fallback, font_size)
+    except OSError:
+        return ImageFont.load_default()
+def draw_text(draw, xy, text, font, fill, bold=False, stroke_width=0):
+    if bold and stroke_width <= 0:
+        stroke_width = 2
+    try:
+        if bold:
+            draw.text(
+                xy,
+                text,
+                fill=fill,
+                font=font,
+                stroke_width=stroke_width,
+                stroke_fill=fill,
+            )
+        else:
+            draw.text(xy, text, fill=fill, font=font)
+    except TypeError:
+        if bold:
+            offsets = [(0, 0), (1, 0), (0, 1), (1, 1)]
+            for dx, dy in offsets:
+                draw.text((xy[0] + dx, xy[1] + dy), text, fill=fill, font=font)
+        else:
+            draw.text(xy, text, fill=fill, font=font)
+def _get_text_size(draw, text, font, bold=False, stroke_width=0):
+    if hasattr(draw, 'textbbox'):
+        try:
+            x0, y0, x1, y1 = draw.textbbox(
+                (0, 0),
+                text,
+                font=font,
+                stroke_width=stroke_width if bold else 0,
+            )
+        except TypeError:
+            x0, y0, x1, y1 = draw.textbbox((0, 0), text, font=font)
+        return x1 - x0, y1 - y0
+    w, h = draw.textsize(text, font=font)
+    if bold:
+        w += stroke_width * 2
+        h += stroke_width * 2
+    return w, h
+def fit_text_to_width(draw, text, font, max_w, bold=False, stroke_width=0):
+    text = text or ''
+    if max_w <= 0:
+        return ''
+    text_w, _ = _get_text_size(draw, text, font, bold=bold, stroke_width=stroke_width)
+    if text_w <= max_w:
+        return text
+    ellipsis = '...'
+    ellipsis_w, _ = _get_text_size(draw, ellipsis, font, bold=bold, stroke_width=stroke_width)
+    if ellipsis_w > max_w:
+        return ''
+    trimmed = text
+    while trimmed:
+        trimmed = trimmed[:-1]
+        candidate = trimmed + ellipsis
+        cand_w, _ = _get_text_size(draw, candidate, font, bold=bold, stroke_width=stroke_width)
+        if cand_w <= max_w:
+            return candidate
+    return ellipsis
+def bgr_to_rgb(color):
+    return (color[2], color[1], color[0])
+def resize_with_points(img, pts, target_long_side):
+    if target_long_side is None or target_long_side <= 0:
+        return img, pts
+    w, h = img.size
+    max_dim = max(w, h)
+    if max_dim <= 0 or max_dim == target_long_side:
+        return img, pts
+    scale = float(target_long_side) / float(max_dim)
+    new_w = max(1, int(round(w * scale)))
+    new_h = max(1, int(round(h * scale)))
+    img = img.resize((new_w, new_h), Image.BILINEAR)
+    if pts is not None and pts.size > 0:
+        pts = pts * scale
+    return img, pts
+def add_padding_with_text(img, text, pad, font, text_color, bg_color, bold, stroke_width):
+    if pad is None or pad <= 0:
+        return img
+    draw_tmp = ImageDraw.Draw(img)
+    text = text or ''
+    text_w, text_h = _get_text_size(draw_tmp, text, font, bold=bold, stroke_width=stroke_width)
+    min_text_gap = 24
+    min_pad = text_h + (2 * min_text_gap)
+    pad = max(pad, min_pad)
+    new_w = img.width + pad * 2
+    new_h = img.height + pad * 2
+    canvas = Image.new('RGB', (new_w, new_h), color=bg_color)
+    canvas.paste(img, (pad, pad))
+    draw = ImageDraw.Draw(canvas)
+    max_text_w = max(0, new_w - (2 * pad))
+    text = fit_text_to_width(draw, text, font, max_text_w, bold=bold, stroke_width=stroke_width)
+    text_w, text_h = _get_text_size(draw, text, font, bold=bold, stroke_width=stroke_width)
+    text_x = pad
+    text_y = max(min_text_gap, (pad - text_h) // 2)
+    text_y = min(text_y, max(0, pad - text_h - min_text_gap))
+    draw_text(draw, (text_x, text_y), text, font, text_color, bold=bold, stroke_width=stroke_width)
+    return canvas
+def annotate_panel(
+    img_bgr,
+    pts,
+    title_text,
+    point_color_bgr,
+    radius,
+    font,
+    text_color,
+    title_bg,
+    target_long_side,
+    pad,
+):
+    rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
+    img = Image.fromarray(rgb)
+    img, pts = resize_with_points(img, pts, target_long_side)
+    draw = ImageDraw.Draw(img)
+    max_dim = max(img.width, img.height)
+    auto_radius = max(3, int(round(max_dim * 0.004)))
+    if radius is None or radius < auto_radius:
+        radius = auto_radius
+    if pts is not None and pts.size > 0:
+        color = bgr_to_rgb(point_color_bgr)
+        for x, y in pts:
+            x0 = x - radius
+            y0 = y - radius
+            x1 = x + radius
+            y1 = y + radius
+            draw.ellipse((x0, y0, x1, y1), fill=color, outline=color)
+    return add_padding_with_text(
+        img,
+        title_text or '',
+        pad,
+        font,
+        text_color,
+        title_bg,
+        bold=False,
+        stroke_width=0,
+    )
+def _load_state_dict(weight_path: Path):
+    if not weight_path.exists():
+        raise FileNotFoundError(f'Weights file not found: {weight_path}')
+    if weight_path.suffix == '.safetensors':
+        try:
+            from safetensors.torch import load_file as load_safetensors
+        except ImportError as exc:
+            raise ImportError(
+                'safetensors is required to load .safetensors weights. Install with: pip install safetensors'
+            ) from exc
+        return load_safetensors(str(weight_path), device='cpu')
+    checkpoint = torch.load(str(weight_path), map_location='cpu')
+    if isinstance(checkpoint, dict) and 'model' in checkpoint and isinstance(checkpoint['model'], dict):
+        return checkpoint['model']
+    if isinstance(checkpoint, dict) and checkpoint and all(torch.is_tensor(v) for v in checkpoint.values()):
+        return checkpoint
+    raise ValueError(
+        'Unsupported checkpoint format. Expected .safetensors or .pth containing a model state_dict.'
+    )
+@torch.no_grad()
+def infer_pet_points(model, frame_bgr, device, upper_bound):
+    frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
+    resized_rgb, scale = resize_for_eval(frame_rgb, upper_bound)
+    resized_h, resized_w = resized_rgb.shape[:2]
+    img = Image.fromarray(resized_rgb)
+    img = PET_TRANSFORM(img)
+    samples = utils.nested_tensor_from_tensor_list([img]).to(device)
+    img_h, img_w = samples.tensors.shape[-2:]
+    outputs = model(samples, test=True)
+    outputs_points = outputs['pred_points']
+    if outputs_points.dim() == 3:
+        outputs_points = outputs_points[0]
+    pred_points = outputs_points.detach().cpu().numpy()
+    if pred_points.size == 0:
+        return np.zeros((0, 2), dtype=np.float32), scale
+    pred_points[:, 0] *= float(img_h)
+    pred_points[:, 1] *= float(img_w)
+    pred_points[:, 0] = np.clip(pred_points[:, 0], 0.0, float(resized_h - 1))
+    pred_points[:, 1] = np.clip(pred_points[:, 1], 0.0, float(resized_w - 1))
+    if scale != 1.0:
+        pred_points = pred_points / float(scale)
+    orig_h, orig_w = frame_bgr.shape[:2]
+    pred_points[:, 0] = np.clip(pred_points[:, 0], 0.0, float(orig_h - 1))
+    pred_points[:, 1] = np.clip(pred_points[:, 1], 0.0, float(orig_w - 1))
+    points_xy = np.stack([pred_points[:, 1], pred_points[:, 0]], axis=1)
+    return points_xy, scale
+def main(args) -> None:
+    device = resolve_device(args.device)
+    model, _ = build_model(args)
+    model.to(device)
+    model.eval()
+    state_dict = _load_state_dict(Path(args.resume))
+    model.load_state_dict(state_dict, strict=True)
+    image_path = Path(args.image_path)
+    frame_bgr = cv2.imread(str(image_path))
+    if frame_bgr is None:
+        raise ValueError(f'Failed to read image: {image_path}')
+    points_xy, scale = infer_pet_points(model, frame_bgr, device, args.upper_bound)
+    count = int(points_xy.shape[0]) if points_xy.size > 0 else 0
+    result = {
+        'image': str(image_path),
+        'count': count,
+        'points_xy': points_xy.tolist(),
+        'scale': scale,
+    }
+    print(f'image: {result["image"]}')
+    print(f'predicted_count: {result["count"]}')
+    if args.output_json:
+        output_json = Path(args.output_json)
+        output_json.parent.mkdir(parents=True, exist_ok=True)
+        output_json.write_text(json.dumps(result, indent=2))
+        print(f'json_saved_to: {output_json}')
+    if args.output_image:
+        output_image = Path(args.output_image)
+        output_image.parent.mkdir(parents=True, exist_ok=True)
+        panel = annotate_panel(
+            frame_bgr,
+            points_xy,
+            f'{args.title_text} Count : {count}',
+            parse_color(args.point_color),
+            args.radius,
+            load_font(font_size=args.panel_font_size, bold=False),
+            text_color=(0, 0, 0),
+            title_bg=(255, 255, 255),
+            target_long_side=args.panel_long_side,
+            pad=args.panel_pad,
+        )
+        panel.save(str(output_image))
+        print(f'annotated_image_saved_to: {output_image}')
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        'PET single-image inference',
+        parents=[get_args_parser()],
+    )
+    main(parser.parse_args())