SAM3-video-segmentation-tracking / docs /Auto_Mode_MultiGPU_Parallel_Plan.md
bellmake's picture
SAM3 Video Segmentation - Clean deployment
ae50268

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Auto-Mode ๋‹ค์ค‘ GPU ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ตฌํ˜„ ๊ณ„ํš

์ž‘์„ฑ์ผ: 2026-04-25 ๋Œ€์ƒ: HuggingFace Spaces ์˜ dedicated multi-GPU ํ•˜๋“œ์›จ์–ด (์˜ˆ: 4ร— A100-80GB) ์—์„œ Auto-Mode ํ์— ์—…๋กœ๋“œํ•œ N ๊ฐœ์˜ ์˜์ƒ์„ ๊ฐ GPU 1๋Œ€๋‹น 1์˜์ƒ์”ฉ ๋™์‹œ์— ๋ถ„ํ• (segment) ํ•˜๋„๋ก ๊ตฌ์กฐ ๋ณ€๊ฒฝ.


0. ๋ฐฐ๊ฒฝ / ํ˜„์žฌ ๊ตฌ์กฐ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ

ํ•ญ๋ชฉ ํ˜„์žฌ (sequential)
Auto-Mode ์ง„์ž…์  app.py:5033 _auto_mode_process(file_list, text_prompt)
๋‹จ์ผ ์˜์ƒ ๋ถ„ํ•  app.py:4896 @spaces.GPU(duration=119) def segment_video(...)
์›Œ์ปค ํ”„๋กœ์„ธ์Šค mp.get_context("spawn").Process(target=_segment_video_worker_entry, ...) (app.py:4854)
์ฝ”์–ด ๋กœ์ง app.py:4239 _segment_video_core(...) (chunk-wise SAM3 ์ถ”๋ก )
๋ชจ๋ธ ์ธ์Šคํ„ด์Šคํ™” app.py:4453 predictor_cls = _get_sam3_predictor_cls(); predictor = predictor_cls(...)
๊ฒฐ๊ณผ ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ build/downloads/ (_persist_for_download ํ˜ธ์ถœ)
์ง„ํ–‰ ํ†ต์‹  mp.Queue ๋กœ progress / status / result / error ๋ฉ”์‹œ์ง€ ์ŠคํŠธ๋ฆฌ๋ฐ
์˜์ƒ ๊ฐ„ ์ฒ˜๋ฆฌ for path in paths: ์ง๋ ฌ ๋ฃจํ”„, GPU cleanup (_cleanup_cuda_cache()) ํ›„ ๋‹ค์Œ ์˜์ƒ ์ฒ˜๋ฆฌ

HuggingFace Spaces ํ•˜๋“œ์›จ์–ด / spaces.GPU ๋™์ž‘

  • @spaces.GPU ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋Š” Config.zero_gpu (= SPACES_ZERO_GPU=true) ์ธ ๊ฒฝ์šฐ์—๋งŒ ZeroGPU ์Šฌ๋ผ์ด์Šค ํ• ๋‹น ๋กœ์ง์ด ๋ถ™๋Š”๋‹ค (spaces/zero/decorator.py:83). dedicated GPU Space (4ร—A100 ๋“ฑ) ์—์„œ๋Š” ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๊ฐ€ no-op ์ด๋ฉฐ, ์ผ๋ฐ˜ Python ํ”„๋กœ์„ธ์Šค๊ฐ€ CUDA 4 ์žฅ ๋ชจ๋‘๋ฅผ ์ง์ ‘ ๋ณธ๋‹ค (torch.cuda.device_count() == 4).
  • ZeroGPU(MIG slice) ๋ชจ๋“œ๋Š” ํ•œ ๋ฒˆ์— ํ•œ GPU ์Šฌ๋ผ์ด์Šค๋งŒ ํ• ๋‹น๋˜๋ฏ€๋กœ ์ด ๊ณ„ํš์€ dedicated multi-GPU ํ•˜๋“œ์›จ์–ด ์ „์ œ์ด๋‹ค. ZeroGPU ํ™˜๊ฒฝ์—์„  ์ž๋™์œผ๋กœ ๊ธฐ์กด ์ง๋ ฌ ๊ฒฝ๋กœ๋กœ fallback ํ•œ๋‹ค.

๊ฒฉ๋ฆฌ(isolation) ์š”๊ตฌ์‚ฌํ•ญ ์ •๋ฆฌ

์ž์› ์ถฉ๋Œ ๊ฐ€๋Šฅ์„ฑ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•
GPU ๋ฉ”๋ชจ๋ฆฌ / ์ปจํ…์ŠคํŠธ ๊ฐ™์€ device ์œ„์—์„œ 4 ์˜์ƒ์ด ๋ชจ๋ธ์„ ๋™์‹œ ์ ์žฌ โ†’ OOM, ์ปจํ…์ŠคํŠธ ๊ฐ„์„ญ ์›Œ์ปค๋งˆ๋‹ค CUDA_VISIBLE_DEVICES=N ํ™˜๊ฒฝ๋ณ€์ˆ˜๋กœ 1์žฅ๋งŒ ๋ณด์ด๊ฒŒ ๊ณ ์ •
sam3.* ๋ชจ๋“ˆ in-process ์บ์‹œ (_SAM3_PREDICTOR_CLS, _LAST_SEG_CACHE, cached_frame_outputs ๋“ฑ) ๊ฐ™์€ ์ธํ„ฐํ”„๋ฆฌํ„ฐ ๋‚ด 4-way concurrent ํ˜ธ์ถœ์‹œ ์ƒํƒœ๊ฐ€ ์–ฝํž˜ spawn ๋ฐฉ์‹์˜ ๋ณ„๋„ ํ”„๋กœ์„ธ์Šค โ†’ ๋ชจ๋“ˆ ์ƒํƒœ ์ž์ฒด๊ฐ€ ๋ถ„๋ฆฌ๋จ
tempfile.mkdtemp() (chunk ์ž…๋ ฅ dir, ํŠธ๋ฆฌ๋ฐ๋œ mp4) mkdtemp ๋Š” ์ž๋™์œผ๋กœ ์ถฉ๋Œ ์—†๋Š” ์ด๋ฆ„ ์ƒ์„ฑ โ†’ ์•ˆ์ „ ์ถ”๊ฐ€ ์กฐ์น˜ ๋ถˆํ•„์š”
build/downloads/ ์‚ฐ์ถœ๋ฌผ ํŒŒ์ผ๋ช… ๋™์‹œ ์‹œ์ž‘ ์˜์ƒ์ด ๋™์ผ timestamp โ†’ auto_mode_results_YYYYMMDD_HHMMSS.zip / *_overlay.mp4 ์ถฉ๋Œ ํŒŒ์ผ๋ช…์— short uuid (uuid.uuid4().hex[:8]) + ์˜์ƒ ์ธ๋ฑ์Šค ์ถ”๊ฐ€
๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ ๋‹ค์šด๋กœ๋“œ / BPE vocab 4 ์›Œ์ปค๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ํŒŒ์ผ์„ download/write โ†’ race ๋ถ€๋ชจ(๋ฉ”์ธ process)์—์„œ ์‚ฌ์ „ 1ํšŒ ๋ณด์žฅ ํ›„ ์›Œ์ปค๋Š” read-only
.zerogpu/tensors ๋“ฑ ์บ์‹œ dedicated ๋ชจ๋“œ์—์„  ZeroGPU ์บ์‹œ๋Š” ์‚ฌ์šฉ ์•ˆ ํ•จ ์˜ํ–ฅ ์—†์Œ
sam3/ ๋””๋ ‰ํ† ๋ฆฌ ์ž์ฒด Python import ๋Š” ํ”„๋กœ์„ธ์Šค๋งˆ๋‹ค ๋…๋ฆฝ โ†’ ๋””๋ ‰ํ† ๋ฆฌ ์‚ฌ๋ณธ ๋ถˆํ•„์š” ์‚ฌ๋ณธ ์ƒ์„ฑ X

๊ฒฐ๋ก 

  • sam3 ํด๋” ๋ณต์ œ๋Š” ํ•„์š” ์—†๋‹ค. ๊ฒฉ๋ฆฌ ๋‹จ์œ„๋Š” โ€œํ”„๋กœ์„ธ์Šคโ€ ํ•œ ๋‹จ๊ณ„๋กœ ์ถฉ๋ถ„ํ•˜๋‹ค.
  • ๊ฐ ์˜์ƒ์ด 1 ๊ฐœ์˜ spawn child process ์—์„œ ์‹คํ–‰๋˜๋ฉฐ, child ์ง„์ž… ์งํ›„ (torch import ์ „) CUDA_VISIBLE_DEVICES ๋ฅผ 1 ์žฅ์œผ๋กœ ์ขํžŒ๋‹ค โ†’ child ์ž…์žฅ์—์„  ํ•ญ์ƒ cuda:0 ํ•œ ๊ฐœ๋งŒ ์กด์žฌ โ†’ ๋ชจ๋ธ/SAM3 ์ฝ”๋“œ์˜ cuda / cuda:0 ํ•˜๋“œ์ฝ”๋”ฉ ์–ด๋””๋“  ์•ˆ์ „.
  • ๋ถ€๋ชจ ํ”„๋กœ์„ธ์Šค๋Š” GPU ์‚ฌ์šฉ X. ๋‹จ์ˆœํžˆ 4-์Šฌ๋กฏ ํ’€์„ ์šด์˜ํ•˜๋Š” ๋””์ŠคํŒจ์ฒ˜ ์—ญํ• ๋งŒ ์ˆ˜ํ–‰. ๋ชจ๋“  ๋ฌด๊ฑฐ์šด import ๋Š” ์›Œ์ปค ์•ˆ์—์„œ.

1. ๋””์ž์ธ ๊ฐœ์š”

1.1 ์›Œ์ปค ํ’€ ๊ตฌ์กฐ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Gradio main process (no torch CUDA usage)                  โ”‚
โ”‚   โ”œโ”€ _auto_mode_process() generator                         โ”‚
โ”‚   โ”œโ”€ ParallelSegmentDispatcher                              โ”‚
โ”‚   โ”‚    โ”œโ”€ pool of N workers (N = min(num_gpus, num_videos)) โ”‚
โ”‚   โ”‚    โ”œโ”€ submit queue  (video_path โ†’ free worker)          โ”‚
โ”‚   โ”‚    โ”œโ”€ event queue   (progress / status / result / err)  โ”‚
โ”‚   โ”‚    โ””โ”€ per-video state: gpu_idx, started_at, last_pctโ€ฆ   โ”‚
โ”‚   โ””โ”€ yields UI updates (status / per-video progress / files)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚ spawn child ร— N
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Worker GPU 0    โ”‚  โ”‚ Worker GPU 1    โ”‚  โ”‚ Worker GPU N-1  โ”‚
โ”‚ CUDA_VISIBLE=0  โ”‚  โ”‚ CUDA_VISIBLE=1  โ”‚  โ”‚ CUDA_VISIBLE=N-1โ”‚
โ”‚ runs            โ”‚  โ”‚ runs            โ”‚  โ”‚ runs            โ”‚
โ”‚ _segment_video  โ”‚  โ”‚ _segment_video  โ”‚  โ”‚ _segment_video  โ”‚
โ”‚   _core(...)    โ”‚  โ”‚   _core(...)    โ”‚  โ”‚   _core(...)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • ๊ฐ ์›Œ์ปค๋Š” ์ƒ์‹œ ์‚ด์•„์žˆ๋Š” โ€œpersistent workerโ€๋กœ ์šด์˜ํ•ด ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ๋กœ๋”ฉ ๋น„์šฉ์„ ์ฒซ ์˜์ƒ 1ํšŒ๋งŒ ๋ถ€๋‹ดํ•œ๋‹ค (์„ ํƒ ์ตœ์ ํ™”: ยง6.2). 1์ฐจ ๊ตฌํ˜„์€ ๋‹จ์ˆœํ•จ์„ ์œ„ํ•ด ์˜์ƒ๋งˆ๋‹ค ์›Œ์ปค ์ƒˆ๋กœ spawn ํ•˜๋Š” โ€œfresh-per-videoโ€ ๊ตฌ์กฐ๋กœ ์‹œ์ž‘ โ†’ ์•ˆ์ •ํ™” ํ›„ ์žฌ์‚ฌ์šฉํ˜•์œผ๋กœ ์ „ํ™˜.
  • N ๊ฐœ์˜ ์˜์ƒ์ด 4 GPU ๋ณด๋‹ค ๋งŽ์œผ๋ฉด, ํ + ํ’€ ๊ตฌ์กฐ๋ผ ์ž๋™์œผ๋กœ ์ง๋ ฌํ™”๋œ๋‹ค (ํ•œ GPU ๊ฐ€ ํ•˜๋‚˜ ๋๋‚ด๋ฉด ๋‹ค์Œ ์˜์ƒ์„ ๋ฐ›์Œ).

1.2 ์›Œ์ปค entry ๋ชจ๋“ˆ ๋ถ„๋ฆฌ โ€” parallel_segment_worker.py

์™œ ๋ณ„๋„ ํŒŒ์ผ์ด ํ•„์š”ํ•œ๊ฐ€:

  • ํ˜„์žฌ worker target (_segment_video_worker_entry) ์€ app.py ๋‚ด๋ถ€ ํ•จ์ˆ˜๋‹ค.
  • spawn ์ž์‹ ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ด target ์„ unpickle ํ•˜๋ ค๋ฉด app.py ๋ฅผ import ํ•ด์•ผ ํ•˜๊ณ , app.py:4 ์—์„œ import torch ๊ฐ€ ์ฆ‰์‹œ ์‹คํ–‰๋œ๋‹ค.
  • ๊ทธ ์‹œ์ ์—” ์ž์‹์ด ์•„์ง os.environ["CUDA_VISIBLE_DEVICES"] ๋ฅผ ์ขํžˆ๊ธฐ ์ „์ด๋ฏ€๋กœ, torch ๊ฐ€ 4 ์žฅ ๋ชจ๋‘ ๋ณด์ด๋Š” ์ƒํƒœ๋กœ cuda runtime ์„ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค โ†’ ์šฐ๋ฆฌ๊ฐ€ cuda:0 ๋งŒ ์“ฐ๋ ค ํ•ด๋„ ๋‹ค๋ฅธ ์žฅ์น˜ ์ปจํ…์ŠคํŠธ๊ฐ€ ๋”ฐ๋ผ์˜จ๋‹ค.
  • ํ•ด๊ฒฐ: ์›Œ์ปค entry ๋ฅผ torch ๋ฅผ top-level ์—์„œ import ํ•˜์ง€ ์•Š๋Š” ์ƒˆ ํŒŒ์ผ๋กœ ๋ถ„๋ฆฌ. ์ž์‹์ด ๊ทธ ํŒŒ์ผ๋งŒ import ํ•œ ๋’ค, ํ•จ์ˆ˜ ๋ณธ๋ฌธ ์ฒซ ์ค„์—์„œ os.environ["CUDA_VISIBLE_DEVICES"] ์„ค์ •ํ•˜๊ณ , ๊ทธ ๋‹ค์Œ torch / app ์„ import.
# parallel_segment_worker.py  (intentionally minimal top-level imports)
import os
import sys
import traceback

def worker_main(gpu_index, args, progress_queue):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_index)
    os.environ["SAM3_WORKER_MODE"] = "1"   # skip Gradio launch in app.py
    os.environ.setdefault("SAM3_CACHE_FRAME_OUTPUTS", "0")
    os.environ.setdefault("SAM3_OFFLOAD_TRACKER_STATE_TO_CPU", "1")

    sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

    # NOW it is safe to import torch / app
    import torch
    if torch.cuda.is_available():
        torch.cuda.set_device(0)        # only one device visible: cuda:0

    from app import _segment_video_core  # imports torch but env is already set

    (video_path, text_prompt, duration_limit, id_corrections_text,
     id_drop_text, id_override_start_sec, show_trails, view_mode) = args

    def _progress_cb(val, desc):
        progress_queue.put({"type": "progress", "value": val, "desc": desc,
                            "gpu_index": gpu_index})

    def _status_cb(msg):
        progress_queue.put({"type": "status", "message": msg,
                            "gpu_index": gpu_index})

    try:
        progress_queue.put({"type": "progress", "value": 0.0,
                            "desc": f"GPU {gpu_index}: starting...",
                            "gpu_index": gpu_index})
        out_path, status, loc_path = _segment_video_core(
            video_path, text_prompt, duration_limit,
            id_corrections_text=id_corrections_text,
            id_drop_text=id_drop_text,
            id_override_start_sec=id_override_start_sec,
            show_trails=show_trails,
            view_mode=view_mode,
            progress_callback=_progress_cb,
            status_callback=_status_cb,
        )
        progress_queue.put({"type": "result",
                            "data": (out_path, status, loc_path),
                            "gpu_index": gpu_index})
    except Exception as exc:  # noqa: BLE001
        progress_queue.put({"type": "error",
                            "message": str(exc),
                            "traceback": traceback.format_exc(),
                            "gpu_index": gpu_index})
    finally:
        try:
            import torch
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.ipc_collect()
        except Exception:
            pass

1.3 ๋””์ŠคํŒจ์ฒ˜ ํด๋ž˜์Šค โ€” app.py ๋‚ด๋ถ€ ์ถ”๊ฐ€

class ParallelSegmentDispatcher:
    """Distribute one video per GPU concurrently and stream events back."""

    def __init__(self, num_gpus: int):
        self.num_gpus = num_gpus
        self.ctx = mp.get_context("spawn")
        self.event_queue = self.ctx.Queue()
        self.workers: dict[int, mp.Process] = {}    # gpu_index -> Process
        self.gpu_assignments: dict[int, dict] = {}  # gpu_index -> task meta

    def submit(self, gpu_index, video_meta, args):
        from parallel_segment_worker import worker_main
        p = self.ctx.Process(
            target=worker_main,
            args=(gpu_index, args, self.event_queue),
            daemon=False,
        )
        p.start()
        self.workers[gpu_index] = p
        self.gpu_assignments[gpu_index] = video_meta

    def free_gpu(self, gpu_index):
        proc = self.workers.pop(gpu_index, None)
        meta = self.gpu_assignments.pop(gpu_index, None)
        if proc is not None:
            proc.join(timeout=5)
            if proc.is_alive():
                proc.terminate()
                proc.join(timeout=5)
        return meta

    def shutdown(self):
        for gi in list(self.workers.keys()):
            self.free_gpu(gi)

1.4 _auto_mode_process ์˜ ๋ณ‘๋ ฌ ๋ณ€ํ˜• โ€” _auto_mode_process_parallel

๋Œ€๋žต์  ์•Œ๊ณ ๋ฆฌ์ฆ˜:

๊ฐ€์šฉ GPU ์ˆ˜ G = torch.cuda.device_count()
์˜์ƒ ์ˆ˜ N = len(paths)
slot_count = min(G, N)

dispatcher = ParallelSegmentDispatcher(slot_count)

# 1) ์ดˆ๊ธฐ N ๊ฐœ ์ค‘ ์ฒซ slot_count ๊ฐœ๋ฅผ ๊ฐ GPU ์— ๋ฐฐ์ •
free_gpus = list(range(slot_count))
queue_index = 0
in_flight = 0
while queue_index < N and free_gpus:
    gi = free_gpus.pop(0)
    dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
    queue_index += 1
    in_flight += 1
    yield UI status

# 2) ์ด๋ฒคํŠธ ๋ฃจํ”„
while in_flight > 0:
    msg = dispatcher.event_queue.get(timeout=...)
    gi = msg["gpu_index"]
    if msg["type"] == "progress":
        update per-GPU progress bar text; aggregate overall progress
        yield UI status
    elif msg["type"] == "status":
        append status for that GPU
        yield UI status
    elif msg["type"] == "result":
        out_path, status, loc_path = msg["data"]
        finalize: rename/persist with disambiguating suffix
        append (mp4, csv) to all_results
        yield UI status (with newly visible result)
        dispatcher.free_gpu(gi)
        in_flight -= 1
        if queue_index < N:
            dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
            queue_index += 1
            in_flight += 1
            yield UI status
    elif msg["type"] == "error":
        record failure for that video
        yield UI status
        dispatcher.free_gpu(gi)
        in_flight -= 1
        # same re-fill logic as result

# 3) ์ข…๋ฃŒ ์ •๋ฆฌ
dispatcher.shutdown()
yield final summary

2. UI ๋ณ€๊ฒฝ

2.1 ์ถ”๊ฐ€ ์ปดํฌ๋„ŒํŠธ โ€” Auto-Mode (Batch Queue) accordion ์•ˆ

์ปดํฌ๋„ŒํŠธ ์šฉ๋„
auto_mode_parallel_status (Markdown) GPU ์ˆ˜ / ํ™œ์„ฑ ์›Œ์ปค ์ˆ˜ / ํ์— ๋‚จ์€ ์˜์ƒ ์ˆ˜ / ๋ถ„๋ฅ˜๋ณ„ ์ง„ํ–‰๋ฅ  (์˜ˆ: GPU0: video_a.mp4 73%, GPU1: video_b.mp4 41% โ€ฆ)
์˜์ƒ๋ณ„ ๊ฒฐ๊ณผ ๋ˆ„์ ์€ ๊ธฐ์กด auto_results_files_state / auto_results_list ์žฌ์‚ฌ์šฉ ๋ณ€๊ฒฝ ์—†์Œ

2.2 ๋‹จ์ผ ์˜์ƒ ๋ฏธ๋ฆฌ๋ณด๊ธฐ / overlay ์ปดํฌ๋„ŒํŠธ

๋ณ‘๋ ฌ ๋ชจ๋“œ์—์„  โ€œํ˜„์žฌ ์ฒ˜๋ฆฌ์ค‘โ€ ๋‹จ์ผ ์˜์ƒ์ด ์—†์œผ๋ฏ€๋กœ:

  • video_input / video_output ๋“ฑ ๋‹จ์ผ ์Šฌ๋กฏ ์œ„์ ฏ์€ ๋งˆ์ง€๋ง‰์œผ๋กœ ์™„๋ฃŒ๋œ ์˜์ƒ ๊ฒฐ๊ณผ๋กœ ๊ฐฑ์‹  (UX ์นœ์ ˆ).
  • ์ฃผ๋œ ์ง„ํ–‰ ํ‘œ์‹œ๋Š” multiline auto_mode_parallel_status ๊ฐ€ ๋‹ด๋‹น.

2.3 fallback

  • torch.cuda.device_count() <= 1 โ†’ _auto_mode_process (ํ˜„์žฌ ์ง๋ ฌ ๋™์ž‘) ๊ทธ๋Œ€๋กœ ์œ ์ง€.
  • >1 โ†’ _auto_mode_process_parallel ๋ถ„๊ธฐ.
  • ํ† ๊ธ€: ํ™˜๊ฒฝ๋ณ€์ˆ˜ SAM3_PARALLEL_AUTO_MODE (๊ธฐ๋ณธ auto, 0 ์œผ๋กœ ๋น„ํ™œ์„ฑ, 1 ๋กœ ๊ฐ•์ œ) ๋กœ ์˜ต์…˜ํ™”.

3. ํŒŒ์ผ๋ช… / ์ถœ๋ ฅ ์ถฉ๋Œ ๋ฐฉ์ง€

build/downloads/ ๋””๋ ‰ํ† ๋ฆฌ ์•ˆ์— 4 ๊ฐœ ์˜์ƒ์ด ๊ฑฐ์˜ ๋™์‹œ์— ๊ฒฐ๊ณผ๋ฅผ ๋–จ์–ด๋œจ๋ฆด ๋•Œ:

ํ•จ์ˆ˜ ๋ณ€๊ฒฝ
_rename_with_rule ๊ฒฐ๊ณผ ํŒŒ์ผ๋ช…์— ์งง์€ ์˜์ƒ-์ธ์Šคํ„ด์Šค ID ๋ฅผ ๋ผ์›Œ๋„ฃ๋„๋ก ๋ณด๊ฐ•. ์˜ˆ: {stem}_{video_id8}_seg_{dur}_{elapsed}s.mp4
_persist_for_download ์ค‘๋ณต basename ์ธ ๊ฒฝ์šฐ _{n} ์ ‘๋ฏธ์‚ฌ ๋ถ€์—ฌ (์ด๋ฏธ ์–ด๋А ์ •๋„ ์ฒ˜๋ฆฌ๋˜์ง€๋งŒ race-safe ํ•˜๊ฒŒ os.rename ํ›„ ์žฌํ™•์ธ)
_build_zip_from_paths ์ด๋ฏธ basename ์ค‘๋ณต disambiguation ๋กœ์ง ์žˆ์Œ (seen_names) โ†’ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ

video_id8 ๋Š” ๋””์ŠคํŒจ์ฒ˜๊ฐ€ ์˜์ƒ ํ์— ๋„ฃ์„ ๋•Œ uuid.uuid4().hex[:8] ๋กœ ํ•œ ๋ฒˆ ์ƒ์„ฑํ•˜์—ฌ meta ์— ์ €์žฅ.


4. ์•ˆ์ „์žฅ์น˜ / ์—ฃ์ง€ ์ผ€์ด์Šค

  1. GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์ „ ์ฒดํฌ: ์›Œ์ปค ์ง„์ž… ์งํ›„ _check_gpu_memory_safe() ๊ฐ€ true ์ธ์ง€ ํ™•์ธ (๋ชจ๋ธ ์ตœ์ดˆ ์ ์žฌ ์ „). false ๋ฉด error ๋ฉ”์‹œ์ง€๋กœ ๋””์ŠคํŒจ์ฒ˜์— ๋ณด๊ณ ํ•˜๊ณ  ์ข…๋ฃŒ.
  2. ์›Œ์ปค ๋น„์ •์ƒ ์ข…๋ฃŒ: ๋””์ŠคํŒจ์ฒ˜๋Š” event_queue.get(timeout=heartbeat) ์œผ๋กœ ํด๋งํ•˜๋ฉฐ, heartbeat ์‹œ๊ฐ„ ๋‚ด ๋ฉ”์‹œ์ง€๊ฐ€ ์—†๊ณ  ํ•ด๋‹น ์›Œ์ปค๊ฐ€ is_alive() == False ๋ฉด error ์ฒ˜๋ฆฌ + free_gpu.
  3. ๋ถ€๋ชจ ํ”„๋กœ์„ธ์Šค์˜ daemon ์ฒดํฌ: ๊ธฐ์กด segment_video ๊ฐ€ mp.current_process().daemon ๋ฉด in-process ๋กœ ํด๋ฐฑํ•˜๋˜ ๋ถ„๊ธฐ (app.py:4918) ์™€ ๋™์ผํ•œ ์ •์‹ ์œผ๋กœ, ๋””์ŠคํŒจ์ฒ˜๋„ daemon ๋ถ€๋ชจ์—์„  ๋น„ํ™œ์„ฑํ™” โ†’ ์ˆœ์ฐจ ํด๋ฐฑ.
  4. ์ทจ์†Œ(์Šคํ†ฑ ๋ฒ„ํŠผ): 1์ฐจ ๊ตฌํ˜„์—” ๋ฏธํฌํ•จ (ํ˜„์žฌ ์ง๋ ฌ ๋ชจ๋“œ์—๋„ stop ์—†์Œ). ํ›„์† ์ž‘์—….
  5. ๋กœ๊ทธ prefix: ์›Œ์ปค๊ฐ€ ๋ณด๋‚ด๋Š” progress/status ๋ฉ”์‹œ์ง€ ์•ž์— [GPU{n}] ์ ‘๋‘๋ฅผ ๋ถ™์—ฌ์„œ UI ์™€ stdout ๊ตฌ๋ถ„.
  6. ๊ฒฐ์ •์  ๋””๋ฐ”์ด์Šค ๋ถ„๋ฐฐ: ์˜์ƒ i ๊ฐ€ ๋ชจ๋‘ ๊ฐ™์€ GPU ๋กœ ๊ฐ€์ง€ ์•Š๋„๋ก ๋””์ŠคํŒจ์ฒ˜๊ฐ€ round-robin (์‚ฌ์‹ค์ƒ โ€œ๋จผ์ € ๋๋‚œ GPU ์— ๋‹ค์Œ ์˜์ƒโ€).

5. ํ…Œ์ŠคํŠธ / ๊ฒ€์ฆ

5.1 ๋กœ์ปฌ (๋‹จ์ผ GPU)

  • _parallel_dispatcher ๊ฐ€ device_count == 1 ์ผ ๋•Œ ์ž๋™์œผ๋กœ ์ง๋ ฌ ๊ฒฝ๋กœ๋กœ ํด๋ฐฑ๋˜๋Š”์ง€ ํ™•์ธ.
  • ํ™˜๊ฒฝ๋ณ€์ˆ˜ SAM3_PARALLEL_AUTO_MODE=1 + CUDA_VISIBLE_DEVICES=0 โ†’ ๋””์ŠคํŒจ์ฒ˜๊ฐ€ 1-์Šฌ๋กฏ ๋ชจ๋“œ๋กœ ๋™์ž‘ (์›Œ์ปค 1๊ฐœ) โ€” ๊ฒฐ๊ณผ๊ฐ€ ๊ธฐ์กด _auto_mode_process ์™€ ๋™์ผํ•ด์•ผ ํ•จ.

5.2 ๋กœ์ปฌ (๊ฐ€์งœ ๋ฉ€ํ‹ฐ GPU ์‹œ๋ฎฌ๋ ˆ์ด์…˜)

  • SAM3_PARALLEL_AUTO_MODE=1 + SAM3_FAKE_GPU_COUNT=4 ๋กœ ๋””์ŠคํŒจ์ฒ˜ ์ฝ”๋“œ๊ฐ€ 4-์Šฌ๋กฏ ํ’€์„ ๋งŒ๋“ค์ง€๋งŒ ์‹ค์ œ๋ก  ๋ชจ๋‘ ๋™์ผํ•œ device 0 ์„ ๊ณต์œ  (ํ…Œ์ŠคํŠธ์šฉ; ๋‹จ์ˆœ dispatcher ๋กœ์ง ๊ฒ€์ฆ).

5.3 HF Space (4ร—A100)

  • 4 ๊ฐœ ์˜์ƒ ์—…๋กœ๋“œ โ†’ ๊ฐ ์˜์ƒ ์ฒ˜๋ฆฌ ์‹œ๊ฐ„์ด single-GPU ๋Œ€๋น„ 3.5~4ร— ๋นจ๋ผ์ง€๋Š”์ง€ ํ™•์ธ.
  • nvidia-smi ๋กœ 4 ์žฅ ๋ชจ๋‘ utilization ์˜ฌ๋ผ๊ฐ€๋Š”์ง€ ํ™•์ธ (๋””๋ฒ„๊ทธ ๋กœ๊ทธ์— GPU memory util: ์ถœ๋ ฅ).

5.4 ํšŒ๊ท€

  • ๋‹จ์ผ ์˜์ƒ โ€œRun Segmentationโ€ ๋ฒ„ํŠผ์€ ๋ณ€๊ฒฝ ์—†์Œ โ†’ ํšŒ๊ท€ ์œ„ํ—˜ ๋‚ฎ์Œ.
  • ๊ฒฐ๊ณผ mp4 / csv ์˜ ๋ถ„ํ•  ์ •ํ™•๋„๋Š” ๋‹จ์ผ/๋ณ‘๋ ฌ ๋ชจ๋“œ์—์„œ bit-identical (๊ฐ™์€ ์‹œ๋“œ๋ผ๋ฉด) โ€” ๋‹จ์ผ vs ๋ณ‘๋ ฌ ๊ฒฐ๊ณผ mp4 ์˜ frame-by-frame mask IoU ๋กœ sanity check.

6. ๋‹จ๊ณ„๋ณ„ ๊ตฌํ˜„ ์ฒดํฌ๋ฆฌ์ŠคํŠธ (์‹คํ–‰ ์ˆœ์„œ)

์ด ๋ฌธ์„œ์— ์ ํžŒ ์ˆœ์„œ๋Œ€๋กœ ์ฝ”๋“œ ์ˆ˜์ •.

Step 1 โ€” ์ƒˆ ํŒŒ์ผ parallel_segment_worker.py ์ƒ์„ฑ

  • top-level imports: os, sys, traceback ๋งŒ.
  • worker_main(gpu_index, args, progress_queue) ํ•จ์ˆ˜ 1.2 ์ ˆ ์ฝ”๋“œ๋Œ€๋กœ ์ž‘์„ฑ.

Step 2 โ€” app.py ์— ๋””์ŠคํŒจ์ฒ˜ ํด๋ž˜์Šค ์ถ”๊ฐ€

  • class ParallelSegmentDispatcher: ์ •์˜ (1.3 ์ ˆ).
  • import uuid ๊ฐ€ ์ด๋ฏธ ์žˆ๋Š”์ง€ ํ™•์ธ (app.py:30) โ†’ โœ… ์žˆ์Œ.

Step 3 โ€” app.py ์— _auto_mode_process_parallel(...) ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ์ถ”๊ฐ€

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ 1.4 ์ ˆ. ์ถœ๋ ฅ ํŠœํ”Œ ํ˜•ํƒœ๋Š” ๊ธฐ์กด _auto_mode_process ์˜ _pkg(...) ์™€ ๋™์ผํ•˜๊ฒŒ 19-tuple ์œ ์ง€ (UI ์™€์ด์–ด๋ง ๋ณ€๊ฒฝ ์•ˆ ํ•จ).
  • auto_mode_status ๋ฉ”์‹œ์ง€๋ฅผ multiline ์œผ๋กœ ๊ตฌ์„ฑํ•ด GPU ๋ณ„ ์ง„ํ–‰๋ฅ  ๋…ธ์ถœ.
  • ๊ฒฐ๊ณผ ํŒŒ์ผ๋ช… disambiguation: video meta ์— vid8 = uuid.uuid4().hex[:8], _rename_with_rule ํ˜ธ์ถœ ํ›„ _persist_for_download ์ „ ๋‹จ๊ณ„์—์„œ stem ์— _{vid8} ์‚ฝ์ž….

Step 4 โ€” app.py ์˜ _auto_mode_process ์ง„์ž…๋ถ€์— ๋ผ์šฐํ„ฐ ์ถ”๊ฐ€

  • ํ•จ์ˆ˜ ์ฒซ ๋ถ€๋ถ„์—์„œ:
    num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
    parallel_env = os.getenv("SAM3_PARALLEL_AUTO_MODE", "auto").lower()
    use_parallel = (
        (parallel_env == "1") or
        (parallel_env == "auto" and num_gpus > 1)
    ) and not bool(os.getenv("SPACES_ZERO_GPU"))
    if use_parallel:
        yield from _auto_mode_process_parallel(file_list, text_prompt, num_gpus, progress)
        return
    
  • ZeroGPU ๋ชจ๋“œ์—์„  ๋น„ํ™œ์„ฑ (๊ฐ ํ˜ธ์ถœ์ด ์Šฌ๋ผ์ด์Šค ๋‹จ์œ„๋กœ ๋งŒ GPU ํ• ๋‹น๋ฐ›์Œ โ†’ ๋™์‹œ์„ฑ ๋ฌด์˜๋ฏธ).

Step 5 โ€” _segment_video_worker_entry ์™€์˜ ์ฝ”๋“œ ์ค‘๋ณต ์ •๋ฆฌ

  • ๊ธฐ์กด single-video ๊ฒฝ๋กœ (segment_video โ†’ _segment_video_worker_entry) ๋„ ์ ์ง„์ ์œผ๋กœ parallel_segment_worker.worker_main ์„ ์‚ฌ์šฉํ•ด ํ•œ ๊ณณ์—์„œ ๊ด€๋ฆฌํ•˜๋„๋ก ํ†ตํ•ฉ (์„ ํƒ). 1์ฐจ ๊ตฌํ˜„์—์„  ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค (ํšŒ๊ท€ ์œ„ํ—˜ ์ตœ์†Œํ™”).

Step 6 โ€” ์ถœ๋ ฅ ํŒŒ์ผ๋ช… disambiguation ํŒจ์น˜

  • _rename_with_rule ์‹œ๊ทธ๋‹ˆ์ฒ˜์— extra_tag: str = "" ์˜ต์…˜ ์ถ”๊ฐ€ (๊ธฐ๋ณธ ๋นˆ ๋ฌธ์ž์—ด๋กœ ํ›„๋ฐฉํ˜ธํ™˜).
  • ๋ณ‘๋ ฌ ๊ฒฝ๋กœ์—์„œ๋งŒ extra_tag=vid8 ์ „๋‹ฌ.

Step 7 โ€” UI ํ…์ŠคํŠธ ๋ณด๊ฐ•

  • auto_mode_status Markdown ์— multi-line ์ถœ๋ ฅ (GPU ๋ณ„ 1์ค„). ๋„ˆ๋ฌด ๊ธธ๋ฉด ์ ‘๊ธฐ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ๋ธ”๋Ÿญ์œผ๋กœ.
  • gr.Progress ๋Š” ๋‹จ์ผ ๋ง‰๋Œ€์ด๋ฏ€๋กœ, ๋ณ‘๋ ฌ ๋ชจ๋“œ์˜ โ€œ์ „์ฒด ํ‰๊ท  ์ง„ํ–‰๋ฅ โ€ ๋งŒ ๊ฑฐ๊ธฐ์— ๋ณด๋‚ด๊ณ  GPU ๋ณ„ ์„ธ๋ถ€๋Š” ํ…์ŠคํŠธ๋กœ.

Step 8 โ€” ์Šค๋ชจํฌ ํ…Œ์ŠคํŠธ

  • ๋กœ์ปฌ์—์„œ python app.py ๋กœ ๋„์šฐ๊ณ :
    • ์˜์ƒ 2๊ฐœ ์—…๋กœ๋“œ โ†’ ๋‹จ์ผ GPU ํ™˜๊ฒฝ์—์„œ ์ง๋ ฌ ๋ชจ๋“œ๋กœ ๋™์ž‘ (GPU 1 ์žฅ๋งŒ ๋ณด์ž„).
    • SAM3_PARALLEL_AUTO_MODE=1 CUDA_VISIBLE_DEVICES=0 python app.py โ†’ 1-์Šฌ๋กฏ ํ’€๋กœ ๋™์ž‘.
  • ๊ฒฐ๊ณผ mp4 / csv ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ƒ์„ฑ๋˜๋Š”์ง€, status UI ๊ฐ€ ๊ฐฑ์‹ ๋˜๋Š”์ง€ ํ™•์ธ.

Step 9 โ€” ํ‘ธ์‹œ

  • requirements.txt ๋ณ€๊ฒฝ ์—†์Œ (multiprocessing / uuid ํ‘œ์ค€ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ).
  • HF Space ์—์„  ํ•˜๋“œ์›จ์–ด ํƒญ์—์„œ 4xA100-large (๋˜๋Š” ๋“ฑ๊ฐ€) ๋กœ ์—…๊ทธ๋ ˆ์ด๋“œํ•œ ํ›„ ๋™์ผ ์ฝ”๋“œ๋ฅผ ํ‘ธ์‹œํ•˜๋ฉด ์ž๋™์œผ๋กœ ๋ณ‘๋ ฌ ๋ชจ๋“œ ์ง„์ž….

7. ํ–ฅํ›„ ํ™•์žฅ (์ด๋ฒˆ PR ๋ฒ”์œ„ ์™ธ)

  • ์›Œ์ปค ์žฌ์‚ฌ์šฉ (persistent): ๋งค ์˜์ƒ๋งˆ๋‹ค spawn ๋Œ€์‹  Connection/Pipe ๊ธฐ๋ฐ˜ RPC ๋กœ ๋ช…๋ น์„ ์›Œ์ปค์— ๋ณด๋‚ด ๋ชจ๋ธ 1ํšŒ๋งŒ ์ ์žฌ. SAM3 ๊ฐ€์ค‘์น˜ ๋กœ๋”ฉ ๋น„์šฉ์ด ์˜์ƒ๋‹น 1~3 ๋ถ„์ด๋ผ๋ฉด ์†๋„ ์ด๋“ ํผ.
  • ์ทจ์†Œ / ์ผ์‹œ์ •์ง€: stop ๋ฒ„ํŠผ โ†’ ๋””์ŠคํŒจ์ฒ˜๊ฐ€ ๋ชจ๋“  ์›Œ์ปค์— SIGTERM ๋ณด๋‚ด๊ณ  partial ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜.
  • GPU ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ๋‹ค๋ฅธ ๊ฒฝ์šฐ: ํฐ ์˜์ƒ์€ 80 GB GPU ๋กœ, ์ž‘์€ ์˜์ƒ์€ ์ž‘์€ GPU ๋กœ ๋ผ์šฐํŒ…ํ•˜๋Š” ์šฐ์„ ์ˆœ์œ„ ํ.
  • ๋ถ„์‚ฐ (multi-node): ๋™์ผ ์ธํ„ฐํŽ˜์ด์Šค๋กœ worker ๋ฅผ SSH ๋„ˆ๋จธ ๋…ธ๋“œ๋กœ ๋„์šธ ์ˆ˜ ์žˆ๊ฒŒ ์ถ”์ƒํ™”.