Spaces:

bellmake
/

SAM3-video-segmentation-tracking

Sleeping

App Files Files Community

SAM3-video-segmentation-tracking / docs /Auto_Mode_MultiGPU_Parallel_Plan.md

bellmake

SAM3 Video Segmentation - Clean deployment

ae50268 30 days ago

preview code

raw

history blame contribute delete

20.3 kB

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Auto-Mode 다중 GPU 병렬 처리 구현 계획

작성일: 2026-04-25 대상: HuggingFace Spaces 의 dedicated multi-GPU 하드웨어 (예: 4× A100-80GB) 에서 Auto-Mode 큐에 업로드한 N 개의 영상을 각 GPU 1대당 1영상씩 동시에 분할(segment) 하도록 구조 변경.

0. 배경 / 현재 구조 한눈에 보기

항목	현재 (sequential)
Auto-Mode 진입점	`app.py:5033 _auto_mode_process(file_list, text_prompt)`
단일 영상 분할	`app.py:4896 @spaces.GPU(duration=119) def segment_video(...)`
워커 프로세스	`mp.get_context("spawn").Process(target=_segment_video_worker_entry, ...)` (`app.py:4854`)
코어 로직	`app.py:4239 _segment_video_core(...)` (chunk-wise SAM3 추론)
모델 인스턴스화	`app.py:4453 predictor_cls = _get_sam3_predictor_cls(); predictor = predictor_cls(...)`
결과 저장 디렉토리	`build/downloads/` (`_persist_for_download` 호출)
진행 통신	`mp.Queue` 로 `progress / status / result / error` 메시지 스트리밍
영상 간 처리	`for path in paths:` 직렬 루프, GPU cleanup (`_cleanup_cuda_cache()`) 후 다음 영상 처리

HuggingFace Spaces 하드웨어 / `spaces.GPU` 동작

@spaces.GPU 데코레이터는 Config.zero_gpu (= SPACES_ZERO_GPU=true) 인 경우에만 ZeroGPU 슬라이스 할당 로직이 붙는다 (spaces/zero/decorator.py:83). dedicated GPU Space (4×A100 등) 에서는 데코레이터가 no-op 이며, 일반 Python 프로세스가 CUDA 4 장 모두를 직접 본다 (torch.cuda.device_count() == 4).
ZeroGPU(MIG slice) 모드는 한 번에 한 GPU 슬라이스만 할당되므로 이 계획은 dedicated multi-GPU 하드웨어 전제이다. ZeroGPU 환경에선 자동으로 기존 직렬 경로로 fallback 한다.

격리(isolation) 요구사항 정리

자원	충돌 가능성	해결 방법
GPU 메모리 / 컨텍스트	같은 device 위에서 4 영상이 모델을 동시 적재 → OOM, 컨텍스트 간섭	워커마다 `CUDA_VISIBLE_DEVICES=N` 환경변수로 1장만 보이게 고정
`sam3.*` 모듈 in-process 캐시 (`_SAM3_PREDICTOR_CLS`, `_LAST_SEG_CACHE`, `cached_frame_outputs` 등)	같은 인터프리터 내 4-way concurrent 호출시 상태가 얽힘	spawn 방식의 별도 프로세스 → 모듈 상태 자체가 분리됨
`tempfile.mkdtemp()` (chunk 입력 dir, 트리밍된 mp4)	`mkdtemp` 는 자동으로 충돌 없는 이름 생성 → 안전	추가 조치 불필요
`build/downloads/` 산출물 파일명	동시 시작 영상이 동일 timestamp → `auto_mode_results_YYYYMMDD_HHMMSS.zip` / `*_overlay.mp4` 충돌	파일명에 short uuid (`uuid.uuid4().hex[:8]`) + 영상 인덱스 추가
모델 체크포인트 다운로드 / BPE vocab	4 워커가 동시에 같은 파일을 download/write → race	부모(메인 process)에서 사전 1회 보장 후 워커는 read-only
`.zerogpu/tensors` 등 캐시	dedicated 모드에선 ZeroGPU 캐시는 사용 안 함	영향 없음
`sam3/` 디렉토리 자체	Python import 는 프로세스마다 독립 → 디렉토리 사본 불필요	사본 생성 X

결론

sam3 폴더 복제는 필요 없다. 격리 단위는 “프로세스” 한 단계로 충분하다.
각 영상이 1 개의 spawn child process 에서 실행되며, child 진입 직후 (torch import 전) CUDA_VISIBLE_DEVICES 를 1 장으로 좁힌다 → child 입장에선 항상 cuda:0 한 개만 존재 → 모델/SAM3 코드의 cuda / cuda:0 하드코딩 어디든 안전.
부모 프로세스는 GPU 사용 X. 단순히 4-슬롯 풀을 운영하는 디스패처 역할만 수행. 모든 무거운 import 는 워커 안에서.

1. 디자인 개요

1.1 워커 풀 구조

┌─────────────────────────────────────────────────────────────┐
│ Gradio main process (no torch CUDA usage)                  │
│   ├─ _auto_mode_process() generator                         │
│   ├─ ParallelSegmentDispatcher                              │
│   │    ├─ pool of N workers (N = min(num_gpus, num_videos)) │
│   │    ├─ submit queue  (video_path → free worker)          │
│   │    ├─ event queue   (progress / status / result / err)  │
│   │    └─ per-video state: gpu_idx, started_at, last_pct…   │
│   └─ yields UI updates (status / per-video progress / files)│
└──────┬──────────────────────────────────────────────────────┘
       │ spawn child × N
       ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Worker GPU 0    │  │ Worker GPU 1    │  │ Worker GPU N-1  │
│ CUDA_VISIBLE=0  │  │ CUDA_VISIBLE=1  │  │ CUDA_VISIBLE=N-1│
│ runs            │  │ runs            │  │ runs            │
│ _segment_video  │  │ _segment_video  │  │ _segment_video  │
│   _core(...)    │  │   _core(...)    │  │   _core(...)    │
└─────────────────┘  └─────────────────┘  └─────────────────┘

각 워커는 상시 살아있는 “persistent worker”로 운영해 모델 가중치 로딩 비용을 첫 영상 1회만 부담한다 (선택 최적화: §6.2). 1차 구현은 단순함을 위해 영상마다 워커 새로 spawn 하는 “fresh-per-video” 구조로 시작 → 안정화 후 재사용형으로 전환.
N 개의 영상이 4 GPU 보다 많으면, 큐 + 풀 구조라 자동으로 직렬화된다 (한 GPU 가 하나 끝내면 다음 영상을 받음).

1.2 워커 entry 모듈 분리 — `parallel_segment_worker.py`

왜 별도 파일이 필요한가:

현재 worker target (_segment_video_worker_entry) 은 app.py 내부 함수다.
spawn 자식 프로세스가 이 target 을 unpickle 하려면 app.py 를 import 해야 하고, app.py:4 에서 import torch 가 즉시 실행된다.
그 시점엔 자식이 아직 os.environ["CUDA_VISIBLE_DEVICES"] 를 좁히기 전이므로, torch 가 4 장 모두 보이는 상태로 cuda runtime 을 초기화한다 → 우리가 cuda:0 만 쓰려 해도 다른 장치 컨텍스트가 따라온다.
해결: 워커 entry 를 torch 를 top-level 에서 import 하지 않는 새 파일로 분리. 자식이 그 파일만 import 한 뒤, 함수 본문 첫 줄에서 os.environ["CUDA_VISIBLE_DEVICES"] 설정하고, 그 다음 torch / app 을 import.

# parallel_segment_worker.py  (intentionally minimal top-level imports)
import os
import sys
import traceback

def worker_main(gpu_index, args, progress_queue):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_index)
    os.environ["SAM3_WORKER_MODE"] = "1"   # skip Gradio launch in app.py
    os.environ.setdefault("SAM3_CACHE_FRAME_OUTPUTS", "0")
    os.environ.setdefault("SAM3_OFFLOAD_TRACKER_STATE_TO_CPU", "1")

    sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

    # NOW it is safe to import torch / app
    import torch
    if torch.cuda.is_available():
        torch.cuda.set_device(0)        # only one device visible: cuda:0

    from app import _segment_video_core  # imports torch but env is already set

    (video_path, text_prompt, duration_limit, id_corrections_text,
     id_drop_text, id_override_start_sec, show_trails, view_mode) = args

    def _progress_cb(val, desc):
        progress_queue.put({"type": "progress", "value": val, "desc": desc,
                            "gpu_index": gpu_index})

    def _status_cb(msg):
        progress_queue.put({"type": "status", "message": msg,
                            "gpu_index": gpu_index})

    try:
        progress_queue.put({"type": "progress", "value": 0.0,
                            "desc": f"GPU {gpu_index}: starting...",
                            "gpu_index": gpu_index})
        out_path, status, loc_path = _segment_video_core(
            video_path, text_prompt, duration_limit,
            id_corrections_text=id_corrections_text,
            id_drop_text=id_drop_text,
            id_override_start_sec=id_override_start_sec,
            show_trails=show_trails,
            view_mode=view_mode,
            progress_callback=_progress_cb,
            status_callback=_status_cb,
        )
        progress_queue.put({"type": "result",
                            "data": (out_path, status, loc_path),
                            "gpu_index": gpu_index})
    except Exception as exc:  # noqa: BLE001
        progress_queue.put({"type": "error",
                            "message": str(exc),
                            "traceback": traceback.format_exc(),
                            "gpu_index": gpu_index})
    finally:
        try:
            import torch
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.ipc_collect()
        except Exception:
            pass

1.3 디스패처 클래스 — `app.py` 내부 추가

class ParallelSegmentDispatcher:
    """Distribute one video per GPU concurrently and stream events back."""

    def __init__(self, num_gpus: int):
        self.num_gpus = num_gpus
        self.ctx = mp.get_context("spawn")
        self.event_queue = self.ctx.Queue()
        self.workers: dict[int, mp.Process] = {}    # gpu_index -> Process
        self.gpu_assignments: dict[int, dict] = {}  # gpu_index -> task meta

    def submit(self, gpu_index, video_meta, args):
        from parallel_segment_worker import worker_main
        p = self.ctx.Process(
            target=worker_main,
            args=(gpu_index, args, self.event_queue),
            daemon=False,
        )
        p.start()
        self.workers[gpu_index] = p
        self.gpu_assignments[gpu_index] = video_meta

    def free_gpu(self, gpu_index):
        proc = self.workers.pop(gpu_index, None)
        meta = self.gpu_assignments.pop(gpu_index, None)
        if proc is not None:
            proc.join(timeout=5)
            if proc.is_alive():
                proc.terminate()
                proc.join(timeout=5)
        return meta

    def shutdown(self):
        for gi in list(self.workers.keys()):
            self.free_gpu(gi)

1.4 `_auto_mode_process` 의 병렬 변형 — `_auto_mode_process_parallel`

대략적 알고리즘:

가용 GPU 수 G = torch.cuda.device_count()
영상 수 N = len(paths)
slot_count = min(G, N)

dispatcher = ParallelSegmentDispatcher(slot_count)

# 1) 초기 N 개 중 첫 slot_count 개를 각 GPU 에 배정
free_gpus = list(range(slot_count))
queue_index = 0
in_flight = 0
while queue_index < N and free_gpus:
    gi = free_gpus.pop(0)
    dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
    queue_index += 1
    in_flight += 1
    yield UI status

# 2) 이벤트 루프
while in_flight > 0:
    msg = dispatcher.event_queue.get(timeout=...)
    gi = msg["gpu_index"]
    if msg["type"] == "progress":
        update per-GPU progress bar text; aggregate overall progress
        yield UI status
    elif msg["type"] == "status":
        append status for that GPU
        yield UI status
    elif msg["type"] == "result":
        out_path, status, loc_path = msg["data"]
        finalize: rename/persist with disambiguating suffix
        append (mp4, csv) to all_results
        yield UI status (with newly visible result)
        dispatcher.free_gpu(gi)
        in_flight -= 1
        if queue_index < N:
            dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
            queue_index += 1
            in_flight += 1
            yield UI status
    elif msg["type"] == "error":
        record failure for that video
        yield UI status
        dispatcher.free_gpu(gi)
        in_flight -= 1
        # same re-fill logic as result

# 3) 종료 정리
dispatcher.shutdown()
yield final summary

2. UI 변경

2.1 추가 컴포넌트 — `Auto-Mode (Batch Queue)` accordion 안

컴포넌트	용도
`auto_mode_parallel_status` (Markdown)	GPU 수 / 활성 워커 수 / 큐에 남은 영상 수 / 분류별 진행률 (예: `GPU0: video_a.mp4 73%`, `GPU1: video_b.mp4 41%` …)
영상별 결과 누적은 기존 `auto_results_files_state` / `auto_results_list` 재사용	변경 없음

2.2 단일 영상 미리보기 / overlay 컴포넌트

병렬 모드에선 “현재 처리중” 단일 영상이 없으므로:

video_input / video_output 등 단일 슬롯 위젯은 마지막으로 완료된 영상 결과로 갱신 (UX 친절).
주된 진행 표시는 multiline auto_mode_parallel_status 가 담당.

2.3 fallback

torch.cuda.device_count() <= 1 → _auto_mode_process (현재 직렬 동작) 그대로 유지.
>1 → _auto_mode_process_parallel 분기.
토글: 환경변수 SAM3_PARALLEL_AUTO_MODE (기본 auto, 0 으로 비활성, 1 로 강제) 로 옵션화.

3. 파일명 / 출력 충돌 방지

build/downloads/ 디렉토리 안에 4 개 영상이 거의 동시에 결과를 떨어뜨릴 때:

함수	변경
`_rename_with_rule`	결과 파일명에 짧은 영상-인스턴스 ID 를 끼워넣도록 보강. 예: `{stem}_{video_id8}_seg_{dur}_{elapsed}s.mp4`
`_persist_for_download`	중복 basename 인 경우 `_{n}` 접미사 부여 (이미 어느 정도 처리되지만 race-safe 하게 `os.rename` 후 재확인)
`_build_zip_from_paths`	이미 basename 중복 disambiguation 로직 있음 (`seen_names`) → 그대로 사용

video_id8 는 디스패처가 영상 큐에 넣을 때 uuid.uuid4().hex[:8] 로 한 번 생성하여 meta 에 저장.

4. 안전장치 / 엣지 케이스

GPU 메모리 사전 체크: 워커 진입 직후 _check_gpu_memory_safe() 가 true 인지 확인 (모델 최초 적재 전). false 면 error 메시지로 디스패처에 보고하고 종료.
워커 비정상 종료: 디스패처는 event_queue.get(timeout=heartbeat) 으로 폴링하며, heartbeat 시간 내 메시지가 없고 해당 워커가 is_alive() == False 면 error 처리 + free_gpu.
부모 프로세스의 daemon 체크: 기존 segment_video 가 mp.current_process().daemon 면 in-process 로 폴백하던 분기 (app.py:4918) 와 동일한 정신으로, 디스패처도 daemon 부모에선 비활성화 → 순차 폴백.
취소(스톱 버튼): 1차 구현엔 미포함 (현재 직렬 모드에도 stop 없음). 후속 작업.
로그 prefix: 워커가 보내는 progress/status 메시지 앞에 [GPU{n}] 접두를 붙여서 UI 와 stdout 구분.
결정적 디바이스 분배: 영상 i 가 모두 같은 GPU 로 가지 않도록 디스패처가 round-robin (사실상 “먼저 끝난 GPU 에 다음 영상”).

5. 테스트 / 검증

5.1 로컬 (단일 GPU)

_parallel_dispatcher 가 device_count == 1 일 때 자동으로 직렬 경로로 폴백되는지 확인.
환경변수 SAM3_PARALLEL_AUTO_MODE=1 + CUDA_VISIBLE_DEVICES=0 → 디스패처가 1-슬롯 모드로 동작 (워커 1개) — 결과가 기존 _auto_mode_process 와 동일해야 함.

5.2 로컬 (가짜 멀티 GPU 시뮬레이션)

SAM3_PARALLEL_AUTO_MODE=1 + SAM3_FAKE_GPU_COUNT=4 로 디스패처 코드가 4-슬롯 풀을 만들지만 실제론 모두 동일한 device 0 을 공유 (테스트용; 단순 dispatcher 로직 검증).

5.3 HF Space (4×A100)

4 개 영상 업로드 → 각 영상 처리 시간이 single-GPU 대비 3.5~4× 빨라지는지 확인.
nvidia-smi 로 4 장 모두 utilization 올라가는지 확인 (디버그 로그에 GPU memory util: 출력).

5.4 회귀

단일 영상 “Run Segmentation” 버튼은 변경 없음 → 회귀 위험 낮음.
결과 mp4 / csv 의 분할 정확도는 단일/병렬 모드에서 bit-identical (같은 시드라면) — 단일 vs 병렬 결과 mp4 의 frame-by-frame mask IoU 로 sanity check.

6. 단계별 구현 체크리스트 (실행 순서)

이 문서에 적힌 순서대로 코드 수정.

Step 1 — 새 파일 `parallel_segment_worker.py` 생성

top-level imports: os, sys, traceback 만.
worker_main(gpu_index, args, progress_queue) 함수 1.2 절 코드대로 작성.

Step 2 — `app.py` 에 디스패처 클래스 추가

class ParallelSegmentDispatcher: 정의 (1.3 절).
import uuid 가 이미 있는지 확인 (app.py:30) → ✅ 있음.

Step 3 — `app.py` 에 `_auto_mode_process_parallel(...)` 제너레이터 추가

알고리즘은 1.4 절. 출력 튜플 형태는 기존 _auto_mode_process 의 _pkg(...) 와 동일하게 19-tuple 유지 (UI 와이어링 변경 안 함).
auto_mode_status 메시지를 multiline 으로 구성해 GPU 별 진행률 노출.
결과 파일명 disambiguation: video meta 에 vid8 = uuid.uuid4().hex[:8], _rename_with_rule 호출 후 _persist_for_download 전 단계에서 stem 에 _{vid8} 삽입.

Step 4 — `app.py` 의 `_auto_mode_process` 진입부에 라우터 추가

함수 첫 부분에서:

num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
parallel_env = os.getenv("SAM3_PARALLEL_AUTO_MODE", "auto").lower()
use_parallel = (
    (parallel_env == "1") or
    (parallel_env == "auto" and num_gpus > 1)
) and not bool(os.getenv("SPACES_ZERO_GPU"))
if use_parallel:
    yield from _auto_mode_process_parallel(file_list, text_prompt, num_gpus, progress)
    return

ZeroGPU 모드에선 비활성 (각 호출이 슬라이스 단위로 만 GPU 할당받음 → 동시성 무의미).

Step 5 — `_segment_video_worker_entry` 와의 코드 중복 정리

기존 single-video 경로 (segment_video → _segment_video_worker_entry) 도 점진적으로 parallel_segment_worker.worker_main 을 사용해 한 곳에서 관리하도록 통합 (선택). 1차 구현에선 건드리지 않는다 (회귀 위험 최소화).

Step 6 — 출력 파일명 disambiguation 패치

_rename_with_rule 시그니처에 extra_tag: str = "" 옵션 추가 (기본 빈 문자열로 후방호환).
병렬 경로에서만 extra_tag=vid8 전달.

Step 7 — UI 텍스트 보강

auto_mode_status Markdown 에 multi-line 출력 (GPU 별 1줄). 너무 길면 접기 가능한 코드블럭으로.
gr.Progress 는 단일 막대이므로, 병렬 모드의 “전체 평균 진행률” 만 거기에 보내고 GPU 별 세부는 텍스트로.

Step 8 — 스모크 테스트

로컬에서 python app.py 로 띄우고:
- 영상 2개 업로드 → 단일 GPU 환경에서 직렬 모드로 동작 (GPU 1 장만 보임).
- SAM3_PARALLEL_AUTO_MODE=1 CUDA_VISIBLE_DEVICES=0 python app.py → 1-슬롯 풀로 동작.
결과 mp4 / csv 가 올바르게 생성되는지, status UI 가 갱신되는지 확인.

Step 9 — 푸시

requirements.txt 변경 없음 (multiprocessing / uuid 표준 라이브러리).
HF Space 에선 하드웨어 탭에서 4xA100-large (또는 등가) 로 업그레이드한 후 동일 코드를 푸시하면 자동으로 병렬 모드 진입.

7. 향후 확장 (이번 PR 범위 외)

워커 재사용 (persistent): 매 영상마다 spawn 대신 Connection/Pipe 기반 RPC 로 명령을 워커에 보내 모델 1회만 적재. SAM3 가중치 로딩 비용이 영상당 1~3 분이라면 속도 이득 큼.
취소 / 일시정지: stop 버튼 → 디스패처가 모든 워커에 SIGTERM 보내고 partial 결과 반환.
GPU 별 메모리 다른 경우: 큰 영상은 80 GB GPU 로, 작은 영상은 작은 GPU 로 라우팅하는 우선순위 큐.
분산 (multi-node): 동일 인터페이스로 worker 를 SSH 너머 노드로 띄울 수 있게 추상화.