# Auto-Mode 다중 GPU 병렬 처리 구현 계획

작성일: 2026-04-25
대상: HuggingFace Spaces 의 dedicated multi-GPU 하드웨어 (예: 4× A100-80GB) 에서
Auto-Mode 큐에 업로드한 N 개의 영상을 **각 GPU 1대당 1영상**씩 동시에 분할(segment) 하도록
구조 변경.

---

## 0. 배경 / 현재 구조 한눈에 보기

| 항목 | 현재 (sequential) |
|------|-------------------|
| Auto-Mode 진입점 | `app.py:5033 _auto_mode_process(file_list, text_prompt)` |
| 단일 영상 분할 | `app.py:4896 @spaces.GPU(duration=119) def segment_video(...)` |
| 워커 프로세스 | `mp.get_context("spawn").Process(target=_segment_video_worker_entry, ...)` (`app.py:4854`) |
| 코어 로직 | `app.py:4239 _segment_video_core(...)` (chunk-wise SAM3 추론) |
| 모델 인스턴스화 | `app.py:4453 predictor_cls = _get_sam3_predictor_cls(); predictor = predictor_cls(...)` |
| 결과 저장 디렉토리 | `build/downloads/` (`_persist_for_download` 호출) |
| 진행 통신 | `mp.Queue` 로 `progress / status / result / error` 메시지 스트리밍 |
| 영상 간 처리 | `for path in paths:` 직렬 루프, GPU cleanup (`_cleanup_cuda_cache()`) 후 다음 영상 처리 |

### HuggingFace Spaces 하드웨어 / `spaces.GPU` 동작

- `@spaces.GPU` 데코레이터는 `Config.zero_gpu` (= `SPACES_ZERO_GPU=true`) 인 경우에만 ZeroGPU 슬라이스 할당 로직이 붙는다 (`spaces/zero/decorator.py:83`). dedicated GPU Space (4×A100 등) 에서는 데코레이터가 **no-op** 이며, 일반 Python 프로세스가 CUDA 4 장 모두를 직접 본다 (`torch.cuda.device_count() == 4`).
- ZeroGPU(MIG slice) 모드는 한 번에 한 GPU 슬라이스만 할당되므로 **이 계획은 dedicated multi-GPU 하드웨어 전제**이다. ZeroGPU 환경에선 자동으로 기존 직렬 경로로 fallback 한다.

### 격리(isolation) 요구사항 정리

| 자원 | 충돌 가능성 | 해결 방법 |
|------|-------------|-----------|
| GPU 메모리 / 컨텍스트 | 같은 device 위에서 4 영상이 모델을 동시 적재 → OOM, 컨텍스트 간섭 | 워커마다 `CUDA_VISIBLE_DEVICES=N` 환경변수로 1장만 보이게 고정 |
| `sam3.*` 모듈 in-process 캐시 (`_SAM3_PREDICTOR_CLS`, `_LAST_SEG_CACHE`, `cached_frame_outputs` 등) | 같은 인터프리터 내 4-way concurrent 호출시 상태가 얽힘 | spawn 방식의 별도 프로세스 → 모듈 상태 자체가 분리됨 |
| `tempfile.mkdtemp()` (chunk 입력 dir, 트리밍된 mp4) | `mkdtemp` 는 자동으로 충돌 없는 이름 생성 → 안전 | 추가 조치 불필요 |
| `build/downloads/` 산출물 파일명 | 동시 시작 영상이 동일 timestamp → `auto_mode_results_YYYYMMDD_HHMMSS.zip` / `*_overlay.mp4` 충돌 | 파일명에 short uuid (`uuid.uuid4().hex[:8]`) + 영상 인덱스 추가 |
| 모델 체크포인트 다운로드 / BPE vocab | 4 워커가 동시에 같은 파일을 download/write → race | 부모(메인 process)에서 사전 1회 보장 후 워커는 read-only |
| `.zerogpu/tensors` 등 캐시 | dedicated 모드에선 ZeroGPU 캐시는 사용 안 함 | 영향 없음 |
| `sam3/` 디렉토리 자체 | Python import 는 프로세스마다 독립 → **디렉토리 사본 불필요** | 사본 생성 X |

### 결론

- **`sam3` 폴더 복제는 필요 없다.** 격리 단위는 “프로세스” 한 단계로 충분하다.
- **각 영상이 1 개의 spawn child process** 에서 실행되며, child 진입 직후 (torch import 전) `CUDA_VISIBLE_DEVICES` 를 1 장으로 좁힌다 → child 입장에선 항상 `cuda:0` 한 개만 존재 → 모델/SAM3 코드의 `cuda` / `cuda:0` 하드코딩 어디든 안전.
- **부모 프로세스는 GPU 사용 X**. 단순히 4-슬롯 풀을 운영하는 디스패처 역할만 수행. 모든 무거운 import 는 워커 안에서.

---

## 1. 디자인 개요

### 1.1 워커 풀 구조

```
┌─────────────────────────────────────────────────────────────┐
│ Gradio main process (no torch CUDA usage)                  │
│   ├─ _auto_mode_process() generator                         │
│   ├─ ParallelSegmentDispatcher                              │
│   │    ├─ pool of N workers (N = min(num_gpus, num_videos)) │
│   │    ├─ submit queue  (video_path → free worker)          │
│   │    ├─ event queue   (progress / status / result / err)  │
│   │    └─ per-video state: gpu_idx, started_at, last_pct…   │
│   └─ yields UI updates (status / per-video progress / files)│
└──────┬──────────────────────────────────────────────────────┘
       │ spawn child × N
       ▼
┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
│ Worker GPU 0    │  │ Worker GPU 1    │  │ Worker GPU N-1  │
│ CUDA_VISIBLE=0  │  │ CUDA_VISIBLE=1  │  │ CUDA_VISIBLE=N-1│
│ runs            │  │ runs            │  │ runs            │
│ _segment_video  │  │ _segment_video  │  │ _segment_video  │
│   _core(...)    │  │   _core(...)    │  │   _core(...)    │
└─────────────────┘  └─────────────────┘  └─────────────────┘
```

- 각 워커는 **상시 살아있는** “persistent worker”로 운영해 모델 가중치 로딩 비용을 첫 영상 1회만 부담한다 (선택 최적화: §6.2). 1차 구현은 단순함을 위해 영상마다 워커 새로 spawn 하는 “fresh-per-video” 구조로 시작 → 안정화 후 재사용형으로 전환.
- N 개의 영상이 4 GPU 보다 많으면, 큐 + 풀 구조라 자동으로 직렬화된다 (한 GPU 가 하나 끝내면 다음 영상을 받음).

### 1.2 워커 entry 모듈 분리 — `parallel_segment_worker.py`

**왜 별도 파일이 필요한가:**
- 현재 worker target (`_segment_video_worker_entry`) 은 `app.py` 내부 함수다.
- spawn 자식 프로세스가 이 target 을 unpickle 하려면 `app.py` 를 import 해야 하고, `app.py:4` 에서 `import torch` 가 즉시 실행된다.
- 그 시점엔 자식이 아직 `os.environ["CUDA_VISIBLE_DEVICES"]` 를 좁히기 전이므로, torch 가 4 장 모두 보이는 상태로 cuda runtime 을 초기화한다 → 우리가 `cuda:0` 만 쓰려 해도 다른 장치 컨텍스트가 따라온다.
- 해결: 워커 entry 를 **torch 를 top-level 에서 import 하지 않는** 새 파일로 분리. 자식이 그 파일만 import 한 뒤, 함수 본문 첫 줄에서 `os.environ["CUDA_VISIBLE_DEVICES"]` 설정하고, 그 *다음* torch / app 을 import.

```python
# parallel_segment_worker.py  (intentionally minimal top-level imports)
import os
import sys
import traceback

def worker_main(gpu_index, args, progress_queue):
    os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_index)
    os.environ["SAM3_WORKER_MODE"] = "1"   # skip Gradio launch in app.py
    os.environ.setdefault("SAM3_CACHE_FRAME_OUTPUTS", "0")
    os.environ.setdefault("SAM3_OFFLOAD_TRACKER_STATE_TO_CPU", "1")

    sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

    # NOW it is safe to import torch / app
    import torch
    if torch.cuda.is_available():
        torch.cuda.set_device(0)        # only one device visible: cuda:0

    from app import _segment_video_core  # imports torch but env is already set

    (video_path, text_prompt, duration_limit, id_corrections_text,
     id_drop_text, id_override_start_sec, show_trails, view_mode) = args

    def _progress_cb(val, desc):
        progress_queue.put({"type": "progress", "value": val, "desc": desc,
                            "gpu_index": gpu_index})

    def _status_cb(msg):
        progress_queue.put({"type": "status", "message": msg,
                            "gpu_index": gpu_index})

    try:
        progress_queue.put({"type": "progress", "value": 0.0,
                            "desc": f"GPU {gpu_index}: starting...",
                            "gpu_index": gpu_index})
        out_path, status, loc_path = _segment_video_core(
            video_path, text_prompt, duration_limit,
            id_corrections_text=id_corrections_text,
            id_drop_text=id_drop_text,
            id_override_start_sec=id_override_start_sec,
            show_trails=show_trails,
            view_mode=view_mode,
            progress_callback=_progress_cb,
            status_callback=_status_cb,
        )
        progress_queue.put({"type": "result",
                            "data": (out_path, status, loc_path),
                            "gpu_index": gpu_index})
    except Exception as exc:  # noqa: BLE001
        progress_queue.put({"type": "error",
                            "message": str(exc),
                            "traceback": traceback.format_exc(),
                            "gpu_index": gpu_index})
    finally:
        try:
            import torch
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.ipc_collect()
        except Exception:
            pass
```

### 1.3 디스패처 클래스 — `app.py` 내부 추가

```python
class ParallelSegmentDispatcher:
    """Distribute one video per GPU concurrently and stream events back."""

    def __init__(self, num_gpus: int):
        self.num_gpus = num_gpus
        self.ctx = mp.get_context("spawn")
        self.event_queue = self.ctx.Queue()
        self.workers: dict[int, mp.Process] = {}    # gpu_index -> Process
        self.gpu_assignments: dict[int, dict] = {}  # gpu_index -> task meta

    def submit(self, gpu_index, video_meta, args):
        from parallel_segment_worker import worker_main
        p = self.ctx.Process(
            target=worker_main,
            args=(gpu_index, args, self.event_queue),
            daemon=False,
        )
        p.start()
        self.workers[gpu_index] = p
        self.gpu_assignments[gpu_index] = video_meta

    def free_gpu(self, gpu_index):
        proc = self.workers.pop(gpu_index, None)
        meta = self.gpu_assignments.pop(gpu_index, None)
        if proc is not None:
            proc.join(timeout=5)
            if proc.is_alive():
                proc.terminate()
                proc.join(timeout=5)
        return meta

    def shutdown(self):
        for gi in list(self.workers.keys()):
            self.free_gpu(gi)
```

### 1.4 `_auto_mode_process` 의 병렬 변형 — `_auto_mode_process_parallel`

대략적 알고리즘:

```text
가용 GPU 수 G = torch.cuda.device_count()
영상 수 N = len(paths)
slot_count = min(G, N)

dispatcher = ParallelSegmentDispatcher(slot_count)

# 1) 초기 N 개 중 첫 slot_count 개를 각 GPU 에 배정
free_gpus = list(range(slot_count))
queue_index = 0
in_flight = 0
while queue_index < N and free_gpus:
    gi = free_gpus.pop(0)
    dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
    queue_index += 1
    in_flight += 1
    yield UI status

# 2) 이벤트 루프
while in_flight > 0:
    msg = dispatcher.event_queue.get(timeout=...)
    gi = msg["gpu_index"]
    if msg["type"] == "progress":
        update per-GPU progress bar text; aggregate overall progress
        yield UI status
    elif msg["type"] == "status":
        append status for that GPU
        yield UI status
    elif msg["type"] == "result":
        out_path, status, loc_path = msg["data"]
        finalize: rename/persist with disambiguating suffix
        append (mp4, csv) to all_results
        yield UI status (with newly visible result)
        dispatcher.free_gpu(gi)
        in_flight -= 1
        if queue_index < N:
            dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
            queue_index += 1
            in_flight += 1
            yield UI status
    elif msg["type"] == "error":
        record failure for that video
        yield UI status
        dispatcher.free_gpu(gi)
        in_flight -= 1
        # same re-fill logic as result

# 3) 종료 정리
dispatcher.shutdown()
yield final summary
```

---

## 2. UI 변경

### 2.1 추가 컴포넌트 — `Auto-Mode (Batch Queue)` accordion 안

| 컴포넌트 | 용도 |
|---------|------|
| `auto_mode_parallel_status` (Markdown) | GPU 수 / 활성 워커 수 / 큐에 남은 영상 수 / 분류별 진행률 (예: `GPU0: video_a.mp4 73%`, `GPU1: video_b.mp4 41%` …) |
| 영상별 결과 누적은 기존 `auto_results_files_state` / `auto_results_list` 재사용 | 변경 없음 |

### 2.2 단일 영상 미리보기 / overlay 컴포넌트

병렬 모드에선 “현재 처리중” 단일 영상이 없으므로:
- `video_input` / `video_output` 등 단일 슬롯 위젯은 **마지막으로 완료된** 영상 결과로 갱신 (UX 친절).
- 주된 진행 표시는 multiline `auto_mode_parallel_status` 가 담당.

### 2.3 fallback

- `torch.cuda.device_count() <= 1` → `_auto_mode_process` (현재 직렬 동작) 그대로 유지.
- `>1` → `_auto_mode_process_parallel` 분기.
- 토글: 환경변수 `SAM3_PARALLEL_AUTO_MODE` (기본 `auto`, `0` 으로 비활성, `1` 로 강제) 로 옵션화.

---

## 3. 파일명 / 출력 충돌 방지

`build/downloads/` 디렉토리 안에 4 개 영상이 거의 동시에 결과를 떨어뜨릴 때:

| 함수 | 변경 |
|------|------|
| `_rename_with_rule` | 결과 파일명에 짧은 영상-인스턴스 ID 를 끼워넣도록 보강. 예: `{stem}_{video_id8}_seg_{dur}_{elapsed}s.mp4` |
| `_persist_for_download` | 중복 basename 인 경우 `_{n}` 접미사 부여 (이미 어느 정도 처리되지만 race-safe 하게 `os.rename` 후 재확인) |
| `_build_zip_from_paths` | 이미 basename 중복 disambiguation 로직 있음 (`seen_names`) → 그대로 사용 |

`video_id8` 는 디스패처가 영상 큐에 넣을 때 `uuid.uuid4().hex[:8]` 로 한 번 생성하여 `meta` 에 저장.

---

## 4. 안전장치 / 엣지 케이스

1. **GPU 메모리 사전 체크**: 워커 진입 직후 `_check_gpu_memory_safe()` 가 true 인지 확인 (모델 최초 적재 전). false 면 `error` 메시지로 디스패처에 보고하고 종료.
2. **워커 비정상 종료**: 디스패처는 `event_queue.get(timeout=heartbeat)` 으로 폴링하며, heartbeat 시간 내 메시지가 없고 해당 워커가 `is_alive() == False` 면 `error` 처리 + `free_gpu`.
3. **부모 프로세스의 daemon 체크**: 기존 `segment_video` 가 `mp.current_process().daemon` 면 in-process 로 폴백하던 분기 (`app.py:4918`) 와 동일한 정신으로, 디스패처도 daemon 부모에선 비활성화 → 순차 폴백.
4. **취소(스톱 버튼)**: 1차 구현엔 미포함 (현재 직렬 모드에도 stop 없음). 후속 작업.
5. **로그 prefix**: 워커가 보내는 progress/status 메시지 앞에 `[GPU{n}]` 접두를 붙여서 UI 와 stdout 구분.
6. **결정적 디바이스 분배**: 영상 i 가 모두 같은 GPU 로 가지 않도록 디스패처가 round-robin (사실상 “먼저 끝난 GPU 에 다음 영상”).

---

## 5. 테스트 / 검증

### 5.1 로컬 (단일 GPU)
- `_parallel_dispatcher` 가 `device_count == 1` 일 때 자동으로 직렬 경로로 폴백되는지 확인.
- 환경변수 `SAM3_PARALLEL_AUTO_MODE=1` + `CUDA_VISIBLE_DEVICES=0` → 디스패처가 1-슬롯 모드로 동작 (워커 1개) — 결과가 기존 `_auto_mode_process` 와 동일해야 함.

### 5.2 로컬 (가짜 멀티 GPU 시뮬레이션)
- `SAM3_PARALLEL_AUTO_MODE=1` + `SAM3_FAKE_GPU_COUNT=4` 로 디스패처 코드가 4-슬롯 풀을 만들지만 실제론 모두 동일한 device 0 을 공유 (테스트용; 단순 dispatcher 로직 검증).

### 5.3 HF Space (4×A100)
- 4 개 영상 업로드 → 각 영상 처리 시간이 single-GPU 대비 3.5~4× 빨라지는지 확인.
- `nvidia-smi` 로 4 장 모두 utilization 올라가는지 확인 (디버그 로그에 `GPU memory util:` 출력).

### 5.4 회귀
- 단일 영상 “Run Segmentation” 버튼은 변경 없음 → 회귀 위험 낮음.
- 결과 mp4 / csv 의 분할 정확도는 단일/병렬 모드에서 bit-identical (같은 시드라면) — 단일 vs 병렬 결과 mp4 의 frame-by-frame mask IoU 로 sanity check.

---

## 6. 단계별 구현 체크리스트 (실행 순서)

이 문서에 적힌 순서대로 코드 수정.

### Step 1 — 새 파일 `parallel_segment_worker.py` 생성
- top-level imports: `os, sys, traceback` 만.
- `worker_main(gpu_index, args, progress_queue)` 함수 1.2 절 코드대로 작성.

### Step 2 — `app.py` 에 디스패처 클래스 추가
- `class ParallelSegmentDispatcher:` 정의 (1.3 절).
- `import uuid` 가 이미 있는지 확인 (`app.py:30`) → ✅ 있음.

### Step 3 — `app.py` 에 `_auto_mode_process_parallel(...)` 제너레이터 추가
- 알고리즘은 1.4 절. 출력 튜플 형태는 기존 `_auto_mode_process` 의 `_pkg(...)` 와 동일하게 19-tuple 유지 (UI 와이어링 변경 안 함).
- `auto_mode_status` 메시지를 multiline 으로 구성해 GPU 별 진행률 노출.
- 결과 파일명 disambiguation: video meta 에 `vid8 = uuid.uuid4().hex[:8]`, `_rename_with_rule` 호출 후 `_persist_for_download` 전 단계에서 stem 에 `_{vid8}` 삽입.

### Step 4 — `app.py` 의 `_auto_mode_process` 진입부에 라우터 추가
- 함수 첫 부분에서:
  ```python
  num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
  parallel_env = os.getenv("SAM3_PARALLEL_AUTO_MODE", "auto").lower()
  use_parallel = (
      (parallel_env == "1") or
      (parallel_env == "auto" and num_gpus > 1)
  ) and not bool(os.getenv("SPACES_ZERO_GPU"))
  if use_parallel:
      yield from _auto_mode_process_parallel(file_list, text_prompt, num_gpus, progress)
      return
  ```
- ZeroGPU 모드에선 비활성 (각 호출이 슬라이스 단위로 만 GPU 할당받음 → 동시성 무의미).

### Step 5 — `_segment_video_worker_entry` 와의 코드 중복 정리
- 기존 single-video 경로 (`segment_video` → `_segment_video_worker_entry`) 도 점진적으로 `parallel_segment_worker.worker_main` 을 사용해 한 곳에서 관리하도록 통합 (선택). 1차 구현에선 **건드리지 않는다** (회귀 위험 최소화).

### Step 6 — 출력 파일명 disambiguation 패치
- `_rename_with_rule` 시그니처에 `extra_tag: str = ""` 옵션 추가 (기본 빈 문자열로 후방호환).
- 병렬 경로에서만 `extra_tag=vid8` 전달.

### Step 7 — UI 텍스트 보강
- `auto_mode_status` Markdown 에 multi-line 출력 (GPU 별 1줄). 너무 길면 접기 가능한 코드블럭으로.
- `gr.Progress` 는 단일 막대이므로, 병렬 모드의 “전체 평균 진행률” 만 거기에 보내고 GPU 별 세부는 텍스트로.

### Step 8 — 스모크 테스트
- 로컬에서 `python app.py` 로 띄우고:
  - 영상 2개 업로드 → 단일 GPU 환경에서 직렬 모드로 동작 (GPU 1 장만 보임).
  - `SAM3_PARALLEL_AUTO_MODE=1 CUDA_VISIBLE_DEVICES=0 python app.py` → 1-슬롯 풀로 동작.
- 결과 mp4 / csv 가 올바르게 생성되는지, status UI 가 갱신되는지 확인.

### Step 9 — 푸시
- `requirements.txt` 변경 없음 (multiprocessing / uuid 표준 라이브러리).
- HF Space 에선 하드웨어 탭에서 `4xA100-large` (또는 등가) 로 업그레이드한 후 동일 코드를 푸시하면 자동으로 병렬 모드 진입.

---

## 7. 향후 확장 (이번 PR 범위 외)

- **워커 재사용 (persistent)**: 매 영상마다 spawn 대신 `Connection`/`Pipe` 기반 RPC 로 명령을 워커에 보내 모델 1회만 적재. SAM3 가중치 로딩 비용이 영상당 1~3 분이라면 속도 이득 큼.
- **취소 / 일시정지**: stop 버튼 → 디스패처가 모든 워커에 SIGTERM 보내고 partial 결과 반환.
- **GPU 별 메모리 다른 경우**: 큰 영상은 80 GB GPU 로, 작은 영상은 작은 GPU 로 라우팅하는 우선순위 큐.
- **분산 (multi-node)**: 동일 인터페이스로 worker 를 SSH 너머 노드로 띄울 수 있게 추상화.