Spaces:

bellmake
/

SAM3-video-segmentation-tracking

Sleeping

App Files Files Community

SAM3-video-segmentation-tracking / docs /Auto_Mode_MultiGPU_Parallel_Plan.md

bellmake

SAM3 Video Segmentation - Clean deployment

ae50268 about 1 month ago

preview code

raw

history blame contribute delete

20.3 kB

	# Auto-Mode 다중 GPU 병렬 처리 구현 계획

	작성일: 2026-04-25
	대상: HuggingFace Spaces 의 dedicated multi-GPU 하드웨어 (예: 4× A100-80GB) 에서
	Auto-Mode 큐에 업로드한 N 개의 영상을 각 GPU 1대당 1영상씩 동시에 분할(segment) 하도록
	구조 변경.

	---

	## 0. 배경 / 현재 구조 한눈에 보기

	\| 항목 \| 현재 (sequential) \|
	\|------\|-------------------\|
	\| Auto-Mode 진입점 \| `app.py:5033 _auto_mode_process(file_list, text_prompt)` \|
	\| 단일 영상 분할 \| `app.py:4896 @spaces.GPU(duration=119) def segment_video(...)` \|
	\| 워커 프로세스 \| `mp.get_context("spawn").Process(target=_segment_video_worker_entry, ...)` (`app.py:4854`) \|
	\| 코어 로직 \| `app.py:4239 _segment_video_core(...)` (chunk-wise SAM3 추론) \|
	\| 모델 인스턴스화 \| `app.py:4453 predictor_cls = _get_sam3_predictor_cls(); predictor = predictor_cls(...)` \|
	\| 결과 저장 디렉토리 \| `build/downloads/` (`_persist_for_download` 호출) \|
	\| 진행 통신 \| `mp.Queue` 로 `progress / status / result / error` 메시지 스트리밍 \|
	\| 영상 간 처리 \| `for path in paths:` 직렬 루프, GPU cleanup (`_cleanup_cuda_cache()`) 후 다음 영상 처리 \|

	### HuggingFace Spaces 하드웨어 / `spaces.GPU` 동작

	- `@spaces.GPU` 데코레이터는 `Config.zero_gpu` (= `SPACES_ZERO_GPU=true`) 인 경우에만 ZeroGPU 슬라이스 할당 로직이 붙는다 (`spaces/zero/decorator.py:83`). dedicated GPU Space (4×A100 등) 에서는 데코레이터가 no-op 이며, 일반 Python 프로세스가 CUDA 4 장 모두를 직접 본다 (`torch.cuda.device_count() == 4`).
	- ZeroGPU(MIG slice) 모드는 한 번에 한 GPU 슬라이스만 할당되므로 이 계획은 dedicated multi-GPU 하드웨어 전제이다. ZeroGPU 환경에선 자동으로 기존 직렬 경로로 fallback 한다.

	### 격리(isolation) 요구사항 정리

	\| 자원 \| 충돌 가능성 \| 해결 방법 \|
	\|------\|-------------\|-----------\|
	\| GPU 메모리 / 컨텍스트 \| 같은 device 위에서 4 영상이 모델을 동시 적재 → OOM, 컨텍스트 간섭 \| 워커마다 `CUDA_VISIBLE_DEVICES=N` 환경변수로 1장만 보이게 고정 \|
	\| `sam3.*` 모듈 in-process 캐시 (`_SAM3_PREDICTOR_CLS`, `_LAST_SEG_CACHE`, `cached_frame_outputs` 등) \| 같은 인터프리터 내 4-way concurrent 호출시 상태가 얽힘 \| spawn 방식의 별도 프로세스 → 모듈 상태 자체가 분리됨 \|
	\| `tempfile.mkdtemp()` (chunk 입력 dir, 트리밍된 mp4) \| `mkdtemp` 는 자동으로 충돌 없는 이름 생성 → 안전 \| 추가 조치 불필요 \|
	\| `build/downloads/` 산출물 파일명 \| 동시 시작 영상이 동일 timestamp → `auto_mode_results_YYYYMMDD_HHMMSS.zip` / `*_overlay.mp4` 충돌 \| 파일명에 short uuid (`uuid.uuid4().hex[:8]`) + 영상 인덱스 추가 \|
	\| 모델 체크포인트 다운로드 / BPE vocab \| 4 워커가 동시에 같은 파일을 download/write → race \| 부모(메인 process)에서 사전 1회 보장 후 워커는 read-only \|
	\| `.zerogpu/tensors` 등 캐시 \| dedicated 모드에선 ZeroGPU 캐시는 사용 안 함 \| 영향 없음 \|
	\| `sam3/` 디렉토리 자체 \| Python import 는 프로세스마다 독립 → 디렉토리 사본 불필요 \| 사본 생성 X \|

	### 결론

	- `sam3` 폴더 복제는 필요 없다. 격리 단위는 “프로세스” 한 단계로 충분하다.
	- 각 영상이 1 개의 spawn child process 에서 실행되며, child 진입 직후 (torch import 전) `CUDA_VISIBLE_DEVICES` 를 1 장으로 좁힌다 → child 입장에선 항상 `cuda:0` 한 개만 존재 → 모델/SAM3 코드의 `cuda` / `cuda:0` 하드코딩 어디든 안전.
	- 부모 프로세스는 GPU 사용 X. 단순히 4-슬롯 풀을 운영하는 디스패처 역할만 수행. 모든 무거운 import 는 워커 안에서.

	---

	## 1. 디자인 개요

	### 1.1 워커 풀 구조

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Gradio main process (no torch CUDA usage) │
	│ ├─ _auto_mode_process() generator │
	│ ├─ ParallelSegmentDispatcher │
	│ │ ├─ pool of N workers (N = min(num_gpus, num_videos)) │
	│ │ ├─ submit queue (video_path → free worker) │
	│ │ ├─ event queue (progress / status / result / err) │
	│ │ └─ per-video state: gpu_idx, started_at, last_pct… │
	│ └─ yields UI updates (status / per-video progress / files)│
	└──────┬──────────────────────────────────────────────────────┘
	│ spawn child × N
	▼
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ Worker GPU 0 │ │ Worker GPU 1 │ │ Worker GPU N-1 │
	│ CUDA_VISIBLE=0 │ │ CUDA_VISIBLE=1 │ │ CUDA_VISIBLE=N-1│
	│ runs │ │ runs │ │ runs │
	│ _segment_video │ │ _segment_video │ │ _segment_video │
	│ _core(...) │ │ _core(...) │ │ _core(...) │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	```

	- 각 워커는 상시 살아있는 “persistent worker”로 운영해 모델 가중치 로딩 비용을 첫 영상 1회만 부담한다 (선택 최적화: §6.2). 1차 구현은 단순함을 위해 영상마다 워커 새로 spawn 하는 “fresh-per-video” 구조로 시작 → 안정화 후 재사용형으로 전환.
	- N 개의 영상이 4 GPU 보다 많으면, 큐 + 풀 구조라 자동으로 직렬화된다 (한 GPU 가 하나 끝내면 다음 영상을 받음).

	### 1.2 워커 entry 모듈 분리 — `parallel_segment_worker.py`

	왜 별도 파일이 필요한가:
	- 현재 worker target (`_segment_video_worker_entry`) 은 `app.py` 내부 함수다.
	- spawn 자식 프로세스가 이 target 을 unpickle 하려면 `app.py` 를 import 해야 하고, `app.py:4` 에서 `import torch` 가 즉시 실행된다.
	- 그 시점엔 자식이 아직 `os.environ["CUDA_VISIBLE_DEVICES"]` 를 좁히기 전이므로, torch 가 4 장 모두 보이는 상태로 cuda runtime 을 초기화한다 → 우리가 `cuda:0` 만 쓰려 해도 다른 장치 컨텍스트가 따라온다.
	- 해결: 워커 entry 를 torch 를 top-level 에서 import 하지 않는 새 파일로 분리. 자식이 그 파일만 import 한 뒤, 함수 본문 첫 줄에서 `os.environ["CUDA_VISIBLE_DEVICES"]` 설정하고, 그 다음 torch / app 을 import.

	```python
	# parallel_segment_worker.py (intentionally minimal top-level imports)
	import os
	import sys
	import traceback

	def worker_main(gpu_index, args, progress_queue):
	os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_index)
	os.environ["SAM3_WORKER_MODE"] = "1" # skip Gradio launch in app.py
	os.environ.setdefault("SAM3_CACHE_FRAME_OUTPUTS", "0")
	os.environ.setdefault("SAM3_OFFLOAD_TRACKER_STATE_TO_CPU", "1")

	sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))

	# NOW it is safe to import torch / app
	import torch
	if torch.cuda.is_available():
	torch.cuda.set_device(0) # only one device visible: cuda:0

	from app import _segment_video_core # imports torch but env is already set

	(video_path, text_prompt, duration_limit, id_corrections_text,
	id_drop_text, id_override_start_sec, show_trails, view_mode) = args

	def _progress_cb(val, desc):
	progress_queue.put({"type": "progress", "value": val, "desc": desc,
	"gpu_index": gpu_index})

	def _status_cb(msg):
	progress_queue.put({"type": "status", "message": msg,
	"gpu_index": gpu_index})

	try:
	progress_queue.put({"type": "progress", "value": 0.0,
	"desc": f"GPU {gpu_index}: starting...",
	"gpu_index": gpu_index})
	out_path, status, loc_path = _segment_video_core(
	video_path, text_prompt, duration_limit,
	id_corrections_text=id_corrections_text,
	id_drop_text=id_drop_text,
	id_override_start_sec=id_override_start_sec,
	show_trails=show_trails,
	view_mode=view_mode,
	progress_callback=_progress_cb,
	status_callback=_status_cb,
	)
	progress_queue.put({"type": "result",
	"data": (out_path, status, loc_path),
	"gpu_index": gpu_index})
	except Exception as exc: # noqa: BLE001
	progress_queue.put({"type": "error",
	"message": str(exc),
	"traceback": traceback.format_exc(),
	"gpu_index": gpu_index})
	finally:
	try:
	import torch
	if torch.cuda.is_available():
	torch.cuda.empty_cache()
	torch.cuda.ipc_collect()
	except Exception:
	pass
	```

	### 1.3 디스패처 클래스 — `app.py` 내부 추가

	```python
	class ParallelSegmentDispatcher:
	"""Distribute one video per GPU concurrently and stream events back."""

	def __init__(self, num_gpus: int):
	self.num_gpus = num_gpus
	self.ctx = mp.get_context("spawn")
	self.event_queue = self.ctx.Queue()
	self.workers: dict[int, mp.Process] = {} # gpu_index -> Process
	self.gpu_assignments: dict[int, dict] = {} # gpu_index -> task meta

	def submit(self, gpu_index, video_meta, args):
	from parallel_segment_worker import worker_main
	p = self.ctx.Process(
	target=worker_main,
	args=(gpu_index, args, self.event_queue),
	daemon=False,
	)
	p.start()
	self.workers[gpu_index] = p
	self.gpu_assignments[gpu_index] = video_meta

	def free_gpu(self, gpu_index):
	proc = self.workers.pop(gpu_index, None)
	meta = self.gpu_assignments.pop(gpu_index, None)
	if proc is not None:
	proc.join(timeout=5)
	if proc.is_alive():
	proc.terminate()
	proc.join(timeout=5)
	return meta

	def shutdown(self):
	for gi in list(self.workers.keys()):
	self.free_gpu(gi)
	```

	### 1.4 `_auto_mode_process` 의 병렬 변형 — `_auto_mode_process_parallel`

	대략적 알고리즘:

	```text
	가용 GPU 수 G = torch.cuda.device_count()
	영상 수 N = len(paths)
	slot_count = min(G, N)

	dispatcher = ParallelSegmentDispatcher(slot_count)

	# 1) 초기 N 개 중 첫 slot_count 개를 각 GPU 에 배정
	free_gpus = list(range(slot_count))
	queue_index = 0
	in_flight = 0
	while queue_index < N and free_gpus:
	gi = free_gpus.pop(0)
	dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
	queue_index += 1
	in_flight += 1
	yield UI status

	# 2) 이벤트 루프
	while in_flight > 0:
	msg = dispatcher.event_queue.get(timeout=...)
	gi = msg["gpu_index"]
	if msg["type"] == "progress":
	update per-GPU progress bar text; aggregate overall progress
	yield UI status
	elif msg["type"] == "status":
	append status for that GPU
	yield UI status
	elif msg["type"] == "result":
	out_path, status, loc_path = msg["data"]
	finalize: rename/persist with disambiguating suffix
	append (mp4, csv) to all_results
	yield UI status (with newly visible result)
	dispatcher.free_gpu(gi)
	in_flight -= 1
	if queue_index < N:
	dispatcher.submit(gi, meta_for(queue_index), args_for(queue_index))
	queue_index += 1
	in_flight += 1
	yield UI status
	elif msg["type"] == "error":
	record failure for that video
	yield UI status
	dispatcher.free_gpu(gi)
	in_flight -= 1
	# same re-fill logic as result

	# 3) 종료 정리
	dispatcher.shutdown()
	yield final summary
	```

	---

	## 2. UI 변경

	### 2.1 추가 컴포넌트 — `Auto-Mode (Batch Queue)` accordion 안

	\| 컴포넌트 \| 용도 \|
	\|---------\|------\|
	\| `auto_mode_parallel_status` (Markdown) \| GPU 수 / 활성 워커 수 / 큐에 남은 영상 수 / 분류별 진행률 (예: `GPU0: video_a.mp4 73%`, `GPU1: video_b.mp4 41%` …) \|
	\| 영상별 결과 누적은 기존 `auto_results_files_state` / `auto_results_list` 재사용 \| 변경 없음 \|

	### 2.2 단일 영상 미리보기 / overlay 컴포넌트

	병렬 모드에선 “현재 처리중” 단일 영상이 없으므로:
	- `video_input` / `video_output` 등 단일 슬롯 위젯은 마지막으로 완료된 영상 결과로 갱신 (UX 친절).
	- 주된 진행 표시는 multiline `auto_mode_parallel_status` 가 담당.

	### 2.3 fallback

	- `torch.cuda.device_count() <= 1` → `_auto_mode_process` (현재 직렬 동작) 그대로 유지.
	- `>1` → `_auto_mode_process_parallel` 분기.
	- 토글: 환경변수 `SAM3_PARALLEL_AUTO_MODE` (기본 `auto`, `0` 으로 비활성, `1` 로 강제) 로 옵션화.

	---

	## 3. 파일명 / 출력 충돌 방지

	`build/downloads/` 디렉토리 안에 4 개 영상이 거의 동시에 결과를 떨어뜨릴 때:

	\| 함수 \| 변경 \|
	\|------\|------\|
	\| `_rename_with_rule` \| 결과 파일명에 짧은 영상-인스턴스 ID 를 끼워넣도록 보강. 예: `{stem}_{video_id8}_seg_{dur}_{elapsed}s.mp4` \|
	\| `_persist_for_download` \| 중복 basename 인 경우 `_{n}` 접미사 부여 (이미 어느 정도 처리되지만 race-safe 하게 `os.rename` 후 재확인) \|
	\| `_build_zip_from_paths` \| 이미 basename 중복 disambiguation 로직 있음 (`seen_names`) → 그대로 사용 \|

	`video_id8` 는 디스패처가 영상 큐에 넣을 때 `uuid.uuid4().hex[:8]` 로 한 번 생성하여 `meta` 에 저장.

	---

	## 4. 안전장치 / 엣지 케이스

	1. GPU 메모리 사전 체크: 워커 진입 직후 `_check_gpu_memory_safe()` 가 true 인지 확인 (모델 최초 적재 전). false 면 `error` 메시지로 디스패처에 보고하고 종료.
	2. 워커 비정상 종료: 디스패처는 `event_queue.get(timeout=heartbeat)` 으로 폴링하며, heartbeat 시간 내 메시지가 없고 해당 워커가 `is_alive() == False` 면 `error` 처리 + `free_gpu`.
	3. 부모 프로세스의 daemon 체크: 기존 `segment_video` 가 `mp.current_process().daemon` 면 in-process 로 폴백하던 분기 (`app.py:4918`) 와 동일한 정신으로, 디스패처도 daemon 부모에선 비활성화 → 순차 폴백.
	4. 취소(스톱 버튼): 1차 구현엔 미포함 (현재 직렬 모드에도 stop 없음). 후속 작업.
	5. 로그 prefix: 워커가 보내는 progress/status 메시지 앞에 `[GPU{n}]` 접두를 붙여서 UI 와 stdout 구분.
	6. 결정적 디바이스 분배: 영상 i 가 모두 같은 GPU 로 가지 않도록 디스패처가 round-robin (사실상 “먼저 끝난 GPU 에 다음 영상”).

	---

	## 5. 테스트 / 검증

	### 5.1 로컬 (단일 GPU)
	- `_parallel_dispatcher` 가 `device_count == 1` 일 때 자동으로 직렬 경로로 폴백되는지 확인.
	- 환경변수 `SAM3_PARALLEL_AUTO_MODE=1` + `CUDA_VISIBLE_DEVICES=0` → 디스패처가 1-슬롯 모드로 동작 (워커 1개) — 결과가 기존 `_auto_mode_process` 와 동일해야 함.

	### 5.2 로컬 (가짜 멀티 GPU 시뮬레이션)
	- `SAM3_PARALLEL_AUTO_MODE=1` + `SAM3_FAKE_GPU_COUNT=4` 로 디스패처 코드가 4-슬롯 풀을 만들지만 실제론 모두 동일한 device 0 을 공유 (테스트용; 단순 dispatcher 로직 검증).

	### 5.3 HF Space (4×A100)
	- 4 개 영상 업로드 → 각 영상 처리 시간이 single-GPU 대비 3.5~4× 빨라지는지 확인.
	- `nvidia-smi` 로 4 장 모두 utilization 올라가는지 확인 (디버그 로그에 `GPU memory util:` 출력).

	### 5.4 회귀
	- 단일 영상 “Run Segmentation” 버튼은 변경 없음 → 회귀 위험 낮음.
	- 결과 mp4 / csv 의 분할 정확도는 단일/병렬 모드에서 bit-identical (같은 시드라면) — 단일 vs 병렬 결과 mp4 의 frame-by-frame mask IoU 로 sanity check.

	---

	## 6. 단계별 구현 체크리스트 (실행 순서)

	이 문서에 적힌 순서대로 코드 수정.

	### Step 1 — 새 파일 `parallel_segment_worker.py` 생성
	- top-level imports: `os, sys, traceback` 만.
	- `worker_main(gpu_index, args, progress_queue)` 함수 1.2 절 코드대로 작성.

	### Step 2 — `app.py` 에 디스패처 클래스 추가
	- `class ParallelSegmentDispatcher:` 정의 (1.3 절).
	- `import uuid` 가 이미 있는지 확인 (`app.py:30`) → ✅ 있음.

	### Step 3 — `app.py` 에 `_auto_mode_process_parallel(...)` 제너레이터 추가
	- 알고리즘은 1.4 절. 출력 튜플 형태는 기존 `_auto_mode_process` 의 `_pkg(...)` 와 동일하게 19-tuple 유지 (UI 와이어링 변경 안 함).
	- `auto_mode_status` 메시지를 multiline 으로 구성해 GPU 별 진행률 노출.
	- 결과 파일명 disambiguation: video meta 에 `vid8 = uuid.uuid4().hex[:8]`, `_rename_with_rule` 호출 후 `_persist_for_download` 전 단계에서 stem 에 `_{vid8}` 삽입.

	### Step 4 — `app.py` 의 `_auto_mode_process` 진입부에 라우터 추가
	- 함수 첫 부분에서:
	```python
	num_gpus = torch.cuda.device_count() if torch.cuda.is_available() else 0
	parallel_env = os.getenv("SAM3_PARALLEL_AUTO_MODE", "auto").lower()
	use_parallel = (
	(parallel_env == "1") or
	(parallel_env == "auto" and num_gpus > 1)
	) and not bool(os.getenv("SPACES_ZERO_GPU"))
	if use_parallel:
	yield from _auto_mode_process_parallel(file_list, text_prompt, num_gpus, progress)
	return
	```
	- ZeroGPU 모드에선 비활성 (각 호출이 슬라이스 단위로 만 GPU 할당받음 → 동시성 무의미).

	### Step 5 — `_segment_video_worker_entry` 와의 코드 중복 정리
	- 기존 single-video 경로 (`segment_video` → `_segment_video_worker_entry`) 도 점진적으로 `parallel_segment_worker.worker_main` 을 사용해 한 곳에서 관리하도록 통합 (선택). 1차 구현에선 건드리지 않는다 (회귀 위험 최소화).

	### Step 6 — 출력 파일명 disambiguation 패치
	- `_rename_with_rule` 시그니처에 `extra_tag: str = ""` 옵션 추가 (기본 빈 문자열로 후방호환).
	- 병렬 경로에서만 `extra_tag=vid8` 전달.

	### Step 7 — UI 텍스트 보강
	- `auto_mode_status` Markdown 에 multi-line 출력 (GPU 별 1줄). 너무 길면 접기 가능한 코드블럭으로.
	- `gr.Progress` 는 단일 막대이므로, 병렬 모드의 “전체 평균 진행률” 만 거기에 보내고 GPU 별 세부는 텍스트로.

	### Step 8 — 스모크 테스트
	- 로컬에서 `python app.py` 로 띄우고:
	- 영상 2개 업로드 → 단일 GPU 환경에서 직렬 모드로 동작 (GPU 1 장만 보임).
	- `SAM3_PARALLEL_AUTO_MODE=1 CUDA_VISIBLE_DEVICES=0 python app.py` → 1-슬롯 풀로 동작.
	- 결과 mp4 / csv 가 올바르게 생성되는지, status UI 가 갱신되는지 확인.

	### Step 9 — 푸시
	- `requirements.txt` 변경 없음 (multiprocessing / uuid 표준 라이브러리).
	- HF Space 에선 하드웨어 탭에서 `4xA100-large` (또는 등가) 로 업그레이드한 후 동일 코드를 푸시하면 자동으로 병렬 모드 진입.

	---

	## 7. 향후 확장 (이번 PR 범위 외)

	- 워커 재사용 (persistent): 매 영상마다 spawn 대신 `Connection`/`Pipe` 기반 RPC 로 명령을 워커에 보내 모델 1회만 적재. SAM3 가중치 로딩 비용이 영상당 1~3 분이라면 속도 이득 큼.
	- 취소 / 일시정지: stop 버튼 → 디스패처가 모든 워커에 SIGTERM 보내고 partial 결과 반환.
	- GPU 별 메모리 다른 경우: 큰 영상은 80 GB GPU 로, 작은 영상은 작은 GPU 로 라우팅하는 우선순위 큐.
	- 분산 (multi-node): 동일 인터페이스로 worker 를 SSH 너머 노드로 띄울 수 있게 추상화.