stroke-viewer-frontend / docs /TECHNICAL_DEBT.md
VibecoderMcSwaggins's picture
docs: expand technical debt audit with external review findings
85be054
|
raw
history blame
10.9 kB

Technical Debt and Known Issues

Last Audit: December 2025 (Revision 2) Auditor: Claude Code + External Senior Review Status: Production-ready with known limitations

Summary

Full architectural review completed with external validation. The codebase is production-ready with documented limitations.

Severity Count Description
P0 (Critical) 0 None
P1 (High) 0 None
P2 (Medium) 3 Temp dir leak, silent empty dataset, brittle git dep
P3 (Low) 6 Type ignores, SSRF vector, base64 overhead, float64

P2: Silent Empty Dataset on Missing Data Directory

Location: src/stroke_deepisles_demo/data/adapter.py:70

Issue: When build_local_dataset() is called with a non-existent directory, Path.glob() returns an empty iterator instead of raising an error. This results in an empty LocalDataset with 0 cases.

Example:

dataset = build_local_dataset(Path("/wrong/path"))
len(dataset)  # Returns 0, no error

Mitigation:

  • UI path (components.py:35-36) explicitly checks if not case_ids and raises RuntimeError
  • Pipeline path will fail with IndexError when accessing case by index
  • CLI shows "Found 0 cases:" which is visible but potentially confusing

Impact: User confusion if data path is misconfigured

Recommended Fix (optional):

def build_local_dataset(data_dir: Path) -> LocalDataset:
    dwi_dir = data_dir / "Images-DWI"
    if not dwi_dir.exists():
        raise FileNotFoundError(f"Data directory not found: {dwi_dir}")
    # ... rest of function

Status: Acceptable - downstream checks prevent silent failures in all user-facing paths


P2: Unbounded Temporary Directory Accumulation (UI Path)

Location: src/stroke_deepisles_demo/pipeline.py:61,106-113 and src/stroke_deepisles_demo/ui/app.py:51

Issue: The run_pipeline_on_case() function creates temporary directories using tempfile.mkdtemp() and defaults to cleanup_staging=False. The CLI correctly overrides this to True, but the Gradio UI does not pass this parameter.

Evidence:

# pipeline.py:61 - default is False
cleanup_staging: bool = False,

# cli.py:76 - CLI correctly overrides ✅
cleanup_staging=True,

# app.py:51 - UI does NOT pass it ❌
result = run_pipeline_on_case(case_id, fast=fast_mode, compute_dice=True)

Risk: In a long-running Gradio app on HF Spaces, each inference request creates a new temporary directory (deepisles_pipeline_*) that is never deleted. This will eventually consume all available disk space, causing DoS.

Mitigation: HF Spaces containers are ephemeral and restart periodically, which limits accumulation. However, heavy usage could still trigger disk pressure.

Recommended Fix:

# Option A: Fix UI to pass cleanup_staging=True
result = run_pipeline_on_case(case_id, fast=fast_mode, compute_dice=True, cleanup_staging=True)

# Option B: Change default to True (breaking change for programmatic users who want to keep staging)
cleanup_staging: bool = True,

Status: Should fix before production deployment


P2: Brittle Git Branch Dependency

Location: pyproject.toml:23

Issue: The project depends on a specific branch of a third-party fork for the datasets library.

Evidence:

"datasets @ git+https://github.com/CloseChoice/datasets.git@feat/bids-loader-streaming-upload-fix",

Risk: If the repository owner deletes the branch feat/bids-loader-streaming-upload-fix or force-pushes changes, all builds will fail immediately. This endangers reproducibility and CI/CD reliability.

Recommended Fix (in priority order):

  1. Get the fix merged upstream to huggingface/datasets and pin to a released version
  2. Fork the repository to the project's organization for permanence
  3. Vendor the required changes directly

Status: Monitor upstream; consider forking if not merged within 30 days


P3: Type Ignore Comments (Expected)

These # type: ignore comments are correct and expected due to library typing limitations:

1. nibabel typing (3 occurrences)

Location: src/stroke_deepisles_demo/metrics.py:26-28

img = nib.load(path)  # type: ignore[attr-defined]
data = img.get_fdata()  # type: ignore[attr-defined]
zooms = img.header.get_zooms()  # type: ignore[attr-defined]

Reason: nibabel has incomplete type stubs

2. numpy.ma typing (5 occurrences)

Location: src/stroke_deepisles_demo/ui/viewer.py

np.ma.masked_where(m_slice == 0, m_slice)  # type: ignore[no-untyped-call]

Reason: numpy masked array functions lack complete type annotations

3. pydantic computed_field (2 occurrences)

Location: src/stroke_deepisles_demo/core/config.py:100,106

@computed_field  # type: ignore[prop-decorator]
@property
def is_hf_spaces(self) -> bool:

Reason: pydantic-settings computed_field decorator typing quirk

Status: These are industry-standard workarounds, not technical debt


P3: Latent SSRF Vector in Staging Utility

Location: src/stroke_deepisles_demo/data/staging.py:124-133

Issue: The _materialize_nifti() helper function contains code to download files from arbitrary HTTP/HTTPS URLs using requests.get().

Evidence:

elif isinstance(source, str):
    if source.startswith(("http://", "https://")):
        import requests
        response = requests.get(source, stream=True, timeout=30)
        response.raise_for_status()
        # ...writes to local file

Current State: This code path is currently unreachable because:

  • CaseFiles from adapter.py only contains local Path objects
  • No user-facing interface accepts URL input

Risk: If a future feature allows user-supplied URLs (e.g., "Load from URL" button), this becomes a Server-Side Request Forgery (SSRF) vulnerability. An attacker could:

  • Probe internal networks from the HF Space container
  • Access cloud metadata services (169.254.169.254)
  • Exfiltrate data to attacker-controlled servers

Recommended Fix:

# Option A: Remove the HTTP code path entirely (recommended if not needed)
# Option B: Add domain allowlist if URL loading is required
ALLOWED_DOMAINS = {"huggingface.co", "github.com"}
parsed = urllib.parse.urlparse(source)
if parsed.netloc not in ALLOWED_DOMAINS:
    raise ValueError(f"URL domain not allowed: {parsed.netloc}")

Status: Acceptable if no URL input features are added; document for future developers


P3: Redundant float64 Cast in NIfTI Loading

Location: src/stroke_deepisles_demo/metrics.py:27

Issue: The code explicitly casts NIfTI data to float64, but nibabel's get_fdata() already returns float64 by default.

Evidence:

data = img.get_fdata().astype(np.float64)  # Redundant - get_fdata() already returns float64

Analysis:

  • The .astype(np.float64) call is a no-op in most cases
  • It does NOT double memory as initially claimed (nibabel already loads as float64)
  • However, float32 would be sufficient for Dice computation and would halve memory usage

Risk: Increased memory usage for large volumes. A 512×512×256 volume:

  • float64: ~512 MB
  • float32: ~256 MB (50% savings)

Recommended Fix (optional optimization):

data = img.get_fdata(dtype=np.float32)  # Use float32 for memory efficiency

Status: Low priority optimization; current behavior is correct but suboptimal


P3: Base64 Data URL Overhead for NiiVue Viewer

Location: src/stroke_deepisles_demo/ui/viewer.py:47-51

Issue: NIfTI files are embedded in HTML as base64 data URLs, incurring ~33% size overhead.

Evidence:

with nifti_path.open("rb") as f:
    nifti_bytes = f.read()
nifti_b64 = base64.b64encode(nifti_bytes).decode("utf-8")
return f"data:application/octet-stream;base64,{nifti_b64}"

Impact:

  • A 5 MB NIfTI file becomes ~6.7 MB in the DOM
  • Large DOM sizes can freeze browser tabs
  • Server RAM spikes during encoding

Why It Exists: This is the standard pattern for Gradio HTML components. Gradio doesn't expose a straightforward static file serving API.

Alternative (significant refactor):

  • Use gr.File output and let NiiVue load from a relative URL
  • Add custom FastAPI route to serve files as binary streams
  • Use Gradio's add_static_files (if supported in Gradio 6.x)

Status: Acceptable for demo; revisit if performance issues arise in production


Good Patterns Observed

Error Handling

  • No bare except: or except: pass statements
  • All exceptions are re-raised with context using from e
  • logger.exception() used before re-raising for full traceback

Fail-Loud Design

  • UI components raise RuntimeError on missing data
  • CLI returns non-zero exit codes on failure
  • Index bounds are explicitly validated before access

Logging

  • Consistent use of module-level loggers via get_logger(__name__)
  • Warnings for skipped cases include counts and examples
  • Debug logging for Docker commands and inference paths

Type Safety

  • Proper use of Path vs str throughout
  • TypedDict for CaseFiles structure
  • Return types explicitly annotated

Architecture Decisions (Not Debt)

1. Dice Score Failure Handling

Location: pipeline.py:129-133

if compute_dice and ground_truth:
    try:
        dice_score = metrics.compute_dice(...)
    except Exception:
        logger.warning("Failed to compute Dice score", exc_info=True)

Decision: Pipeline continues if Dice computation fails. This is intentional - inference results are more valuable than failing the entire pipeline due to a metrics issue.

2. Direct Invocation Module

Location: inference/direct.py Decision: Separate module for HF Spaces direct Python invocation. Keeps Docker path clean and follows single-responsibility principle.

3. Lazy Demo Initialization

Location: ui/app.py:159-168

_demo: gr.Blocks | None = None

def get_demo() -> gr.Blocks:
    global _demo
    if _demo is None:
        _demo = create_app()
    return _demo

Decision: Avoids import-time side effects. Demo is only created when accessed.


Conclusion

The codebase is well-architected and production-ready with documented limitations.

Before Production Deployment

  1. Fix P2: Temp dir leak - Add cleanup_staging=True to UI path (1 line change)

Monitor / Low Priority

  1. P2: Git dependency - Track upstream merge; fork if needed
  2. P2: Empty dataset - Already mitigated downstream
  3. P3 issues - Document for future developers; no immediate action

Verdict

The codebase follows clean code principles with proper error handling, type safety, and fail-loud design. The identified issues are manageable and do not block deployment.