VibecoderMcSwaggins commited on
Commit
785d976
·
unverified ·
1 Parent(s): 262b3cb

fix(security): address P1 findings from post-refactor audit

Browse files

Security hardening: path validation in cleanup, configurable traceback display, pinned GitHub Actions, subject_id regex validation.

.github/workflows/ci.yml CHANGED
@@ -94,7 +94,7 @@ jobs:
94
  - uses: actions/checkout@v4
95
 
96
  - name: Free disk space
97
- uses: jlumbroso/free-disk-space@main
98
  with:
99
  tool-cache: false
100
  android: true
@@ -131,7 +131,7 @@ jobs:
131
  - uses: actions/checkout@v4
132
 
133
  - name: Free disk space
134
- uses: jlumbroso/free-disk-space@main
135
  with:
136
  tool-cache: false
137
  android: true
 
94
  - uses: actions/checkout@v4
95
 
96
  - name: Free disk space
97
+ uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
98
  with:
99
  tool-cache: false
100
  android: true
 
131
  - uses: actions/checkout@v4
132
 
133
  - name: Free disk space
134
+ uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1
135
  with:
136
  tool-cache: false
137
  android: true
docs/bugs/POST-REFACTOR-AUDIT-2024-12-13.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Post-Refactor Audit Validation - 2024-12-13
2
+
3
+ **Context:** Senior audit performed after data loading refactor (PR #42).
4
+ **Method:** Each claim validated from first principles by reading source code.
5
+
6
+ ---
7
+
8
+ ## P0 Claims - VALIDATION
9
+
10
+ ### CLAIM: Concurrency limiting isn't atomic (TOCTOU race)
11
+ **File:** `src/stroke_deepisles_demo/api/routes.py:92-98`
12
+
13
+ **Validation:** The concurrency check exists (BUG-006 fix):
14
+ ```python
15
+ if store.get_active_job_count() >= get_settings().max_concurrent_jobs:
16
+ raise HTTPException(status_code=503, ...)
17
+ ```
18
+
19
+ **Verdict:** PARTIALLY TRUE - Fix exists but has a small TOCTOU race window between check and `create_job()`. For a demo app, this is acceptable. The worst case is N+1 jobs run instead of N.
20
+
21
+ **Status:** DOCUMENTED - Acceptable for demo, not a P0 blocker.
22
+
23
+ ---
24
+
25
+ ### CLAIM: No authn/authz and no rate limiting
26
+ **File:** `src/stroke_deepisles_demo/api/main.py`
27
+
28
+ **Validation:** Already documented in `docs/bugs/BUGS-AUDIT-2024-12-12.md`:
29
+ > "BUG-007 (No Authentication) was removed from audit - intentional design for public demo."
30
+
31
+ **Verdict:** FALSE - This is intentional. The app is a public demo on HuggingFace Spaces.
32
+
33
+ **Status:** NOT A BUG - Intentional design decision.
34
+
35
+ ---
36
+
37
+ ## P1 Claims - VALIDATION
38
+
39
+ ### CLAIM: Path traversal via HF subject_id
40
+ **File:** `src/stroke_deepisles_demo/data/loader.py:104-114`
41
+
42
+ **Validation:** TRUE - `subject_id` from HF dataset is used directly in path construction:
43
+ ```python
44
+ case_dir = self._temp_dir / subject_id # No validation
45
+ ```
46
+
47
+ **Mitigation:** The `subject_id` comes from the HuggingFace dataset we control, not user input. All ISLES24 subject IDs are `sub-strokeXXXX` format.
48
+
49
+ **Verdict:** TRUE (theoretical) but MITIGATED in practice. Adding validation is defense-in-depth.
50
+
51
+ **Status:** FIX - Add regex validation for defense-in-depth.
52
+
53
+ ---
54
+
55
+ ### CLAIM: Gradio state arbitrary delete
56
+ **File:** `src/stroke_deepisles_demo/ui/app.py:55-66`
57
+
58
+ **Validation:** TRUE - `previous_results_dir` from Gradio state used in `shutil.rmtree()`:
59
+ ```python
60
+ def _cleanup_previous_results(previous_results_dir: str | None) -> None:
61
+ prev_path = Path(previous_results_dir)
62
+ if prev_path.exists():
63
+ shutil.rmtree(prev_path) # No validation!
64
+ ```
65
+
66
+ Gradio state is client-side (base64 encoded JSON), potentially manipulable.
67
+
68
+ **Verdict:** TRUE - Should validate path is under allowed root.
69
+
70
+ **Status:** FIX - Add path validation.
71
+
72
+ ---
73
+
74
+ ### CLAIM: show_error=True exposes tracebacks
75
+ **File:** `src/stroke_deepisles_demo/ui/app.py:288`
76
+
77
+ **Validation:** TRUE
78
+ ```python
79
+ get_demo().launch(
80
+ show_error=True, # Show full Python tracebacks in UI for debugging
81
+ )
82
+ ```
83
+
84
+ **Verdict:** TRUE - Information disclosure risk. Should be configurable.
85
+
86
+ **Status:** FIX - Make configurable via settings, default to False in production.
87
+
88
+ ---
89
+
90
+ ### CLAIM: Unpinned GitHub Actions (supply-chain risk)
91
+ **File:** `.github/workflows/ci.yml:97,134`
92
+
93
+ **Validation:** TRUE
94
+ ```yaml
95
+ uses: jlumbroso/free-disk-space@main # Unpinned!
96
+ ```
97
+
98
+ **Verdict:** TRUE - Supply-chain risk. Should pin to commit SHA.
99
+
100
+ **Status:** FIX - Pin to commit SHA.
101
+
102
+ ---
103
+
104
+ ### CLAIM: DeepISLES output paths mismatch
105
+ **Files:** Multiple
106
+
107
+ **Validation:** Need to verify actual output structure.
108
+
109
+ **Status:** DEFERRED - Requires integration testing to validate.
110
+
111
+ ---
112
+
113
+ ## P2 Claims - VALIDATION
114
+
115
+ ### CLAIM: Dead Settings (hf_dataset_id, etc.)
116
+ **Files:** `src/stroke_deepisles_demo/core/config.py`
117
+
118
+ **Validation:** TRUE - `grep` shows these settings are defined but never accessed:
119
+ - `hf_dataset_id` - loader.py hardcodes `DEFAULT_HF_DATASET`
120
+ - `hf_token` - not used in new loader
121
+ - `deepisles_docker_image` - inference uses hardcoded default
122
+ - `deepisles_timeout_seconds` - not threaded through
123
+
124
+ **Verdict:** TRUE - Tech debt but not a bug. Settings exist for future use.
125
+
126
+ **Status:** DOCUMENTED - Low priority cleanup.
127
+
128
+ ---
129
+
130
+ ### CLAIM: Case ID reload per request
131
+ **File:** `src/stroke_deepisles_demo/api/routes.py:101`
132
+
133
+ **Validation:** TRUE - Each request calls `list_case_ids()` which may reload dataset.
134
+
135
+ **Verdict:** TRUE but ACCEPTABLE - For a demo with 149 cases, this is fine.
136
+
137
+ **Status:** DOCUMENTED - Acceptable for demo scale.
138
+
139
+ ---
140
+
141
+ ## P3 Claims - VALIDATION
142
+
143
+ ### CLAIM: Dockerfile uv unpinned
144
+ **File:** `Dockerfile:35`
145
+
146
+ **Validation:** TRUE
147
+ ```dockerfile
148
+ RUN pip install --no-cache-dir uv
149
+ ```
150
+
151
+ **Verdict:** TRUE but LOW RISK - `uv.lock` pins actual dependencies. Only `uv` tool version floats.
152
+
153
+ **Status:** DOCUMENTED - Low priority.
154
+
155
+ ---
156
+
157
+ ### CLAIM: Docs out of date
158
+ **Files:** `DATA-PIPELINE.md`, `docs/specs/00-data-loading-refactor.md`
159
+
160
+ **Validation:** TRUE - Docs reference removed HF parquet workaround architecture.
161
+
162
+ **Status:** FIX - Update docs.
163
+
164
+ ---
165
+
166
+ ## P4 Claims - VALIDATION
167
+
168
+ ### CLAIM: DataLoadError unused
169
+ **File:** `src/stroke_deepisles_demo/core/exceptions.py`
170
+
171
+ **Validation:** TRUE - Defined but no longer imported in data layer after refactor.
172
+
173
+ **Status:** CLEANUP - Remove or use consistently.
174
+
175
+ ---
176
+
177
+ ## Summary
178
+
179
+ | Finding | Severity | Status | Action |
180
+ |---------|----------|--------|--------|
181
+ | TOCTOU race in concurrency | P2 | Documented | Accept for demo |
182
+ | No auth | N/A | Documented | Intentional |
183
+ | Path traversal (subject_id) | P1 | New | Fix (defense-in-depth) |
184
+ | Gradio state arbitrary delete | P1 | New | Fix |
185
+ | show_error=True tracebacks | P1 | New | Fix |
186
+ | Unpinned GitHub Actions | P1 | New | Fix |
187
+ | Dead Settings | P3 | Documented | Accept |
188
+ | Case ID reload | P3 | Documented | Accept |
189
+ | Docs out of date | P3 | New | Fix |
190
+ | DataLoadError unused | P4 | New | Cleanup |
191
+
192
+ ---
193
+
194
+ **Audited by:** Claude Code
195
+ **Date:** 2024-12-13
196
+ **Branch:** fix/post-refactor-audit
src/stroke_deepisles_demo/core/config.py CHANGED
@@ -108,6 +108,8 @@ class Settings(BaseSettings):
108
  gradio_server_name: str = "0.0.0.0"
109
  gradio_server_port: int = 7860
110
  gradio_share: bool = False
 
 
111
 
112
  @computed_field # type: ignore[prop-decorator]
113
  @property
 
108
  gradio_server_name: str = "0.0.0.0"
109
  gradio_server_port: int = 7860
110
  gradio_share: bool = False
111
+ # Show full Python tracebacks in Gradio UI (security: disable in production)
112
+ gradio_show_error: bool = False
113
 
114
  @computed_field # type: ignore[prop-decorator]
115
  @property
src/stroke_deepisles_demo/data/loader.py CHANGED
@@ -2,6 +2,7 @@
2
 
3
  from __future__ import annotations
4
 
 
5
  import shutil
6
  import tempfile
7
  from dataclasses import dataclass, field
@@ -11,6 +12,10 @@ from typing import TYPE_CHECKING, Protocol, Self
11
  from stroke_deepisles_demo.core.logging import get_logger
12
  from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
13
 
 
 
 
 
14
  if TYPE_CHECKING:
15
  from datasets import Dataset as HFDataset
16
 
@@ -103,6 +108,12 @@ class HuggingFaceDatasetWrapper:
103
  row = self.dataset[idx]
104
  subject_id = row["subject_id"]
105
 
 
 
 
 
 
 
106
  # Prepare temp dir
107
  if self._temp_dir is None:
108
  self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_wrapper_"))
 
2
 
3
  from __future__ import annotations
4
 
5
+ import re
6
  import shutil
7
  import tempfile
8
  from dataclasses import dataclass, field
 
12
  from stroke_deepisles_demo.core.logging import get_logger
13
  from stroke_deepisles_demo.core.types import CaseFiles # noqa: TC001
14
 
15
+ # Security: Regex for valid ISLES24 subject IDs (defense-in-depth)
16
+ # Expected format: sub-strokeXXXX (e.g., sub-stroke0001)
17
+ _SAFE_SUBJECT_ID_PATTERN = re.compile(r"^sub-stroke\d{4}$")
18
+
19
  if TYPE_CHECKING:
20
  from datasets import Dataset as HFDataset
21
 
 
108
  row = self.dataset[idx]
109
  subject_id = row["subject_id"]
110
 
111
+ # Security: Validate subject_id before using in path (defense-in-depth)
112
+ if not _SAFE_SUBJECT_ID_PATTERN.match(subject_id):
113
+ raise ValueError(
114
+ f"Invalid subject_id format: {subject_id!r}. Expected format: sub-strokeXXXX"
115
+ )
116
+
117
  # Prepare temp dir
118
  if self._temp_dir is None:
119
  self._temp_dir = Path(tempfile.mkdtemp(prefix="isles24_hf_wrapper_"))
src/stroke_deepisles_demo/ui/app.py CHANGED
@@ -53,10 +53,30 @@ def initialize_case_selector() -> gr.Dropdown:
53
 
54
 
55
  def _cleanup_previous_results(previous_results_dir: str | None) -> None:
56
- """Clean up previous results directory (per-session, thread-safe)."""
 
 
 
 
57
  if previous_results_dir is None:
58
  return
59
- prev_path = Path(previous_results_dir)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  if prev_path.exists():
61
  try:
62
  shutil.rmtree(prev_path)
@@ -285,5 +305,5 @@ if __name__ == "__main__":
285
  share=settings.gradio_share,
286
  theme=gr.themes.Soft(),
287
  css="footer {visibility: hidden}",
288
- show_error=True, # Show full Python tracebacks in UI for debugging
289
  )
 
53
 
54
 
55
  def _cleanup_previous_results(previous_results_dir: str | None) -> None:
56
+ """Clean up previous results directory (per-session, thread-safe).
57
+
58
+ Security: Validates path is under allowed results root to prevent
59
+ arbitrary file deletion via manipulated Gradio state.
60
+ """
61
  if previous_results_dir is None:
62
  return
63
+
64
+ from stroke_deepisles_demo.core.config import get_settings
65
+
66
+ prev_path = Path(previous_results_dir).resolve()
67
+ allowed_root = get_settings().results_dir.resolve()
68
+
69
+ # Security: Ensure path is under allowed root (prevent path traversal)
70
+ try:
71
+ prev_path.relative_to(allowed_root)
72
+ except ValueError:
73
+ logger.warning(
74
+ "Refusing to cleanup path outside allowed root: %s (root: %s)",
75
+ prev_path,
76
+ allowed_root,
77
+ )
78
+ return
79
+
80
  if prev_path.exists():
81
  try:
82
  shutil.rmtree(prev_path)
 
305
  share=settings.gradio_share,
306
  theme=gr.themes.Soft(),
307
  css="footer {visibility: hidden}",
308
+ show_error=settings.gradio_show_error, # Default False for security
309
  )