File size: 8,322 Bytes
5248e3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
# API Reference — `core/`

## `core/__init__.py` — public exports

```python

from core import (

    # helpers.py

    run, jload, write_tmp, relpath, have_binary,

    # models.py

    make_finding, dedup_findings, sort_findings,

    # hf.py

    hf_space_to_git, list_user_spaces, comment_on_space,

    # bootstrap.py

    bootstrap_binaries,

    # baseline.py

    make_fingerprint, save_baseline, load_baseline,

    filter_by_baseline, parse_ignore_file, apply_ignore_rules,

)

```

---

## `core/scanner.py`

### `scan_repo()`



```python

def scan_repo(
    repo_url: str,

    hf_token: Optional[str] = None,

    deep_history: bool = False,

    run_security: bool = True,

    run_performance: bool = True,

    run_llm: bool = True,

    max_workers: int = 8,

    progress_cb: Optional[Callable[[float, str], None]] = None,

) -> Tuple[List[dict], List[str]]

```


Clone or copy the target, run all enabled scanners in parallel, return `(findings, log)`.

**Parameters:**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `repo_url` | `str` | — | HTTPS URL (HF Space or git repo) or local directory path |
| `hf_token` | `str` | `None` | HF Bearer token for private/gated repos |
| `deep_history` | `bool` | `False` | If `True`, run full `git clone` (no `--depth 1`) and include gitleaks |
| `run_security` | `bool` | `True` | Enable security scanners (Semgrep, bandit, pip-audit, …) |
| `run_performance` | `bool` | `True` | Enable performance scanners (Semgrep:Perf, ruff) |
| `run_llm` | `bool` | `True` | Enable LLM/agent scanners (Semgrep:LLM, agent-audit) |
| `max_workers` | `int` | `8` | Maximum thread-pool workers |
| `progress_cb` | `Callable` | `None` | Called with `(fraction: float, description: str)` as each scanner completes |

**Returns:** `(findings, log)` where `findings` is a deduplicated, sorted `List[dict]` and `log` is a `List[str]` of scanner messages (first entry is the summary `"OK (N unique findings)"`).

**Error handling:** Never raises. Returns `([], [error_message])` on clone failure or invalid target.

**Example:**

```python

from core.scanner import scan_repo



findings, log = scan_repo(

    "https://huggingface.co/spaces/owner/myspace",

    run_performance=False,

    progress_cb=lambda f, d: print(f"{f:.0%} {d}"),

)

print(log[0])   # "OK (23 unique findings)"

```

---

## `core/baseline.py`

### `make_fingerprint(finding)`



```python

def make_fingerprint(finding: dict) -> str
```



Return a 16-hex-char deterministic fingerprint:



```
sha256( tool:rule:file:line:message )[:16]
```



### `save_baseline(findings, path)`



```python

def save_baseline(findings: List[dict], path: Union[str, Path]) -> None

```

Persist fingerprints to a JSON file at `path`. Overwrites if it exists. The JSON format:

```json

{

  "created": "2025-01-01T12:00:00Z",

  "scanner_version": "4.0.0",

  "fingerprints": ["abc123...", ...]

}

```

### `load_baseline(path)`



```python

def load_baseline(path: Union[str, Path]) -> Set[str]
```



Return the set of fingerprint strings from a saved baseline JSON file.



### `filter_by_baseline(findings, baseline)`



```python

def filter_by_baseline(

    findings: List[dict],

    baseline: Set[str],

) -> Tuple[List[dict], List[dict]]

```

Return `(kept, suppressed)` — findings whose fingerprints are **not** in `baseline` vs those that are.

### `parse_ignore_file(path)`

```python

def parse_ignore_file(path: Union[str, Path]) -> List[IgnoreRule]

```

Parse a `.hfscanignore` file and return a list of `IgnoreRule` dataclass instances.

**`.hfscanignore` syntax:**

```

# comment

tests/                         # suppress all findings under tests/

* rule:B101                    # suppress rule everywhere

src/legacy/ severity:INFO      # suppress INFO severity under path

src/gen/ rule:B608             # suppress rule under path

```

### `apply_ignore_rules(findings, rules)`

```python

def apply_ignore_rules(

    findings: List[dict],

    rules: List[IgnoreRule],

) -> Tuple[List[dict], int]

```

Return `(kept_findings, ignored_count)`. Rules are evaluated in order; first match wins.

### `IgnoreRule` dataclass

```python

@dataclass

class IgnoreRule:

    path_prefix: str        # "" = wildcard (applies everywhere)

    rule_id: str            # "" = no rule filter

    severity: str           # "" = no severity filter

```

---

## `core/models.py`

### `make_finding()`



```python

def make_finding(
    tool: str,

    rule: str,

    severity: str,

    file: str,

    line: int,

    message: str,

    owasp: Union[str, List[str]],

    category: str = "security",

    confidence: str = None,

    remediation: str = None,

) -> dict

```


Build a normalized finding dict. All scanner runners call this to ensure uniform output shape.

- `confidence`: if `None`, looked up from `TOOL_DEFAULT_CONFIDENCE`; falls back to `"possible"`.
- `remediation`: if `None`, looked up from `report.remediation.REMEDIATION` by `rule`; falls back to `""`.
- `owasp`: `str` is automatically wrapped in a list.

### `sort_findings(findings)`



```python

def sort_findings(findings: List[dict]) -> List[dict]
```



Sort by `severity` (ERROR < WARNING < INFO) → `confidence` (confirmed < likely < possible) → `file` → `line`.



### `dedup_findings(findings)`



```python

def dedup_findings(findings: List[dict]) -> List[dict]

```

Remove duplicates keyed on `(tool, file, line, message)`. Preserves first occurrence order.

### Constants

```python

SEVERITY_RANK: dict      # {"ERROR": 0, "HIGH": 0, "WARNING": 1, ...}

CONFIDENCE_RANK: dict    # {"confirmed": 0, "likely": 1, "possible": 2}

TOOL_DEFAULT_CONFIDENCE: dict  # per-tool default confidence levels

FORBIDDEN_FILES: list    # file names that are always flagged

```

---

## `core/helpers.py`

### `run(cmd, cwd=None, timeout=300)`

```python

def run(cmd: List[str], cwd: str = None, timeout: int = 300) -> Tuple[str, int]

```

Run a subprocess. Returns `(stdout_stripped, returncode)`. Never raises.

| Exit code | Meaning |
|-----------|---------|
| Normal | Return code from the process |
| `124` | Timed out (`subprocess.TimeoutExpired`) |
| `127` | Binary not found (`FileNotFoundError`) |

### `jload(txt)`

```python

def jload(txt: str) -> Optional[Any]

```

Parse JSON from a string. Returns `None` for empty strings or parse errors.

### `write_tmp(content, suffix=".yaml")`



```python

def write_tmp(content: str, suffix: str = ".yaml") -> str
```



Write `content` to a temp file and return its absolute path.



### `relpath(base, p)`



```python

def relpath(base: str, p: str) -> str

```

Return `p` relative to `base`. If `p` is not under `base`, return `str(p)` unchanged.

### `have_binary(name)`



```python

def have_binary(name: str) -> bool
```



Return `True` if `name` is on `PATH` (via `shutil.which`).



---



## `core/hf.py`



### `hf_space_to_git(url, token=None)`



```python

def hf_space_to_git(url: str, token: str = None) -> Optional[str]

```

Convert `https://huggingface.co/spaces/<ns>/<name>` to a git-cloneable URL. Returns `None` for non-HF URLs. If `token` is provided, embeds it as HTTP basic auth (`USER:<token>@`).

### `list_user_spaces(username, hf_token=None, limit=500)`



```python

def list_user_spaces(

    username: str,

    hf_token: str = None,
    limit: int = 500,

) -> Tuple[List[str], str]

```


Return `(space_urls, status_message)`. Queries `https://huggingface.co/api/spaces?author=<username>`. Returns `([], error_message)` on HTTP error or network failure.

---

## `core/bootstrap.py`

### `bootstrap_binaries()`



```python

def bootstrap_binaries() -> dict
```



Download `gitleaks` and `hadolint` binaries for the current platform if not already on PATH. Returns a dict with keys `"gitleaks"` and `"hadolint"`, values `"ok"` / `"already installed"` / `"error: ..."`.



Binaries are placed in:

- **Windows**: `<venv>\Scripts\` (next to `python.exe`, so `shutil.which` finds them)

- **macOS/Linux**: `~/.local/bin/`



**Versions:** `GITLEAKS_VERSION = "8.18.4"`, `HADOLINT_VERSION = "2.12.0"` (defined as module constants).