# API Reference — `core/` ## `core/__init__.py` — public exports ```python from core import ( # helpers.py run, jload, write_tmp, relpath, have_binary, # models.py make_finding, dedup_findings, sort_findings, # hf.py hf_space_to_git, list_user_spaces, comment_on_space, # bootstrap.py bootstrap_binaries, # baseline.py make_fingerprint, save_baseline, load_baseline, filter_by_baseline, parse_ignore_file, apply_ignore_rules, ) ``` --- ## `core/scanner.py` ### `scan_repo()` ```python def scan_repo( repo_url: str, hf_token: Optional[str] = None, deep_history: bool = False, run_security: bool = True, run_performance: bool = True, run_llm: bool = True, max_workers: int = 8, progress_cb: Optional[Callable[[float, str], None]] = None, ) -> Tuple[List[dict], List[str]] ``` Clone or copy the target, run all enabled scanners in parallel, return `(findings, log)`. **Parameters:** | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `repo_url` | `str` | — | HTTPS URL (HF Space or git repo) or local directory path | | `hf_token` | `str` | `None` | HF Bearer token for private/gated repos | | `deep_history` | `bool` | `False` | If `True`, run full `git clone` (no `--depth 1`) and include gitleaks | | `run_security` | `bool` | `True` | Enable security scanners (Semgrep, bandit, pip-audit, …) | | `run_performance` | `bool` | `True` | Enable performance scanners (Semgrep:Perf, ruff) | | `run_llm` | `bool` | `True` | Enable LLM/agent scanners (Semgrep:LLM, agent-audit) | | `max_workers` | `int` | `8` | Maximum thread-pool workers | | `progress_cb` | `Callable` | `None` | Called with `(fraction: float, description: str)` as each scanner completes | **Returns:** `(findings, log)` where `findings` is a deduplicated, sorted `List[dict]` and `log` is a `List[str]` of scanner messages (first entry is the summary `"OK (N unique findings)"`). **Error handling:** Never raises. Returns `([], [error_message])` on clone failure or invalid target. **Example:** ```python from core.scanner import scan_repo findings, log = scan_repo( "https://huggingface.co/spaces/owner/myspace", run_performance=False, progress_cb=lambda f, d: print(f"{f:.0%} {d}"), ) print(log[0]) # "OK (23 unique findings)" ``` --- ## `core/baseline.py` ### `make_fingerprint(finding)` ```python def make_fingerprint(finding: dict) -> str ``` Return a 16-hex-char deterministic fingerprint: ``` sha256( tool:rule:file:line:message )[:16] ``` ### `save_baseline(findings, path)` ```python def save_baseline(findings: List[dict], path: Union[str, Path]) -> None ``` Persist fingerprints to a JSON file at `path`. Overwrites if it exists. The JSON format: ```json { "created": "2025-01-01T12:00:00Z", "scanner_version": "4.0.0", "fingerprints": ["abc123...", ...] } ``` ### `load_baseline(path)` ```python def load_baseline(path: Union[str, Path]) -> Set[str] ``` Return the set of fingerprint strings from a saved baseline JSON file. ### `filter_by_baseline(findings, baseline)` ```python def filter_by_baseline( findings: List[dict], baseline: Set[str], ) -> Tuple[List[dict], List[dict]] ``` Return `(kept, suppressed)` — findings whose fingerprints are **not** in `baseline` vs those that are. ### `parse_ignore_file(path)` ```python def parse_ignore_file(path: Union[str, Path]) -> List[IgnoreRule] ``` Parse a `.hfscanignore` file and return a list of `IgnoreRule` dataclass instances. **`.hfscanignore` syntax:** ``` # comment tests/ # suppress all findings under tests/ * rule:B101 # suppress rule everywhere src/legacy/ severity:INFO # suppress INFO severity under path src/gen/ rule:B608 # suppress rule under path ``` ### `apply_ignore_rules(findings, rules)` ```python def apply_ignore_rules( findings: List[dict], rules: List[IgnoreRule], ) -> Tuple[List[dict], int] ``` Return `(kept_findings, ignored_count)`. Rules are evaluated in order; first match wins. ### `IgnoreRule` dataclass ```python @dataclass class IgnoreRule: path_prefix: str # "" = wildcard (applies everywhere) rule_id: str # "" = no rule filter severity: str # "" = no severity filter ``` --- ## `core/models.py` ### `make_finding()` ```python def make_finding( tool: str, rule: str, severity: str, file: str, line: int, message: str, owasp: Union[str, List[str]], category: str = "security", confidence: str = None, remediation: str = None, ) -> dict ``` Build a normalized finding dict. All scanner runners call this to ensure uniform output shape. - `confidence`: if `None`, looked up from `TOOL_DEFAULT_CONFIDENCE`; falls back to `"possible"`. - `remediation`: if `None`, looked up from `report.remediation.REMEDIATION` by `rule`; falls back to `""`. - `owasp`: `str` is automatically wrapped in a list. ### `sort_findings(findings)` ```python def sort_findings(findings: List[dict]) -> List[dict] ``` Sort by `severity` (ERROR < WARNING < INFO) → `confidence` (confirmed < likely < possible) → `file` → `line`. ### `dedup_findings(findings)` ```python def dedup_findings(findings: List[dict]) -> List[dict] ``` Remove duplicates keyed on `(tool, file, line, message)`. Preserves first occurrence order. ### Constants ```python SEVERITY_RANK: dict # {"ERROR": 0, "HIGH": 0, "WARNING": 1, ...} CONFIDENCE_RANK: dict # {"confirmed": 0, "likely": 1, "possible": 2} TOOL_DEFAULT_CONFIDENCE: dict # per-tool default confidence levels FORBIDDEN_FILES: list # file names that are always flagged ``` --- ## `core/helpers.py` ### `run(cmd, cwd=None, timeout=300)` ```python def run(cmd: List[str], cwd: str = None, timeout: int = 300) -> Tuple[str, int] ``` Run a subprocess. Returns `(stdout_stripped, returncode)`. Never raises. | Exit code | Meaning | |-----------|---------| | Normal | Return code from the process | | `124` | Timed out (`subprocess.TimeoutExpired`) | | `127` | Binary not found (`FileNotFoundError`) | ### `jload(txt)` ```python def jload(txt: str) -> Optional[Any] ``` Parse JSON from a string. Returns `None` for empty strings or parse errors. ### `write_tmp(content, suffix=".yaml")` ```python def write_tmp(content: str, suffix: str = ".yaml") -> str ``` Write `content` to a temp file and return its absolute path. ### `relpath(base, p)` ```python def relpath(base: str, p: str) -> str ``` Return `p` relative to `base`. If `p` is not under `base`, return `str(p)` unchanged. ### `have_binary(name)` ```python def have_binary(name: str) -> bool ``` Return `True` if `name` is on `PATH` (via `shutil.which`). --- ## `core/hf.py` ### `hf_space_to_git(url, token=None)` ```python def hf_space_to_git(url: str, token: str = None) -> Optional[str] ``` Convert `https://huggingface.co/spaces//` to a git-cloneable URL. Returns `None` for non-HF URLs. If `token` is provided, embeds it as HTTP basic auth (`USER:@`). ### `list_user_spaces(username, hf_token=None, limit=500)` ```python def list_user_spaces( username: str, hf_token: str = None, limit: int = 500, ) -> Tuple[List[str], str] ``` Return `(space_urls, status_message)`. Queries `https://huggingface.co/api/spaces?author=`. Returns `([], error_message)` on HTTP error or network failure. --- ## `core/bootstrap.py` ### `bootstrap_binaries()` ```python def bootstrap_binaries() -> dict ``` Download `gitleaks` and `hadolint` binaries for the current platform if not already on PATH. Returns a dict with keys `"gitleaks"` and `"hadolint"`, values `"ok"` / `"already installed"` / `"error: ..."`. Binaries are placed in: - **Windows**: `\Scripts\` (next to `python.exe`, so `shutil.which` finds them) - **macOS/Linux**: `~/.local/bin/` **Versions:** `GITLEAKS_VERSION = "8.18.4"`, `HADOLINT_VERSION = "2.12.0"` (defined as module constants).