cmpatino HF Staff Claude Opus 4.8 (1M context) commited on
Commit
f05c667
Β·
1 Parent(s): ec0742d

Log audit to a private out-of-org bucket

Browse files

Audit records carry sensitive metadata (caller_ip, user_agent, source
URIs). They were written to gemma-main-bucket/audit/, readable by every
org contributor. Move them to a private bucket outside the org
(cmpatino/gemma-challenge-audit), owned by the Space's HF_TOKEN (Option A).

- config.py: new fixed audit_bucket constant.
- hub.py: append_jsonl_central -> append_jsonl_audit, backed by a generic
_append_jsonl(bucket, ...) targeting the audit bucket (read-modify-write
now reads+writes there).
- audit.py: retarget the single call site.
- DESIGN.md: update Β§8, constants, file layout, and deployment bootstrap.

Existing audit history was moved to the new bucket out-of-band.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Files changed (4) hide show
  1. DESIGN.md +12 -6
  2. app/audit.py +1 -1
  3. app/config.py +3 -0
  4. app/hub.py +11 -3
DESIGN.md CHANGED
@@ -34,7 +34,7 @@ This Space holds the API for **one** collab (here: the Gemma collab in `gemma-ch
34
  | Shared resource | `shared_resources/…_{agent_id}{.ext|/…}` (the `_{agent_id}` segment is mandatory somewhere in the leaf path) |
35
  | Audit log | `audit/{YYYYMM}.jsonl` |
36
 
37
- Space-config constants: `ORG=gemma-challenge`, `COLLAB_SLUG=gemma`, `CENTRAL_BUCKET=gemma-challenge/gemma-main-bucket`.
38
 
39
  ### HF Hub dependencies
40
  - `huggingface_hub` SDK (or equivalent REST calls): bucket existence check, bucket creator metadata, bucket list, bucket cp/sync, bucket-to-bucket copy.
@@ -237,7 +237,7 @@ Rate-limit responses include `Retry-After`.
237
 
238
  ## 8. Audit Log
239
 
240
- Append-one-JSON-line-per-write to `audit/{YYYYMM}.jsonl` in the central bucket, via admin token. Append-only; readable by every contributor.
241
 
242
  ```json
243
  {
@@ -292,7 +292,7 @@ bucket-sync/
292
  DESIGN.md ← this document
293
  app/
294
  main.py ← FastAPI entry; mounts routers
295
- config.py ← fixed: ORG, COLLAB_SLUG, CENTRAL_BUCKET; env: HF_TOKEN
296
  hub.py ← huggingface_hub wrapper (read/write/copy/list/whoami/creator)
297
  naming.py ← derives bucket names, filenames, target paths
298
  frontmatter.py ← parse / merge / serialise YAML frontmatter
@@ -322,15 +322,21 @@ Synchronous: the client waits for the copy to finish. Implementation streams obj
322
  # 1. Create central bucket as admin
323
  hf buckets create gemma-challenge/gemma-main-bucket
324
 
325
- # 2. Seed it
 
 
 
 
 
326
  hf buckets cp ./README.md hf://buckets/gemma-challenge/gemma-main-bucket/
327
  hf buckets cp ./enwik8 hf://buckets/gemma-challenge/gemma-main-bucket/shared_resources/
328
  hf buckets cp ./shared_resources/README.md \
329
  hf://buckets/gemma-challenge/gemma-main-bucket/shared_resources/
330
 
331
  # 3. Configure this Space (Settings β†’ Secrets)
332
- # HF_TOKEN = <admin token, scope: org-admin, repo write>
333
- # (ORG, COLLAB_SLUG, CENTRAL_BUCKET are fixed in config.py β€” no env needed)
 
334
 
335
  # 4. Push code. Space rebuilds. Health-check:
336
  curl https://gemma-challenge-gemma-bucket-sync.hf.space/v1/healthz
 
34
  | Shared resource | `shared_resources/…_{agent_id}{.ext|/…}` (the `_{agent_id}` segment is mandatory somewhere in the leaf path) |
35
  | Audit log | `audit/{YYYYMM}.jsonl` |
36
 
37
+ Space-config constants: `ORG=gemma-challenge`, `COLLAB_SLUG=gemma`, `CENTRAL_BUCKET=gemma-challenge/gemma-main-bucket`, `AUDIT_BUCKET=cmpatino/gemma-challenge-audit` (private, out-of-org).
38
 
39
  ### HF Hub dependencies
40
  - `huggingface_hub` SDK (or equivalent REST calls): bucket existence check, bucket creator metadata, bucket list, bucket cp/sync, bucket-to-bucket copy.
 
237
 
238
  ## 8. Audit Log
239
 
240
+ Append-one-JSON-line-per-write to `audit/{YYYYMM}.jsonl` in the **private audit bucket** (`cmpatino/gemma-challenge-audit`), via admin token. Append-only. This bucket lives **outside the org** so collab contributors cannot read it β€” audit records carry sensitive metadata (`caller_ip`, `user_agent`, source URIs). The Space's `HF_TOKEN` owns the bucket and is the only reader/writer (Option A; see Β§11).
241
 
242
  ```json
243
  {
 
292
  DESIGN.md ← this document
293
  app/
294
  main.py ← FastAPI entry; mounts routers
295
+ config.py ← fixed: ORG, COLLAB_SLUG, CENTRAL_BUCKET, AUDIT_BUCKET; env: HF_TOKEN
296
  hub.py ← huggingface_hub wrapper (read/write/copy/list/whoami/creator)
297
  naming.py ← derives bucket names, filenames, target paths
298
  frontmatter.py ← parse / merge / serialise YAML frontmatter
 
322
  # 1. Create central bucket as admin
323
  hf buckets create gemma-challenge/gemma-main-bucket
324
 
325
+ # 1b. Create the PRIVATE audit bucket, outside the org, owned by the
326
+ # account whose token is the Space's HF_TOKEN (Option A). Contributors
327
+ # must NOT be able to read it.
328
+ hf buckets create cmpatino/gemma-challenge-audit --private
329
+
330
+ # 2. Seed the central bucket
331
  hf buckets cp ./README.md hf://buckets/gemma-challenge/gemma-main-bucket/
332
  hf buckets cp ./enwik8 hf://buckets/gemma-challenge/gemma-main-bucket/shared_resources/
333
  hf buckets cp ./shared_resources/README.md \
334
  hf://buckets/gemma-challenge/gemma-main-bucket/shared_resources/
335
 
336
  # 3. Configure this Space (Settings β†’ Secrets)
337
+ # HF_TOKEN = <token that is BOTH gemma-challenge org-admin AND owner
338
+ # of cmpatino/gemma-challenge-audit, with repo write>
339
+ # (ORG, COLLAB_SLUG, CENTRAL_BUCKET, AUDIT_BUCKET are fixed in config.py)
340
 
341
  # 4. Push code. Space rebuilds. Health-check:
342
  curl https://gemma-challenge-gemma-bucket-sync.hf.space/v1/healthz
app/audit.py CHANGED
@@ -51,6 +51,6 @@ class AuditLogger:
51
  line = json.dumps(record, separators=(",", ":"), ensure_ascii=False)
52
  path = audit_log_path(now)
53
  try:
54
- self._hub.append_jsonl_central(path, line)
55
  except Exception:
56
  log.exception("audit append failed; record=%s", record)
 
51
  line = json.dumps(record, separators=(",", ":"), ensure_ascii=False)
52
  path = audit_log_path(now)
53
  try:
54
+ self._hub.append_jsonl_audit(path, line)
55
  except Exception:
56
  log.exception("audit append failed; record=%s", record)
app/config.py CHANGED
@@ -13,6 +13,9 @@ class Settings(BaseSettings):
13
  org: ClassVar[str] = "gemma-challenge"
14
  collab_slug: ClassVar[str] = "gemma"
15
  central_bucket: ClassVar[str] = "gemma-challenge/gemma-main-bucket"
 
 
 
16
 
17
  hf_token: str | None = Field(None, alias="HF_TOKEN")
18
 
 
13
  org: ClassVar[str] = "gemma-challenge"
14
  collab_slug: ClassVar[str] = "gemma"
15
  central_bucket: ClassVar[str] = "gemma-challenge/gemma-main-bucket"
16
+ # Audit log lives in a private bucket OUTSIDE the org so collab members
17
+ # cannot read it. The Space's HF_TOKEN owns this bucket (Option A).
18
+ audit_bucket: ClassVar[str] = "cmpatino/gemma-challenge-audit"
19
 
20
  hf_token: str | None = Field(None, alias="HF_TOKEN")
21
 
app/hub.py CHANGED
@@ -142,14 +142,22 @@ class HubClient:
142
  def write_text_central(self, target_path: str, text: str) -> None:
143
  self.write_bytes_central(target_path, text.encode("utf-8"))
144
 
145
- def append_jsonl_central(self, target_path: str, line: str) -> None:
 
 
 
 
146
  try:
147
- existing = self.read_central_bytes(target_path)
148
  except FileNotFoundError:
149
  existing = b""
150
  if existing and not existing.endswith(b"\n"):
151
  existing += b"\n"
152
- self.write_bytes_central(target_path, existing + line.encode("utf-8") + b"\n")
 
 
 
 
153
 
154
  # ───────────────────────── Cross-bucket copy ─────────────────────────
155
 
 
142
  def write_text_central(self, target_path: str, text: str) -> None:
143
  self.write_bytes_central(target_path, text.encode("utf-8"))
144
 
145
+ def append_jsonl_audit(self, target_path: str, line: str) -> None:
146
+ """Append to the audit log in the private (out-of-org) audit bucket."""
147
+ self._append_jsonl(self._settings.audit_bucket, target_path, line)
148
+
149
+ def _append_jsonl(self, bucket: str, target_path: str, line: str) -> None:
150
  try:
151
+ existing = self._download_one(bucket, target_path)
152
  except FileNotFoundError:
153
  existing = b""
154
  if existing and not existing.endswith(b"\n"):
155
  existing += b"\n"
156
+ batch_bucket_files(
157
+ bucket_id=bucket,
158
+ add=[(existing + line.encode("utf-8") + b"\n", target_path)],
159
+ token=self._token,
160
+ )
161
 
162
  # ───────────────────────── Cross-bucket copy ─────────────────────────
163