File size: 5,817 Bytes
41788c4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | # Architecture
This document explains how the Docsifer codebase is structured and why.
## Goals
1. **Correctness** β eliminate the bugs identified in the audit (SSRF,
path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
2. **Production-grade safety** β never crash a free-tier container under load.
3. **Operability** β config driven, observable, well-tested.
4. **Performance** β minimize redundant I/O, reuse expensive clients.
## Module map
```
docsifer/
βββ api/ # HTTP layer
β βββ deps.py
β βββ error_handlers.py
β βββ middleware.py # request-id, body limit, security headers
β βββ schemas.py # Pydantic request/response models
β βββ v1/
β βββ convert.py # POST /v1/convert
β βββ stats.py # GET /v1/stats
β βββ health.py # GET /v1/healthz, /v1/readyz
βββ core/ # Pure business logic
β βββ service.py # DocsiferService
β βββ llm_registry.py # TTL-cached MarkItDown(LLM) instances
β βββ html_cleaner.py # selectolax-based HTML scrub
β βββ tokenizer.py # tiktoken wrapper with fallbacks
β βββ mime.py # safe MIME detection
β βββ url_guard.py # SSRF protection
βββ analytics/ # Lifespan-managed analytics
β βββ service.py
β βββ periods.py # ISO 8601 week numbering
β βββ store.py # AnalyticsStore protocol + Upstash & in-memory
βββ safety/ # Anti-crash primitives (Section N of the audit)
β βββ conversion_gate.py
β βββ per_ip_limiter.py
β βββ resource_guard.py
β βββ circuit_breaker.py
β βββ memory_watchdog.py
β βββ disk_cleanup.py
βββ ui/ # Gradio UI (optional)
β βββ gradio_app.py
βββ config.py # pydantic-settings
βββ exceptions.py
βββ logging_config.py
βββ main.py # FastAPI app factory + lifespan
```
## Request flow (POST /v1/convert)
```
Client
β multipart/form-data
βΌ
[ middleware ]
ββ request_id β X-Request-ID header + ContextVar
ββ body_limit β 413 if Content-Length > MAX
ββ security_headers β X-Content-Type-Options, X-Frame-Options, β¦
β
βΌ
[ route handler convert.py ]
ββ parse JSON forms via Pydantic (422 on errors)
ββ acquire PerIPLimiter slot (429 on overflow)
ββ acquire ConversionGate slot (503 when full)
ββ stream upload to disk in worker thread
ββ ResourceGuard checks RAM/disk
ββ asyncio.wait_for() guards against runaway conversions
ββ DocsiferService.convert_file(...)
β
βΌ
[ DocsiferService (core) ]
ββ choose converter:
β ββ no api_key β cached `MarkItDown()`
β ββ otherwise β LLMRegistry.get(LLMConfig)
ββ HTML path β clean in-memory β MarkItDown.convert_stream()
ββ everything else β normalize_extension() β MarkItDown.convert(path)
```
The route returns an `ORJSONResponse`; large markdown payloads are gzipped by
the `GZipMiddleware`.
## Lifespan
`docsifer.main._lifespan` owns the lifecycle of every singleton:
- `DocsiferService` β built once, has a bounded `ThreadPoolExecutor`.
- `AnalyticsService` β loads totals from Upstash, kicks off the
background sync loop, **flushes pending counters on shutdown**.
- `ConversionGate`, `PerIPLimiter`, `ResourceGuard` β no I/O, just config.
- `disk_cleanup_loop` β periodic temp-file sweeper.
- `memory_watchdog_loop` β optional, sends SIGTERM on RSS overrun.
All background tasks share an `asyncio.Event` so shutdown is deterministic.
## Concurrency model
- One global `asyncio.Semaphore` (in `ConversionGate`) bounds in-flight
conversions. Anything beyond `max_concurrent + max_queue` is rejected with
`503 + Retry-After`.
- `PerIPLimiter` enforces fairness so a single IP cannot monopolize the gate.
- `ResourceGuard` runs ahead of every conversion to short-circuit OOM.
- All synchronous work happens in a dedicated `ThreadPoolExecutor` to keep
the event loop responsive.
## Analytics
- Increments are stored in an in-process `pending` counter and applied to a
`totals` counter atomically. The lock-free `_snapshot` dict is what
`/v1/stats` returns β readers are never blocked by writers.
- The background sync loop pipelines `HINCRBY` operations into Upstash so a
full flush is a single HTTP round-trip when the server supports pipelines.
- Failures keep `pending` intact for the next attempt; on shutdown the
service performs one final flush.
## Security posture
- **SSRF** β every URL goes through `validate_url()` which resolves the host
and rejects private/loopback/link-local/multicast addresses by default.
- **Path traversal** β `Path(filename).name` strips any directory component.
- **Body limit** β enforced by middleware before the body is consumed.
- **Extension allowlist** β server-side allowlist independent of the UI.
- **CORS** β `allow_credentials` is automatically disabled when origins
contain `*` (which would otherwise be invalid per the spec).
- **Error responses** β never echo internal exception messages; only the
`public_message` of the typed exception is returned.
## Free-tier defaults
The defaults in `config.py` target a 2 vCPU / 16 GB RAM container (Hugging
Face Spaces basic):
- `max_upload_bytes = 10 MB`
- `max_concurrent_conversions = 2`
- `max_queue_depth = 10`
- `max_per_ip_concurrent = 1`
- `request_timeout_sec = 55` (just under HF's 60 s gateway)
- `analytics_sync_interval_sec = 1800`
|