File size: 5,817 Bytes
41788c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# Architecture

This document explains how the Docsifer codebase is structured and why.

## Goals

1. **Correctness** β€” eliminate the bugs identified in the audit (SSRF,
   path-traversal, broken cookies, race conditions, deprecated FastAPI APIs).
2. **Production-grade safety** β€” never crash a free-tier container under load.
3. **Operability** β€” config driven, observable, well-tested.
4. **Performance** β€” minimize redundant I/O, reuse expensive clients.

## Module map

```
docsifer/
β”œβ”€β”€ api/              # HTTP layer
β”‚   β”œβ”€β”€ deps.py
β”‚   β”œβ”€β”€ error_handlers.py
β”‚   β”œβ”€β”€ middleware.py        # request-id, body limit, security headers
β”‚   β”œβ”€β”€ schemas.py           # Pydantic request/response models
β”‚   └── v1/
β”‚       β”œβ”€β”€ convert.py       # POST /v1/convert
β”‚       β”œβ”€β”€ stats.py         # GET  /v1/stats
β”‚       └── health.py        # GET  /v1/healthz, /v1/readyz
β”œβ”€β”€ core/             # Pure business logic
β”‚   β”œβ”€β”€ service.py           # DocsiferService
β”‚   β”œβ”€β”€ llm_registry.py      # TTL-cached MarkItDown(LLM) instances
β”‚   β”œβ”€β”€ html_cleaner.py      # selectolax-based HTML scrub
β”‚   β”œβ”€β”€ tokenizer.py         # tiktoken wrapper with fallbacks
β”‚   β”œβ”€β”€ mime.py              # safe MIME detection
β”‚   └── url_guard.py         # SSRF protection
β”œβ”€β”€ analytics/        # Lifespan-managed analytics
β”‚   β”œβ”€β”€ service.py
β”‚   β”œβ”€β”€ periods.py           # ISO 8601 week numbering
β”‚   └── store.py             # AnalyticsStore protocol + Upstash & in-memory
β”œβ”€β”€ safety/           # Anti-crash primitives (Section N of the audit)
β”‚   β”œβ”€β”€ conversion_gate.py
β”‚   β”œβ”€β”€ per_ip_limiter.py
β”‚   β”œβ”€β”€ resource_guard.py
β”‚   β”œβ”€β”€ circuit_breaker.py
β”‚   β”œβ”€β”€ memory_watchdog.py
β”‚   └── disk_cleanup.py
β”œβ”€β”€ ui/               # Gradio UI (optional)
β”‚   └── gradio_app.py
β”œβ”€β”€ config.py         # pydantic-settings
β”œβ”€β”€ exceptions.py
β”œβ”€β”€ logging_config.py
└── main.py           # FastAPI app factory + lifespan
```

## Request flow (POST /v1/convert)

```
Client
  β”‚  multipart/form-data
  β–Ό
[ middleware ]
  β”œβ”€ request_id          β†’ X-Request-ID header + ContextVar
  β”œβ”€ body_limit          β†’ 413 if Content-Length > MAX
  └─ security_headers    β†’ X-Content-Type-Options, X-Frame-Options, …
  β”‚
  β–Ό
[ route handler convert.py ]
  β”œβ”€ parse JSON forms via Pydantic (422 on errors)
  β”œβ”€ acquire PerIPLimiter slot (429 on overflow)
  β”œβ”€ acquire ConversionGate slot (503 when full)
  β”œβ”€ stream upload to disk in worker thread
  β”œβ”€ ResourceGuard checks RAM/disk
  β”œβ”€ asyncio.wait_for() guards against runaway conversions
  └─ DocsiferService.convert_file(...)
       β”‚
       β–Ό
[ DocsiferService (core) ]
  β”œβ”€ choose converter:
  β”‚     β”œβ”€ no api_key β†’ cached `MarkItDown()`
  β”‚     └─ otherwise  β†’ LLMRegistry.get(LLMConfig)
  β”œβ”€ HTML path  β†’ clean in-memory β†’ MarkItDown.convert_stream()
  └─ everything else β†’ normalize_extension() β†’ MarkItDown.convert(path)
```

The route returns an `ORJSONResponse`; large markdown payloads are gzipped by
the `GZipMiddleware`.

## Lifespan

`docsifer.main._lifespan` owns the lifecycle of every singleton:

- `DocsiferService`            β€” built once, has a bounded `ThreadPoolExecutor`.
- `AnalyticsService`           β€” loads totals from Upstash, kicks off the
  background sync loop, **flushes pending counters on shutdown**.
- `ConversionGate`, `PerIPLimiter`, `ResourceGuard` β€” no I/O, just config.
- `disk_cleanup_loop`          β€” periodic temp-file sweeper.
- `memory_watchdog_loop`       β€” optional, sends SIGTERM on RSS overrun.

All background tasks share an `asyncio.Event` so shutdown is deterministic.

## Concurrency model

- One global `asyncio.Semaphore` (in `ConversionGate`) bounds in-flight
  conversions. Anything beyond `max_concurrent + max_queue` is rejected with
  `503 + Retry-After`.
- `PerIPLimiter` enforces fairness so a single IP cannot monopolize the gate.
- `ResourceGuard` runs ahead of every conversion to short-circuit OOM.
- All synchronous work happens in a dedicated `ThreadPoolExecutor` to keep
  the event loop responsive.

## Analytics

- Increments are stored in an in-process `pending` counter and applied to a
  `totals` counter atomically. The lock-free `_snapshot` dict is what
  `/v1/stats` returns β€” readers are never blocked by writers.
- The background sync loop pipelines `HINCRBY` operations into Upstash so a
  full flush is a single HTTP round-trip when the server supports pipelines.
- Failures keep `pending` intact for the next attempt; on shutdown the
  service performs one final flush.

## Security posture

- **SSRF** β€” every URL goes through `validate_url()` which resolves the host
  and rejects private/loopback/link-local/multicast addresses by default.
- **Path traversal** β€” `Path(filename).name` strips any directory component.
- **Body limit** β€” enforced by middleware before the body is consumed.
- **Extension allowlist** β€” server-side allowlist independent of the UI.
- **CORS** β€” `allow_credentials` is automatically disabled when origins
  contain `*` (which would otherwise be invalid per the spec).
- **Error responses** β€” never echo internal exception messages; only the
  `public_message` of the typed exception is returned.

## Free-tier defaults

The defaults in `config.py` target a 2 vCPU / 16 GB RAM container (Hugging
Face Spaces basic):

- `max_upload_bytes = 10 MB`
- `max_concurrent_conversions = 2`
- `max_queue_depth = 10`
- `max_per_ip_concurrent = 1`
- `request_timeout_sec = 55` (just under HF's 60 s gateway)
- `analytics_sync_interval_sec = 1800`