PDF Processing System Architecture
System Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (Next.js) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PDF Upload Component β β
β β β’ File input β β
β β β’ Drag & drop β β
β β β’ Progress indicator β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
HTTP/FormData
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β POST /api/v1/process-pdf β
β βββΊ Extract sentences only β
β β
β POST /api/v1/process-pdf-to-bias β
β βββΊ Extract + Analyze bias (complete pipeline) β
β β
β GET /api/v1/pdf-health β
β βββΊ Service status check β
β β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ΄ββββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββ βββββββββββββββββββ
β PDF Bytes β β PDFProcessor β
β β β (utility/) β
ββββββββββββββ βββββββββββββββββββ
β β
βββββββββββββ¬ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Step 1: Extract Text β
β PyMuPDF (fitz) β
β β’ Read PDF pages β
β β’ Extract raw text β
β β’ Handle multi-page β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Step 2: Clean Text β
β Regex Processing β
β β’ Remove extra whitespace β
β β’ Normalize formatting β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Step 3: Segment Sentences β
β Nepali-aware Regex β
β β’ Split on ΰ€¦ΰ€£ΰ₯ΰ€‘ (ΰ₯€) β
β β’ Handle punctuation β
β β’ Filter short fragments β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββ
β Step 4: Refine (Optional) β
β Mistral LLM API β
β β’ Correct segmentation β
β β’ Remove duplicates β
β β’ JSON formatting β
ββββββββββββββ¬βββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ βββββββββββββββββββ
β Sentences β β Return JSON β
β List β β Array β
ββββββββ¬ββββββββ ββββββββββ¬βββββββββ
β β
β ββββββββββββββ΄ββββββββββββββ
β β β
β βΌ βΌ
β βββββββββββββββββββ ββββββββββββββββββββ
β β Option A: β β Option B: β
β β Return to β β Pass to Bias β
β β User β β Detection β
β βββββββββββββββββββ ββββββββββ¬ββββββββββ
β β
β βΌ
β ββββββββββββββββββββββββ
β β Bias Detection Modelβ
β β DistilBERT Nepali β
β β (module_b) β
β ββββββββββββββββββββββββ€
β β β’ Classify sentence β
β β β’ 11 categories: β
β β - neutral β
β β - gender β
β β - caste β
β β - religion β
β β - political β
β β - age β
β β - disability β
β β - appearance β
β β - social status β
β β - religiosity β
β β - ambiguity β
β ββββββββββ¬ββββββββββββββ
β β
βββββββββββββ¬ββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββ
β Response to User β
ββββββββββββββββββββββββββββββββββββββββ€
β β’ Extracted sentences β
β β’ Bias classification results β
β β’ Confidence scores β
β β’ Biased/neutral counts β
β β’ Processing metadata β
ββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Frontend Display β
β β’ Show results β
β β’ Highlight biases β
β β’ Display stats β
ββββββββββββββββββββββββ
Data Flow Diagram
INPUT: PDF File
β
ββ Metadata
β ββ Filename
β ββ File size
β ββ Upload timestamp
β
ββ Binary content
β
βΌ
PDFProcessor.process_pdf_from_bytes()
β
ββ extract_text_from_pdf()
β β Uses: PyMuPDF.fitz.open()
β β Output: Raw text string + page count
β ββ ~200-500ms
β
ββ clean_text()
β β Uses: Regex replacements
β β Output: Normalized text string
β ββ ~50ms
β
ββ split_into_sentences()
β β Uses: Nepali-aware regex patterns
β β Output: List[sentences]
β ββ ~50-150ms
β
ββ refine_sentences_with_llm() [OPTIONAL]
β Uses: Mistral API
β Input: JSON-formatted sentences
β Output: Refined List[sentences]
ββ ~2-5s (includes API latency)
βΌ
OUTPUT 1 (Extract Only):
{
"success": true,
"sentences": ["ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯§", "ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯¨", ...],
"total_sentences": 15,
"raw_text": "ΰ€«ΰ₯ΰ€² text...",
"filename": "doc.pdf"
}
OUTPUT 2 (Extract + Bias):
{
"success": true,
"total_sentences": 15,
"biased_count": 2,
"neutral_count": 13,
"results": [
{
"sentence": "ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯§",
"category": "neutral",
"confidence": 0.95,
"is_biased": false
},
...
],
"filename": "doc.pdf"
}
Component Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β utility/ module β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PDFProcessor Class β
β ββ __init__(mistral_api_key) β
β β ββ Initialize MistralClient β
β β β
β ββ extract_text_from_pdf(pdf_path) β
β β ββ fitz.open(pdf_path) β
β β ββ Iterate pages β
β β ββ get_text("text") β
β β ββ Return: raw text β
β β β
β ββ clean_text(text) β
β β ββ Remove newlines β
β β ββ Normalize spaces β
β β ββ Return: cleaned text β
β β β
β ββ split_into_sentences(text) β
β β ββ Apply Nepali regex patterns β
β β ββ Filter short fragments β
β β ββ Return: sentence list β
β β β
β ββ refine_sentences_with_llm(sentences) β
β β ββ Format as JSON β
β β ββ Send to Mistral API β
β β ββ Parse JSON response β
β β ββ Return: refined sentences β
β β β
β ββ process_pdf(pdf_path, refine_with_llm) β
β β ββ Complete pipeline (file path) β
β β β
β ββ process_pdf_from_bytes(pdf_bytes, refine_with_llm) β
β ββ Complete pipeline (bytes) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β api/routes/ module β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β pdf_processing.py (FastAPI Routes) β
β ββ POST /api/v1/process-pdf β
β β ββ Receive: file (UploadFile), refine_with_llm β
β β ββ Call: PDFProcessor.process_pdf_from_bytes() β
β β ββ Return: PDFProcessingResponse β
β β β
β ββ POST /api/v1/process-pdf-to-bias β
β β ββ Receive: file, refine_with_llm, confidence_threshold β
β β ββ Call: PDFProcessor.process_pdf_from_bytes() β
β β ββ Call: run_bias_detection() β
β β ββ Return: PDFToBiasDetectionResponse β
β β β
β ββ GET /api/v1/pdf-health β
β ββ Check: PDFProcessor availability β
β ββ Check: Mistral client status β
β ββ Return: health status β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β api/schemas.py (Pydantic Models) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PDFProcessingRequest / Response β
β PDFToBiasDetectionRequest / Response β
β BiasResult (reused from bias_detection) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β External Dependencies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PyMuPDF (fitz) β
β ββ PDF text extraction β
β β
β Mistral API Client β
β ββ LLM-based sentence refinement β
β β
β FastAPI β
β ββ API framework (from module_a) β
β β
β Pydantic β
β ββ Data validation β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Processing Timeline
Timeline for Single PDF (~10 KB):
Time Component Duration Cumulative
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
0ms ββ API receives upload ~5ms 5ms
β
5ms ββ Read bytes ~10ms 15ms
β
15ms ββ PyMuPDF extraction ~200ms 215ms
β
215ms ββ Text cleaning ~30ms 245ms
β
245ms ββ Sentence split ~100ms 345ms
β
345ms ββ LLM refinement ~3500ms 3845ms
β (if enabled)
β
3845msββ Bias detection ~500ms 4345ms
β (if enabled)
β
4345msββ Return response ~5ms 4350ms
Total Time:
ββ With LLM + Bias: ~4.3 seconds
ββ With LLM only: ~3.8 seconds
ββ Without LLM: ~0.35 seconds
ββ Bias only: ~0.5 seconds (sentence extraction)
Error Handling Flow
API Request
β
ββ Validate file
β ββ Is PDF? β No β 400 Bad Request
β ββ Is empty? β Yes β 400 Bad Request
β ββ Is valid? β Continue
β
ββ Process PDF
ββ Extract text
β ββ File not found? β FileNotFoundError β 500
β ββ Permission denied? β Exception β 500
β ββ Success? β Continue
β
ββ Split sentences
β ββ No sentences? β Warning in response β 200 (empty results)
β ββ Success? β Continue
β
ββ Refine with LLM (if enabled)
ββ API key missing? β Warning, use regex β 200 (fallback)
ββ Network error? β Warning, use regex β 200 (fallback)
ββ Invalid JSON? β Warning, use regex β 200 (fallback)
ββ Success? β Return refined sentences β 200
State Diagram
ββββββββββββββββ
β Idle β
ββββββββ¬ββββββββ
β
User uploads PDF
β
βΌ
ββββββββββββββββββββ
β Validating β
ββββββββ¬ββββββββββββ
β
ββββββββ΄βββββββ
β β
Invalid Valid
β β
βΌ βΌ
βββββββββββ ββββββββββββββββββββ
β Error β β Extracting β
βββββββββββ ββββββββ¬ββββββββββββ
β
ββββββββ΄βββββββ
β β
Success No Text
β β
βΌ βΌ
ββββββββββββ ββββββββββββ
βSplitting β β Error β
ββββββββ¬ββββ ββββββββββββ
β
ββββββββ΄βββββββ
β β
Success No Sentences
β β
βΌ βΌ
ββββββββββββββ ββββββββββββ
βRefining β β Error β
β(Optional) β ββββββββββββ
ββββββββ¬ββββββ
β
ββββββββ΄βββββββ
β β
Success Failed
β β
βββββββ¬ββββββββ€
β β β
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββ
β Formatting β βFallbackβ
βResponse β β(Regex) β
ββββββββ¬ββββββββ βββββ¬βββββ
β β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββ
β Bias Detection β
β (if enabled) β
ββββββββββ¬βββββββββ
β
ββββββββ΄βββββββ
β β
Success Error
β β
βΌ βΌ
ββββββββββββββ ββββββββββ
βFormat β βReturn β
βResponse β βError β
ββββββββββ¬ββββ βββββ¬βββββ
β β
ββββββ¬ββββββ
β
βΌ
ββββββββββββββββββββ
βSend Response β
βto Client β
ββββββββββββββββββββ
Integration Points
Frontend (Next.js)
β
βββΊ /api/v1/process-pdf
β ββ Use for: Sentence extraction only
β
βββΊ /api/v1/process-pdf-to-bias
β ββ Use for: Full analysis (PDF β Bias)
β
βββΊ /api/v1/pdf-health
ββ Use for: Service status check
Internal Integration
β
βββΊ PDFProcessor class
β ββ Use in: Custom workflows
β
βββΊ run_bias_detection() function
ββ Use in: Direct bias analysis
This architecture provides: β Scalable processing pipeline β Clear separation of concerns β Reusable components β Error resilience β Performance optimization options β Easy integration points