| # PDF Processing System Architecture | |
| ## System Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Frontend (Next.js) β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| β β PDF Upload Component β β | |
| β β β’ File input β β | |
| β β β’ Drag & drop β β | |
| β β β’ Progress indicator β β | |
| β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β | |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| HTTP/FormData | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β FastAPI Backend β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β POST /api/v1/process-pdf β | |
| β βββΊ Extract sentences only β | |
| β β | |
| β POST /api/v1/process-pdf-to-bias β | |
| β βββΊ Extract + Analyze bias (complete pipeline) β | |
| β β | |
| β GET /api/v1/pdf-health β | |
| β βββΊ Service status check β | |
| β β | |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββ΄ββββββββββββββββββ | |
| β β | |
| βΌ βΌ | |
| ββββββββββββββ βββββββββββββββββββ | |
| β PDF Bytes β β PDFProcessor β | |
| β β β (utility/) β | |
| ββββββββββββββ βββββββββββββββββββ | |
| β β | |
| βββββββββββββ¬ββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Step 1: Extract Text β | |
| β PyMuPDF (fitz) β | |
| β β’ Read PDF pages β | |
| β β’ Extract raw text β | |
| β β’ Handle multi-page β | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Step 2: Clean Text β | |
| β Regex Processing β | |
| β β’ Remove extra whitespace β | |
| β β’ Normalize formatting β | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Step 3: Segment Sentences β | |
| β Nepali-aware Regex β | |
| β β’ Split on ΰ€¦ΰ€£ΰ₯ΰ€‘ (ΰ₯€) β | |
| β β’ Handle punctuation β | |
| β β’ Filter short fragments β | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββββββββββββββ | |
| β Step 4: Refine (Optional) β | |
| β Mistral LLM API β | |
| β β’ Correct segmentation β | |
| β β’ Remove duplicates β | |
| β β’ JSON formatting β | |
| ββββββββββββββ¬βββββββββββββββββ | |
| β | |
| ββββββββββββββ΄βββββββββββββ | |
| β β | |
| βΌ βΌ | |
| ββββββββββββββββ βββββββββββββββββββ | |
| β Sentences β β Return JSON β | |
| β List β β Array β | |
| ββββββββ¬ββββββββ ββββββββββ¬βββββββββ | |
| β β | |
| β ββββββββββββββ΄ββββββββββββββ | |
| β β β | |
| β βΌ βΌ | |
| β βββββββββββββββββββ ββββββββββββββββββββ | |
| β β Option A: β β Option B: β | |
| β β Return to β β Pass to Bias β | |
| β β User β β Detection β | |
| β βββββββββββββββββββ ββββββββββ¬ββββββββββ | |
| β β | |
| β βΌ | |
| β ββββββββββββββββββββββββ | |
| β β Bias Detection Modelβ | |
| β β DistilBERT Nepali β | |
| β β (module_b) β | |
| β ββββββββββββββββββββββββ€ | |
| β β β’ Classify sentence β | |
| β β β’ 11 categories: β | |
| β β - neutral β | |
| β β - gender β | |
| β β - caste β | |
| β β - religion β | |
| β β - political β | |
| β β - age β | |
| β β - disability β | |
| β β - appearance β | |
| β β - social status β | |
| β β - religiosity β | |
| β β - ambiguity β | |
| β ββββββββββ¬ββββββββββββββ | |
| β β | |
| βββββββββββββ¬ββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β Response to User β | |
| ββββββββββββββββββββββββββββββββββββββββ€ | |
| β β’ Extracted sentences β | |
| β β’ Bias classification results β | |
| β β’ Confidence scores β | |
| β β’ Biased/neutral counts β | |
| β β’ Processing metadata β | |
| ββββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββββββ | |
| β Frontend Display β | |
| β β’ Show results β | |
| β β’ Highlight biases β | |
| β β’ Display stats β | |
| ββββββββββββββββββββββββ | |
| ``` | |
| ## Data Flow Diagram | |
| ``` | |
| INPUT: PDF File | |
| β | |
| ββ Metadata | |
| β ββ Filename | |
| β ββ File size | |
| β ββ Upload timestamp | |
| β | |
| ββ Binary content | |
| β | |
| βΌ | |
| PDFProcessor.process_pdf_from_bytes() | |
| β | |
| ββ extract_text_from_pdf() | |
| β β Uses: PyMuPDF.fitz.open() | |
| β β Output: Raw text string + page count | |
| β ββ ~200-500ms | |
| β | |
| ββ clean_text() | |
| β β Uses: Regex replacements | |
| β β Output: Normalized text string | |
| β ββ ~50ms | |
| β | |
| ββ split_into_sentences() | |
| β β Uses: Nepali-aware regex patterns | |
| β β Output: List[sentences] | |
| β ββ ~50-150ms | |
| β | |
| ββ refine_sentences_with_llm() [OPTIONAL] | |
| β Uses: Mistral API | |
| β Input: JSON-formatted sentences | |
| β Output: Refined List[sentences] | |
| ββ ~2-5s (includes API latency) | |
| βΌ | |
| OUTPUT 1 (Extract Only): | |
| { | |
| "success": true, | |
| "sentences": ["ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯§", "ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯¨", ...], | |
| "total_sentences": 15, | |
| "raw_text": "ΰ€«ΰ₯ΰ€² text...", | |
| "filename": "doc.pdf" | |
| } | |
| OUTPUT 2 (Extract + Bias): | |
| { | |
| "success": true, | |
| "total_sentences": 15, | |
| "biased_count": 2, | |
| "neutral_count": 13, | |
| "results": [ | |
| { | |
| "sentence": "ΰ€΅ΰ€Ύΰ€ΰ₯ΰ€― ΰ₯§", | |
| "category": "neutral", | |
| "confidence": 0.95, | |
| "is_biased": false | |
| }, | |
| ... | |
| ], | |
| "filename": "doc.pdf" | |
| } | |
| ``` | |
| ## Component Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β utility/ module β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β PDFProcessor Class β | |
| β ββ __init__(mistral_api_key) β | |
| β β ββ Initialize MistralClient β | |
| β β β | |
| β ββ extract_text_from_pdf(pdf_path) β | |
| β β ββ fitz.open(pdf_path) β | |
| β β ββ Iterate pages β | |
| β β ββ get_text("text") β | |
| β β ββ Return: raw text β | |
| β β β | |
| β ββ clean_text(text) β | |
| β β ββ Remove newlines β | |
| β β ββ Normalize spaces β | |
| β β ββ Return: cleaned text β | |
| β β β | |
| β ββ split_into_sentences(text) β | |
| β β ββ Apply Nepali regex patterns β | |
| β β ββ Filter short fragments β | |
| β β ββ Return: sentence list β | |
| β β β | |
| β ββ refine_sentences_with_llm(sentences) β | |
| β β ββ Format as JSON β | |
| β β ββ Send to Mistral API β | |
| β β ββ Parse JSON response β | |
| β β ββ Return: refined sentences β | |
| β β β | |
| β ββ process_pdf(pdf_path, refine_with_llm) β | |
| β β ββ Complete pipeline (file path) β | |
| β β β | |
| β ββ process_pdf_from_bytes(pdf_bytes, refine_with_llm) β | |
| β ββ Complete pipeline (bytes) β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β api/routes/ module β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β pdf_processing.py (FastAPI Routes) β | |
| β ββ POST /api/v1/process-pdf β | |
| β β ββ Receive: file (UploadFile), refine_with_llm β | |
| β β ββ Call: PDFProcessor.process_pdf_from_bytes() β | |
| β β ββ Return: PDFProcessingResponse β | |
| β β β | |
| β ββ POST /api/v1/process-pdf-to-bias β | |
| β β ββ Receive: file, refine_with_llm, confidence_threshold β | |
| β β ββ Call: PDFProcessor.process_pdf_from_bytes() β | |
| β β ββ Call: run_bias_detection() β | |
| β β ββ Return: PDFToBiasDetectionResponse β | |
| β β β | |
| β ββ GET /api/v1/pdf-health β | |
| β ββ Check: PDFProcessor availability β | |
| β ββ Check: Mistral client status β | |
| β ββ Return: health status β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β api/schemas.py (Pydantic Models) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β PDFProcessingRequest / Response β | |
| β PDFToBiasDetectionRequest / Response β | |
| β BiasResult (reused from bias_detection) β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β External Dependencies β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β PyMuPDF (fitz) β | |
| β ββ PDF text extraction β | |
| β β | |
| β Mistral API Client β | |
| β ββ LLM-based sentence refinement β | |
| β β | |
| β FastAPI β | |
| β ββ API framework (from module_a) β | |
| β β | |
| β Pydantic β | |
| β ββ Data validation β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Processing Timeline | |
| ``` | |
| Timeline for Single PDF (~10 KB): | |
| Time Component Duration Cumulative | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| 0ms ββ API receives upload ~5ms 5ms | |
| β | |
| 5ms ββ Read bytes ~10ms 15ms | |
| β | |
| 15ms ββ PyMuPDF extraction ~200ms 215ms | |
| β | |
| 215ms ββ Text cleaning ~30ms 245ms | |
| β | |
| 245ms ββ Sentence split ~100ms 345ms | |
| β | |
| 345ms ββ LLM refinement ~3500ms 3845ms | |
| β (if enabled) | |
| β | |
| 3845msββ Bias detection ~500ms 4345ms | |
| β (if enabled) | |
| β | |
| 4345msββ Return response ~5ms 4350ms | |
| Total Time: | |
| ββ With LLM + Bias: ~4.3 seconds | |
| ββ With LLM only: ~3.8 seconds | |
| ββ Without LLM: ~0.35 seconds | |
| ββ Bias only: ~0.5 seconds (sentence extraction) | |
| ``` | |
| ## Error Handling Flow | |
| ``` | |
| API Request | |
| β | |
| ββ Validate file | |
| β ββ Is PDF? β No β 400 Bad Request | |
| β ββ Is empty? β Yes β 400 Bad Request | |
| β ββ Is valid? β Continue | |
| β | |
| ββ Process PDF | |
| ββ Extract text | |
| β ββ File not found? β FileNotFoundError β 500 | |
| β ββ Permission denied? β Exception β 500 | |
| β ββ Success? β Continue | |
| β | |
| ββ Split sentences | |
| β ββ No sentences? β Warning in response β 200 (empty results) | |
| β ββ Success? β Continue | |
| β | |
| ββ Refine with LLM (if enabled) | |
| ββ API key missing? β Warning, use regex β 200 (fallback) | |
| ββ Network error? β Warning, use regex β 200 (fallback) | |
| ββ Invalid JSON? β Warning, use regex β 200 (fallback) | |
| ββ Success? β Return refined sentences β 200 | |
| ``` | |
| ## State Diagram | |
| ``` | |
| ββββββββββββββββ | |
| β Idle β | |
| ββββββββ¬ββββββββ | |
| β | |
| User uploads PDF | |
| β | |
| βΌ | |
| ββββββββββββββββββββ | |
| β Validating β | |
| ββββββββ¬ββββββββββββ | |
| β | |
| ββββββββ΄βββββββ | |
| β β | |
| Invalid Valid | |
| β β | |
| βΌ βΌ | |
| βββββββββββ ββββββββββββββββββββ | |
| β Error β β Extracting β | |
| βββββββββββ ββββββββ¬ββββββββββββ | |
| β | |
| ββββββββ΄βββββββ | |
| β β | |
| Success No Text | |
| β β | |
| βΌ βΌ | |
| ββββββββββββ ββββββββββββ | |
| βSplitting β β Error β | |
| ββββββββ¬ββββ ββββββββββββ | |
| β | |
| ββββββββ΄βββββββ | |
| β β | |
| Success No Sentences | |
| β β | |
| βΌ βΌ | |
| ββββββββββββββ ββββββββββββ | |
| βRefining β β Error β | |
| β(Optional) β ββββββββββββ | |
| ββββββββ¬ββββββ | |
| β | |
| ββββββββ΄βββββββ | |
| β β | |
| Success Failed | |
| β β | |
| βββββββ¬ββββββββ€ | |
| β β β | |
| βΌ βΌ βΌ | |
| ββββββββββββββββ ββββββββββ | |
| β Formatting β βFallbackβ | |
| βResponse β β(Regex) β | |
| ββββββββ¬ββββββββ βββββ¬βββββ | |
| β β | |
| ββββββββ¬βββββββ | |
| β | |
| βΌ | |
| βββββββββββββββββββ | |
| β Bias Detection β | |
| β (if enabled) β | |
| ββββββββββ¬βββββββββ | |
| β | |
| ββββββββ΄βββββββ | |
| β β | |
| Success Error | |
| β β | |
| βΌ βΌ | |
| ββββββββββββββ ββββββββββ | |
| βFormat β βReturn β | |
| βResponse β βError β | |
| ββββββββββ¬ββββ βββββ¬βββββ | |
| β β | |
| ββββββ¬ββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββββββ | |
| βSend Response β | |
| βto Client β | |
| ββββββββββββββββββββ | |
| ``` | |
| ## Integration Points | |
| ``` | |
| Frontend (Next.js) | |
| β | |
| βββΊ /api/v1/process-pdf | |
| β ββ Use for: Sentence extraction only | |
| β | |
| βββΊ /api/v1/process-pdf-to-bias | |
| β ββ Use for: Full analysis (PDF β Bias) | |
| β | |
| βββΊ /api/v1/pdf-health | |
| ββ Use for: Service status check | |
| Internal Integration | |
| β | |
| βββΊ PDFProcessor class | |
| β ββ Use in: Custom workflows | |
| β | |
| βββΊ run_bias_detection() function | |
| ββ Use in: Direct bias analysis | |
| ``` | |
| --- | |
| This architecture provides: | |
| β Scalable processing pipeline | |
| β Clear separation of concerns | |
| β Reusable components | |
| β Error resilience | |
| β Performance optimization options | |
| β Easy integration points | |