Spaces:
Sleeping
Sleeping
Architecture β UPPSC PCS 2024 Statistical Audit
System Overview
A two-script data extraction and analysis pipeline that operates entirely on official public PDFs. No external services, no API dependencies. The pipeline reads raw government PDF result files, extracts roll numbers using regex, groups by series prefix, and outputs structured counts for statistical comparison.
Components
Extraction Layer (extract_counts.py + verify_extraction.py)
extract_counts.py: Opens each of the three PDFs (prelims, mains, final) usingpdfplumber, extracts all text page by page, applies a regex pattern to find 7-digit roll numbers (with 6-digit fallback), groups them by their first two digits (series prefix), and writes the full breakdown tocounts.jsonverify_extraction.py: A single-purpose validation script that confirms total candidate counts match the officially published figures (15,066 prelims / 2,720 mains / 933 final) β proving no candidates were missed or double-counted
Analysis Layer (report.md)
- A structured markdown report containing four analysis tables:
- Prelims baseline distribution by series
- Prelims β Mains survival rate by group
- Mains β Final conversion rate by group
- End-to-end selection rate (prelims to final seats)
- Includes expected vs. actual variance calculation: +136 excess seats for the
00 & 01group over their proportional expectation
Data Flow
Official UPPSC PDFs (pre_2024.pdf, mains_result.pdf, final_result.pdf)
β
pdfplumber text extraction, page by page
β
Regex: extract all 7-digit roll numbers
β
Group by first 2 digits (series prefix)
β
counts.json: per-stage, per-series breakdown
β
report.md: stage-by-stage statistical comparison
Edge Cases Handled
- Mains PDF has a cover page (non-data page 1) β skipped with
skip_page_1=True - Some UPPSC roll numbers are 6 digits β handled with fallback regex
- Verification script independently confirms total counts before any series analysis is run