# Key Decisions — UPPSC PCS 2024 Statistical Audit

## Decision 1: pdfplumber for text extraction instead of pypdf
**What:** Used pdfplumber to extract text from the UPPSC result PDFs.
**Why:** pdfplumber was the natural first choice for this type of forensic PDF work. It handles raw text extraction well from government-style PDFs where layout can be inconsistent.

## Decision 2: Regex on raw text instead of table parsing
**What:** Roll numbers were extracted using regex (`\b\d{7}\b`) on raw page text rather than trying to parse structured table data.
**Why:** The UPPSC PDFs contain a large volume of numbers beyond just roll numbers — page numbers, registration IDs, and other numeric data. Table parsing would have required precise column alignment assumptions that the PDFs do not reliably provide. Regex targeting the 7-digit roll number pattern (with a 6-digit fallback) proved to be the more robust and precise approach.

## Decision 3: Series prefix grouping by first 2 digits
**What:** Candidates are grouped by the first two digits of their roll number (`r[:2]`) to identify their series.
**Why:** The investigation started from observing that the anomaly was concentrated among candidates whose roll numbers started with `00` and `01`. Grouping by the first two digits was the natural way to isolate these series and compare their selection rates against the rest of the pool across all three exam stages.

## Decision 4: Separate verification script before analysis
**What:** A standalone `verify_extraction.py` script confirms total candidate counts before any series-level analysis is run.
**Why:** Because the source data is official government examination results, the integrity of the extraction had to be independently provable. The verification script confirms that the total counts extracted (15,066 prelims / 2,720 mains / 933 final) match the officially published figures exactly — making the entire analysis fully reproducible and publicly defensible. Anyone questioning the methodology can run the script themselves.

## Decision 5: Mains PDF page-1 skip
**What:** The mains PDF skips the first page with `skip_page_1=True`.
**Why:** Page 1 of the mains PDF did not contain any roll numbers — it was a cover or header page. Including it in the extraction loop added no data and risked pulling in stray numbers from the page layout. Skipping it explicitly kept the extraction clean.