Spaces:

neural-arun
/

ArunCore

Sleeping

App Files Files Community

ArunCore / data /github /result_anomaly /decisions.md

Neural Arun

ArunCore Deployment

9ae77d7 about 1 month ago

preview code

raw

history blame contribute delete

2.42 kB

Key Decisions — UPPSC PCS 2024 Statistical Audit

Decision 1: pdfplumber for text extraction instead of pypdf

What: Used pdfplumber to extract text from the UPPSC result PDFs. Why: pdfplumber was the natural first choice for this type of forensic PDF work. It handles raw text extraction well from government-style PDFs where layout can be inconsistent.

Decision 2: Regex on raw text instead of table parsing

What: Roll numbers were extracted using regex (\b\d{7}\b) on raw page text rather than trying to parse structured table data. Why: The UPPSC PDFs contain a large volume of numbers beyond just roll numbers — page numbers, registration IDs, and other numeric data. Table parsing would have required precise column alignment assumptions that the PDFs do not reliably provide. Regex targeting the 7-digit roll number pattern (with a 6-digit fallback) proved to be the more robust and precise approach.

Decision 3: Series prefix grouping by first 2 digits

What: Candidates are grouped by the first two digits of their roll number (r[:2]) to identify their series. Why: The investigation started from observing that the anomaly was concentrated among candidates whose roll numbers started with 00 and 01. Grouping by the first two digits was the natural way to isolate these series and compare their selection rates against the rest of the pool across all three exam stages.

Decision 4: Separate verification script before analysis

What: A standalone verify_extraction.py script confirms total candidate counts before any series-level analysis is run. Why: Because the source data is official government examination results, the integrity of the extraction had to be independently provable. The verification script confirms that the total counts extracted (15,066 prelims / 2,720 mains / 933 final) match the officially published figures exactly — making the entire analysis fully reproducible and publicly defensible. Anyone questioning the methodology can run the script themselves.

Decision 5: Mains PDF page-1 skip

What: The mains PDF skips the first page with skip_page_1=True. Why: Page 1 of the mains PDF did not contain any roll numbers — it was a cover or header page. Including it in the extraction loop added no data and risked pulling in stray numbers from the page layout. Skipping it explicitly kept the extraction clean.