Spaces:

bhuvan-2005
/

question-extractor

Sleeping

bhuvan-2005 commited on Nov 17, 2025

Commit

19d31b9

verified ·

1 Parent(s): 1b0427e

Update question_extractor.py

Files changed (1) hide show

question_extractor.py CHANGED Viewed

@@ -663,8 +663,15 @@ def process_question_paper(image_path, output_path):
     text = extract_text_from_image(image_path)
     subject = extract_subject_name(text)
-    # Use text-line based generic extraction as the primary method.
-    questions = extract_questions_from_text(text)
     # Write out the results in a structured layout
     with open(output_path, 'w', encoding='utf-8') as f:

     text = extract_text_from_image(image_path)
     subject = extract_subject_name(text)
+    # 1) Try layout-based extraction first (uses Tesseract's positional
+    # data to find question numbers in the left column). This is
+    # particularly robust for table-style papers like VIT's CAT format.
+    questions = extract_questions_with_layout(image_path)
+    # 2) If that fails or finds too few questions, fall back to the
+    # generic text-line based extractor which uses only OCR'd text.
+    if not questions or len(questions) < 3:
+        questions = extract_questions_from_text(text)
     # Write out the results in a structured layout
     with open(output_path, 'w', encoding='utf-8') as f: