Spaces:

MasterMap
/

mastermap-cleaner

Running

andrewbejjani commited on about 4 hours ago

Commit

c6a3f44

1 Parent(s): ad5ab1d

Added functional doc in README.md and added basic

comments to the code
modified: README.md
modified: apply_blueprint.py
modified: main.py
modified: newest_model.py
modified: src/config.py
modified: src/data_pipeline.py
modified: src/llm_router.py
modified: src/process_runner.py
modified: src/utils.py
modified: src/workbook_io.py
modified: ui_app.py

Files changed (11) hide show

README.md +130 -14
apply_blueprint.py +12 -7
main.py +21 -20
newest_model.py +5 -0
src/config.py +5 -3
src/data_pipeline.py +25 -29
src/llm_router.py +11 -2
src/process_runner.py +6 -0
src/utils.py +10 -0
src/workbook_io.py +5 -0
ui_app.py +33 -3

README.md CHANGED Viewed

@@ -1,18 +1,134 @@
----
-title: MasterMap Cleaner
-sdk: docker
-app_port: 7860
----
-## Environment
-For local development, copy `.env.example` to `.env` and fill only the values you need.
-Set production secrets in the Hugging Face Space settings, not in committed files.
-- `GROQ_API_KEY`: required for Groq model calls.
-- `APP_PASSWORD`: required to password-protect the deployed Space; set it as a Hugging Face Secret.
-- `APP_USERNAME`: optional, defaults to `mastermap` when `APP_PASSWORD` is set.
-- `APP_SECRET_KEY`: recommended for stable, isolated per-browser sessions.
-- `HF_TOKEN`: Hugging Face Space Secret only; optional, required only for the `Save Manual References` button.
-`Save Manual References` only enables on Hugging Face Spaces when `SPACE_ID` is present and `HF_TOKEN` is configured. It commits the current `refdata/manual_references.json` back to the Space repository.

+# MasterMap Cleaner
+MasterMap Cleaner is a web tool used to clean and standardize MasterMap Excel files.
+The tool takes an uploaded workbook, checks selected columns against approved reference lists, uses AI only when needed, and creates a cleaned version of the workbook. It also generates a review file called a **Blueprint**, where uncertain values can be checked and corrected by a human before the final workbook is downloaded.
+## What The Tool Produces
+After a cleaning run, the tool can produce:
+- a cleaned workbook with a new cleaned sheet
+- a Blueprint file for human review
+- a final workbook after the reviewed Blueprint has been applied
+## Before You Start
+You need:
+- the Hugging Face Space link for the tool
+- the username and password provided by the tool administrator
+- the Excel workbook you want to clean
+- access to Excel or another spreadsheet editor to review the Blueprint
+Accepted workbook formats:
+- `.xlsx`
+- `.xlsm`
+## Recommended Workflow
+1. Open the tool.
+2. Upload the Excel workbook.
+3. Select the source sheet to clean.
+4. Run the cleaning process.
+5. Download the cleaned workbook and Blueprint.
+6. Review the Blueprint in Excel.
+7. Upload the reviewed Blueprint back into the tool.
+8. Apply the Blueprint.
+9. Download the final cleaned workbook.
+10. Save manual references if new approved values should be remembered for future runs.
+## Step 1: Open The Tool
+Open the Hugging Face Space link in your browser.
+If prompted, enter the username and password provided by the tool administrator.
+## Step 2: Upload The Dataset
+In the **Dataset to Clean** section:
+1. Drop the Excel file into the upload box, or click the box and select the file.
+2. Wait until the file is loaded.
+3. Select the source sheet that contains the data to clean.
+4. Choose the output sheet name.
+The output sheet is the new sheet that will be created inside the workbook for cleaned data.
+Do not use the same name as the source sheet.
+## Step 3: Run Cleaning
+Click **Run Cleaning**.
+While the tool is running, it will show progress for each cleaned column. Some files may take time depending on the number of rows and the number of values that require AI review.
+When the run finishes, the tool will show download links for:
+- **Blueprint**
+- **Cleaned Workbook**
+Download both files.
+## Step 4: Review The Blueprint
+Open the Blueprint file in Excel.
+The Blueprint contains values that the tool wants a human to review. Each row represents one value or correction candidate.
+Main columns:
+- `Row_Index`: the row in the workbook where the value appears
+- `Column`: the field being reviewed
+- `Original_Raw_Text`: the original value from the uploaded file
+- `AI_Suggested_Match`: the tool's suggested cleaned value
+- `Human_Override`: the reviewer correction field
+- `Confidence`: how confident the tool was
+- `Match_Source`: how the suggestion was produced
+How to review each row:
+- If the suggested value is correct, leave `Human_Override` empty.
+- If the suggested value is wrong, choose the correct value from the dropdown.
+- If the correct value is not in the dropdown, type it manually.
+- Focus especially on rows marked `LOW` or `MEDIUM` confidence.
+The dropdown is there to help, but manual typing is allowed.
+After reviewing, save the Blueprint file.
+## Step 5: Apply The Reviewed Blueprint
+Return to the tool and go to the **Apply Blueprint** section.
+1. Upload the workbook that should receive the corrections.
+2. Select the sheet to update.
+3. Upload the reviewed Blueprint file.
+4. Click **Apply Blueprint**.
+Use the same cleaned sheet that was created during the cleaning step unless you were instructed otherwise.
+When the apply step finishes, download the final cleaned workbook.
+## Step 6: Save Manual References
+After applying a Blueprint, the tool may learn newly approved values so future runs can recognize them automatically.
+If the **Save Manual References** button is available, click it after applying the Blueprint.
+Use this button when:
+- the Blueprint contained manually approved values
+- those values should be remembered in future cleaning runs
+- the administrator instructed you to preserve the updated references
+If the button is disabled, continue using the tool normally. The administrator may handle reference saving separately.
+## Important Usage Notes
+- Keep the browser tab open while a cleaning or apply process is running.
+- Download your files before closing the page.
+- If you refresh the page, you may need to upload the files again.
+- Do not share the tool password outside the approved user group.
+- Do not upload files unless they are intended to be processed by this tool.

apply_blueprint.py CHANGED Viewed

@@ -14,6 +14,7 @@ from src.config import (
 from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
 def parse_args():
     parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
     parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
     parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
@@ -29,6 +30,7 @@ def parse_args():
     return args
 def load_json_safe(filepath):
     try:
         with open(filepath, 'r', encoding='utf-8-sig') as f:
             return json.load(f)
@@ -36,16 +38,19 @@ def load_json_safe(filepath):
         return {}
 def split_approved_parts(value):
     if pd.isna(value):
         return []
     return [part.strip() for part in str(value).split(",") if part.strip()]
 def ensure_manual_bucket(manual_refs, official_refs, column_name):
     if column_name not in manual_refs:
         manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
     return manual_refs[column_name]
 def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
     manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
     added_count = 0
@@ -85,7 +90,7 @@ if __name__ == "__main__":
         print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
         exit()
-    # Load the target Excel workbook
     wb = openpyxl.load_workbook(args.input)
     if args.sheet not in wb.sheetnames:
         print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
@@ -98,7 +103,7 @@ if __name__ == "__main__":
         if sheet.cell(row=1, column=c).value
     }
-    # Load the memory dictionaries using the synced CLI path
     official_refs = load_json_safe(args.refs)
     manual_refs = load_json_safe(args.manual_refs)
@@ -107,6 +112,7 @@ if __name__ == "__main__":
     print("Applying manual overrides and updating memory...")
     for _, row in bp_df.iterrows():
         human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
         approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
         confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
@@ -117,7 +123,7 @@ if __name__ == "__main__":
         raw_col = str(row['Column']).strip()
         if human_val:
-            # 1. Update the Excel File
             try:
                 excel_row = int(row['Row_Index'])
             except (TypeError, ValueError):
@@ -136,7 +142,7 @@ if __name__ == "__main__":
             sheet.cell(row=excel_row, column=col_idx).value = human_val
             changes_made += 1
-        # 2. Update Manual References for human overrides and accepted AI suggestions.
         if raw_col == "Degree":
             continue
@@ -152,11 +158,10 @@ if __name__ == "__main__":
     memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
-    # Save Excel
     wb.save(args.input)
-    # Save JSONs
-    # Make sure the data directory exists before dumping
     manual_refs_dir = os.path.dirname(args.manual_refs)
     if manual_refs_dir:
         os.makedirs(manual_refs_dir, exist_ok=True)

 from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
 def parse_args():
+    """Parse workbook, Blueprint, and reference paths for the apply step."""
     parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
     parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
     parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
     return args
 def load_json_safe(filepath):
+    """Load JSON memory files and fall back to an empty dict if absent/corrupt."""
     try:
         with open(filepath, 'r', encoding='utf-8-sig') as f:
             return json.load(f)
         return {}
 def split_approved_parts(value):
+    """Split multi-value approvals into individual reference candidates."""
     if pd.isna(value):
         return []
     return [part.strip() for part in str(value).split(",") if part.strip()]
 def ensure_manual_bucket(manual_refs, official_refs, column_name):
+    """Create the correct manual-ref container for list or dict reference columns."""
     if column_name not in manual_refs:
         manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
     return manual_refs[column_name]
 def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
+    """Remember approved values that are not already official or manual refs."""
     manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
     added_count = 0
         print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
         exit()
+    # Human overrides are applied directly to the selected cleaned sheet.
     wb = openpyxl.load_workbook(args.input)
     if args.sheet not in wb.sheetnames:
         print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
         if sheet.cell(row=1, column=c).value
     }
+    # Reference files use the same CLI defaults as the cleaning pipeline.
     official_refs = load_json_safe(args.refs)
     manual_refs = load_json_safe(args.manual_refs)
     print("Applying manual overrides and updating memory...")
     for _, row in bp_df.iterrows():
+        # Empty Human_Override means the reviewer accepted the AI suggestion.
         human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
         approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
         confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
         raw_col = str(row['Column']).strip()
         if human_val:
+            # Blueprint row indices already include the skipped MasterMap filter row.
             try:
                 excel_row = int(row['Row_Index'])
             except (TypeError, ValueError):
             sheet.cell(row=excel_row, column=col_idx).value = human_val
             changes_made += 1
+        # Only approved non-low-confidence values should teach future runs.
         if raw_col == "Degree":
             continue
     memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
+    # Persist workbook updates before writing the learned memory file.
     wb.save(args.input)
+    # Manual refs may be written to an empty deployment volume, so ensure the folder exists.
     manual_refs_dir = os.path.dirname(args.manual_refs)
     if manual_refs_dir:
         os.makedirs(manual_refs_dir, exist_ok=True)

main.py CHANGED Viewed

@@ -9,13 +9,13 @@ from openpyxl.utils import get_column_letter
 from openpyxl.worksheet.datavalidation import DataValidation
 from openpyxl.workbook.defined_name import DefinedName
-# Import our new modular architecture
 from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
 from src.llm_router import GroqRouter
 from src.data_pipeline import process_column, cluster_degrees_by_institution
 from src.utils import prune_manual_refs_against_official
-# --- 1. CONFIGURATION ---
 COLUMNS_CONFIG = {
     "Country": r',|;|\n|/',
     "Institution": r'[,/;|\n]',
@@ -30,10 +30,12 @@ COLUMNS_CONFIG = {
 master_cache = {}
 def load_json_safe(filepath):
     with open(filepath, 'r', encoding='utf-8-sig') as f:
         return json.load(f)
 def validate_official_refs(official_refs):
     missing = []
     for column_name in COLUMNS_CONFIG:
         if column_name == "Degree":
@@ -51,29 +53,28 @@ def validate_official_refs(official_refs):
         )
 def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
-    """Injects robust, static searchable dropdowns into the Blueprint."""
     print("Injecting static searchable dropdowns into Blueprint...")
     wb = openpyxl.load_workbook(blueprint_path)
     main_sheet = wb.active
-    # 1. Create the Reference Sheet
     ref_sheet = wb.create_sheet(title="Reference_Lists")
     col_idx = 1
     for column_name, unique_items in master_unique_lists.items():
         safe_name = column_name.replace(" ", "_")
-        # Write the header
         ref_sheet.cell(row=1, column=col_idx, value=safe_name)
-        # Clean and alphabetize the list for a better user experience
         valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
         # Write the items
         for row_idx, item in enumerate(valid_items, start=2):
             ref_sheet.cell(row=row_idx, column=col_idx, value=item)
-        # 2. Create the Excel "Named Range"
         if valid_items:
             letter = get_column_letter(col_idx)
             range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
@@ -82,7 +83,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
         col_idx += 1
-    # 3. Locate Target & Override Columns
     target_col_idx = None
     override_col_letter = None
     for cell in main_sheet[1]:
@@ -91,7 +92,6 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
         elif cell.value == "Human_Override":
             override_col_letter = get_column_letter(cell.column)
-    # 4. Apply Data Validation
     if target_col_idx and override_col_letter:
         dv = DataValidation(
             type="list",
@@ -108,7 +108,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
 if __name__ == "__main__":
-    # --- 2. INITIALIZATION ---
     args = parse_cli_args()
     source_sheet_name = args.sheet
     output_sheet_name = args.output_sheet
@@ -117,7 +117,7 @@ if __name__ == "__main__":
     print("Loading AI Model (this may take a few seconds)...")
     model = SentenceTransformer('all-MiniLM-L6-v2')
-    # Initialize our LLM Router
     router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
     if not os.path.exists(args.refs):
@@ -131,6 +131,8 @@ if __name__ == "__main__":
     official_refs = load_json_safe(args.refs)
     manual_refs = load_json_safe(args.manual_refs)
     validate_official_refs(official_refs)
     memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
     if memory_pruned:
         print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
@@ -138,10 +140,11 @@ if __name__ == "__main__":
     print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
     data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
-    # Initialize the global Blueprint Logger
     blueprint_records = []
-    # --- 3. EXECUTE BATCH PIPELINE ---
     for col, pattern in COLUMNS_CONFIG.items():
         if col == "Degree":
             inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
@@ -157,16 +160,15 @@ if __name__ == "__main__":
                 split_pattern=pattern, blueprint_data=blueprint_records
             )
-    # --- 4. EXPORT RESULTS ---
     print("\nSaving all memory files...")
     with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
-    # 4a. Export the Blueprint for Human Review
     if blueprint_records:
         bp_df = pd.DataFrame(blueprint_records)
         bp_df.to_excel(args.blueprint, index=False)
-        # --- Format the Blueprint Visually ---
         bp_wb = openpyxl.load_workbook(args.blueprint)
         bp_sheet = bp_wb.active
@@ -195,9 +197,8 @@ if __name__ == "__main__":
         bp_wb.save(args.blueprint)
         print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
-        # --- NEW: Build master lists and inject dropdowns ---
         def extract_uniques(ref_data):
-            """Helper to extract names whether the memory file is a list or a dict"""
             if isinstance(ref_data, dict): return list(ref_data.values())
             elif isinstance(ref_data, list): return ref_data
             return []
@@ -206,7 +207,7 @@ if __name__ == "__main__":
         for category in COLUMNS_CONFIG.keys():
             off_items = extract_uniques(official_refs.get(category, []))
             man_items = extract_uniques(manual_refs.get(category, []))
-            # Merge, deduplicate, and remove blanks
             master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
         inject_searchable_dropdowns(args.blueprint, master_lists)
@@ -214,7 +215,7 @@ if __name__ == "__main__":
     else:
         print("[!] No blueprint generated. All matches were HIGH confidence!")
-    # 4b. Inject Cleaned Data to Mastermap
     print("\nOpening original Excel file to preserve formatting...")
     wb = openpyxl.load_workbook(args.input)
     new_sheet_name = output_sheet_name

 from openpyxl.worksheet.datavalidation import DataValidation
 from openpyxl.workbook.defined_name import DefinedName
 from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
 from src.llm_router import GroqRouter
 from src.data_pipeline import process_column, cluster_degrees_by_institution
 from src.utils import prune_manual_refs_against_official
+# Each cleaned column has its own conservative split pattern. Avoid splitting
+# on words like "and" because they can be part of official country names.
 COLUMNS_CONFIG = {
     "Country": r',|;|\n|/',
     "Institution": r'[,/;|\n]',
 master_cache = {}
 def load_json_safe(filepath):
+    """Load reference JSON files, accepting UTF-8 files with or without a BOM."""
     with open(filepath, 'r', encoding='utf-8-sig') as f:
         return json.load(f)
 def validate_official_refs(official_refs):
+    """Fail early if required reference buckets are missing or empty."""
     missing = []
     for column_name in COLUMNS_CONFIG:
         if column_name == "Degree":
         )
 def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
+    """Add hidden reference lists and dropdowns to the generated Blueprint."""
     print("Injecting static searchable dropdowns into Blueprint...")
     wb = openpyxl.load_workbook(blueprint_path)
     main_sheet = wb.active
+    # Store all dropdown values on a hidden sheet so Excel can reference them.
     ref_sheet = wb.create_sheet(title="Reference_Lists")
     col_idx = 1
     for column_name, unique_items in master_unique_lists.items():
         safe_name = column_name.replace(" ", "_")
         ref_sheet.cell(row=1, column=col_idx, value=safe_name)
+        # Clean and alphabetize the list for a better review experience.
         valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
         # Write the items
         for row_idx, item in enumerate(valid_items, start=2):
             ref_sheet.cell(row=row_idx, column=col_idx, value=item)
+        # Named ranges let data validation reference long lists safely.
         if valid_items:
             letter = get_column_letter(col_idx)
             range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
         col_idx += 1
+    # The override dropdown changes based on the row's target column.
     target_col_idx = None
     override_col_letter = None
     for cell in main_sheet[1]:
         elif cell.value == "Human_Override":
             override_col_letter = get_column_letter(cell.column)
     if target_col_idx and override_col_letter:
         dv = DataValidation(
             type="list",
 if __name__ == "__main__":
+    # Parse CLI/UI arguments before loading any expensive model assets.
     args = parse_cli_args()
     source_sheet_name = args.sheet
     output_sheet_name = args.output_sheet
     print("Loading AI Model (this may take a few seconds)...")
     model = SentenceTransformer('all-MiniLM-L6-v2')
+    # The router owns Groq fallback order and rate-limit switching.
     router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
     if not os.path.exists(args.refs):
     official_refs = load_json_safe(args.refs)
     manual_refs = load_json_safe(args.manual_refs)
     validate_official_refs(official_refs)
+    # Manual memory should only contain values not already covered by official refs.
     memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
     if memory_pruned:
         print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
     print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
     data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
+    # Every uncertain or changed value is logged here for human review.
     blueprint_records = []
+    # Run each configured column through the normalization pipeline. Degree
+    # values are clustered within each institution instead of matched to refs.
     for col, pattern in COLUMNS_CONFIG.items():
         if col == "Degree":
             inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
                 split_pattern=pattern, blueprint_data=blueprint_records
             )
     print("\nSaving all memory files...")
     with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
+    # Export the review workbook only when there is something to inspect.
     if blueprint_records:
         bp_df = pd.DataFrame(blueprint_records)
         bp_df.to_excel(args.blueprint, index=False)
+        # Basic formatting helps reviewers scan confidence levels quickly.
         bp_wb = openpyxl.load_workbook(args.blueprint)
         bp_sheet = bp_wb.active
         bp_wb.save(args.blueprint)
         print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
         def extract_uniques(ref_data):
+            """Extract display values from list-style or dict-style references."""
             if isinstance(ref_data, dict): return list(ref_data.values())
             elif isinstance(ref_data, list): return ref_data
             return []
         for category in COLUMNS_CONFIG.keys():
             off_items = extract_uniques(official_refs.get(category, []))
             man_items = extract_uniques(manual_refs.get(category, []))
+            # Merge official and manual values for the Blueprint dropdowns.
             master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
         inject_searchable_dropdowns(args.blueprint, master_lists)
     else:
         print("[!] No blueprint generated. All matches were HIGH confidence!")
+    # Copy the source sheet to preserve formatting, then overwrite cleaned columns.
     print("\nOpening original Excel file to preserve formatting...")
     wb = openpyxl.load_workbook(args.input)
     new_sheet_name = output_sheet_name

newest_model.py CHANGED Viewed

@@ -37,6 +37,7 @@ PREFERRED_MODEL_IDS = {model_id.lower() for model_id in PREFERRED_PRODUCTION_CHA
 def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
     headers = {
         "Authorization": f"Bearer {api_key}",
         "Content-Type": "application/json",
@@ -47,6 +48,7 @@ def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
 def is_active_chat_model(model: dict[str, Any]) -> bool:
     model_id = str(model.get("id", "")).lower()
     if not model_id:
         return False
@@ -58,6 +60,7 @@ def is_active_chat_model(model: dict[str, Any]) -> bool:
 def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
     model_id = str(model.get("id", ""))
     model_id_lower = model_id.lower()
@@ -75,6 +78,7 @@ def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
 def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
     api_key = os.getenv("GROQ_API_KEY")
     if not api_key:
         raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
@@ -98,6 +102,7 @@ def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS),
 def main() -> None:
     parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
     parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
     parser.add_argument(

 def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
+    """Fetch the current Groq model catalog using the OpenAI-compatible API."""
     headers = {
         "Authorization": f"Bearer {api_key}",
         "Content-Type": "application/json",
 def is_active_chat_model(model: dict[str, Any]) -> bool:
+    """Keep only active preferred chat models that are suitable for judging."""
     model_id = str(model.get("id", "")).lower()
     if not model_id:
         return False
 def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
+    """Sort models by preferred production order, then by recency/capacity."""
     model_id = str(model.get("id", ""))
     model_id_lower = model_id.lower()
 def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
+    """Return a comma-ready fallback list for GROQ_MODEL."""
     api_key = os.getenv("GROQ_API_KEY")
     if not api_key:
         raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
 def main() -> None:
+    """CLI entry point used when refreshing the recommended Groq model list."""
     parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
     parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
     parser.add_argument(

src/config.py CHANGED Viewed

@@ -2,12 +2,13 @@ import os
 import argparse
 from dotenv import load_dotenv
-# Load environment variables
 load_dotenv()
 # --- ENVIRONMENT VARIABLES to be set up in .env ---
 GROQ_API_KEY = os.getenv("GROQ_API_KEY")
-RAW_MODELS = os.getenv("GROQ_MODEL")
 APP_USERNAME = os.getenv("APP_USERNAME")
 APP_PASSWORD = os.getenv("APP_PASSWORD")
 SPACE_ID = os.getenv("SPACE_ID")
@@ -46,7 +47,7 @@ def resolve_ref_path(file_arg):
     return os.path.join(REFDATA_DIR, file_arg)
 def parse_cli_args():
-    """Sets up the command line arguments so you don't have to hardcode filenames."""
     parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
     parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
     parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
@@ -57,6 +58,7 @@ def parse_cli_args():
     parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
     args = parser.parse_args()
     args.input = resolve_data_path(args.input)
     args.blueprint = resolve_data_path(args.blueprint)
     args.refs = resolve_ref_path(args.refs)

 import argparse
 from dotenv import load_dotenv
+# Load local .env values for development; Hugging Face injects the same names
+# as environment variables in production.
 load_dotenv()
 # --- ENVIRONMENT VARIABLES to be set up in .env ---
 GROQ_API_KEY = os.getenv("GROQ_API_KEY")
+RAW_MODELS = os.getenv("GROQ_MODEL", "")
 APP_USERNAME = os.getenv("APP_USERNAME")
 APP_PASSWORD = os.getenv("APP_PASSWORD")
 SPACE_ID = os.getenv("SPACE_ID")
     return os.path.join(REFDATA_DIR, file_arg)
 def parse_cli_args():
+    """Parse shared CLI arguments used by both local runs and the Flask UI."""
     parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
     parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
     parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
     parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
     args = parser.parse_args()
+    # Keep CLI calls short by treating bare names as files under data/refdata.
     args.input = resolve_data_path(args.input)
     args.blueprint = resolve_data_path(args.blueprint)
     args.refs = resolve_ref_path(args.refs)

src/data_pipeline.py CHANGED Viewed

@@ -5,7 +5,6 @@ from collections import Counter
 from sentence_transformers import util
 from tqdm import tqdm
-# Import our pure text manipulation functions
 from src.utils import (
     clean_degree_text,
     normalize_text,
@@ -13,11 +12,8 @@ from src.utils import (
     smart_format
 )
 from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
-# ---------------------------------------------------------------------------
-# ML & CLUSTERING ENGINE
-# ---------------------------------------------------------------------------
 def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
     cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
     raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
     clean_counts = Counter(cleaned_list)
@@ -32,7 +28,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
     embeddings = model.encode(unique_cleans, convert_to_tensor=True)
     clean_to_clustered = {}
-    merge_info = {} # Tracks similarity scores for the Blueprint
     for i, current_deg in enumerate(unique_cleans):
         if current_deg in clean_to_clustered: continue
@@ -45,8 +41,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
                 if score.item() >= threshold and target_deg not in clean_to_clustered:
                     pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
-                    # We still use school_cache as a temporary runtime speedup,
-                    # but it is NOT saved to the json memory.
                     cached_action = school_cache.get(pair_key)
                     if cached_action:
@@ -83,6 +78,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
 def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
     print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
     cleaned_col_name = f'Cleaned_{degree_col}'
     df[cleaned_col_name] = df[degree_col].copy()
@@ -92,7 +88,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
     school_mappings = {}
-    # 1. Wrap the AI bottleneck (school clustering) in tqdm
     for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
         school_mask = (df[inst_col] == school) & (df[degree_col].notna())
         raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
@@ -101,7 +97,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
         if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
         school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
-    # 2. Wrap the DataFrame injection and Blueprint logging in tqdm
     for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
         school = row[inst_col]
         raw_deg = str(row[degree_col])
@@ -113,7 +109,6 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
             final_val, src, conf = mapping_data
             df.at[idx, cleaned_col_name] = final_val
-            # Log to Blueprint if modified or auto-merged
             if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
                 blueprint_data.append({
                     "Row_Index": idx + 3,
@@ -128,6 +123,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
 def get_deterministic_match(value, combined_valid_targets):
     val_clean = normalize_text(value)
     for target in combined_valid_targets:
         target_clean = normalize_text(target)
@@ -138,6 +134,7 @@ def get_deterministic_match(value, combined_valid_targets):
 def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
     if not combined_valid_targets: return []
     query_embedding = model.encode(value, convert_to_tensor=True)
     similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
@@ -146,6 +143,7 @@ def get_top_candidates(model, value, combined_valid_targets, reference_embedding
     return [combined_valid_targets[idx] for idx in top_matches.indices]
 def get_dict_exact_match(value, combined_dict):
     value_clean = normalize_text(value)
     for alias, canonical in combined_dict.items():
@@ -159,6 +157,7 @@ def get_dict_exact_match(value, combined_dict):
     return None
 def get_dict_rule_match(value, combined_dict):
     aliases = list(combined_dict.keys())
     canonical_values = list(dict.fromkeys(combined_dict.values()))
@@ -173,6 +172,7 @@ def get_dict_rule_match(value, combined_dict):
     return None
 def as_reference_list(ref_data):
     if isinstance(ref_data, list):
         return ref_data
     if isinstance(ref_data, dict):
@@ -180,6 +180,7 @@ def as_reference_list(ref_data):
     return []
 def as_reference_dict(ref_data):
     if isinstance(ref_data, dict):
         return ref_data
     if isinstance(ref_data, list):
@@ -187,6 +188,7 @@ def as_reference_dict(ref_data):
     return {}
 def update_match_postfix(progress, source_counts):
     progress.set_postfix({
         "Exact_Match": source_counts["Exact_Match"],
         "Rule_Match": source_counts["Rule_Match"],
@@ -202,6 +204,7 @@ def match_cache_key(column_name, value):
 def append_unique_cleaned_part(cleaned_parts, value):
     seen = set()
     for existing_value in cleaned_parts:
         for existing_part in str(existing_value).split(","):
@@ -226,11 +229,8 @@ def append_unique_cleaned_part(cleaned_parts, value):
     return added
-# ---------------------------------------------------------------------------
-# CORE DATA PIPELINE
-# ---------------------------------------------------------------------------
 def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
     if column_name not in df.columns: return df
     core_data = official_refs.get(column_name, [])
@@ -241,6 +241,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
     is_dict_mode = isinstance(core_data, dict)
     def get_updated_embeddings():
         if is_dict_mode:
             c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
             c_keys = list(c_dict.keys())
@@ -261,6 +262,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
     if not is_dict_mode and not combined_valid_targets:
         raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
     uniques = set()
     for cell in df[column_name].dropna():
         for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
@@ -273,14 +275,14 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
     for word in progress:
         word_clean = match_cache_key(column_name, word)
-        # 1. Check Memory Cache
         if word_clean in master_cache[column_name]:
             detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
             source_counts["Memory_Cache"] += 1
             update_match_postfix(progress, source_counts)
             continue
-        # 2. Check Exact Targets
         if is_dict_mode:
             exact = get_dict_exact_match(word, combined_dict)
         else:
@@ -293,7 +295,6 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
             update_match_postfix(progress, source_counts)
             continue
-        # 3. Deterministic / Rule Match
         if is_dict_mode:
             suggested_match = get_dict_rule_match(word, combined_dict)
         else:
@@ -305,7 +306,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
             update_match_postfix(progress, source_counts)
             continue
-        # 4. LLM API Match
         candidates = []
         if is_dict_mode:
             cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
@@ -314,12 +315,11 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
         else:
             candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
-        # Call the router instance
         ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
         source_counts[src] += 1
         update_match_postfix(progress, source_counts)
-        # Process every valid string, regardless of confidence (skip if API crashed)
         if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
             llm_parts = [p.strip() for p in ans_val.split(",")]
             corrected_parts = []
@@ -338,24 +338,20 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
                             corrected_parts.append(part)
                             all_matched = False
                 else:
-                    # 1. Exact Match Check (Case-insensitive)
                     exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
                     if exact_match:
                         corrected_parts.append(exact_match)
                     else:
-                        # 2. Rule-Based Match Check
                         rule_match = get_deterministic_match(part, candidates)
                         if rule_match:
                             corrected_parts.append(rule_match)
                         else:
-                            # 3. No match in dictionary. Keep LLM's version, but flag that we couldn't verify it.
                             corrected_parts.append(part)
                             all_matched = False
-            # Remove duplicates while preserving the exact order
             unique_parts = list(dict.fromkeys(corrected_parts))
-            # Glue it back together
             ans_val = ", ".join(unique_parts)
             raw_parts_for_check = [
@@ -376,7 +372,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
         detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
-    # Reconstruct cells and capture low/medium confidence matches for the Blueprint
     for idx, row in df.iterrows():
         cell_val = row[column_name]
         if pd.isna(cell_val): continue
@@ -390,7 +386,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
         while i < len(raw_parts):
             curr = raw_parts[i]
-            # Check for combined pairs (e.g., "University of, Manchester" split by mistake)
             if i + 1 < len(raw_parts):
                 combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
                 if combo_clean in detailed_cache:
@@ -416,7 +412,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
         final_stitched_val = ", ".join(cleaned_parts)
         df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
-        # Log EVERY change made to the Excel file, plus any low/medium confidence guesses
         if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
             blueprint_data.append({
                 "Row_Index": idx + 3,

 from sentence_transformers import util
 from tqdm import tqdm
 from src.utils import (
     clean_degree_text,
     normalize_text,
     smart_format
 )
 from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
 def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
+    """Cluster similar degree labels inside one institution."""
     cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
     raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
     clean_counts = Counter(cleaned_list)
     embeddings = model.encode(unique_cleans, convert_to_tensor=True)
     clean_to_clustered = {}
+    merge_info = {} # Track similarity scores for Blueprint transparency.
     for i, current_deg in enumerate(unique_cleans):
         if current_deg in clean_to_clustered: continue
                 if score.item() >= threshold and target_deg not in clean_to_clustered:
                     pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
+                    # Runtime cache avoids repeated decisions within one run only.
                     cached_action = school_cache.get(pair_key)
                     if cached_action:
 def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
+    """Apply degree clustering separately for each institution."""
     print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
     cleaned_col_name = f'Cleaned_{degree_col}'
     df[cleaned_col_name] = df[degree_col].copy()
     school_mappings = {}
+    # Build school-specific mappings before mutating the dataframe.
     for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
         school_mask = (df[inst_col] == school) & (df[degree_col].notna())
         raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
         if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
         school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
+    # Apply mappings and log only changed/merged values for review.
     for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
         school = row[inst_col]
         raw_deg = str(row[degree_col])
             final_val, src, conf = mapping_data
             df.at[idx, cleaned_col_name] = final_val
             if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
                 blueprint_data.append({
                     "Row_Index": idx + 3,
 def get_deterministic_match(value, combined_valid_targets):
+    """Match obvious aliases/acronyms without calling embeddings or Groq."""
     val_clean = normalize_text(value)
     for target in combined_valid_targets:
         target_clean = normalize_text(target)
 def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
+    """Return the nearest reference candidates for one raw value."""
     if not combined_valid_targets: return []
     query_embedding = model.encode(value, convert_to_tensor=True)
     similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
     return [combined_valid_targets[idx] for idx in top_matches.indices]
 def get_dict_exact_match(value, combined_dict):
+    """Exact match against alias keys first, then canonical values."""
     value_clean = normalize_text(value)
     for alias, canonical in combined_dict.items():
     return None
 def get_dict_rule_match(value, combined_dict):
+    """Rule match dictionary-style refs while returning canonical values."""
     aliases = list(combined_dict.keys())
     canonical_values = list(dict.fromkeys(combined_dict.values()))
     return None
 def as_reference_list(ref_data):
+    """Convert list/dict reference data to display values."""
     if isinstance(ref_data, list):
         return ref_data
     if isinstance(ref_data, dict):
     return []
 def as_reference_dict(ref_data):
+    """Convert list/dict reference data to an alias-to-canonical mapping."""
     if isinstance(ref_data, dict):
         return ref_data
     if isinstance(ref_data, list):
     return {}
 def update_match_postfix(progress, source_counts):
+    """Expose match-source counts in tqdm without noisy per-row prints."""
     progress.set_postfix({
         "Exact_Match": source_counts["Exact_Match"],
         "Rule_Match": source_counts["Rule_Match"],
 def append_unique_cleaned_part(cleaned_parts, value):
+    """Append comma-separated cleaned parts while preserving first-seen order."""
     seen = set()
     for existing_value in cleaned_parts:
         for existing_part in str(existing_value).split(","):
     return added
 def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
+    """Clean one dataframe column using refs, embeddings, then Groq fallback."""
     if column_name not in df.columns: return df
     core_data = official_refs.get(column_name, [])
     is_dict_mode = isinstance(core_data, dict)
     def get_updated_embeddings():
+        """Build current reference candidates after manual memory is loaded."""
         if is_dict_mode:
             c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
             c_keys = list(c_dict.keys())
     if not is_dict_mode and not combined_valid_targets:
         raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
+    # Work on unique split values first so repeated cells reuse one decision.
     uniques = set()
     for cell in df[column_name].dropna():
         for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
     for word in progress:
         word_clean = match_cache_key(column_name, word)
+        # Fast path: reuse a decision made earlier in this run.
         if word_clean in master_cache[column_name]:
             detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
             source_counts["Memory_Cache"] += 1
             update_match_postfix(progress, source_counts)
             continue
+        # Exact/rule matches are trusted and avoid LLM calls.
         if is_dict_mode:
             exact = get_dict_exact_match(word, combined_dict)
         else:
             update_match_postfix(progress, source_counts)
             continue
         if is_dict_mode:
             suggested_match = get_dict_rule_match(word, combined_dict)
         else:
             update_match_postfix(progress, source_counts)
             continue
+        # Last resort: send only the top reference candidates to Groq.
         candidates = []
         if is_dict_mode:
             cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
         else:
             candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
         ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
         source_counts[src] += 1
         update_match_postfix(progress, source_counts)
+        # Re-check Groq output against refs so canonical casing/names are preserved.
         if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
             llm_parts = [p.strip() for p in ans_val.split(",")]
             corrected_parts = []
                             corrected_parts.append(part)
                             all_matched = False
                 else:
                     exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
                     if exact_match:
                         corrected_parts.append(exact_match)
                     else:
                         rule_match = get_deterministic_match(part, candidates)
                         if rule_match:
                             corrected_parts.append(rule_match)
                         else:
+                            # Keep unverifiable LLM text, but do not upgrade confidence.
                             corrected_parts.append(part)
                             all_matched = False
             unique_parts = list(dict.fromkeys(corrected_parts))
             ans_val = ", ".join(unique_parts)
             raw_parts_for_check = [
         detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
+    # Reconstruct full cell values in original row order for workbook injection.
     for idx, row in df.iterrows():
         cell_val = row[column_name]
         if pd.isna(cell_val): continue
         while i < len(raw_parts):
             curr = raw_parts[i]
+            # Recover obvious accidental splits such as "University of, Manchester".
             if i + 1 < len(raw_parts):
                 combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
                 if combo_clean in detailed_cache:
         final_stitched_val = ", ".join(cleaned_parts)
         df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
+        # Review every changed cell and every low/medium-confidence result.
         if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
             blueprint_data.append({
                 "Row_Index": idx + 3,

src/llm_router.py CHANGED Viewed

@@ -3,9 +3,13 @@ import time
 from tqdm import tqdm
 from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
-class RateLimitException(Exception): pass
 class GroqRouter:
     def __init__(self, api_key, available_models):
         self.api_key = api_key
         self.available_models = available_models
@@ -13,6 +17,7 @@ class GroqRouter:
         self.last_printed_model = None
     def ask_judge(self, word, candidates, column_name):
         if self.current_model_index >= len(self.available_models):
             return (word, "API_Error_All_Models_Dead", "LOW")
@@ -21,6 +26,8 @@ class GroqRouter:
         headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
         if column_name in ["Institution", "Degree"]:
             specific_rules = (
                 "- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
@@ -66,7 +73,6 @@ class GroqRouter:
             "max_tokens": 50
         }
-        # --- SIMPLIFIED RETRY LOGIC ---
         @retry(
             retry=retry_if_exception_type(RateLimitException),
             wait=wait_exponential(multiplier=2, min=2, max=30),
@@ -74,6 +80,7 @@ class GroqRouter:
             reraise=True
         )
         def fire_request():
             res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
             if res.status_code == 429:
@@ -90,6 +97,7 @@ class GroqRouter:
                 self.last_printed_model = active_model
             try:
                 time.sleep(0.3)
                 response = fire_request()
@@ -106,6 +114,7 @@ class GroqRouter:
             except RateLimitException:
                 tqdm.write(f"  [!] Limits exhausted for {active_model}!")
                 self.current_model_index += 1
                 if self.current_model_index < len(self.available_models):

 from tqdm import tqdm
 from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
+class RateLimitException(Exception):
+    """Raised when a Groq model hits rate limits and should be rotated."""
+    pass
 class GroqRouter:
+    """Small Groq client that rotates through configured fallback models."""
     def __init__(self, api_key, available_models):
         self.api_key = api_key
         self.available_models = available_models
         self.last_printed_model = None
     def ask_judge(self, word, candidates, column_name):
+        """Ask Groq to normalize one raw value against likely candidates."""
         if self.current_model_index >= len(self.available_models):
             return (word, "API_Error_All_Models_Dead", "LOW")
         headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
+        # Column-specific prompt rules prevent the model from over-splitting
+        # institutions while still translating geography values to English.
         if column_name in ["Institution", "Degree"]:
             specific_rules = (
                 "- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
             "max_tokens": 50
         }
         @retry(
             retry=retry_if_exception_type(RateLimitException),
             wait=wait_exponential(multiplier=2, min=2, max=30),
             reraise=True
         )
         def fire_request():
+            """Fire one request; tenacity retries only explicit rate-limit errors."""
             res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
             if res.status_code == 429:
                 self.last_printed_model = active_model
             try:
+                # Light throttling reduces avoidable rate-limit pressure.
                 time.sleep(0.3)
                 response = fire_request()
             except RateLimitException:
                 tqdm.write(f"  [!] Limits exhausted for {active_model}!")
+                # Move to the next configured model and keep processing.
                 self.current_model_index += 1
                 if self.current_model_index < len(self.available_models):

src/process_runner.py CHANGED Viewed

@@ -10,6 +10,7 @@ ACTIVE_PROCESSES = {}
 def stop_process(job_id: str) -> bool:
     process = ACTIVE_PROCESSES.get(job_id)
     if not process or process.poll() is not None:
         return False
@@ -26,6 +27,7 @@ def stop_process(job_id: str) -> bool:
 def stream_process(command, cwd: Path, job_id=None):
     env = os.environ.copy()
     env["PYTHONUNBUFFERED"] = "1"
     popen_kwargs = {
@@ -36,6 +38,7 @@ def stream_process(command, cwd: Path, job_id=None):
         "env": env,
     }
     if os.name == "nt":
         popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
     process = subprocess.Popen(
@@ -43,6 +46,7 @@ def stream_process(command, cwd: Path, job_id=None):
         **popen_kwargs,
     )
     if job_id:
         ACTIVE_PROCESSES[job_id] = process
     try:
         assert process.stdout is not None
@@ -59,6 +63,8 @@ def stream_process(command, cwd: Path, job_id=None):
         trailing_chunk = decoder.decode(b"", final=True)
         if trailing_chunk:
             yield f"data: {json.dumps(trailing_chunk)}\n\n"
         yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
         event_name = "done" if exit_code == 0 else "failed"
         yield f"event: {event_name}\ndata: {{}}\n\n"

 def stop_process(job_id: str) -> bool:
+    """Stop a tracked subprocess by frontend job id."""
     process = ACTIVE_PROCESSES.get(job_id)
     if not process or process.poll() is not None:
         return False
 def stream_process(command, cwd: Path, job_id=None):
+    """Run a command and yield stdout/stderr as server-sent event chunks."""
     env = os.environ.copy()
     env["PYTHONUNBUFFERED"] = "1"
     popen_kwargs = {
         "env": env,
     }
     if os.name == "nt":
+        # Required so CTRL_BREAK can stop child Python processes on Windows.
         popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
     process = subprocess.Popen(
         **popen_kwargs,
     )
     if job_id:
+        # The UI can later call /stop with this id.
         ACTIVE_PROCESSES[job_id] = process
     try:
         assert process.stdout is not None
         trailing_chunk = decoder.decode(b"", final=True)
         if trailing_chunk:
             yield f"data: {json.dumps(trailing_chunk)}\n\n"
+        # The frontend distinguishes a real success from a crashed subprocess.
         yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
         event_name = "done" if exit_code == 0 else "failed"
         yield f"event: {event_name}\ndata: {{}}\n\n"

src/utils.py CHANGED Viewed

@@ -5,6 +5,7 @@ import unicodedata
 from src.config import HF_TOKEN, SPACE_ID
 def strip_degrees_for_search(text):
     if not isinstance(text, str): return text
     degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
     cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
@@ -14,6 +15,7 @@ def strip_degrees_for_search(text):
     return cleaned
 def smart_format(text):
     if not isinstance(text, str): return text
     res = text.title()
     acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
@@ -24,6 +26,7 @@ def smart_format(text):
     return res.strip()
 def clean_degree_text(text):
     if not isinstance(text, str): return ""
     text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
     text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
@@ -32,14 +35,17 @@ def clean_degree_text(text):
     return smart_format(text)
 def normalize_text(text):
     if not isinstance(text, str): return ""
     normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
     return normalized.strip().lower()
 def normalize_ref(value):
     return normalize_text(str(value))
 def iter_ref_values(ref_data):
     if isinstance(ref_data, dict):
         yield from (item for item in ref_data.keys() if isinstance(item, str))
         yield from (item for item in ref_data.values() if isinstance(item, str))
@@ -47,10 +53,12 @@ def iter_ref_values(ref_data):
         yield from (item for item in ref_data if isinstance(item, str))
 def ref_contains(ref_data, value):
     needle = normalize_ref(value)
     return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
 def prune_manual_refs_against_official(manual_refs, official_refs):
     removed_count = 0
     for column_name, manual_bucket in list(manual_refs.items()):
@@ -100,6 +108,7 @@ def prune_manual_refs_against_official(manual_refs, official_refs):
 MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
 def reference_sync_status():
     if not SPACE_ID:
         return {
             "enabled": False,
@@ -121,6 +130,7 @@ def reference_sync_status():
     }
 def save_manual_references_to_hub(app_root: Path):
     status = reference_sync_status()
     if not status["enabled"]:
         raise RuntimeError(status["reason"])

 from src.config import HF_TOKEN, SPACE_ID
 def strip_degrees_for_search(text):
+    """Remove common degree words before matching institution names."""
     if not isinstance(text, str): return text
     degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
     cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
     return cleaned
 def smart_format(text):
+    """Title-case free text while preserving common academic/business acronyms."""
     if not isinstance(text, str): return text
     res = text.title()
     acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
     return res.strip()
 def clean_degree_text(text):
+    """Normalize degree titles before within-school clustering."""
     if not isinstance(text, str): return ""
     text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
     text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
     return smart_format(text)
 def normalize_text(text):
+    """Normalize text for accent-insensitive, case-insensitive comparisons."""
     if not isinstance(text, str): return ""
     normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
     return normalized.strip().lower()
 def normalize_ref(value):
+    """Normalize a reference value or alias for dictionary/set lookups."""
     return normalize_text(str(value))
 def iter_ref_values(ref_data):
+    """Yield all searchable strings from list-style or dict-style references."""
     if isinstance(ref_data, dict):
         yield from (item for item in ref_data.keys() if isinstance(item, str))
         yield from (item for item in ref_data.values() if isinstance(item, str))
         yield from (item for item in ref_data if isinstance(item, str))
 def ref_contains(ref_data, value):
+    """Return whether a reference bucket already contains a value/alias."""
     needle = normalize_ref(value)
     return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
 def prune_manual_refs_against_official(manual_refs, official_refs):
+    """Remove manual values that are duplicates of official references."""
     removed_count = 0
     for column_name, manual_bucket in list(manual_refs.items()):
 MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
 def reference_sync_status():
+    """Report whether the app can commit manual refs back to Hugging Face."""
     if not SPACE_ID:
         return {
             "enabled": False,
     }
 def save_manual_references_to_hub(app_root: Path):
+    """Commit the current manual references file back to the Space repository."""
     status = reference_sync_status()
     if not status["enabled"]:
         raise RuntimeError(status["reason"])

src/workbook_io.py CHANGED Viewed

@@ -9,6 +9,7 @@ ALLOWED_EXCEL_EXTENSIONS = (".xlsx", ".xlsm")
 def save_uploaded_excel(uploaded, upload_dir: Path):
     if not uploaded or not uploaded.filename:
         raise ValueError("No file uploaded.")
@@ -18,6 +19,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
     stem = Path(filename).stem
     suffix = Path(filename).suffix
     saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
     destination = upload_dir / saved_filename
     uploaded.save(destination)
@@ -25,6 +27,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
 def read_workbook_sheets(path: Path) -> list[str]:
     workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
     try:
         return workbook.sheetnames
@@ -33,6 +36,7 @@ def read_workbook_sheets(path: Path) -> list[str]:
 def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
     if not raw_path:
         raise ValueError("Path is required.")
@@ -42,6 +46,7 @@ def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path
     resolved = candidate.resolve()
     allowed = [root.resolve() for root in allowed_roots]
     if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
         raise ValueError("Path is outside the application data directory.")

 def save_uploaded_excel(uploaded, upload_dir: Path):
+    """Validate and save an uploaded Excel file with a collision-safe name."""
     if not uploaded or not uploaded.filename:
         raise ValueError("No file uploaded.")
     stem = Path(filename).stem
     suffix = Path(filename).suffix
+    # UUID suffixes keep simultaneous users from overwriting each other.
     saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
     destination = upload_dir / saved_filename
     uploaded.save(destination)
 def read_workbook_sheets(path: Path) -> list[str]:
+    """Read sheet names without loading workbook cell data into memory."""
     workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
     try:
         return workbook.sheetnames
 def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
+    """Resolve a user-supplied path and ensure it stays inside allowed roots."""
     if not raw_path:
         raise ValueError("Path is required.")
     resolved = candidate.resolve()
     allowed = [root.resolve() for root in allowed_roots]
+    # Prevent download/apply endpoints from reading arbitrary server files.
     if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
         raise ValueError("Path is outside the application data directory.")

ui_app.py CHANGED Viewed

@@ -23,6 +23,8 @@ from src.workbook_io import read_workbook_sheets, resolve_allowed_path, save_upl
 APP_ROOT = Path(__file__).resolve().parent
 UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
 UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
 ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
 app = Flask(
@@ -31,7 +33,7 @@ app = Flask(
     static_folder=str(APP_ROOT / "ui" / "static"),
 )
 app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
-app.secret_key = os.getenv("APP_SECRET_KEY", "mastermap-local-dev-secret")
 DEFAULT_STATE = {
     "clean_path": "",
@@ -50,6 +52,7 @@ DEFAULT_STATE = {
 def fresh_state():
     return {
         key: list(value) if isinstance(value, list) else value
         for key, value in DEFAULT_STATE.items()
@@ -57,16 +60,19 @@ def fresh_state():
 def get_state():
     if "ui_state" not in session:
         session["ui_state"] = fresh_state()
     return session["ui_state"]
 def mark_state_changed():
     session.modified = True
 def auth_required_response() -> Response:
     return Response(
         "Authentication required",
         401,
@@ -75,15 +81,17 @@ def auth_required_response() -> Response:
 def missing_auth_config_response() -> Response:
     return Response(
-        "APP_PASSWORD Space Secret is not configured.",
         503,
     )
 @app.before_request
 def require_basic_auth():
-    if not APP_PASSWORD:
         if SPACE_ID:
             return missing_auth_config_response()
         return None
@@ -110,6 +118,7 @@ def prevent_browser_cache(response):
 def default_models() -> str:
     preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
     env_preferred_models = [
         model
@@ -120,6 +129,7 @@ def default_models() -> str:
 def render_page(message: str = "", error: str = ""):
     state = get_state()
     if state["clean_sheets"]:
         state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
@@ -139,6 +149,7 @@ def render_page(message: str = "", error: str = ""):
 def can_apply_blueprint() -> bool:
     state = get_state()
     return bool(
         state["apply_workbook_path"]
@@ -149,10 +160,12 @@ def can_apply_blueprint() -> bool:
 def wants_json_response() -> bool:
     return "application/json" in request.headers.get("Accept", "")
 def ui_state_payload(message: str = "", error: str = ""):
     state = get_state()
     return {
         "message": message,
@@ -168,6 +181,7 @@ def ui_state_payload(message: str = "", error: str = ""):
 def pick_sheet(sheets, preferred_sheet=None, state=None):
     state = state or get_state()
     if preferred_sheet and preferred_sheet in sheets:
         return preferred_sheet
@@ -177,6 +191,7 @@ def pick_sheet(sheets, preferred_sheet=None, state=None):
 def update_ui_state_from_form(form):
     state = get_state()
     state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
     state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
@@ -200,6 +215,7 @@ def prepare_clean():
     except Exception as exc:
         return render_page(error=str(exc))
     state["clean_path"] = str(path)
     state["clean_filename"] = filename
     state["clean_sheets"] = sheets
@@ -214,6 +230,7 @@ def prepare_clean():
 @app.route("/remove-clean", methods=["POST"])
 def remove_clean():
     update_ui_state_from_form(request.form)
     state = get_state()
     old_path = state["clean_path"]
@@ -232,6 +249,7 @@ def remove_clean():
 @app.route("/prepare-apply-workbook", methods=["POST"])
 def prepare_apply_workbook():
     try:
         update_ui_state_from_form(request.form)
         state = get_state()
@@ -254,6 +272,7 @@ def prepare_apply_workbook():
 @app.route("/prepare-apply-blueprint", methods=["POST"])
 def prepare_apply_blueprint():
     try:
         update_ui_state_from_form(request.form)
         state = get_state()
@@ -276,6 +295,7 @@ def prepare_apply_blueprint():
 @app.route("/models")
 def models_endpoint():
     try:
         models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
     except Exception as exc:
@@ -285,11 +305,13 @@ def models_endpoint():
 @app.route("/references/status")
 def references_status():
     return jsonify(reference_sync_status())
 @app.route("/references/save", methods=["POST"])
 def save_references():
     try:
         result = save_manual_references_to_hub(APP_ROOT)
     except Exception as exc:
@@ -299,6 +321,7 @@ def save_references():
 @app.route("/sheets")
 def sheets_endpoint():
     try:
         workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
         if not workbook_path.is_file():
@@ -310,6 +333,7 @@ def sheets_endpoint():
 @app.route("/download-blueprint")
 def download_blueprint():
     state = get_state()
     requested_path = request.args.get("path") or state["apply_blueprint_path"]
     if not requested_path:
@@ -322,6 +346,7 @@ def download_blueprint():
 @app.route("/download-cleaned-workbook")
 def download_cleaned_workbook():
     state = get_state()
     requested_path = request.args.get("path") or state["clean_path"]
     if not requested_path:
@@ -339,6 +364,7 @@ def download_cleaned_workbook():
 @app.route("/download-applied-workbook")
 def download_applied_workbook():
     state = get_state()
     requested_path = request.args.get("path") or state["apply_workbook_path"]
     if not requested_path:
@@ -356,6 +382,7 @@ def download_applied_workbook():
 @app.route("/run")
 def run():
     job_id = request.args.get("job_id", uuid.uuid4().hex)
     input_path = request.args.get("input", "")
     sheet = request.args.get("sheet", "")
@@ -370,6 +397,7 @@ def run():
     except ValueError as exc:
         return jsonify({"error": str(exc)}), 400
     blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
     state = get_state()
     state["apply_blueprint_path"] = str(blueprint_path)
@@ -397,6 +425,7 @@ def run():
 @app.route("/stop", methods=["POST"])
 def stop():
     job_id = request.args.get("job_id", "")
     if not stop_process(job_id):
         return jsonify({"stopped": False, "message": "No active run found."}), 404
@@ -406,6 +435,7 @@ def stop():
 @app.route("/apply")
 def apply_blueprint():
     input_path = request.args.get("input", "")
     blueprint_path = request.args.get("blueprint", "")
     sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)

 APP_ROOT = Path(__file__).resolve().parent
 UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
 UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
+# Download/apply routes only accept files inside data/ to avoid arbitrary file reads.
 ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
 app = Flask(
     static_folder=str(APP_ROOT / "ui" / "static"),
 )
 app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
+app.secret_key = os.getenv("APP_SECRET_KEY", "local-dev-secret")
 DEFAULT_STATE = {
     "clean_path": "",
 def fresh_state():
+    """Create a clean UI state for one browser session."""
     return {
         key: list(value) if isinstance(value, list) else value
         for key, value in DEFAULT_STATE.items()
 def get_state():
+    """Return the current browser's state, creating it on first visit."""
     if "ui_state" not in session:
         session["ui_state"] = fresh_state()
     return session["ui_state"]
 def mark_state_changed():
+    """Tell Flask to re-sign the session cookie after nested state edits."""
     session.modified = True
 def auth_required_response() -> Response:
+    """Ask the browser for basic-auth credentials."""
     return Response(
         "Authentication required",
         401,
 def missing_auth_config_response() -> Response:
+    """Fail closed on Hugging Face if password protection was not configured."""
     return Response(
+        "App login credentials are not configured.",
         503,
     )
 @app.before_request
 def require_basic_auth():
+    """Protect every app route with a shared password when configured."""
+    if not APP_PASSWORD or not APP_USERNAME:
         if SPACE_ID:
             return missing_auth_config_response()
         return None
 def default_models() -> str:
+    """Prefer configured Groq models, falling back to the curated production list."""
     preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
     env_preferred_models = [
         model
 def render_page(message: str = "", error: str = ""):
+    """Render the app with state scoped to the current browser session."""
     state = get_state()
     if state["clean_sheets"]:
         state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
 def can_apply_blueprint() -> bool:
+    """The Apply button requires workbook, blueprint, and target sheet."""
     state = get_state()
     return bool(
         state["apply_workbook_path"]
 def wants_json_response() -> bool:
+    """AJAX upload routes ask for JSON; normal form posts render the page."""
     return "application/json" in request.headers.get("Accept", "")
 def ui_state_payload(message: str = "", error: str = ""):
+    """Return just enough state for the frontend to update without a reload."""
     state = get_state()
     return {
         "message": message,
 def pick_sheet(sheets, preferred_sheet=None, state=None):
+    """Choose a stable sheet selection when workbooks are uploaded/refreshed."""
     state = state or get_state()
     if preferred_sheet and preferred_sheet in sheets:
         return preferred_sheet
 def update_ui_state_from_form(form):
+    """Preserve current UI selections while a file upload request is processed."""
     state = get_state()
     state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
     state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
     except Exception as exc:
         return render_page(error=str(exc))
+    # The uploaded workbook becomes both the cleaning input and default apply target.
     state["clean_path"] = str(path)
     state["clean_filename"] = filename
     state["clean_sheets"] = sheets
 @app.route("/remove-clean", methods=["POST"])
 def remove_clean():
+    """Clear the current session's cleaning workbook without touching other sessions."""
     update_ui_state_from_form(request.form)
     state = get_state()
     old_path = state["clean_path"]
 @app.route("/prepare-apply-workbook", methods=["POST"])
 def prepare_apply_workbook():
+    """AJAX upload for the workbook that will receive Blueprint corrections."""
     try:
         update_ui_state_from_form(request.form)
         state = get_state()
 @app.route("/prepare-apply-blueprint", methods=["POST"])
 def prepare_apply_blueprint():
+    """AJAX upload for an externally reviewed Blueprint workbook."""
     try:
         update_ui_state_from_form(request.form)
         state = get_state()
 @app.route("/models")
 def models_endpoint():
+    """Fetch the currently usable Groq fallback model list for the UI."""
     try:
         models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
     except Exception as exc:
 @app.route("/references/status")
 def references_status():
+    """Tell the UI whether Hugging Face reference sync can be used."""
     return jsonify(reference_sync_status())
 @app.route("/references/save", methods=["POST"])
 def save_references():
+    """Commit manual references back to the Space repo when HF sync is configured."""
     try:
         result = save_manual_references_to_hub(APP_ROOT)
     except Exception as exc:
 @app.route("/sheets")
 def sheets_endpoint():
+    """Return workbook sheet names for dynamic apply-sheet selection."""
     try:
         workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
         if not workbook_path.is_file():
 @app.route("/download-blueprint")
 def download_blueprint():
+    """Download either the session Blueprint or an explicitly requested run file."""
     state = get_state()
     requested_path = request.args.get("path") or state["apply_blueprint_path"]
     if not requested_path:
 @app.route("/download-cleaned-workbook")
 def download_cleaned_workbook():
+    """Download the cleaned workbook copy for this session/run."""
     state = get_state()
     requested_path = request.args.get("path") or state["clean_path"]
     if not requested_path:
 @app.route("/download-applied-workbook")
 def download_applied_workbook():
+    """Download the workbook after Blueprint corrections have been applied."""
     state = get_state()
     requested_path = request.args.get("path") or state["apply_workbook_path"]
     if not requested_path:
 @app.route("/run")
 def run():
+    """Start the cleaning subprocess and stream its logs as server-sent events."""
     job_id = request.args.get("job_id", uuid.uuid4().hex)
     input_path = request.args.get("input", "")
     sheet = request.args.get("sheet", "")
     except ValueError as exc:
         return jsonify({"error": str(exc)}), 400
+    # Each run gets its own Blueprint so simultaneous users cannot overwrite it.
     blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
     state = get_state()
     state["apply_blueprint_path"] = str(blueprint_path)
 @app.route("/stop", methods=["POST"])
 def stop():
+    """Stop a running cleaning subprocess for the given frontend job id."""
     job_id = request.args.get("job_id", "")
     if not stop_process(job_id):
         return jsonify({"stopped": False, "message": "No active run found."}), 404
 @app.route("/apply")
 def apply_blueprint():
+    """Start the Blueprint-apply subprocess and stream its logs."""
     input_path = request.args.get("input", "")
     blueprint_path = request.args.get("blueprint", "")
     sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)