Spaces:
Running
Running
Commit ·
c6a3f44
1
Parent(s): ad5ab1d
Added functional doc in README.md and added basic
Browse filescomments to the code
modified: README.md
modified: apply_blueprint.py
modified: main.py
modified: newest_model.py
modified: src/config.py
modified: src/data_pipeline.py
modified: src/llm_router.py
modified: src/process_runner.py
modified: src/utils.py
modified: src/workbook_io.py
modified: ui_app.py
- README.md +130 -14
- apply_blueprint.py +12 -7
- main.py +21 -20
- newest_model.py +5 -0
- src/config.py +5 -3
- src/data_pipeline.py +25 -29
- src/llm_router.py +11 -2
- src/process_runner.py +6 -0
- src/utils.py +10 -0
- src/workbook_io.py +5 -0
- ui_app.py +33 -3
README.md
CHANGED
|
@@ -1,18 +1,134 @@
|
|
| 1 |
-
|
| 2 |
-
title: MasterMap Cleaner
|
| 3 |
-
sdk: docker
|
| 4 |
-
app_port: 7860
|
| 5 |
-
---
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
-
Set production secrets in the Hugging Face Space settings, not in committed files.
|
| 11 |
|
| 12 |
-
|
| 13 |
-
- `APP_PASSWORD`: required to password-protect the deployed Space; set it as a Hugging Face Secret.
|
| 14 |
-
- `APP_USERNAME`: optional, defaults to `mastermap` when `APP_PASSWORD` is set.
|
| 15 |
-
- `APP_SECRET_KEY`: recommended for stable, isolated per-browser sessions.
|
| 16 |
-
- `HF_TOKEN`: Hugging Face Space Secret only; optional, required only for the `Save Manual References` button.
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MasterMap Cleaner
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
MasterMap Cleaner is a web tool used to clean and standardize MasterMap Excel files.
|
| 4 |
|
| 5 |
+
The tool takes an uploaded workbook, checks selected columns against approved reference lists, uses AI only when needed, and creates a cleaned version of the workbook. It also generates a review file called a **Blueprint**, where uncertain values can be checked and corrected by a human before the final workbook is downloaded.
|
|
|
|
| 6 |
|
| 7 |
+
## What The Tool Produces
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
After a cleaning run, the tool can produce:
|
| 10 |
+
|
| 11 |
+
- a cleaned workbook with a new cleaned sheet
|
| 12 |
+
- a Blueprint file for human review
|
| 13 |
+
- a final workbook after the reviewed Blueprint has been applied
|
| 14 |
+
|
| 15 |
+
## Before You Start
|
| 16 |
+
|
| 17 |
+
You need:
|
| 18 |
+
|
| 19 |
+
- the Hugging Face Space link for the tool
|
| 20 |
+
- the username and password provided by the tool administrator
|
| 21 |
+
- the Excel workbook you want to clean
|
| 22 |
+
- access to Excel or another spreadsheet editor to review the Blueprint
|
| 23 |
+
|
| 24 |
+
Accepted workbook formats:
|
| 25 |
+
|
| 26 |
+
- `.xlsx`
|
| 27 |
+
- `.xlsm`
|
| 28 |
+
|
| 29 |
+
## Recommended Workflow
|
| 30 |
+
|
| 31 |
+
1. Open the tool.
|
| 32 |
+
2. Upload the Excel workbook.
|
| 33 |
+
3. Select the source sheet to clean.
|
| 34 |
+
4. Run the cleaning process.
|
| 35 |
+
5. Download the cleaned workbook and Blueprint.
|
| 36 |
+
6. Review the Blueprint in Excel.
|
| 37 |
+
7. Upload the reviewed Blueprint back into the tool.
|
| 38 |
+
8. Apply the Blueprint.
|
| 39 |
+
9. Download the final cleaned workbook.
|
| 40 |
+
10. Save manual references if new approved values should be remembered for future runs.
|
| 41 |
+
|
| 42 |
+
## Step 1: Open The Tool
|
| 43 |
+
|
| 44 |
+
Open the Hugging Face Space link in your browser.
|
| 45 |
+
|
| 46 |
+
If prompted, enter the username and password provided by the tool administrator.
|
| 47 |
+
|
| 48 |
+
## Step 2: Upload The Dataset
|
| 49 |
+
|
| 50 |
+
In the **Dataset to Clean** section:
|
| 51 |
+
|
| 52 |
+
1. Drop the Excel file into the upload box, or click the box and select the file.
|
| 53 |
+
2. Wait until the file is loaded.
|
| 54 |
+
3. Select the source sheet that contains the data to clean.
|
| 55 |
+
4. Choose the output sheet name.
|
| 56 |
+
|
| 57 |
+
The output sheet is the new sheet that will be created inside the workbook for cleaned data.
|
| 58 |
+
|
| 59 |
+
Do not use the same name as the source sheet.
|
| 60 |
+
|
| 61 |
+
## Step 3: Run Cleaning
|
| 62 |
+
|
| 63 |
+
Click **Run Cleaning**.
|
| 64 |
+
|
| 65 |
+
While the tool is running, it will show progress for each cleaned column. Some files may take time depending on the number of rows and the number of values that require AI review.
|
| 66 |
+
|
| 67 |
+
When the run finishes, the tool will show download links for:
|
| 68 |
+
|
| 69 |
+
- **Blueprint**
|
| 70 |
+
- **Cleaned Workbook**
|
| 71 |
+
|
| 72 |
+
Download both files.
|
| 73 |
+
|
| 74 |
+
## Step 4: Review The Blueprint
|
| 75 |
+
|
| 76 |
+
Open the Blueprint file in Excel.
|
| 77 |
+
|
| 78 |
+
The Blueprint contains values that the tool wants a human to review. Each row represents one value or correction candidate.
|
| 79 |
+
|
| 80 |
+
Main columns:
|
| 81 |
+
|
| 82 |
+
- `Row_Index`: the row in the workbook where the value appears
|
| 83 |
+
- `Column`: the field being reviewed
|
| 84 |
+
- `Original_Raw_Text`: the original value from the uploaded file
|
| 85 |
+
- `AI_Suggested_Match`: the tool's suggested cleaned value
|
| 86 |
+
- `Human_Override`: the reviewer correction field
|
| 87 |
+
- `Confidence`: how confident the tool was
|
| 88 |
+
- `Match_Source`: how the suggestion was produced
|
| 89 |
+
|
| 90 |
+
How to review each row:
|
| 91 |
+
|
| 92 |
+
- If the suggested value is correct, leave `Human_Override` empty.
|
| 93 |
+
- If the suggested value is wrong, choose the correct value from the dropdown.
|
| 94 |
+
- If the correct value is not in the dropdown, type it manually.
|
| 95 |
+
- Focus especially on rows marked `LOW` or `MEDIUM` confidence.
|
| 96 |
+
|
| 97 |
+
The dropdown is there to help, but manual typing is allowed.
|
| 98 |
+
|
| 99 |
+
After reviewing, save the Blueprint file.
|
| 100 |
+
|
| 101 |
+
## Step 5: Apply The Reviewed Blueprint
|
| 102 |
+
|
| 103 |
+
Return to the tool and go to the **Apply Blueprint** section.
|
| 104 |
+
|
| 105 |
+
1. Upload the workbook that should receive the corrections.
|
| 106 |
+
2. Select the sheet to update.
|
| 107 |
+
3. Upload the reviewed Blueprint file.
|
| 108 |
+
4. Click **Apply Blueprint**.
|
| 109 |
+
|
| 110 |
+
Use the same cleaned sheet that was created during the cleaning step unless you were instructed otherwise.
|
| 111 |
+
|
| 112 |
+
When the apply step finishes, download the final cleaned workbook.
|
| 113 |
+
|
| 114 |
+
## Step 6: Save Manual References
|
| 115 |
+
|
| 116 |
+
After applying a Blueprint, the tool may learn newly approved values so future runs can recognize them automatically.
|
| 117 |
+
|
| 118 |
+
If the **Save Manual References** button is available, click it after applying the Blueprint.
|
| 119 |
+
|
| 120 |
+
Use this button when:
|
| 121 |
+
|
| 122 |
+
- the Blueprint contained manually approved values
|
| 123 |
+
- those values should be remembered in future cleaning runs
|
| 124 |
+
- the administrator instructed you to preserve the updated references
|
| 125 |
+
|
| 126 |
+
If the button is disabled, continue using the tool normally. The administrator may handle reference saving separately.
|
| 127 |
+
|
| 128 |
+
## Important Usage Notes
|
| 129 |
+
|
| 130 |
+
- Keep the browser tab open while a cleaning or apply process is running.
|
| 131 |
+
- Download your files before closing the page.
|
| 132 |
+
- If you refresh the page, you may need to upload the files again.
|
| 133 |
+
- Do not share the tool password outside the approved user group.
|
| 134 |
+
- Do not upload files unless they are intended to be processed by this tool.
|
apply_blueprint.py
CHANGED
|
@@ -14,6 +14,7 @@ from src.config import (
|
|
| 14 |
from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
|
| 15 |
|
| 16 |
def parse_args():
|
|
|
|
| 17 |
parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
|
| 18 |
parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
|
| 19 |
parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
|
|
@@ -29,6 +30,7 @@ def parse_args():
|
|
| 29 |
return args
|
| 30 |
|
| 31 |
def load_json_safe(filepath):
|
|
|
|
| 32 |
try:
|
| 33 |
with open(filepath, 'r', encoding='utf-8-sig') as f:
|
| 34 |
return json.load(f)
|
|
@@ -36,16 +38,19 @@ def load_json_safe(filepath):
|
|
| 36 |
return {}
|
| 37 |
|
| 38 |
def split_approved_parts(value):
|
|
|
|
| 39 |
if pd.isna(value):
|
| 40 |
return []
|
| 41 |
return [part.strip() for part in str(value).split(",") if part.strip()]
|
| 42 |
|
| 43 |
def ensure_manual_bucket(manual_refs, official_refs, column_name):
|
|
|
|
| 44 |
if column_name not in manual_refs:
|
| 45 |
manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
|
| 46 |
return manual_refs[column_name]
|
| 47 |
|
| 48 |
def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
|
|
|
|
| 49 |
manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
|
| 50 |
added_count = 0
|
| 51 |
|
|
@@ -85,7 +90,7 @@ if __name__ == "__main__":
|
|
| 85 |
print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
|
| 86 |
exit()
|
| 87 |
|
| 88 |
-
#
|
| 89 |
wb = openpyxl.load_workbook(args.input)
|
| 90 |
if args.sheet not in wb.sheetnames:
|
| 91 |
print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
|
|
@@ -98,7 +103,7 @@ if __name__ == "__main__":
|
|
| 98 |
if sheet.cell(row=1, column=c).value
|
| 99 |
}
|
| 100 |
|
| 101 |
-
#
|
| 102 |
official_refs = load_json_safe(args.refs)
|
| 103 |
manual_refs = load_json_safe(args.manual_refs)
|
| 104 |
|
|
@@ -107,6 +112,7 @@ if __name__ == "__main__":
|
|
| 107 |
|
| 108 |
print("Applying manual overrides and updating memory...")
|
| 109 |
for _, row in bp_df.iterrows():
|
|
|
|
| 110 |
human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
|
| 111 |
approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
|
| 112 |
confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
|
|
@@ -117,7 +123,7 @@ if __name__ == "__main__":
|
|
| 117 |
raw_col = str(row['Column']).strip()
|
| 118 |
|
| 119 |
if human_val:
|
| 120 |
-
#
|
| 121 |
try:
|
| 122 |
excel_row = int(row['Row_Index'])
|
| 123 |
except (TypeError, ValueError):
|
|
@@ -136,7 +142,7 @@ if __name__ == "__main__":
|
|
| 136 |
sheet.cell(row=excel_row, column=col_idx).value = human_val
|
| 137 |
changes_made += 1
|
| 138 |
|
| 139 |
-
#
|
| 140 |
if raw_col == "Degree":
|
| 141 |
continue
|
| 142 |
|
|
@@ -152,11 +158,10 @@ if __name__ == "__main__":
|
|
| 152 |
|
| 153 |
memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
|
| 154 |
|
| 155 |
-
#
|
| 156 |
wb.save(args.input)
|
| 157 |
|
| 158 |
-
#
|
| 159 |
-
# Make sure the data directory exists before dumping
|
| 160 |
manual_refs_dir = os.path.dirname(args.manual_refs)
|
| 161 |
if manual_refs_dir:
|
| 162 |
os.makedirs(manual_refs_dir, exist_ok=True)
|
|
|
|
| 14 |
from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
|
| 15 |
|
| 16 |
def parse_args():
|
| 17 |
+
"""Parse workbook, Blueprint, and reference paths for the apply step."""
|
| 18 |
parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
|
| 19 |
parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
|
| 20 |
parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
|
|
|
|
| 30 |
return args
|
| 31 |
|
| 32 |
def load_json_safe(filepath):
|
| 33 |
+
"""Load JSON memory files and fall back to an empty dict if absent/corrupt."""
|
| 34 |
try:
|
| 35 |
with open(filepath, 'r', encoding='utf-8-sig') as f:
|
| 36 |
return json.load(f)
|
|
|
|
| 38 |
return {}
|
| 39 |
|
| 40 |
def split_approved_parts(value):
|
| 41 |
+
"""Split multi-value approvals into individual reference candidates."""
|
| 42 |
if pd.isna(value):
|
| 43 |
return []
|
| 44 |
return [part.strip() for part in str(value).split(",") if part.strip()]
|
| 45 |
|
| 46 |
def ensure_manual_bucket(manual_refs, official_refs, column_name):
|
| 47 |
+
"""Create the correct manual-ref container for list or dict reference columns."""
|
| 48 |
if column_name not in manual_refs:
|
| 49 |
manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
|
| 50 |
return manual_refs[column_name]
|
| 51 |
|
| 52 |
def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
|
| 53 |
+
"""Remember approved values that are not already official or manual refs."""
|
| 54 |
manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
|
| 55 |
added_count = 0
|
| 56 |
|
|
|
|
| 90 |
print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
|
| 91 |
exit()
|
| 92 |
|
| 93 |
+
# Human overrides are applied directly to the selected cleaned sheet.
|
| 94 |
wb = openpyxl.load_workbook(args.input)
|
| 95 |
if args.sheet not in wb.sheetnames:
|
| 96 |
print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
|
|
|
|
| 103 |
if sheet.cell(row=1, column=c).value
|
| 104 |
}
|
| 105 |
|
| 106 |
+
# Reference files use the same CLI defaults as the cleaning pipeline.
|
| 107 |
official_refs = load_json_safe(args.refs)
|
| 108 |
manual_refs = load_json_safe(args.manual_refs)
|
| 109 |
|
|
|
|
| 112 |
|
| 113 |
print("Applying manual overrides and updating memory...")
|
| 114 |
for _, row in bp_df.iterrows():
|
| 115 |
+
# Empty Human_Override means the reviewer accepted the AI suggestion.
|
| 116 |
human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
|
| 117 |
approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
|
| 118 |
confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
|
|
|
|
| 123 |
raw_col = str(row['Column']).strip()
|
| 124 |
|
| 125 |
if human_val:
|
| 126 |
+
# Blueprint row indices already include the skipped MasterMap filter row.
|
| 127 |
try:
|
| 128 |
excel_row = int(row['Row_Index'])
|
| 129 |
except (TypeError, ValueError):
|
|
|
|
| 142 |
sheet.cell(row=excel_row, column=col_idx).value = human_val
|
| 143 |
changes_made += 1
|
| 144 |
|
| 145 |
+
# Only approved non-low-confidence values should teach future runs.
|
| 146 |
if raw_col == "Degree":
|
| 147 |
continue
|
| 148 |
|
|
|
|
| 158 |
|
| 159 |
memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
|
| 160 |
|
| 161 |
+
# Persist workbook updates before writing the learned memory file.
|
| 162 |
wb.save(args.input)
|
| 163 |
|
| 164 |
+
# Manual refs may be written to an empty deployment volume, so ensure the folder exists.
|
|
|
|
| 165 |
manual_refs_dir = os.path.dirname(args.manual_refs)
|
| 166 |
if manual_refs_dir:
|
| 167 |
os.makedirs(manual_refs_dir, exist_ok=True)
|
main.py
CHANGED
|
@@ -9,13 +9,13 @@ from openpyxl.utils import get_column_letter
|
|
| 9 |
from openpyxl.worksheet.datavalidation import DataValidation
|
| 10 |
from openpyxl.workbook.defined_name import DefinedName
|
| 11 |
|
| 12 |
-
# Import our new modular architecture
|
| 13 |
from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
|
| 14 |
from src.llm_router import GroqRouter
|
| 15 |
from src.data_pipeline import process_column, cluster_degrees_by_institution
|
| 16 |
from src.utils import prune_manual_refs_against_official
|
| 17 |
|
| 18 |
-
#
|
|
|
|
| 19 |
COLUMNS_CONFIG = {
|
| 20 |
"Country": r',|;|\n|/',
|
| 21 |
"Institution": r'[,/;|\n]',
|
|
@@ -30,10 +30,12 @@ COLUMNS_CONFIG = {
|
|
| 30 |
master_cache = {}
|
| 31 |
|
| 32 |
def load_json_safe(filepath):
|
|
|
|
| 33 |
with open(filepath, 'r', encoding='utf-8-sig') as f:
|
| 34 |
return json.load(f)
|
| 35 |
|
| 36 |
def validate_official_refs(official_refs):
|
|
|
|
| 37 |
missing = []
|
| 38 |
for column_name in COLUMNS_CONFIG:
|
| 39 |
if column_name == "Degree":
|
|
@@ -51,29 +53,28 @@ def validate_official_refs(official_refs):
|
|
| 51 |
)
|
| 52 |
|
| 53 |
def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
|
| 54 |
-
"""
|
| 55 |
print("Injecting static searchable dropdowns into Blueprint...")
|
| 56 |
wb = openpyxl.load_workbook(blueprint_path)
|
| 57 |
main_sheet = wb.active
|
| 58 |
|
| 59 |
-
#
|
| 60 |
ref_sheet = wb.create_sheet(title="Reference_Lists")
|
| 61 |
|
| 62 |
col_idx = 1
|
| 63 |
for column_name, unique_items in master_unique_lists.items():
|
| 64 |
safe_name = column_name.replace(" ", "_")
|
| 65 |
|
| 66 |
-
# Write the header
|
| 67 |
ref_sheet.cell(row=1, column=col_idx, value=safe_name)
|
| 68 |
|
| 69 |
-
# Clean and alphabetize the list for a better
|
| 70 |
valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
|
| 71 |
|
| 72 |
# Write the items
|
| 73 |
for row_idx, item in enumerate(valid_items, start=2):
|
| 74 |
ref_sheet.cell(row=row_idx, column=col_idx, value=item)
|
| 75 |
|
| 76 |
-
#
|
| 77 |
if valid_items:
|
| 78 |
letter = get_column_letter(col_idx)
|
| 79 |
range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
|
|
@@ -82,7 +83,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
|
|
| 82 |
|
| 83 |
col_idx += 1
|
| 84 |
|
| 85 |
-
#
|
| 86 |
target_col_idx = None
|
| 87 |
override_col_letter = None
|
| 88 |
for cell in main_sheet[1]:
|
|
@@ -91,7 +92,6 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
|
|
| 91 |
elif cell.value == "Human_Override":
|
| 92 |
override_col_letter = get_column_letter(cell.column)
|
| 93 |
|
| 94 |
-
# 4. Apply Data Validation
|
| 95 |
if target_col_idx and override_col_letter:
|
| 96 |
dv = DataValidation(
|
| 97 |
type="list",
|
|
@@ -108,7 +108,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
|
|
| 108 |
|
| 109 |
|
| 110 |
if __name__ == "__main__":
|
| 111 |
-
#
|
| 112 |
args = parse_cli_args()
|
| 113 |
source_sheet_name = args.sheet
|
| 114 |
output_sheet_name = args.output_sheet
|
|
@@ -117,7 +117,7 @@ if __name__ == "__main__":
|
|
| 117 |
print("Loading AI Model (this may take a few seconds)...")
|
| 118 |
model = SentenceTransformer('all-MiniLM-L6-v2')
|
| 119 |
|
| 120 |
-
#
|
| 121 |
router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
|
| 122 |
|
| 123 |
if not os.path.exists(args.refs):
|
|
@@ -131,6 +131,8 @@ if __name__ == "__main__":
|
|
| 131 |
official_refs = load_json_safe(args.refs)
|
| 132 |
manual_refs = load_json_safe(args.manual_refs)
|
| 133 |
validate_official_refs(official_refs)
|
|
|
|
|
|
|
| 134 |
memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
|
| 135 |
if memory_pruned:
|
| 136 |
print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
|
|
@@ -138,10 +140,11 @@ if __name__ == "__main__":
|
|
| 138 |
print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
|
| 139 |
data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
|
| 140 |
|
| 141 |
-
#
|
| 142 |
blueprint_records = []
|
| 143 |
|
| 144 |
-
#
|
|
|
|
| 145 |
for col, pattern in COLUMNS_CONFIG.items():
|
| 146 |
if col == "Degree":
|
| 147 |
inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
|
|
@@ -157,16 +160,15 @@ if __name__ == "__main__":
|
|
| 157 |
split_pattern=pattern, blueprint_data=blueprint_records
|
| 158 |
)
|
| 159 |
|
| 160 |
-
# --- 4. EXPORT RESULTS ---
|
| 161 |
print("\nSaving all memory files...")
|
| 162 |
with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
|
| 163 |
|
| 164 |
-
#
|
| 165 |
if blueprint_records:
|
| 166 |
bp_df = pd.DataFrame(blueprint_records)
|
| 167 |
bp_df.to_excel(args.blueprint, index=False)
|
| 168 |
|
| 169 |
-
#
|
| 170 |
bp_wb = openpyxl.load_workbook(args.blueprint)
|
| 171 |
bp_sheet = bp_wb.active
|
| 172 |
|
|
@@ -195,9 +197,8 @@ if __name__ == "__main__":
|
|
| 195 |
bp_wb.save(args.blueprint)
|
| 196 |
print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
|
| 197 |
|
| 198 |
-
# --- NEW: Build master lists and inject dropdowns ---
|
| 199 |
def extract_uniques(ref_data):
|
| 200 |
-
"""
|
| 201 |
if isinstance(ref_data, dict): return list(ref_data.values())
|
| 202 |
elif isinstance(ref_data, list): return ref_data
|
| 203 |
return []
|
|
@@ -206,7 +207,7 @@ if __name__ == "__main__":
|
|
| 206 |
for category in COLUMNS_CONFIG.keys():
|
| 207 |
off_items = extract_uniques(official_refs.get(category, []))
|
| 208 |
man_items = extract_uniques(manual_refs.get(category, []))
|
| 209 |
-
# Merge
|
| 210 |
master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
|
| 211 |
|
| 212 |
inject_searchable_dropdowns(args.blueprint, master_lists)
|
|
@@ -214,7 +215,7 @@ if __name__ == "__main__":
|
|
| 214 |
else:
|
| 215 |
print("[!] No blueprint generated. All matches were HIGH confidence!")
|
| 216 |
|
| 217 |
-
#
|
| 218 |
print("\nOpening original Excel file to preserve formatting...")
|
| 219 |
wb = openpyxl.load_workbook(args.input)
|
| 220 |
new_sheet_name = output_sheet_name
|
|
|
|
| 9 |
from openpyxl.worksheet.datavalidation import DataValidation
|
| 10 |
from openpyxl.workbook.defined_name import DefinedName
|
| 11 |
|
|
|
|
| 12 |
from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
|
| 13 |
from src.llm_router import GroqRouter
|
| 14 |
from src.data_pipeline import process_column, cluster_degrees_by_institution
|
| 15 |
from src.utils import prune_manual_refs_against_official
|
| 16 |
|
| 17 |
+
# Each cleaned column has its own conservative split pattern. Avoid splitting
|
| 18 |
+
# on words like "and" because they can be part of official country names.
|
| 19 |
COLUMNS_CONFIG = {
|
| 20 |
"Country": r',|;|\n|/',
|
| 21 |
"Institution": r'[,/;|\n]',
|
|
|
|
| 30 |
master_cache = {}
|
| 31 |
|
| 32 |
def load_json_safe(filepath):
|
| 33 |
+
"""Load reference JSON files, accepting UTF-8 files with or without a BOM."""
|
| 34 |
with open(filepath, 'r', encoding='utf-8-sig') as f:
|
| 35 |
return json.load(f)
|
| 36 |
|
| 37 |
def validate_official_refs(official_refs):
|
| 38 |
+
"""Fail early if required reference buckets are missing or empty."""
|
| 39 |
missing = []
|
| 40 |
for column_name in COLUMNS_CONFIG:
|
| 41 |
if column_name == "Degree":
|
|
|
|
| 53 |
)
|
| 54 |
|
| 55 |
def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
|
| 56 |
+
"""Add hidden reference lists and dropdowns to the generated Blueprint."""
|
| 57 |
print("Injecting static searchable dropdowns into Blueprint...")
|
| 58 |
wb = openpyxl.load_workbook(blueprint_path)
|
| 59 |
main_sheet = wb.active
|
| 60 |
|
| 61 |
+
# Store all dropdown values on a hidden sheet so Excel can reference them.
|
| 62 |
ref_sheet = wb.create_sheet(title="Reference_Lists")
|
| 63 |
|
| 64 |
col_idx = 1
|
| 65 |
for column_name, unique_items in master_unique_lists.items():
|
| 66 |
safe_name = column_name.replace(" ", "_")
|
| 67 |
|
|
|
|
| 68 |
ref_sheet.cell(row=1, column=col_idx, value=safe_name)
|
| 69 |
|
| 70 |
+
# Clean and alphabetize the list for a better review experience.
|
| 71 |
valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
|
| 72 |
|
| 73 |
# Write the items
|
| 74 |
for row_idx, item in enumerate(valid_items, start=2):
|
| 75 |
ref_sheet.cell(row=row_idx, column=col_idx, value=item)
|
| 76 |
|
| 77 |
+
# Named ranges let data validation reference long lists safely.
|
| 78 |
if valid_items:
|
| 79 |
letter = get_column_letter(col_idx)
|
| 80 |
range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
|
|
|
|
| 83 |
|
| 84 |
col_idx += 1
|
| 85 |
|
| 86 |
+
# The override dropdown changes based on the row's target column.
|
| 87 |
target_col_idx = None
|
| 88 |
override_col_letter = None
|
| 89 |
for cell in main_sheet[1]:
|
|
|
|
| 92 |
elif cell.value == "Human_Override":
|
| 93 |
override_col_letter = get_column_letter(cell.column)
|
| 94 |
|
|
|
|
| 95 |
if target_col_idx and override_col_letter:
|
| 96 |
dv = DataValidation(
|
| 97 |
type="list",
|
|
|
|
| 108 |
|
| 109 |
|
| 110 |
if __name__ == "__main__":
|
| 111 |
+
# Parse CLI/UI arguments before loading any expensive model assets.
|
| 112 |
args = parse_cli_args()
|
| 113 |
source_sheet_name = args.sheet
|
| 114 |
output_sheet_name = args.output_sheet
|
|
|
|
| 117 |
print("Loading AI Model (this may take a few seconds)...")
|
| 118 |
model = SentenceTransformer('all-MiniLM-L6-v2')
|
| 119 |
|
| 120 |
+
# The router owns Groq fallback order and rate-limit switching.
|
| 121 |
router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
|
| 122 |
|
| 123 |
if not os.path.exists(args.refs):
|
|
|
|
| 131 |
official_refs = load_json_safe(args.refs)
|
| 132 |
manual_refs = load_json_safe(args.manual_refs)
|
| 133 |
validate_official_refs(official_refs)
|
| 134 |
+
|
| 135 |
+
# Manual memory should only contain values not already covered by official refs.
|
| 136 |
memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
|
| 137 |
if memory_pruned:
|
| 138 |
print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
|
|
|
|
| 140 |
print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
|
| 141 |
data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
|
| 142 |
|
| 143 |
+
# Every uncertain or changed value is logged here for human review.
|
| 144 |
blueprint_records = []
|
| 145 |
|
| 146 |
+
# Run each configured column through the normalization pipeline. Degree
|
| 147 |
+
# values are clustered within each institution instead of matched to refs.
|
| 148 |
for col, pattern in COLUMNS_CONFIG.items():
|
| 149 |
if col == "Degree":
|
| 150 |
inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
|
|
|
|
| 160 |
split_pattern=pattern, blueprint_data=blueprint_records
|
| 161 |
)
|
| 162 |
|
|
|
|
| 163 |
print("\nSaving all memory files...")
|
| 164 |
with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
|
| 165 |
|
| 166 |
+
# Export the review workbook only when there is something to inspect.
|
| 167 |
if blueprint_records:
|
| 168 |
bp_df = pd.DataFrame(blueprint_records)
|
| 169 |
bp_df.to_excel(args.blueprint, index=False)
|
| 170 |
|
| 171 |
+
# Basic formatting helps reviewers scan confidence levels quickly.
|
| 172 |
bp_wb = openpyxl.load_workbook(args.blueprint)
|
| 173 |
bp_sheet = bp_wb.active
|
| 174 |
|
|
|
|
| 197 |
bp_wb.save(args.blueprint)
|
| 198 |
print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
|
| 199 |
|
|
|
|
| 200 |
def extract_uniques(ref_data):
|
| 201 |
+
"""Extract display values from list-style or dict-style references."""
|
| 202 |
if isinstance(ref_data, dict): return list(ref_data.values())
|
| 203 |
elif isinstance(ref_data, list): return ref_data
|
| 204 |
return []
|
|
|
|
| 207 |
for category in COLUMNS_CONFIG.keys():
|
| 208 |
off_items = extract_uniques(official_refs.get(category, []))
|
| 209 |
man_items = extract_uniques(manual_refs.get(category, []))
|
| 210 |
+
# Merge official and manual values for the Blueprint dropdowns.
|
| 211 |
master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
|
| 212 |
|
| 213 |
inject_searchable_dropdowns(args.blueprint, master_lists)
|
|
|
|
| 215 |
else:
|
| 216 |
print("[!] No blueprint generated. All matches were HIGH confidence!")
|
| 217 |
|
| 218 |
+
# Copy the source sheet to preserve formatting, then overwrite cleaned columns.
|
| 219 |
print("\nOpening original Excel file to preserve formatting...")
|
| 220 |
wb = openpyxl.load_workbook(args.input)
|
| 221 |
new_sheet_name = output_sheet_name
|
newest_model.py
CHANGED
|
@@ -37,6 +37,7 @@ PREFERRED_MODEL_IDS = {model_id.lower() for model_id in PREFERRED_PRODUCTION_CHA
|
|
| 37 |
|
| 38 |
|
| 39 |
def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
|
|
|
|
| 40 |
headers = {
|
| 41 |
"Authorization": f"Bearer {api_key}",
|
| 42 |
"Content-Type": "application/json",
|
|
@@ -47,6 +48,7 @@ def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
|
|
| 47 |
|
| 48 |
|
| 49 |
def is_active_chat_model(model: dict[str, Any]) -> bool:
|
|
|
|
| 50 |
model_id = str(model.get("id", "")).lower()
|
| 51 |
if not model_id:
|
| 52 |
return False
|
|
@@ -58,6 +60,7 @@ def is_active_chat_model(model: dict[str, Any]) -> bool:
|
|
| 58 |
|
| 59 |
|
| 60 |
def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
|
|
|
|
| 61 |
model_id = str(model.get("id", ""))
|
| 62 |
model_id_lower = model_id.lower()
|
| 63 |
|
|
@@ -75,6 +78,7 @@ def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
|
|
| 75 |
|
| 76 |
|
| 77 |
def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
|
|
|
|
| 78 |
api_key = os.getenv("GROQ_API_KEY")
|
| 79 |
if not api_key:
|
| 80 |
raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
|
|
@@ -98,6 +102,7 @@ def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS),
|
|
| 98 |
|
| 99 |
|
| 100 |
def main() -> None:
|
|
|
|
| 101 |
parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
|
| 102 |
parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
|
| 103 |
parser.add_argument(
|
|
|
|
| 37 |
|
| 38 |
|
| 39 |
def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
|
| 40 |
+
"""Fetch the current Groq model catalog using the OpenAI-compatible API."""
|
| 41 |
headers = {
|
| 42 |
"Authorization": f"Bearer {api_key}",
|
| 43 |
"Content-Type": "application/json",
|
|
|
|
| 48 |
|
| 49 |
|
| 50 |
def is_active_chat_model(model: dict[str, Any]) -> bool:
|
| 51 |
+
"""Keep only active preferred chat models that are suitable for judging."""
|
| 52 |
model_id = str(model.get("id", "")).lower()
|
| 53 |
if not model_id:
|
| 54 |
return False
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
|
| 63 |
+
"""Sort models by preferred production order, then by recency/capacity."""
|
| 64 |
model_id = str(model.get("id", ""))
|
| 65 |
model_id_lower = model_id.lower()
|
| 66 |
|
|
|
|
| 78 |
|
| 79 |
|
| 80 |
def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
|
| 81 |
+
"""Return a comma-ready fallback list for GROQ_MODEL."""
|
| 82 |
api_key = os.getenv("GROQ_API_KEY")
|
| 83 |
if not api_key:
|
| 84 |
raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
|
|
|
|
| 102 |
|
| 103 |
|
| 104 |
def main() -> None:
|
| 105 |
+
"""CLI entry point used when refreshing the recommended Groq model list."""
|
| 106 |
parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
|
| 107 |
parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
|
| 108 |
parser.add_argument(
|
src/config.py
CHANGED
|
@@ -2,12 +2,13 @@ import os
|
|
| 2 |
import argparse
|
| 3 |
from dotenv import load_dotenv
|
| 4 |
|
| 5 |
-
# Load
|
|
|
|
| 6 |
load_dotenv()
|
| 7 |
|
| 8 |
# --- ENVIRONMENT VARIABLES to be set up in .env ---
|
| 9 |
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
|
| 10 |
-
RAW_MODELS = os.getenv("GROQ_MODEL")
|
| 11 |
APP_USERNAME = os.getenv("APP_USERNAME")
|
| 12 |
APP_PASSWORD = os.getenv("APP_PASSWORD")
|
| 13 |
SPACE_ID = os.getenv("SPACE_ID")
|
|
@@ -46,7 +47,7 @@ def resolve_ref_path(file_arg):
|
|
| 46 |
return os.path.join(REFDATA_DIR, file_arg)
|
| 47 |
|
| 48 |
def parse_cli_args():
|
| 49 |
-
"""
|
| 50 |
parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
|
| 51 |
parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
|
| 52 |
parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
|
|
@@ -57,6 +58,7 @@ def parse_cli_args():
|
|
| 57 |
parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
|
| 58 |
|
| 59 |
args = parser.parse_args()
|
|
|
|
| 60 |
args.input = resolve_data_path(args.input)
|
| 61 |
args.blueprint = resolve_data_path(args.blueprint)
|
| 62 |
args.refs = resolve_ref_path(args.refs)
|
|
|
|
| 2 |
import argparse
|
| 3 |
from dotenv import load_dotenv
|
| 4 |
|
| 5 |
+
# Load local .env values for development; Hugging Face injects the same names
|
| 6 |
+
# as environment variables in production.
|
| 7 |
load_dotenv()
|
| 8 |
|
| 9 |
# --- ENVIRONMENT VARIABLES to be set up in .env ---
|
| 10 |
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
|
| 11 |
+
RAW_MODELS = os.getenv("GROQ_MODEL", "")
|
| 12 |
APP_USERNAME = os.getenv("APP_USERNAME")
|
| 13 |
APP_PASSWORD = os.getenv("APP_PASSWORD")
|
| 14 |
SPACE_ID = os.getenv("SPACE_ID")
|
|
|
|
| 47 |
return os.path.join(REFDATA_DIR, file_arg)
|
| 48 |
|
| 49 |
def parse_cli_args():
|
| 50 |
+
"""Parse shared CLI arguments used by both local runs and the Flask UI."""
|
| 51 |
parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
|
| 52 |
parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
|
| 53 |
parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
|
|
|
|
| 58 |
parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
|
| 59 |
|
| 60 |
args = parser.parse_args()
|
| 61 |
+
# Keep CLI calls short by treating bare names as files under data/refdata.
|
| 62 |
args.input = resolve_data_path(args.input)
|
| 63 |
args.blueprint = resolve_data_path(args.blueprint)
|
| 64 |
args.refs = resolve_ref_path(args.refs)
|
src/data_pipeline.py
CHANGED
|
@@ -5,7 +5,6 @@ from collections import Counter
|
|
| 5 |
from sentence_transformers import util
|
| 6 |
from tqdm import tqdm
|
| 7 |
|
| 8 |
-
# Import our pure text manipulation functions
|
| 9 |
from src.utils import (
|
| 10 |
clean_degree_text,
|
| 11 |
normalize_text,
|
|
@@ -13,11 +12,8 @@ from src.utils import (
|
|
| 13 |
smart_format
|
| 14 |
)
|
| 15 |
from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
|
| 16 |
-
# ---------------------------------------------------------------------------
|
| 17 |
-
# ML & CLUSTERING ENGINE
|
| 18 |
-
# ---------------------------------------------------------------------------
|
| 19 |
-
|
| 20 |
def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
|
|
|
|
| 21 |
cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
|
| 22 |
raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
|
| 23 |
clean_counts = Counter(cleaned_list)
|
|
@@ -32,7 +28,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
|
|
| 32 |
|
| 33 |
embeddings = model.encode(unique_cleans, convert_to_tensor=True)
|
| 34 |
clean_to_clustered = {}
|
| 35 |
-
merge_info = {} #
|
| 36 |
|
| 37 |
for i, current_deg in enumerate(unique_cleans):
|
| 38 |
if current_deg in clean_to_clustered: continue
|
|
@@ -45,8 +41,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
|
|
| 45 |
if score.item() >= threshold and target_deg not in clean_to_clustered:
|
| 46 |
pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
|
| 47 |
|
| 48 |
-
#
|
| 49 |
-
# but it is NOT saved to the json memory.
|
| 50 |
cached_action = school_cache.get(pair_key)
|
| 51 |
|
| 52 |
if cached_action:
|
|
@@ -83,6 +78,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
|
|
| 83 |
|
| 84 |
|
| 85 |
def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
|
|
|
|
| 86 |
print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
|
| 87 |
cleaned_col_name = f'Cleaned_{degree_col}'
|
| 88 |
df[cleaned_col_name] = df[degree_col].copy()
|
|
@@ -92,7 +88,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
|
|
| 92 |
|
| 93 |
school_mappings = {}
|
| 94 |
|
| 95 |
-
#
|
| 96 |
for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
|
| 97 |
school_mask = (df[inst_col] == school) & (df[degree_col].notna())
|
| 98 |
raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
|
|
@@ -101,7 +97,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
|
|
| 101 |
if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
|
| 102 |
school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
|
| 103 |
|
| 104 |
-
#
|
| 105 |
for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
|
| 106 |
school = row[inst_col]
|
| 107 |
raw_deg = str(row[degree_col])
|
|
@@ -113,7 +109,6 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
|
|
| 113 |
final_val, src, conf = mapping_data
|
| 114 |
df.at[idx, cleaned_col_name] = final_val
|
| 115 |
|
| 116 |
-
# Log to Blueprint if modified or auto-merged
|
| 117 |
if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
|
| 118 |
blueprint_data.append({
|
| 119 |
"Row_Index": idx + 3,
|
|
@@ -128,6 +123,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
|
|
| 128 |
|
| 129 |
|
| 130 |
def get_deterministic_match(value, combined_valid_targets):
|
|
|
|
| 131 |
val_clean = normalize_text(value)
|
| 132 |
for target in combined_valid_targets:
|
| 133 |
target_clean = normalize_text(target)
|
|
@@ -138,6 +134,7 @@ def get_deterministic_match(value, combined_valid_targets):
|
|
| 138 |
|
| 139 |
|
| 140 |
def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
|
|
|
|
| 141 |
if not combined_valid_targets: return []
|
| 142 |
query_embedding = model.encode(value, convert_to_tensor=True)
|
| 143 |
similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
|
|
@@ -146,6 +143,7 @@ def get_top_candidates(model, value, combined_valid_targets, reference_embedding
|
|
| 146 |
return [combined_valid_targets[idx] for idx in top_matches.indices]
|
| 147 |
|
| 148 |
def get_dict_exact_match(value, combined_dict):
|
|
|
|
| 149 |
value_clean = normalize_text(value)
|
| 150 |
|
| 151 |
for alias, canonical in combined_dict.items():
|
|
@@ -159,6 +157,7 @@ def get_dict_exact_match(value, combined_dict):
|
|
| 159 |
return None
|
| 160 |
|
| 161 |
def get_dict_rule_match(value, combined_dict):
|
|
|
|
| 162 |
aliases = list(combined_dict.keys())
|
| 163 |
canonical_values = list(dict.fromkeys(combined_dict.values()))
|
| 164 |
|
|
@@ -173,6 +172,7 @@ def get_dict_rule_match(value, combined_dict):
|
|
| 173 |
return None
|
| 174 |
|
| 175 |
def as_reference_list(ref_data):
|
|
|
|
| 176 |
if isinstance(ref_data, list):
|
| 177 |
return ref_data
|
| 178 |
if isinstance(ref_data, dict):
|
|
@@ -180,6 +180,7 @@ def as_reference_list(ref_data):
|
|
| 180 |
return []
|
| 181 |
|
| 182 |
def as_reference_dict(ref_data):
|
|
|
|
| 183 |
if isinstance(ref_data, dict):
|
| 184 |
return ref_data
|
| 185 |
if isinstance(ref_data, list):
|
|
@@ -187,6 +188,7 @@ def as_reference_dict(ref_data):
|
|
| 187 |
return {}
|
| 188 |
|
| 189 |
def update_match_postfix(progress, source_counts):
|
|
|
|
| 190 |
progress.set_postfix({
|
| 191 |
"Exact_Match": source_counts["Exact_Match"],
|
| 192 |
"Rule_Match": source_counts["Rule_Match"],
|
|
@@ -202,6 +204,7 @@ def match_cache_key(column_name, value):
|
|
| 202 |
|
| 203 |
|
| 204 |
def append_unique_cleaned_part(cleaned_parts, value):
|
|
|
|
| 205 |
seen = set()
|
| 206 |
for existing_value in cleaned_parts:
|
| 207 |
for existing_part in str(existing_value).split(","):
|
|
@@ -226,11 +229,8 @@ def append_unique_cleaned_part(cleaned_parts, value):
|
|
| 226 |
return added
|
| 227 |
|
| 228 |
|
| 229 |
-
# ---------------------------------------------------------------------------
|
| 230 |
-
# CORE DATA PIPELINE
|
| 231 |
-
# ---------------------------------------------------------------------------
|
| 232 |
-
|
| 233 |
def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
|
|
|
|
| 234 |
if column_name not in df.columns: return df
|
| 235 |
|
| 236 |
core_data = official_refs.get(column_name, [])
|
|
@@ -241,6 +241,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 241 |
is_dict_mode = isinstance(core_data, dict)
|
| 242 |
|
| 243 |
def get_updated_embeddings():
|
|
|
|
| 244 |
if is_dict_mode:
|
| 245 |
c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
|
| 246 |
c_keys = list(c_dict.keys())
|
|
@@ -261,6 +262,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 261 |
if not is_dict_mode and not combined_valid_targets:
|
| 262 |
raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
|
| 263 |
|
|
|
|
| 264 |
uniques = set()
|
| 265 |
for cell in df[column_name].dropna():
|
| 266 |
for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
|
|
@@ -273,14 +275,14 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 273 |
for word in progress:
|
| 274 |
word_clean = match_cache_key(column_name, word)
|
| 275 |
|
| 276 |
-
#
|
| 277 |
if word_clean in master_cache[column_name]:
|
| 278 |
detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
|
| 279 |
source_counts["Memory_Cache"] += 1
|
| 280 |
update_match_postfix(progress, source_counts)
|
| 281 |
continue
|
| 282 |
|
| 283 |
-
#
|
| 284 |
if is_dict_mode:
|
| 285 |
exact = get_dict_exact_match(word, combined_dict)
|
| 286 |
else:
|
|
@@ -293,7 +295,6 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 293 |
update_match_postfix(progress, source_counts)
|
| 294 |
continue
|
| 295 |
|
| 296 |
-
# 3. Deterministic / Rule Match
|
| 297 |
if is_dict_mode:
|
| 298 |
suggested_match = get_dict_rule_match(word, combined_dict)
|
| 299 |
else:
|
|
@@ -305,7 +306,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 305 |
update_match_postfix(progress, source_counts)
|
| 306 |
continue
|
| 307 |
|
| 308 |
-
#
|
| 309 |
candidates = []
|
| 310 |
if is_dict_mode:
|
| 311 |
cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
|
|
@@ -314,12 +315,11 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 314 |
else:
|
| 315 |
candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
|
| 316 |
|
| 317 |
-
# Call the router instance
|
| 318 |
ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
|
| 319 |
source_counts[src] += 1
|
| 320 |
update_match_postfix(progress, source_counts)
|
| 321 |
|
| 322 |
-
#
|
| 323 |
if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
|
| 324 |
llm_parts = [p.strip() for p in ans_val.split(",")]
|
| 325 |
corrected_parts = []
|
|
@@ -338,24 +338,20 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 338 |
corrected_parts.append(part)
|
| 339 |
all_matched = False
|
| 340 |
else:
|
| 341 |
-
# 1. Exact Match Check (Case-insensitive)
|
| 342 |
exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
|
| 343 |
if exact_match:
|
| 344 |
corrected_parts.append(exact_match)
|
| 345 |
else:
|
| 346 |
-
# 2. Rule-Based Match Check
|
| 347 |
rule_match = get_deterministic_match(part, candidates)
|
| 348 |
if rule_match:
|
| 349 |
corrected_parts.append(rule_match)
|
| 350 |
else:
|
| 351 |
-
#
|
| 352 |
corrected_parts.append(part)
|
| 353 |
all_matched = False
|
| 354 |
|
| 355 |
-
# Remove duplicates while preserving the exact order
|
| 356 |
unique_parts = list(dict.fromkeys(corrected_parts))
|
| 357 |
|
| 358 |
-
# Glue it back together
|
| 359 |
ans_val = ", ".join(unique_parts)
|
| 360 |
|
| 361 |
raw_parts_for_check = [
|
|
@@ -376,7 +372,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 376 |
|
| 377 |
detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
|
| 378 |
|
| 379 |
-
# Reconstruct
|
| 380 |
for idx, row in df.iterrows():
|
| 381 |
cell_val = row[column_name]
|
| 382 |
if pd.isna(cell_val): continue
|
|
@@ -390,7 +386,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 390 |
while i < len(raw_parts):
|
| 391 |
curr = raw_parts[i]
|
| 392 |
|
| 393 |
-
#
|
| 394 |
if i + 1 < len(raw_parts):
|
| 395 |
combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
|
| 396 |
if combo_clean in detailed_cache:
|
|
@@ -416,7 +412,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
|
|
| 416 |
final_stitched_val = ", ".join(cleaned_parts)
|
| 417 |
df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
|
| 418 |
|
| 419 |
-
#
|
| 420 |
if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
|
| 421 |
blueprint_data.append({
|
| 422 |
"Row_Index": idx + 3,
|
|
|
|
| 5 |
from sentence_transformers import util
|
| 6 |
from tqdm import tqdm
|
| 7 |
|
|
|
|
| 8 |
from src.utils import (
|
| 9 |
clean_degree_text,
|
| 10 |
normalize_text,
|
|
|
|
| 12 |
smart_format
|
| 13 |
)
|
| 14 |
from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
|
| 16 |
+
"""Cluster similar degree labels inside one institution."""
|
| 17 |
cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
|
| 18 |
raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
|
| 19 |
clean_counts = Counter(cleaned_list)
|
|
|
|
| 28 |
|
| 29 |
embeddings = model.encode(unique_cleans, convert_to_tensor=True)
|
| 30 |
clean_to_clustered = {}
|
| 31 |
+
merge_info = {} # Track similarity scores for Blueprint transparency.
|
| 32 |
|
| 33 |
for i, current_deg in enumerate(unique_cleans):
|
| 34 |
if current_deg in clean_to_clustered: continue
|
|
|
|
| 41 |
if score.item() >= threshold and target_deg not in clean_to_clustered:
|
| 42 |
pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
|
| 43 |
|
| 44 |
+
# Runtime cache avoids repeated decisions within one run only.
|
|
|
|
| 45 |
cached_action = school_cache.get(pair_key)
|
| 46 |
|
| 47 |
if cached_action:
|
|
|
|
| 78 |
|
| 79 |
|
| 80 |
def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
|
| 81 |
+
"""Apply degree clustering separately for each institution."""
|
| 82 |
print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
|
| 83 |
cleaned_col_name = f'Cleaned_{degree_col}'
|
| 84 |
df[cleaned_col_name] = df[degree_col].copy()
|
|
|
|
| 88 |
|
| 89 |
school_mappings = {}
|
| 90 |
|
| 91 |
+
# Build school-specific mappings before mutating the dataframe.
|
| 92 |
for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
|
| 93 |
school_mask = (df[inst_col] == school) & (df[degree_col].notna())
|
| 94 |
raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
|
|
|
|
| 97 |
if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
|
| 98 |
school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
|
| 99 |
|
| 100 |
+
# Apply mappings and log only changed/merged values for review.
|
| 101 |
for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
|
| 102 |
school = row[inst_col]
|
| 103 |
raw_deg = str(row[degree_col])
|
|
|
|
| 109 |
final_val, src, conf = mapping_data
|
| 110 |
df.at[idx, cleaned_col_name] = final_val
|
| 111 |
|
|
|
|
| 112 |
if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
|
| 113 |
blueprint_data.append({
|
| 114 |
"Row_Index": idx + 3,
|
|
|
|
| 123 |
|
| 124 |
|
| 125 |
def get_deterministic_match(value, combined_valid_targets):
|
| 126 |
+
"""Match obvious aliases/acronyms without calling embeddings or Groq."""
|
| 127 |
val_clean = normalize_text(value)
|
| 128 |
for target in combined_valid_targets:
|
| 129 |
target_clean = normalize_text(target)
|
|
|
|
| 134 |
|
| 135 |
|
| 136 |
def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
|
| 137 |
+
"""Return the nearest reference candidates for one raw value."""
|
| 138 |
if not combined_valid_targets: return []
|
| 139 |
query_embedding = model.encode(value, convert_to_tensor=True)
|
| 140 |
similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
|
|
|
|
| 143 |
return [combined_valid_targets[idx] for idx in top_matches.indices]
|
| 144 |
|
| 145 |
def get_dict_exact_match(value, combined_dict):
|
| 146 |
+
"""Exact match against alias keys first, then canonical values."""
|
| 147 |
value_clean = normalize_text(value)
|
| 148 |
|
| 149 |
for alias, canonical in combined_dict.items():
|
|
|
|
| 157 |
return None
|
| 158 |
|
| 159 |
def get_dict_rule_match(value, combined_dict):
|
| 160 |
+
"""Rule match dictionary-style refs while returning canonical values."""
|
| 161 |
aliases = list(combined_dict.keys())
|
| 162 |
canonical_values = list(dict.fromkeys(combined_dict.values()))
|
| 163 |
|
|
|
|
| 172 |
return None
|
| 173 |
|
| 174 |
def as_reference_list(ref_data):
|
| 175 |
+
"""Convert list/dict reference data to display values."""
|
| 176 |
if isinstance(ref_data, list):
|
| 177 |
return ref_data
|
| 178 |
if isinstance(ref_data, dict):
|
|
|
|
| 180 |
return []
|
| 181 |
|
| 182 |
def as_reference_dict(ref_data):
|
| 183 |
+
"""Convert list/dict reference data to an alias-to-canonical mapping."""
|
| 184 |
if isinstance(ref_data, dict):
|
| 185 |
return ref_data
|
| 186 |
if isinstance(ref_data, list):
|
|
|
|
| 188 |
return {}
|
| 189 |
|
| 190 |
def update_match_postfix(progress, source_counts):
|
| 191 |
+
"""Expose match-source counts in tqdm without noisy per-row prints."""
|
| 192 |
progress.set_postfix({
|
| 193 |
"Exact_Match": source_counts["Exact_Match"],
|
| 194 |
"Rule_Match": source_counts["Rule_Match"],
|
|
|
|
| 204 |
|
| 205 |
|
| 206 |
def append_unique_cleaned_part(cleaned_parts, value):
|
| 207 |
+
"""Append comma-separated cleaned parts while preserving first-seen order."""
|
| 208 |
seen = set()
|
| 209 |
for existing_value in cleaned_parts:
|
| 210 |
for existing_part in str(existing_value).split(","):
|
|
|
|
| 229 |
return added
|
| 230 |
|
| 231 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
|
| 233 |
+
"""Clean one dataframe column using refs, embeddings, then Groq fallback."""
|
| 234 |
if column_name not in df.columns: return df
|
| 235 |
|
| 236 |
core_data = official_refs.get(column_name, [])
|
|
|
|
| 241 |
is_dict_mode = isinstance(core_data, dict)
|
| 242 |
|
| 243 |
def get_updated_embeddings():
|
| 244 |
+
"""Build current reference candidates after manual memory is loaded."""
|
| 245 |
if is_dict_mode:
|
| 246 |
c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
|
| 247 |
c_keys = list(c_dict.keys())
|
|
|
|
| 262 |
if not is_dict_mode and not combined_valid_targets:
|
| 263 |
raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
|
| 264 |
|
| 265 |
+
# Work on unique split values first so repeated cells reuse one decision.
|
| 266 |
uniques = set()
|
| 267 |
for cell in df[column_name].dropna():
|
| 268 |
for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
|
|
|
|
| 275 |
for word in progress:
|
| 276 |
word_clean = match_cache_key(column_name, word)
|
| 277 |
|
| 278 |
+
# Fast path: reuse a decision made earlier in this run.
|
| 279 |
if word_clean in master_cache[column_name]:
|
| 280 |
detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
|
| 281 |
source_counts["Memory_Cache"] += 1
|
| 282 |
update_match_postfix(progress, source_counts)
|
| 283 |
continue
|
| 284 |
|
| 285 |
+
# Exact/rule matches are trusted and avoid LLM calls.
|
| 286 |
if is_dict_mode:
|
| 287 |
exact = get_dict_exact_match(word, combined_dict)
|
| 288 |
else:
|
|
|
|
| 295 |
update_match_postfix(progress, source_counts)
|
| 296 |
continue
|
| 297 |
|
|
|
|
| 298 |
if is_dict_mode:
|
| 299 |
suggested_match = get_dict_rule_match(word, combined_dict)
|
| 300 |
else:
|
|
|
|
| 306 |
update_match_postfix(progress, source_counts)
|
| 307 |
continue
|
| 308 |
|
| 309 |
+
# Last resort: send only the top reference candidates to Groq.
|
| 310 |
candidates = []
|
| 311 |
if is_dict_mode:
|
| 312 |
cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
|
|
|
|
| 315 |
else:
|
| 316 |
candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
|
| 317 |
|
|
|
|
| 318 |
ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
|
| 319 |
source_counts[src] += 1
|
| 320 |
update_match_postfix(progress, source_counts)
|
| 321 |
|
| 322 |
+
# Re-check Groq output against refs so canonical casing/names are preserved.
|
| 323 |
if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
|
| 324 |
llm_parts = [p.strip() for p in ans_val.split(",")]
|
| 325 |
corrected_parts = []
|
|
|
|
| 338 |
corrected_parts.append(part)
|
| 339 |
all_matched = False
|
| 340 |
else:
|
|
|
|
| 341 |
exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
|
| 342 |
if exact_match:
|
| 343 |
corrected_parts.append(exact_match)
|
| 344 |
else:
|
|
|
|
| 345 |
rule_match = get_deterministic_match(part, candidates)
|
| 346 |
if rule_match:
|
| 347 |
corrected_parts.append(rule_match)
|
| 348 |
else:
|
| 349 |
+
# Keep unverifiable LLM text, but do not upgrade confidence.
|
| 350 |
corrected_parts.append(part)
|
| 351 |
all_matched = False
|
| 352 |
|
|
|
|
| 353 |
unique_parts = list(dict.fromkeys(corrected_parts))
|
| 354 |
|
|
|
|
| 355 |
ans_val = ", ".join(unique_parts)
|
| 356 |
|
| 357 |
raw_parts_for_check = [
|
|
|
|
| 372 |
|
| 373 |
detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
|
| 374 |
|
| 375 |
+
# Reconstruct full cell values in original row order for workbook injection.
|
| 376 |
for idx, row in df.iterrows():
|
| 377 |
cell_val = row[column_name]
|
| 378 |
if pd.isna(cell_val): continue
|
|
|
|
| 386 |
while i < len(raw_parts):
|
| 387 |
curr = raw_parts[i]
|
| 388 |
|
| 389 |
+
# Recover obvious accidental splits such as "University of, Manchester".
|
| 390 |
if i + 1 < len(raw_parts):
|
| 391 |
combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
|
| 392 |
if combo_clean in detailed_cache:
|
|
|
|
| 412 |
final_stitched_val = ", ".join(cleaned_parts)
|
| 413 |
df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
|
| 414 |
|
| 415 |
+
# Review every changed cell and every low/medium-confidence result.
|
| 416 |
if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
|
| 417 |
blueprint_data.append({
|
| 418 |
"Row_Index": idx + 3,
|
src/llm_router.py
CHANGED
|
@@ -3,9 +3,13 @@ import time
|
|
| 3 |
from tqdm import tqdm
|
| 4 |
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
|
| 5 |
|
| 6 |
-
class RateLimitException(Exception):
|
|
|
|
|
|
|
| 7 |
|
| 8 |
class GroqRouter:
|
|
|
|
|
|
|
| 9 |
def __init__(self, api_key, available_models):
|
| 10 |
self.api_key = api_key
|
| 11 |
self.available_models = available_models
|
|
@@ -13,6 +17,7 @@ class GroqRouter:
|
|
| 13 |
self.last_printed_model = None
|
| 14 |
|
| 15 |
def ask_judge(self, word, candidates, column_name):
|
|
|
|
| 16 |
if self.current_model_index >= len(self.available_models):
|
| 17 |
return (word, "API_Error_All_Models_Dead", "LOW")
|
| 18 |
|
|
@@ -21,6 +26,8 @@ class GroqRouter:
|
|
| 21 |
|
| 22 |
headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
|
| 23 |
|
|
|
|
|
|
|
| 24 |
if column_name in ["Institution", "Degree"]:
|
| 25 |
specific_rules = (
|
| 26 |
"- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
|
|
@@ -66,7 +73,6 @@ class GroqRouter:
|
|
| 66 |
"max_tokens": 50
|
| 67 |
}
|
| 68 |
|
| 69 |
-
# --- SIMPLIFIED RETRY LOGIC ---
|
| 70 |
@retry(
|
| 71 |
retry=retry_if_exception_type(RateLimitException),
|
| 72 |
wait=wait_exponential(multiplier=2, min=2, max=30),
|
|
@@ -74,6 +80,7 @@ class GroqRouter:
|
|
| 74 |
reraise=True
|
| 75 |
)
|
| 76 |
def fire_request():
|
|
|
|
| 77 |
res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
|
| 78 |
|
| 79 |
if res.status_code == 429:
|
|
@@ -90,6 +97,7 @@ class GroqRouter:
|
|
| 90 |
self.last_printed_model = active_model
|
| 91 |
|
| 92 |
try:
|
|
|
|
| 93 |
time.sleep(0.3)
|
| 94 |
response = fire_request()
|
| 95 |
|
|
@@ -106,6 +114,7 @@ class GroqRouter:
|
|
| 106 |
except RateLimitException:
|
| 107 |
tqdm.write(f" [!] Limits exhausted for {active_model}!")
|
| 108 |
|
|
|
|
| 109 |
self.current_model_index += 1
|
| 110 |
|
| 111 |
if self.current_model_index < len(self.available_models):
|
|
|
|
| 3 |
from tqdm import tqdm
|
| 4 |
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
|
| 5 |
|
| 6 |
+
class RateLimitException(Exception):
|
| 7 |
+
"""Raised when a Groq model hits rate limits and should be rotated."""
|
| 8 |
+
pass
|
| 9 |
|
| 10 |
class GroqRouter:
|
| 11 |
+
"""Small Groq client that rotates through configured fallback models."""
|
| 12 |
+
|
| 13 |
def __init__(self, api_key, available_models):
|
| 14 |
self.api_key = api_key
|
| 15 |
self.available_models = available_models
|
|
|
|
| 17 |
self.last_printed_model = None
|
| 18 |
|
| 19 |
def ask_judge(self, word, candidates, column_name):
|
| 20 |
+
"""Ask Groq to normalize one raw value against likely candidates."""
|
| 21 |
if self.current_model_index >= len(self.available_models):
|
| 22 |
return (word, "API_Error_All_Models_Dead", "LOW")
|
| 23 |
|
|
|
|
| 26 |
|
| 27 |
headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
|
| 28 |
|
| 29 |
+
# Column-specific prompt rules prevent the model from over-splitting
|
| 30 |
+
# institutions while still translating geography values to English.
|
| 31 |
if column_name in ["Institution", "Degree"]:
|
| 32 |
specific_rules = (
|
| 33 |
"- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
|
|
|
|
| 73 |
"max_tokens": 50
|
| 74 |
}
|
| 75 |
|
|
|
|
| 76 |
@retry(
|
| 77 |
retry=retry_if_exception_type(RateLimitException),
|
| 78 |
wait=wait_exponential(multiplier=2, min=2, max=30),
|
|
|
|
| 80 |
reraise=True
|
| 81 |
)
|
| 82 |
def fire_request():
|
| 83 |
+
"""Fire one request; tenacity retries only explicit rate-limit errors."""
|
| 84 |
res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
|
| 85 |
|
| 86 |
if res.status_code == 429:
|
|
|
|
| 97 |
self.last_printed_model = active_model
|
| 98 |
|
| 99 |
try:
|
| 100 |
+
# Light throttling reduces avoidable rate-limit pressure.
|
| 101 |
time.sleep(0.3)
|
| 102 |
response = fire_request()
|
| 103 |
|
|
|
|
| 114 |
except RateLimitException:
|
| 115 |
tqdm.write(f" [!] Limits exhausted for {active_model}!")
|
| 116 |
|
| 117 |
+
# Move to the next configured model and keep processing.
|
| 118 |
self.current_model_index += 1
|
| 119 |
|
| 120 |
if self.current_model_index < len(self.available_models):
|
src/process_runner.py
CHANGED
|
@@ -10,6 +10,7 @@ ACTIVE_PROCESSES = {}
|
|
| 10 |
|
| 11 |
|
| 12 |
def stop_process(job_id: str) -> bool:
|
|
|
|
| 13 |
process = ACTIVE_PROCESSES.get(job_id)
|
| 14 |
if not process or process.poll() is not None:
|
| 15 |
return False
|
|
@@ -26,6 +27,7 @@ def stop_process(job_id: str) -> bool:
|
|
| 26 |
|
| 27 |
|
| 28 |
def stream_process(command, cwd: Path, job_id=None):
|
|
|
|
| 29 |
env = os.environ.copy()
|
| 30 |
env["PYTHONUNBUFFERED"] = "1"
|
| 31 |
popen_kwargs = {
|
|
@@ -36,6 +38,7 @@ def stream_process(command, cwd: Path, job_id=None):
|
|
| 36 |
"env": env,
|
| 37 |
}
|
| 38 |
if os.name == "nt":
|
|
|
|
| 39 |
popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
|
| 40 |
|
| 41 |
process = subprocess.Popen(
|
|
@@ -43,6 +46,7 @@ def stream_process(command, cwd: Path, job_id=None):
|
|
| 43 |
**popen_kwargs,
|
| 44 |
)
|
| 45 |
if job_id:
|
|
|
|
| 46 |
ACTIVE_PROCESSES[job_id] = process
|
| 47 |
try:
|
| 48 |
assert process.stdout is not None
|
|
@@ -59,6 +63,8 @@ def stream_process(command, cwd: Path, job_id=None):
|
|
| 59 |
trailing_chunk = decoder.decode(b"", final=True)
|
| 60 |
if trailing_chunk:
|
| 61 |
yield f"data: {json.dumps(trailing_chunk)}\n\n"
|
|
|
|
|
|
|
| 62 |
yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
|
| 63 |
event_name = "done" if exit_code == 0 else "failed"
|
| 64 |
yield f"event: {event_name}\ndata: {{}}\n\n"
|
|
|
|
| 10 |
|
| 11 |
|
| 12 |
def stop_process(job_id: str) -> bool:
|
| 13 |
+
"""Stop a tracked subprocess by frontend job id."""
|
| 14 |
process = ACTIVE_PROCESSES.get(job_id)
|
| 15 |
if not process or process.poll() is not None:
|
| 16 |
return False
|
|
|
|
| 27 |
|
| 28 |
|
| 29 |
def stream_process(command, cwd: Path, job_id=None):
|
| 30 |
+
"""Run a command and yield stdout/stderr as server-sent event chunks."""
|
| 31 |
env = os.environ.copy()
|
| 32 |
env["PYTHONUNBUFFERED"] = "1"
|
| 33 |
popen_kwargs = {
|
|
|
|
| 38 |
"env": env,
|
| 39 |
}
|
| 40 |
if os.name == "nt":
|
| 41 |
+
# Required so CTRL_BREAK can stop child Python processes on Windows.
|
| 42 |
popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
|
| 43 |
|
| 44 |
process = subprocess.Popen(
|
|
|
|
| 46 |
**popen_kwargs,
|
| 47 |
)
|
| 48 |
if job_id:
|
| 49 |
+
# The UI can later call /stop with this id.
|
| 50 |
ACTIVE_PROCESSES[job_id] = process
|
| 51 |
try:
|
| 52 |
assert process.stdout is not None
|
|
|
|
| 63 |
trailing_chunk = decoder.decode(b"", final=True)
|
| 64 |
if trailing_chunk:
|
| 65 |
yield f"data: {json.dumps(trailing_chunk)}\n\n"
|
| 66 |
+
|
| 67 |
+
# The frontend distinguishes a real success from a crashed subprocess.
|
| 68 |
yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
|
| 69 |
event_name = "done" if exit_code == 0 else "failed"
|
| 70 |
yield f"event: {event_name}\ndata: {{}}\n\n"
|
src/utils.py
CHANGED
|
@@ -5,6 +5,7 @@ import unicodedata
|
|
| 5 |
from src.config import HF_TOKEN, SPACE_ID
|
| 6 |
|
| 7 |
def strip_degrees_for_search(text):
|
|
|
|
| 8 |
if not isinstance(text, str): return text
|
| 9 |
degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
|
| 10 |
cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
|
|
@@ -14,6 +15,7 @@ def strip_degrees_for_search(text):
|
|
| 14 |
return cleaned
|
| 15 |
|
| 16 |
def smart_format(text):
|
|
|
|
| 17 |
if not isinstance(text, str): return text
|
| 18 |
res = text.title()
|
| 19 |
acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
|
|
@@ -24,6 +26,7 @@ def smart_format(text):
|
|
| 24 |
return res.strip()
|
| 25 |
|
| 26 |
def clean_degree_text(text):
|
|
|
|
| 27 |
if not isinstance(text, str): return ""
|
| 28 |
text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
|
| 29 |
text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
|
|
@@ -32,14 +35,17 @@ def clean_degree_text(text):
|
|
| 32 |
return smart_format(text)
|
| 33 |
|
| 34 |
def normalize_text(text):
|
|
|
|
| 35 |
if not isinstance(text, str): return ""
|
| 36 |
normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
|
| 37 |
return normalized.strip().lower()
|
| 38 |
|
| 39 |
def normalize_ref(value):
|
|
|
|
| 40 |
return normalize_text(str(value))
|
| 41 |
|
| 42 |
def iter_ref_values(ref_data):
|
|
|
|
| 43 |
if isinstance(ref_data, dict):
|
| 44 |
yield from (item for item in ref_data.keys() if isinstance(item, str))
|
| 45 |
yield from (item for item in ref_data.values() if isinstance(item, str))
|
|
@@ -47,10 +53,12 @@ def iter_ref_values(ref_data):
|
|
| 47 |
yield from (item for item in ref_data if isinstance(item, str))
|
| 48 |
|
| 49 |
def ref_contains(ref_data, value):
|
|
|
|
| 50 |
needle = normalize_ref(value)
|
| 51 |
return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
|
| 52 |
|
| 53 |
def prune_manual_refs_against_official(manual_refs, official_refs):
|
|
|
|
| 54 |
removed_count = 0
|
| 55 |
|
| 56 |
for column_name, manual_bucket in list(manual_refs.items()):
|
|
@@ -100,6 +108,7 @@ def prune_manual_refs_against_official(manual_refs, official_refs):
|
|
| 100 |
MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
|
| 101 |
|
| 102 |
def reference_sync_status():
|
|
|
|
| 103 |
if not SPACE_ID:
|
| 104 |
return {
|
| 105 |
"enabled": False,
|
|
@@ -121,6 +130,7 @@ def reference_sync_status():
|
|
| 121 |
}
|
| 122 |
|
| 123 |
def save_manual_references_to_hub(app_root: Path):
|
|
|
|
| 124 |
status = reference_sync_status()
|
| 125 |
if not status["enabled"]:
|
| 126 |
raise RuntimeError(status["reason"])
|
|
|
|
| 5 |
from src.config import HF_TOKEN, SPACE_ID
|
| 6 |
|
| 7 |
def strip_degrees_for_search(text):
|
| 8 |
+
"""Remove common degree words before matching institution names."""
|
| 9 |
if not isinstance(text, str): return text
|
| 10 |
degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
|
| 11 |
cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
|
|
|
|
| 15 |
return cleaned
|
| 16 |
|
| 17 |
def smart_format(text):
|
| 18 |
+
"""Title-case free text while preserving common academic/business acronyms."""
|
| 19 |
if not isinstance(text, str): return text
|
| 20 |
res = text.title()
|
| 21 |
acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
|
|
|
|
| 26 |
return res.strip()
|
| 27 |
|
| 28 |
def clean_degree_text(text):
|
| 29 |
+
"""Normalize degree titles before within-school clustering."""
|
| 30 |
if not isinstance(text, str): return ""
|
| 31 |
text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
|
| 32 |
text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
|
|
|
|
| 35 |
return smart_format(text)
|
| 36 |
|
| 37 |
def normalize_text(text):
|
| 38 |
+
"""Normalize text for accent-insensitive, case-insensitive comparisons."""
|
| 39 |
if not isinstance(text, str): return ""
|
| 40 |
normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
|
| 41 |
return normalized.strip().lower()
|
| 42 |
|
| 43 |
def normalize_ref(value):
|
| 44 |
+
"""Normalize a reference value or alias for dictionary/set lookups."""
|
| 45 |
return normalize_text(str(value))
|
| 46 |
|
| 47 |
def iter_ref_values(ref_data):
|
| 48 |
+
"""Yield all searchable strings from list-style or dict-style references."""
|
| 49 |
if isinstance(ref_data, dict):
|
| 50 |
yield from (item for item in ref_data.keys() if isinstance(item, str))
|
| 51 |
yield from (item for item in ref_data.values() if isinstance(item, str))
|
|
|
|
| 53 |
yield from (item for item in ref_data if isinstance(item, str))
|
| 54 |
|
| 55 |
def ref_contains(ref_data, value):
|
| 56 |
+
"""Return whether a reference bucket already contains a value/alias."""
|
| 57 |
needle = normalize_ref(value)
|
| 58 |
return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
|
| 59 |
|
| 60 |
def prune_manual_refs_against_official(manual_refs, official_refs):
|
| 61 |
+
"""Remove manual values that are duplicates of official references."""
|
| 62 |
removed_count = 0
|
| 63 |
|
| 64 |
for column_name, manual_bucket in list(manual_refs.items()):
|
|
|
|
| 108 |
MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
|
| 109 |
|
| 110 |
def reference_sync_status():
|
| 111 |
+
"""Report whether the app can commit manual refs back to Hugging Face."""
|
| 112 |
if not SPACE_ID:
|
| 113 |
return {
|
| 114 |
"enabled": False,
|
|
|
|
| 130 |
}
|
| 131 |
|
| 132 |
def save_manual_references_to_hub(app_root: Path):
|
| 133 |
+
"""Commit the current manual references file back to the Space repository."""
|
| 134 |
status = reference_sync_status()
|
| 135 |
if not status["enabled"]:
|
| 136 |
raise RuntimeError(status["reason"])
|
src/workbook_io.py
CHANGED
|
@@ -9,6 +9,7 @@ ALLOWED_EXCEL_EXTENSIONS = (".xlsx", ".xlsm")
|
|
| 9 |
|
| 10 |
|
| 11 |
def save_uploaded_excel(uploaded, upload_dir: Path):
|
|
|
|
| 12 |
if not uploaded or not uploaded.filename:
|
| 13 |
raise ValueError("No file uploaded.")
|
| 14 |
|
|
@@ -18,6 +19,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
|
|
| 18 |
|
| 19 |
stem = Path(filename).stem
|
| 20 |
suffix = Path(filename).suffix
|
|
|
|
| 21 |
saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
|
| 22 |
destination = upload_dir / saved_filename
|
| 23 |
uploaded.save(destination)
|
|
@@ -25,6 +27,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
|
|
| 25 |
|
| 26 |
|
| 27 |
def read_workbook_sheets(path: Path) -> list[str]:
|
|
|
|
| 28 |
workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
|
| 29 |
try:
|
| 30 |
return workbook.sheetnames
|
|
@@ -33,6 +36,7 @@ def read_workbook_sheets(path: Path) -> list[str]:
|
|
| 33 |
|
| 34 |
|
| 35 |
def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
|
|
|
|
| 36 |
if not raw_path:
|
| 37 |
raise ValueError("Path is required.")
|
| 38 |
|
|
@@ -42,6 +46,7 @@ def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path
|
|
| 42 |
|
| 43 |
resolved = candidate.resolve()
|
| 44 |
allowed = [root.resolve() for root in allowed_roots]
|
|
|
|
| 45 |
if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
|
| 46 |
raise ValueError("Path is outside the application data directory.")
|
| 47 |
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
def save_uploaded_excel(uploaded, upload_dir: Path):
|
| 12 |
+
"""Validate and save an uploaded Excel file with a collision-safe name."""
|
| 13 |
if not uploaded or not uploaded.filename:
|
| 14 |
raise ValueError("No file uploaded.")
|
| 15 |
|
|
|
|
| 19 |
|
| 20 |
stem = Path(filename).stem
|
| 21 |
suffix = Path(filename).suffix
|
| 22 |
+
# UUID suffixes keep simultaneous users from overwriting each other.
|
| 23 |
saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
|
| 24 |
destination = upload_dir / saved_filename
|
| 25 |
uploaded.save(destination)
|
|
|
|
| 27 |
|
| 28 |
|
| 29 |
def read_workbook_sheets(path: Path) -> list[str]:
|
| 30 |
+
"""Read sheet names without loading workbook cell data into memory."""
|
| 31 |
workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
|
| 32 |
try:
|
| 33 |
return workbook.sheetnames
|
|
|
|
| 36 |
|
| 37 |
|
| 38 |
def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
|
| 39 |
+
"""Resolve a user-supplied path and ensure it stays inside allowed roots."""
|
| 40 |
if not raw_path:
|
| 41 |
raise ValueError("Path is required.")
|
| 42 |
|
|
|
|
| 46 |
|
| 47 |
resolved = candidate.resolve()
|
| 48 |
allowed = [root.resolve() for root in allowed_roots]
|
| 49 |
+
# Prevent download/apply endpoints from reading arbitrary server files.
|
| 50 |
if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
|
| 51 |
raise ValueError("Path is outside the application data directory.")
|
| 52 |
|
ui_app.py
CHANGED
|
@@ -23,6 +23,8 @@ from src.workbook_io import read_workbook_sheets, resolve_allowed_path, save_upl
|
|
| 23 |
APP_ROOT = Path(__file__).resolve().parent
|
| 24 |
UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
|
| 25 |
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
|
|
|
|
|
|
|
| 26 |
ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
|
| 27 |
|
| 28 |
app = Flask(
|
|
@@ -31,7 +33,7 @@ app = Flask(
|
|
| 31 |
static_folder=str(APP_ROOT / "ui" / "static"),
|
| 32 |
)
|
| 33 |
app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
|
| 34 |
-
app.secret_key = os.getenv("APP_SECRET_KEY", "
|
| 35 |
|
| 36 |
DEFAULT_STATE = {
|
| 37 |
"clean_path": "",
|
|
@@ -50,6 +52,7 @@ DEFAULT_STATE = {
|
|
| 50 |
|
| 51 |
|
| 52 |
def fresh_state():
|
|
|
|
| 53 |
return {
|
| 54 |
key: list(value) if isinstance(value, list) else value
|
| 55 |
for key, value in DEFAULT_STATE.items()
|
|
@@ -57,16 +60,19 @@ def fresh_state():
|
|
| 57 |
|
| 58 |
|
| 59 |
def get_state():
|
|
|
|
| 60 |
if "ui_state" not in session:
|
| 61 |
session["ui_state"] = fresh_state()
|
| 62 |
return session["ui_state"]
|
| 63 |
|
| 64 |
|
| 65 |
def mark_state_changed():
|
|
|
|
| 66 |
session.modified = True
|
| 67 |
|
| 68 |
|
| 69 |
def auth_required_response() -> Response:
|
|
|
|
| 70 |
return Response(
|
| 71 |
"Authentication required",
|
| 72 |
401,
|
|
@@ -75,15 +81,17 @@ def auth_required_response() -> Response:
|
|
| 75 |
|
| 76 |
|
| 77 |
def missing_auth_config_response() -> Response:
|
|
|
|
| 78 |
return Response(
|
| 79 |
-
"
|
| 80 |
503,
|
| 81 |
)
|
| 82 |
|
| 83 |
|
| 84 |
@app.before_request
|
| 85 |
def require_basic_auth():
|
| 86 |
-
|
|
|
|
| 87 |
if SPACE_ID:
|
| 88 |
return missing_auth_config_response()
|
| 89 |
return None
|
|
@@ -110,6 +118,7 @@ def prevent_browser_cache(response):
|
|
| 110 |
|
| 111 |
|
| 112 |
def default_models() -> str:
|
|
|
|
| 113 |
preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
|
| 114 |
env_preferred_models = [
|
| 115 |
model
|
|
@@ -120,6 +129,7 @@ def default_models() -> str:
|
|
| 120 |
|
| 121 |
|
| 122 |
def render_page(message: str = "", error: str = ""):
|
|
|
|
| 123 |
state = get_state()
|
| 124 |
if state["clean_sheets"]:
|
| 125 |
state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
|
|
@@ -139,6 +149,7 @@ def render_page(message: str = "", error: str = ""):
|
|
| 139 |
|
| 140 |
|
| 141 |
def can_apply_blueprint() -> bool:
|
|
|
|
| 142 |
state = get_state()
|
| 143 |
return bool(
|
| 144 |
state["apply_workbook_path"]
|
|
@@ -149,10 +160,12 @@ def can_apply_blueprint() -> bool:
|
|
| 149 |
|
| 150 |
|
| 151 |
def wants_json_response() -> bool:
|
|
|
|
| 152 |
return "application/json" in request.headers.get("Accept", "")
|
| 153 |
|
| 154 |
|
| 155 |
def ui_state_payload(message: str = "", error: str = ""):
|
|
|
|
| 156 |
state = get_state()
|
| 157 |
return {
|
| 158 |
"message": message,
|
|
@@ -168,6 +181,7 @@ def ui_state_payload(message: str = "", error: str = ""):
|
|
| 168 |
|
| 169 |
|
| 170 |
def pick_sheet(sheets, preferred_sheet=None, state=None):
|
|
|
|
| 171 |
state = state or get_state()
|
| 172 |
if preferred_sheet and preferred_sheet in sheets:
|
| 173 |
return preferred_sheet
|
|
@@ -177,6 +191,7 @@ def pick_sheet(sheets, preferred_sheet=None, state=None):
|
|
| 177 |
|
| 178 |
|
| 179 |
def update_ui_state_from_form(form):
|
|
|
|
| 180 |
state = get_state()
|
| 181 |
state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
|
| 182 |
state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
|
|
@@ -200,6 +215,7 @@ def prepare_clean():
|
|
| 200 |
except Exception as exc:
|
| 201 |
return render_page(error=str(exc))
|
| 202 |
|
|
|
|
| 203 |
state["clean_path"] = str(path)
|
| 204 |
state["clean_filename"] = filename
|
| 205 |
state["clean_sheets"] = sheets
|
|
@@ -214,6 +230,7 @@ def prepare_clean():
|
|
| 214 |
|
| 215 |
@app.route("/remove-clean", methods=["POST"])
|
| 216 |
def remove_clean():
|
|
|
|
| 217 |
update_ui_state_from_form(request.form)
|
| 218 |
state = get_state()
|
| 219 |
old_path = state["clean_path"]
|
|
@@ -232,6 +249,7 @@ def remove_clean():
|
|
| 232 |
|
| 233 |
@app.route("/prepare-apply-workbook", methods=["POST"])
|
| 234 |
def prepare_apply_workbook():
|
|
|
|
| 235 |
try:
|
| 236 |
update_ui_state_from_form(request.form)
|
| 237 |
state = get_state()
|
|
@@ -254,6 +272,7 @@ def prepare_apply_workbook():
|
|
| 254 |
|
| 255 |
@app.route("/prepare-apply-blueprint", methods=["POST"])
|
| 256 |
def prepare_apply_blueprint():
|
|
|
|
| 257 |
try:
|
| 258 |
update_ui_state_from_form(request.form)
|
| 259 |
state = get_state()
|
|
@@ -276,6 +295,7 @@ def prepare_apply_blueprint():
|
|
| 276 |
|
| 277 |
@app.route("/models")
|
| 278 |
def models_endpoint():
|
|
|
|
| 279 |
try:
|
| 280 |
models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
|
| 281 |
except Exception as exc:
|
|
@@ -285,11 +305,13 @@ def models_endpoint():
|
|
| 285 |
|
| 286 |
@app.route("/references/status")
|
| 287 |
def references_status():
|
|
|
|
| 288 |
return jsonify(reference_sync_status())
|
| 289 |
|
| 290 |
|
| 291 |
@app.route("/references/save", methods=["POST"])
|
| 292 |
def save_references():
|
|
|
|
| 293 |
try:
|
| 294 |
result = save_manual_references_to_hub(APP_ROOT)
|
| 295 |
except Exception as exc:
|
|
@@ -299,6 +321,7 @@ def save_references():
|
|
| 299 |
|
| 300 |
@app.route("/sheets")
|
| 301 |
def sheets_endpoint():
|
|
|
|
| 302 |
try:
|
| 303 |
workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
|
| 304 |
if not workbook_path.is_file():
|
|
@@ -310,6 +333,7 @@ def sheets_endpoint():
|
|
| 310 |
|
| 311 |
@app.route("/download-blueprint")
|
| 312 |
def download_blueprint():
|
|
|
|
| 313 |
state = get_state()
|
| 314 |
requested_path = request.args.get("path") or state["apply_blueprint_path"]
|
| 315 |
if not requested_path:
|
|
@@ -322,6 +346,7 @@ def download_blueprint():
|
|
| 322 |
|
| 323 |
@app.route("/download-cleaned-workbook")
|
| 324 |
def download_cleaned_workbook():
|
|
|
|
| 325 |
state = get_state()
|
| 326 |
requested_path = request.args.get("path") or state["clean_path"]
|
| 327 |
if not requested_path:
|
|
@@ -339,6 +364,7 @@ def download_cleaned_workbook():
|
|
| 339 |
|
| 340 |
@app.route("/download-applied-workbook")
|
| 341 |
def download_applied_workbook():
|
|
|
|
| 342 |
state = get_state()
|
| 343 |
requested_path = request.args.get("path") or state["apply_workbook_path"]
|
| 344 |
if not requested_path:
|
|
@@ -356,6 +382,7 @@ def download_applied_workbook():
|
|
| 356 |
|
| 357 |
@app.route("/run")
|
| 358 |
def run():
|
|
|
|
| 359 |
job_id = request.args.get("job_id", uuid.uuid4().hex)
|
| 360 |
input_path = request.args.get("input", "")
|
| 361 |
sheet = request.args.get("sheet", "")
|
|
@@ -370,6 +397,7 @@ def run():
|
|
| 370 |
except ValueError as exc:
|
| 371 |
return jsonify({"error": str(exc)}), 400
|
| 372 |
|
|
|
|
| 373 |
blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
|
| 374 |
state = get_state()
|
| 375 |
state["apply_blueprint_path"] = str(blueprint_path)
|
|
@@ -397,6 +425,7 @@ def run():
|
|
| 397 |
|
| 398 |
@app.route("/stop", methods=["POST"])
|
| 399 |
def stop():
|
|
|
|
| 400 |
job_id = request.args.get("job_id", "")
|
| 401 |
if not stop_process(job_id):
|
| 402 |
return jsonify({"stopped": False, "message": "No active run found."}), 404
|
|
@@ -406,6 +435,7 @@ def stop():
|
|
| 406 |
|
| 407 |
@app.route("/apply")
|
| 408 |
def apply_blueprint():
|
|
|
|
| 409 |
input_path = request.args.get("input", "")
|
| 410 |
blueprint_path = request.args.get("blueprint", "")
|
| 411 |
sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)
|
|
|
|
| 23 |
APP_ROOT = Path(__file__).resolve().parent
|
| 24 |
UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
|
| 25 |
UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
|
| 26 |
+
|
| 27 |
+
# Download/apply routes only accept files inside data/ to avoid arbitrary file reads.
|
| 28 |
ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
|
| 29 |
|
| 30 |
app = Flask(
|
|
|
|
| 33 |
static_folder=str(APP_ROOT / "ui" / "static"),
|
| 34 |
)
|
| 35 |
app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
|
| 36 |
+
app.secret_key = os.getenv("APP_SECRET_KEY", "local-dev-secret")
|
| 37 |
|
| 38 |
DEFAULT_STATE = {
|
| 39 |
"clean_path": "",
|
|
|
|
| 52 |
|
| 53 |
|
| 54 |
def fresh_state():
|
| 55 |
+
"""Create a clean UI state for one browser session."""
|
| 56 |
return {
|
| 57 |
key: list(value) if isinstance(value, list) else value
|
| 58 |
for key, value in DEFAULT_STATE.items()
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
def get_state():
|
| 63 |
+
"""Return the current browser's state, creating it on first visit."""
|
| 64 |
if "ui_state" not in session:
|
| 65 |
session["ui_state"] = fresh_state()
|
| 66 |
return session["ui_state"]
|
| 67 |
|
| 68 |
|
| 69 |
def mark_state_changed():
|
| 70 |
+
"""Tell Flask to re-sign the session cookie after nested state edits."""
|
| 71 |
session.modified = True
|
| 72 |
|
| 73 |
|
| 74 |
def auth_required_response() -> Response:
|
| 75 |
+
"""Ask the browser for basic-auth credentials."""
|
| 76 |
return Response(
|
| 77 |
"Authentication required",
|
| 78 |
401,
|
|
|
|
| 81 |
|
| 82 |
|
| 83 |
def missing_auth_config_response() -> Response:
|
| 84 |
+
"""Fail closed on Hugging Face if password protection was not configured."""
|
| 85 |
return Response(
|
| 86 |
+
"App login credentials are not configured.",
|
| 87 |
503,
|
| 88 |
)
|
| 89 |
|
| 90 |
|
| 91 |
@app.before_request
|
| 92 |
def require_basic_auth():
|
| 93 |
+
"""Protect every app route with a shared password when configured."""
|
| 94 |
+
if not APP_PASSWORD or not APP_USERNAME:
|
| 95 |
if SPACE_ID:
|
| 96 |
return missing_auth_config_response()
|
| 97 |
return None
|
|
|
|
| 118 |
|
| 119 |
|
| 120 |
def default_models() -> str:
|
| 121 |
+
"""Prefer configured Groq models, falling back to the curated production list."""
|
| 122 |
preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
|
| 123 |
env_preferred_models = [
|
| 124 |
model
|
|
|
|
| 129 |
|
| 130 |
|
| 131 |
def render_page(message: str = "", error: str = ""):
|
| 132 |
+
"""Render the app with state scoped to the current browser session."""
|
| 133 |
state = get_state()
|
| 134 |
if state["clean_sheets"]:
|
| 135 |
state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
|
|
|
|
| 149 |
|
| 150 |
|
| 151 |
def can_apply_blueprint() -> bool:
|
| 152 |
+
"""The Apply button requires workbook, blueprint, and target sheet."""
|
| 153 |
state = get_state()
|
| 154 |
return bool(
|
| 155 |
state["apply_workbook_path"]
|
|
|
|
| 160 |
|
| 161 |
|
| 162 |
def wants_json_response() -> bool:
|
| 163 |
+
"""AJAX upload routes ask for JSON; normal form posts render the page."""
|
| 164 |
return "application/json" in request.headers.get("Accept", "")
|
| 165 |
|
| 166 |
|
| 167 |
def ui_state_payload(message: str = "", error: str = ""):
|
| 168 |
+
"""Return just enough state for the frontend to update without a reload."""
|
| 169 |
state = get_state()
|
| 170 |
return {
|
| 171 |
"message": message,
|
|
|
|
| 181 |
|
| 182 |
|
| 183 |
def pick_sheet(sheets, preferred_sheet=None, state=None):
|
| 184 |
+
"""Choose a stable sheet selection when workbooks are uploaded/refreshed."""
|
| 185 |
state = state or get_state()
|
| 186 |
if preferred_sheet and preferred_sheet in sheets:
|
| 187 |
return preferred_sheet
|
|
|
|
| 191 |
|
| 192 |
|
| 193 |
def update_ui_state_from_form(form):
|
| 194 |
+
"""Preserve current UI selections while a file upload request is processed."""
|
| 195 |
state = get_state()
|
| 196 |
state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
|
| 197 |
state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
|
|
|
|
| 215 |
except Exception as exc:
|
| 216 |
return render_page(error=str(exc))
|
| 217 |
|
| 218 |
+
# The uploaded workbook becomes both the cleaning input and default apply target.
|
| 219 |
state["clean_path"] = str(path)
|
| 220 |
state["clean_filename"] = filename
|
| 221 |
state["clean_sheets"] = sheets
|
|
|
|
| 230 |
|
| 231 |
@app.route("/remove-clean", methods=["POST"])
|
| 232 |
def remove_clean():
|
| 233 |
+
"""Clear the current session's cleaning workbook without touching other sessions."""
|
| 234 |
update_ui_state_from_form(request.form)
|
| 235 |
state = get_state()
|
| 236 |
old_path = state["clean_path"]
|
|
|
|
| 249 |
|
| 250 |
@app.route("/prepare-apply-workbook", methods=["POST"])
|
| 251 |
def prepare_apply_workbook():
|
| 252 |
+
"""AJAX upload for the workbook that will receive Blueprint corrections."""
|
| 253 |
try:
|
| 254 |
update_ui_state_from_form(request.form)
|
| 255 |
state = get_state()
|
|
|
|
| 272 |
|
| 273 |
@app.route("/prepare-apply-blueprint", methods=["POST"])
|
| 274 |
def prepare_apply_blueprint():
|
| 275 |
+
"""AJAX upload for an externally reviewed Blueprint workbook."""
|
| 276 |
try:
|
| 277 |
update_ui_state_from_form(request.form)
|
| 278 |
state = get_state()
|
|
|
|
| 295 |
|
| 296 |
@app.route("/models")
|
| 297 |
def models_endpoint():
|
| 298 |
+
"""Fetch the currently usable Groq fallback model list for the UI."""
|
| 299 |
try:
|
| 300 |
models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
|
| 301 |
except Exception as exc:
|
|
|
|
| 305 |
|
| 306 |
@app.route("/references/status")
|
| 307 |
def references_status():
|
| 308 |
+
"""Tell the UI whether Hugging Face reference sync can be used."""
|
| 309 |
return jsonify(reference_sync_status())
|
| 310 |
|
| 311 |
|
| 312 |
@app.route("/references/save", methods=["POST"])
|
| 313 |
def save_references():
|
| 314 |
+
"""Commit manual references back to the Space repo when HF sync is configured."""
|
| 315 |
try:
|
| 316 |
result = save_manual_references_to_hub(APP_ROOT)
|
| 317 |
except Exception as exc:
|
|
|
|
| 321 |
|
| 322 |
@app.route("/sheets")
|
| 323 |
def sheets_endpoint():
|
| 324 |
+
"""Return workbook sheet names for dynamic apply-sheet selection."""
|
| 325 |
try:
|
| 326 |
workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
|
| 327 |
if not workbook_path.is_file():
|
|
|
|
| 333 |
|
| 334 |
@app.route("/download-blueprint")
|
| 335 |
def download_blueprint():
|
| 336 |
+
"""Download either the session Blueprint or an explicitly requested run file."""
|
| 337 |
state = get_state()
|
| 338 |
requested_path = request.args.get("path") or state["apply_blueprint_path"]
|
| 339 |
if not requested_path:
|
|
|
|
| 346 |
|
| 347 |
@app.route("/download-cleaned-workbook")
|
| 348 |
def download_cleaned_workbook():
|
| 349 |
+
"""Download the cleaned workbook copy for this session/run."""
|
| 350 |
state = get_state()
|
| 351 |
requested_path = request.args.get("path") or state["clean_path"]
|
| 352 |
if not requested_path:
|
|
|
|
| 364 |
|
| 365 |
@app.route("/download-applied-workbook")
|
| 366 |
def download_applied_workbook():
|
| 367 |
+
"""Download the workbook after Blueprint corrections have been applied."""
|
| 368 |
state = get_state()
|
| 369 |
requested_path = request.args.get("path") or state["apply_workbook_path"]
|
| 370 |
if not requested_path:
|
|
|
|
| 382 |
|
| 383 |
@app.route("/run")
|
| 384 |
def run():
|
| 385 |
+
"""Start the cleaning subprocess and stream its logs as server-sent events."""
|
| 386 |
job_id = request.args.get("job_id", uuid.uuid4().hex)
|
| 387 |
input_path = request.args.get("input", "")
|
| 388 |
sheet = request.args.get("sheet", "")
|
|
|
|
| 397 |
except ValueError as exc:
|
| 398 |
return jsonify({"error": str(exc)}), 400
|
| 399 |
|
| 400 |
+
# Each run gets its own Blueprint so simultaneous users cannot overwrite it.
|
| 401 |
blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
|
| 402 |
state = get_state()
|
| 403 |
state["apply_blueprint_path"] = str(blueprint_path)
|
|
|
|
| 425 |
|
| 426 |
@app.route("/stop", methods=["POST"])
|
| 427 |
def stop():
|
| 428 |
+
"""Stop a running cleaning subprocess for the given frontend job id."""
|
| 429 |
job_id = request.args.get("job_id", "")
|
| 430 |
if not stop_process(job_id):
|
| 431 |
return jsonify({"stopped": False, "message": "No active run found."}), 404
|
|
|
|
| 435 |
|
| 436 |
@app.route("/apply")
|
| 437 |
def apply_blueprint():
|
| 438 |
+
"""Start the Blueprint-apply subprocess and stream its logs."""
|
| 439 |
input_path = request.args.get("input", "")
|
| 440 |
blueprint_path = request.args.get("blueprint", "")
|
| 441 |
sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)
|