andrewbejjani commited on
Commit
c6a3f44
·
1 Parent(s): ad5ab1d

Added functional doc in README.md and added basic

Browse files

comments to the code
modified: README.md
modified: apply_blueprint.py
modified: main.py
modified: newest_model.py
modified: src/config.py
modified: src/data_pipeline.py
modified: src/llm_router.py
modified: src/process_runner.py
modified: src/utils.py
modified: src/workbook_io.py
modified: ui_app.py

README.md CHANGED
@@ -1,18 +1,134 @@
1
- ---
2
- title: MasterMap Cleaner
3
- sdk: docker
4
- app_port: 7860
5
- ---
6
 
7
- ## Environment
8
 
9
- For local development, copy `.env.example` to `.env` and fill only the values you need.
10
- Set production secrets in the Hugging Face Space settings, not in committed files.
11
 
12
- - `GROQ_API_KEY`: required for Groq model calls.
13
- - `APP_PASSWORD`: required to password-protect the deployed Space; set it as a Hugging Face Secret.
14
- - `APP_USERNAME`: optional, defaults to `mastermap` when `APP_PASSWORD` is set.
15
- - `APP_SECRET_KEY`: recommended for stable, isolated per-browser sessions.
16
- - `HF_TOKEN`: Hugging Face Space Secret only; optional, required only for the `Save Manual References` button.
17
 
18
- `Save Manual References` only enables on Hugging Face Spaces when `SPACE_ID` is present and `HF_TOKEN` is configured. It commits the current `refdata/manual_references.json` back to the Space repository.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MasterMap Cleaner
 
 
 
 
2
 
3
+ MasterMap Cleaner is a web tool used to clean and standardize MasterMap Excel files.
4
 
5
+ The tool takes an uploaded workbook, checks selected columns against approved reference lists, uses AI only when needed, and creates a cleaned version of the workbook. It also generates a review file called a **Blueprint**, where uncertain values can be checked and corrected by a human before the final workbook is downloaded.
 
6
 
7
+ ## What The Tool Produces
 
 
 
 
8
 
9
+ After a cleaning run, the tool can produce:
10
+
11
+ - a cleaned workbook with a new cleaned sheet
12
+ - a Blueprint file for human review
13
+ - a final workbook after the reviewed Blueprint has been applied
14
+
15
+ ## Before You Start
16
+
17
+ You need:
18
+
19
+ - the Hugging Face Space link for the tool
20
+ - the username and password provided by the tool administrator
21
+ - the Excel workbook you want to clean
22
+ - access to Excel or another spreadsheet editor to review the Blueprint
23
+
24
+ Accepted workbook formats:
25
+
26
+ - `.xlsx`
27
+ - `.xlsm`
28
+
29
+ ## Recommended Workflow
30
+
31
+ 1. Open the tool.
32
+ 2. Upload the Excel workbook.
33
+ 3. Select the source sheet to clean.
34
+ 4. Run the cleaning process.
35
+ 5. Download the cleaned workbook and Blueprint.
36
+ 6. Review the Blueprint in Excel.
37
+ 7. Upload the reviewed Blueprint back into the tool.
38
+ 8. Apply the Blueprint.
39
+ 9. Download the final cleaned workbook.
40
+ 10. Save manual references if new approved values should be remembered for future runs.
41
+
42
+ ## Step 1: Open The Tool
43
+
44
+ Open the Hugging Face Space link in your browser.
45
+
46
+ If prompted, enter the username and password provided by the tool administrator.
47
+
48
+ ## Step 2: Upload The Dataset
49
+
50
+ In the **Dataset to Clean** section:
51
+
52
+ 1. Drop the Excel file into the upload box, or click the box and select the file.
53
+ 2. Wait until the file is loaded.
54
+ 3. Select the source sheet that contains the data to clean.
55
+ 4. Choose the output sheet name.
56
+
57
+ The output sheet is the new sheet that will be created inside the workbook for cleaned data.
58
+
59
+ Do not use the same name as the source sheet.
60
+
61
+ ## Step 3: Run Cleaning
62
+
63
+ Click **Run Cleaning**.
64
+
65
+ While the tool is running, it will show progress for each cleaned column. Some files may take time depending on the number of rows and the number of values that require AI review.
66
+
67
+ When the run finishes, the tool will show download links for:
68
+
69
+ - **Blueprint**
70
+ - **Cleaned Workbook**
71
+
72
+ Download both files.
73
+
74
+ ## Step 4: Review The Blueprint
75
+
76
+ Open the Blueprint file in Excel.
77
+
78
+ The Blueprint contains values that the tool wants a human to review. Each row represents one value or correction candidate.
79
+
80
+ Main columns:
81
+
82
+ - `Row_Index`: the row in the workbook where the value appears
83
+ - `Column`: the field being reviewed
84
+ - `Original_Raw_Text`: the original value from the uploaded file
85
+ - `AI_Suggested_Match`: the tool's suggested cleaned value
86
+ - `Human_Override`: the reviewer correction field
87
+ - `Confidence`: how confident the tool was
88
+ - `Match_Source`: how the suggestion was produced
89
+
90
+ How to review each row:
91
+
92
+ - If the suggested value is correct, leave `Human_Override` empty.
93
+ - If the suggested value is wrong, choose the correct value from the dropdown.
94
+ - If the correct value is not in the dropdown, type it manually.
95
+ - Focus especially on rows marked `LOW` or `MEDIUM` confidence.
96
+
97
+ The dropdown is there to help, but manual typing is allowed.
98
+
99
+ After reviewing, save the Blueprint file.
100
+
101
+ ## Step 5: Apply The Reviewed Blueprint
102
+
103
+ Return to the tool and go to the **Apply Blueprint** section.
104
+
105
+ 1. Upload the workbook that should receive the corrections.
106
+ 2. Select the sheet to update.
107
+ 3. Upload the reviewed Blueprint file.
108
+ 4. Click **Apply Blueprint**.
109
+
110
+ Use the same cleaned sheet that was created during the cleaning step unless you were instructed otherwise.
111
+
112
+ When the apply step finishes, download the final cleaned workbook.
113
+
114
+ ## Step 6: Save Manual References
115
+
116
+ After applying a Blueprint, the tool may learn newly approved values so future runs can recognize them automatically.
117
+
118
+ If the **Save Manual References** button is available, click it after applying the Blueprint.
119
+
120
+ Use this button when:
121
+
122
+ - the Blueprint contained manually approved values
123
+ - those values should be remembered in future cleaning runs
124
+ - the administrator instructed you to preserve the updated references
125
+
126
+ If the button is disabled, continue using the tool normally. The administrator may handle reference saving separately.
127
+
128
+ ## Important Usage Notes
129
+
130
+ - Keep the browser tab open while a cleaning or apply process is running.
131
+ - Download your files before closing the page.
132
+ - If you refresh the page, you may need to upload the files again.
133
+ - Do not share the tool password outside the approved user group.
134
+ - Do not upload files unless they are intended to be processed by this tool.
apply_blueprint.py CHANGED
@@ -14,6 +14,7 @@ from src.config import (
14
  from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
15
 
16
  def parse_args():
 
17
  parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
18
  parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
19
  parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
@@ -29,6 +30,7 @@ def parse_args():
29
  return args
30
 
31
  def load_json_safe(filepath):
 
32
  try:
33
  with open(filepath, 'r', encoding='utf-8-sig') as f:
34
  return json.load(f)
@@ -36,16 +38,19 @@ def load_json_safe(filepath):
36
  return {}
37
 
38
  def split_approved_parts(value):
 
39
  if pd.isna(value):
40
  return []
41
  return [part.strip() for part in str(value).split(",") if part.strip()]
42
 
43
  def ensure_manual_bucket(manual_refs, official_refs, column_name):
 
44
  if column_name not in manual_refs:
45
  manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
46
  return manual_refs[column_name]
47
 
48
  def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
 
49
  manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
50
  added_count = 0
51
 
@@ -85,7 +90,7 @@ if __name__ == "__main__":
85
  print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
86
  exit()
87
 
88
- # Load the target Excel workbook
89
  wb = openpyxl.load_workbook(args.input)
90
  if args.sheet not in wb.sheetnames:
91
  print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
@@ -98,7 +103,7 @@ if __name__ == "__main__":
98
  if sheet.cell(row=1, column=c).value
99
  }
100
 
101
- # Load the memory dictionaries using the synced CLI path
102
  official_refs = load_json_safe(args.refs)
103
  manual_refs = load_json_safe(args.manual_refs)
104
 
@@ -107,6 +112,7 @@ if __name__ == "__main__":
107
 
108
  print("Applying manual overrides and updating memory...")
109
  for _, row in bp_df.iterrows():
 
110
  human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
111
  approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
112
  confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
@@ -117,7 +123,7 @@ if __name__ == "__main__":
117
  raw_col = str(row['Column']).strip()
118
 
119
  if human_val:
120
- # 1. Update the Excel File
121
  try:
122
  excel_row = int(row['Row_Index'])
123
  except (TypeError, ValueError):
@@ -136,7 +142,7 @@ if __name__ == "__main__":
136
  sheet.cell(row=excel_row, column=col_idx).value = human_val
137
  changes_made += 1
138
 
139
- # 2. Update Manual References for human overrides and accepted AI suggestions.
140
  if raw_col == "Degree":
141
  continue
142
 
@@ -152,11 +158,10 @@ if __name__ == "__main__":
152
 
153
  memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
154
 
155
- # Save Excel
156
  wb.save(args.input)
157
 
158
- # Save JSONs
159
- # Make sure the data directory exists before dumping
160
  manual_refs_dir = os.path.dirname(args.manual_refs)
161
  if manual_refs_dir:
162
  os.makedirs(manual_refs_dir, exist_ok=True)
 
14
  from src.utils import normalize_ref, prune_manual_refs_against_official, ref_contains
15
 
16
  def parse_args():
17
+ """Parse workbook, Blueprint, and reference paths for the apply step."""
18
  parser = argparse.ArgumentParser(description="Apply Blueprint Human Overrides")
19
  parser.add_argument("--input", required=True, help="Master Excel file name inside data/")
20
  parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
 
30
  return args
31
 
32
  def load_json_safe(filepath):
33
+ """Load JSON memory files and fall back to an empty dict if absent/corrupt."""
34
  try:
35
  with open(filepath, 'r', encoding='utf-8-sig') as f:
36
  return json.load(f)
 
38
  return {}
39
 
40
  def split_approved_parts(value):
41
+ """Split multi-value approvals into individual reference candidates."""
42
  if pd.isna(value):
43
  return []
44
  return [part.strip() for part in str(value).split(",") if part.strip()]
45
 
46
  def ensure_manual_bucket(manual_refs, official_refs, column_name):
47
+ """Create the correct manual-ref container for list or dict reference columns."""
48
  if column_name not in manual_refs:
49
  manual_refs[column_name] = {} if isinstance(official_refs.get(column_name), dict) else []
50
  return manual_refs[column_name]
51
 
52
  def remember_approved_value(manual_refs, official_refs, column_name, approved_value):
53
+ """Remember approved values that are not already official or manual refs."""
54
  manual_bucket = ensure_manual_bucket(manual_refs, official_refs, column_name)
55
  added_count = 0
56
 
 
90
  print(f"Error: Blueprint is missing required columns: {sorted(missing_columns)}")
91
  exit()
92
 
93
+ # Human overrides are applied directly to the selected cleaned sheet.
94
  wb = openpyxl.load_workbook(args.input)
95
  if args.sheet not in wb.sheetnames:
96
  print(f"Error: No '{args.sheet}' sheet found in {args.input}.")
 
103
  if sheet.cell(row=1, column=c).value
104
  }
105
 
106
+ # Reference files use the same CLI defaults as the cleaning pipeline.
107
  official_refs = load_json_safe(args.refs)
108
  manual_refs = load_json_safe(args.manual_refs)
109
 
 
112
 
113
  print("Applying manual overrides and updating memory...")
114
  for _, row in bp_df.iterrows():
115
+ # Empty Human_Override means the reviewer accepted the AI suggestion.
116
  human_val = str(row['Human_Override']).strip() if pd.notna(row['Human_Override']) else ""
117
  approved_val = human_val if human_val else str(row['AI_Suggested_Match']).strip() if pd.notna(row['AI_Suggested_Match']) else ""
118
  confidence = str(row['Confidence']).strip().upper() if pd.notna(row['Confidence']) else ""
 
123
  raw_col = str(row['Column']).strip()
124
 
125
  if human_val:
126
+ # Blueprint row indices already include the skipped MasterMap filter row.
127
  try:
128
  excel_row = int(row['Row_Index'])
129
  except (TypeError, ValueError):
 
142
  sheet.cell(row=excel_row, column=col_idx).value = human_val
143
  changes_made += 1
144
 
145
+ # Only approved non-low-confidence values should teach future runs.
146
  if raw_col == "Degree":
147
  continue
148
 
 
158
 
159
  memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
160
 
161
+ # Persist workbook updates before writing the learned memory file.
162
  wb.save(args.input)
163
 
164
+ # Manual refs may be written to an empty deployment volume, so ensure the folder exists.
 
165
  manual_refs_dir = os.path.dirname(args.manual_refs)
166
  if manual_refs_dir:
167
  os.makedirs(manual_refs_dir, exist_ok=True)
main.py CHANGED
@@ -9,13 +9,13 @@ from openpyxl.utils import get_column_letter
9
  from openpyxl.worksheet.datavalidation import DataValidation
10
  from openpyxl.workbook.defined_name import DefinedName
11
 
12
- # Import our new modular architecture
13
  from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
14
  from src.llm_router import GroqRouter
15
  from src.data_pipeline import process_column, cluster_degrees_by_institution
16
  from src.utils import prune_manual_refs_against_official
17
 
18
- # --- 1. CONFIGURATION ---
 
19
  COLUMNS_CONFIG = {
20
  "Country": r',|;|\n|/',
21
  "Institution": r'[,/;|\n]',
@@ -30,10 +30,12 @@ COLUMNS_CONFIG = {
30
  master_cache = {}
31
 
32
  def load_json_safe(filepath):
 
33
  with open(filepath, 'r', encoding='utf-8-sig') as f:
34
  return json.load(f)
35
 
36
  def validate_official_refs(official_refs):
 
37
  missing = []
38
  for column_name in COLUMNS_CONFIG:
39
  if column_name == "Degree":
@@ -51,29 +53,28 @@ def validate_official_refs(official_refs):
51
  )
52
 
53
  def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
54
- """Injects robust, static searchable dropdowns into the Blueprint."""
55
  print("Injecting static searchable dropdowns into Blueprint...")
56
  wb = openpyxl.load_workbook(blueprint_path)
57
  main_sheet = wb.active
58
 
59
- # 1. Create the Reference Sheet
60
  ref_sheet = wb.create_sheet(title="Reference_Lists")
61
 
62
  col_idx = 1
63
  for column_name, unique_items in master_unique_lists.items():
64
  safe_name = column_name.replace(" ", "_")
65
 
66
- # Write the header
67
  ref_sheet.cell(row=1, column=col_idx, value=safe_name)
68
 
69
- # Clean and alphabetize the list for a better user experience
70
  valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
71
 
72
  # Write the items
73
  for row_idx, item in enumerate(valid_items, start=2):
74
  ref_sheet.cell(row=row_idx, column=col_idx, value=item)
75
 
76
- # 2. Create the Excel "Named Range"
77
  if valid_items:
78
  letter = get_column_letter(col_idx)
79
  range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
@@ -82,7 +83,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
82
 
83
  col_idx += 1
84
 
85
- # 3. Locate Target & Override Columns
86
  target_col_idx = None
87
  override_col_letter = None
88
  for cell in main_sheet[1]:
@@ -91,7 +92,6 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
91
  elif cell.value == "Human_Override":
92
  override_col_letter = get_column_letter(cell.column)
93
 
94
- # 4. Apply Data Validation
95
  if target_col_idx and override_col_letter:
96
  dv = DataValidation(
97
  type="list",
@@ -108,7 +108,7 @@ def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
108
 
109
 
110
  if __name__ == "__main__":
111
- # --- 2. INITIALIZATION ---
112
  args = parse_cli_args()
113
  source_sheet_name = args.sheet
114
  output_sheet_name = args.output_sheet
@@ -117,7 +117,7 @@ if __name__ == "__main__":
117
  print("Loading AI Model (this may take a few seconds)...")
118
  model = SentenceTransformer('all-MiniLM-L6-v2')
119
 
120
- # Initialize our LLM Router
121
  router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
122
 
123
  if not os.path.exists(args.refs):
@@ -131,6 +131,8 @@ if __name__ == "__main__":
131
  official_refs = load_json_safe(args.refs)
132
  manual_refs = load_json_safe(args.manual_refs)
133
  validate_official_refs(official_refs)
 
 
134
  memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
135
  if memory_pruned:
136
  print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
@@ -138,10 +140,11 @@ if __name__ == "__main__":
138
  print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
139
  data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
140
 
141
- # Initialize the global Blueprint Logger
142
  blueprint_records = []
143
 
144
- # --- 3. EXECUTE BATCH PIPELINE ---
 
145
  for col, pattern in COLUMNS_CONFIG.items():
146
  if col == "Degree":
147
  inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
@@ -157,16 +160,15 @@ if __name__ == "__main__":
157
  split_pattern=pattern, blueprint_data=blueprint_records
158
  )
159
 
160
- # --- 4. EXPORT RESULTS ---
161
  print("\nSaving all memory files...")
162
  with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
163
 
164
- # 4a. Export the Blueprint for Human Review
165
  if blueprint_records:
166
  bp_df = pd.DataFrame(blueprint_records)
167
  bp_df.to_excel(args.blueprint, index=False)
168
 
169
- # --- Format the Blueprint Visually ---
170
  bp_wb = openpyxl.load_workbook(args.blueprint)
171
  bp_sheet = bp_wb.active
172
 
@@ -195,9 +197,8 @@ if __name__ == "__main__":
195
  bp_wb.save(args.blueprint)
196
  print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
197
 
198
- # --- NEW: Build master lists and inject dropdowns ---
199
  def extract_uniques(ref_data):
200
- """Helper to extract names whether the memory file is a list or a dict"""
201
  if isinstance(ref_data, dict): return list(ref_data.values())
202
  elif isinstance(ref_data, list): return ref_data
203
  return []
@@ -206,7 +207,7 @@ if __name__ == "__main__":
206
  for category in COLUMNS_CONFIG.keys():
207
  off_items = extract_uniques(official_refs.get(category, []))
208
  man_items = extract_uniques(manual_refs.get(category, []))
209
- # Merge, deduplicate, and remove blanks
210
  master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
211
 
212
  inject_searchable_dropdowns(args.blueprint, master_lists)
@@ -214,7 +215,7 @@ if __name__ == "__main__":
214
  else:
215
  print("[!] No blueprint generated. All matches were HIGH confidence!")
216
 
217
- # 4b. Inject Cleaned Data to Mastermap
218
  print("\nOpening original Excel file to preserve formatting...")
219
  wb = openpyxl.load_workbook(args.input)
220
  new_sheet_name = output_sheet_name
 
9
  from openpyxl.worksheet.datavalidation import DataValidation
10
  from openpyxl.workbook.defined_name import DefinedName
11
 
 
12
  from src.config import parse_cli_args, GROQ_API_KEY, AVAILABLE_MODELS, DEFAULT_SIMILARITY_THRESHOLD
13
  from src.llm_router import GroqRouter
14
  from src.data_pipeline import process_column, cluster_degrees_by_institution
15
  from src.utils import prune_manual_refs_against_official
16
 
17
+ # Each cleaned column has its own conservative split pattern. Avoid splitting
18
+ # on words like "and" because they can be part of official country names.
19
  COLUMNS_CONFIG = {
20
  "Country": r',|;|\n|/',
21
  "Institution": r'[,/;|\n]',
 
30
  master_cache = {}
31
 
32
  def load_json_safe(filepath):
33
+ """Load reference JSON files, accepting UTF-8 files with or without a BOM."""
34
  with open(filepath, 'r', encoding='utf-8-sig') as f:
35
  return json.load(f)
36
 
37
  def validate_official_refs(official_refs):
38
+ """Fail early if required reference buckets are missing or empty."""
39
  missing = []
40
  for column_name in COLUMNS_CONFIG:
41
  if column_name == "Degree":
 
53
  )
54
 
55
  def inject_searchable_dropdowns(blueprint_path, master_unique_lists):
56
+ """Add hidden reference lists and dropdowns to the generated Blueprint."""
57
  print("Injecting static searchable dropdowns into Blueprint...")
58
  wb = openpyxl.load_workbook(blueprint_path)
59
  main_sheet = wb.active
60
 
61
+ # Store all dropdown values on a hidden sheet so Excel can reference them.
62
  ref_sheet = wb.create_sheet(title="Reference_Lists")
63
 
64
  col_idx = 1
65
  for column_name, unique_items in master_unique_lists.items():
66
  safe_name = column_name.replace(" ", "_")
67
 
 
68
  ref_sheet.cell(row=1, column=col_idx, value=safe_name)
69
 
70
+ # Clean and alphabetize the list for a better review experience.
71
  valid_items = sorted([item for item in unique_items if item and isinstance(item, str)])
72
 
73
  # Write the items
74
  for row_idx, item in enumerate(valid_items, start=2):
75
  ref_sheet.cell(row=row_idx, column=col_idx, value=item)
76
 
77
+ # Named ranges let data validation reference long lists safely.
78
  if valid_items:
79
  letter = get_column_letter(col_idx)
80
  range_str = f"Reference_Lists!${letter}$2:${letter}${len(valid_items) + 1}"
 
83
 
84
  col_idx += 1
85
 
86
+ # The override dropdown changes based on the row's target column.
87
  target_col_idx = None
88
  override_col_letter = None
89
  for cell in main_sheet[1]:
 
92
  elif cell.value == "Human_Override":
93
  override_col_letter = get_column_letter(cell.column)
94
 
 
95
  if target_col_idx and override_col_letter:
96
  dv = DataValidation(
97
  type="list",
 
108
 
109
 
110
  if __name__ == "__main__":
111
+ # Parse CLI/UI arguments before loading any expensive model assets.
112
  args = parse_cli_args()
113
  source_sheet_name = args.sheet
114
  output_sheet_name = args.output_sheet
 
117
  print("Loading AI Model (this may take a few seconds)...")
118
  model = SentenceTransformer('all-MiniLM-L6-v2')
119
 
120
+ # The router owns Groq fallback order and rate-limit switching.
121
  router = GroqRouter(api_key=GROQ_API_KEY, available_models=available_models)
122
 
123
  if not os.path.exists(args.refs):
 
131
  official_refs = load_json_safe(args.refs)
132
  manual_refs = load_json_safe(args.manual_refs)
133
  validate_official_refs(official_refs)
134
+
135
+ # Manual memory should only contain values not already covered by official refs.
136
  memory_pruned = prune_manual_refs_against_official(manual_refs, official_refs)
137
  if memory_pruned:
138
  print(f"[INFO] Removed {memory_pruned} manual reference duplicate(s) already covered by official refs.")
 
140
  print(f"Loading Excel dataset from {args.input}, sheet '{source_sheet_name}'...")
141
  data = pd.read_excel(args.input, sheet_name=source_sheet_name, skiprows=[1])
142
 
143
+ # Every uncertain or changed value is logged here for human review.
144
  blueprint_records = []
145
 
146
+ # Run each configured column through the normalization pipeline. Degree
147
+ # values are clustered within each institution instead of matched to refs.
148
  for col, pattern in COLUMNS_CONFIG.items():
149
  if col == "Degree":
150
  inst_col = 'Cleaned_Institution' if 'Cleaned_Institution' in data.columns else 'Institution'
 
160
  split_pattern=pattern, blueprint_data=blueprint_records
161
  )
162
 
 
163
  print("\nSaving all memory files...")
164
  with open(args.manual_refs, 'w', encoding='utf-8') as f: json.dump(manual_refs, f, indent=4, ensure_ascii=False)
165
 
166
+ # Export the review workbook only when there is something to inspect.
167
  if blueprint_records:
168
  bp_df = pd.DataFrame(blueprint_records)
169
  bp_df.to_excel(args.blueprint, index=False)
170
 
171
+ # Basic formatting helps reviewers scan confidence levels quickly.
172
  bp_wb = openpyxl.load_workbook(args.blueprint)
173
  bp_sheet = bp_wb.active
174
 
 
197
  bp_wb.save(args.blueprint)
198
  print(f"[!] Saved and formatted {len(bp_df)} rows requiring review to {args.blueprint}")
199
 
 
200
  def extract_uniques(ref_data):
201
+ """Extract display values from list-style or dict-style references."""
202
  if isinstance(ref_data, dict): return list(ref_data.values())
203
  elif isinstance(ref_data, list): return ref_data
204
  return []
 
207
  for category in COLUMNS_CONFIG.keys():
208
  off_items = extract_uniques(official_refs.get(category, []))
209
  man_items = extract_uniques(manual_refs.get(category, []))
210
+ # Merge official and manual values for the Blueprint dropdowns.
211
  master_lists[category] = list(set([x for x in (off_items + man_items) if x]))
212
 
213
  inject_searchable_dropdowns(args.blueprint, master_lists)
 
215
  else:
216
  print("[!] No blueprint generated. All matches were HIGH confidence!")
217
 
218
+ # Copy the source sheet to preserve formatting, then overwrite cleaned columns.
219
  print("\nOpening original Excel file to preserve formatting...")
220
  wb = openpyxl.load_workbook(args.input)
221
  new_sheet_name = output_sheet_name
newest_model.py CHANGED
@@ -37,6 +37,7 @@ PREFERRED_MODEL_IDS = {model_id.lower() for model_id in PREFERRED_PRODUCTION_CHA
37
 
38
 
39
  def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
 
40
  headers = {
41
  "Authorization": f"Bearer {api_key}",
42
  "Content-Type": "application/json",
@@ -47,6 +48,7 @@ def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
47
 
48
 
49
  def is_active_chat_model(model: dict[str, Any]) -> bool:
 
50
  model_id = str(model.get("id", "")).lower()
51
  if not model_id:
52
  return False
@@ -58,6 +60,7 @@ def is_active_chat_model(model: dict[str, Any]) -> bool:
58
 
59
 
60
  def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
 
61
  model_id = str(model.get("id", ""))
62
  model_id_lower = model_id.lower()
63
 
@@ -75,6 +78,7 @@ def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
75
 
76
 
77
  def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
 
78
  api_key = os.getenv("GROQ_API_KEY")
79
  if not api_key:
80
  raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
@@ -98,6 +102,7 @@ def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS),
98
 
99
 
100
  def main() -> None:
 
101
  parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
102
  parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
103
  parser.add_argument(
 
37
 
38
 
39
  def fetch_groq_models(api_key: str) -> list[dict[str, Any]]:
40
+ """Fetch the current Groq model catalog using the OpenAI-compatible API."""
41
  headers = {
42
  "Authorization": f"Bearer {api_key}",
43
  "Content-Type": "application/json",
 
48
 
49
 
50
  def is_active_chat_model(model: dict[str, Any]) -> bool:
51
+ """Keep only active preferred chat models that are suitable for judging."""
52
  model_id = str(model.get("id", "")).lower()
53
  if not model_id:
54
  return False
 
60
 
61
 
62
  def rank_model(model: dict[str, Any]) -> tuple[int, int, int, str]:
63
+ """Sort models by preferred production order, then by recency/capacity."""
64
  model_id = str(model.get("id", ""))
65
  model_id_lower = model_id.lower()
66
 
 
78
 
79
 
80
  def select_groq_chat_models(limit: int = len(PREFERRED_PRODUCTION_CHAT_MODELS), strategy: str = "stable") -> list[str]:
81
+ """Return a comma-ready fallback list for GROQ_MODEL."""
82
  api_key = os.getenv("GROQ_API_KEY")
83
  if not api_key:
84
  raise RuntimeError("GROQ_API_KEY is missing. Add it to .env first.")
 
102
 
103
 
104
  def main() -> None:
105
+ """CLI entry point used when refreshing the recommended Groq model list."""
106
  parser = argparse.ArgumentParser(description="Select currently available Groq chat models.")
107
  parser.add_argument("--limit", type=int, default=len(PREFERRED_PRODUCTION_CHAT_MODELS), help="Number of fallback models to print.")
108
  parser.add_argument(
src/config.py CHANGED
@@ -2,12 +2,13 @@ import os
2
  import argparse
3
  from dotenv import load_dotenv
4
 
5
- # Load environment variables
 
6
  load_dotenv()
7
 
8
  # --- ENVIRONMENT VARIABLES to be set up in .env ---
9
  GROQ_API_KEY = os.getenv("GROQ_API_KEY")
10
- RAW_MODELS = os.getenv("GROQ_MODEL")
11
  APP_USERNAME = os.getenv("APP_USERNAME")
12
  APP_PASSWORD = os.getenv("APP_PASSWORD")
13
  SPACE_ID = os.getenv("SPACE_ID")
@@ -46,7 +47,7 @@ def resolve_ref_path(file_arg):
46
  return os.path.join(REFDATA_DIR, file_arg)
47
 
48
  def parse_cli_args():
49
- """Sets up the command line arguments so you don't have to hardcode filenames."""
50
  parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
51
  parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
52
  parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
@@ -57,6 +58,7 @@ def parse_cli_args():
57
  parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
58
 
59
  args = parser.parse_args()
 
60
  args.input = resolve_data_path(args.input)
61
  args.blueprint = resolve_data_path(args.blueprint)
62
  args.refs = resolve_ref_path(args.refs)
 
2
  import argparse
3
  from dotenv import load_dotenv
4
 
5
+ # Load local .env values for development; Hugging Face injects the same names
6
+ # as environment variables in production.
7
  load_dotenv()
8
 
9
  # --- ENVIRONMENT VARIABLES to be set up in .env ---
10
  GROQ_API_KEY = os.getenv("GROQ_API_KEY")
11
+ RAW_MODELS = os.getenv("GROQ_MODEL", "")
12
  APP_USERNAME = os.getenv("APP_USERNAME")
13
  APP_PASSWORD = os.getenv("APP_PASSWORD")
14
  SPACE_ID = os.getenv("SPACE_ID")
 
47
  return os.path.join(REFDATA_DIR, file_arg)
48
 
49
  def parse_cli_args():
50
+ """Parse shared CLI arguments used by both local runs and the Flask UI."""
51
  parser = argparse.ArgumentParser(description="MasterMap Data Normalization Pipeline")
52
  parser.add_argument("--input", required=True, help="Raw input Excel file name inside data/")
53
  parser.add_argument("--blueprint", default=DEFAULT_BLUEPRINT_FILE, help="Blueprint Excel file name inside data/")
 
58
  parser.add_argument("--models", default="", help="Comma-separated Groq models to use in fallback order")
59
 
60
  args = parser.parse_args()
61
+ # Keep CLI calls short by treating bare names as files under data/refdata.
62
  args.input = resolve_data_path(args.input)
63
  args.blueprint = resolve_data_path(args.blueprint)
64
  args.refs = resolve_ref_path(args.refs)
src/data_pipeline.py CHANGED
@@ -5,7 +5,6 @@ from collections import Counter
5
  from sentence_transformers import util
6
  from tqdm import tqdm
7
 
8
- # Import our pure text manipulation functions
9
  from src.utils import (
10
  clean_degree_text,
11
  normalize_text,
@@ -13,11 +12,8 @@ from src.utils import (
13
  smart_format
14
  )
15
  from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
16
- # ---------------------------------------------------------------------------
17
- # ML & CLUSTERING ENGINE
18
- # ---------------------------------------------------------------------------
19
-
20
  def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
 
21
  cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
22
  raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
23
  clean_counts = Counter(cleaned_list)
@@ -32,7 +28,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
32
 
33
  embeddings = model.encode(unique_cleans, convert_to_tensor=True)
34
  clean_to_clustered = {}
35
- merge_info = {} # Tracks similarity scores for the Blueprint
36
 
37
  for i, current_deg in enumerate(unique_cleans):
38
  if current_deg in clean_to_clustered: continue
@@ -45,8 +41,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
45
  if score.item() >= threshold and target_deg not in clean_to_clustered:
46
  pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
47
 
48
- # We still use school_cache as a temporary runtime speedup,
49
- # but it is NOT saved to the json memory.
50
  cached_action = school_cache.get(pair_key)
51
 
52
  if cached_action:
@@ -83,6 +78,7 @@ def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
83
 
84
 
85
  def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
 
86
  print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
87
  cleaned_col_name = f'Cleaned_{degree_col}'
88
  df[cleaned_col_name] = df[degree_col].copy()
@@ -92,7 +88,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
92
 
93
  school_mappings = {}
94
 
95
- # 1. Wrap the AI bottleneck (school clustering) in tqdm
96
  for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
97
  school_mask = (df[inst_col] == school) & (df[degree_col].notna())
98
  raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
@@ -101,7 +97,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
101
  if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
102
  school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
103
 
104
- # 2. Wrap the DataFrame injection and Blueprint logging in tqdm
105
  for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
106
  school = row[inst_col]
107
  raw_deg = str(row[degree_col])
@@ -113,7 +109,6 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
113
  final_val, src, conf = mapping_data
114
  df.at[idx, cleaned_col_name] = final_val
115
 
116
- # Log to Blueprint if modified or auto-merged
117
  if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
118
  blueprint_data.append({
119
  "Row_Index": idx + 3,
@@ -128,6 +123,7 @@ def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache
128
 
129
 
130
  def get_deterministic_match(value, combined_valid_targets):
 
131
  val_clean = normalize_text(value)
132
  for target in combined_valid_targets:
133
  target_clean = normalize_text(target)
@@ -138,6 +134,7 @@ def get_deterministic_match(value, combined_valid_targets):
138
 
139
 
140
  def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
 
141
  if not combined_valid_targets: return []
142
  query_embedding = model.encode(value, convert_to_tensor=True)
143
  similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
@@ -146,6 +143,7 @@ def get_top_candidates(model, value, combined_valid_targets, reference_embedding
146
  return [combined_valid_targets[idx] for idx in top_matches.indices]
147
 
148
  def get_dict_exact_match(value, combined_dict):
 
149
  value_clean = normalize_text(value)
150
 
151
  for alias, canonical in combined_dict.items():
@@ -159,6 +157,7 @@ def get_dict_exact_match(value, combined_dict):
159
  return None
160
 
161
  def get_dict_rule_match(value, combined_dict):
 
162
  aliases = list(combined_dict.keys())
163
  canonical_values = list(dict.fromkeys(combined_dict.values()))
164
 
@@ -173,6 +172,7 @@ def get_dict_rule_match(value, combined_dict):
173
  return None
174
 
175
  def as_reference_list(ref_data):
 
176
  if isinstance(ref_data, list):
177
  return ref_data
178
  if isinstance(ref_data, dict):
@@ -180,6 +180,7 @@ def as_reference_list(ref_data):
180
  return []
181
 
182
  def as_reference_dict(ref_data):
 
183
  if isinstance(ref_data, dict):
184
  return ref_data
185
  if isinstance(ref_data, list):
@@ -187,6 +188,7 @@ def as_reference_dict(ref_data):
187
  return {}
188
 
189
  def update_match_postfix(progress, source_counts):
 
190
  progress.set_postfix({
191
  "Exact_Match": source_counts["Exact_Match"],
192
  "Rule_Match": source_counts["Rule_Match"],
@@ -202,6 +204,7 @@ def match_cache_key(column_name, value):
202
 
203
 
204
  def append_unique_cleaned_part(cleaned_parts, value):
 
205
  seen = set()
206
  for existing_value in cleaned_parts:
207
  for existing_part in str(existing_value).split(","):
@@ -226,11 +229,8 @@ def append_unique_cleaned_part(cleaned_parts, value):
226
  return added
227
 
228
 
229
- # ---------------------------------------------------------------------------
230
- # CORE DATA PIPELINE
231
- # ---------------------------------------------------------------------------
232
-
233
  def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
 
234
  if column_name not in df.columns: return df
235
 
236
  core_data = official_refs.get(column_name, [])
@@ -241,6 +241,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
241
  is_dict_mode = isinstance(core_data, dict)
242
 
243
  def get_updated_embeddings():
 
244
  if is_dict_mode:
245
  c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
246
  c_keys = list(c_dict.keys())
@@ -261,6 +262,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
261
  if not is_dict_mode and not combined_valid_targets:
262
  raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
263
 
 
264
  uniques = set()
265
  for cell in df[column_name].dropna():
266
  for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
@@ -273,14 +275,14 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
273
  for word in progress:
274
  word_clean = match_cache_key(column_name, word)
275
 
276
- # 1. Check Memory Cache
277
  if word_clean in master_cache[column_name]:
278
  detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
279
  source_counts["Memory_Cache"] += 1
280
  update_match_postfix(progress, source_counts)
281
  continue
282
 
283
- # 2. Check Exact Targets
284
  if is_dict_mode:
285
  exact = get_dict_exact_match(word, combined_dict)
286
  else:
@@ -293,7 +295,6 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
293
  update_match_postfix(progress, source_counts)
294
  continue
295
 
296
- # 3. Deterministic / Rule Match
297
  if is_dict_mode:
298
  suggested_match = get_dict_rule_match(word, combined_dict)
299
  else:
@@ -305,7 +306,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
305
  update_match_postfix(progress, source_counts)
306
  continue
307
 
308
- # 4. LLM API Match
309
  candidates = []
310
  if is_dict_mode:
311
  cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
@@ -314,12 +315,11 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
314
  else:
315
  candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
316
 
317
- # Call the router instance
318
  ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
319
  source_counts[src] += 1
320
  update_match_postfix(progress, source_counts)
321
 
322
- # Process every valid string, regardless of confidence (skip if API crashed)
323
  if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
324
  llm_parts = [p.strip() for p in ans_val.split(",")]
325
  corrected_parts = []
@@ -338,24 +338,20 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
338
  corrected_parts.append(part)
339
  all_matched = False
340
  else:
341
- # 1. Exact Match Check (Case-insensitive)
342
  exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
343
  if exact_match:
344
  corrected_parts.append(exact_match)
345
  else:
346
- # 2. Rule-Based Match Check
347
  rule_match = get_deterministic_match(part, candidates)
348
  if rule_match:
349
  corrected_parts.append(rule_match)
350
  else:
351
- # 3. No match in dictionary. Keep LLM's version, but flag that we couldn't verify it.
352
  corrected_parts.append(part)
353
  all_matched = False
354
 
355
- # Remove duplicates while preserving the exact order
356
  unique_parts = list(dict.fromkeys(corrected_parts))
357
 
358
- # Glue it back together
359
  ans_val = ", ".join(unique_parts)
360
 
361
  raw_parts_for_check = [
@@ -376,7 +372,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
376
 
377
  detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
378
 
379
- # Reconstruct cells and capture low/medium confidence matches for the Blueprint
380
  for idx, row in df.iterrows():
381
  cell_val = row[column_name]
382
  if pd.isna(cell_val): continue
@@ -390,7 +386,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
390
  while i < len(raw_parts):
391
  curr = raw_parts[i]
392
 
393
- # Check for combined pairs (e.g., "University of, Manchester" split by mistake)
394
  if i + 1 < len(raw_parts):
395
  combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
396
  if combo_clean in detailed_cache:
@@ -416,7 +412,7 @@ def process_column(df, column_name, model, groq_router, official_refs, manual_re
416
  final_stitched_val = ", ".join(cleaned_parts)
417
  df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
418
 
419
- # Log EVERY change made to the Excel file, plus any low/medium confidence guesses
420
  if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
421
  blueprint_data.append({
422
  "Row_Index": idx + 3,
 
5
  from sentence_transformers import util
6
  from tqdm import tqdm
7
 
 
8
  from src.utils import (
9
  clean_degree_text,
10
  normalize_text,
 
12
  smart_format
13
  )
14
  from src.config import TOP_K_CANDIDATES, DEFAULT_SIMILARITY_THRESHOLD
 
 
 
 
15
  def self_cluster_degrees(raw_degrees_list, model, school_cache, threshold=0.93):
16
+ """Cluster similar degree labels inside one institution."""
17
  cleaned_list = [clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)]
18
  raw_to_clean = {raw: clean_degree_text(raw) for raw in raw_degrees_list if isinstance(raw, str)}
19
  clean_counts = Counter(cleaned_list)
 
28
 
29
  embeddings = model.encode(unique_cleans, convert_to_tensor=True)
30
  clean_to_clustered = {}
31
+ merge_info = {} # Track similarity scores for Blueprint transparency.
32
 
33
  for i, current_deg in enumerate(unique_cleans):
34
  if current_deg in clean_to_clustered: continue
 
41
  if score.item() >= threshold and target_deg not in clean_to_clustered:
42
  pair_key = f"{min(current_deg, target_deg)}|||{max(current_deg, target_deg)}"
43
 
44
+ # Runtime cache avoids repeated decisions within one run only.
 
45
  cached_action = school_cache.get(pair_key)
46
 
47
  if cached_action:
 
78
 
79
 
80
  def cluster_degrees_by_institution(df, degree_col, inst_col, model, master_cache, blueprint_data, threshold=0.93):
81
+ """Apply degree clustering separately for each institution."""
82
  print(f"\n[INFO] Auto-Clustering '{degree_col}'. (Merges will be logged to Blueprint...)")
83
  cleaned_col_name = f'Cleaned_{degree_col}'
84
  df[cleaned_col_name] = df[degree_col].copy()
 
88
 
89
  school_mappings = {}
90
 
91
+ # Build school-specific mappings before mutating the dataframe.
92
  for school in tqdm(unique_schools, desc=f"Mapping {degree_col}s by Institution"):
93
  school_mask = (df[inst_col] == school) & (df[degree_col].notna())
94
  raw_degs = df.loc[school_mask, degree_col].astype(str).tolist()
 
97
  if school not in master_cache["Degree_Decisions"]: master_cache["Degree_Decisions"][school] = {}
98
  school_mappings[school] = self_cluster_degrees(raw_degs, model, master_cache["Degree_Decisions"][school], threshold)
99
 
100
+ # Apply mappings and log only changed/merged values for review.
101
  for idx, row in tqdm(df.iterrows(), total=len(df), desc=f"Applying & Logging {degree_col}s"):
102
  school = row[inst_col]
103
  raw_deg = str(row[degree_col])
 
109
  final_val, src, conf = mapping_data
110
  df.at[idx, cleaned_col_name] = final_val
111
 
 
112
  if str(raw_deg).strip() != final_val.strip() or conf != "HIGH":
113
  blueprint_data.append({
114
  "Row_Index": idx + 3,
 
123
 
124
 
125
  def get_deterministic_match(value, combined_valid_targets):
126
+ """Match obvious aliases/acronyms without calling embeddings or Groq."""
127
  val_clean = normalize_text(value)
128
  for target in combined_valid_targets:
129
  target_clean = normalize_text(target)
 
134
 
135
 
136
  def get_top_candidates(model, value, combined_valid_targets, reference_embeddings, k=5):
137
+ """Return the nearest reference candidates for one raw value."""
138
  if not combined_valid_targets: return []
139
  query_embedding = model.encode(value, convert_to_tensor=True)
140
  similarities = util.pytorch_cos_sim(query_embedding, reference_embeddings)[0]
 
143
  return [combined_valid_targets[idx] for idx in top_matches.indices]
144
 
145
  def get_dict_exact_match(value, combined_dict):
146
+ """Exact match against alias keys first, then canonical values."""
147
  value_clean = normalize_text(value)
148
 
149
  for alias, canonical in combined_dict.items():
 
157
  return None
158
 
159
  def get_dict_rule_match(value, combined_dict):
160
+ """Rule match dictionary-style refs while returning canonical values."""
161
  aliases = list(combined_dict.keys())
162
  canonical_values = list(dict.fromkeys(combined_dict.values()))
163
 
 
172
  return None
173
 
174
  def as_reference_list(ref_data):
175
+ """Convert list/dict reference data to display values."""
176
  if isinstance(ref_data, list):
177
  return ref_data
178
  if isinstance(ref_data, dict):
 
180
  return []
181
 
182
  def as_reference_dict(ref_data):
183
+ """Convert list/dict reference data to an alias-to-canonical mapping."""
184
  if isinstance(ref_data, dict):
185
  return ref_data
186
  if isinstance(ref_data, list):
 
188
  return {}
189
 
190
  def update_match_postfix(progress, source_counts):
191
+ """Expose match-source counts in tqdm without noisy per-row prints."""
192
  progress.set_postfix({
193
  "Exact_Match": source_counts["Exact_Match"],
194
  "Rule_Match": source_counts["Rule_Match"],
 
204
 
205
 
206
  def append_unique_cleaned_part(cleaned_parts, value):
207
+ """Append comma-separated cleaned parts while preserving first-seen order."""
208
  seen = set()
209
  for existing_value in cleaned_parts:
210
  for existing_part in str(existing_value).split(","):
 
229
  return added
230
 
231
 
 
 
 
 
232
  def process_column(df, column_name, model, groq_router, official_refs, manual_refs, master_cache, split_pattern, blueprint_data):
233
+ """Clean one dataframe column using refs, embeddings, then Groq fallback."""
234
  if column_name not in df.columns: return df
235
 
236
  core_data = official_refs.get(column_name, [])
 
241
  is_dict_mode = isinstance(core_data, dict)
242
 
243
  def get_updated_embeddings():
244
+ """Build current reference candidates after manual memory is loaded."""
245
  if is_dict_mode:
246
  c_dict = {**as_reference_dict(core_data), **as_reference_dict(added_data)}
247
  c_keys = list(c_dict.keys())
 
262
  if not is_dict_mode and not combined_valid_targets:
263
  raise ValueError(f"No list references loaded for '{column_name}'. Refusing to call Groq for every value.")
264
 
265
+ # Work on unique split values first so repeated cells reuse one decision.
266
  uniques = set()
267
  for cell in df[column_name].dropna():
268
  for p in re.split(split_pattern, str(cell), flags=re.IGNORECASE):
 
275
  for word in progress:
276
  word_clean = match_cache_key(column_name, word)
277
 
278
+ # Fast path: reuse a decision made earlier in this run.
279
  if word_clean in master_cache[column_name]:
280
  detailed_cache[word_clean] = {"val": master_cache[column_name][word_clean], "src": "Memory_Cache", "conf": "HIGH"}
281
  source_counts["Memory_Cache"] += 1
282
  update_match_postfix(progress, source_counts)
283
  continue
284
 
285
+ # Exact/rule matches are trusted and avoid LLM calls.
286
  if is_dict_mode:
287
  exact = get_dict_exact_match(word, combined_dict)
288
  else:
 
295
  update_match_postfix(progress, source_counts)
296
  continue
297
 
 
298
  if is_dict_mode:
299
  suggested_match = get_dict_rule_match(word, combined_dict)
300
  else:
 
306
  update_match_postfix(progress, source_counts)
307
  continue
308
 
309
+ # Last resort: send only the top reference candidates to Groq.
310
  candidates = []
311
  if is_dict_mode:
312
  cand_keys = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
 
315
  else:
316
  candidates = get_top_candidates(model, word, combined_valid_targets, reference_embeddings)
317
 
 
318
  ans_val, src, conf = groq_router.ask_judge(word, candidates, column_name)
319
  source_counts[src] += 1
320
  update_match_postfix(progress, source_counts)
321
 
322
+ # Re-check Groq output against refs so canonical casing/names are preserved.
323
  if "API_Error" not in conf and ans_val != "UNKNOWN" and ans_val != "LLM_Failed":
324
  llm_parts = [p.strip() for p in ans_val.split(",")]
325
  corrected_parts = []
 
338
  corrected_parts.append(part)
339
  all_matched = False
340
  else:
 
341
  exact_match = next((c for c in candidates if c.lower() == part.lower()), None)
342
  if exact_match:
343
  corrected_parts.append(exact_match)
344
  else:
 
345
  rule_match = get_deterministic_match(part, candidates)
346
  if rule_match:
347
  corrected_parts.append(rule_match)
348
  else:
349
+ # Keep unverifiable LLM text, but do not upgrade confidence.
350
  corrected_parts.append(part)
351
  all_matched = False
352
 
 
353
  unique_parts = list(dict.fromkeys(corrected_parts))
354
 
 
355
  ans_val = ", ".join(unique_parts)
356
 
357
  raw_parts_for_check = [
 
372
 
373
  detailed_cache[word_clean] = {"val": ans_val, "src": src, "conf": conf}
374
 
375
+ # Reconstruct full cell values in original row order for workbook injection.
376
  for idx, row in df.iterrows():
377
  cell_val = row[column_name]
378
  if pd.isna(cell_val): continue
 
386
  while i < len(raw_parts):
387
  curr = raw_parts[i]
388
 
389
+ # Recover obvious accidental splits such as "University of, Manchester".
390
  if i + 1 < len(raw_parts):
391
  combo_clean = match_cache_key(column_name, f"{curr}, {raw_parts[i+1]}")
392
  if combo_clean in detailed_cache:
 
412
  final_stitched_val = ", ".join(cleaned_parts)
413
  df.at[idx, f'Cleaned_{column_name}'] = final_stitched_val
414
 
415
+ # Review every changed cell and every low/medium-confidence result.
416
  if str(cell_val).strip() != final_stitched_val.strip() or lowest_conf != "HIGH":
417
  blueprint_data.append({
418
  "Row_Index": idx + 3,
src/llm_router.py CHANGED
@@ -3,9 +3,13 @@ import time
3
  from tqdm import tqdm
4
  from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
5
 
6
- class RateLimitException(Exception): pass
 
 
7
 
8
  class GroqRouter:
 
 
9
  def __init__(self, api_key, available_models):
10
  self.api_key = api_key
11
  self.available_models = available_models
@@ -13,6 +17,7 @@ class GroqRouter:
13
  self.last_printed_model = None
14
 
15
  def ask_judge(self, word, candidates, column_name):
 
16
  if self.current_model_index >= len(self.available_models):
17
  return (word, "API_Error_All_Models_Dead", "LOW")
18
 
@@ -21,6 +26,8 @@ class GroqRouter:
21
 
22
  headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
23
 
 
 
24
  if column_name in ["Institution", "Degree"]:
25
  specific_rules = (
26
  "- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
@@ -66,7 +73,6 @@ class GroqRouter:
66
  "max_tokens": 50
67
  }
68
 
69
- # --- SIMPLIFIED RETRY LOGIC ---
70
  @retry(
71
  retry=retry_if_exception_type(RateLimitException),
72
  wait=wait_exponential(multiplier=2, min=2, max=30),
@@ -74,6 +80,7 @@ class GroqRouter:
74
  reraise=True
75
  )
76
  def fire_request():
 
77
  res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
78
 
79
  if res.status_code == 429:
@@ -90,6 +97,7 @@ class GroqRouter:
90
  self.last_printed_model = active_model
91
 
92
  try:
 
93
  time.sleep(0.3)
94
  response = fire_request()
95
 
@@ -106,6 +114,7 @@ class GroqRouter:
106
  except RateLimitException:
107
  tqdm.write(f" [!] Limits exhausted for {active_model}!")
108
 
 
109
  self.current_model_index += 1
110
 
111
  if self.current_model_index < len(self.available_models):
 
3
  from tqdm import tqdm
4
  from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
5
 
6
+ class RateLimitException(Exception):
7
+ """Raised when a Groq model hits rate limits and should be rotated."""
8
+ pass
9
 
10
  class GroqRouter:
11
+ """Small Groq client that rotates through configured fallback models."""
12
+
13
  def __init__(self, api_key, available_models):
14
  self.api_key = api_key
15
  self.available_models = available_models
 
17
  self.last_printed_model = None
18
 
19
  def ask_judge(self, word, candidates, column_name):
20
+ """Ask Groq to normalize one raw value against likely candidates."""
21
  if self.current_model_index >= len(self.available_models):
22
  return (word, "API_Error_All_Models_Dead", "LOW")
23
 
 
26
 
27
  headers = {"Authorization": f"Bearer {self.api_key}", "Content-Type": "application/json"}
28
 
29
+ # Column-specific prompt rules prevent the model from over-splitting
30
+ # institutions while still translating geography values to English.
31
  if column_name in ["Institution", "Degree"]:
32
  specific_rules = (
33
  "- Split distinct separate schools or global alliances with a comma (e.g., 'Harvard & MIT' -> 'Harvard University, MIT').\n"
 
73
  "max_tokens": 50
74
  }
75
 
 
76
  @retry(
77
  retry=retry_if_exception_type(RateLimitException),
78
  wait=wait_exponential(multiplier=2, min=2, max=30),
 
80
  reraise=True
81
  )
82
  def fire_request():
83
+ """Fire one request; tenacity retries only explicit rate-limit errors."""
84
  res = requests.post("https://api.groq.com/openai/v1/chat/completions", headers=headers, json=payload, timeout=30)
85
 
86
  if res.status_code == 429:
 
97
  self.last_printed_model = active_model
98
 
99
  try:
100
+ # Light throttling reduces avoidable rate-limit pressure.
101
  time.sleep(0.3)
102
  response = fire_request()
103
 
 
114
  except RateLimitException:
115
  tqdm.write(f" [!] Limits exhausted for {active_model}!")
116
 
117
+ # Move to the next configured model and keep processing.
118
  self.current_model_index += 1
119
 
120
  if self.current_model_index < len(self.available_models):
src/process_runner.py CHANGED
@@ -10,6 +10,7 @@ ACTIVE_PROCESSES = {}
10
 
11
 
12
  def stop_process(job_id: str) -> bool:
 
13
  process = ACTIVE_PROCESSES.get(job_id)
14
  if not process or process.poll() is not None:
15
  return False
@@ -26,6 +27,7 @@ def stop_process(job_id: str) -> bool:
26
 
27
 
28
  def stream_process(command, cwd: Path, job_id=None):
 
29
  env = os.environ.copy()
30
  env["PYTHONUNBUFFERED"] = "1"
31
  popen_kwargs = {
@@ -36,6 +38,7 @@ def stream_process(command, cwd: Path, job_id=None):
36
  "env": env,
37
  }
38
  if os.name == "nt":
 
39
  popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
40
 
41
  process = subprocess.Popen(
@@ -43,6 +46,7 @@ def stream_process(command, cwd: Path, job_id=None):
43
  **popen_kwargs,
44
  )
45
  if job_id:
 
46
  ACTIVE_PROCESSES[job_id] = process
47
  try:
48
  assert process.stdout is not None
@@ -59,6 +63,8 @@ def stream_process(command, cwd: Path, job_id=None):
59
  trailing_chunk = decoder.decode(b"", final=True)
60
  if trailing_chunk:
61
  yield f"data: {json.dumps(trailing_chunk)}\n\n"
 
 
62
  yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
63
  event_name = "done" if exit_code == 0 else "failed"
64
  yield f"event: {event_name}\ndata: {{}}\n\n"
 
10
 
11
 
12
  def stop_process(job_id: str) -> bool:
13
+ """Stop a tracked subprocess by frontend job id."""
14
  process = ACTIVE_PROCESSES.get(job_id)
15
  if not process or process.poll() is not None:
16
  return False
 
27
 
28
 
29
  def stream_process(command, cwd: Path, job_id=None):
30
+ """Run a command and yield stdout/stderr as server-sent event chunks."""
31
  env = os.environ.copy()
32
  env["PYTHONUNBUFFERED"] = "1"
33
  popen_kwargs = {
 
38
  "env": env,
39
  }
40
  if os.name == "nt":
41
+ # Required so CTRL_BREAK can stop child Python processes on Windows.
42
  popen_kwargs["creationflags"] = subprocess.CREATE_NEW_PROCESS_GROUP
43
 
44
  process = subprocess.Popen(
 
46
  **popen_kwargs,
47
  )
48
  if job_id:
49
+ # The UI can later call /stop with this id.
50
  ACTIVE_PROCESSES[job_id] = process
51
  try:
52
  assert process.stdout is not None
 
63
  trailing_chunk = decoder.decode(b"", final=True)
64
  if trailing_chunk:
65
  yield f"data: {json.dumps(trailing_chunk)}\n\n"
66
+
67
+ # The frontend distinguishes a real success from a crashed subprocess.
68
  yield f"data: {json.dumps(chr(10) + f'Process exited with code {exit_code}' + chr(10))}\n\n"
69
  event_name = "done" if exit_code == 0 else "failed"
70
  yield f"event: {event_name}\ndata: {{}}\n\n"
src/utils.py CHANGED
@@ -5,6 +5,7 @@ import unicodedata
5
  from src.config import HF_TOKEN, SPACE_ID
6
 
7
  def strip_degrees_for_search(text):
 
8
  if not isinstance(text, str): return text
9
  degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
10
  cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
@@ -14,6 +15,7 @@ def strip_degrees_for_search(text):
14
  return cleaned
15
 
16
  def smart_format(text):
 
17
  if not isinstance(text, str): return text
18
  res = text.title()
19
  acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
@@ -24,6 +26,7 @@ def smart_format(text):
24
  return res.strip()
25
 
26
  def clean_degree_text(text):
 
27
  if not isinstance(text, str): return ""
28
  text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
29
  text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
@@ -32,14 +35,17 @@ def clean_degree_text(text):
32
  return smart_format(text)
33
 
34
  def normalize_text(text):
 
35
  if not isinstance(text, str): return ""
36
  normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
37
  return normalized.strip().lower()
38
 
39
  def normalize_ref(value):
 
40
  return normalize_text(str(value))
41
 
42
  def iter_ref_values(ref_data):
 
43
  if isinstance(ref_data, dict):
44
  yield from (item for item in ref_data.keys() if isinstance(item, str))
45
  yield from (item for item in ref_data.values() if isinstance(item, str))
@@ -47,10 +53,12 @@ def iter_ref_values(ref_data):
47
  yield from (item for item in ref_data if isinstance(item, str))
48
 
49
  def ref_contains(ref_data, value):
 
50
  needle = normalize_ref(value)
51
  return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
52
 
53
  def prune_manual_refs_against_official(manual_refs, official_refs):
 
54
  removed_count = 0
55
 
56
  for column_name, manual_bucket in list(manual_refs.items()):
@@ -100,6 +108,7 @@ def prune_manual_refs_against_official(manual_refs, official_refs):
100
  MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
101
 
102
  def reference_sync_status():
 
103
  if not SPACE_ID:
104
  return {
105
  "enabled": False,
@@ -121,6 +130,7 @@ def reference_sync_status():
121
  }
122
 
123
  def save_manual_references_to_hub(app_root: Path):
 
124
  status = reference_sync_status()
125
  if not status["enabled"]:
126
  raise RuntimeError(status["reason"])
 
5
  from src.config import HF_TOKEN, SPACE_ID
6
 
7
  def strip_degrees_for_search(text):
8
+ """Remove common degree words before matching institution names."""
9
  if not isinstance(text, str): return text
10
  degree_pattern = r'\b(MSc|MBA|BBA|BSc|Ph\.?D\.?|BA|MA|BS|MS|EMBA|Master|Bachelor|Masters|Bachelors|Licence)\b'
11
  cleaned = re.sub(degree_pattern, '', text, flags=re.IGNORECASE)
 
15
  return cleaned
16
 
17
  def smart_format(text):
18
+ """Title-case free text while preserving common academic/business acronyms."""
19
  if not isinstance(text, str): return text
20
  res = text.title()
21
  acronyms = ['Ma', 'Ba', 'Mba', 'Bba', 'Hr', 'It', 'Bs', 'Ms', 'Phd', 'Bsc', 'Msc', 'Llm', 'Pge', 'Cems']
 
26
  return res.strip()
27
 
28
  def clean_degree_text(text):
29
+ """Normalize degree titles before within-school clustering."""
30
  if not isinstance(text, str): return ""
31
  text = re.sub(r'\band\b', '&', text, flags=re.IGNORECASE)
32
  text = re.sub(r'\bet\b', '&', text, flags=re.IGNORECASE)
 
35
  return smart_format(text)
36
 
37
  def normalize_text(text):
38
+ """Normalize text for accent-insensitive, case-insensitive comparisons."""
39
  if not isinstance(text, str): return ""
40
  normalized = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
41
  return normalized.strip().lower()
42
 
43
  def normalize_ref(value):
44
+ """Normalize a reference value or alias for dictionary/set lookups."""
45
  return normalize_text(str(value))
46
 
47
  def iter_ref_values(ref_data):
48
+ """Yield all searchable strings from list-style or dict-style references."""
49
  if isinstance(ref_data, dict):
50
  yield from (item for item in ref_data.keys() if isinstance(item, str))
51
  yield from (item for item in ref_data.values() if isinstance(item, str))
 
53
  yield from (item for item in ref_data if isinstance(item, str))
54
 
55
  def ref_contains(ref_data, value):
56
+ """Return whether a reference bucket already contains a value/alias."""
57
  needle = normalize_ref(value)
58
  return any(normalize_ref(item) == needle for item in iter_ref_values(ref_data))
59
 
60
  def prune_manual_refs_against_official(manual_refs, official_refs):
61
+ """Remove manual values that are duplicates of official references."""
62
  removed_count = 0
63
 
64
  for column_name, manual_bucket in list(manual_refs.items()):
 
108
  MANUAL_REFERENCES_REPO_PATH = "refdata/manual_references.json"
109
 
110
  def reference_sync_status():
111
+ """Report whether the app can commit manual refs back to Hugging Face."""
112
  if not SPACE_ID:
113
  return {
114
  "enabled": False,
 
130
  }
131
 
132
  def save_manual_references_to_hub(app_root: Path):
133
+ """Commit the current manual references file back to the Space repository."""
134
  status = reference_sync_status()
135
  if not status["enabled"]:
136
  raise RuntimeError(status["reason"])
src/workbook_io.py CHANGED
@@ -9,6 +9,7 @@ ALLOWED_EXCEL_EXTENSIONS = (".xlsx", ".xlsm")
9
 
10
 
11
  def save_uploaded_excel(uploaded, upload_dir: Path):
 
12
  if not uploaded or not uploaded.filename:
13
  raise ValueError("No file uploaded.")
14
 
@@ -18,6 +19,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
18
 
19
  stem = Path(filename).stem
20
  suffix = Path(filename).suffix
 
21
  saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
22
  destination = upload_dir / saved_filename
23
  uploaded.save(destination)
@@ -25,6 +27,7 @@ def save_uploaded_excel(uploaded, upload_dir: Path):
25
 
26
 
27
  def read_workbook_sheets(path: Path) -> list[str]:
 
28
  workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
29
  try:
30
  return workbook.sheetnames
@@ -33,6 +36,7 @@ def read_workbook_sheets(path: Path) -> list[str]:
33
 
34
 
35
  def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
 
36
  if not raw_path:
37
  raise ValueError("Path is required.")
38
 
@@ -42,6 +46,7 @@ def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path
42
 
43
  resolved = candidate.resolve()
44
  allowed = [root.resolve() for root in allowed_roots]
 
45
  if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
46
  raise ValueError("Path is outside the application data directory.")
47
 
 
9
 
10
 
11
  def save_uploaded_excel(uploaded, upload_dir: Path):
12
+ """Validate and save an uploaded Excel file with a collision-safe name."""
13
  if not uploaded or not uploaded.filename:
14
  raise ValueError("No file uploaded.")
15
 
 
19
 
20
  stem = Path(filename).stem
21
  suffix = Path(filename).suffix
22
+ # UUID suffixes keep simultaneous users from overwriting each other.
23
  saved_filename = f"{stem}_{uuid.uuid4().hex[:8]}{suffix}"
24
  destination = upload_dir / saved_filename
25
  uploaded.save(destination)
 
27
 
28
 
29
  def read_workbook_sheets(path: Path) -> list[str]:
30
+ """Read sheet names without loading workbook cell data into memory."""
31
  workbook = openpyxl.load_workbook(path, read_only=True, data_only=False)
32
  try:
33
  return workbook.sheetnames
 
36
 
37
 
38
  def resolve_allowed_path(raw_path: str, app_root: Path, allowed_roots: list[Path]) -> Path:
39
+ """Resolve a user-supplied path and ensure it stays inside allowed roots."""
40
  if not raw_path:
41
  raise ValueError("Path is required.")
42
 
 
46
 
47
  resolved = candidate.resolve()
48
  allowed = [root.resolve() for root in allowed_roots]
49
+ # Prevent download/apply endpoints from reading arbitrary server files.
50
  if not any(resolved == root or resolved.is_relative_to(root) for root in allowed):
51
  raise ValueError("Path is outside the application data directory.")
52
 
ui_app.py CHANGED
@@ -23,6 +23,8 @@ from src.workbook_io import read_workbook_sheets, resolve_allowed_path, save_upl
23
  APP_ROOT = Path(__file__).resolve().parent
24
  UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
25
  UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
 
 
26
  ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
27
 
28
  app = Flask(
@@ -31,7 +33,7 @@ app = Flask(
31
  static_folder=str(APP_ROOT / "ui" / "static"),
32
  )
33
  app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
34
- app.secret_key = os.getenv("APP_SECRET_KEY", "mastermap-local-dev-secret")
35
 
36
  DEFAULT_STATE = {
37
  "clean_path": "",
@@ -50,6 +52,7 @@ DEFAULT_STATE = {
50
 
51
 
52
  def fresh_state():
 
53
  return {
54
  key: list(value) if isinstance(value, list) else value
55
  for key, value in DEFAULT_STATE.items()
@@ -57,16 +60,19 @@ def fresh_state():
57
 
58
 
59
  def get_state():
 
60
  if "ui_state" not in session:
61
  session["ui_state"] = fresh_state()
62
  return session["ui_state"]
63
 
64
 
65
  def mark_state_changed():
 
66
  session.modified = True
67
 
68
 
69
  def auth_required_response() -> Response:
 
70
  return Response(
71
  "Authentication required",
72
  401,
@@ -75,15 +81,17 @@ def auth_required_response() -> Response:
75
 
76
 
77
  def missing_auth_config_response() -> Response:
 
78
  return Response(
79
- "APP_PASSWORD Space Secret is not configured.",
80
  503,
81
  )
82
 
83
 
84
  @app.before_request
85
  def require_basic_auth():
86
- if not APP_PASSWORD:
 
87
  if SPACE_ID:
88
  return missing_auth_config_response()
89
  return None
@@ -110,6 +118,7 @@ def prevent_browser_cache(response):
110
 
111
 
112
  def default_models() -> str:
 
113
  preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
114
  env_preferred_models = [
115
  model
@@ -120,6 +129,7 @@ def default_models() -> str:
120
 
121
 
122
  def render_page(message: str = "", error: str = ""):
 
123
  state = get_state()
124
  if state["clean_sheets"]:
125
  state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
@@ -139,6 +149,7 @@ def render_page(message: str = "", error: str = ""):
139
 
140
 
141
  def can_apply_blueprint() -> bool:
 
142
  state = get_state()
143
  return bool(
144
  state["apply_workbook_path"]
@@ -149,10 +160,12 @@ def can_apply_blueprint() -> bool:
149
 
150
 
151
  def wants_json_response() -> bool:
 
152
  return "application/json" in request.headers.get("Accept", "")
153
 
154
 
155
  def ui_state_payload(message: str = "", error: str = ""):
 
156
  state = get_state()
157
  return {
158
  "message": message,
@@ -168,6 +181,7 @@ def ui_state_payload(message: str = "", error: str = ""):
168
 
169
 
170
  def pick_sheet(sheets, preferred_sheet=None, state=None):
 
171
  state = state or get_state()
172
  if preferred_sheet and preferred_sheet in sheets:
173
  return preferred_sheet
@@ -177,6 +191,7 @@ def pick_sheet(sheets, preferred_sheet=None, state=None):
177
 
178
 
179
  def update_ui_state_from_form(form):
 
180
  state = get_state()
181
  state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
182
  state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
@@ -200,6 +215,7 @@ def prepare_clean():
200
  except Exception as exc:
201
  return render_page(error=str(exc))
202
 
 
203
  state["clean_path"] = str(path)
204
  state["clean_filename"] = filename
205
  state["clean_sheets"] = sheets
@@ -214,6 +230,7 @@ def prepare_clean():
214
 
215
  @app.route("/remove-clean", methods=["POST"])
216
  def remove_clean():
 
217
  update_ui_state_from_form(request.form)
218
  state = get_state()
219
  old_path = state["clean_path"]
@@ -232,6 +249,7 @@ def remove_clean():
232
 
233
  @app.route("/prepare-apply-workbook", methods=["POST"])
234
  def prepare_apply_workbook():
 
235
  try:
236
  update_ui_state_from_form(request.form)
237
  state = get_state()
@@ -254,6 +272,7 @@ def prepare_apply_workbook():
254
 
255
  @app.route("/prepare-apply-blueprint", methods=["POST"])
256
  def prepare_apply_blueprint():
 
257
  try:
258
  update_ui_state_from_form(request.form)
259
  state = get_state()
@@ -276,6 +295,7 @@ def prepare_apply_blueprint():
276
 
277
  @app.route("/models")
278
  def models_endpoint():
 
279
  try:
280
  models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
281
  except Exception as exc:
@@ -285,11 +305,13 @@ def models_endpoint():
285
 
286
  @app.route("/references/status")
287
  def references_status():
 
288
  return jsonify(reference_sync_status())
289
 
290
 
291
  @app.route("/references/save", methods=["POST"])
292
  def save_references():
 
293
  try:
294
  result = save_manual_references_to_hub(APP_ROOT)
295
  except Exception as exc:
@@ -299,6 +321,7 @@ def save_references():
299
 
300
  @app.route("/sheets")
301
  def sheets_endpoint():
 
302
  try:
303
  workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
304
  if not workbook_path.is_file():
@@ -310,6 +333,7 @@ def sheets_endpoint():
310
 
311
  @app.route("/download-blueprint")
312
  def download_blueprint():
 
313
  state = get_state()
314
  requested_path = request.args.get("path") or state["apply_blueprint_path"]
315
  if not requested_path:
@@ -322,6 +346,7 @@ def download_blueprint():
322
 
323
  @app.route("/download-cleaned-workbook")
324
  def download_cleaned_workbook():
 
325
  state = get_state()
326
  requested_path = request.args.get("path") or state["clean_path"]
327
  if not requested_path:
@@ -339,6 +364,7 @@ def download_cleaned_workbook():
339
 
340
  @app.route("/download-applied-workbook")
341
  def download_applied_workbook():
 
342
  state = get_state()
343
  requested_path = request.args.get("path") or state["apply_workbook_path"]
344
  if not requested_path:
@@ -356,6 +382,7 @@ def download_applied_workbook():
356
 
357
  @app.route("/run")
358
  def run():
 
359
  job_id = request.args.get("job_id", uuid.uuid4().hex)
360
  input_path = request.args.get("input", "")
361
  sheet = request.args.get("sheet", "")
@@ -370,6 +397,7 @@ def run():
370
  except ValueError as exc:
371
  return jsonify({"error": str(exc)}), 400
372
 
 
373
  blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
374
  state = get_state()
375
  state["apply_blueprint_path"] = str(blueprint_path)
@@ -397,6 +425,7 @@ def run():
397
 
398
  @app.route("/stop", methods=["POST"])
399
  def stop():
 
400
  job_id = request.args.get("job_id", "")
401
  if not stop_process(job_id):
402
  return jsonify({"stopped": False, "message": "No active run found."}), 404
@@ -406,6 +435,7 @@ def stop():
406
 
407
  @app.route("/apply")
408
  def apply_blueprint():
 
409
  input_path = request.args.get("input", "")
410
  blueprint_path = request.args.get("blueprint", "")
411
  sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)
 
23
  APP_ROOT = Path(__file__).resolve().parent
24
  UPLOAD_DIR = APP_ROOT / DATA_DIR / "uploads"
25
  UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
26
+
27
+ # Download/apply routes only accept files inside data/ to avoid arbitrary file reads.
28
  ALLOWED_FILE_ROOTS = [APP_ROOT / DATA_DIR]
29
 
30
  app = Flask(
 
33
  static_folder=str(APP_ROOT / "ui" / "static"),
34
  )
35
  app.config["MAX_CONTENT_LENGTH"] = 100 * 1024 * 1024
36
+ app.secret_key = os.getenv("APP_SECRET_KEY", "local-dev-secret")
37
 
38
  DEFAULT_STATE = {
39
  "clean_path": "",
 
52
 
53
 
54
  def fresh_state():
55
+ """Create a clean UI state for one browser session."""
56
  return {
57
  key: list(value) if isinstance(value, list) else value
58
  for key, value in DEFAULT_STATE.items()
 
60
 
61
 
62
  def get_state():
63
+ """Return the current browser's state, creating it on first visit."""
64
  if "ui_state" not in session:
65
  session["ui_state"] = fresh_state()
66
  return session["ui_state"]
67
 
68
 
69
  def mark_state_changed():
70
+ """Tell Flask to re-sign the session cookie after nested state edits."""
71
  session.modified = True
72
 
73
 
74
  def auth_required_response() -> Response:
75
+ """Ask the browser for basic-auth credentials."""
76
  return Response(
77
  "Authentication required",
78
  401,
 
81
 
82
 
83
  def missing_auth_config_response() -> Response:
84
+ """Fail closed on Hugging Face if password protection was not configured."""
85
  return Response(
86
+ "App login credentials are not configured.",
87
  503,
88
  )
89
 
90
 
91
  @app.before_request
92
  def require_basic_auth():
93
+ """Protect every app route with a shared password when configured."""
94
+ if not APP_PASSWORD or not APP_USERNAME:
95
  if SPACE_ID:
96
  return missing_auth_config_response()
97
  return None
 
118
 
119
 
120
  def default_models() -> str:
121
+ """Prefer configured Groq models, falling back to the curated production list."""
122
  preferred_model_ids = {model.lower() for model in PREFERRED_PRODUCTION_CHAT_MODELS}
123
  env_preferred_models = [
124
  model
 
129
 
130
 
131
  def render_page(message: str = "", error: str = ""):
132
+ """Render the app with state scoped to the current browser session."""
133
  state = get_state()
134
  if state["clean_sheets"]:
135
  state["clean_selected_sheet"] = pick_sheet(state["clean_sheets"], state["clean_selected_sheet"], state)
 
149
 
150
 
151
  def can_apply_blueprint() -> bool:
152
+ """The Apply button requires workbook, blueprint, and target sheet."""
153
  state = get_state()
154
  return bool(
155
  state["apply_workbook_path"]
 
160
 
161
 
162
  def wants_json_response() -> bool:
163
+ """AJAX upload routes ask for JSON; normal form posts render the page."""
164
  return "application/json" in request.headers.get("Accept", "")
165
 
166
 
167
  def ui_state_payload(message: str = "", error: str = ""):
168
+ """Return just enough state for the frontend to update without a reload."""
169
  state = get_state()
170
  return {
171
  "message": message,
 
181
 
182
 
183
  def pick_sheet(sheets, preferred_sheet=None, state=None):
184
+ """Choose a stable sheet selection when workbooks are uploaded/refreshed."""
185
  state = state or get_state()
186
  if preferred_sheet and preferred_sheet in sheets:
187
  return preferred_sheet
 
191
 
192
 
193
  def update_ui_state_from_form(form):
194
+ """Preserve current UI selections while a file upload request is processed."""
195
  state = get_state()
196
  state["clean_selected_sheet"] = form.get("clean_selected_sheet") or state["clean_selected_sheet"]
197
  state["output_sheet"] = form.get("output_sheet") or state["output_sheet"] or DEFAULT_OUTPUT_SHEET_NAME
 
215
  except Exception as exc:
216
  return render_page(error=str(exc))
217
 
218
+ # The uploaded workbook becomes both the cleaning input and default apply target.
219
  state["clean_path"] = str(path)
220
  state["clean_filename"] = filename
221
  state["clean_sheets"] = sheets
 
230
 
231
  @app.route("/remove-clean", methods=["POST"])
232
  def remove_clean():
233
+ """Clear the current session's cleaning workbook without touching other sessions."""
234
  update_ui_state_from_form(request.form)
235
  state = get_state()
236
  old_path = state["clean_path"]
 
249
 
250
  @app.route("/prepare-apply-workbook", methods=["POST"])
251
  def prepare_apply_workbook():
252
+ """AJAX upload for the workbook that will receive Blueprint corrections."""
253
  try:
254
  update_ui_state_from_form(request.form)
255
  state = get_state()
 
272
 
273
  @app.route("/prepare-apply-blueprint", methods=["POST"])
274
  def prepare_apply_blueprint():
275
+ """AJAX upload for an externally reviewed Blueprint workbook."""
276
  try:
277
  update_ui_state_from_form(request.form)
278
  state = get_state()
 
295
 
296
  @app.route("/models")
297
  def models_endpoint():
298
+ """Fetch the currently usable Groq fallback model list for the UI."""
299
  try:
300
  models = select_groq_chat_models(limit=len(PREFERRED_PRODUCTION_CHAT_MODELS))
301
  except Exception as exc:
 
305
 
306
  @app.route("/references/status")
307
  def references_status():
308
+ """Tell the UI whether Hugging Face reference sync can be used."""
309
  return jsonify(reference_sync_status())
310
 
311
 
312
  @app.route("/references/save", methods=["POST"])
313
  def save_references():
314
+ """Commit manual references back to the Space repo when HF sync is configured."""
315
  try:
316
  result = save_manual_references_to_hub(APP_ROOT)
317
  except Exception as exc:
 
321
 
322
  @app.route("/sheets")
323
  def sheets_endpoint():
324
+ """Return workbook sheet names for dynamic apply-sheet selection."""
325
  try:
326
  workbook_path = resolve_allowed_path(request.args.get("path", ""), APP_ROOT, ALLOWED_FILE_ROOTS)
327
  if not workbook_path.is_file():
 
333
 
334
  @app.route("/download-blueprint")
335
  def download_blueprint():
336
+ """Download either the session Blueprint or an explicitly requested run file."""
337
  state = get_state()
338
  requested_path = request.args.get("path") or state["apply_blueprint_path"]
339
  if not requested_path:
 
346
 
347
  @app.route("/download-cleaned-workbook")
348
  def download_cleaned_workbook():
349
+ """Download the cleaned workbook copy for this session/run."""
350
  state = get_state()
351
  requested_path = request.args.get("path") or state["clean_path"]
352
  if not requested_path:
 
364
 
365
  @app.route("/download-applied-workbook")
366
  def download_applied_workbook():
367
+ """Download the workbook after Blueprint corrections have been applied."""
368
  state = get_state()
369
  requested_path = request.args.get("path") or state["apply_workbook_path"]
370
  if not requested_path:
 
382
 
383
  @app.route("/run")
384
  def run():
385
+ """Start the cleaning subprocess and stream its logs as server-sent events."""
386
  job_id = request.args.get("job_id", uuid.uuid4().hex)
387
  input_path = request.args.get("input", "")
388
  sheet = request.args.get("sheet", "")
 
397
  except ValueError as exc:
398
  return jsonify({"error": str(exc)}), 400
399
 
400
+ # Each run gets its own Blueprint so simultaneous users cannot overwrite it.
401
  blueprint_path = UPLOAD_DIR / f"Blueprint_{job_id}.xlsx"
402
  state = get_state()
403
  state["apply_blueprint_path"] = str(blueprint_path)
 
425
 
426
  @app.route("/stop", methods=["POST"])
427
  def stop():
428
+ """Stop a running cleaning subprocess for the given frontend job id."""
429
  job_id = request.args.get("job_id", "")
430
  if not stop_process(job_id):
431
  return jsonify({"stopped": False, "message": "No active run found."}), 404
 
435
 
436
  @app.route("/apply")
437
  def apply_blueprint():
438
+ """Start the Blueprint-apply subprocess and stream its logs."""
439
  input_path = request.args.get("input", "")
440
  blueprint_path = request.args.get("blueprint", "")
441
  sheet = request.args.get("sheet", DEFAULT_OUTPUT_SHEET_NAME)