Spaces:

lsdf
/

ai-seo-analyzer

Running

lsdf commited on Mar 24

Commit

c204306

1 Parent(s): 7c286d9

refactor(optimizer): apply unified per-goal iteration budget across all stages

Build per-stage goal lists and iterate each goal with max_iterations attempts before moving on. Total run budget now scales with number of actionable goals, and docs are updated to match the universal rule.

Made-with: Cursor

Files changed (2) hide show

docs/FULL_FUNCTIONAL_DOCUMENTATION.md +2 -2
optimizer.py +91 -48

docs/FULL_FUNCTIONAL_DOCUMENTATION.md CHANGED Viewed

@@ -472,7 +472,7 @@ HTML extraction pipeline:
 - `_is_stage_complete` для `bert`:
   - этап считается завершённым только когда **каждая** отслеживаемая ключевая фраза достигает `bert_stage_target` (проверка по `min(bert_phrase_scores)`);
   - достижение порога одной «сильной» фразой больше не завершает BERT-этап.
-  - BERT-этап не пропускается автоматически по правилу `Stage plateau: no primary progress for 3 steps`; plateau-auto-advance остаётся для следующих этапов (ngram/semantic/title/bm25).
 - `_validate_candidate_text`:
   - отклоняет некачественные/спамные кандидаты (дубли слов/сущностей, подозрительные склейки токенов);
   - добавляет anti-stuffing фильтр для цели BERT (повторы exact phrase и чрезмерные повторы focus-термов).
@@ -480,7 +480,7 @@ HTML extraction pipeline:
 ### Главная функция `optimize_text`
 Итерационный цикл:
 1. baseline metrics.
-   - число итераций цикла: `min(80, max_iterations + addon)`, где `addon = min(56, N_целей_ngram × 3)` — чтобы низкий `max_iterations` не обрывал n-gram стадию после трёх строк лога при большом списке целей.
 2. выбрать goal.
 3. выбрать пул чанков и операцию каскада.
    - **Этап `title`:** если средняя BERT-близость Title к ключам (`title_bert_score`) ниже порога (`TITLE_TARGET_THRESHOLD` ≈ 0.65), цель — **только переписать текст из поля Title** (`target_title`), а не абзац основного текста. LLM получает текущий title, выдержку из body и ключевые слова; метрики пересчитываются с новым title. Пакетные правки по body с title не смешиваются.

 - `_is_stage_complete` для `bert`:
   - этап считается завершённым только когда **каждая** отслеживаемая ключевая фраза достигает `bert_stage_target` (проверка по `min(bert_phrase_scores)`);
   - достижение порога одной «сильной» фразой больше не завершает BERT-этап.
+  - унифицированный цикл по целям: на каждой стадии для **каждой** найденной цели/фразы действует одинаковый бюджет `max_iterations` попыток; после исчерпания лимита оптимизатор переходит к следующей цели той же стадии.
 - `_validate_candidate_text`:
   - отклоняет некачественные/спамные кандидаты (дубли слов/сущностей, подозрительные склейки токенов);
   - добавляет anti-stuffing фильтр для цели BERT (повторы exact phrase и чрезмерные повторы focus-термов).
 ### Главная функция `optimize_text`
 Итерационный цикл:
 1. baseline metrics.
+   - общий бюджет шагов оценивается как `sum(цели_стадии × max_iterations)` по всем стадиям (с верхней отсечкой в коде), то есть масштабируется по числу реально требующих улучшения целей.
 2. выбрать goal.
 3. выбрать пул чанков и операцию каскада.
    - **Этап `title`:** если средняя BERT-близость Title к ключам (`title_bert_score`) ниже порога (`TITLE_TARGET_THRESHOLD` ≈ 0.65), цель — **только переписать текст из поля Title** (`target_title`), а не абзац основного текста. LLM получает текущий title, выдержку из body и ключевые слова; метрики пересчитываются с новым title. Пакетные правки по body с title не смешиваются.

optimizer.py CHANGED Viewed

@@ -545,18 +545,48 @@ def _choose_optimization_goal(
     bert_stage_target: float = BERT_TARGET_THRESHOLD,
     stage_cursor: int = 0,
 ) -> Dict[str, Any]:
-    candidates: Dict[str, Dict[str, Any]] = {}
     bert_details = analysis.get("bert_analysis", {}).get("detailed", []) or []
     low_bert = [x for x in bert_details if float(x.get("my_max_score", 0)) < float(bert_stage_target)]
     if low_bert:
-        worst = sorted(low_bert, key=lambda x: float(x.get("my_max_score", 0)))[0]
-        focus_terms = _filter_stopwords(_tokenize(worst.get("phrase", "")), language)[:4]
-        candidates["bert"] = {"type": "bert", "label": str(worst.get("phrase", "")), "focus_terms": focus_terms, "avoid_terms": []}
     bm25_remove = [x for x in (analysis.get("bm25_recommendations") or []) if x.get("action") == "remove"]
     if len(bm25_remove) >= 4:
-        spam_terms = [str(x.get("word", "")) for x in sorted(bm25_remove, key=lambda r: int(r.get("count", 0)), reverse=True)[:4]]
-        candidates["bm25"] = {"type": "bm25", "label": "reduce spam", "focus_terms": [], "avoid_terms": spam_terms}
     # Semantic keyword gaps
     lang_stop = STOP_WORDS.get(language, STOP_WORDS["en"])
@@ -579,19 +609,14 @@ def _choose_optimization_goal(
         if _is_semantic_gap(target_w, comp_w):
             candidate_rows.append((term, gap))
     if candidate_rows:
-        top_term = sorted(candidate_rows, key=lambda x: x[1], reverse=True)[0][0]
-        candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
     # N-gram balancing (toward competitor average with tolerance policy).
     ngram_rows = _build_ngram_stage_rows(analysis, keywords, language)
     if ngram_rows:
-        pick = max(0, int(stage_cursor))
-        if pick >= len(ngram_rows):
-            # No more n-gram targets in current stage cursor window.
-            pass
-        else:
-            label, target, comp_avg, tol, _, _ = ngram_rows[pick]
-            candidates["ngram"] = {
                 "type": "ngram",
                 "label": label,
                 "focus_terms": [label],
@@ -602,9 +627,9 @@ def _choose_optimization_goal(
                 "ngram_lower_bound": round(comp_avg * (1.0 - tol), 3),
                 "ngram_upper_bound": round(comp_avg * (1.0 + tol), 3),
                 "ngram_direction": "increase" if target < comp_avg else "decrease",
-                "ngram_rank_index": pick,
                 "ngram_candidates_total": len(ngram_rows),
-            }
     title_bert = analysis.get("title_analysis", {}).get("bert", {}) or {}
     title_target_score = title_bert.get("target_score")
@@ -613,17 +638,14 @@ def _choose_optimization_goal(
         and title_target_score is not None
         and float(title_target_score) < TITLE_TARGET_THRESHOLD
     ):
-        candidates["title"] = {
             "type": "title",
             "label": "title alignment",
             "focus_terms": _filter_stopwords(_tokenize(" ".join(keywords[:8])), language)[:8],
             "avoid_terms": [],
-        }
-    if stage in candidates:
-        return candidates[stage]
-    return {"type": "none", "label": "no-op", "focus_terms": [], "avoid_terms": []}
 def _choose_sentence_idx(sentences: List[str], focus_terms: List[str], avoid_terms: List[str], language: str) -> int:
@@ -1485,11 +1507,24 @@ def optimize_text(
         baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
     )
-    # Global max_iterations caps early stages; n-gram stage gets extra steps so each target
-    # can use NGRAM_ATTEMPTS_PER_TERM tries without being cut off at the user iteration cap.
-    ngram_row_count = len(_build_ngram_stage_rows(baseline_analysis, keywords, language))
-    ngram_step_addon = min(56, max(0, ngram_row_count) * NGRAM_ATTEMPTS_PER_TERM)
-    total_loop_steps = min(80, max_iterations + ngram_step_addon)
     current_text = target_text
     current_title = (target_title or "").strip()
@@ -1574,15 +1609,40 @@ def optimize_text(
             break
         active_stage = STAGE_ORDER[stage_idx]
-        goal = _choose_optimization_goal(
             current_analysis,
             current_semantic,
             keywords,
             language,
             stage=active_stage,
             bert_stage_target=bert_stage_target,
-            stage_cursor=int((stage_goal_cursor.get(active_stage) or {}).get("term_index", 0)),
         )
         if goal["type"] == "none":
             stage_idx += 1
             stage_no_progress_steps = 0
@@ -2176,8 +2236,6 @@ def optimize_text(
                     stage_no_progress_steps = 0
                 else:
                     stage_no_progress_steps += 1
-                if active_stage == "ngram":
-                    _advance_ngram_term_cursor(stage_goal_cursor, active_stage)
                 applied_changes += 1
                 queued_candidates = []
@@ -2322,8 +2380,6 @@ def optimize_text(
                         stage_no_progress_steps = 0
                     else:
                         stage_no_progress_steps += 1
-                    if active_stage == "ngram":
-                        _advance_ngram_term_cursor(stage_goal_cursor, active_stage)
                     applied_changes += 1
                     batch_applied = True
                     batch_info = {
@@ -2442,18 +2498,7 @@ def optimize_text(
                 }
             )
             stage_no_progress_steps += 1
-            if active_stage == "ngram":
-                _advance_ngram_term_cursor(stage_goal_cursor, active_stage)
-            # Do not auto-skip BERT stage on local plateau while threshold is unmet.
-            # For BERT we keep iterating (with cascade escalation) until either:
-            # - per-phrase threshold is met in _is_stage_complete, or
-            # - global step budget is exhausted.
-            can_advance_on_plateau = active_stage != "bert"
-            if can_advance_on_plateau and stage_no_progress_steps >= 3 and stage_idx < len(STAGE_ORDER) - 1:
-                stage_idx += 1
-                stage_no_progress_steps = 0
-                logs[-1]["advanced_to_stage"] = STAGE_ORDER[stage_idx]
-                logs[-1]["reason"] = f"{logs[-1].get('reason', '-') } Stage plateau: no primary progress for 3 steps."
             consecutive_failures += 1
             if consecutive_failures >= 2 and cascade_level < 4:
                 cascade_level += 1
@@ -2487,8 +2532,6 @@ def optimize_text(
             stage_no_progress_steps = 0
         else:
             stage_no_progress_steps += 1
-        if active_stage == "ngram":
-            _advance_ngram_term_cursor(stage_goal_cursor, active_stage)
         applied_changes += 1
         queued_candidates = []

     bert_stage_target: float = BERT_TARGET_THRESHOLD,
     stage_cursor: int = 0,
 ) -> Dict[str, Any]:
+    goals = _collect_optimization_goals(
+        analysis=analysis,
+        semantic=semantic,
+        keywords=keywords,
+        language=language,
+        stage=stage,
+        bert_stage_target=bert_stage_target,
+    )
+    if not goals:
+        return {"type": "none", "label": "no-op", "focus_terms": [], "avoid_terms": []}
+    pick = max(0, int(stage_cursor))
+    if pick >= len(goals):
+        return {"type": "none", "label": "no-op", "focus_terms": [], "avoid_terms": []}
+    return goals[pick]
+def _collect_optimization_goals(
+    analysis: Dict[str, Any],
+    semantic: Dict[str, Any],
+    keywords: List[str],
+    language: str,
+    stage: str = "bert",
+    bert_stage_target: float = BERT_TARGET_THRESHOLD,
+) -> List[Dict[str, Any]]:
+    goals: List[Dict[str, Any]] = []
     bert_details = analysis.get("bert_analysis", {}).get("detailed", []) or []
     low_bert = [x for x in bert_details if float(x.get("my_max_score", 0)) < float(bert_stage_target)]
     if low_bert:
+        for row in sorted(low_bert, key=lambda x: float(x.get("my_max_score", 0))):
+            phrase = str(row.get("phrase", "")).strip()
+            if not phrase:
+                continue
+            focus_terms = _filter_stopwords(_tokenize(phrase), language)[:4]
+            goals.append({"type": "bert", "label": phrase, "focus_terms": focus_terms, "avoid_terms": []})
     bm25_remove = [x for x in (analysis.get("bm25_recommendations") or []) if x.get("action") == "remove"]
     if len(bm25_remove) >= 4:
+        for row in sorted(bm25_remove, key=lambda r: int(r.get("count", 0)), reverse=True)[:8]:
+            word = str(row.get("word", "")).strip()
+            if not word:
+                continue
+            goals.append({"type": "bm25", "label": f"reduce spam: {word}", "focus_terms": [], "avoid_terms": [word]})
     # Semantic keyword gaps
     lang_stop = STOP_WORDS.get(language, STOP_WORDS["en"])
         if _is_semantic_gap(target_w, comp_w):
             candidate_rows.append((term, gap))
     if candidate_rows:
+        for term, _gap in sorted(candidate_rows, key=lambda x: x[1], reverse=True)[:12]:
+            goals.append({"type": "semantic", "label": term, "focus_terms": [term], "avoid_terms": []})
     # N-gram balancing (toward competitor average with tolerance policy).
     ngram_rows = _build_ngram_stage_rows(analysis, keywords, language)
     if ngram_rows:
+        for rank, (label, target, comp_avg, tol, _, _) in enumerate(ngram_rows):
+            goals.append({
                 "type": "ngram",
                 "label": label,
                 "focus_terms": [label],
                 "ngram_lower_bound": round(comp_avg * (1.0 - tol), 3),
                 "ngram_upper_bound": round(comp_avg * (1.0 + tol), 3),
                 "ngram_direction": "increase" if target < comp_avg else "decrease",
+                "ngram_rank_index": rank,
                 "ngram_candidates_total": len(ngram_rows),
+            })
     title_bert = analysis.get("title_analysis", {}).get("bert", {}) or {}
     title_target_score = title_bert.get("target_score")
         and title_target_score is not None
         and float(title_target_score) < TITLE_TARGET_THRESHOLD
     ):
+        goals.append({
             "type": "title",
             "label": "title alignment",
             "focus_terms": _filter_stopwords(_tokenize(" ".join(keywords[:8])), language)[:8],
             "avoid_terms": [],
+        })
+    return [g for g in goals if g.get("type") == stage]
 def _choose_sentence_idx(sentences: List[str], focus_terms: List[str], avoid_terms: List[str], language: str) -> int:
         baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
     )
+    # Unified per-goal budget for all stages:
+    # total steps = sum(goals_in_stage * max_iterations)
+    baseline_goal_counts = {
+        st: len(
+            _collect_optimization_goals(
+                baseline_analysis,
+                baseline_semantic,
+                keywords,
+                language,
+                stage=st,
+                bert_stage_target=bert_stage_target,
+            )
+        )
+        for st in STAGE_ORDER
+    }
+    ngram_row_count = int(baseline_goal_counts.get("ngram", 0))
+    estimated_total = sum(int(c) * int(max_iterations) for c in baseline_goal_counts.values())
+    total_loop_steps = min(240, max(1, estimated_total))
     current_text = target_text
     current_title = (target_title or "").strip()
             break
         active_stage = STAGE_ORDER[stage_idx]
+        goals_for_stage = _collect_optimization_goals(
             current_analysis,
             current_semantic,
             keywords,
             language,
             stage=active_stage,
             bert_stage_target=bert_stage_target,
         )
+        state = stage_goal_cursor.get(active_stage) or {"goal_index": 0, "attempt_count": 0}
+        goal_index = int(state.get("goal_index", 0))
+        attempt_count = int(state.get("attempt_count", 0))
+        # Advance across goals that exhausted per-goal iteration budget.
+        while goal_index < len(goals_for_stage) and attempt_count >= max_iterations:
+            goal_index += 1
+            attempt_count = 0
+        if goal_index >= len(goals_for_stage):
+            stage_idx += 1
+            stage_no_progress_steps = 0
+            logs.append(
+                {
+                    "step": step + 1,
+                    "status": "stage_skipped",
+                    "stage": active_stage,
+                    "reason": f"All goals exhausted for stage '{active_stage}' (max_iterations={max_iterations} per goal).",
+                }
+            )
+            stage_goal_cursor[active_stage] = {"goal_index": goal_index, "attempt_count": attempt_count}
+            continue
+        goal = goals_for_stage[goal_index]
+        attempt_count += 1
+        stage_goal_cursor[active_stage] = {"goal_index": goal_index, "attempt_count": attempt_count}
         if goal["type"] == "none":
             stage_idx += 1
             stage_no_progress_steps = 0
                     stage_no_progress_steps = 0
                 else:
                     stage_no_progress_steps += 1
                 applied_changes += 1
                 queued_candidates = []
                         stage_no_progress_steps = 0
                     else:
                         stage_no_progress_steps += 1
                     applied_changes += 1
                     batch_applied = True
                     batch_info = {
                 }
             )
             stage_no_progress_steps += 1
+            # Stage transition is controlled by per-stage iteration budget and completion checks.
             consecutive_failures += 1
             if consecutive_failures >= 2 and cascade_level < 4:
                 cascade_level += 1
             stage_no_progress_steps = 0
         else:
             stage_no_progress_steps += 1
         applied_changes += 1
         queued_candidates = []