Spaces:

lsdf
/

ai-seo-analyzer

Running

App Files Files Community

lsdf commited on Mar 21

Commit

ad42da6

1 Parent(s): a52efa4

N-gram: rank by Freq/Avg, extra loop budget, overlapping window chunking

Browse files

Files changed (3) hide show

docs/FULL_FUNCTIONAL_DOCUMENTATION.md +2 -0
docs/TEXT_OPTIMIZER_PRINCIPLES.md +11 -0
optimizer.py +102 -26

docs/FULL_FUNCTIONAL_DOCUMENTATION.md CHANGED Viewed

@@ -464,10 +464,12 @@ HTML extraction pipeline:
 ### Главная функция `optimize_text`
 Итерационный цикл:
 1. baseline metrics.
 2. выбрать goal.
 3. выбрать пул чанков и операцию каскада.
    - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
    - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
    - для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
    - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
 4. сгенерировать `N` кандидатов для каждого выбранного span.

 ### Главная функция `optimize_text`
 Итерационный цикл:
 1. baseline metrics.
+   - число итераций цикла: `min(80, max_iterations + addon)`, где `addon = min(56, N_целей_ngram × 3)` — чтобы низкий `max_iterations` не обрывал n-gram стадию после трёх строк лога при большом списке целей.
 2. выбрать goal.
 3. выбрать пул чанков и операцию каскада.
    - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
    - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
+   - для **n-gram** целей предложения ранжируются через **скользящие перекрывающиеся окна** из 2–4 предложений (шаг 1): каждому предложению присваивается лучший балл среди окон, оценка штрафует локальные повторы фразы и шумовые блоки;
    - для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
    - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
 4. сгенерировать `N` кандидатов для каждого выбранного span.

docs/TEXT_OPTIMIZER_PRINCIPLES.md CHANGED Viewed

@@ -44,10 +44,21 @@ Update it whenever optimization policy changes.
 - Selection rules (multi-competitor mode, `competitors > 1`):
   - bi-grams and tri-grams are eligible when present in `>= 2` competitors;
   - unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
 - Iteration behavior:
   - optimizer works on one n-gram target at a time per step;
   - per eligible n-gram target it allocates `3` attempts, then moves to the next target;
   - if target list ends, stage advances to the next optimization stage.
 ## 5.1 Summary logic memory (current)

 - Selection rules (multi-competitor mode, `competitors > 1`):
   - bi-grams and tri-grams are eligible when present in `>= 2` competitors;
   - unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
+- Target ranking (which n-gram to work on next):
+  - sort eligible **underrepresented** rows by **Freq(K)** (`comp_occurrence`) descending,
+    then **Avg(K)** (`competitor_avg`) descending,
+    then **deviation** from competitor average descending (larger gap first).
 - Iteration behavior:
   - optimizer works on one n-gram target at a time per step;
   - per eligible n-gram target it allocates `3` attempts, then moves to the next target;
   - if target list ends, stage advances to the next optimization stage.
+- **Global step budget:** the UI `max_iterations` cap still limits total loop iterations, but the
+  optimizer **adds** extra steps reserved for the n-gram stage (`targets × 3`, capped) so a low
+  `max_iterations` value does not stop the run after only three n-gram rows while many targets remain.
+- **Chunk selection (n-gram stage):** candidate sentences are ranked using **overlapping multi-sentence
+  windows** (stride 1). Each sentence receives the best window score; windows favor low local phrase
+  duplication, topical overlap with phrase tokens, and non-noisy prose. Document-level phrase count
+  remains the primary acceptance signal.
 ## 5.1 Summary logic memory (current)

optimizer.py CHANGED Viewed

@@ -303,6 +303,95 @@ def _chunk_ngram_count(text: str, ngram_label: str, language: str) -> int:
     return count
 def _compute_metrics(
     analysis: Dict[str, Any],
     semantic: Dict[str, Any],
@@ -470,32 +559,8 @@ def _choose_optimization_goal(
         candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
     # N-gram balancing (toward competitor average with tolerance policy).
-    ngram_rows: List[Tuple[str, float, float, float, int, float]] = []
-    ngram_stats = analysis.get("ngram_stats", {}) or {}
-    competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
-    keyword_unigrams = _keyword_unigram_set(keywords, language)
-    for bucket_name, bucket in ngram_stats.items():
-        if not isinstance(bucket, list):
-            continue
-        for item in bucket:
-            ngram_label = str(item.get("ngram", "")).strip()
-            if not ngram_label:
-                continue
-            target = float(item.get("target_count", 0))
-            comp_avg = float(item.get("competitor_avg", 0))
-            comp_occ = int(item.get("comp_occurrence", 0))
-            if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
-                continue
-            if not _is_ngram_outside_tolerance(target, comp_avg):
-                continue
-            # N-gram stage is for underrepresented terms only.
-            if target >= comp_avg:
-                continue
-            tol = _ngram_tolerance_pct(comp_avg)
-            dev_ratio = _ngram_deviation_ratio(target, comp_avg)
-            ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
     if ngram_rows:
-        ngram_rows.sort(key=lambda x: (x[5], x[4], x[2]), reverse=True)
         pick = max(0, int(stage_cursor))
         if pick >= len(ngram_rows):
             # No more n-gram targets in current stage cursor window.
@@ -576,6 +641,11 @@ def _rank_sentence_indices(
     avoid = [x for x in avoid_terms if x]
     center = (len(sentences) - 1) / 2.0
     # For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
     candidate_indices = list(range(len(sentences)))
     if goal_type == "bert":
@@ -1250,6 +1320,12 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
         baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
     )
     current_text = target_text
     current_analysis = baseline_analysis
     current_semantic = baseline_semantic
@@ -1266,7 +1342,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
     stage_no_progress_steps = 0
     stage_goal_cursor: Dict[str, Dict[str, int]] = {}
-    for step in range(max_iterations):
         while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
             STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
         ):

     return count
+def _build_ngram_stage_rows(
+    analysis: Dict[str, Any],
+    keywords: List[str],
+    language: str,
+) -> List[Tuple[str, float, float, float, int, float]]:
+    """
+    Eligible underrepresented n-gram targets for the ngram stage.
+    Each row: (ngram_label, target_count, comp_avg, tolerance_pct, comp_occurrence, dev_ratio).
+    Priority (user policy): maximize competitor coverage Freq(K), then Avg(K), then
+    how far below the band the target is (larger deviation first).
+    Only terms with My=0 or outside the tolerance band below average are included.
+    """
+    ngram_rows: List[Tuple[str, float, float, float, int, float]] = []
+    ngram_stats = analysis.get("ngram_stats", {}) or {}
+    competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
+    keyword_unigrams = _keyword_unigram_set(keywords, language)
+    for _bucket_name, bucket in ngram_stats.items():
+        if not isinstance(bucket, list):
+            continue
+        for item in bucket:
+            ngram_label = str(item.get("ngram", "")).strip()
+            if not ngram_label:
+                continue
+            target = float(item.get("target_count", 0))
+            comp_avg = float(item.get("competitor_avg", 0))
+            comp_occ = int(item.get("comp_occurrence", 0))
+            if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
+                continue
+            if not _is_ngram_outside_tolerance(target, comp_avg):
+                continue
+            if target >= comp_avg:
+                continue
+            tol = _ngram_tolerance_pct(comp_avg)
+            dev_ratio = _ngram_deviation_ratio(target, comp_avg)
+            ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
+    ngram_rows.sort(key=lambda x: (x[4], x[2], x[5]), reverse=True)
+    return ngram_rows
+def _score_ngram_candidate_window(window_sentences: List[str], goal_label: str, language: str) -> float:
+    """Heuristic: good place to add phrase — low local duplication, topical proximity, not boilerplate."""
+    chunk = " ".join(s for s in window_sentences if s).strip()
+    if not chunk:
+        return -1e6
+    phrase_count = float(_chunk_ngram_count(chunk, goal_label, language))
+    noise_n = sum(1 for s in window_sentences if _is_noise_like_sentence(s))
+    noise_frac = noise_n / max(1, len(window_sentences))
+    phrase_tokens = [t.lower() for t in _filter_stopwords(_tokenize(goal_label), language) if t]
+    chunk_l = chunk.lower()
+    unigram_hits = sum(1 for t in phrase_tokens if t and len(t) > 1 and t in chunk_l)
+    rel_proxy = unigram_hits / max(1, len(phrase_tokens)) if phrase_tokens else 0.0
+    return (
+        -3.0 * phrase_count
+        + 2.2 * rel_proxy
+        - 4.0 * noise_frac
+        + min(len(chunk) / 1200.0, 0.35)
+    )
+def _rank_ngram_overlap_sentence_indices(
+    sentences: List[str],
+    goal_label: str,
+    language: str,
+) -> List[int]:
+    """
+    Slide overlapping multi-sentence windows over the document; each sentence gets the
+    best score among windows that contain it. Order sentences by that score (desc).
+    """
+    n = len(sentences)
+    if n <= 0:
+        return [0]
+    if n == 1:
+        return [0]
+    # 2–4 sentences per window, stride 1 for strong overlap.
+    w = min(4, max(2, n))
+    best: List[float] = [-1e9] * n
+    for start in range(0, n - w + 1):
+        win = sentences[start : start + w]
+        sc = _score_ngram_candidate_window(win, goal_label, language)
+        for j in range(start, start + w):
+            if sc > best[j]:
+                best[j] = sc
+    center = (n - 1) / 2.0
+    scored_idx = [(i, best[i], -abs(i - center)) for i in range(n)]
+    scored_idx.sort(key=lambda t: (t[1], t[2]), reverse=True)
+    return [t[0] for t in scored_idx]
 def _compute_metrics(
     analysis: Dict[str, Any],
     semantic: Dict[str, Any],
         candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
     # N-gram balancing (toward competitor average with tolerance policy).
+    ngram_rows = _build_ngram_stage_rows(analysis, keywords, language)
     if ngram_rows:
         pick = max(0, int(stage_cursor))
         if pick >= len(ngram_rows):
             # No more n-gram targets in current stage cursor window.
     avoid = [x for x in avoid_terms if x]
     center = (len(sentences) - 1) / 2.0
+    # N-gram stage: overlapping sentence windows — pick spans where insertion is natural
+    # while document-level phrase count remains the primary optimization signal.
+    if goal_type == "ngram" and (goal_label or "").strip():
+        return _rank_ngram_overlap_sentence_indices(sentences, str(goal_label).strip(), language)
     # For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
     candidate_indices = list(range(len(sentences)))
     if goal_type == "bert":
         baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
     )
+    # Global max_iterations caps early stages; n-gram stage gets extra steps so each target
+    # can use NGRAM_ATTEMPTS_PER_TERM tries without being cut off at the user iteration cap.
+    ngram_row_count = len(_build_ngram_stage_rows(baseline_analysis, keywords, language))
+    ngram_step_addon = min(56, max(0, ngram_row_count) * NGRAM_ATTEMPTS_PER_TERM)
+    total_loop_steps = min(80, max_iterations + ngram_step_addon)
     current_text = target_text
     current_analysis = baseline_analysis
     current_semantic = baseline_semantic
     stage_no_progress_steps = 0
     stage_goal_cursor: Dict[str, Dict[str, int]] = {}
+    for step in range(total_loop_steps):
         while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
             STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
         ):