Spaces:

lsdf
/

ai-seo-analyzer

Running

lsdf commited on Mar 17

Commit

09a2c0e

1 Parent(s): dd4e1d6

Add configurable phrase strategy mode for LLM optimizer.

Expose Auto/Distributed/Exact phrase strategy in UI and API, then enforce it in prompt generation and response metadata to reduce unnatural exact-phrase stuffing.

Made-with: Cursor

Files changed (4) hide show

docs/FULL_FUNCTIONAL_DOCUMENTATION.md +10 -2
models.py +2 -0
optimizer.py +81 -7
templates/index.html +18 -3

docs/FULL_FUNCTIONAL_DOCUMENTATION.md CHANGED Viewed

@@ -164,7 +164,8 @@
 ### Вход (`OptimizerRequest`)
 - аналитические данные: `target_text`, `competitors`, `keywords`, `language`, `target_title`, `competitor_titles`
 - LLM: `api_key`, `api_base_url`, `model`, `temperature`
-- стратегия: `max_iterations`, `candidates_per_iteration`, `optimization_mode`
 ### Выход (`OptimizerResponse`)
 - `optimized_text`
@@ -172,6 +173,7 @@
 - `iterations[]` (подробный лог шагов)
 - `applied_changes`
 - `optimization_mode`
 - `error` (если есть)
 ---
@@ -428,7 +430,10 @@ HTML extraction pipeline:
   - учитывает `cascade_level` и тип операции (`rewrite`/`insert`)
   - явно требует грамматически корректный и естественный текст
   - ограничивает число предложений по уровню
-  - для BERT допускает 2 валидные схемы: exact phrase один раз **или** естественное разнесённое использование core-термов (`mbit`, `alternatives`) в одном абзаце.
   - для `rewrite` явно требует сохранить исходный смысл `sentence-by-sentence` и не менять субъект/ключевую сущность без необходимости.
 ### Применение правок
@@ -448,6 +453,9 @@ HTML extraction pipeline:
   - hard constraints (не ухудшать критичные метрики сверх допустимого);
   - режимы `conservative/balanced/aggressive` задают пороги регрессии;
   - решение учитывает и `goal_improved`, и общий `delta_score`.
 ### Главная функция `optimize_text`
 Итерационный цикл:

 ### Вход (`OptimizerRequest`)
 - аналитические данные: `target_text`, `competitors`, `keywords`, `language`, `target_title`, `competitor_titles`
 - LLM: `api_key`, `api_base_url`, `model`, `temperature`
+- стратегия: `max_iterations`, `candidates_per_iteration`, `optimization_mode`, `phrase_strategy_mode`
+  - `phrase_strategy_mode`: `auto | distributed_preferred | exact_preferred`
 ### Выход (`OptimizerResponse`)
 - `optimized_text`
 - `iterations[]` (подробный лог шагов)
 - `applied_changes`
 - `optimization_mode`
+- `phrase_strategy_mode`
 - `error` (если есть)
 ---
   - учитывает `cascade_level` и тип операции (`rewrite`/`insert`)
   - явно требует грамматически корректный и естественный текст
   - ограничивает число предложений по уровню
+  - для BERT динамически выбирает стратегию по длине целевой фразы:
+    - короткие цели: допустим один natural exact match;
+    - длинные multi-word цели: приоритет у distributed semantic coverage (части фразы/леммы/близкие формулировки), без forced exact match.
+  - exact phrase не должен повторяться: при неестественном звучании он запрещается в пользу распределённой формулировки.
   - для `rewrite` явно требует сохранить исходный смысл `sentence-by-sentence` и не менять субъект/ключевую сущность без необходимости.
 ### Применение правок
   - hard constraints (не ухудшать критичные метрики сверх допустимого);
   - режимы `conservative/balanced/aggressive` задают пороги регрессии;
   - решение учитывает и `goal_improved`, и общий `delta_score`.
+- `_validate_candidate_text`:
+  - отклоняет некачественные/спамные кандидаты (дубли слов/сущностей, подозрительные склейки токенов);
+  - добавляет anti-stuffing фильтр для цели BERT (повторы exact phrase и чрезмерные повторы focus-термов).
 ### Главная функция `optimize_text`
 Итерационный цикл:

models.py CHANGED Viewed

@@ -91,6 +91,7 @@ class OptimizerRequest(BaseModel):
     candidates_per_iteration: int = 2
     temperature: float = 0.25
     optimization_mode: str = "balanced"
 class OptimizerResponse(BaseModel):
@@ -101,4 +102,5 @@ class OptimizerResponse(BaseModel):
     iterations: List[Dict[str, Any]] = Field(default_factory=list)
     applied_changes: int = 0
     optimization_mode: str = "balanced"
     error: str = ""

     candidates_per_iteration: int = 2
     temperature: float = 0.25
     optimization_mode: str = "balanced"
+    phrase_strategy_mode: str = "auto"  # auto | exact_preferred | distributed_preferred
 class OptimizerResponse(BaseModel):
     iterations: List[Dict[str, Any]] = Field(default_factory=list)
     applied_changes: int = 0
     optimization_mode: str = "balanced"
+    phrase_strategy_mode: str = "auto"
     error: str = ""

optimizer.py CHANGED Viewed

@@ -60,7 +60,13 @@ def _max_sentences_for_level(cascade_level: int, operation: str) -> int:
     return 4
-def _validate_candidate_text(edited_text: str, cascade_level: int, operation: str) -> List[str]:
     reasons: List[str] = []
     text = (edited_text or "").strip()
     if not text:
@@ -79,6 +85,29 @@ def _validate_candidate_text(edited_text: str, cascade_level: int, operation: st
     if re.search(r"\b[a-z]{6,}[A-Z][a-z]+\b", text):
         reasons.append("suspicious_token_join")
     return reasons
@@ -685,6 +714,7 @@ def _llm_edit_chunk(
     focus_terms: List[str],
     avoid_terms: List[str],
     temperature: float,
 ) -> Dict[str, Any]:
     endpoint = base_url.rstrip("/") + "/chat/completions"
     op = operation if operation in {"rewrite", "insert"} else "rewrite"
@@ -701,17 +731,45 @@ def _llm_edit_chunk(
         else "Create a short bridge chunk (1-2 sentences) to insert after the chunk."
     )
     max_sent = _max_sentences_for_level(cascade_level, op)
     user_msg = (
         f"Language: {language}\n"
         f"Operation: {op}\n"
         f"Cascade level: L{cascade_level}\n"
         f"Goal: {goal_type} ({goal_label})\n"
         f"Instruction: {op_instruction}\n"
         f"Must preserve overall narrative and style.\n"
         "Text must be grammatically correct and natural for native readers.\n"
         "Keep edits tightly local to the provided chunk and immediate context only.\n"
         "Edit must be substantive (not just synonyms) and should increase relevance to the goal phrase.\n"
         "Do not change the sentence subject/entity focus unless absolutely required by grammar.\n"
         f"Focus terms to strengthen: {', '.join(focus_terms) if focus_terms else '-'}\n"
         f"Terms to de-emphasize/avoid overuse: {', '.join(avoid_terms) if avoid_terms else '-'}\n\n"
         f"Chunk to edit/expand:\n{chunk_text}\n\n"
@@ -722,11 +780,13 @@ def _llm_edit_chunk(
         "2) Keep local coherence with surrounding text.\n"
         f"3) Max {max_sent} sentence(s) in edited_text.\n"
         "4) Keep key named entities from the original chunk unchanged when possible.\n"
-        "5) For BERT goal, improve semantic match to goal phrase without keyword stuffing.\n"
-        "6) For BERT goals you may use either: (a) exact phrase once, or (b) natural distributed use of core terms in one paragraph.\n"
-        "7) For rewrite: preserve original meaning sentence-by-sentence while improving relevance.\n"
-        "8) Provide rationale in one short sentence.\n"
-        "9) Only output JSON object."
     )
     payload = {
         "model": model,
@@ -764,6 +824,9 @@ def _llm_edit_chunk(
             "goal_label": goal_label,
             "focus_terms": focus_terms,
             "avoid_terms": avoid_terms,
             "max_sentences": max_sent,
             "chunk_text": chunk_text,
             "context_before": context_before,
@@ -930,6 +993,9 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
     candidates_per_iteration = max(1, min(5, candidates_per_iteration))
     temperature = float(request_data.get("temperature", 0.25) or 0.25)
     optimization_mode = str(request_data.get("optimization_mode", "balanced") or "balanced")
     baseline_analysis = _build_analysis_snapshot(
         target_text, competitors, keywords, language, target_title, competitor_titles
@@ -1022,6 +1088,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                         focus_terms=goal["focus_terms"],
                         avoid_terms=goal["avoid_terms"],
                         temperature=temp,
                     )
                     edited_text = str((llm_result or {}).get("edited_text", "")).strip()
                     llm_rationale = str((llm_result or {}).get("rationale", "")).strip()
@@ -1029,7 +1096,13 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                     if not edited_text or edited_text == original_span_text:
                         continue
-                    quality_issues = _validate_candidate_text(edited_text, cascade_level, operation)
                     before_rel, after_rel = _chunk_relevance_pair(
                         original_span_text,
                         edited_text,
@@ -1576,4 +1649,5 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
         "iterations": logs,
         "applied_changes": applied_changes,
         "optimization_mode": optimization_mode,
     }

     return 4
+def _validate_candidate_text(
+    edited_text: str,
+    cascade_level: int,
+    operation: str,
+    goal_label: str = "",
+    focus_terms: Optional[List[str]] = None,
+) -> List[str]:
     reasons: List[str] = []
     text = (edited_text or "").strip()
     if not text:
     if re.search(r"\b[a-z]{6,}[A-Z][a-z]+\b", text):
         reasons.append("suspicious_token_join")
+    # Anti-stuffing checks for BERT phrase goals.
+    focus_terms = focus_terms or []
+    phrase = (goal_label or "").strip().lower()
+    normalized = re.sub(r"\s+", " ", text.lower())
+    if phrase:
+        phrase_occurrences = normalized.count(phrase)
+        phrase_token_count = len(_tokenize(phrase))
+        # For long goal phrases, repeated exact matches are usually unnatural.
+        if phrase_token_count >= 3 and phrase_occurrences > 1:
+            reasons.append("exact_phrase_stuffing")
+        elif phrase_occurrences > 2:
+            reasons.append("exact_phrase_stuffing")
+    for term in focus_terms:
+        tok = (term or "").strip().lower()
+        if not tok:
+            continue
+        term_occurrences = len(re.findall(rf"\b{re.escape(tok)}\b", normalized))
+        if term_occurrences > 3:
+            reasons.append("focus_term_overuse")
+            break
     return reasons
     focus_terms: List[str],
     avoid_terms: List[str],
     temperature: float,
+    phrase_strategy_mode: str = "auto",
 ) -> Dict[str, Any]:
     endpoint = base_url.rstrip("/") + "/chat/completions"
     op = operation if operation in {"rewrite", "insert"} else "rewrite"
         else "Create a short bridge chunk (1-2 sentences) to insert after the chunk."
     )
     max_sent = _max_sentences_for_level(cascade_level, op)
+    phrase_tokens = _filter_stopwords(_tokenize(goal_label or ""), language)
+    phrase_len = len(phrase_tokens)
+    strategy_mode = (phrase_strategy_mode or "auto").strip().lower()
+    if strategy_mode not in {"auto", "exact_preferred", "distributed_preferred"}:
+        strategy_mode = "auto"
+    if strategy_mode == "exact_preferred":
+        phrase_strategy = (
+            "Prefer one natural exact phrase mention when grammatically correct; otherwise use distributed core-term coverage."
+        )
+    elif strategy_mode == "distributed_preferred":
+        phrase_strategy = (
+            "Prefer distributed semantic coverage: spread core terms/lemmas naturally and avoid exact phrase unless absolutely natural."
+        )
+    elif phrase_len >= 3:
+        phrase_strategy = (
+            "Prefer distributed semantic coverage for long phrases: naturally spread core terms/lemmas across the local paragraph. "
+            "Use exact phrase only if it is grammatically natural."
+        )
+    elif phrase_len == 2:
+        phrase_strategy = (
+            "For two-term goals, use either one natural exact phrase or distributed use of both terms without repetition."
+        )
+    else:
+        phrase_strategy = (
+            "For single-term goals, improve relevance using natural lexical variants and nearby semantic anchors."
+        )
     user_msg = (
         f"Language: {language}\n"
         f"Operation: {op}\n"
         f"Cascade level: L{cascade_level}\n"
         f"Goal: {goal_type} ({goal_label})\n"
+        f"Goal token count (without stopwords): {phrase_len}\n"
         f"Instruction: {op_instruction}\n"
         f"Must preserve overall narrative and style.\n"
         "Text must be grammatically correct and natural for native readers.\n"
         "Keep edits tightly local to the provided chunk and immediate context only.\n"
         "Edit must be substantive (not just synonyms) and should increase relevance to the goal phrase.\n"
         "Do not change the sentence subject/entity focus unless absolutely required by grammar.\n"
+        f"Phrase strategy: {phrase_strategy}\n"
         f"Focus terms to strengthen: {', '.join(focus_terms) if focus_terms else '-'}\n"
         f"Terms to de-emphasize/avoid overuse: {', '.join(avoid_terms) if avoid_terms else '-'}\n\n"
         f"Chunk to edit/expand:\n{chunk_text}\n\n"
         "2) Keep local coherence with surrounding text.\n"
         f"3) Max {max_sent} sentence(s) in edited_text.\n"
         "4) Keep key named entities from the original chunk unchanged when possible.\n"
+        "5) For BERT goals, prioritize semantic alignment over exact phrase repetition.\n"
+        "6) If exact phrase sounds unnatural, do NOT force it; use grammatically correct distributed wording.\n"
+        "7) Exact phrase may appear at most once, and only when it reads naturally.\n"
+        "8) Avoid repeating the same focus term more than needed; no stuffing.\n"
+        "9) For rewrite: preserve original meaning sentence-by-sentence while improving relevance.\n"
+        "10) Provide rationale in one short sentence.\n"
+        "11) Only output JSON object."
     )
     payload = {
         "model": model,
             "goal_label": goal_label,
             "focus_terms": focus_terms,
             "avoid_terms": avoid_terms,
+            "phrase_strategy_mode": strategy_mode,
+            "goal_token_count": phrase_len,
+            "phrase_strategy": phrase_strategy,
             "max_sentences": max_sent,
             "chunk_text": chunk_text,
             "context_before": context_before,
     candidates_per_iteration = max(1, min(5, candidates_per_iteration))
     temperature = float(request_data.get("temperature", 0.25) or 0.25)
     optimization_mode = str(request_data.get("optimization_mode", "balanced") or "balanced")
+    phrase_strategy_mode = str(request_data.get("phrase_strategy_mode", "auto") or "auto").strip().lower()
+    if phrase_strategy_mode not in {"auto", "exact_preferred", "distributed_preferred"}:
+        phrase_strategy_mode = "auto"
     baseline_analysis = _build_analysis_snapshot(
         target_text, competitors, keywords, language, target_title, competitor_titles
                         focus_terms=goal["focus_terms"],
                         avoid_terms=goal["avoid_terms"],
                         temperature=temp,
+                        phrase_strategy_mode=phrase_strategy_mode,
                     )
                     edited_text = str((llm_result or {}).get("edited_text", "")).strip()
                     llm_rationale = str((llm_result or {}).get("rationale", "")).strip()
                     if not edited_text or edited_text == original_span_text:
                         continue
+                    quality_issues = _validate_candidate_text(
+                        edited_text,
+                        cascade_level,
+                        operation,
+                        goal_label=goal.get("label", ""),
+                        focus_terms=goal.get("focus_terms", []) or [],
+                    )
                     before_rel, after_rel = _chunk_relevance_pair(
                         original_span_text,
                         edited_text,
         "iterations": logs,
         "applied_changes": applied_changes,
         "optimization_mode": optimization_mode,
+        "phrase_strategy_mode": phrase_strategy_mode,
     }

templates/index.html CHANGED Viewed

@@ -310,6 +310,14 @@
                                     <option value="aggressive">Aggressive</option>
                                 </select>
                             </div>
                         </div>
                         <div class="d-flex gap-2 mt-3">
                             <button class="btn btn-dark" onclick="runLlmOptimization()">Запустить оптимизацию</button>
@@ -545,7 +553,8 @@
                 optimizer_iterations: Number(document.getElementById('optimizerIterations').value || 2),
                 optimizer_candidates: Number(document.getElementById('optimizerCandidates').value || 2),
                 optimizer_temperature: Number(document.getElementById('optimizerTemp').value || 0.25),
-                optimizer_mode: document.getElementById('optimizerMode').value
             },
             state: {
                 analysis_result: currentData,
@@ -596,6 +605,7 @@
         document.getElementById('optimizerCandidates').value = 2;
         document.getElementById('optimizerTemp').value = 0.25;
         document.getElementById('optimizerMode').value = 'balanced';
         // Competitor text fields
         const competitorsList = document.getElementById('competitorsList');
@@ -649,6 +659,7 @@
         document.getElementById('optimizerCandidates').value = inp.optimizer_candidates ?? 2;
         document.getElementById('optimizerTemp').value = inp.optimizer_temperature ?? 0.25;
         document.getElementById('optimizerMode').value = inp.optimizer_mode || 'balanced';
         // Title character counter refresh
         const titleLen = (inp.target_title || '').length;
@@ -924,7 +935,10 @@
             <div class="stat-card">
                 <h6 class="card-title">Результат оптимизации</h6>
                 <div class="small mb-2">Применено правок: <strong>${data.applied_changes || 0}</strong></div>
-                <div class="small mb-2">Режим: <strong>${data.optimization_mode || 'balanced'}</strong></div>
                 <div class="table-responsive">
                     <table class="table table-sm table-bordered mb-0">
                         <thead class="table-light"><tr><th>Метрика</th><th>До</th><th>После</th></tr></thead>
@@ -981,7 +995,8 @@
             max_iterations: Number(document.getElementById('optimizerIterations').value || 2),
             candidates_per_iteration: Number(document.getElementById('optimizerCandidates').value || 2),
             temperature: Number(document.getElementById('optimizerTemp').value || 0.25),
-            optimization_mode: document.getElementById('optimizerMode').value || 'balanced'
         };
         document.getElementById('loader').style.display = 'block';

                                     <option value="aggressive">Aggressive</option>
                                 </select>
                             </div>
+                            <div class="col-md-3">
+                                <label class="form-label small text-muted mb-1">Phrase Strategy</label>
+                                <select id="optimizerPhraseStrategy" class="form-select">
+                                    <option value="auto" selected>Auto</option>
+                                    <option value="distributed_preferred">Distributed preferred</option>
+                                    <option value="exact_preferred">Exact phrase preferred</option>
+                                </select>
+                            </div>
                         </div>
                         <div class="d-flex gap-2 mt-3">
                             <button class="btn btn-dark" onclick="runLlmOptimization()">Запустить оптимизацию</button>
                 optimizer_iterations: Number(document.getElementById('optimizerIterations').value || 2),
                 optimizer_candidates: Number(document.getElementById('optimizerCandidates').value || 2),
                 optimizer_temperature: Number(document.getElementById('optimizerTemp').value || 0.25),
+                optimizer_mode: document.getElementById('optimizerMode').value,
+                optimizer_phrase_strategy: document.getElementById('optimizerPhraseStrategy').value
             },
             state: {
                 analysis_result: currentData,
         document.getElementById('optimizerCandidates').value = 2;
         document.getElementById('optimizerTemp').value = 0.25;
         document.getElementById('optimizerMode').value = 'balanced';
+        document.getElementById('optimizerPhraseStrategy').value = 'auto';
         // Competitor text fields
         const competitorsList = document.getElementById('competitorsList');
         document.getElementById('optimizerCandidates').value = inp.optimizer_candidates ?? 2;
         document.getElementById('optimizerTemp').value = inp.optimizer_temperature ?? 0.25;
         document.getElementById('optimizerMode').value = inp.optimizer_mode || 'balanced';
+        document.getElementById('optimizerPhraseStrategy').value = inp.optimizer_phrase_strategy || 'auto';
         // Title character counter refresh
         const titleLen = (inp.target_title || '').length;
             <div class="stat-card">
                 <h6 class="card-title">Результат оптимизации</h6>
                 <div class="small mb-2">Применено правок: <strong>${data.applied_changes || 0}</strong></div>
+                <div class="small mb-2">
+                    Режим: <strong>${data.optimization_mode || 'balanced'}</strong>
+                    · Phrase Strategy: <strong>${data.phrase_strategy_mode || 'auto'}</strong>
+                </div>
                 <div class="table-responsive">
                     <table class="table table-sm table-bordered mb-0">
                         <thead class="table-light"><tr><th>Метрика</th><th>До</th><th>После</th></tr></thead>
             max_iterations: Number(document.getElementById('optimizerIterations').value || 2),
             candidates_per_iteration: Number(document.getElementById('optimizerCandidates').value || 2),
             temperature: Number(document.getElementById('optimizerTemp').value || 0.25),
+            optimization_mode: document.getElementById('optimizerMode').value || 'balanced',
+            phrase_strategy_mode: document.getElementById('optimizerPhraseStrategy').value || 'auto'
         };
         document.getElementById('loader').style.display = 'block';