Spaces:

lsdf
/

ai-seo-analyzer

Running

lsdf commited on Mar 17

Commit

aa803bd

1 Parent(s): f65551f

Refine LLM prompt for semantic-vector chunk optimization.

Update optimizer prompts to enforce subject preservation and sentence-by-sentence meaning retention for rewrites, allow exact or distributed core-term usage for BERT goals, and collect one-line rationale from the model. Surface rationale in debug UI and synchronize full documentation.

Made-with: Cursor

Files changed (3) hide show

docs/FULL_FUNCTIONAL_DOCUMENTATION.md +6 -1
optimizer.py +46 -5
templates/index.html +2 -0

docs/FULL_FUNCTIONAL_DOCUMENTATION.md CHANGED Viewed

@@ -424,9 +424,12 @@ HTML extraction pipeline:
 ### Генерация кандидатов
 - `_llm_edit_chunk` — отправляет structured prompt в OpenAI-compatible API.
   - учитывает `cascade_level` и тип операции (`rewrite`/`insert`)
   - явно требует грамматически корректный и естественный текст
   - ограничивает число предложений по уровню
 ### Применение правок
 - `_replace_span` — замена диапазона предложений.
@@ -448,6 +451,7 @@ HTML extraction pipeline:
 3. выбрать пул чанков и операцию каскада.
    - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
    - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
    - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
 4. сгенерировать `N` кандидатов для каждого выбранного span.
 5. pre-validation (формат/качество/длины).
@@ -472,7 +476,8 @@ HTML extraction pipeline:
    - `L4`: более широкий rewrite окна (до 5 предложений с вариативным охватом).
 11. вести подробный лог по каждому кандидату.
    - в debug-таблице фиксируются и chunk-level сигналы (`local+`, `chunk Δ`, `rel before->after`) наряду с глобальными (`Δ score`, `valid`, `goal+`);
-   - для каждого кандидата сохраняется `llm_prompt_debug` (операция, цель, фокус-термы, chunk и ближайший контекст), что позволяет анализировать фактический вход в LLM.
    - также сохраняется `metrics_delta` (вклад BM25/BERT/Semantic/N-gram/Title в общий сдвиг), включая `semantic_gap_sum` и изменение состава gap-термов (`semantic_gap_terms_added/removed`), чтобы видеть, за счет чего падает или растет `score`.
 ---

 ### Генерация кандидатов
 - `_llm_edit_chunk` — отправляет structured prompt в OpenAI-compatible API.
+  - роль модели в prompt: **semantic-vector optimizer for SEO**, а не общий “copy editor”.
   - учитывает `cascade_level` и тип операции (`rewrite`/`insert`)
   - явно требует грамматически корректный и естественный текст
   - ограничивает число предложений по уровню
+  - для BERT допускает 2 валидные схемы: exact phrase один раз **или** естественное разнесённое использование core-термов (`mbit`, `alternatives`) в одном абзаце.
+  - для `rewrite` явно требует сохранить исходный смысл `sentence-by-sentence` и не менять субъект/ключевую сущность без необходимости.
 ### Применение правок
 - `_replace_span` — замена диапазона предложений.
 3. выбрать пул чанков и операцию каскада.
    - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
    - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
+   - для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
    - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
 4. сгенерировать `N` кандидатов для каждого выбранного span.
 5. pre-validation (формат/качество/длины).
    - `L4`: более широкий rewrite окна (до 5 предложений с вариативным охватом).
 11. вести подробный лог по каждому кандидату.
    - в debug-таблице фиксируются и chunk-level сигналы (`local+`, `chunk Δ`, `rel before->after`) наряду с глобальными (`Δ score`, `valid`, `goal+`);
+   - для каждого кандидата сохраняется `llm_prompt_debug` (операция, цель, фокус-термы, chunk и ближайший контекст), что позволяет анализировать фактический вход в LLM;
+   - LLM возвращает поле `rationale` (1 строка) — краткое объяснение, почему правка должна повысить релевантность цели.
    - также сохраняется `metrics_delta` (вклад BM25/BERT/Semantic/N-gram/Title в общий сдвиг), включая `semantic_gap_sum` и изменение состава gap-термов (`semantic_gap_terms_added/removed`), чтобы видеть, за счет чего падает или растет `score`.
 ---

optimizer.py CHANGED Viewed

@@ -376,6 +376,7 @@ def _rank_sentence_indices(
         return [0]
     stop = STOP_WORDS.get(language, STOP_WORDS["en"])
     focus = [x for x in focus_terms if x and x not in stop]
     avoid = [x for x in avoid_terms if x]
     center = (len(sentences) - 1) / 2.0
@@ -394,8 +395,30 @@ def _rank_sentence_indices(
         avoid_score = sum(lower.count(t.lower()) for t in avoid)
         chunk_rel = _chunk_goal_relevance(s, goal_type, goal_label, focus_terms, language)
         noise_penalty = 1.0 if _is_noise_like_sentence(s) else 0.0
-        # Prefer semantically relevant and lexical matches; push noisy headers/CTA lower.
-        score = (chunk_rel * 4.0) + (focus_score * 3.0) + (avoid_score * 2.0) - (noise_penalty * 3.0) - (abs(idx - center) * 0.05)
         scored.append((idx, score, len(s)))
     scored.sort(key=lambda x: (x[1], -x[2]), reverse=True)
@@ -666,8 +689,10 @@ def _llm_edit_chunk(
     endpoint = base_url.rstrip("/") + "/chat/completions"
     op = operation if operation in {"rewrite", "insert"} else "rewrite"
     system_msg = (
-        "You are an SEO copy editor. Work locally, preserve narrative flow, factual tone, and language. "
-        "Return strict JSON only: {\"edited_text\": \"...\"}. "
         "Do not rewrite the whole text. Never change topic or introduce unrelated entities."
     )
     op_instruction = (
@@ -686,6 +711,7 @@ def _llm_edit_chunk(
         "Text must be grammatically correct and natural for native readers.\n"
         "Keep edits tightly local to the provided chunk and immediate context only.\n"
         "Edit must be substantive (not just synonyms) and should increase relevance to the goal phrase.\n"
         f"Focus terms to strengthen: {', '.join(focus_terms) if focus_terms else '-'}\n"
         f"Terms to de-emphasize/avoid overuse: {', '.join(avoid_terms) if avoid_terms else '-'}\n\n"
         f"Chunk to edit/expand:\n{chunk_text}\n\n"
@@ -697,7 +723,10 @@ def _llm_edit_chunk(
         f"3) Max {max_sent} sentence(s) in edited_text.\n"
         "4) Keep key named entities from the original chunk unchanged when possible.\n"
         "5) For BERT goal, improve semantic match to goal phrase without keyword stuffing.\n"
-        "6) Only output JSON object."
     )
     payload = {
         "model": model,
@@ -719,12 +748,15 @@ def _llm_edit_chunk(
     )
     parsed = _extract_json_object(content)
     edited = ""
     if parsed:
         edited = str(parsed.get("edited_text") or parsed.get("revised_sentence") or parsed.get("rewrite") or "").strip()
     if not edited:
         raise ValueError("LLM returned invalid JSON edit payload.")
     return {
         "edited_text": edited,
         "prompt_debug": {
             "operation": op,
             "cascade_level": cascade_level,
@@ -940,6 +972,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                         temperature=temp,
                     )
                     edited_text = str((llm_result or {}).get("edited_text", "")).strip()
                     prompt_debug = (llm_result or {}).get("prompt_debug", {})
                     if not edited_text or edited_text == original_span_text:
                         continue
@@ -980,6 +1013,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                                 "chunk_relevance_after": after_rel,
                                 "term_diff": _term_diff(original_span_text, edited_text, language),
                                 "llm_prompt_debug": prompt_debug,
                                 "operation": operation,
                                 "sentence_index": sent_idx,
                                 "span_start": span_start,
@@ -1007,6 +1041,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                                 "chunk_relevance_after": after_rel,
                                 "term_diff": _term_diff(original_span_text, edited_text, language),
                                 "llm_prompt_debug": prompt_debug,
                                 "operation": operation,
                                 "sentence_index": sent_idx,
                                 "span_start": span_start,
@@ -1058,6 +1093,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                             "chunk_relevance_after": after_rel,
                             "term_diff": _term_diff(original_span_text, edited_text, language),
                             "llm_prompt_debug": prompt_debug,
                             "invalid_reasons": invalid_reasons,
                             "delta_score": delta_score,
                             "candidate_score": cand_metrics.get("score"),
@@ -1088,6 +1124,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                                 "goal_type": goal.get("type"),
                                 "goal_label": goal.get("label"),
                             },
                             "operation": operation,
                             "sentence_index": sent_idx,
                             "span_start": span_start,
@@ -1178,6 +1215,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                                 "chunk_relevance_after": c.get("chunk_relevance_after"),
                                 "term_diff": c.get("term_diff"),
                                 "llm_prompt_debug": c.get("llm_prompt_debug"),
                                 "metrics_delta": c.get("metrics_delta"),
                                 "invalid_reasons": c.get("invalid_reasons", []),
                                 "delta_score": c.get("delta_score"),
@@ -1324,6 +1362,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                                 "chunk_relevance_after": c.get("chunk_relevance_after"),
                                 "term_diff": c.get("term_diff"),
                                 "llm_prompt_debug": c.get("llm_prompt_debug"),
                                 "metrics_delta": c.get("metrics_delta"),
                                 "invalid_reasons": c.get("invalid_reasons", []),
                                 "delta_score": c.get("delta_score"),
@@ -1372,6 +1411,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                             "chunk_relevance_after": c.get("chunk_relevance_after"),
                             "term_diff": c.get("term_diff"),
                             "llm_prompt_debug": c.get("llm_prompt_debug"),
                             "metrics_delta": c.get("metrics_delta"),
                             "invalid_reasons": c.get("invalid_reasons", []),
                             "delta_score": c.get("delta_score"),
@@ -1439,6 +1479,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
                         "chunk_relevance_after": c.get("chunk_relevance_after"),
                         "term_diff": c.get("term_diff"),
                         "llm_prompt_debug": c.get("llm_prompt_debug"),
                         "metrics_delta": c.get("metrics_delta"),
                         "invalid_reasons": c.get("invalid_reasons", []),
                         "delta_score": c.get("delta_score"),

         return [0]
     stop = STOP_WORDS.get(language, STOP_WORDS["en"])
     focus = [x for x in focus_terms if x and x not in stop]
+    goal_phrase = (goal_label or "").strip().lower()
     avoid = [x for x in avoid_terms if x]
     center = (len(sentences) - 1) / 2.0
         avoid_score = sum(lower.count(t.lower()) for t in avoid)
         chunk_rel = _chunk_goal_relevance(s, goal_type, goal_label, focus_terms, language)
         noise_penalty = 1.0 if _is_noise_like_sentence(s) else 0.0
+        # For BERT goals, do not over-focus on existing occurrences only:
+        # prioritize semantically relevant chunks where phrase/terms may still be underrepresented.
+        if goal_type == "bert":
+            tokenized = _filter_stopwords(_tokenize(lower), language)
+            token_set = set(tokenized)
+            core_terms = [t.lower() for t in focus if t]
+            core_hits = sum(1 for t in core_terms if t in token_set)
+            coverage = (core_hits / max(1, len(core_terms))) if core_terms else 0.0
+            phrase_present = 1.0 if (goal_phrase and goal_phrase in lower) else 0.0
+            # Boost candidates where semantic context is relevant but explicit core terms are not saturated yet.
+            missing_term_boost = (1.0 - coverage) * 1.4 if chunk_rel >= 0.18 else 0.0
+            phrase_absent_boost = 0.35 if (goal_phrase and not phrase_present and chunk_rel >= 0.2) else 0.0
+            score = (
+                (chunk_rel * 5.0)
+                + (focus_score * 1.2)
+                + (missing_term_boost + phrase_absent_boost)
+                + (avoid_score * 1.5)
+                - (noise_penalty * 3.0)
+                - (abs(idx - center) * 0.05)
+            )
+        else:
+            # Prefer semantically relevant and lexical matches; push noisy headers/CTA lower.
+            score = (chunk_rel * 4.0) + (focus_score * 3.0) + (avoid_score * 2.0) - (noise_penalty * 3.0) - (abs(idx - center) * 0.05)
         scored.append((idx, score, len(s)))
     scored.sort(key=lambda x: (x[1], -x[2]), reverse=True)
     endpoint = base_url.rstrip("/") + "/chat/completions"
     op = operation if operation in {"rewrite", "insert"} else "rewrite"
     system_msg = (
+        "You are a semantic-vector optimizer for SEO tasks. "
+        "Your task is to improve chunk relevance to the focus terms/goal phrase with minimal local edits. "
+        "Preserve narrative flow, factual tone, and language. "
+        "Return strict JSON only: {\"edited_text\": \"...\", \"rationale\": \"...\"}. "
         "Do not rewrite the whole text. Never change topic or introduce unrelated entities."
     )
     op_instruction = (
         "Text must be grammatically correct and natural for native readers.\n"
         "Keep edits tightly local to the provided chunk and immediate context only.\n"
         "Edit must be substantive (not just synonyms) and should increase relevance to the goal phrase.\n"
+        "Do not change the sentence subject/entity focus unless absolutely required by grammar.\n"
         f"Focus terms to strengthen: {', '.join(focus_terms) if focus_terms else '-'}\n"
         f"Terms to de-emphasize/avoid overuse: {', '.join(avoid_terms) if avoid_terms else '-'}\n\n"
         f"Chunk to edit/expand:\n{chunk_text}\n\n"
         f"3) Max {max_sent} sentence(s) in edited_text.\n"
         "4) Keep key named entities from the original chunk unchanged when possible.\n"
         "5) For BERT goal, improve semantic match to goal phrase without keyword stuffing.\n"
+        "6) For BERT goals you may use either: (a) exact phrase once, or (b) natural distributed use of core terms in one paragraph.\n"
+        "7) For rewrite: preserve original meaning sentence-by-sentence while improving relevance.\n"
+        "8) Provide rationale in one short sentence.\n"
+        "9) Only output JSON object."
     )
     payload = {
         "model": model,
     )
     parsed = _extract_json_object(content)
     edited = ""
+    rationale = ""
     if parsed:
         edited = str(parsed.get("edited_text") or parsed.get("revised_sentence") or parsed.get("rewrite") or "").strip()
+        rationale = str(parsed.get("rationale") or parsed.get("why") or "").strip()
     if not edited:
         raise ValueError("LLM returned invalid JSON edit payload.")
     return {
         "edited_text": edited,
+        "rationale": rationale,
         "prompt_debug": {
             "operation": op,
             "cascade_level": cascade_level,
                         temperature=temp,
                     )
                     edited_text = str((llm_result or {}).get("edited_text", "")).strip()
+                    llm_rationale = str((llm_result or {}).get("rationale", "")).strip()
                     prompt_debug = (llm_result or {}).get("prompt_debug", {})
                     if not edited_text or edited_text == original_span_text:
                         continue
                                 "chunk_relevance_after": after_rel,
                                 "term_diff": _term_diff(original_span_text, edited_text, language),
                                 "llm_prompt_debug": prompt_debug,
+                                "llm_rationale": llm_rationale,
                                 "operation": operation,
                                 "sentence_index": sent_idx,
                                 "span_start": span_start,
                                 "chunk_relevance_after": after_rel,
                                 "term_diff": _term_diff(original_span_text, edited_text, language),
                                 "llm_prompt_debug": prompt_debug,
+                                "llm_rationale": llm_rationale,
                                 "operation": operation,
                                 "sentence_index": sent_idx,
                                 "span_start": span_start,
                             "chunk_relevance_after": after_rel,
                             "term_diff": _term_diff(original_span_text, edited_text, language),
                             "llm_prompt_debug": prompt_debug,
+                            "llm_rationale": llm_rationale,
                             "invalid_reasons": invalid_reasons,
                             "delta_score": delta_score,
                             "candidate_score": cand_metrics.get("score"),
                                 "goal_type": goal.get("type"),
                                 "goal_label": goal.get("label"),
                             },
+                            "llm_rationale": "",
                             "operation": operation,
                             "sentence_index": sent_idx,
                             "span_start": span_start,
                                 "chunk_relevance_after": c.get("chunk_relevance_after"),
                                 "term_diff": c.get("term_diff"),
                                 "llm_prompt_debug": c.get("llm_prompt_debug"),
+                                "llm_rationale": c.get("llm_rationale"),
                                 "metrics_delta": c.get("metrics_delta"),
                                 "invalid_reasons": c.get("invalid_reasons", []),
                                 "delta_score": c.get("delta_score"),
                                 "chunk_relevance_after": c.get("chunk_relevance_after"),
                                 "term_diff": c.get("term_diff"),
                                 "llm_prompt_debug": c.get("llm_prompt_debug"),
+                                "llm_rationale": c.get("llm_rationale"),
                                 "metrics_delta": c.get("metrics_delta"),
                                 "invalid_reasons": c.get("invalid_reasons", []),
                                 "delta_score": c.get("delta_score"),
                             "chunk_relevance_after": c.get("chunk_relevance_after"),
                             "term_diff": c.get("term_diff"),
                             "llm_prompt_debug": c.get("llm_prompt_debug"),
+                            "llm_rationale": c.get("llm_rationale"),
                             "metrics_delta": c.get("metrics_delta"),
                             "invalid_reasons": c.get("invalid_reasons", []),
                             "delta_score": c.get("delta_score"),
                         "chunk_relevance_after": c.get("chunk_relevance_after"),
                         "term_diff": c.get("term_diff"),
                         "llm_prompt_debug": c.get("llm_prompt_debug"),
+                        "llm_rationale": c.get("llm_rationale"),
                         "metrics_delta": c.get("metrics_delta"),
                         "invalid_reasons": c.get("invalid_reasons", []),
                         "delta_score": c.get("delta_score"),

templates/index.html CHANGED Viewed

@@ -860,6 +860,7 @@
                 const termDiff = c.term_diff ? safeHtml(JSON.stringify(c.term_diff)) : '-';
                 const metricDelta = c.metrics_delta ? safeHtml(JSON.stringify(c.metrics_delta)) : '-';
                 const promptDbg = c.llm_prompt_debug ? safeHtml(JSON.stringify(c.llm_prompt_debug, null, 2)) : '-';
                 return `
                     <tr>
                         <td>${c.candidate_index ?? '-'}</td>
@@ -878,6 +879,7 @@
                         </td>
                         <td>
                             <div style="max-width: 520px; white-space: normal;">${sentAfter}</div>
                             <details class="mt-1">
                                 <summary class="small">LLM input</summary>
                                 <pre class="small mb-0" style="white-space: pre-wrap;">${promptDbg}</pre>

                 const termDiff = c.term_diff ? safeHtml(JSON.stringify(c.term_diff)) : '-';
                 const metricDelta = c.metrics_delta ? safeHtml(JSON.stringify(c.metrics_delta)) : '-';
                 const promptDbg = c.llm_prompt_debug ? safeHtml(JSON.stringify(c.llm_prompt_debug, null, 2)) : '-';
+                const rationale = c.llm_rationale ? safeHtml(c.llm_rationale) : '-';
                 return `
                     <tr>
                         <td>${c.candidate_index ?? '-'}</td>
                         </td>
                         <td>
                             <div style="max-width: 520px; white-space: normal;">${sentAfter}</div>
+                            <div class="small text-muted mt-1">rationale: ${rationale}</div>
                             <details class="mt-1">
                                 <summary class="small">LLM input</summary>
                                 <pre class="small mb-0" style="white-space: pre-wrap;">${promptDbg}</pre>