Spaces:
Sleeping
Sleeping
N-gram: rank by Freq/Avg, extra loop budget, overlapping window chunking
Browse files- docs/FULL_FUNCTIONAL_DOCUMENTATION.md +2 -0
- docs/TEXT_OPTIMIZER_PRINCIPLES.md +11 -0
- optimizer.py +102 -26
docs/FULL_FUNCTIONAL_DOCUMENTATION.md
CHANGED
|
@@ -464,10 +464,12 @@ HTML extraction pipeline:
|
|
| 464 |
### Главная функция `optimize_text`
|
| 465 |
Итерационный цикл:
|
| 466 |
1. baseline metrics.
|
|
|
|
| 467 |
2. выбрать goal.
|
| 468 |
3. выбрать пул чанков и операцию каскада.
|
| 469 |
- на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
|
| 470 |
- ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
|
|
|
|
| 471 |
- для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
|
| 472 |
- используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
|
| 473 |
4. сгенерировать `N` кандидатов для каждого выбранного span.
|
|
|
|
| 464 |
### Главная функция `optimize_text`
|
| 465 |
Итерационный цикл:
|
| 466 |
1. baseline metrics.
|
| 467 |
+
- число итераций цикла: `min(80, max_iterations + addon)`, где `addon = min(56, N_целей_ngram × 3)` — чтобы низкий `max_iterations` не обрывал n-gram стадию после трёх строк лога при большом списке целей.
|
| 468 |
2. выбрать goal.
|
| 469 |
3. выбрать пул чанков и операцию каскада.
|
| 470 |
- на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
|
| 471 |
- ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
|
| 472 |
+
- для **n-gram** целей предложения ранжируются через **скользящие перекрывающиеся окна** из 2–4 предложений (шаг 1): каждому предложению присваивается лучший балл среди окон, оценка штрафует локальные повторы фразы и шумовые блоки;
|
| 473 |
- для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
|
| 474 |
- используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
|
| 475 |
4. сгенерировать `N` кандидатов для каждого выбранного span.
|
docs/TEXT_OPTIMIZER_PRINCIPLES.md
CHANGED
|
@@ -44,10 +44,21 @@ Update it whenever optimization policy changes.
|
|
| 44 |
- Selection rules (multi-competitor mode, `competitors > 1`):
|
| 45 |
- bi-grams and tri-grams are eligible when present in `>= 2` competitors;
|
| 46 |
- unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
- Iteration behavior:
|
| 48 |
- optimizer works on one n-gram target at a time per step;
|
| 49 |
- per eligible n-gram target it allocates `3` attempts, then moves to the next target;
|
| 50 |
- if target list ends, stage advances to the next optimization stage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## 5.1 Summary logic memory (current)
|
| 53 |
|
|
|
|
| 44 |
- Selection rules (multi-competitor mode, `competitors > 1`):
|
| 45 |
- bi-grams and tri-grams are eligible when present in `>= 2` competitors;
|
| 46 |
- unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
|
| 47 |
+
- Target ranking (which n-gram to work on next):
|
| 48 |
+
- sort eligible **underrepresented** rows by **Freq(K)** (`comp_occurrence`) descending,
|
| 49 |
+
then **Avg(K)** (`competitor_avg`) descending,
|
| 50 |
+
then **deviation** from competitor average descending (larger gap first).
|
| 51 |
- Iteration behavior:
|
| 52 |
- optimizer works on one n-gram target at a time per step;
|
| 53 |
- per eligible n-gram target it allocates `3` attempts, then moves to the next target;
|
| 54 |
- if target list ends, stage advances to the next optimization stage.
|
| 55 |
+
- **Global step budget:** the UI `max_iterations` cap still limits total loop iterations, but the
|
| 56 |
+
optimizer **adds** extra steps reserved for the n-gram stage (`targets × 3`, capped) so a low
|
| 57 |
+
`max_iterations` value does not stop the run after only three n-gram rows while many targets remain.
|
| 58 |
+
- **Chunk selection (n-gram stage):** candidate sentences are ranked using **overlapping multi-sentence
|
| 59 |
+
windows** (stride 1). Each sentence receives the best window score; windows favor low local phrase
|
| 60 |
+
duplication, topical overlap with phrase tokens, and non-noisy prose. Document-level phrase count
|
| 61 |
+
remains the primary acceptance signal.
|
| 62 |
|
| 63 |
## 5.1 Summary logic memory (current)
|
| 64 |
|
optimizer.py
CHANGED
|
@@ -303,6 +303,95 @@ def _chunk_ngram_count(text: str, ngram_label: str, language: str) -> int:
|
|
| 303 |
return count
|
| 304 |
|
| 305 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 306 |
def _compute_metrics(
|
| 307 |
analysis: Dict[str, Any],
|
| 308 |
semantic: Dict[str, Any],
|
|
@@ -470,32 +559,8 @@ def _choose_optimization_goal(
|
|
| 470 |
candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
|
| 471 |
|
| 472 |
# N-gram balancing (toward competitor average with tolerance policy).
|
| 473 |
-
ngram_rows
|
| 474 |
-
ngram_stats = analysis.get("ngram_stats", {}) or {}
|
| 475 |
-
competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
|
| 476 |
-
keyword_unigrams = _keyword_unigram_set(keywords, language)
|
| 477 |
-
for bucket_name, bucket in ngram_stats.items():
|
| 478 |
-
if not isinstance(bucket, list):
|
| 479 |
-
continue
|
| 480 |
-
for item in bucket:
|
| 481 |
-
ngram_label = str(item.get("ngram", "")).strip()
|
| 482 |
-
if not ngram_label:
|
| 483 |
-
continue
|
| 484 |
-
target = float(item.get("target_count", 0))
|
| 485 |
-
comp_avg = float(item.get("competitor_avg", 0))
|
| 486 |
-
comp_occ = int(item.get("comp_occurrence", 0))
|
| 487 |
-
if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
|
| 488 |
-
continue
|
| 489 |
-
if not _is_ngram_outside_tolerance(target, comp_avg):
|
| 490 |
-
continue
|
| 491 |
-
# N-gram stage is for underrepresented terms only.
|
| 492 |
-
if target >= comp_avg:
|
| 493 |
-
continue
|
| 494 |
-
tol = _ngram_tolerance_pct(comp_avg)
|
| 495 |
-
dev_ratio = _ngram_deviation_ratio(target, comp_avg)
|
| 496 |
-
ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
|
| 497 |
if ngram_rows:
|
| 498 |
-
ngram_rows.sort(key=lambda x: (x[5], x[4], x[2]), reverse=True)
|
| 499 |
pick = max(0, int(stage_cursor))
|
| 500 |
if pick >= len(ngram_rows):
|
| 501 |
# No more n-gram targets in current stage cursor window.
|
|
@@ -576,6 +641,11 @@ def _rank_sentence_indices(
|
|
| 576 |
avoid = [x for x in avoid_terms if x]
|
| 577 |
center = (len(sentences) - 1) / 2.0
|
| 578 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 579 |
# For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
|
| 580 |
candidate_indices = list(range(len(sentences)))
|
| 581 |
if goal_type == "bert":
|
|
@@ -1250,6 +1320,12 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
|
|
| 1250 |
baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
|
| 1251 |
)
|
| 1252 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1253 |
current_text = target_text
|
| 1254 |
current_analysis = baseline_analysis
|
| 1255 |
current_semantic = baseline_semantic
|
|
@@ -1266,7 +1342,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
|
|
| 1266 |
stage_no_progress_steps = 0
|
| 1267 |
stage_goal_cursor: Dict[str, Dict[str, int]] = {}
|
| 1268 |
|
| 1269 |
-
for step in range(
|
| 1270 |
while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
|
| 1271 |
STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
|
| 1272 |
):
|
|
|
|
| 303 |
return count
|
| 304 |
|
| 305 |
|
| 306 |
+
def _build_ngram_stage_rows(
|
| 307 |
+
analysis: Dict[str, Any],
|
| 308 |
+
keywords: List[str],
|
| 309 |
+
language: str,
|
| 310 |
+
) -> List[Tuple[str, float, float, float, int, float]]:
|
| 311 |
+
"""
|
| 312 |
+
Eligible underrepresented n-gram targets for the ngram stage.
|
| 313 |
+
Each row: (ngram_label, target_count, comp_avg, tolerance_pct, comp_occurrence, dev_ratio).
|
| 314 |
+
|
| 315 |
+
Priority (user policy): maximize competitor coverage Freq(K), then Avg(K), then
|
| 316 |
+
how far below the band the target is (larger deviation first).
|
| 317 |
+
Only terms with My=0 or outside the tolerance band below average are included.
|
| 318 |
+
"""
|
| 319 |
+
ngram_rows: List[Tuple[str, float, float, float, int, float]] = []
|
| 320 |
+
ngram_stats = analysis.get("ngram_stats", {}) or {}
|
| 321 |
+
competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
|
| 322 |
+
keyword_unigrams = _keyword_unigram_set(keywords, language)
|
| 323 |
+
for _bucket_name, bucket in ngram_stats.items():
|
| 324 |
+
if not isinstance(bucket, list):
|
| 325 |
+
continue
|
| 326 |
+
for item in bucket:
|
| 327 |
+
ngram_label = str(item.get("ngram", "")).strip()
|
| 328 |
+
if not ngram_label:
|
| 329 |
+
continue
|
| 330 |
+
target = float(item.get("target_count", 0))
|
| 331 |
+
comp_avg = float(item.get("competitor_avg", 0))
|
| 332 |
+
comp_occ = int(item.get("comp_occurrence", 0))
|
| 333 |
+
if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
|
| 334 |
+
continue
|
| 335 |
+
if not _is_ngram_outside_tolerance(target, comp_avg):
|
| 336 |
+
continue
|
| 337 |
+
if target >= comp_avg:
|
| 338 |
+
continue
|
| 339 |
+
tol = _ngram_tolerance_pct(comp_avg)
|
| 340 |
+
dev_ratio = _ngram_deviation_ratio(target, comp_avg)
|
| 341 |
+
ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
|
| 342 |
+
ngram_rows.sort(key=lambda x: (x[4], x[2], x[5]), reverse=True)
|
| 343 |
+
return ngram_rows
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
def _score_ngram_candidate_window(window_sentences: List[str], goal_label: str, language: str) -> float:
|
| 347 |
+
"""Heuristic: good place to add phrase — low local duplication, topical proximity, not boilerplate."""
|
| 348 |
+
chunk = " ".join(s for s in window_sentences if s).strip()
|
| 349 |
+
if not chunk:
|
| 350 |
+
return -1e6
|
| 351 |
+
phrase_count = float(_chunk_ngram_count(chunk, goal_label, language))
|
| 352 |
+
noise_n = sum(1 for s in window_sentences if _is_noise_like_sentence(s))
|
| 353 |
+
noise_frac = noise_n / max(1, len(window_sentences))
|
| 354 |
+
phrase_tokens = [t.lower() for t in _filter_stopwords(_tokenize(goal_label), language) if t]
|
| 355 |
+
chunk_l = chunk.lower()
|
| 356 |
+
unigram_hits = sum(1 for t in phrase_tokens if t and len(t) > 1 and t in chunk_l)
|
| 357 |
+
rel_proxy = unigram_hits / max(1, len(phrase_tokens)) if phrase_tokens else 0.0
|
| 358 |
+
return (
|
| 359 |
+
-3.0 * phrase_count
|
| 360 |
+
+ 2.2 * rel_proxy
|
| 361 |
+
- 4.0 * noise_frac
|
| 362 |
+
+ min(len(chunk) / 1200.0, 0.35)
|
| 363 |
+
)
|
| 364 |
+
|
| 365 |
+
|
| 366 |
+
def _rank_ngram_overlap_sentence_indices(
|
| 367 |
+
sentences: List[str],
|
| 368 |
+
goal_label: str,
|
| 369 |
+
language: str,
|
| 370 |
+
) -> List[int]:
|
| 371 |
+
"""
|
| 372 |
+
Slide overlapping multi-sentence windows over the document; each sentence gets the
|
| 373 |
+
best score among windows that contain it. Order sentences by that score (desc).
|
| 374 |
+
"""
|
| 375 |
+
n = len(sentences)
|
| 376 |
+
if n <= 0:
|
| 377 |
+
return [0]
|
| 378 |
+
if n == 1:
|
| 379 |
+
return [0]
|
| 380 |
+
# 2–4 sentences per window, stride 1 for strong overlap.
|
| 381 |
+
w = min(4, max(2, n))
|
| 382 |
+
best: List[float] = [-1e9] * n
|
| 383 |
+
for start in range(0, n - w + 1):
|
| 384 |
+
win = sentences[start : start + w]
|
| 385 |
+
sc = _score_ngram_candidate_window(win, goal_label, language)
|
| 386 |
+
for j in range(start, start + w):
|
| 387 |
+
if sc > best[j]:
|
| 388 |
+
best[j] = sc
|
| 389 |
+
center = (n - 1) / 2.0
|
| 390 |
+
scored_idx = [(i, best[i], -abs(i - center)) for i in range(n)]
|
| 391 |
+
scored_idx.sort(key=lambda t: (t[1], t[2]), reverse=True)
|
| 392 |
+
return [t[0] for t in scored_idx]
|
| 393 |
+
|
| 394 |
+
|
| 395 |
def _compute_metrics(
|
| 396 |
analysis: Dict[str, Any],
|
| 397 |
semantic: Dict[str, Any],
|
|
|
|
| 559 |
candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
|
| 560 |
|
| 561 |
# N-gram balancing (toward competitor average with tolerance policy).
|
| 562 |
+
ngram_rows = _build_ngram_stage_rows(analysis, keywords, language)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 563 |
if ngram_rows:
|
|
|
|
| 564 |
pick = max(0, int(stage_cursor))
|
| 565 |
if pick >= len(ngram_rows):
|
| 566 |
# No more n-gram targets in current stage cursor window.
|
|
|
|
| 641 |
avoid = [x for x in avoid_terms if x]
|
| 642 |
center = (len(sentences) - 1) / 2.0
|
| 643 |
|
| 644 |
+
# N-gram stage: overlapping sentence windows — pick spans where insertion is natural
|
| 645 |
+
# while document-level phrase count remains the primary optimization signal.
|
| 646 |
+
if goal_type == "ngram" and (goal_label or "").strip():
|
| 647 |
+
return _rank_ngram_overlap_sentence_indices(sentences, str(goal_label).strip(), language)
|
| 648 |
+
|
| 649 |
# For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
|
| 650 |
candidate_indices = list(range(len(sentences)))
|
| 651 |
if goal_type == "bert":
|
|
|
|
| 1320 |
baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
|
| 1321 |
)
|
| 1322 |
|
| 1323 |
+
# Global max_iterations caps early stages; n-gram stage gets extra steps so each target
|
| 1324 |
+
# can use NGRAM_ATTEMPTS_PER_TERM tries without being cut off at the user iteration cap.
|
| 1325 |
+
ngram_row_count = len(_build_ngram_stage_rows(baseline_analysis, keywords, language))
|
| 1326 |
+
ngram_step_addon = min(56, max(0, ngram_row_count) * NGRAM_ATTEMPTS_PER_TERM)
|
| 1327 |
+
total_loop_steps = min(80, max_iterations + ngram_step_addon)
|
| 1328 |
+
|
| 1329 |
current_text = target_text
|
| 1330 |
current_analysis = baseline_analysis
|
| 1331 |
current_semantic = baseline_semantic
|
|
|
|
| 1342 |
stage_no_progress_steps = 0
|
| 1343 |
stage_goal_cursor: Dict[str, Dict[str, int]] = {}
|
| 1344 |
|
| 1345 |
+
for step in range(total_loop_steps):
|
| 1346 |
while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
|
| 1347 |
STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
|
| 1348 |
):
|