lsdf commited on
Commit
ad42da6
·
1 Parent(s): a52efa4

N-gram: rank by Freq/Avg, extra loop budget, overlapping window chunking

Browse files
docs/FULL_FUNCTIONAL_DOCUMENTATION.md CHANGED
@@ -464,10 +464,12 @@ HTML extraction pipeline:
464
  ### Главная функция `optimize_text`
465
  Итерационный цикл:
466
  1. baseline metrics.
 
467
  2. выбрать goal.
468
  3. выбрать пул чанков и операцию каскада.
469
  - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
470
  - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
 
471
  - для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
472
  - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
473
  4. сгенерировать `N` кандидатов для каждого выбранного span.
 
464
  ### Главная функция `optimize_text`
465
  Итерационный цикл:
466
  1. baseline metrics.
467
+ - число итераций цикла: `min(80, max_iterations + addon)`, где `addon = min(56, N_целей_ngram × 3)` — чтобы низкий `max_iterations` не обрывал n-gram стадию после трёх строк лога при большом списке целей.
468
  2. выбрать goal.
469
  3. выбрать пул чанков и операцию каскада.
470
  - на шаг выбирается несколько span-кандидатов (multi-chunk selection), а не один;
471
  - ранжирование учитывает `focus_terms/avoid_terms`, chunk-level relevance и шумовые эвристики (menu/CTA/header penalties);
472
+ - для **n-gram** целей предложения ранжируются через **скользящие перекрывающиеся окна** из 2–4 предложений (шаг 1): каждому предложению присваивается лучший балл среди окон, оценка штрафует локальные повторы фразы и шумовые блоки;
473
  - для BERT-целей ранжирование не ограничивается участками с already-present вхождениями: дополнительно приоритизируются релевантные участки с недопредставленными core-термами, где их можно добавить естественно;
474
  - используется `attempt_cursor` по цели и `attempted_spans`, чтобы избежать циклов по одному и тому же участку.
475
  4. сгенерировать `N` кандидатов для каждого выбранного span.
docs/TEXT_OPTIMIZER_PRINCIPLES.md CHANGED
@@ -44,10 +44,21 @@ Update it whenever optimization policy changes.
44
  - Selection rules (multi-competitor mode, `competitors > 1`):
45
  - bi-grams and tri-grams are eligible when present in `>= 2` competitors;
46
  - unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
 
 
 
 
47
  - Iteration behavior:
48
  - optimizer works on one n-gram target at a time per step;
49
  - per eligible n-gram target it allocates `3` attempts, then moves to the next target;
50
  - if target list ends, stage advances to the next optimization stage.
 
 
 
 
 
 
 
51
 
52
  ## 5.1 Summary logic memory (current)
53
 
 
44
  - Selection rules (multi-competitor mode, `competitors > 1`):
45
  - bi-grams and tri-grams are eligible when present in `>= 2` competitors;
46
  - unigrams are eligible only if they are part of user keyword phrases and present in `>= 2` competitors.
47
+ - Target ranking (which n-gram to work on next):
48
+ - sort eligible **underrepresented** rows by **Freq(K)** (`comp_occurrence`) descending,
49
+ then **Avg(K)** (`competitor_avg`) descending,
50
+ then **deviation** from competitor average descending (larger gap first).
51
  - Iteration behavior:
52
  - optimizer works on one n-gram target at a time per step;
53
  - per eligible n-gram target it allocates `3` attempts, then moves to the next target;
54
  - if target list ends, stage advances to the next optimization stage.
55
+ - **Global step budget:** the UI `max_iterations` cap still limits total loop iterations, but the
56
+ optimizer **adds** extra steps reserved for the n-gram stage (`targets × 3`, capped) so a low
57
+ `max_iterations` value does not stop the run after only three n-gram rows while many targets remain.
58
+ - **Chunk selection (n-gram stage):** candidate sentences are ranked using **overlapping multi-sentence
59
+ windows** (stride 1). Each sentence receives the best window score; windows favor low local phrase
60
+ duplication, topical overlap with phrase tokens, and non-noisy prose. Document-level phrase count
61
+ remains the primary acceptance signal.
62
 
63
  ## 5.1 Summary logic memory (current)
64
 
optimizer.py CHANGED
@@ -303,6 +303,95 @@ def _chunk_ngram_count(text: str, ngram_label: str, language: str) -> int:
303
  return count
304
 
305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
  def _compute_metrics(
307
  analysis: Dict[str, Any],
308
  semantic: Dict[str, Any],
@@ -470,32 +559,8 @@ def _choose_optimization_goal(
470
  candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
471
 
472
  # N-gram balancing (toward competitor average with tolerance policy).
473
- ngram_rows: List[Tuple[str, float, float, float, int, float]] = []
474
- ngram_stats = analysis.get("ngram_stats", {}) or {}
475
- competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
476
- keyword_unigrams = _keyword_unigram_set(keywords, language)
477
- for bucket_name, bucket in ngram_stats.items():
478
- if not isinstance(bucket, list):
479
- continue
480
- for item in bucket:
481
- ngram_label = str(item.get("ngram", "")).strip()
482
- if not ngram_label:
483
- continue
484
- target = float(item.get("target_count", 0))
485
- comp_avg = float(item.get("competitor_avg", 0))
486
- comp_occ = int(item.get("comp_occurrence", 0))
487
- if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
488
- continue
489
- if not _is_ngram_outside_tolerance(target, comp_avg):
490
- continue
491
- # N-gram stage is for underrepresented terms only.
492
- if target >= comp_avg:
493
- continue
494
- tol = _ngram_tolerance_pct(comp_avg)
495
- dev_ratio = _ngram_deviation_ratio(target, comp_avg)
496
- ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
497
  if ngram_rows:
498
- ngram_rows.sort(key=lambda x: (x[5], x[4], x[2]), reverse=True)
499
  pick = max(0, int(stage_cursor))
500
  if pick >= len(ngram_rows):
501
  # No more n-gram targets in current stage cursor window.
@@ -576,6 +641,11 @@ def _rank_sentence_indices(
576
  avoid = [x for x in avoid_terms if x]
577
  center = (len(sentences) - 1) / 2.0
578
 
 
 
 
 
 
579
  # For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
580
  candidate_indices = list(range(len(sentences)))
581
  if goal_type == "bert":
@@ -1250,6 +1320,12 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
1250
  baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
1251
  )
1252
 
 
 
 
 
 
 
1253
  current_text = target_text
1254
  current_analysis = baseline_analysis
1255
  current_semantic = baseline_semantic
@@ -1266,7 +1342,7 @@ def optimize_text(request_data: Dict[str, Any]) -> Dict[str, Any]:
1266
  stage_no_progress_steps = 0
1267
  stage_goal_cursor: Dict[str, Dict[str, int]] = {}
1268
 
1269
- for step in range(max_iterations):
1270
  while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
1271
  STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
1272
  ):
 
303
  return count
304
 
305
 
306
+ def _build_ngram_stage_rows(
307
+ analysis: Dict[str, Any],
308
+ keywords: List[str],
309
+ language: str,
310
+ ) -> List[Tuple[str, float, float, float, int, float]]:
311
+ """
312
+ Eligible underrepresented n-gram targets for the ngram stage.
313
+ Each row: (ngram_label, target_count, comp_avg, tolerance_pct, comp_occurrence, dev_ratio).
314
+
315
+ Priority (user policy): maximize competitor coverage Freq(K), then Avg(K), then
316
+ how far below the band the target is (larger deviation first).
317
+ Only terms with My=0 or outside the tolerance band below average are included.
318
+ """
319
+ ngram_rows: List[Tuple[str, float, float, float, int, float]] = []
320
+ ngram_stats = analysis.get("ngram_stats", {}) or {}
321
+ competitor_count = len((analysis.get("word_counts", {}) or {}).get("competitors", []) or [])
322
+ keyword_unigrams = _keyword_unigram_set(keywords, language)
323
+ for _bucket_name, bucket in ngram_stats.items():
324
+ if not isinstance(bucket, list):
325
+ continue
326
+ for item in bucket:
327
+ ngram_label = str(item.get("ngram", "")).strip()
328
+ if not ngram_label:
329
+ continue
330
+ target = float(item.get("target_count", 0))
331
+ comp_avg = float(item.get("competitor_avg", 0))
332
+ comp_occ = int(item.get("comp_occurrence", 0))
333
+ if not _is_ngram_stage_candidate(ngram_label, comp_occ, competitor_count, keyword_unigrams):
334
+ continue
335
+ if not _is_ngram_outside_tolerance(target, comp_avg):
336
+ continue
337
+ if target >= comp_avg:
338
+ continue
339
+ tol = _ngram_tolerance_pct(comp_avg)
340
+ dev_ratio = _ngram_deviation_ratio(target, comp_avg)
341
+ ngram_rows.append((ngram_label, target, comp_avg, tol, comp_occ, dev_ratio))
342
+ ngram_rows.sort(key=lambda x: (x[4], x[2], x[5]), reverse=True)
343
+ return ngram_rows
344
+
345
+
346
+ def _score_ngram_candidate_window(window_sentences: List[str], goal_label: str, language: str) -> float:
347
+ """Heuristic: good place to add phrase — low local duplication, topical proximity, not boilerplate."""
348
+ chunk = " ".join(s for s in window_sentences if s).strip()
349
+ if not chunk:
350
+ return -1e6
351
+ phrase_count = float(_chunk_ngram_count(chunk, goal_label, language))
352
+ noise_n = sum(1 for s in window_sentences if _is_noise_like_sentence(s))
353
+ noise_frac = noise_n / max(1, len(window_sentences))
354
+ phrase_tokens = [t.lower() for t in _filter_stopwords(_tokenize(goal_label), language) if t]
355
+ chunk_l = chunk.lower()
356
+ unigram_hits = sum(1 for t in phrase_tokens if t and len(t) > 1 and t in chunk_l)
357
+ rel_proxy = unigram_hits / max(1, len(phrase_tokens)) if phrase_tokens else 0.0
358
+ return (
359
+ -3.0 * phrase_count
360
+ + 2.2 * rel_proxy
361
+ - 4.0 * noise_frac
362
+ + min(len(chunk) / 1200.0, 0.35)
363
+ )
364
+
365
+
366
+ def _rank_ngram_overlap_sentence_indices(
367
+ sentences: List[str],
368
+ goal_label: str,
369
+ language: str,
370
+ ) -> List[int]:
371
+ """
372
+ Slide overlapping multi-sentence windows over the document; each sentence gets the
373
+ best score among windows that contain it. Order sentences by that score (desc).
374
+ """
375
+ n = len(sentences)
376
+ if n <= 0:
377
+ return [0]
378
+ if n == 1:
379
+ return [0]
380
+ # 2–4 sentences per window, stride 1 for strong overlap.
381
+ w = min(4, max(2, n))
382
+ best: List[float] = [-1e9] * n
383
+ for start in range(0, n - w + 1):
384
+ win = sentences[start : start + w]
385
+ sc = _score_ngram_candidate_window(win, goal_label, language)
386
+ for j in range(start, start + w):
387
+ if sc > best[j]:
388
+ best[j] = sc
389
+ center = (n - 1) / 2.0
390
+ scored_idx = [(i, best[i], -abs(i - center)) for i in range(n)]
391
+ scored_idx.sort(key=lambda t: (t[1], t[2]), reverse=True)
392
+ return [t[0] for t in scored_idx]
393
+
394
+
395
  def _compute_metrics(
396
  analysis: Dict[str, Any],
397
  semantic: Dict[str, Any],
 
559
  candidates["semantic"] = {"type": "semantic", "label": top_term, "focus_terms": [top_term], "avoid_terms": []}
560
 
561
  # N-gram balancing (toward competitor average with tolerance policy).
562
+ ngram_rows = _build_ngram_stage_rows(analysis, keywords, language)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
563
  if ngram_rows:
 
564
  pick = max(0, int(stage_cursor))
565
  if pick >= len(ngram_rows):
566
  # No more n-gram targets in current stage cursor window.
 
641
  avoid = [x for x in avoid_terms if x]
642
  center = (len(sentences) - 1) / 2.0
643
 
644
+ # N-gram stage: overlapping sentence windows — pick spans where insertion is natural
645
+ # while document-level phrase count remains the primary optimization signal.
646
+ if goal_type == "ngram" and (goal_label or "").strip():
647
+ return _rank_ngram_overlap_sentence_indices(sentences, str(goal_label).strip(), language)
648
+
649
  # For BERT optimization prefer natural prose chunks over list/menu/noisy blocks.
650
  candidate_indices = list(range(len(sentences)))
651
  if goal_type == "bert":
 
1320
  baseline_analysis, baseline_semantic, keywords, language, bert_stage_target=bert_stage_target
1321
  )
1322
 
1323
+ # Global max_iterations caps early stages; n-gram stage gets extra steps so each target
1324
+ # can use NGRAM_ATTEMPTS_PER_TERM tries without being cut off at the user iteration cap.
1325
+ ngram_row_count = len(_build_ngram_stage_rows(baseline_analysis, keywords, language))
1326
+ ngram_step_addon = min(56, max(0, ngram_row_count) * NGRAM_ATTEMPTS_PER_TERM)
1327
+ total_loop_steps = min(80, max_iterations + ngram_step_addon)
1328
+
1329
  current_text = target_text
1330
  current_analysis = baseline_analysis
1331
  current_semantic = baseline_semantic
 
1342
  stage_no_progress_steps = 0
1343
  stage_goal_cursor: Dict[str, Dict[str, int]] = {}
1344
 
1345
+ for step in range(total_loop_steps):
1346
  while stage_idx < len(STAGE_ORDER) and _is_stage_complete(
1347
  STAGE_ORDER[stage_idx], current_metrics, bert_stage_target=bert_stage_target
1348
  ):