Luigi commited on
Commit
27363be
·
1 Parent(s): d207764

fix: prevent models from copying schema descriptions as extracted content

Browse files

Models (Gemma-3, LFM2-Extract) were treating descriptive values in schema
examples as actual extracted content to copy.

Changed prompts to use empty arrays in schema with separate descriptions:
- Before: "action_items": ["Specific action items with owner and deadline"]
- After: "action_items": [] + separate description line

This forces models to generate actual extracted content instead of copying
schema descriptions.

Applies to both:
- _build_schema_extraction_prompt() (LFM2-Extract optimized)
- _build_reasoning_extraction_prompt() (Qwen3 hybrid optimized)

For both English and Traditional Chinese versions.

Files changed (1) hide show
  1. meeting_summarizer/extraction.py +36 -16
meeting_summarizer/extraction.py CHANGED
@@ -330,23 +330,33 @@ def _build_schema_extraction_prompt(output_language: str) -> str:
330
  return """以 JSON 格式返回資料,使用以下架構:
331
 
332
  {
333
- "action_items": ["包含負責人和截止日期的具體行動項目"],
334
- "decisions": ["包含理由的決策"],
335
- "key_points": ["重要討論要點"],
336
- "open_questions": ["未解決的問題或疑慮"]
337
  }
338
 
 
 
 
 
 
339
  從使用者提供的逐字稿中提取。逐字稿可能包含重複、雜訊或不完整內容,請專注於有意義的對話內容,忽略重複的詞句。"""
340
  else:
341
  return """Return data as a JSON object with the following schema:
342
 
343
  {
344
- "action_items": ["Specific action items with owner and deadline"],
345
- "decisions": ["Decisions made with rationale"],
346
- "key_points": ["Important discussion points"],
347
- "open_questions": ["Unresolved questions or concerns"]
348
  }
349
 
 
 
 
 
 
350
  Extract from the transcript provided by the user. The transcript may contain repetitions, noise, or incomplete sentences - focus on meaningful dialogue content and ignore repetitive phrases."""
351
 
352
 
@@ -366,12 +376,17 @@ def _build_reasoning_extraction_prompt(output_language: str) -> str:
366
 
367
  推理後,以 JSON 格式返回資料,使用以下架構:
368
  {
369
- "action_items": ["包含負責人和截止日期的具體行動項目"],
370
- "decisions": ["包含理由的決策"],
371
- "key_points": ["重要討論要點"],
372
- "open_questions": ["未解決的問題或疑慮"]
373
  }
374
 
 
 
 
 
 
375
  規則:
376
  - 每個項目必須是完整、獨立的句子
377
  - 在每個項目中包含上下文(誰、什麼、何時)
@@ -391,12 +406,17 @@ The transcript may contain repetitions, noise, or incomplete sentences - focus o
391
 
392
  After reasoning, return data as a JSON object with the following schema:
393
  {
394
- "action_items": ["Specific action items with owner and deadline"],
395
- "decisions": ["Decisions made with rationale"],
396
- "key_points": ["Important discussion points"],
397
- "open_questions": ["Unresolved questions or concerns"]
398
  }
399
 
 
 
 
 
 
400
  Rules:
401
  - Each item must be a complete, standalone sentence
402
  - Include context (who, what, when) in each item
 
330
  return """以 JSON 格式返回資料,使用以下架構:
331
 
332
  {
333
+ "action_items": [],
334
+ "decisions": [],
335
+ "key_points": [],
336
+ "open_questions": []
337
  }
338
 
339
+ action_items: 包含負責人和截止日期的具體行動項目
340
+ decisions: 包合理由的決策
341
+ key_points: 重要討論要點
342
+ open_questions: 未解決的問題或疑慮
343
+
344
  從使用者提供的逐字稿中提取。逐字稿可能包含重複、雜訊或不完整內容,請專注於有意義的對話內容,忽略重複的詞句。"""
345
  else:
346
  return """Return data as a JSON object with the following schema:
347
 
348
  {
349
+ "action_items": [],
350
+ "decisions": [],
351
+ "key_points": [],
352
+ "open_questions": []
353
  }
354
 
355
+ action_items: Specific action items with owner and deadline
356
+ decisions: Decisions made with rationale
357
+ key_points: Important discussion points
358
+ open_questions: Unresolved questions or concerns
359
+
360
  Extract from the transcript provided by the user. The transcript may contain repetitions, noise, or incomplete sentences - focus on meaningful dialogue content and ignore repetitive phrases."""
361
 
362
 
 
376
 
377
  推理後,以 JSON 格式返回資料,使用以下架構:
378
  {
379
+ "action_items": [],
380
+ "decisions": [],
381
+ "key_points": [],
382
+ "open_questions": []
383
  }
384
 
385
+ action_items: 包含負責人和截止日期的具體行動項目
386
+ decisions: 包合理由的決策
387
+ key_points: 重要討論要點
388
+ open_questions: 未解決的問題或疑慮
389
+
390
  規則:
391
  - 每個項目必須是完整、獨立的句子
392
  - 在每個項目中包含上下文(誰、什麼、何時)
 
406
 
407
  After reasoning, return data as a JSON object with the following schema:
408
  {
409
+ "action_items": [],
410
+ "decisions": [],
411
+ "key_points": [],
412
+ "open_questions": []
413
  }
414
 
415
+ action_items: Specific action items with owner and deadline
416
+ decisions: Decisions made with rationale
417
+ key_points: Important discussion points
418
+ open_questions: Unresolved questions or concerns
419
+
420
  Rules:
421
  - Each item must be a complete, standalone sentence
422
  - Include context (who, what, when) in each item