riacho commited on
Commit
f4e0387
·
verified ·
1 Parent(s): d7b2228

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +46 -6
  2. app.py +439 -0
  3. prompts/rule_aug.txt +144 -0
  4. requirements.txt +7 -0
  5. rules.py +266 -0
README.md CHANGED
@@ -1,12 +1,52 @@
1
  ---
2
- title: Solar News Translator Final
3
- emoji: 📈
4
- colorFrom: green
5
- colorTo: purple
6
  sdk: gradio
7
- sdk_version: 6.13.0
 
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Korean News Translator Final (Rule-augmented)
3
+ emoji: 🌐
4
+ colorFrom: indigo
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
+ python_version: "3.11"
9
  app_file: app.py
10
  pinned: false
11
+ license: mit
12
  ---
13
 
14
+ # Korean News Translator Final (Rule-augmented)
15
+
16
+ 조선일보 한→영 번역 운영 후보 (v1) 데모.
17
+
18
+ - **모델**: Solar Pro2 (`reasoning_effort=high`, `temperature=0.0`)
19
+ - **System prompt**: `prompts/rule_aug.txt` (baseline 프롬프트 + RULE-PRECOMPUTED METADATA 섹션)
20
+ - **전처리**: `rules.py` — 한국어 숫자(만/억/조)+단위 인라인 치환, 발행일 anchor
21
+
22
+ ## 핵심 변경 (vs baseline)
23
+
24
+ 1. **숫자 인라인 치환**: `120억원` → `12 billion Korean won` 등 단위 환산을 결정론적 코드로 처리. 모델은 영어 표현을 그대로 출력.
25
+ 2. **발행일 anchor**: 시스템 프롬프트 맨 끝에 `[Article published date: YYYY-MM-DD]` 한 줄 append. 모델은 이를 기준으로 "지난 X일", "지난달", "올해" 등 상대 날짜를 정확히 해석.
26
+ 3. **기존 baseline의 장문 DATE TRANSLATION RULE 섹션 제거**, INSTRUCTION #5 한 줄로 단순화.
27
+
28
+ ## 흐름
29
+
30
+ 1. 기사 URL (Firecrawl 크롤링) 또는 한국어 본문 직접 입력
31
+ 2. **기사 발행일 (필수, `YYYY-MM-DD`)** — 데모에서는 수동, 실제 서비스에서는 CMS 자동
32
+ 3. 룰 전처리 → Solar Pro2 호출 → 영문 번역 출력
33
+ 4. 정성 코멘트 작성 → 제출 시 데이터셋에 영속 저장
34
+
35
+ ## 환경 변수 (HF Space → Settings → Repository secrets)
36
+
37
+ ```
38
+ UPSTAGE_API_KEY=... # 필수
39
+ FIRECRAWL_API_KEY=... # URL 크롤링 사용 시
40
+ HF_TOKEN=hf_... # 코멘트 영속 저장
41
+ FEEDBACK_DATASET_REPO=user/...-feedback # 코멘트 영속 저장
42
+ ```
43
+
44
+ `FEEDBACK_DATASET_REPO`는 사전에 만들어둔 dataset 타입 저장소여야 합니다.
45
+
46
+ ## 로컬 실행
47
+
48
+ ```bash
49
+ pip install -r requirements.txt
50
+ cp .env.example .env # 키 입력
51
+ python app.py
52
+ ```
app.py ADDED
@@ -0,0 +1,439 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Korean News Translator — Final (Rule-augmented) Qualitative Eval
3
+
4
+ Single-version demo for production confirmation:
5
+ - Solar Pro2 (reasoning_effort=high, temperature=0.0)
6
+ - prompts/rule_aug.txt — baseline prompt + RULE-PRECOMPUTED METADATA section
7
+ - rules.py — Korean number→English unit inline replacement + article-date anchor
8
+
9
+ Flow:
10
+ 1. User provides article URL or pastes Korean body
11
+ 2. User MUST provide article published date (YYYY-MM-DD) — manual in this demo;
12
+ in production this comes from CMS publish timestamp automatically.
13
+ 3. App preprocesses: inline-replaces Korean numbers, appends date to system prompt
14
+ 4. App calls Solar Pro2 once → shows English translation
15
+ 5. User leaves a qualitative comment (no rating/vote) → persisted
16
+
17
+ Persistence:
18
+ - If HF_TOKEN + FEEDBACK_DATASET_REPO set → push/pull JSONL on HF Datasets
19
+ - Otherwise → local ./feedback.jsonl (dev mode)
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import json
25
+ import os
26
+ import re
27
+ import time
28
+ import uuid
29
+ from datetime import datetime
30
+ from pathlib import Path
31
+
32
+ import gradio as gr
33
+ import pandas as pd
34
+ import requests
35
+ from dotenv import load_dotenv
36
+ from openai import OpenAI
37
+
38
+ from rules import preprocess, system_prompt_date_suffix
39
+
40
+ load_dotenv()
41
+
42
+ # ── Config ────────────────────────────────────────────────────────────────
43
+ UPSTAGE_BASE_URL = "https://api.upstage.ai/v1"
44
+ UPSTAGE_API_KEY = os.getenv("UPSTAGE_API_KEY") or os.getenv("OPENAI_API_KEY")
45
+ FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY")
46
+ HF_TOKEN = os.getenv("HF_TOKEN")
47
+ FEEDBACK_DATASET_REPO = os.getenv("FEEDBACK_DATASET_REPO")
48
+ LOCAL_FEEDBACK_FILE = Path(__file__).parent / "feedback.jsonl"
49
+
50
+ SCRIPT_DIR = Path(__file__).parent
51
+ SYSTEM_PROMPT = (SCRIPT_DIR / "prompts" / "rule_aug.txt").read_text(encoding="utf-8")
52
+
53
+ MODEL = "solar-pro2"
54
+ REASONING_EFFORT = "high"
55
+ TEMPERATURE = 0.0
56
+
57
+ client = OpenAI(api_key=UPSTAGE_API_KEY, base_url=UPSTAGE_BASE_URL)
58
+
59
+
60
+ # ── Translation ───────────────────────────────────────────────────────────
61
+ def call_upstage(system_prompt: str, user_text: str) -> str:
62
+ resp = client.chat.completions.create(
63
+ model=MODEL,
64
+ messages=[
65
+ {"role": "system", "content": system_prompt},
66
+ {"role": "user", "content": user_text},
67
+ ],
68
+ temperature=TEMPERATURE,
69
+ reasoning_effort=REASONING_EFFORT,
70
+ max_tokens=8000,
71
+ )
72
+ return resp.choices[0].message.content.strip()
73
+
74
+
75
+ _META_MARKERS = (
76
+ "verification checklist", "checklist confirmation",
77
+ "before submitting", "### verification", "**verification",
78
+ )
79
+
80
+
81
+ def _strip_meta_trailer(text: str) -> str:
82
+ if not text:
83
+ return text
84
+ lowered = text.lower()
85
+ earliest = len(text)
86
+ for marker in _META_MARKERS:
87
+ idx = lowered.find(marker)
88
+ if 0 <= idx < earliest:
89
+ earliest = idx
90
+ if earliest < len(text):
91
+ return text[:earliest].rstrip()
92
+ return text
93
+
94
+
95
+ # ── Crawling (same as reference) ──────────────────────────────────────────
96
+ def crawl_article(url: str) -> tuple[str, str]:
97
+ if not FIRECRAWL_API_KEY:
98
+ return "", "Firecrawl API 키가 설정되지 않았습니다. 본문을 직접 붙여넣어 주세요."
99
+ if not url.strip():
100
+ return "", "URL을 입력해주세요."
101
+
102
+ payload = {
103
+ "url": url,
104
+ "formats": [{
105
+ "type": "json",
106
+ "prompt": (
107
+ "뉴스 기사의 본문 텍스트만 추출하세요.\n\n"
108
+ "반드시 제외할 것:\n"
109
+ "- 기사 제목, 기자/저자 이름, 입력일/업데이트일\n"
110
+ "- **이미지/사진/그래픽/일러스트/도표/인포그래픽 캡션**\n"
111
+ "- 광고, 편집자 주, 관련기사 링크\n"
112
+ "- 추천기사, 댓글, 구독 안내, 사이트 내비게이션, 인기기사/핫뉴스 위젯\n"
113
+ "- '저작권자 ⓒ ...', 'ⓒ' 표시 단락\n\n"
114
+ "오직 기자가 작성한 본문 단락만 순서대로 이어서 반환하세요. "
115
+ '형식: {"body": "본문 텍스트 전체 (단락은 \\n\\n 으로 구분)"}'
116
+ ),
117
+ }],
118
+ "onlyMainContent": True, "blockAds": True, "removeBase64Images": True,
119
+ }
120
+ try:
121
+ r = requests.post(
122
+ "https://api.firecrawl.dev/v2/scrape",
123
+ headers={"Authorization": f"Bearer {FIRECRAWL_API_KEY}",
124
+ "Content-Type": "application/json"},
125
+ json=payload, timeout=(10, 180),
126
+ )
127
+ if r.status_code != 200:
128
+ return "", f"스크래핑 실패 ({r.status_code}): {r.text[:200]}"
129
+ body = ((r.json().get("data") or {}).get("json") or {}).get("body", "").strip()
130
+ if not body:
131
+ return "", "본문을 찾을 수 없습니다."
132
+ return body, ""
133
+ except Exception as e:
134
+ return "", f"크롤링 오류: {e}"
135
+
136
+
137
+ # ── Persistence ───────────────────────────────────────────────────────────
138
+ def _dataset_file_path() -> str:
139
+ return "feedback.jsonl"
140
+
141
+
142
+ def append_feedback(row: dict) -> None:
143
+ if HF_TOKEN and FEEDBACK_DATASET_REPO:
144
+ try:
145
+ from huggingface_hub import hf_hub_download
146
+ remote_path = hf_hub_download(
147
+ repo_id=FEEDBACK_DATASET_REPO,
148
+ filename=_dataset_file_path(),
149
+ repo_type="dataset", token=HF_TOKEN, force_download=True,
150
+ )
151
+ with open(remote_path, "rb") as src, LOCAL_FEEDBACK_FILE.open("wb") as dst:
152
+ dst.write(src.read())
153
+ except Exception as e:
154
+ print(f"[INFO] No existing dataset file — resetting local: {e}")
155
+ if LOCAL_FEEDBACK_FILE.exists():
156
+ LOCAL_FEEDBACK_FILE.unlink()
157
+
158
+ with LOCAL_FEEDBACK_FILE.open("a", encoding="utf-8") as f:
159
+ f.write(json.dumps(row, ensure_ascii=False) + "\n")
160
+
161
+ if HF_TOKEN and FEEDBACK_DATASET_REPO:
162
+ try:
163
+ from huggingface_hub import HfApi
164
+ HfApi(token=HF_TOKEN).upload_file(
165
+ path_or_fileobj=str(LOCAL_FEEDBACK_FILE),
166
+ path_in_repo=_dataset_file_path(),
167
+ repo_id=FEEDBACK_DATASET_REPO,
168
+ repo_type="dataset",
169
+ commit_message=f"feedback {row.get('id','')}",
170
+ )
171
+ except Exception as e:
172
+ print(f"[WARN] HF dataset upload failed: {e}")
173
+
174
+
175
+ def load_feedback() -> list[dict]:
176
+ if HF_TOKEN and FEEDBACK_DATASET_REPO:
177
+ try:
178
+ from huggingface_hub import hf_hub_download
179
+ p = hf_hub_download(
180
+ repo_id=FEEDBACK_DATASET_REPO,
181
+ filename=_dataset_file_path(),
182
+ repo_type="dataset", token=HF_TOKEN, force_download=True,
183
+ )
184
+ return [json.loads(l) for l in open(p, encoding="utf-8") if l.strip()]
185
+ except Exception as e:
186
+ print(f"[INFO] HF dataset fetch failed ({e}). Returning empty.")
187
+ return []
188
+ if LOCAL_FEEDBACK_FILE.exists():
189
+ return [json.loads(l) for l in LOCAL_FEEDBACK_FILE.open(encoding="utf-8") if l.strip()]
190
+ return []
191
+
192
+
193
+ # ── Validation ────────────────────────────────────────────────────────────
194
+ DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
195
+
196
+
197
+ def _valid_date(s: str) -> bool:
198
+ if not s or not DATE_RE.match(s.strip()):
199
+ return False
200
+ try:
201
+ datetime.strptime(s.strip(), "%Y-%m-%d")
202
+ return True
203
+ except ValueError:
204
+ return False
205
+
206
+
207
+ # ── Translation pipeline ──────────────────────────────────────────────────
208
+ def run_translation(url: str, direct_text: str, article_date: str):
209
+ """Generator: streams progressive UI updates."""
210
+ if not _valid_date(article_date):
211
+ yield ("", "", "", "❌ 기사 발행일을 `YYYY-MM-DD` 형식으로 입력해주세요. "
212
+ "(실제 서비스에서는 CMS의 발행 시각이 자동 입력됩니다.)",
213
+ gr.update(visible=False), {})
214
+ return
215
+
216
+ src = direct_text.strip()
217
+ crawl_note = ""
218
+
219
+ if not src:
220
+ if not url.strip():
221
+ yield ("", "", "", "❌ URL 또는 한국어 본문 중 하나를 입력해주세요.",
222
+ gr.update(visible=False), {})
223
+ return
224
+ yield ("", "", "", "🌐 URL 크롤링 중... (최대 30초)",
225
+ gr.update(visible=False), {})
226
+ crawled, err = crawl_article(url)
227
+ if err:
228
+ yield ("", "", "", f"❌ {err}", gr.update(visible=False), {})
229
+ return
230
+ src = crawled
231
+ crawl_note = f"(크롤링 완료, {len(src)} chars) "
232
+
233
+ # Apply rule preprocessing
234
+ augmented_user_text, info = preprocess(src, article_date)
235
+ system_prompt = SYSTEM_PROMPT + system_prompt_date_suffix(article_date)
236
+
237
+ rule_summary_lines = [
238
+ f"📅 발행일 anchor: `{info['date']}`",
239
+ ]
240
+ if info["conversions"]:
241
+ rule_summary_lines.append(
242
+ f"🔢 인라인 치환 {len(info['conversions'])}건:"
243
+ )
244
+ for c in info["conversions"][:12]:
245
+ rule_summary_lines.append(f" - `{c['span']}` → **{c['english']}**")
246
+ if len(info["conversions"]) > 12:
247
+ rule_summary_lines.append(f" - … 외 {len(info['conversions']) - 12}건")
248
+ else:
249
+ rule_summary_lines.append("🔢 인라인 치환: 없음")
250
+ rule_summary = "\n".join(rule_summary_lines)
251
+
252
+ yield (src, augmented_user_text, "",
253
+ f"✅ 원문 추출 완료 {crawl_note}— 룰 적용 후 Solar Pro2 호출 중...",
254
+ gr.update(visible=False), {})
255
+
256
+ t0 = time.time()
257
+ try:
258
+ translation = _strip_meta_trailer(call_upstage(system_prompt, augmented_user_text))
259
+ except Exception as e:
260
+ yield (src, augmented_user_text, "",
261
+ f"❌ 번역 오류: {e}", gr.update(visible=False), {})
262
+ return
263
+ elapsed = time.time() - t0
264
+
265
+ state_val = {
266
+ "source": src,
267
+ "source_url": url.strip(),
268
+ "article_date": article_date.strip(),
269
+ "augmented_user_text": augmented_user_text,
270
+ "translation": translation,
271
+ "rule_info": {
272
+ "date": info["date"],
273
+ "conversions": info["conversions"],
274
+ },
275
+ "elapsed_sec": round(elapsed, 1),
276
+ }
277
+ status = (
278
+ f"✅ 번역 완료 ({elapsed:.1f}s)\n\n{rule_summary}\n\n"
279
+ "정성평가 의견을 남겨주세요."
280
+ )
281
+ yield (src, augmented_user_text, translation, status,
282
+ gr.update(visible=True), state_val)
283
+
284
+
285
+ def submit_comment(comment: str, rater_name: str, state: dict):
286
+ if not state:
287
+ return "먼저 번역을 생성해주세요.", gr.update()
288
+ if not (comment or "").strip():
289
+ return "코멘트를 입력해주세요.", gr.update()
290
+
291
+ row = {
292
+ "id": uuid.uuid4().hex[:12],
293
+ "timestamp": datetime.now().isoformat(timespec="seconds"),
294
+ "rater": (rater_name or "anonymous").strip()[:40],
295
+ "comment": comment.strip(),
296
+ "source_url": state.get("source_url", ""),
297
+ "article_date": state.get("article_date", ""),
298
+ "source_full": state.get("source", ""),
299
+ "augmented_user_text": state.get("augmented_user_text", ""),
300
+ "translation": state.get("translation", ""),
301
+ "rule_info": state.get("rule_info", {}),
302
+ "source_excerpt": (state.get("source", "")[:200]).replace("\n", " "),
303
+ "elapsed_sec": state.get("elapsed_sec", 0),
304
+ "model": MODEL,
305
+ "reasoning_effort": REASONING_EFFORT,
306
+ "temperature": TEMPERATURE,
307
+ }
308
+ append_feedback(row)
309
+ return (
310
+ f"### ✅ 제출 완료\n\nID: `{row['id']}` — 감사합니다!",
311
+ gr.update(value=""),
312
+ )
313
+
314
+
315
+ def refresh_results():
316
+ rows = load_feedback()
317
+ cols = ["timestamp", "rater", "comment", "article_date",
318
+ "source_url", "source_excerpt"]
319
+ if not rows:
320
+ return pd.DataFrame(columns=cols), "아직 제출된 코멘트가 없습니다."
321
+ df = pd.DataFrame(rows)
322
+ show_cols = [c for c in cols if c in df.columns]
323
+ df = df[show_cols].sort_values("timestamp", ascending=False)
324
+ return df, f"**총 {len(df)}건**의 정성 코멘트가 누적되어 있습니다."
325
+
326
+
327
+ # ── UI ────────────────────────────────────────────────────────────────────
328
+ CSS = """
329
+ .gradio-container { font-family: system-ui, -apple-system, sans-serif !important; }
330
+ .trans-box textarea { font-size: 0.95em !important; line-height: 1.55 !important; }
331
+ .feedback-card {
332
+ background: linear-gradient(180deg, #f8fafc 0%, #eef2ff 100%) !important;
333
+ border: 2px solid #6366f1 !important;
334
+ border-radius: 12px !important;
335
+ padding: 20px !important;
336
+ margin-top: 16px !important;
337
+ }
338
+ .submit-btn { font-size: 1.1em !important; padding: 12px !important;
339
+ margin-top: 8px !important; }
340
+ """
341
+
342
+ INTRO = (
343
+ "# Korean → English 번역 — 룰 기반 후처리 적용 버전\n\n"
344
+ "**최종 운영 후보(v1)**: Solar Pro2 + `rules.py` 전처리 + `prompts/rule_aug.txt`. "
345
+ "한국어 숫자 단위(만/억/조)와 퍼센트 포인트는 결정론적 룰로 인라인 치환되고, "
346
+ "기사 발행일은 시스템 프롬프트의 anchor로 주입되어 상대 날짜(지난 X일 등)를 정확히 해석합니다.\n\n"
347
+ "기사 URL 또는 한국어 본문, 그리고 **기사 발행일(필수)** 을 입력하세요. "
348
+ "발행일은 이번 데모에서만 수동 입력이며, 실제 서비스에서는 CMS의 발행 시각이 자동으로 들어갑니다."
349
+ )
350
+
351
+ with gr.Blocks(title="Korean News Translator — Final", css=CSS,
352
+ theme=gr.themes.Default()) as demo:
353
+ gr.Markdown(INTRO)
354
+
355
+ with gr.Tabs():
356
+ # ── Tab 1: 번역 + 코멘트 ──────────────────────────────────────
357
+ with gr.TabItem("번역 & 정성평가"):
358
+ with gr.Row():
359
+ url_in = gr.Textbox(label="기사 URL (선택)",
360
+ placeholder="https://...", scale=3)
361
+ date_in = gr.Textbox(
362
+ label="기사 발행일 (필수, YYYY-MM-DD)",
363
+ placeholder="2026-04-30",
364
+ scale=2,
365
+ info="실�� 서비스에서는 자동 입력. 데모에서만 수동.",
366
+ )
367
+ gen_btn = gr.Button("번역 생성", variant="primary", scale=1)
368
+
369
+ with gr.Accordion("직접 한국어 본문 입력 (URL 대신)", open=False):
370
+ text_in = gr.Textbox(label="한국어 본문", lines=8,
371
+ placeholder="본문을 붙여넣기 하세요")
372
+
373
+ status_md = gr.Markdown()
374
+
375
+ with gr.Row():
376
+ with gr.Column():
377
+ gr.Markdown("### 원문 (Korean)")
378
+ src_box = gr.Textbox(label="", lines=18, interactive=False,
379
+ elem_classes=["trans-box"])
380
+ with gr.Column():
381
+ gr.Markdown("### 룰 적용 후 user message (인라인 치환)")
382
+ aug_box = gr.Textbox(label="", lines=18, interactive=False,
383
+ elem_classes=["trans-box"])
384
+ with gr.Column():
385
+ gr.Markdown("### 영문 번역 결과")
386
+ out_box = gr.Textbox(label="", lines=18, interactive=False,
387
+ elem_classes=["trans-box"])
388
+
389
+ with gr.Group(visible=False, elem_classes=["feedback-card"]) as fb_group:
390
+ gr.Markdown("## 📝 정성 코멘트")
391
+ with gr.Row():
392
+ rater_in = gr.Textbox(
393
+ label="닉네임 (선택)", placeholder="anonymous",
394
+ max_lines=1, scale=1,
395
+ )
396
+ comment_in = gr.Textbox(
397
+ label="평가 의견 (자유 기술)",
398
+ lines=5,
399
+ placeholder=(
400
+ "자연스러움 / 뉴스체 / 직역투 / 오역 / 누락 / "
401
+ "숫자·단위 / 날짜·시점 / 인용 처리 / 고유명사 등 "
402
+ "자유롭게 남겨주세요."
403
+ ),
404
+ )
405
+ submit_btn = gr.Button(
406
+ "✅ 코멘트 제출", variant="primary",
407
+ elem_classes=["submit-btn"],
408
+ )
409
+ ack_md = gr.Markdown()
410
+
411
+ state_box = gr.State({})
412
+
413
+ gen_btn.click(
414
+ fn=run_translation,
415
+ inputs=[url_in, text_in, date_in],
416
+ outputs=[src_box, aug_box, out_box, status_md, fb_group, state_box],
417
+ )
418
+ submit_btn.click(
419
+ fn=submit_comment,
420
+ inputs=[comment_in, rater_in, state_box],
421
+ outputs=[ack_md, comment_in],
422
+ )
423
+
424
+ # ── Tab 2: 누적 코멘트 ────────────────────────────────────────
425
+ with gr.TabItem("누적 코멘트"):
426
+ refresh_btn = gr.Button("🔄 새로고침")
427
+ summary_md = gr.Markdown()
428
+ results_df = gr.Dataframe(
429
+ label="제출된 정성 코멘트",
430
+ headers=["timestamp", "rater", "comment", "article_date",
431
+ "source_url", "source_excerpt"],
432
+ wrap=True, interactive=False,
433
+ )
434
+ refresh_btn.click(fn=refresh_results, outputs=[results_df, summary_md])
435
+ demo.load(fn=refresh_results, outputs=[results_df, summary_md])
436
+
437
+
438
+ if __name__ == "__main__":
439
+ demo.launch()
prompts/rule_aug.txt ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #ROLE:
2
+ You are a professional Korean-to-English news translator. Your task is to translate Korean news articles into high-quality English while following the rules with verifications.
3
+
4
+ #GROUND RULE 1
5
+ CRITICAL: Preserve Special Tokens
6
+ 1. Preserve every <fig></fig>, <description></description>, <h></h>, and <quote></quote> token exactly as they appear in the original text, maintaining both their position and format without any modifications.
7
+ 2. The total number of each token in the translation must exactly match the original Korean text.
8
+ 3. <description></description> and <h></h> tokens contain content that should be translated while the tags themselves must remain intact and correctly opened and closed.
9
+ - The inner content of <description></description> and <h></h> must be translated into fluent, professional English.
10
+ - You must preserve the tags exactly as written, including: Both opening (<description>, <h>) and closing (</description>, </h>) token
11
+ 4. <fig></fig> and <quote></quote> tokens should be output exactly as they appear - do not add any content between these token.
12
+ - If <quote></quote> appears MID-SENTENCE in Korean → Must appear MID-SENTENCE in English
13
+ - If <quote></quote> appears BETWEEN sentences in Korean → Must appear BETWEEN sentences in English
14
+
15
+ #GROUND RULE 2
16
+ CRITICAL: English Proper Noun Handling (NEVER include Korean text)
17
+ 1. ALWAYS PRESERVE the exact spelling of any English text that appears in the input, regardless of how famous or well-known the person/organization is
18
+ 2. NEVER use your knowledge of standard English spellings for names - use ONLY what appears in the input
19
+ 3. PRESERVE English proper nouns exactly as they appear in the input (organization names, place names, etc.)
20
+ 4. TRANSLATE general English terms, mathematical terms, and common words naturally, even if they appear in English in the input
21
+ 5. PARSING ERRORS: If English text appears to be incorrectly parsed Korean text (e.g., "Hyundai인" → "현대인" → modern people), translate based on the intended Korean meaning, not the literal English text
22
+ 6. NEVER include Korean text or romanization alongside English translations - do not use parallel notation with Korean terms in parentheses or brackets
23
+
24
+ Examples:
25
+ "Yi Cheul soo 대통령" → "President Yi Cheul soo" (preserve name spelling, NOT "Yi Cheul-soo")
26
+ "Jisoo함수" → "Exponential function" (translate mathematical term, even if it appears as "Jisoo function" in input)
27
+ "Lee Young-Hee" → "Lee Young-Hee" (preserve exact spelling, NOT "Lee Younghee")
28
+ "Park Ji Won" → "Park Ji Won" (preserve exact spelling, NOT "Park Ji-won")
29
+
30
+ #INSTRUCTIONS:
31
+ FIRST PRIORITIES:
32
+ 1. KEEP every detail meaning, original tone/style, words and formats exactly as written in Korean.
33
+ 2. DO NOT use any prior knowledge of names, titles, positions, or proper nouns. PRESERVE ONLY what appears in the input
34
+ 3. DO NOT add, omit, or change any information from the original text.
35
+ 4. KEEP the same subject-object relationships and speaker's intention from Korean.
36
+ 5. Resolve relative date expressions (e.g., 지난 X일, 지난달, 작년, 올해) using the [Article published date: YYYY-MM-DD] line at the bottom of this system prompt as the year/month anchor.
37
+
38
+ SECOND PRIORITIES:
39
+ - Use vocabulary, tone, and style that are natural and appropriate for a professional English-language news article.
40
+ - Ensure the translation is accurate and faithful to the original content, while sounding natural to native English readers.
41
+ - Do not summarize or paraphrase. Translate all details as they appear.
42
+ - Do not include any explanations or extra comments.
43
+ - Do not generate or add any headline, title, or summary. Only provide the translated article body.
44
+ - Do not add parentheses(), semicolon(;), bold (**), italic (*), or any markdown formatting unless present in the original.
45
+
46
+ # Parentheses to Comma Conversion
47
+ - Convert parenthetical expressions to comma-separated phrases for better readability
48
+ - Korean format: 아이유(이지은·31) → English format: IU, Lee Ji-eun, 31 years old
49
+ - Korean format: 서울시청(시민 서비스 센터) → English format: Seoul City Hall, the Citizen Service Center
50
+ - Korean format: 오후 3시(현지시간) → English format: 3 p.m., local time
51
+ - Maintain the same informational content while using commas instead of parentheses for smoother English flow
52
+
53
+ # Names and Ages
54
+ - Korean format: 김철수(35) → English format: Kim Chulsoo, 35 years old
55
+ - Always spell out "years old" in full and use a comma
56
+
57
+ # Anonymous Source Attribution
58
+ - Use "a source from [organization]" format for anonymous statements, "관계자"
59
+ - Korean format: 정부 관계자에 따르면 → English format: According to a source from the government
60
+
61
+ # Anonymous Name Format - CRITICAL TRANSLATION RULE
62
+
63
+ ## Core Principles
64
+ - **NEVER assign gender unless explicitly stated in the original Korean text**
65
+ - **Maintain anonymity level exactly as presented in source material**
66
+ - **Only reflect information actually provided in the original**
67
+
68
+ ## Basic Anonymized Name Patterns
69
+ - For anonymized names ending in 모씨: translate as "[Family Name]"
70
+ - Examples: 김모씨 → "Kim", 우모씨 → "Woo", 정모씨 → "Jeong"
71
+
72
+ ## NEVER translate 모씨 patterns as:
73
+ ❌ "Lee Mo" / "Kim Mo" / "[Name] Mo"
74
+
75
+ # Title Translation (books, reports, movies, and other content)
76
+ - Use literal translation to preserve the original meaning and nuance of the title as much as possible.
77
+ - Format book titles with single quotes 'title' instead of asterisks *title*
78
+
79
+ # Word Translation Guide
80
+ - Translate "본지" → "this newspaper"
81
+ - Death references → "died"
82
+ - AVOID euphemisms like 'passed away' or 'departed'. Use 'died' unless there's a compelling reason to do otherwise (e.g. quoting someone)
83
+
84
+ # Knowledge-Based Information Updates
85
+ Use ONLY the specific updated information provided in the infromation guidelines
86
+ Based on the current timeframe - Update key figures and titles accurately:
87
+
88
+ ## Official Title Updates
89
+
90
+ 1. South Korean President: Lee Jae Myung
91
+ Elected June 4, 2025 (current)
92
+ Examples: "이재명 후보" → "President Lee Jae Myung" / "이재명 대표" → "President Lee Jae Myung"
93
+
94
+ 2. Former South Korean President: Yoon Suk Yeol
95
+ Impeached April 4, 2025
96
+ Example: "윤석열 대통령" → "former President Yoon Suk Yeol"
97
+
98
+ 3. US President: Donald Trump
99
+ Current term: January 20, 2025~present (47th president)
100
+ Example: "도널드 트럼프 대통령" → "President Donald Trump"
101
+
102
+ 4. Former US President: Joe Biden
103
+ Term: January 20, 2021~January 20, 2025 (46th president)
104
+ Example: "조 바이든 대통령" → "former President Joe Biden"
105
+
106
+ 5. Japanese Prime Minister: Shigeru Ishiba
107
+ In office since October 1, 2024 ~ September 7, 2025
108
+ Example: "기시다 총리" → "Prime Minister Shigeru Ishiba"
109
+
110
+ # RULE-PRECOMPUTED METADATA
111
+ The source text has been preprocessed before reaching you:
112
+ 1. Korean monetary/number expressions (with 조 / 억 / 만 markers) have been REPLACED INLINE with their English-unit equivalents (e.g., "12 billion won", "5.4 trillion won", "89.7 billion dollars"). When you see such an English-unit expression in the source, output it VERBATIM in the translation — do not modify, recompute, or reformat the magnitude. Treat these expressions as pre-translated tokens that must pass through unchanged.
113
+ 2. The "[Article published date: YYYY-MM-DD]" line at the very bottom of this system prompt is the year/month anchor for relative dates. "지난 X일" means the Xth of the same month as the published date (or the most recent past Xth if X > current day-of-month). Do NOT translate "지난 X일" as "the Xth of last month" unless the source explicitly contains "지난달".
114
+ 3. NEVER translate "지난 X일" as "X days ago" or any relative-distance phrasing — it always refers to a specific calendar day of the same month, so render it as "on [Month] [X]" (e.g., "on November 3") or "on the Xth".
115
+
116
+ #VERIFICATION
117
+ CRITICAL WARNING: Failure to follow these rules will result in incorrect translation.
118
+ CHECKLIST:
119
+ 1. English proper nouns preserved exactly as they appear in input without applying standard spellings
120
+ 2. All original number preserved exactly and number units properly converted (no omissions and no additions)
121
+ 3. MANDATORY: Keep all tokens (<fig></fig>, <quote></quote>, <description></description>, <h></h>) in exact same POSITION as original with exact same COUNT maintaining both OPENING and CLOSING token
122
+ 3-1. Verify <h></h> token has both opening and closing tags must remain intact and correctly placed.
123
+ 3-2. If <quote></quote> token EXIST, verify count by checking numbered identifiers inside tags - count quote1, quote2, quote3, etc. in original, then ensure translation has exactly the same numbered tokens.
124
+ 4. Professional English news style while preserving original tone
125
+ 5. Every sentence are translated including all necessary words and nuances from the original
126
+ 6. Age format: "Name, XX years old" with proper spacing
127
+ 7. Subject-object relationships maintained
128
+ 8. NO KOREAN OUTPUT: Final output must contain ONLY English text
129
+ 9. DO NOT add parentheses(), semicolon(;), bold (**), italic (*), or any markdown formatting unless present in the original.
130
+ 10. For lists, use COMMAS instead of semicolons
131
+
132
+ # FINAL TOKEN VERIFICATION BEFORE OUTPUT:
133
+ 1. Count ALL special tokens in the original Korean text: <fig></fig>, <quote></quote>, <description></description>, <h></h>
134
+ 2. Ensure your translation has the EXACT SAME NUMBER of each token type
135
+ 3. If original has ZERO tokens, your translation must have ZERO tokens
136
+ 4. If original has tokens, DO NOT omit any - preserve every single one in the exact same position
137
+
138
+ !!! REMEMBER: All numbers individually verified and corrected during thinking process and apply to final output !!!
139
+ !!! CRITICAL: Do not use parallel notation with Korean terms in parentheses or brackets !!!
140
+
141
+ --------
142
+
143
+ Provide ONLY the FINAL ENGLISH professional news article without any explanation text. (NEVER include Korean text even in examples or parentheses)
144
+ Do not output any explanations, intermediate reasoning, or translation ��� only the final article.
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==4.44.1
2
+ starlette<0.40
3
+ huggingface_hub>=0.23.0,<0.26
4
+ openai>=1.0.0
5
+ python-dotenv>=1.0.0
6
+ pandas>=2.0.0
7
+ requests>=2.31.0
rules.py ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Rule-based preprocessing: Korean number→English unit conversion + article date injection."""
2
+
3
+ import re
4
+ from datetime import datetime
5
+
6
+ # Korean magnitude markers (cumulative within a compound number)
7
+ KO_MAGNITUDE = [("조", 10**12), ("억", 10**8), ("만", 10**4)]
8
+
9
+ # Recognized currency / counter units. Currency names follow international
10
+ # naming conventions ("Korean won", "Japanese yen", "Chinese yuan"). Counter
11
+ # units stay simple.
12
+ UNIT_MAP = {
13
+ "원": "Korean won",
14
+ "달러": "dollars",
15
+ "유로": "euros",
16
+ "엔": "Japanese yen",
17
+ "위안": "Chinese yuan",
18
+ "명": "people",
19
+ "개": "units",
20
+ "건": "cases",
21
+ "대": "vehicles",
22
+ "톤": "tons",
23
+ }
24
+
25
+ # Compound Korean number followed by unit. The number portion may contain any
26
+ # combination of 조 / 억 / 만 plus a tail digit run, optionally followed by an
27
+ # approximation marker (여 / 약 / 가량 / 여 등) before the unit. We require at
28
+ # least one digit-magnitude pair OR a digit-only tail to avoid matching bare
29
+ # units.
30
+ APPROX = r"(?:여|여\s+|가량|가량\s+|쯤|쯤\s+|약\s+|약)"
31
+ NUM_UNIT_PATTERN = re.compile(
32
+ r"(?P<num>"
33
+ r"(?:\d[\d,]*(?:\.\d+)?\s*조\s*)?"
34
+ r"(?:\d[\d,]*(?:\.\d+)?(?:\s*" + APPROX + r")?\s*억\s*)?"
35
+ r"(?:\d[\d,]*(?:\.\d+)?(?:\s*" + APPROX + r")?\s*만\s*)?"
36
+ r"(?:\d[\d,]*(?:\.\d+)?)?"
37
+ r")\s*(?P<unit>원|달러|유로|엔|위안|명|개|건|대|톤)"
38
+ )
39
+
40
+ # Percentage points. Matches all common Korean variants of "X%p / X%포인트 /
41
+ # X퍼센트 포인트". Must be detected and replaced BEFORE plain numbers so that
42
+ # "10%" doesn't slip through unattended (we want "10 percentage points", never
43
+ # just "10%").
44
+ PERCENT_POINT_PATTERN = re.compile(
45
+ r"(?P<num>\d[\d,]*(?:\.\d+)?)\s*"
46
+ r"(?:%\s*p|%\s*포인트|퍼센트\s*포인트|%\s*포인트)",
47
+ re.IGNORECASE,
48
+ )
49
+
50
+
51
+ def parse_korean_number(text: str) -> float | None:
52
+ """Parse a compound Korean magnitude string (e.g., '1조4000억', '120억', '5000만',
53
+ '7800여억') → numeric value. Approximation markers (여 / 가량 / 쯤 / 약) are stripped."""
54
+ s = text.replace(",", "").replace(" ", "")
55
+ # Strip approximation markers anywhere they appear within the number string
56
+ s = re.sub(r"(여|가량|쯤|약)", "", s)
57
+ total = 0.0
58
+ matched = False
59
+ for marker, mult in KO_MAGNITUDE:
60
+ m = re.match(r"(\d+(?:\.\d+)?)" + marker, s)
61
+ if m:
62
+ total += float(m.group(1)) * mult
63
+ s = s[m.end():]
64
+ matched = True
65
+ if s:
66
+ try:
67
+ total += float(s)
68
+ matched = True
69
+ except ValueError:
70
+ pass
71
+ return total if matched else None
72
+
73
+
74
+ def format_english_amount(amount: float, unit_kor: str) -> str:
75
+ """Format amount in English unit. Uses 'million/billion/trillion' for large numbers."""
76
+ unit_en = UNIT_MAP.get(unit_kor, unit_kor)
77
+
78
+ def _fmt(n: float) -> str:
79
+ # Preserve up to 4 decimals; trim trailing zeros; thousands-separate the integer part.
80
+ if abs(n - round(n)) < 1e-9:
81
+ return f"{int(round(n)):,}"
82
+ s = f"{n:.4f}".rstrip("0").rstrip(".")
83
+ if "." in s:
84
+ int_part, dec = s.split(".")
85
+ return f"{int(int_part):,}.{dec}"
86
+ return f"{int(s):,}"
87
+
88
+ if amount >= 10**12:
89
+ return f"{_fmt(amount / 10**12)} trillion {unit_en}"
90
+ if amount >= 10**9:
91
+ return f"{_fmt(amount / 10**9)} billion {unit_en}"
92
+ if amount >= 10**6:
93
+ return f"{_fmt(amount / 10**6)} million {unit_en}"
94
+ # < 10^6: write as thousand-separated integer (avoid awkward "135.574 thousand")
95
+ return f"{_fmt(amount)} {unit_en}"
96
+
97
+
98
+ def detect_korean_numbers(text: str) -> list[dict]:
99
+ """Return list of {span, start, end, amount, unit, english} entries found in text."""
100
+ results = []
101
+ seen_spans = set()
102
+ for m in NUM_UNIT_PATTERN.finditer(text):
103
+ num_str = m.group("num").strip()
104
+ unit = m.group("unit")
105
+ if not re.search(r"\d", num_str):
106
+ continue # skip bare unit
107
+ amount = parse_korean_number(num_str)
108
+ if amount is None or amount == 0:
109
+ continue
110
+ full = f"{num_str}{unit}"
111
+ # Dedup identical spans (e.g., the same "100억원" appearing many times)
112
+ if full in seen_spans:
113
+ continue
114
+ seen_spans.add(full)
115
+ results.append({
116
+ "span": full,
117
+ "start": m.start(),
118
+ "end": m.end(),
119
+ "amount": amount,
120
+ "unit_ko": unit,
121
+ "english": format_english_amount(amount, unit),
122
+ })
123
+ return results
124
+
125
+
126
+ def parse_article_date(date_str: str) -> str | None:
127
+ """Normalize an article date to a YYYY-MM-DD string, return None if unparseable."""
128
+ if not date_str:
129
+ return None
130
+ s = date_str.strip()
131
+ # Try a few common formats
132
+ for fmt in ("%Y-%m-%d", "%Y.%m.%d", "%Y/%m/%d", "%Y%m%d"):
133
+ try:
134
+ return datetime.strptime(s[:10], fmt).strftime("%Y-%m-%d")
135
+ except ValueError:
136
+ continue
137
+ return None
138
+
139
+
140
+ def _format_simple_number(num_str: str) -> str:
141
+ """Format a plain digit string (with possible commas/decimals) keeping the value as-is."""
142
+ s = num_str.replace(",", "")
143
+ try:
144
+ n = float(s)
145
+ except ValueError:
146
+ return num_str
147
+ if abs(n - round(n)) < 1e-9:
148
+ return f"{int(round(n)):,}"
149
+ return f"{n:,g}"
150
+
151
+
152
+ def replace_numbers_inline(text: str) -> tuple[str, list[dict]]:
153
+ """Replace Korean number+unit spans in `text` with their English equivalents.
154
+
155
+ Two passes (each non-overlapping internally), applied in priority order so
156
+ that longer / more specific patterns win:
157
+ 1. Percentage points ("10%포인트" → "10 percentage points")
158
+ 2. Korean magnitude+unit ("120억원" → "12 billion Korean won")
159
+
160
+ Returns (rewritten_text, list_of_replacements_with_offsets_in_original).
161
+ """
162
+ matches: list[dict] = []
163
+ used_ranges: list[tuple[int, int]] = []
164
+
165
+ def overlaps(s: int, e: int) -> bool:
166
+ return any(not (e <= us or s >= ue) for us, ue in used_ranges)
167
+
168
+ # Pass 1: percentage points (highest priority)
169
+ for m in PERCENT_POINT_PATTERN.finditer(text):
170
+ s, e = m.start(), m.end()
171
+ if overlaps(s, e):
172
+ continue
173
+ num_str = m.group("num")
174
+ english = f"{_format_simple_number(num_str)} percentage points"
175
+ matches.append({
176
+ "start": s,
177
+ "end": e,
178
+ "span": text[s:e],
179
+ "amount": None,
180
+ "unit_ko": "%포인트",
181
+ "english": english,
182
+ })
183
+ used_ranges.append((s, e))
184
+
185
+ # Pass 2: Korean magnitude+currency/counter
186
+ for m in NUM_UNIT_PATTERN.finditer(text):
187
+ s, e = m.start(), m.end()
188
+ if overlaps(s, e):
189
+ continue
190
+ num_str = m.group("num").strip()
191
+ unit = m.group("unit")
192
+ if not re.search(r"\d", num_str):
193
+ continue
194
+ amount = parse_korean_number(num_str)
195
+ if amount is None or amount == 0:
196
+ continue
197
+ english = format_english_amount(amount, unit)
198
+ matches.append({
199
+ "start": s,
200
+ "end": e,
201
+ "span": text[s:e],
202
+ "amount": amount,
203
+ "unit_ko": unit,
204
+ "english": english,
205
+ })
206
+ used_ranges.append((s, e))
207
+
208
+ # Apply replacements right-to-left to preserve earlier offsets
209
+ matches.sort(key=lambda r: r["start"])
210
+ out = text
211
+ for r in sorted(matches, key=lambda r: r["start"], reverse=True):
212
+ out = out[:r["start"]] + r["english"] + out[r["end"]:]
213
+
214
+ return out, matches
215
+
216
+
217
+ def preprocess(text: str, article_date: str | None) -> tuple[str, dict]:
218
+ """Inline-replace Korean numbers with English units inside the user text.
219
+
220
+ The article date is NOT injected into the user message; callers should append
221
+ it to the system prompt via `system_prompt_date_suffix()` instead.
222
+
223
+ Returns (rewritten_text, debug_info).
224
+ """
225
+ norm_date = parse_article_date(article_date) if article_date else None
226
+ rewritten, replacements = replace_numbers_inline(text)
227
+ return rewritten, {"date": norm_date, "conversions": replacements}
228
+
229
+
230
+ def system_prompt_date_suffix(article_date: str | None) -> str:
231
+ """Return the line to append to the system prompt for date anchoring.
232
+
233
+ Empty string if no date is available.
234
+ """
235
+ norm_date = parse_article_date(article_date) if article_date else None
236
+ if not norm_date:
237
+ return ""
238
+ return f"\n\n[Article published date: {norm_date}]"
239
+
240
+
241
+ # Small system-prompt addendum: explains both the inline numbers and the date header.
242
+ RULE_PROMPT_ADDENDUM = """
243
+
244
+ # RULE-PRECOMPUTED METADATA
245
+ The source text has been preprocessed before reaching you:
246
+ 1. Korean monetary/number expressions (with 조 / 억 / 만 markers) have been REPLACED INLINE with their English-unit equivalents (e.g., "12 billion won", "5.4 trillion won", "89.7 billion dollars"). When you see such an English-unit expression in the source, output it VERBATIM in the translation — do not modify, recompute, or reformat the magnitude. Treat these expressions as pre-translated tokens that must pass through unchanged.
247
+ 2. If a "[Article published date: YYYY-MM-DD]" line is provided at the top, use it to resolve relative date expressions. "지난 X일" means the Xth of the same month as the published date (or the most recent past Xth if X > current day-of-month). Do NOT translate "지난 X일" as "the Xth of last month" unless the source explicitly contains "지난달".
248
+ """
249
+
250
+
251
+ if __name__ == "__main__":
252
+ # Smoke test
253
+ samples = [
254
+ ("1023억원과 5조4000억원, 그리고 300만원", "2025-03-15"),
255
+ ("897억달러 매출", "2025-04-02"),
256
+ ("지난 19일 회담이 열렸다", "2025-01-23"),
257
+ ("100억원대 분쟁이 발생", "2024-11-01"),
258
+ ("5000만원의 보너스를 받았다", "2025-02-10"),
259
+ ]
260
+ for text, date in samples:
261
+ out, info = preprocess(text, date)
262
+ print(f"\n--- date={date} ---")
263
+ print(f"input: {text}")
264
+ print(f"detected: {info['conversions']}")
265
+ print(f"resolved date: {info['date']}")
266
+ print(f"augmented:\n{out}")