Spaces:

riacho
/

Solar-News-Translator-Final

Sleeping

App Files Files Community

riacho commited on Apr 29

Commit

f4e0387

verified ·

1 Parent(s): d7b2228

Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

README.md +46 -6
app.py +439 -0
prompts/rule_aug.txt +144 -0
requirements.txt +7 -0
rules.py +266 -0

README.md CHANGED Viewed

@@ -1,12 +1,52 @@
 ---
-title: Solar News Translator Final
-emoji: 📈
-colorFrom: green
-colorTo: purple
 sdk: gradio
-sdk_version: 6.13.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Korean News Translator — Final (Rule-augmented)
+emoji: 🌐
+colorFrom: indigo
+colorTo: blue
 sdk: gradio
+sdk_version: 4.44.1
+python_version: "3.11"
 app_file: app.py
 pinned: false
+license: mit
 ---
+# Korean News Translator — Final (Rule-augmented)
+조선일보 한→영 번역 운영 후보 (v1) 데모.
+- **모델**: Solar Pro2 (`reasoning_effort=high`, `temperature=0.0`)
+- **System prompt**: `prompts/rule_aug.txt` (baseline 프롬프트 + RULE-PRECOMPUTED METADATA 섹션)
+- **전처리**: `rules.py` — 한국어 숫자(만/억/조)+단위 인라인 치환, 발행일 anchor
+## 핵심 변경 (vs baseline)
+1. **숫자 인라인 치환**: `120억원` → `12 billion Korean won` 등 단위 환산을 결정론적 코드로 처리. 모델은 영어 표현을 그대로 출력.
+2. **발행일 anchor**: 시스템 프롬프트 맨 끝에 `[Article published date: YYYY-MM-DD]` 한 줄 append. 모델은 이를 기준으로 "지난 X일", "지난달", "올해" 등 상대 날짜를 정확히 해석.
+3. **기존 baseline의 장문 DATE TRANSLATION RULE 섹션 제거**, INSTRUCTION #5 한 줄로 단순화.
+## 흐름
+1. 기사 URL (Firecrawl 크롤링) 또는 한국어 본문 직접 입력
+2. **기사 발행일 (필수, `YYYY-MM-DD`)** — 데모에서는 수동, 실제 서비스에서는 CMS 자동
+3. 룰 전처리 → Solar Pro2 호출 → 영문 번역 출력
+4. 정성 코멘트 작성 → 제출 시 데이터셋에 영속 저장
+## 환경 변수 (HF Space → Settings → Repository secrets)
+```
+UPSTAGE_API_KEY=...                       # 필수
+FIRECRAWL_API_KEY=...                     # URL 크롤링 사용 시
+HF_TOKEN=hf_...                            # 코멘트 영속 저장
+FEEDBACK_DATASET_REPO=user/...-feedback    # 코멘트 영속 저장
+```
+`FEEDBACK_DATASET_REPO`는 사전에 만들어둔 dataset 타입 저장소여야 합니다.
+## 로컬 실행
+```bash
+pip install -r requirements.txt
+cp .env.example .env   # 키 입력
+python app.py
+```

app.py ADDED Viewed

	@@ -0,0 +1,439 @@

+"""
+Korean News Translator — Final (Rule-augmented) Qualitative Eval
+Single-version demo for production confirmation:
+  - Solar Pro2 (reasoning_effort=high, temperature=0.0)
+  - prompts/rule_aug.txt — baseline prompt + RULE-PRECOMPUTED METADATA section
+  - rules.py — Korean number→English unit inline replacement + article-date anchor
+Flow:
+  1. User provides article URL or pastes Korean body
+  2. User MUST provide article published date (YYYY-MM-DD) — manual in this demo;
+     in production this comes from CMS publish timestamp automatically.
+  3. App preprocesses: inline-replaces Korean numbers, appends date to system prompt
+  4. App calls Solar Pro2 once → shows English translation
+  5. User leaves a qualitative comment (no rating/vote) → persisted
+Persistence:
+  - If HF_TOKEN + FEEDBACK_DATASET_REPO set → push/pull JSONL on HF Datasets
+  - Otherwise → local ./feedback.jsonl (dev mode)
+"""
+from __future__ import annotations
+import json
+import os
+import re
+import time
+import uuid
+from datetime import datetime
+from pathlib import Path
+import gradio as gr
+import pandas as pd
+import requests
+from dotenv import load_dotenv
+from openai import OpenAI
+from rules import preprocess, system_prompt_date_suffix
+load_dotenv()
+# ── Config ────────────────────────────────────────────────────────────────
+UPSTAGE_BASE_URL = "https://api.upstage.ai/v1"
+UPSTAGE_API_KEY = os.getenv("UPSTAGE_API_KEY") or os.getenv("OPENAI_API_KEY")
+FIRECRAWL_API_KEY = os.getenv("FIRECRAWL_API_KEY")
+HF_TOKEN = os.getenv("HF_TOKEN")
+FEEDBACK_DATASET_REPO = os.getenv("FEEDBACK_DATASET_REPO")
+LOCAL_FEEDBACK_FILE = Path(__file__).parent / "feedback.jsonl"
+SCRIPT_DIR = Path(__file__).parent
+SYSTEM_PROMPT = (SCRIPT_DIR / "prompts" / "rule_aug.txt").read_text(encoding="utf-8")
+MODEL = "solar-pro2"
+REASONING_EFFORT = "high"
+TEMPERATURE = 0.0
+client = OpenAI(api_key=UPSTAGE_API_KEY, base_url=UPSTAGE_BASE_URL)
+# ── Translation ───────────────────────────────────────────────────────────
+def call_upstage(system_prompt: str, user_text: str) -> str:
+    resp = client.chat.completions.create(
+        model=MODEL,
+        messages=[
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_text},
+        ],
+        temperature=TEMPERATURE,
+        reasoning_effort=REASONING_EFFORT,
+        max_tokens=8000,
+    )
+    return resp.choices[0].message.content.strip()
+_META_MARKERS = (
+    "verification checklist", "checklist confirmation",
+    "before submitting", "### verification", "**verification",
+)
+def _strip_meta_trailer(text: str) -> str:
+    if not text:
+        return text
+    lowered = text.lower()
+    earliest = len(text)
+    for marker in _META_MARKERS:
+        idx = lowered.find(marker)
+        if 0 <= idx < earliest:
+            earliest = idx
+    if earliest < len(text):
+        return text[:earliest].rstrip()
+    return text
+# ── Crawling (same as reference) ──────────────────────────────────────────
+def crawl_article(url: str) -> tuple[str, str]:
+    if not FIRECRAWL_API_KEY:
+        return "", "Firecrawl API 키가 설정되지 않았습니다. 본문을 직접 붙여넣어 주세요."
+    if not url.strip():
+        return "", "URL을 입력해주세요."
+    payload = {
+        "url": url,
+        "formats": [{
+            "type": "json",
+            "prompt": (
+                "뉴스 기사의 본문 텍스트만 추출하세요.\n\n"
+                "반드시 제외할 것:\n"
+                "- 기사 제목, 기자/저자 이름, 입력일/업데이트일\n"
+                "- **이미지/사진/그래픽/일러스트/도표/인포그래픽 캡션**\n"
+                "- 광고, 편집자 주, 관련기사 링크\n"
+                "- 추천기사, 댓글, 구독 안내, 사이트 내비게이션, 인기기사/핫뉴스 위젯\n"
+                "- '저작권자 ⓒ ...', 'ⓒ' 표시 단락\n\n"
+                "오직 기자가 작성한 본문 단락만 순서대로 이어서 반환하세요. "
+                '형식: {"body": "본문 텍스트 전체 (단락은 \\n\\n 으로 구분)"}'
+            ),
+        }],
+        "onlyMainContent": True, "blockAds": True, "removeBase64Images": True,
+    }
+    try:
+        r = requests.post(
+            "https://api.firecrawl.dev/v2/scrape",
+            headers={"Authorization": f"Bearer {FIRECRAWL_API_KEY}",
+                     "Content-Type": "application/json"},
+            json=payload, timeout=(10, 180),
+        )
+        if r.status_code != 200:
+            return "", f"스크래핑 실패 ({r.status_code}): {r.text[:200]}"
+        body = ((r.json().get("data") or {}).get("json") or {}).get("body", "").strip()
+        if not body:
+            return "", "본문을 찾을 수 없습니다."
+        return body, ""
+    except Exception as e:
+        return "", f"크롤링 오류: {e}"
+# ── Persistence ───────────────────────────────────────────────────────────
+def _dataset_file_path() -> str:
+    return "feedback.jsonl"
+def append_feedback(row: dict) -> None:
+    if HF_TOKEN and FEEDBACK_DATASET_REPO:
+        try:
+            from huggingface_hub import hf_hub_download
+            remote_path = hf_hub_download(
+                repo_id=FEEDBACK_DATASET_REPO,
+                filename=_dataset_file_path(),
+                repo_type="dataset", token=HF_TOKEN, force_download=True,
+            )
+            with open(remote_path, "rb") as src, LOCAL_FEEDBACK_FILE.open("wb") as dst:
+                dst.write(src.read())
+        except Exception as e:
+            print(f"[INFO] No existing dataset file — resetting local: {e}")
+            if LOCAL_FEEDBACK_FILE.exists():
+                LOCAL_FEEDBACK_FILE.unlink()
+    with LOCAL_FEEDBACK_FILE.open("a", encoding="utf-8") as f:
+        f.write(json.dumps(row, ensure_ascii=False) + "\n")
+    if HF_TOKEN and FEEDBACK_DATASET_REPO:
+        try:
+            from huggingface_hub import HfApi
+            HfApi(token=HF_TOKEN).upload_file(
+                path_or_fileobj=str(LOCAL_FEEDBACK_FILE),
+                path_in_repo=_dataset_file_path(),
+                repo_id=FEEDBACK_DATASET_REPO,
+                repo_type="dataset",
+                commit_message=f"feedback {row.get('id','')}",
+            )
+        except Exception as e:
+            print(f"[WARN] HF dataset upload failed: {e}")
+def load_feedback() -> list[dict]:
+    if HF_TOKEN and FEEDBACK_DATASET_REPO:
+        try:
+            from huggingface_hub import hf_hub_download
+            p = hf_hub_download(
+                repo_id=FEEDBACK_DATASET_REPO,
+                filename=_dataset_file_path(),
+                repo_type="dataset", token=HF_TOKEN, force_download=True,
+            )
+            return [json.loads(l) for l in open(p, encoding="utf-8") if l.strip()]
+        except Exception as e:
+            print(f"[INFO] HF dataset fetch failed ({e}). Returning empty.")
+            return []
+    if LOCAL_FEEDBACK_FILE.exists():
+        return [json.loads(l) for l in LOCAL_FEEDBACK_FILE.open(encoding="utf-8") if l.strip()]
+    return []
+# ── Validation ────────────────────────────────────────────────────────────
+DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+def _valid_date(s: str) -> bool:
+    if not s or not DATE_RE.match(s.strip()):
+        return False
+    try:
+        datetime.strptime(s.strip(), "%Y-%m-%d")
+        return True
+    except ValueError:
+        return False
+# ── Translation pipeline ──────────────────────────────────────────────────
+def run_translation(url: str, direct_text: str, article_date: str):
+    """Generator: streams progressive UI updates."""
+    if not _valid_date(article_date):
+        yield ("", "", "", "❌ 기사 발행일을 `YYYY-MM-DD` 형식으로 입력해주세요. "
+               "(실제 서비스에서는 CMS의 발행 시각이 자동 입력됩니다.)",
+               gr.update(visible=False), {})
+        return
+    src = direct_text.strip()
+    crawl_note = ""
+    if not src:
+        if not url.strip():
+            yield ("", "", "", "❌ URL 또는 한국어 본문 중 하나를 입력해주세요.",
+                   gr.update(visible=False), {})
+            return
+        yield ("", "", "", "🌐 URL 크롤링 중... (최대 30초)",
+               gr.update(visible=False), {})
+        crawled, err = crawl_article(url)
+        if err:
+            yield ("", "", "", f"❌ {err}", gr.update(visible=False), {})
+            return
+        src = crawled
+        crawl_note = f"(크롤링 완료, {len(src)} chars) "
+    # Apply rule preprocessing
+    augmented_user_text, info = preprocess(src, article_date)
+    system_prompt = SYSTEM_PROMPT + system_prompt_date_suffix(article_date)
+    rule_summary_lines = [
+        f"📅 발행일 anchor: `{info['date']}`",
+    ]
+    if info["conversions"]:
+        rule_summary_lines.append(
+            f"🔢 인라인 치환 {len(info['conversions'])}건:"
+        )
+        for c in info["conversions"][:12]:
+            rule_summary_lines.append(f"  - `{c['span']}` → **{c['english']}**")
+        if len(info["conversions"]) > 12:
+            rule_summary_lines.append(f"  - … 외 {len(info['conversions']) - 12}건")
+    else:
+        rule_summary_lines.append("🔢 인라인 치환: 없음")
+    rule_summary = "\n".join(rule_summary_lines)
+    yield (src, augmented_user_text, "",
+           f"✅ 원문 추출 완료 {crawl_note}— 룰 적용 후 Solar Pro2 호출 중...",
+           gr.update(visible=False), {})
+    t0 = time.time()
+    try:
+        translation = _strip_meta_trailer(call_upstage(system_prompt, augmented_user_text))
+    except Exception as e:
+        yield (src, augmented_user_text, "",
+               f"❌ 번역 오류: {e}", gr.update(visible=False), {})
+        return
+    elapsed = time.time() - t0
+    state_val = {
+        "source": src,
+        "source_url": url.strip(),
+        "article_date": article_date.strip(),
+        "augmented_user_text": augmented_user_text,
+        "translation": translation,
+        "rule_info": {
+            "date": info["date"],
+            "conversions": info["conversions"],
+        },
+        "elapsed_sec": round(elapsed, 1),
+    }
+    status = (
+        f"✅ 번역 완료 ({elapsed:.1f}s)\n\n{rule_summary}\n\n"
+        "정성평가 의견을 남겨주세요."
+    )
+    yield (src, augmented_user_text, translation, status,
+           gr.update(visible=True), state_val)
+def submit_comment(comment: str, rater_name: str, state: dict):
+    if not state:
+        return "먼저 번역을 생성해주세요.", gr.update()
+    if not (comment or "").strip():
+        return "코멘트를 입력해주세요.", gr.update()
+    row = {
+        "id": uuid.uuid4().hex[:12],
+        "timestamp": datetime.now().isoformat(timespec="seconds"),
+        "rater": (rater_name or "anonymous").strip()[:40],
+        "comment": comment.strip(),
+        "source_url": state.get("source_url", ""),
+        "article_date": state.get("article_date", ""),
+        "source_full": state.get("source", ""),
+        "augmented_user_text": state.get("augmented_user_text", ""),
+        "translation": state.get("translation", ""),
+        "rule_info": state.get("rule_info", {}),
+        "source_excerpt": (state.get("source", "")[:200]).replace("\n", " "),
+        "elapsed_sec": state.get("elapsed_sec", 0),
+        "model": MODEL,
+        "reasoning_effort": REASONING_EFFORT,
+        "temperature": TEMPERATURE,
+    }
+    append_feedback(row)
+    return (
+        f"### ✅ 제출 완료\n\nID: `{row['id']}` — 감사합니다!",
+        gr.update(value=""),
+    )
+def refresh_results():
+    rows = load_feedback()
+    cols = ["timestamp", "rater", "comment", "article_date",
+            "source_url", "source_excerpt"]
+    if not rows:
+        return pd.DataFrame(columns=cols), "아직 제출된 코멘트가 없습니다."
+    df = pd.DataFrame(rows)
+    show_cols = [c for c in cols if c in df.columns]
+    df = df[show_cols].sort_values("timestamp", ascending=False)
+    return df, f"**총 {len(df)}건**의 정성 코멘트가 누적되어 있습니다."
+# ── UI ────────────────────────────────────────────────────────────────────
+CSS = """
+.gradio-container { font-family: system-ui, -apple-system, sans-serif !important; }
+.trans-box textarea { font-size: 0.95em !important; line-height: 1.55 !important; }
+.feedback-card {
+    background: linear-gradient(180deg, #f8fafc 0%, #eef2ff 100%) !important;
+    border: 2px solid #6366f1 !important;
+    border-radius: 12px !important;
+    padding: 20px !important;
+    margin-top: 16px !important;
+}
+.submit-btn { font-size: 1.1em !important; padding: 12px !important;
+              margin-top: 8px !important; }
+"""
+INTRO = (
+    "# Korean → English 번역 — 룰 기반 후처리 적용 버전\n\n"
+    "**최종 운영 후보(v1)**: Solar Pro2 + `rules.py` 전처리 + `prompts/rule_aug.txt`. "
+    "한국어 숫자 단위(만/억/조)와 퍼센트 포인트는 결정론적 룰로 인라인 치환되고, "
+    "기사 발행일은 시스템 프롬프트의 anchor로 주입되어 상대 날짜(지난 X일 등)를 정확히 해석합니다.\n\n"
+    "기사 URL 또는 한국어 본문, 그리고 **기사 발행일(필수)** 을 입력하세요. "
+    "발행일은 이번 데모에서만 수동 입력이며, 실제 서비스에서는 CMS의 발행 시각이 자동으로 들어갑니다."
+)
+with gr.Blocks(title="Korean News Translator — Final", css=CSS,
+               theme=gr.themes.Default()) as demo:
+    gr.Markdown(INTRO)
+    with gr.Tabs():
+        # ── Tab 1: 번역 + 코멘트 ──────────────────────────────────────
+        with gr.TabItem("번역 & 정성평가"):
+            with gr.Row():
+                url_in = gr.Textbox(label="기사 URL (선택)",
+                                    placeholder="https://...", scale=3)
+                date_in = gr.Textbox(
+                    label="기사 발행일 (필수, YYYY-MM-DD)",
+                    placeholder="2026-04-30",
+                    scale=2,
+                    info="실�� 서비스에서는 자동 입력. 데모에서만 수동.",
+                )
+                gen_btn = gr.Button("번역 생성", variant="primary", scale=1)
+            with gr.Accordion("직접 한국어 본문 입력 (URL 대신)", open=False):
+                text_in = gr.Textbox(label="한국어 본문", lines=8,
+                                     placeholder="본문을 붙여넣기 하세요")
+            status_md = gr.Markdown()
+            with gr.Row():
+                with gr.Column():
+                    gr.Markdown("### 원문 (Korean)")
+                    src_box = gr.Textbox(label="", lines=18, interactive=False,
+                                         elem_classes=["trans-box"])
+                with gr.Column():
+                    gr.Markdown("### 룰 적용 후 user message (인라인 치환)")
+                    aug_box = gr.Textbox(label="", lines=18, interactive=False,
+                                         elem_classes=["trans-box"])
+                with gr.Column():
+                    gr.Markdown("### 영문 번역 결과")
+                    out_box = gr.Textbox(label="", lines=18, interactive=False,
+                                         elem_classes=["trans-box"])
+            with gr.Group(visible=False, elem_classes=["feedback-card"]) as fb_group:
+                gr.Markdown("## 📝 정성 코멘트")
+                with gr.Row():
+                    rater_in = gr.Textbox(
+                        label="닉네임 (선택)", placeholder="anonymous",
+                        max_lines=1, scale=1,
+                    )
+                comment_in = gr.Textbox(
+                    label="평가 의견 (자유 기술)",
+                    lines=5,
+                    placeholder=(
+                        "자연스러움 / 뉴스체 / 직역투 / 오역 / 누락 / "
+                        "숫자·단위 / 날짜·시점 / 인용 처리 / 고유명사 등 "
+                        "자유롭게 남겨주세요."
+                    ),
+                )
+                submit_btn = gr.Button(
+                    "✅ 코멘트 제출", variant="primary",
+                    elem_classes=["submit-btn"],
+                )
+                ack_md = gr.Markdown()
+            state_box = gr.State({})
+            gen_btn.click(
+                fn=run_translation,
+                inputs=[url_in, text_in, date_in],
+                outputs=[src_box, aug_box, out_box, status_md, fb_group, state_box],
+            )
+            submit_btn.click(
+                fn=submit_comment,
+                inputs=[comment_in, rater_in, state_box],
+                outputs=[ack_md, comment_in],
+            )
+        # ── Tab 2: 누적 코멘트 ────────────────────────────────────────
+        with gr.TabItem("누적 코멘트"):
+            refresh_btn = gr.Button("🔄 새로고침")
+            summary_md = gr.Markdown()
+            results_df = gr.Dataframe(
+                label="제출된 정성 코멘트",
+                headers=["timestamp", "rater", "comment", "article_date",
+                         "source_url", "source_excerpt"],
+                wrap=True, interactive=False,
+            )
+            refresh_btn.click(fn=refresh_results, outputs=[results_df, summary_md])
+            demo.load(fn=refresh_results, outputs=[results_df, summary_md])
+if __name__ == "__main__":
+    demo.launch()

prompts/rule_aug.txt ADDED Viewed

	@@ -0,0 +1,144 @@

+#ROLE:
+You are a professional Korean-to-English news translator. Your task is to translate Korean news articles into high-quality English while following the rules with verifications.
+#GROUND RULE 1
+CRITICAL: Preserve Special Tokens
+1. Preserve every <fig></fig>, <description></description>, <h></h>, and <quote></quote> token exactly as they appear in the original text, maintaining both their position and format without any modifications.
+2. The total number of each token in the translation must exactly match the original Korean text.
+3. <description></description> and <h></h> tokens contain content that should be translated while the tags themselves must remain intact and correctly opened and closed.
+- The inner content of <description></description> and <h></h> must be translated into fluent, professional English.
+- You must preserve the tags exactly as written, including: Both opening (<description>, <h>) and closing (</description>, </h>) token
+4. <fig></fig> and <quote></quote> tokens should be output exactly as they appear - do not add any content between these token.
+- If <quote></quote> appears MID-SENTENCE in Korean → Must appear MID-SENTENCE in English
+- If <quote></quote> appears BETWEEN sentences in Korean → Must appear BETWEEN sentences in English
+#GROUND RULE 2
+CRITICAL: English Proper Noun Handling (NEVER include Korean text)
+1. ALWAYS PRESERVE the exact spelling of any English text that appears in the input, regardless of how famous or well-known the person/organization is
+2. NEVER use your knowledge of standard English spellings for names - use ONLY what appears in the input
+3. PRESERVE English proper nouns exactly as they appear in the input (organization names, place names, etc.)
+4. TRANSLATE general English terms, mathematical terms, and common words naturally, even if they appear in English in the input
+5. PARSING ERRORS: If English text appears to be incorrectly parsed Korean text (e.g., "Hyundai인" → "현대인" → modern people), translate based on the intended Korean meaning, not the literal English text
+6. NEVER include Korean text or romanization alongside English translations - do not use parallel notation with Korean terms in parentheses or brackets
+Examples:
+"Yi Cheul soo 대통령" → "President Yi Cheul soo" (preserve name spelling, NOT "Yi Cheul-soo")
+"Jisoo함수" → "Exponential function" (translate mathematical term, even if it appears as "Jisoo function" in input)
+"Lee Young-Hee" → "Lee Young-Hee" (preserve exact spelling, NOT "Lee Younghee")
+"Park Ji Won" → "Park Ji Won" (preserve exact spelling, NOT "Park Ji-won")
+#INSTRUCTIONS:
+FIRST PRIORITIES:
+1. KEEP every detail meaning, original tone/style, words and formats exactly as written in Korean.
+2. DO NOT use any prior knowledge of names, titles, positions, or proper nouns. PRESERVE ONLY what appears in the input
+3. DO NOT add, omit, or change any information from the original text.
+4. KEEP the same subject-object relationships and speaker's intention from Korean.
+5. Resolve relative date expressions (e.g., 지난 X일, 지난달, 작년, 올해) using the [Article published date: YYYY-MM-DD] line at the bottom of this system prompt as the year/month anchor.
+SECOND PRIORITIES:
+- Use vocabulary, tone, and style that are natural and appropriate for a professional English-language news article.
+- Ensure the translation is accurate and faithful to the original content, while sounding natural to native English readers.
+- Do not summarize or paraphrase. Translate all details as they appear.
+- Do not include any explanations or extra comments.
+- Do not generate or add any headline, title, or summary. Only provide the translated article body.
+- Do not add parentheses(), semicolon(;), bold (**), italic (*), or any markdown formatting unless present in the original.
+# Parentheses to Comma Conversion
+- Convert parenthetical expressions to comma-separated phrases for better readability
+- Korean format: 아이유(이지은·31) → English format: IU, Lee Ji-eun, 31 years old
+- Korean format: 서울시청(시민 서비스 센터) → English format: Seoul City Hall, the Citizen Service Center
+- Korean format: 오후 3시(현지시간) → English format: 3 p.m., local time
+- Maintain the same informational content while using commas instead of parentheses for smoother English flow
+# Names and Ages
+- Korean format: 김철수(35) → English format: Kim Chulsoo, 35 years old
+- Always spell out "years old" in full and use a comma
+# Anonymous Source Attribution
+- Use "a source from [organization]" format for anonymous statements, "관계자"
+- Korean format: 정부 관계자에 따르면 → English format: According to a source from the government
+# Anonymous Name Format - CRITICAL TRANSLATION RULE
+## Core Principles
+- **NEVER assign gender unless explicitly stated in the original Korean text**
+- **Maintain anonymity level exactly as presented in source material**
+- **Only reflect information actually provided in the original**
+## Basic Anonymized Name Patterns
+- For anonymized names ending in 모씨: translate as "[Family Name]"
+- Examples: 김모씨 → "Kim", 우모씨 → "Woo", 정모씨 → "Jeong"
+## NEVER translate 모씨 patterns as:
+❌ "Lee Mo" / "Kim Mo" / "[Name] Mo"
+# Title Translation (books, reports, movies, and other content)
+- Use literal translation to preserve the original meaning and nuance of the title as much as possible.
+- Format book titles with single quotes 'title' instead of asterisks *title*
+# Word Translation Guide
+- Translate "본지" → "this newspaper"
+- Death references → "died"
+- AVOID euphemisms like 'passed away' or 'departed'. Use 'died' unless there's a compelling reason to do otherwise (e.g. quoting someone)
+# Knowledge-Based Information Updates
+Use ONLY the specific updated information provided in the infromation guidelines
+Based on the current timeframe - Update key figures and titles accurately:
+## Official Title Updates
+1. South Korean President: Lee Jae Myung
+Elected June 4, 2025 (current)
+Examples: "이재명 후보" → "President Lee Jae Myung" / "이재명 대표" → "President Lee Jae Myung"
+2. Former South Korean President: Yoon Suk Yeol
+Impeached April 4, 2025
+Example: "윤석열 대통령" → "former President Yoon Suk Yeol"
+3. US President: Donald Trump
+Current term: January 20, 2025~present (47th president)
+Example: "도널드 트럼프 대통령" → "President Donald Trump"
+4. Former US President: Joe Biden
+Term: January 20, 2021~January 20, 2025 (46th president)
+Example: "조 바이든 대통령" → "former President Joe Biden"
+5. Japanese Prime Minister: Shigeru Ishiba
+In office since October 1, 2024 ~ September 7, 2025
+Example: "기시다 총리" → "Prime Minister Shigeru Ishiba"
+# RULE-PRECOMPUTED METADATA
+The source text has been preprocessed before reaching you:
+1. Korean monetary/number expressions (with 조 / 억 / 만 markers) have been REPLACED INLINE with their English-unit equivalents (e.g., "12 billion won", "5.4 trillion won", "89.7 billion dollars"). When you see such an English-unit expression in the source, output it VERBATIM in the translation — do not modify, recompute, or reformat the magnitude. Treat these expressions as pre-translated tokens that must pass through unchanged.
+2. The "[Article published date: YYYY-MM-DD]" line at the very bottom of this system prompt is the year/month anchor for relative dates. "지난 X일" means the Xth of the same month as the published date (or the most recent past Xth if X > current day-of-month). Do NOT translate "지난 X일" as "the Xth of last month" unless the source explicitly contains "지난달".
+3. NEVER translate "지난 X일" as "X days ago" or any relative-distance phrasing — it always refers to a specific calendar day of the same month, so render it as "on [Month] [X]" (e.g., "on November 3") or "on the Xth".
+#VERIFICATION
+CRITICAL WARNING: Failure to follow these rules will result in incorrect translation.
+CHECKLIST:
+1. English proper nouns preserved exactly as they appear in input without applying standard spellings
+2. All original number preserved exactly and number units properly converted (no omissions and no additions)
+3. MANDATORY: Keep all tokens (<fig></fig>, <quote></quote>, <description></description>, <h></h>) in exact same POSITION as original with exact same COUNT maintaining both OPENING and CLOSING token
+3-1. Verify <h></h> token has both opening and closing tags must remain intact and correctly placed.
+3-2. If <quote></quote> token EXIST, verify count by checking numbered identifiers inside tags - count quote1, quote2, quote3, etc. in original, then ensure translation has exactly the same numbered tokens.
+4. Professional English news style while preserving original tone
+5. Every sentence are translated including all necessary words and nuances from the original
+6. Age format: "Name, XX years old" with proper spacing
+7. Subject-object relationships maintained
+8. NO KOREAN OUTPUT: Final output must contain ONLY English text
+9. DO NOT add parentheses(), semicolon(;), bold (**), italic (*), or any markdown formatting unless present in the original.
+10. For lists, use COMMAS instead of semicolons
+# FINAL TOKEN VERIFICATION BEFORE OUTPUT:
+1. Count ALL special tokens in the original Korean text: <fig></fig>, <quote></quote>, <description></description>, <h></h>
+2. Ensure your translation has the EXACT SAME NUMBER of each token type
+3. If original has ZERO tokens, your translation must have ZERO tokens
+4. If original has tokens, DO NOT omit any - preserve every single one in the exact same position
+!!! REMEMBER: All numbers individually verified and corrected during thinking process and apply to final output !!!
+!!! CRITICAL: Do not use parallel notation with Korean terms in parentheses or brackets !!!
+--------
+Provide ONLY the FINAL ENGLISH professional news article without any explanation text.  (NEVER include Korean text even in examples or parentheses)
+Do not output any explanations, intermediate reasoning, or translation ��� only the final article.

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+gradio==4.44.1
+starlette<0.40
+huggingface_hub>=0.23.0,<0.26
+openai>=1.0.0
+python-dotenv>=1.0.0
+pandas>=2.0.0
+requests>=2.31.0

rules.py ADDED Viewed

	@@ -0,0 +1,266 @@

+"""Rule-based preprocessing: Korean number→English unit conversion + article date injection."""
+import re
+from datetime import datetime
+# Korean magnitude markers (cumulative within a compound number)
+KO_MAGNITUDE = [("조", 10**12), ("억", 10**8), ("만", 10**4)]
+# Recognized currency / counter units. Currency names follow international
+# naming conventions ("Korean won", "Japanese yen", "Chinese yuan"). Counter
+# units stay simple.
+UNIT_MAP = {
+    "원": "Korean won",
+    "달러": "dollars",
+    "유로": "euros",
+    "엔": "Japanese yen",
+    "위안": "Chinese yuan",
+    "명": "people",
+    "개": "units",
+    "건": "cases",
+    "대": "vehicles",
+    "톤": "tons",
+}
+# Compound Korean number followed by unit. The number portion may contain any
+# combination of 조 / 억 / 만 plus a tail digit run, optionally followed by an
+# approximation marker (여 / 약 / 가량 / 여 등) before the unit. We require at
+# least one digit-magnitude pair OR a digit-only tail to avoid matching bare
+# units.
+APPROX = r"(?:여|여\s+|가량|가량\s+|쯤|쯤\s+|약\s+|약)"
+NUM_UNIT_PATTERN = re.compile(
+    r"(?P<num>"
+    r"(?:\d[\d,]*(?:\.\d+)?\s*조\s*)?"
+    r"(?:\d[\d,]*(?:\.\d+)?(?:\s*" + APPROX + r")?\s*억\s*)?"
+    r"(?:\d[\d,]*(?:\.\d+)?(?:\s*" + APPROX + r")?\s*만\s*)?"
+    r"(?:\d[\d,]*(?:\.\d+)?)?"
+    r")\s*(?P<unit>원|달러|유로|엔|위안|명|개|건|대|톤)"
+)
+# Percentage points. Matches all common Korean variants of "X%p / X%포인트 /
+# X퍼센트 포인트". Must be detected and replaced BEFORE plain numbers so that
+# "10%" doesn't slip through unattended (we want "10 percentage points", never
+# just "10%").
+PERCENT_POINT_PATTERN = re.compile(
+    r"(?P<num>\d[\d,]*(?:\.\d+)?)\s*"
+    r"(?:%\s*p|%\s*포인트|퍼센트\s*포인트|％\s*포인트)",
+    re.IGNORECASE,
+)
+def parse_korean_number(text: str) -> float | None:
+    """Parse a compound Korean magnitude string (e.g., '1조4000억', '120억', '5000만',
+    '7800여억') → numeric value. Approximation markers (여 / 가량 / 쯤 / 약) are stripped."""
+    s = text.replace(",", "").replace(" ", "")
+    # Strip approximation markers anywhere they appear within the number string
+    s = re.sub(r"(여|가량|쯤|약)", "", s)
+    total = 0.0
+    matched = False
+    for marker, mult in KO_MAGNITUDE:
+        m = re.match(r"(\d+(?:\.\d+)?)" + marker, s)
+        if m:
+            total += float(m.group(1)) * mult
+            s = s[m.end():]
+            matched = True
+    if s:
+        try:
+            total += float(s)
+            matched = True
+        except ValueError:
+            pass
+    return total if matched else None
+def format_english_amount(amount: float, unit_kor: str) -> str:
+    """Format amount in English unit. Uses 'million/billion/trillion' for large numbers."""
+    unit_en = UNIT_MAP.get(unit_kor, unit_kor)
+    def _fmt(n: float) -> str:
+        # Preserve up to 4 decimals; trim trailing zeros; thousands-separate the integer part.
+        if abs(n - round(n)) < 1e-9:
+            return f"{int(round(n)):,}"
+        s = f"{n:.4f}".rstrip("0").rstrip(".")
+        if "." in s:
+            int_part, dec = s.split(".")
+            return f"{int(int_part):,}.{dec}"
+        return f"{int(s):,}"
+    if amount >= 10**12:
+        return f"{_fmt(amount / 10**12)} trillion {unit_en}"
+    if amount >= 10**9:
+        return f"{_fmt(amount / 10**9)} billion {unit_en}"
+    if amount >= 10**6:
+        return f"{_fmt(amount / 10**6)} million {unit_en}"
+    # < 10^6: write as thousand-separated integer (avoid awkward "135.574 thousand")
+    return f"{_fmt(amount)} {unit_en}"
+def detect_korean_numbers(text: str) -> list[dict]:
+    """Return list of {span, start, end, amount, unit, english} entries found in text."""
+    results = []
+    seen_spans = set()
+    for m in NUM_UNIT_PATTERN.finditer(text):
+        num_str = m.group("num").strip()
+        unit = m.group("unit")
+        if not re.search(r"\d", num_str):
+            continue  # skip bare unit
+        amount = parse_korean_number(num_str)
+        if amount is None or amount == 0:
+            continue
+        full = f"{num_str}{unit}"
+        # Dedup identical spans (e.g., the same "100억원" appearing many times)
+        if full in seen_spans:
+            continue
+        seen_spans.add(full)
+        results.append({
+            "span": full,
+            "start": m.start(),
+            "end": m.end(),
+            "amount": amount,
+            "unit_ko": unit,
+            "english": format_english_amount(amount, unit),
+        })
+    return results
+def parse_article_date(date_str: str) -> str | None:
+    """Normalize an article date to a YYYY-MM-DD string, return None if unparseable."""
+    if not date_str:
+        return None
+    s = date_str.strip()
+    # Try a few common formats
+    for fmt in ("%Y-%m-%d", "%Y.%m.%d", "%Y/%m/%d", "%Y%m%d"):
+        try:
+            return datetime.strptime(s[:10], fmt).strftime("%Y-%m-%d")
+        except ValueError:
+            continue
+    return None
+def _format_simple_number(num_str: str) -> str:
+    """Format a plain digit string (with possible commas/decimals) keeping the value as-is."""
+    s = num_str.replace(",", "")
+    try:
+        n = float(s)
+    except ValueError:
+        return num_str
+    if abs(n - round(n)) < 1e-9:
+        return f"{int(round(n)):,}"
+    return f"{n:,g}"
+def replace_numbers_inline(text: str) -> tuple[str, list[dict]]:
+    """Replace Korean number+unit spans in `text` with their English equivalents.
+    Two passes (each non-overlapping internally), applied in priority order so
+    that longer / more specific patterns win:
+      1. Percentage points  ("10%포인트" → "10 percentage points")
+      2. Korean magnitude+unit ("120억원" → "12 billion Korean won")
+    Returns (rewritten_text, list_of_replacements_with_offsets_in_original).
+    """
+    matches: list[dict] = []
+    used_ranges: list[tuple[int, int]] = []
+    def overlaps(s: int, e: int) -> bool:
+        return any(not (e <= us or s >= ue) for us, ue in used_ranges)
+    # Pass 1: percentage points (highest priority)
+    for m in PERCENT_POINT_PATTERN.finditer(text):
+        s, e = m.start(), m.end()
+        if overlaps(s, e):
+            continue
+        num_str = m.group("num")
+        english = f"{_format_simple_number(num_str)} percentage points"
+        matches.append({
+            "start": s,
+            "end": e,
+            "span": text[s:e],
+            "amount": None,
+            "unit_ko": "%포인트",
+            "english": english,
+        })
+        used_ranges.append((s, e))
+    # Pass 2: Korean magnitude+currency/counter
+    for m in NUM_UNIT_PATTERN.finditer(text):
+        s, e = m.start(), m.end()
+        if overlaps(s, e):
+            continue
+        num_str = m.group("num").strip()
+        unit = m.group("unit")
+        if not re.search(r"\d", num_str):
+            continue
+        amount = parse_korean_number(num_str)
+        if amount is None or amount == 0:
+            continue
+        english = format_english_amount(amount, unit)
+        matches.append({
+            "start": s,
+            "end": e,
+            "span": text[s:e],
+            "amount": amount,
+            "unit_ko": unit,
+            "english": english,
+        })
+        used_ranges.append((s, e))
+    # Apply replacements right-to-left to preserve earlier offsets
+    matches.sort(key=lambda r: r["start"])
+    out = text
+    for r in sorted(matches, key=lambda r: r["start"], reverse=True):
+        out = out[:r["start"]] + r["english"] + out[r["end"]:]
+    return out, matches
+def preprocess(text: str, article_date: str | None) -> tuple[str, dict]:
+    """Inline-replace Korean numbers with English units inside the user text.
+    The article date is NOT injected into the user message; callers should append
+    it to the system prompt via `system_prompt_date_suffix()` instead.
+    Returns (rewritten_text, debug_info).
+    """
+    norm_date = parse_article_date(article_date) if article_date else None
+    rewritten, replacements = replace_numbers_inline(text)
+    return rewritten, {"date": norm_date, "conversions": replacements}
+def system_prompt_date_suffix(article_date: str | None) -> str:
+    """Return the line to append to the system prompt for date anchoring.
+    Empty string if no date is available.
+    """
+    norm_date = parse_article_date(article_date) if article_date else None
+    if not norm_date:
+        return ""
+    return f"\n\n[Article published date: {norm_date}]"
+# Small system-prompt addendum: explains both the inline numbers and the date header.
+RULE_PROMPT_ADDENDUM = """
+# RULE-PRECOMPUTED METADATA
+The source text has been preprocessed before reaching you:
+1. Korean monetary/number expressions (with 조 / 억 / 만 markers) have been REPLACED INLINE with their English-unit equivalents (e.g., "12 billion won", "5.4 trillion won", "89.7 billion dollars"). When you see such an English-unit expression in the source, output it VERBATIM in the translation — do not modify, recompute, or reformat the magnitude. Treat these expressions as pre-translated tokens that must pass through unchanged.
+2. If a "[Article published date: YYYY-MM-DD]" line is provided at the top, use it to resolve relative date expressions. "지난 X일" means the Xth of the same month as the published date (or the most recent past Xth if X > current day-of-month). Do NOT translate "지난 X일" as "the Xth of last month" unless the source explicitly contains "지난달".
+"""
+if __name__ == "__main__":
+    # Smoke test
+    samples = [
+        ("1023억원과 5조4000억원, 그리고 300만원", "2025-03-15"),
+        ("897억달러 매출", "2025-04-02"),
+        ("지난 19일 회담이 열렸다", "2025-01-23"),
+        ("100억원대 분쟁이 발생", "2024-11-01"),
+        ("5000만원의 보너스를 받았다", "2025-02-10"),
+    ]
+    for text, date in samples:
+        out, info = preprocess(text, date)
+        print(f"\n--- date={date} ---")
+        print(f"input:  {text}")
+        print(f"detected: {info['conversions']}")
+        print(f"resolved date: {info['date']}")
+        print(f"augmented:\n{out}")