Spaces:

Peterase
/

rag-api-node-1

Running

Peterase commited on 16 days ago

Commit

0b5eb95

1 Parent(s): d13f5bc

feat: structured LLM output + clean content for LLM

rag_chat_use_case.py:
- New _clean_content_for_llm(): strips image tags, bare URLs, nav bullets,
horizontal rules from Jina markdown before sending to LLM
- Applied in _limit_context() so all sources are cleaned before tokenizing
- New structured prompt format enforces:
## [Headline]
**[Topic]**
- bullet point [N]
> summary callout
instead of verbose paragraph-per-source style
- Both execute_chat and execute_stream use the same new prompt
- Shorter, cleaner prompt reduces token usage and improves response quality

Files changed (1) hide show

src/core/use_cases/rag_chat_use_case.py +93 -77

src/core/use_cases/rag_chat_use_case.py CHANGED Viewed

@@ -90,6 +90,8 @@ Document:
         for doc in docs:
             content = doc.get("content", "")
             metadata = doc.get("metadata", {})
             # Extract source name from multiple possible fields
@@ -746,7 +748,37 @@ JSON:"""
         # ── Step 8: Token limitation ──────────────────────────────────────────
         return self._limit_context(query, deduped_final)
-    def _get_history_text(self, session_id: str) -> str:
         past_messages = self.chat_history_db.get_history(session_id, limit=6)
         return "".join([f"{msg.role}: {msg.content}\n" for msg in past_messages])
@@ -831,46 +863,38 @@ JSON:"""
                 source_label = source_name
             source_index_lines += f"[{idx}] {source_label}\n"
-        prompt = f"""You are ARKI AI, a real-time news assistant. Today's date is {datetime.utcnow().strftime("%B %d, %Y")}.
-════════════════════════════════════════════════════════
-SOURCE INDEX — ONLY THESE SOURCES EXIST. DO NOT INVENT ANY OTHERS.
-════════════════════════════════════════════════════════
 {source_index_lines if source_index_lines else "NO SOURCES RETRIEVED."}
-════════════════════════════════════════════════════════
-CRITICAL CITATION RULE:
-- You have EXACTLY {len(final_sources)} source(s) listed above.
-- ONLY cite numbers that appear in the Source Index above (e.g. if you have 2 sources, only use [1] and [2]).
-- NEVER write [3], [4], [5]... if those numbers are not in the Source Index.
-- NEVER invent sources, facts, or citations from your training data.
-- Every fact you state MUST come from the News Context below AND be cited with its number.
-STEP 1 — EVALUATE THE SOURCES:
-Read the News Context below and determine:
-A) DIRECT MATCH — Sources directly answer the question:
-   → Answer using ONLY facts from the context, cite each fact with [number]
-   → Use **bold** headlines for structure
-B) RELATED INFORMATION — Sources have related but not exact information:
-   → Say: "I found articles about [related topic], but not specifically about [exact query]."
-   → Share what IS in the context, citing with [number]
-C) NO SOURCES / NO RELEVANT INFORMATION:
-   → Say clearly: "I couldn't find relevant news on that topic in today's feed."
-   → STOP. Do not add any information from your training data.
-STEP 2 — ANSWER RULES:
-1. Use ONLY facts from the News Context below. NEVER use training data or general knowledge.
-2. Cite every fact with its source number: [1] or [2] etc. Only use numbers from the Source Index.
-3. Non-English articles — translate content to English in your answer.
-4. Always respond in English.
-5. At the END of your answer, on a new line, write exactly:
-   FOLLOW_UP: question1 | question2 | question3
-   (3 short follow-up questions based only on what you actually found)
-News Context (from live multilingual database):
 {context_text if context_text else "NO CONTEXT RETRIEVED."}
 Conversation History:
@@ -966,46 +990,38 @@ Answer:"""
             source_index_lines += f"[{idx}] {source_label}\n"
             doc["citation_index"] = idx
-        prompt_stream = f"""You are ARKI AI, a real-time news assistant. Today's date is {datetime.utcnow().strftime("%B %d, %Y")}.
-════════════════════════════════════════════════════════
-SOURCE INDEX — ONLY THESE SOURCES EXIST. DO NOT INVENT ANY OTHERS.
-════════════════════════════════════════════════════════
 {source_index_lines if source_index_lines else "NO SOURCES RETRIEVED."}
-════════════════════════════════════════════════════════
-CRITICAL CITATION RULE:
-- You have EXACTLY {len(final_sources)} source(s) listed above.
-- ONLY cite numbers that appear in the Source Index above (e.g. if you have 2 sources, only use [1] and [2]).
-- NEVER write [3], [4], [5]... if those numbers are not in the Source Index.
-- NEVER invent sources, facts, or citations from your training data.
-- Every fact you state MUST come from the News Context below AND be cited with its number.
-STEP 1 — EVALUATE THE SOURCES:
-Read the News Context below and determine:
-A) DIRECT MATCH — Sources directly answer the question:
-   → Answer using ONLY facts from the context, cite each fact with [number]
-   → Use **bold** headlines for structure
-B) RELATED INFORMATION — Sources have related but not exact information:
-   → Say: "I found articles about [related topic], but not specifically about [exact query]."
-   → Share what IS in the context, citing with [number]
-C) NO SOURCES / NO RELEVANT INFORMATION:
-   → Say clearly: "I couldn't find relevant news on that topic in today's feed."
-   → STOP. Do not add any information from your training data.
-STEP 2 — ANSWER RULES:
-1. Use ONLY facts from the News Context below. NEVER use training data or general knowledge.
-2. Cite every fact with its source number: [1] or [2] etc. Only use numbers from the Source Index.
-3. Non-English articles — translate content to English in your answer.
-4. Always respond in English.
-5. At the END of your answer, on a new line, write exactly:
-   FOLLOW_UP: question1 | question2 | question3
-   (3 short follow-up questions based only on what you actually found)
-News Context (from live multilingual database):
 {context_text if context_text else "NO CONTEXT RETRIEVED."}
 Conversation History:

         for doc in docs:
             content = doc.get("content", "")
+            # Clean Jina markdown artifacts before tokenizing/sending to LLM
+            content = self._clean_content_for_llm(content)
             metadata = doc.get("metadata", {})
             # Extract source name from multiple possible fields
         # ── Step 8: Token limitation ──────────────────────────────────────────
         return self._limit_context(query, deduped_final)
+    def _clean_content_for_llm(self, content: str) -> str:
+        """
+        Strip markdown artifacts from Jina-extracted content before sending to LLM.
+        Removes: image tags, navigation links, skip-to-content, social share buttons.
+        Keeps: article text, headings, paragraphs.
+        """
+        import re
+        # Remove image markdown: ![alt](url)
+        content = re.sub(r'!\[.*?\]\(.*?\)', '', content)
+        # Remove inline links but keep link text: [text](url) → text
+        content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content)
+        # Remove bare URLs
+        content = re.sub(r'https?://\S+', '', content)
+        # Remove Skip to content / Skip to main
+        content = re.sub(r'\[?Skip to [^\n]+\n?', '', content, flags=re.IGNORECASE)
+        # Remove lines that are just navigation items (short lines with *)
+        lines = content.split('\n')
+        cleaned = []
+        for line in lines:
+            stripped = line.strip()
+            # Skip pure navigation bullets (short, no sentence structure)
+            if stripped.startswith('* ') and len(stripped) < 60 and '.' not in stripped:
+                continue
+            # Skip lines that are just dashes or equals
+            if re.match(r'^[-=]{3,}$', stripped):
+                continue
+            cleaned.append(line)
+        content = '\n'.join(cleaned)
+        # Collapse multiple blank lines
+        content = re.sub(r'\n{3,}', '\n\n', content)
+        return content.strip()
         past_messages = self.chat_history_db.get_history(session_id, limit=6)
         return "".join([f"{msg.role}: {msg.content}\n" for msg in past_messages])
                 source_label = source_name
             source_index_lines += f"[{idx}] {source_label}\n"
+        prompt = f"""You are ARKI AI, a real-time Ethiopia & Africa news assistant. Today: {datetime.utcnow().strftime("%B %d, %Y")}.
+SOURCE INDEX (cite by number — these are the ONLY sources you may use):
 {source_index_lines if source_index_lines else "NO SOURCES RETRIEVED."}
+STRICT RULES:
+- Use ONLY facts from the News Context below. NEVER use training data.
+- Cite every fact: [1], [2], etc. Only use numbers that exist in the Source Index above.
+- Non-English articles: translate to English in your answer.
+- Always respond in English.
+OUTPUT FORMAT — use this exact structure:
+## [Short headline summarizing the main news]
+**[Topic 1]**
+- Key fact from source [N]
+- Key fact from source [N]
+**[Topic 2]** (if applicable)
+- Key fact from source [N]
+> 💡 *[One sentence summary of the overall situation]*
+FOLLOW_UP: question1 | question2 | question3
+EVALUATION GUIDE:
+- If sources directly answer the question → use the format above
+- If sources are related but not exact → start with "I found related news:" then use the format
+- If no relevant sources → say "I couldn't find relevant news on that topic in today's feed." and STOP
+News Context:
 {context_text if context_text else "NO CONTEXT RETRIEVED."}
 Conversation History:
             source_index_lines += f"[{idx}] {source_label}\n"
             doc["citation_index"] = idx
+        prompt_stream = f"""You are ARKI AI, a real-time Ethiopia & Africa news assistant. Today: {datetime.utcnow().strftime("%B %d, %Y")}.
+SOURCE INDEX (cite by number — these are the ONLY sources you may use):
 {source_index_lines if source_index_lines else "NO SOURCES RETRIEVED."}
+STRICT RULES:
+- Use ONLY facts from the News Context below. NEVER use training data.
+- Cite every fact: [1], [2], etc. Only use numbers that exist in the Source Index above.
+- Non-English articles: translate to English in your answer.
+- Always respond in English.
+OUTPUT FORMAT — use this exact structure:
+## [Short headline summarizing the main news]
+**[Topic 1]**
+- Key fact from source [N]
+- Key fact from source [N]
+**[Topic 2]** (if applicable)
+- Key fact from source [N]
+> 💡 *[One sentence summary of the overall situation]*
+FOLLOW_UP: question1 | question2 | question3
+EVALUATION GUIDE:
+- If sources directly answer the question → use the format above
+- If sources are related but not exact → start with "I found related news:" then use the format
+- If no relevant sources → say "I couldn't find relevant news on that topic in today's feed." and STOP
+News Context:
 {context_text if context_text else "NO CONTEXT RETRIEVED."}
 Conversation History: