Spaces:

10gen
/

deepsearchitv2

Running

App Files Files Community

Guiyom commited on Feb 21, 2025

Commit

ed43430

verified ·

1 Parent(s): 1eacd00

Update app.py

Browse files

Files changed (1) hide show

app.py +76 -22

app.py CHANGED Viewed

@@ -394,16 +394,77 @@ def openai_call(prompt: str, messages: list = None, model: str = "o3-mini",
         logging.error(err_msg)
         return err_msg
 def analyze_with_gpt4o(query: str, snippet: str, breadth: int, temperature: float = 0.7, max_tokens: int = 8000) -> dict:
     # If snippet is a callable, call it to get the string.
     if callable(snippet):
         snippet = snippet()
     snippet_words = len(snippet.split())
-    # decide a proportional max tokens (cap at 3000 for example)
-    # e.g. 1 token ~ ~0.75 words, so we do something simplistic:
     dynamic_tokens = min(3000, max(250, int(snippet_words * 0.5)))
-    client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
     prompt = (f"""Analyze the following content from a query result:
 {snippet}
@@ -412,40 +473,33 @@ Research topic:
 {query}
 Instructions:
-1.  Relevance: Determine if the content is relevant to the research topic.  Answer with a single word: 'yes' or 'no'.
-2.  Structure: If the content is relevant, provide a comprehensive summary structured into the following sections.  Prioritize extreme conciseness and token efficiency while preserving all key information. Aim for the shortest possible summary that retains all essential facts, figures, arguments, and quotes. The total summary should not exceed 1000 words, but shorter is strongly preferred.
-    -   Key Facts (at least 5):  List the core factual claims. Use short, declarative sentences or bullet points. Apply lemmatization, common abbreviations (e.g., vs., e.g., i.e., AI, LLM), and remove unnecessary words.
-    -   Key Figures (at least 5): Extract numerical data, statistics, dates, percentages. Use numerical representation. Present concisely (list or table format).
     -   Key Arguments (at least 5): Identify main arguments/claims. Summarize supporting evidence and counter-arguments. Use lemmatization, abbreviations, and concise phrasing. Remove redundant phrases.
-    -   Key Quotes (at least 1 f any): Include significant quotes (with the name of the author between parenthesis). Attribute quotes correctly. Choose quotes that are concise and impactful. If a quote can be paraphrased concisely without losing essential meaning, paraphrase it and note that it's a paraphrase. Use symbols instead of words (&, +, ->, =, ...).
-    -   Structured summary (10 to 50 sentences depending on the length): mention anecdotes, people, locations, anything that make will make the end report relatable and grounded
 Note: General Optimization Guidelines:
     -   Lemmatize: Use the root form of words (e.g., "running" -> "run").
-    -   Abbreviate: Use common abbreviations
     -   Remove Redundancy: Eliminate unnecessary words and phrases. Be concise.
     -   Shorten Words (Carefully): If a shorter word conveys the same meaning (e.g., "information" -> "info"), use it, but avoid ambiguity.
     -   Implicit Representation: Remove redundant terms.
-    -    Use Symbols: Use symbols instead of words (&, +, ->, =, ...).
-3.  Follow-up Search Queries: Generate at least {breadth} follow-up search queries. These should be relevant to the research topic but also developments from the content summarized, aim for deeper understanding, use search operators (AND, OR, quotation marks), and be represented as a Python list of strings.
-For example: "Artificial intelligence" AND (mathematics OR geometry) -algebra,science AND history AND mathematics,...
-Return the result as a JSON object with the keys 'relevant', 'structure', and 'followups'. The 'structure' value should itself be a JSON object with keys 'Key Facts', 'Key Figures', 'Key Arguments', 'Key Quotes' and 'Summary'.
-4. Ensure that the summary length and level of detail is proportional to the source length.
 Source length: {snippet_words} words. You may produce a more detailed summary if the text is long.
 Proceed."""
     )
     try:
-        response = client.chat.completions.create(
-            model="gpt-4o-mini",
-            messages=[{"role": "user", "content": prompt}],
-            temperature=temperature,
-            max_tokens=max_tokens
-        )
-        res_text = response.choices[0].message.content.strip()
         # Remove Markdown code fences if present
         if res_text.startswith("```"):
             res_text = re.sub(r"^```(json)?", "", res_text)

         logging.error(err_msg)
         return err_msg
+def summarize_large_text(text: str, target_length: int, chunk_size: int = 1000, overlap: int = 200) -> str:
+    """
+    Summarizes a large text by splitting it into overlapping chunks, summarizing each chunk,
+    and then combining the intermediate summaries into a final summary.
+    The prompt for these intermediate calls explicitly instructs to preserve key details
+    and to include any tables or structured data present.
+    Parameters:
+      text         : The input text to summarize.
+      target_length: The maximum number of tokens (or an approximate target length) for the final summary.
+      chunk_size   : The number of words to include in each chunk.
+      overlap      : The number of overlapping words between chunks.
+    Returns:
+      The final combined summary as a string.
+    """
+    words = text.split()
+    if len(words) <= chunk_size:
+        # If the text is short, simply return it (or you could call a simple summarization)
+        return text
+    chunks = []
+    i = 0
+    while i < len(words):
+        # Create a chunk from i to i+chunk_size words.
+        chunk = " ".join(words[i:i+chunk_size])
+        chunks.append(chunk)
+        # Move forward by chunk_size - overlap words.
+        i += (chunk_size - overlap)
+    summary_chunks = []
+    for chunk in chunks:
+        chunk_prompt = (
+            "Summarize the following text, preserving all key details and ensuring that any tables or structured "
+            "data are also summarized:\n\n" + chunk
+        )
+        # Use a relatively small max_tokens value for each chunk summarization.
+        summary_chunk = openai_call(prompt=chunk_prompt, model="gpt-4o-mini", max_tokens_param=500, temperature=0.7)
+        summary_chunks.append(summary_chunk.strip())
+    combined_summary = "\n".join(summary_chunks)
+    # Now, produce one final summary that fuses all the intermediate summaries.
+    final_prompt = (
+        "Combine the following summaries into one concise summary that preserves all critical details, "
+        "including any relevant table or structured data:\n\n" + combined_summary
+    )
+    final_summary = openai_call(prompt=final_prompt, model="gpt-4o-mini", max_tokens_param=target_length, temperature=0.7)
+    return final_summary.strip()
 def analyze_with_gpt4o(query: str, snippet: str, breadth: int, temperature: float = 0.7, max_tokens: int = 8000) -> dict:
     # If snippet is a callable, call it to get the string.
     if callable(snippet):
         snippet = snippet()
     snippet_words = len(snippet.split())
+    # Define a word threshold after which we start the chunking summarization.
+    CHUNK_WORD_THRESHOLD = 1500
+    if snippet_words > CHUNK_WORD_THRESHOLD:
+        # Adjust the target_length as needed (here using 2000 tokens as an example).
+        snippet = summarize_large_text(snippet, target_length=2000, chunk_size=1000, overlap=200)
+        snippet_words = len(snippet.split())
+    # Decide a proportional dynamic token count (for reference; not used to limit the API call below)
     dynamic_tokens = min(3000, max(250, int(snippet_words * 0.5)))
+    client = os.getenv('OPENAI_API_KEY')  # alternatively, pass your API key here if needed.
+    # (Assuming you use a client instance from your OpenAI library elsewhere.)
+    # Here, we assume that openai.OpenAI(api_key=...) is wrapped by openai_call.
     prompt = (f"""Analyze the following content from a query result:
 {snippet}
 {query}
 Instructions:
+1.  Relevance: Determine if the content is relevant to the research topic. Answer with a single word: 'yes' or 'no'.
+2.  Structure: If the content is relevant, provide a comprehensive summary structured into the following sections. Prioritize extreme conciseness and token efficiency while preserving all key information. Aim for the shortest possible summary that retains all essential facts, figures, arguments, and quotes. The total summary should not exceed 1000 words, but shorter is strongly preferred.
+    -   Key Facts (at least 5): List the core factual claims. Use short, declarative sentences or bullet points. Apply lemmatization, common abbreviations (e.g., vs., e.g., i.e., AI, LLM), and remove unnecessary words.
+    -   Key Figures (at least 5): Extract numerical data, statistics, dates, percentages. Use numerical representation and present concisely (list or table format). If the content includes tables or structured data, extract and summarize the critical information from them.
     -   Key Arguments (at least 5): Identify main arguments/claims. Summarize supporting evidence and counter-arguments. Use lemmatization, abbreviations, and concise phrasing. Remove redundant phrases.
+    -   Key Quotes (at least 1 if any): Include significant quotes (with the name of the author in parentheses). Attribute quotes correctly. Choose quotes that are concise and impactful. If a quote can be paraphrased concisely without losing essential meaning, paraphrase it and note that it's a paraphrase. Use symbols instead of words (&, +, ->, =, ...).
+    -   Structured Summary (10 to 50 sentences depending on the length): Mention anecdotes, people, locations, and any additional context that will make the end report relatable and grounded.
 Note: General Optimization Guidelines:
     -   Lemmatize: Use the root form of words (e.g., "running" -> "run").
+    -   Abbreviate: Use common abbreviations.
     -   Remove Redundancy: Eliminate unnecessary words and phrases. Be concise.
     -   Shorten Words (Carefully): If a shorter word conveys the same meaning (e.g., "information" -> "info"), use it, but avoid ambiguity.
     -   Implicit Representation: Remove redundant terms.
+    -   Use Symbols: Use symbols instead of words (&, +, ->, =, ...).
+3.  Follow-up Search Queries: Generate at least {breadth} follow-up search queries. These should be relevant to the research topic and build upon the summarized content. Aim for deeper understanding by using search operators (AND, OR, quotation marks) where appropriate. Represent these queries as a Python list of strings, e.g., ["query1", "query2", ...].
+4. Ensure that the summary length and level of detail is proportional to the source length.
 Source length: {snippet_words} words. You may produce a more detailed summary if the text is long.
 Proceed."""
     )
     try:
+        response = openai_call(prompt=prompt, model="gpt-4o-mini", max_tokens_param=max_tokens, temperature=temperature)
+        res_text = response.strip()
         # Remove Markdown code fences if present
         if res_text.startswith("```"):
             res_text = re.sub(r"^```(json)?", "", res_text)