Spaces:

taijichat
/

chat

Running

App Files Files Community

chat / tools /agent_tools_documentation.md

WeMWish

Fix infinite loop bug in literature search system

557ed35 8 months ago

preview code

raw

history blame contribute delete

15.5 kB

Agent Tools Documentation

This document outlines the granular tools that can be created or extracted from the TaijiChat R Shiny application. These tools are intended for an agent system to access data, calculations, methodologies, tables, and graphs from the application.

Tool Name: get_raw_excel_data Description: Reads a specified Excel file and returns its raw content as a list of lists, where each inner list represents a row. This tool is generic; the file_path should be an absolute path or a path relative to the project root (e.g., "www/some_data.xlsx"). For predefined datasets within the application structure, other more specific tools should be preferred if available. Input: file_path (string) - The path to the Excel file. Output: data (list of lists of strings/numbers) - The raw data from the Excel sheet. Returns an empty list if the file is not found or cannot be read.

Tool Name: get_processed_tf_data Description: Reads and processes a TF-related Excel file identified by its dataset_identifier (e.g., "Naive", "Overall_TF_PageRank"). It uses an internal mapping (get_tf_catalog_dataset_path) to find the actual file path within the 'www/tablePagerank/' directory. The standard processing includes: reading the Excel file, transposing it, using the original first row as new column headers, and then removing this header row from the data. Input: dataset_identifier (string) - The identifier for the dataset. Valid identifiers include: "Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm". Output: data (list of lists of strings/numbers) - The processed data, where the first inner list contains the headers, and subsequent lists are data rows. Returns an empty list if processing fails or identifier is invalid.

Tool Name: filter_data_by_column_keywords Description: Filters a dataset (list of lists, where the first list is headers) based on keywords matching its column names. This is for data that has already been processed (e.g., by get_processed_tf_data) where TFs or genes are column headers. The keyword search is case-insensitive and supports multiple comma-separated keywords. If no keywords are provided, the original dataset is returned. Input: dataset (list of lists) - The data to filter, with the first list being headers. keywords (string) - Comma-separated keywords to search for in column headers. Output: filtered_dataset (list of lists) - The subset of the data containing only the matching columns (including the header row). Returns an empty list (with headers only) if no columns match.

Tool Name: get_tf_wave_search_data Description: Reads the searchtfwaves.xlsx file from www/waveanalysis/, which contains TF names organized by "waves" (Wave1 to Wave7 as columns). Input: tf_search_term (string, optional) - A specific TF name to search for. If empty or not provided, all TF wave data is returned. The search is case-insensitive. Output: wave_data (dictionary) - If tf_search_term is provided and matches, returns a structure like {"WaveX": ["TF1", "TF2"], "WaveY": ["TF1"]} showing which waves the TF belongs to. If no tf_search_term, returns the full data as {"Wave1": ["All TFs in Wave1"], "Wave2": ["All TFs in Wave2"], ...}. If no matches are found for a search term, an empty dictionary is returned.

Tool Name: get_tf_correlation_data Description: Reads the TF-TFcorTRMTEX.xlsx file from www/TFcorintextrm/. If a tf_name is provided, it filters the data for that TF (case-insensitive match on the primary TF identifier column, typically "TF Name" or the first column). Input: tf_name (string, optional) - The specific TF name to search for. If empty or not provided, returns the full dataset. Output: correlation_data (list of lists) - The filtered (or full) data from the correlation table. The first list is headers. Returns an empty list (with headers only) if tf_name is provided but not found or if the file cannot be processed.

Tool Name: get_tf_correlation_image_path Description: Reads the TF-TFcorTRMTEX.xlsx file from www/TFcorintextrm/, finds the row for the given tf_name (case-insensitive match on the primary TF identifier column), and returns the path stored in the "TF Merged Graph Path" column. The returned path is relative to the project's www directory (e.g., "www/networkanalysis/images/BATF_graph.png"). Input: tf_name (string) - The specific TF name. Output: image_path (string) - The relative web path to the image or an empty string if not found or if the file cannot be processed.

Tool Name: list_all_tfs_in_correlation_data Description: Reads the TF-TFcorTRMTEX.xlsx file from www/TFcorintextrm/ and returns a list of all unique TF names from the primary TF identifier column (typically "TF Name" or the first column). Filters out empty strings and 'nan'. Input: None Output: tf_list (list of strings) - A list of TF names. Returns an empty list if the file cannot be processed.

Tool Name: get_tf_community_sheet_data Description: Reads one of the TF community Excel files (trmcommunities.xlsx or texcommunities.xlsx) located in www/tfcommunities/. Input: community_type (string) - Either "trm" or "texterm". Output: community_data (list of lists) - Data from the specified community sheet (raw format, first list is headers). Returns an empty list if the type is invalid or file not found/processed.

Tool Name: get_static_image_path Description: Returns the predefined relative web path (e.g., "www/images/logo.png") for a known static image asset. These paths are typically relative to the project root. Input: image_identifier (string) - A unique key representing the image (e.g., "home_page_diagram", "ucsd_logo", "naive_bubble_plot", "wave1_main_img", "wave1_gokegg_img", "wave1_ranked_text1_img", "tfcat_overview_img", "network_correlation_desc_img"). Output: image_path (string) - The relative path (e.g., "www/homedesc.png"). Returns an empty string if identifier is unknown. This tool relies on an internal mapping (_STATIC_IMAGE_WEB_PATHS in tools.agent_tools).

Tool Name: get_ui_descriptive_text Description: Retrieves predefined descriptive text, methodology explanations, or captions by its identifier, primarily from tools/ui_texts.json. Input: text_identifier (string) - A unique key representing the text block (e.g., "tf_score_calculation_info", "cell_state_specificity_info", "wave_analysis_overview_text", "wave_1_analysis_placeholder_details"). Output: descriptive_text (string) - The requested text block. Returns an empty string if identifier is unknown.

Tool Name: list_available_tf_catalog_datasets Description: Returns a list of valid dataset_identifier strings that can be used with the get_processed_tf_data tool. Input: None Output: dataset_identifiers (list of strings) - E.g., ["Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm"].

Tool Name: list_available_cell_state_bubble_plots Description: Returns a list of identifiers for available cell-state specific bubble plot images. These identifiers can be used with get_static_image_path. Input: None Output: image_identifiers (list of strings) - E.g., ["naive_bubble_plot", "te_bubble_plot", ...]. Derived from internal mapping in tools.agent_tools.

Tool Name: list_available_wave_analysis_assets Description: Returns a structured dictionary of available asset identifiers for a specific TF wave (main image, GO/KEGG image, ranked text images). Identifiers can be used with get_static_image_path. Input: wave_number (integer, 1-7) - The wave number. Output: asset_info (dictionary) - E.g., {"main_image_id": "waveX_main_img", "gokegg_image_id": "waveX_gokegg_img", "ranked_text_image_ids": ["waveX_ranked_text1_img", ...]}. Returns empty if wave number is invalid. Derived from internal mapping in tools.agent_tools.

Tool Name: get_internal_navigation_info Description: Provides information about where an internal UI link (like those on the homepage image map or wave overview images) is intended to navigate within the application structure. Input: link_id (string) - The identifier of the link (e.g., "to_tfcat", "to_tfwave", "to_tfnet", "c1_link", "c2_link", etc.). Output: navigation_target_description (string) - A human-readable description of the target (e.g., "Navigates to the 'TF Catalog' section.", "Navigates to the 'Wave 1 Analysis' tab."). Derived from internal mapping in tools.agent_tools.

Tool Name: get_biorxiv_paper_url Description: Returns the URL for the main bioRxiv paper referenced in the application. Input: None Output: url (string) - The bioRxiv paper URL.

Tool Name: list_all_files_in_www_directory Description: Scans the entire www/ directory (and its subdirectories, excluding common hidden/system files) and returns a list of all files found. For each file, it provides its relative path from the project root (e.g., "www/images/logo.png"), its detected MIME type (e.g., "image/png", "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"), and its size in bytes. This tool helps in understanding all available static assets and data files within the web-accessible www directory. Input: None Output: file_manifest (list of dictionaries) - Each dictionary represents a file and contains the keys: path (string), type (string), size (integer). Example item: {"path": "www/data/report.txt", "type": "text/plain", "size": 1024}. Returns an empty list if the www directory isn't found or is empty.

`multi_source_literature_search(queries: list[str], max_results_per_query_per_source: int = 1, max_total_unique_papers: int = 10) -> list[dict]`

Searches for academic literature across multiple sources (Semantic Scholar, PubMed, ArXiv) using a list of provided search queries. It then de-duplicates the results based primarily on DOI, and secondarily on a combination of title and first author if DOI is not available. The search process stops early if the max_total_unique_papers limit is reached.

Args:

queries (list[str]): A list of search query strings. The GenerationAgent should brainstorm 3-5 diverse queries relevant to the user's request.
max_results_per_query_per_source (int): The maximum number of results to fetch from EACH academic source (Semantic Scholar, PubMed, ArXiv) for EACH query string. Defaults to 1.
max_total_unique_papers (int): The maximum total number of unique de-duplicated papers to return across all queries and sources. Defaults to 10. The tool will stop fetching more data once this limit is met.

Returns:

list[dict]: A consolidated and de-duplicated list of paper details, containing up to max_total_unique_papers. Each dictionary in the list represents a paper and has the following keys:
- "title" (str): The title of the paper. "N/A" if not available.
- "authors" (list[str]): A list of author names. ["N/A"] if not available.
- "year" (str | int): The publication year. "N/A" if not available.
- "abstract" (str): A snippet of the abstract (typically up to 500 characters followed by "..."). "N/A" if not available.
- "doi" (str | None): The Digital Object Identifier. None if not available.
- "url" (str): A direct URL to the paper (e.g., PubMed link, ArXiv link, Semantic Scholar link). "N/A" if not available.
- "venue" (str): The publication venue (e.g., journal name, "ArXiv"). "N/A" if not available.
- "source_api" (str): The API from which this record was retrieved (e.g., "Semantic Scholar", "PubMed", "ArXiv").

GenerationAgent Usage Example (for python_code field when status is AWAITING_DATA):

# Example: User asks for up to 3 papers
print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["T-cell exhaustion markers AND cancer", "immunotherapy for melanoma AND biomarkers"], max_results_per_query_per_source=1, max_total_unique_papers=3)}))

# Example: Defaulting to 10 total unique papers
print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["COVID-19 long-term effects"], max_results_per_query_per_source=2)}))

Important Considerations for GenerationAgent:

When results are returned from this tool, the GenerationAgent's explanation (for CODE_COMPLETE status) should present a summary of the found papers (e.g., titles, authors, URLs). It should clearly state that these are potential literature leads and should not yet claim to have read or summarized the full content of these papers in that same turn, unless a subsequent tool call for summarization is planned and executed.

`fetch_text_from_urls(paper_info_list: list[dict], max_chars_per_paper: int = 15000) -> list[dict]`

Attempts to fetch and extract textual content from the URLs of papers provided in a list. This tool is typically used after multi_source_literature_search to gather content for summarization by the GenerationAgent.

Args:

paper_info_list (list[dict]): A list of paper dictionaries, as returned by multi_source_literature_search. Each dictionary is expected to have at least a "url" key. Other keys like "title" and "source_api" are used for logging.
max_chars_per_paper (int): The maximum number of characters of text to retrieve and store for each paper. Defaults to 15000. Text longer than this will be truncated.

Returns:

list[dict]: The input paper_info_list, where each paper dictionary is augmented with a new key "retrieved_text_content".
- If successful, "retrieved_text_content" (str) will contain the extracted text (up to max_chars_per_paper).
- If fetching or parsing fails for a paper, "retrieved_text_content" (str) will contain an error message (e.g., "Error: Invalid or missing URL.", "Error fetching URL: ...", "Error: No text could be extracted.").

GenerationAgent Usage Example (for python_code field when status is AWAITING_DATA):

This tool is usually the second step in a literature review process.

# Assume 'list_of_papers_from_search' is a variable holding the output from a previous
# call to tools.multi_source_literature_search(...)
print(json.dumps({'intermediate_data_for_llm': tools.fetch_text_from_urls(paper_info_list=list_of_papers_from_search, max_chars_per_paper=10000)}))

Important Considerations for GenerationAgent:

After this tool returns the paper_info_list (now with "retrieved_text_content"), the GenerationAgent is responsible for using its own LLM capabilities to read the "retrieved_text_content" for each paper and generate summaries if requested by the user or if it's part of its plan.
The GenerationAgent should be prepared for "retrieved_text_content" to contain error messages and handle them gracefully in its summarization logic (e.g., by stating that text for a particular paper could not be retrieved).
Web scraping is inherently unreliable; success in fetching and parsing text can vary greatly between websites. The agent should not assume text will always be available.