Spaces:

taijichat
/

chat

Running

App Files Files Community

chat / tools /agent_tools_documentation.md

WeMWish

Fix infinite loop bug in literature search system

557ed35 8 months ago

preview code

raw

history blame contribute delete

15.5 kB

	# Agent Tools Documentation

	This document outlines the granular tools that can be created or extracted from the TaijiChat R Shiny application. These tools are intended for an agent system to access data, calculations, methodologies, tables, and graphs from the application.

	---

	Tool Name: `get_raw_excel_data`
	Description: Reads a specified Excel file and returns its raw content as a list of lists, where each inner list represents a row. This tool is generic; the `file_path` should be an absolute path or a path relative to the project root (e.g., "www/some_data.xlsx"). For predefined datasets within the application structure, other more specific tools should be preferred if available.
	Input: `file_path` (string) - The path to the Excel file.
	Output: `data` (list of lists of strings/numbers) - The raw data from the Excel sheet. Returns an empty list if the file is not found or cannot be read.

	---

	Tool Name: `get_processed_tf_data`
	Description: Reads and processes a TF-related Excel file identified by its dataset_identifier (e.g., "Naive", "Overall_TF_PageRank"). It uses an internal mapping (get_tf_catalog_dataset_path) to find the actual file path within the 'www/tablePagerank/' directory. The standard processing includes: reading the Excel file, transposing it, using the original first row as new column headers, and then removing this header row from the data.
	Input: `dataset_identifier` (string) - The identifier for the dataset. Valid identifiers include: "Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm".
	Output: `data` (list of lists of strings/numbers) - The processed data, where the first inner list contains the headers, and subsequent lists are data rows. Returns an empty list if processing fails or identifier is invalid.

	---

	Tool Name: `filter_data_by_column_keywords`
	Description: Filters a dataset (list of lists, where the first list is headers) based on keywords matching its column names. This is for data that has already been processed (e.g., by `get_processed_tf_data`) where TFs or genes are column headers. The keyword search is case-insensitive and supports multiple comma-separated keywords. If no keywords are provided, the original dataset is returned.
	Input:
	`dataset` (list of lists) - The data to filter, with the first list being headers.
	`keywords` (string) - Comma-separated keywords to search for in column headers.
	Output: `filtered_dataset` (list of lists) - The subset of the data containing only the matching columns (including the header row). Returns an empty list (with headers only) if no columns match.

	---

	Tool Name: `get_tf_wave_search_data`
	Description: Reads the `searchtfwaves.xlsx` file from `www/waveanalysis/`, which contains TF names organized by "waves" (Wave1 to Wave7 as columns).
	Input: `tf_search_term` (string, optional) - A specific TF name to search for. If empty or not provided, all TF wave data is returned. The search is case-insensitive.
	Output: `wave_data` (dictionary) - If `tf_search_term` is provided and matches, returns a structure like `{"WaveX": ["TF1", "TF2"], "WaveY": ["TF1"]}` showing which waves the TF belongs to. If no `tf_search_term`, returns the full data as `{"Wave1": ["All TFs in Wave1"], "Wave2": ["All TFs in Wave2"], ...}`. If no matches are found for a search term, an empty dictionary is returned.

	---

	Tool Name: `get_tf_correlation_data`
	Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`. If a `tf_name` is provided, it filters the data for that TF (case-insensitive match on the primary TF identifier column, typically "TF Name" or the first column).
	Input: `tf_name` (string, optional) - The specific TF name to search for. If empty or not provided, returns the full dataset.
	Output: `correlation_data` (list of lists) - The filtered (or full) data from the correlation table. The first list is headers. Returns an empty list (with headers only) if `tf_name` is provided but not found or if the file cannot be processed.

	---

	Tool Name: `get_tf_correlation_image_path`
	Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`, finds the row for the given `tf_name` (case-insensitive match on the primary TF identifier column), and returns the path stored in the "TF Merged Graph Path" column. The returned path is relative to the project's `www` directory (e.g., "www/networkanalysis/images/BATF_graph.png").
	Input: `tf_name` (string) - The specific TF name.
	Output: `image_path` (string) - The relative web path to the image or an empty string if not found or if the file cannot be processed.

	---

	Tool Name: `list_all_tfs_in_correlation_data`
	Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/` and returns a list of all unique TF names from the primary TF identifier column (typically "TF Name" or the first column). Filters out empty strings and 'nan'.
	Input: None
	Output: `tf_list` (list of strings) - A list of TF names. Returns an empty list if the file cannot be processed.

	---

	Tool Name: `get_tf_community_sheet_data`
	Description: Reads one of the TF community Excel files (`trmcommunities.xlsx` or `texcommunities.xlsx`) located in `www/tfcommunities/`.
	Input: `community_type` (string) - Either "trm" or "texterm".
	Output: `community_data` (list of lists) - Data from the specified community sheet (raw format, first list is headers). Returns an empty list if the type is invalid or file not found/processed.

	---

	Tool Name: `get_static_image_path`
	Description: Returns the predefined relative web path (e.g., "www/images/logo.png") for a known static image asset. These paths are typically relative to the project root.
	Input: `image_identifier` (string) - A unique key representing the image (e.g., "home_page_diagram", "ucsd_logo", "naive_bubble_plot", "wave1_main_img", "wave1_gokegg_img", "wave1_ranked_text1_img", "tfcat_overview_img", "network_correlation_desc_img").
	Output: `image_path` (string) - The relative path (e.g., "www/homedesc.png"). Returns an empty string if identifier is unknown. This tool relies on an internal mapping (`_STATIC_IMAGE_WEB_PATHS` in `tools.agent_tools`).

	---

	Tool Name: `get_ui_descriptive_text`
	Description: Retrieves predefined descriptive text, methodology explanations, or captions by its identifier, primarily from `tools/ui_texts.json`.
	Input: `text_identifier` (string) - A unique key representing the text block (e.g., "tf_score_calculation_info", "cell_state_specificity_info", "wave_analysis_overview_text", "wave_1_analysis_placeholder_details").
	Output: `descriptive_text` (string) - The requested text block. Returns an empty string if identifier is unknown.

	---

	Tool Name: `list_available_tf_catalog_datasets`
	Description: Returns a list of valid `dataset_identifier` strings that can be used with the `get_processed_tf_data` tool.
	Input: None
	Output: `dataset_identifiers` (list of strings) - E.g., ["Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm"].

	---

	Tool Name: `list_available_cell_state_bubble_plots`
	Description: Returns a list of identifiers for available cell-state specific bubble plot images. These identifiers can be used with `get_static_image_path`.
	Input: None
	Output: `image_identifiers` (list of strings) - E.g., ["naive_bubble_plot", "te_bubble_plot", ...]. Derived from internal mapping in `tools.agent_tools`.

	---

	Tool Name: `list_available_wave_analysis_assets`
	Description: Returns a structured dictionary of available asset identifiers for a specific TF wave (main image, GO/KEGG image, ranked text images). Identifiers can be used with `get_static_image_path`.
	Input: `wave_number` (integer, 1-7) - The wave number.
	Output: `asset_info` (dictionary) - E.g., `{"main_image_id": "waveX_main_img", "gokegg_image_id": "waveX_gokegg_img", "ranked_text_image_ids": ["waveX_ranked_text1_img", ...]}`. Returns empty if wave number is invalid. Derived from internal mapping in `tools.agent_tools`.

	---

	Tool Name: `get_internal_navigation_info`
	Description: Provides information about where an internal UI link (like those on the homepage image map or wave overview images) is intended to navigate within the application structure.
	Input: `link_id` (string) - The identifier of the link (e.g., "to_tfcat", "to_tfwave", "to_tfnet", "c1_link", "c2_link", etc.).
	Output: `navigation_target_description` (string) - A human-readable description of the target (e.g., "Navigates to the 'TF Catalog' section.", "Navigates to the 'Wave 1 Analysis' tab."). Derived from internal mapping in `tools.agent_tools`.

	---

	Tool Name: `get_biorxiv_paper_url`
	Description: Returns the URL for the main bioRxiv paper referenced in the application.
	Input: None
	Output: `url` (string) - The bioRxiv paper URL.

	---

	Tool Name: `list_all_files_in_www_directory`
	Description: Scans the entire `www/` directory (and its subdirectories, excluding common hidden/system files) and returns a list of all files found. For each file, it provides its relative path from the project root (e.g., "www/images/logo.png"), its detected MIME type (e.g., "image/png", "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"), and its size in bytes. This tool helps in understanding all available static assets and data files within the web-accessible `www` directory.
	Input: None
	Output: `file_manifest` (list of dictionaries) - Each dictionary represents a file and contains the keys: `path` (string), `type` (string), `size` (integer). Example item: `{"path": "www/data/report.txt", "type": "text/plain", "size": 1024}`. Returns an empty list if the `www` directory isn't found or is empty.

	---

	### `multi_source_literature_search(queries: list[str], max_results_per_query_per_source: int = 1, max_total_unique_papers: int = 10) -> list[dict]`

	Searches for academic literature across multiple sources (Semantic Scholar, PubMed, ArXiv) using a list of provided search queries. It then de-duplicates the results based primarily on DOI, and secondarily on a combination of title and first author if DOI is not available. The search process stops early if the `max_total_unique_papers` limit is reached.

	Args:

	* `queries (list[str])`: A list of search query strings. The GenerationAgent should brainstorm 3-5 diverse queries relevant to the user's request.
	* `max_results_per_query_per_source (int)`: The maximum number of results to fetch from EACH academic source (Semantic Scholar, PubMed, ArXiv) for EACH query string. Defaults to `1`.
	* `max_total_unique_papers (int)`: The maximum total number of unique de-duplicated papers to return across all queries and sources. Defaults to `10`. The tool will stop fetching more data once this limit is met.

	Returns:

	* `list[dict]`: A consolidated and de-duplicated list of paper details, containing up to `max_total_unique_papers`. Each dictionary in the list represents a paper and has the following keys:
	* `"title" (str)`: The title of the paper. "N/A" if not available.
	* `"authors" (list[str])`: A list of author names. ["N/A"] if not available.
	* `"year" (str \| int)`: The publication year. "N/A" if not available.
	* `"abstract" (str)`: A snippet of the abstract (typically up to 500 characters followed by "..."). "N/A" if not available.
	* `"doi" (str \| None)`: The Digital Object Identifier. `None` if not available.
	* `"url" (str)`: A direct URL to the paper (e.g., PubMed link, ArXiv link, Semantic Scholar link). "N/A" if not available.
	* `"venue" (str)`: The publication venue (e.g., journal name, "ArXiv"). "N/A" if not available.
	* `"source_api" (str)`: The API from which this record was retrieved (e.g., "Semantic Scholar", "PubMed", "ArXiv").

	GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):

	```python
	# Example: User asks for up to 3 papers
	print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["T-cell exhaustion markers AND cancer", "immunotherapy for melanoma AND biomarkers"], max_results_per_query_per_source=1, max_total_unique_papers=3)}))

	# Example: Defaulting to 10 total unique papers
	print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["COVID-19 long-term effects"], max_results_per_query_per_source=2)}))
	```

	Important Considerations for GenerationAgent:

	* When results are returned from this tool, the `GenerationAgent`'s `explanation` (for `CODE_COMPLETE` status) should present a summary of the found papers (e.g., titles, authors, URLs). It should clearly state that these are potential literature leads and should not yet claim to have read or summarized the full content of these papers in that same turn, unless a subsequent tool call for summarization is planned and executed.

	---

	### `fetch_text_from_urls(paper_info_list: list[dict], max_chars_per_paper: int = 15000) -> list[dict]`

	Attempts to fetch and extract textual content from the URLs of papers provided in a list. This tool is typically used after `multi_source_literature_search` to gather content for summarization by the GenerationAgent.

	Args:

	* `paper_info_list (list[dict])`: A list of paper dictionaries, as returned by `multi_source_literature_search`. Each dictionary is expected to have at least a `"url"` key. Other keys like `"title"` and `"source_api"` are used for logging.
	* `max_chars_per_paper (int)`: The maximum number of characters of text to retrieve and store for each paper. Defaults to `15000`. Text longer than this will be truncated.

	Returns:

	* `list[dict]`: The input `paper_info_list`, where each paper dictionary is augmented with a new key `"retrieved_text_content"`.
	* If successful, `"retrieved_text_content" (str)` will contain the extracted text (up to `max_chars_per_paper`).
	* If fetching or parsing fails for a paper, `"retrieved_text_content" (str)` will contain an error message (e.g., "Error: Invalid or missing URL.", "Error fetching URL: ...", "Error: No text could be extracted.").

	GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):

	This tool is usually the second step in a literature review process.

	```python
	# Assume 'list_of_papers_from_search' is a variable holding the output from a previous
	# call to tools.multi_source_literature_search(...)
	print(json.dumps({'intermediate_data_for_llm': tools.fetch_text_from_urls(paper_info_list=list_of_papers_from_search, max_chars_per_paper=10000)}))
	```

	Important Considerations for GenerationAgent:

	* After this tool returns the `paper_info_list` (now with `"retrieved_text_content"`), the `GenerationAgent` is responsible for using its own LLM capabilities to read the `"retrieved_text_content"` for each paper and generate summaries if requested by the user or if it's part of its plan.
	* The `GenerationAgent` should be prepared for `"retrieved_text_content"` to contain error messages and handle them gracefully in its summarization logic (e.g., by stating that text for a particular paper could not be retrieved).
	* Web scraping is inherently unreliable; success in fetching and parsing text can vary greatly between websites. The agent should not assume text will always be available.

	---