| # Agent Tools Documentation | |
| This document outlines the granular tools that can be created or extracted from the TaijiChat R Shiny application. These tools are intended for an agent system to access data, calculations, methodologies, tables, and graphs from the application. | |
| --- | |
| Tool Name: `get_raw_excel_data` | |
| Description: Reads a specified Excel file and returns its raw content as a list of lists, where each inner list represents a row. This tool is generic; the `file_path` should be an absolute path or a path relative to the project root (e.g., "www/some_data.xlsx"). For predefined datasets within the application structure, other more specific tools should be preferred if available. | |
| Input: `file_path` (string) - The path to the Excel file. | |
| Output: `data` (list of lists of strings/numbers) - The raw data from the Excel sheet. Returns an empty list if the file is not found or cannot be read. | |
| --- | |
| Tool Name: `get_processed_tf_data` | |
| Description: Reads and processes a TF-related Excel file identified by its dataset_identifier (e.g., "Naive", "Overall_TF_PageRank"). It uses an internal mapping (get_tf_catalog_dataset_path) to find the actual file path within the 'www/tablePagerank/' directory. The standard processing includes: reading the Excel file, transposing it, using the original first row as new column headers, and then removing this header row from the data. | |
| Input: `dataset_identifier` (string) - The identifier for the dataset. Valid identifiers include: "Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm". | |
| Output: `data` (list of lists of strings/numbers) - The processed data, where the first inner list contains the headers, and subsequent lists are data rows. Returns an empty list if processing fails or identifier is invalid. | |
| --- | |
| Tool Name: `filter_data_by_column_keywords` | |
| Description: Filters a dataset (list of lists, where the first list is headers) based on keywords matching its column names. This is for data that has already been processed (e.g., by `get_processed_tf_data`) where TFs or genes are column headers. The keyword search is case-insensitive and supports multiple comma-separated keywords. If no keywords are provided, the original dataset is returned. | |
| Input: | |
| `dataset` (list of lists) - The data to filter, with the first list being headers. | |
| `keywords` (string) - Comma-separated keywords to search for in column headers. | |
| Output: `filtered_dataset` (list of lists) - The subset of the data containing only the matching columns (including the header row). Returns an empty list (with headers only) if no columns match. | |
| --- | |
| Tool Name: `get_tf_wave_search_data` | |
| Description: Reads the `searchtfwaves.xlsx` file from `www/waveanalysis/`, which contains TF names organized by "waves" (Wave1 to Wave7 as columns). | |
| Input: `tf_search_term` (string, optional) - A specific TF name to search for. If empty or not provided, all TF wave data is returned. The search is case-insensitive. | |
| Output: `wave_data` (dictionary) - If `tf_search_term` is provided and matches, returns a structure like `{"WaveX": ["TF1", "TF2"], "WaveY": ["TF1"]}` showing which waves the TF belongs to. If no `tf_search_term`, returns the full data as `{"Wave1": ["All TFs in Wave1"], "Wave2": ["All TFs in Wave2"], ...}`. If no matches are found for a search term, an empty dictionary is returned. | |
| --- | |
| Tool Name: `get_tf_correlation_data` | |
| Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`. If a `tf_name` is provided, it filters the data for that TF (case-insensitive match on the primary TF identifier column, typically "TF Name" or the first column). | |
| Input: `tf_name` (string, optional) - The specific TF name to search for. If empty or not provided, returns the full dataset. | |
| Output: `correlation_data` (list of lists) - The filtered (or full) data from the correlation table. The first list is headers. Returns an empty list (with headers only) if `tf_name` is provided but not found or if the file cannot be processed. | |
| --- | |
| Tool Name: `get_tf_correlation_image_path` | |
| Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`, finds the row for the given `tf_name` (case-insensitive match on the primary TF identifier column), and returns the path stored in the "TF Merged Graph Path" column. The returned path is relative to the project's `www` directory (e.g., "www/networkanalysis/images/BATF_graph.png"). | |
| Input: `tf_name` (string) - The specific TF name. | |
| Output: `image_path` (string) - The relative web path to the image or an empty string if not found or if the file cannot be processed. | |
| --- | |
| Tool Name: `list_all_tfs_in_correlation_data` | |
| Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/` and returns a list of all unique TF names from the primary TF identifier column (typically "TF Name" or the first column). Filters out empty strings and 'nan'. | |
| Input: None | |
| Output: `tf_list` (list of strings) - A list of TF names. Returns an empty list if the file cannot be processed. | |
| --- | |
| Tool Name: `get_tf_community_sheet_data` | |
| Description: Reads one of the TF community Excel files (`trmcommunities.xlsx` or `texcommunities.xlsx`) located in `www/tfcommunities/`. | |
| Input: `community_type` (string) - Either "trm" or "texterm". | |
| Output: `community_data` (list of lists) - Data from the specified community sheet (raw format, first list is headers). Returns an empty list if the type is invalid or file not found/processed. | |
| --- | |
| Tool Name: `get_static_image_path` | |
| Description: Returns the predefined relative web path (e.g., "www/images/logo.png") for a known static image asset. These paths are typically relative to the project root. | |
| Input: `image_identifier` (string) - A unique key representing the image (e.g., "home_page_diagram", "ucsd_logo", "naive_bubble_plot", "wave1_main_img", "wave1_gokegg_img", "wave1_ranked_text1_img", "tfcat_overview_img", "network_correlation_desc_img"). | |
| Output: `image_path` (string) - The relative path (e.g., "www/homedesc.png"). Returns an empty string if identifier is unknown. This tool relies on an internal mapping (`_STATIC_IMAGE_WEB_PATHS` in `tools.agent_tools`). | |
| --- | |
| Tool Name: `get_ui_descriptive_text` | |
| Description: Retrieves predefined descriptive text, methodology explanations, or captions by its identifier, primarily from `tools/ui_texts.json`. | |
| Input: `text_identifier` (string) - A unique key representing the text block (e.g., "tf_score_calculation_info", "cell_state_specificity_info", "wave_analysis_overview_text", "wave_1_analysis_placeholder_details"). | |
| Output: `descriptive_text` (string) - The requested text block. Returns an empty string if identifier is unknown. | |
| --- | |
| Tool Name: `list_available_tf_catalog_datasets` | |
| Description: Returns a list of valid `dataset_identifier` strings that can be used with the `get_processed_tf_data` tool. | |
| Input: None | |
| Output: `dataset_identifiers` (list of strings) - E.g., ["Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm"]. | |
| --- | |
| Tool Name: `list_available_cell_state_bubble_plots` | |
| Description: Returns a list of identifiers for available cell-state specific bubble plot images. These identifiers can be used with `get_static_image_path`. | |
| Input: None | |
| Output: `image_identifiers` (list of strings) - E.g., ["naive_bubble_plot", "te_bubble_plot", ...]. Derived from internal mapping in `tools.agent_tools`. | |
| --- | |
| Tool Name: `list_available_wave_analysis_assets` | |
| Description: Returns a structured dictionary of available asset identifiers for a specific TF wave (main image, GO/KEGG image, ranked text images). Identifiers can be used with `get_static_image_path`. | |
| Input: `wave_number` (integer, 1-7) - The wave number. | |
| Output: `asset_info` (dictionary) - E.g., `{"main_image_id": "waveX_main_img", "gokegg_image_id": "waveX_gokegg_img", "ranked_text_image_ids": ["waveX_ranked_text1_img", ...]}`. Returns empty if wave number is invalid. Derived from internal mapping in `tools.agent_tools`. | |
| --- | |
| Tool Name: `get_internal_navigation_info` | |
| Description: Provides information about where an internal UI link (like those on the homepage image map or wave overview images) is intended to navigate within the application structure. | |
| Input: `link_id` (string) - The identifier of the link (e.g., "to_tfcat", "to_tfwave", "to_tfnet", "c1_link", "c2_link", etc.). | |
| Output: `navigation_target_description` (string) - A human-readable description of the target (e.g., "Navigates to the 'TF Catalog' section.", "Navigates to the 'Wave 1 Analysis' tab."). Derived from internal mapping in `tools.agent_tools`. | |
| --- | |
| Tool Name: `get_biorxiv_paper_url` | |
| Description: Returns the URL for the main bioRxiv paper referenced in the application. | |
| Input: None | |
| Output: `url` (string) - The bioRxiv paper URL. | |
| --- | |
| Tool Name: `list_all_files_in_www_directory` | |
| Description: Scans the entire `www/` directory (and its subdirectories, excluding common hidden/system files) and returns a list of all files found. For each file, it provides its relative path from the project root (e.g., "www/images/logo.png"), its detected MIME type (e.g., "image/png", "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"), and its size in bytes. This tool helps in understanding all available static assets and data files within the web-accessible `www` directory. | |
| Input: None | |
| Output: `file_manifest` (list of dictionaries) - Each dictionary represents a file and contains the keys: `path` (string), `type` (string), `size` (integer). Example item: `{"path": "www/data/report.txt", "type": "text/plain", "size": 1024}`. Returns an empty list if the `www` directory isn't found or is empty. | |
| --- | |
| ### `multi_source_literature_search(queries: list[str], max_results_per_query_per_source: int = 1, max_total_unique_papers: int = 10) -> list[dict]` | |
| Searches for academic literature across multiple sources (Semantic Scholar, PubMed, ArXiv) using a list of provided search queries. It then de-duplicates the results based primarily on DOI, and secondarily on a combination of title and first author if DOI is not available. The search process stops early if the `max_total_unique_papers` limit is reached. | |
| **Args:** | |
| * `queries (list[str])`: A list of search query strings. The GenerationAgent should brainstorm 3-5 diverse queries relevant to the user's request. | |
| * `max_results_per_query_per_source (int)`: The maximum number of results to fetch from EACH academic source (Semantic Scholar, PubMed, ArXiv) for EACH query string. Defaults to `1`. | |
| * `max_total_unique_papers (int)`: The maximum total number of unique de-duplicated papers to return across all queries and sources. Defaults to `10`. The tool will stop fetching more data once this limit is met. | |
| **Returns:** | |
| * `list[dict]`: A consolidated and de-duplicated list of paper details, containing up to `max_total_unique_papers`. Each dictionary in the list represents a paper and has the following keys: | |
| * `"title" (str)`: The title of the paper. "N/A" if not available. | |
| * `"authors" (list[str])`: A list of author names. ["N/A"] if not available. | |
| * `"year" (str | int)`: The publication year. "N/A" if not available. | |
| * `"abstract" (str)`: A snippet of the abstract (typically up to 500 characters followed by "..."). "N/A" if not available. | |
| * `"doi" (str | None)`: The Digital Object Identifier. `None` if not available. | |
| * `"url" (str)`: A direct URL to the paper (e.g., PubMed link, ArXiv link, Semantic Scholar link). "N/A" if not available. | |
| * `"venue" (str)`: The publication venue (e.g., journal name, "ArXiv"). "N/A" if not available. | |
| * `"source_api" (str)`: The API from which this record was retrieved (e.g., "Semantic Scholar", "PubMed", "ArXiv"). | |
| **GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):** | |
| ```python | |
| # Example: User asks for up to 3 papers | |
| print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["T-cell exhaustion markers AND cancer", "immunotherapy for melanoma AND biomarkers"], max_results_per_query_per_source=1, max_total_unique_papers=3)})) | |
| # Example: Defaulting to 10 total unique papers | |
| print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["COVID-19 long-term effects"], max_results_per_query_per_source=2)})) | |
| ``` | |
| **Important Considerations for GenerationAgent:** | |
| * When results are returned from this tool, the `GenerationAgent`'s `explanation` (for `CODE_COMPLETE` status) should present a summary of the *found papers* (e.g., titles, authors, URLs). It should clearly state that these are potential literature leads and should *not* yet claim to have read or summarized the full content of these papers in that same turn, unless a subsequent tool call for summarization is planned and executed. | |
| --- | |
| ### `fetch_text_from_urls(paper_info_list: list[dict], max_chars_per_paper: int = 15000) -> list[dict]` | |
| Attempts to fetch and extract textual content from the URLs of papers provided in a list. This tool is typically used after `multi_source_literature_search` to gather content for summarization by the GenerationAgent. | |
| **Args:** | |
| * `paper_info_list (list[dict])`: A list of paper dictionaries, as returned by `multi_source_literature_search`. Each dictionary is expected to have at least a `"url"` key. Other keys like `"title"` and `"source_api"` are used for logging. | |
| * `max_chars_per_paper (int)`: The maximum number of characters of text to retrieve and store for each paper. Defaults to `15000`. Text longer than this will be truncated. | |
| **Returns:** | |
| * `list[dict]`: The input `paper_info_list`, where each paper dictionary is augmented with a new key `"retrieved_text_content"`. | |
| * If successful, `"retrieved_text_content" (str)` will contain the extracted text (up to `max_chars_per_paper`). | |
| * If fetching or parsing fails for a paper, `"retrieved_text_content" (str)` will contain an error message (e.g., "Error: Invalid or missing URL.", "Error fetching URL: ...", "Error: No text could be extracted."). | |
| **GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):** | |
| This tool is usually the second step in a literature review process. | |
| ```python | |
| # Assume 'list_of_papers_from_search' is a variable holding the output from a previous | |
| # call to tools.multi_source_literature_search(...) | |
| print(json.dumps({'intermediate_data_for_llm': tools.fetch_text_from_urls(paper_info_list=list_of_papers_from_search, max_chars_per_paper=10000)})) | |
| ``` | |
| **Important Considerations for GenerationAgent:** | |
| * After this tool returns the `paper_info_list` (now with `"retrieved_text_content"`), the `GenerationAgent` is responsible for using its own LLM capabilities to read the `"retrieved_text_content"` for each paper and generate summaries if requested by the user or if it's part of its plan. | |
| * The `GenerationAgent` should be prepared for `"retrieved_text_content"` to contain error messages and handle them gracefully in its summarization logic (e.g., by stating that text for a particular paper could not be retrieved). | |
| * Web scraping is inherently unreliable; success in fetching and parsing text can vary greatly between websites. The agent should not assume text will always be available. | |
| --- |