# Agent Tools Documentation This document outlines the granular tools that can be created or extracted from the TaijiChat R Shiny application. These tools are intended for an agent system to access data, calculations, methodologies, tables, and graphs from the application. --- Tool Name: `get_raw_excel_data` Description: Reads a specified Excel file and returns its raw content as a list of lists, where each inner list represents a row. This tool is generic; the `file_path` should be an absolute path or a path relative to the project root (e.g., "www/some_data.xlsx"). For predefined datasets within the application structure, other more specific tools should be preferred if available. Input: `file_path` (string) - The path to the Excel file. Output: `data` (list of lists of strings/numbers) - The raw data from the Excel sheet. Returns an empty list if the file is not found or cannot be read. --- Tool Name: `get_processed_tf_data` Description: Reads and processes a TF-related Excel file identified by its dataset_identifier (e.g., "Naive", "Overall_TF_PageRank"). It uses an internal mapping (get_tf_catalog_dataset_path) to find the actual file path within the 'www/tablePagerank/' directory. The standard processing includes: reading the Excel file, transposing it, using the original first row as new column headers, and then removing this header row from the data. Input: `dataset_identifier` (string) - The identifier for the dataset. Valid identifiers include: "Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm". Output: `data` (list of lists of strings/numbers) - The processed data, where the first inner list contains the headers, and subsequent lists are data rows. Returns an empty list if processing fails or identifier is invalid. --- Tool Name: `filter_data_by_column_keywords` Description: Filters a dataset (list of lists, where the first list is headers) based on keywords matching its column names. This is for data that has already been processed (e.g., by `get_processed_tf_data`) where TFs or genes are column headers. The keyword search is case-insensitive and supports multiple comma-separated keywords. If no keywords are provided, the original dataset is returned. Input: `dataset` (list of lists) - The data to filter, with the first list being headers. `keywords` (string) - Comma-separated keywords to search for in column headers. Output: `filtered_dataset` (list of lists) - The subset of the data containing only the matching columns (including the header row). Returns an empty list (with headers only) if no columns match. --- Tool Name: `get_tf_wave_search_data` Description: Reads the `searchtfwaves.xlsx` file from `www/waveanalysis/`, which contains TF names organized by "waves" (Wave1 to Wave7 as columns). Input: `tf_search_term` (string, optional) - A specific TF name to search for. If empty or not provided, all TF wave data is returned. The search is case-insensitive. Output: `wave_data` (dictionary) - If `tf_search_term` is provided and matches, returns a structure like `{"WaveX": ["TF1", "TF2"], "WaveY": ["TF1"]}` showing which waves the TF belongs to. If no `tf_search_term`, returns the full data as `{"Wave1": ["All TFs in Wave1"], "Wave2": ["All TFs in Wave2"], ...}`. If no matches are found for a search term, an empty dictionary is returned. --- Tool Name: `get_tf_correlation_data` Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`. If a `tf_name` is provided, it filters the data for that TF (case-insensitive match on the primary TF identifier column, typically "TF Name" or the first column). Input: `tf_name` (string, optional) - The specific TF name to search for. If empty or not provided, returns the full dataset. Output: `correlation_data` (list of lists) - The filtered (or full) data from the correlation table. The first list is headers. Returns an empty list (with headers only) if `tf_name` is provided but not found or if the file cannot be processed. --- Tool Name: `get_tf_correlation_image_path` Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/`, finds the row for the given `tf_name` (case-insensitive match on the primary TF identifier column), and returns the path stored in the "TF Merged Graph Path" column. The returned path is relative to the project's `www` directory (e.g., "www/networkanalysis/images/BATF_graph.png"). Input: `tf_name` (string) - The specific TF name. Output: `image_path` (string) - The relative web path to the image or an empty string if not found or if the file cannot be processed. --- Tool Name: `list_all_tfs_in_correlation_data` Description: Reads the `TF-TFcorTRMTEX.xlsx` file from `www/TFcorintextrm/` and returns a list of all unique TF names from the primary TF identifier column (typically "TF Name" or the first column). Filters out empty strings and 'nan'. Input: None Output: `tf_list` (list of strings) - A list of TF names. Returns an empty list if the file cannot be processed. --- Tool Name: `get_tf_community_sheet_data` Description: Reads one of the TF community Excel files (`trmcommunities.xlsx` or `texcommunities.xlsx`) located in `www/tfcommunities/`. Input: `community_type` (string) - Either "trm" or "texterm". Output: `community_data` (list of lists) - Data from the specified community sheet (raw format, first list is headers). Returns an empty list if the type is invalid or file not found/processed. --- Tool Name: `get_static_image_path` Description: Returns the predefined relative web path (e.g., "www/images/logo.png") for a known static image asset. These paths are typically relative to the project root. Input: `image_identifier` (string) - A unique key representing the image (e.g., "home_page_diagram", "ucsd_logo", "naive_bubble_plot", "wave1_main_img", "wave1_gokegg_img", "wave1_ranked_text1_img", "tfcat_overview_img", "network_correlation_desc_img"). Output: `image_path` (string) - The relative path (e.g., "www/homedesc.png"). Returns an empty string if identifier is unknown. This tool relies on an internal mapping (`_STATIC_IMAGE_WEB_PATHS` in `tools.agent_tools`). --- Tool Name: `get_ui_descriptive_text` Description: Retrieves predefined descriptive text, methodology explanations, or captions by its identifier, primarily from `tools/ui_texts.json`. Input: `text_identifier` (string) - A unique key representing the text block (e.g., "tf_score_calculation_info", "cell_state_specificity_info", "wave_analysis_overview_text", "wave_1_analysis_placeholder_details"). Output: `descriptive_text` (string) - The requested text block. Returns an empty string if identifier is unknown. --- Tool Name: `list_available_tf_catalog_datasets` Description: Returns a list of valid `dataset_identifier` strings that can be used with the `get_processed_tf_data` tool. Input: None Output: `dataset_identifiers` (list of strings) - E.g., ["Overall_TF_PageRank", "Naive", "TE", "MP", "TCM", "TEM", "TRM", "TEXprog", "TEXeff", "TEXterm"]. --- Tool Name: `list_available_cell_state_bubble_plots` Description: Returns a list of identifiers for available cell-state specific bubble plot images. These identifiers can be used with `get_static_image_path`. Input: None Output: `image_identifiers` (list of strings) - E.g., ["naive_bubble_plot", "te_bubble_plot", ...]. Derived from internal mapping in `tools.agent_tools`. --- Tool Name: `list_available_wave_analysis_assets` Description: Returns a structured dictionary of available asset identifiers for a specific TF wave (main image, GO/KEGG image, ranked text images). Identifiers can be used with `get_static_image_path`. Input: `wave_number` (integer, 1-7) - The wave number. Output: `asset_info` (dictionary) - E.g., `{"main_image_id": "waveX_main_img", "gokegg_image_id": "waveX_gokegg_img", "ranked_text_image_ids": ["waveX_ranked_text1_img", ...]}`. Returns empty if wave number is invalid. Derived from internal mapping in `tools.agent_tools`. --- Tool Name: `get_internal_navigation_info` Description: Provides information about where an internal UI link (like those on the homepage image map or wave overview images) is intended to navigate within the application structure. Input: `link_id` (string) - The identifier of the link (e.g., "to_tfcat", "to_tfwave", "to_tfnet", "c1_link", "c2_link", etc.). Output: `navigation_target_description` (string) - A human-readable description of the target (e.g., "Navigates to the 'TF Catalog' section.", "Navigates to the 'Wave 1 Analysis' tab."). Derived from internal mapping in `tools.agent_tools`. --- Tool Name: `get_biorxiv_paper_url` Description: Returns the URL for the main bioRxiv paper referenced in the application. Input: None Output: `url` (string) - The bioRxiv paper URL. --- Tool Name: `list_all_files_in_www_directory` Description: Scans the entire `www/` directory (and its subdirectories, excluding common hidden/system files) and returns a list of all files found. For each file, it provides its relative path from the project root (e.g., "www/images/logo.png"), its detected MIME type (e.g., "image/png", "text/csv", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"), and its size in bytes. This tool helps in understanding all available static assets and data files within the web-accessible `www` directory. Input: None Output: `file_manifest` (list of dictionaries) - Each dictionary represents a file and contains the keys: `path` (string), `type` (string), `size` (integer). Example item: `{"path": "www/data/report.txt", "type": "text/plain", "size": 1024}`. Returns an empty list if the `www` directory isn't found or is empty. --- ### `multi_source_literature_search(queries: list[str], max_results_per_query_per_source: int = 1, max_total_unique_papers: int = 10) -> list[dict]` Searches for academic literature across multiple sources (Semantic Scholar, PubMed, ArXiv) using a list of provided search queries. It then de-duplicates the results based primarily on DOI, and secondarily on a combination of title and first author if DOI is not available. The search process stops early if the `max_total_unique_papers` limit is reached. **Args:** * `queries (list[str])`: A list of search query strings. The GenerationAgent should brainstorm 3-5 diverse queries relevant to the user's request. * `max_results_per_query_per_source (int)`: The maximum number of results to fetch from EACH academic source (Semantic Scholar, PubMed, ArXiv) for EACH query string. Defaults to `1`. * `max_total_unique_papers (int)`: The maximum total number of unique de-duplicated papers to return across all queries and sources. Defaults to `10`. The tool will stop fetching more data once this limit is met. **Returns:** * `list[dict]`: A consolidated and de-duplicated list of paper details, containing up to `max_total_unique_papers`. Each dictionary in the list represents a paper and has the following keys: * `"title" (str)`: The title of the paper. "N/A" if not available. * `"authors" (list[str])`: A list of author names. ["N/A"] if not available. * `"year" (str | int)`: The publication year. "N/A" if not available. * `"abstract" (str)`: A snippet of the abstract (typically up to 500 characters followed by "..."). "N/A" if not available. * `"doi" (str | None)`: The Digital Object Identifier. `None` if not available. * `"url" (str)`: A direct URL to the paper (e.g., PubMed link, ArXiv link, Semantic Scholar link). "N/A" if not available. * `"venue" (str)`: The publication venue (e.g., journal name, "ArXiv"). "N/A" if not available. * `"source_api" (str)`: The API from which this record was retrieved (e.g., "Semantic Scholar", "PubMed", "ArXiv"). **GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):** ```python # Example: User asks for up to 3 papers print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["T-cell exhaustion markers AND cancer", "immunotherapy for melanoma AND biomarkers"], max_results_per_query_per_source=1, max_total_unique_papers=3)})) # Example: Defaulting to 10 total unique papers print(json.dumps({'intermediate_data_for_llm': tools.multi_source_literature_search(queries=["COVID-19 long-term effects"], max_results_per_query_per_source=2)})) ``` **Important Considerations for GenerationAgent:** * When results are returned from this tool, the `GenerationAgent`'s `explanation` (for `CODE_COMPLETE` status) should present a summary of the *found papers* (e.g., titles, authors, URLs). It should clearly state that these are potential literature leads and should *not* yet claim to have read or summarized the full content of these papers in that same turn, unless a subsequent tool call for summarization is planned and executed. --- ### `fetch_text_from_urls(paper_info_list: list[dict], max_chars_per_paper: int = 15000) -> list[dict]` Attempts to fetch and extract textual content from the URLs of papers provided in a list. This tool is typically used after `multi_source_literature_search` to gather content for summarization by the GenerationAgent. **Args:** * `paper_info_list (list[dict])`: A list of paper dictionaries, as returned by `multi_source_literature_search`. Each dictionary is expected to have at least a `"url"` key. Other keys like `"title"` and `"source_api"` are used for logging. * `max_chars_per_paper (int)`: The maximum number of characters of text to retrieve and store for each paper. Defaults to `15000`. Text longer than this will be truncated. **Returns:** * `list[dict]`: The input `paper_info_list`, where each paper dictionary is augmented with a new key `"retrieved_text_content"`. * If successful, `"retrieved_text_content" (str)` will contain the extracted text (up to `max_chars_per_paper`). * If fetching or parsing fails for a paper, `"retrieved_text_content" (str)` will contain an error message (e.g., "Error: Invalid or missing URL.", "Error fetching URL: ...", "Error: No text could be extracted."). **GenerationAgent Usage Example (for `python_code` field when `status` is `AWAITING_DATA`):** This tool is usually the second step in a literature review process. ```python # Assume 'list_of_papers_from_search' is a variable holding the output from a previous # call to tools.multi_source_literature_search(...) print(json.dumps({'intermediate_data_for_llm': tools.fetch_text_from_urls(paper_info_list=list_of_papers_from_search, max_chars_per_paper=10000)})) ``` **Important Considerations for GenerationAgent:** * After this tool returns the `paper_info_list` (now with `"retrieved_text_content"`), the `GenerationAgent` is responsible for using its own LLM capabilities to read the `"retrieved_text_content"` for each paper and generate summaries if requested by the user or if it's part of its plan. * The `GenerationAgent` should be prepared for `"retrieved_text_content"` to contain error messages and handle them gracefully in its summarization logic (e.g., by stating that text for a particular paper could not be retrieved). * Web scraping is inherently unreliable; success in fetching and parsing text can vary greatly between websites. The agent should not assume text will always be available. ---