shgao's picture
txagent eval
c4045a3
|
raw
history blame
9.71 kB
metadata
title: Crowdsourced Evaluation
emoji: 🌍
colorFrom: yellow
colorTo: indigo
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
header: mini
pinned: false

TxAgent Crowdsourcing Evaluation Portal: README

This Gradio application provides a user-friendly interface for human evaluation of TxAgent and other biomedical language models. Users can compare and rate model responses to clinical questions, with their evaluations being stored for analysis.


Current Challenges and Future Enhancements

While this evaluation portal offers a robust framework, there are a few areas for improvement:

  1. Scrolling Behavior: Despite efforts to implement scroll_to_output, pages may not consistently scroll to the top when transitioning. This can impact user experience, especially on longer pages.
  2. Tool Configuration Updates: The JSON files for tool configurations, while loaded from the tool_lists directory, are not currently updated in real-time from the ToolUniverse repository. This means any new or updated tools in ToolUniverse require a manual refresh of these local files to be reflected in the evaluation.
  3. Specialty-Specific Question Assignment: Currently, if a user's selected specialty has no assigned questions, the system doesn't automatically default to a random question. This could be adjusted by modifying the get_evaluator_questions function to ensure evaluators always have questions available.
  4. Flexible Evaluation Tracks: The portal currently supports a single evaluation track, comparing TxAgent against other models. It lacks the ability to simultaneously manage a separate evaluation track, such as comparing TxAgent-Qwen against other models. This would require modifications to the question retrieval and assignment logic.

Application Components

The Gradio application is structured into several interconnected pages, each serving a specific purpose in the evaluation workflow:

  • page_minus1 (Initial Landing Page): This is the very first page users encounter. It provides an overview of the TxAgent project and offers two main calls to action: "Submit Questions for TxAgent Evaluation" and "Participate in TxAgent Evaluation." The "Submit Questions" button redirects users to an external Google Form, while "Participate in Evaluation" transitions to page0.

  • page0 (Welcome and User Information): On this page, evaluators are welcomed to the study and provided with important instructions. Users are required to input their name, email, medical specialty (and subspecialty if applicable), and years of experience. This information is crucial for assigning relevant questions and tracking evaluation progress. A "Next" button moves the user to page1 (via a confirmation modal), and a "Home" button returns them to page_minus1.

  • eval_progress_modal (Question Progress Confirmation): This is a small pop-up modal that appears after a user submits their information on page0. It informs the evaluator about the number of questions they have remaining (or if they've completed all questions for their profile) and prompts them to proceed to the next question.

  • page1 (Pairwise Comparison): This is the first main evaluation page. It displays a clinical question (prompt) and the responses from two different models (Model A and Model B) side-by-side in scrollable chat windows. Users are asked to perform a pairwise comparison for five criteria: Problem Resolution, Helpfulness, Scientific Consensus, Accuracy, and Completeness. For each criterion, they select which model performed better or if it was a tie/neither did well, and they can optionally provide free-text reasons for their choice. There's also a crucial "This question does not make sense or is not biomedically-relevant" button to flag problematic questions. A "Next" button leads to page2, and "Back" returns to page0.

  • page2 (Individual Model Rating): This page is for detailed individual ratings of Model A and Model B based on the same five criteria. The prompt and model responses are again displayed. Based on the pairwise comparison choices made on page1, the choices for the individual ratings on page2 are constrained to ensure consistency. For example, if Model A was chosen as "better" for Problem Resolution on page1, Model A's score for Problem Resolution on page2 cannot be lower than Model B's. A "Submit" button initiates the data submission process, and "Back" returns to page1.

  • final_page (Completion Message): This page is displayed once an evaluator has completed all available questions for their profile. It provides a thank you message and indicates that no more questions are available for evaluation.

  • error_modal (Validation Error Display): A pop-up modal used to display any validation errors that occur during the evaluation process (e.g., if pairwise comparison and individual ratings are inconsistent).

  • confirm_modal (Submission Confirmation): A pop-up modal that asks for final confirmation before submitting the evaluation data, ensuring the user is aware that responses cannot be edited after submission.


How Components Interact

  1. User Onboarding (page_minus1 -> page0 -> eval_progress_modal):

    • The user starts at page_minus1. Clicking "Participate in Evaluation" triggers go_to_page0_from_minus1, hiding page_minus1 and showing page0.
    • On page0, the user inputs their details. Clicking "Next" calls go_to_eval_progress_modal, which validates the input, fetches all relevant questions for the user's specialty, and displays the eval_progress_modal with the number of remaining questions. It also populates the initial content for page1 (chat_a, chat_b, page1_prompt, page1_reference_answer) and stores the user_info_state and data_subset_state.
    • Clicking "OK, proceed to question evaluation" in the modal triggers go_to_page1, hiding the modal and page0, and showing page1.
  2. Evaluation Flow (page1 -> page2):

    • On page1, users perform pairwise comparisons. Their selections and reasons are captured by the pairwise_inputs and comparison_reasons_inputs. The nonsense_btn updates the nonsense_btn_clicked state.
    • Clicking "Next: Rate Responses" on page1 calls go_to_page2. This function stores the pairwise choices in pairwise_state and comparison_reasons, updates page2_prompt, page2_reference_answer, and the chatbot content for chat_a_rating and chat_b_rating. It also populates pairwise_results_for_display on page2 to remind users of their previous choices, which then restricts the choices for individual ratings on page2.
    • On page2, users input individual ratings for each model. The restrict_choices function dynamically adjusts the available options for each gr.Radio component in ratings_A and ratings_B based on the corresponding pairwise choice in pairwise_state. This ensures the ratings are consistent with the user's initial comparison.
  3. Submission and Next Question Logic (page2 -> confirm_modal -> Data Storage / Next Question / final_page):

    • Clicking "Submit" on page2 first triggers validate_ratings. This function checks for consistency between the pairwise choices (pairwise_state) and the individual ratings (ratings_A, ratings_B).
    • The process_result function then determines the next step based on the validation result:
      • If there are validation errors, the error_modal is displayed.
      • If validation passes, the confirm_modal appears, asking for final confirmation.
    • Clicking "Yes, please submit" in the confirm_modal calls final_submit. This function:
      • Constructs a row dictionary from all collected data (user_info_state, data_subset_state, pairwise_state, comparison_reasons, nonsense_btn_clicked, and the individual ratings).
      • Appends this data to a Google Sheet (using append_to_sheet).
      • Crucially, it then re-fetches and filters the list of available questions for the user based on their email and specialty, by calling get_evaluator_questions again.
      • Based on the number of remaining_count questions:
        • If remaining_count is 0, the final_page is shown.
        • If remaining_count is greater than 0, the eval_progress_modal is displayed again to inform the user about the remaining questions, and the application's internal states are reset and repopulated with a new question for the next evaluation round.
    • Clicking "Cancel" in the confirm_modal simply hides it, allowing the user to make changes.
  4. Navigation and State Management:

    • Back Buttons: All "Back" buttons (back_btn_0, back_btn_2) simply toggle visibility to return to the previous page.
    • Home Buttons: "Home Page" buttons (home_btn_0, home_btn_1, home_btn_2) return the user to page_minus1. It's important to note that these save the current question's progress internally if it's been populated, but do not submit it.
    • State Variables: Gradio's gr.State() components (user_info_state, pairwise_state, scores_A_state, comparison_reasons, nonsense_btn_clicked, unqualified_A_state, data_subset_state) are essential for preserving data across page transitions and ensuring that information collected on one page is available for processing on subsequent pages or during submission.

This structured approach allows for a multi-step evaluation process, guiding the user through comparisons and detailed ratings, while ensuring data integrity and efficient handling of evaluation rounds.