Spaces:

THUIR
/

AEOLLM

Sleeping

陈俊杰 commited on Sep 2, 2024

Commit

c3f31ee

1 Parent(s): 9a263a1

@@ -126,6 +126,7 @@ st.markdown("""
     .main-text {
         font-size: 18px;
         line-height: 1.6;
     }
     </style>
     """, unsafe_allow_html=True)
@@ -142,8 +143,8 @@ elif page == "Methodology":
     st.image("asserts/method.svg", use_column_width=True)
     st.markdown("""
 <ol class='main-text'>
-  <li>First, we choose four subtasks as shown in the table below:</li>
-  <table>
     <thead>
       <tr>
         <th style="text-align: left">Task</th>
@@ -174,9 +175,9 @@ elif page == "Methodology":
       </tr>
     </tbody>
   </table>
-  <li>Second, we choose a series of popular LLMs during the competition to generate answers.</li>
-  <li>Third, we manually annotate the answer sets for each question, which will be used as gold standards for evaluating the performance of different evaluation methods.</li>
-  <li>Last, we will collect evaluation results from participants and calculate consistency with manually annotated results. We will use Accuracy, Kendall’s tau and Spearman correlation coefficient as the evaluation metrics.</li>
 </ol>
     """,unsafe_allow_html=True)
@@ -196,39 +197,31 @@ elif page == "Datasets":
 elif page == "Important Dates":
     st.header("Important Dates")
     st.markdown("""
-<p class='main-text'><em>All deadlines are at 11:59pm in the Anywhere on Earth (AOE) timezone.</em><br />
-<span class="event"><strong>Kickoff Event</strong>:</span> <span class="date">March 29, 2024</span><br />
-<span class="event"><strong>Dataset Release</strong>:</span> <span class="date">👉May 1, 2024</span><br />
-<span class="event"><strong>System Output Submission Deadline</strong>:</span> <span class="date">Jan 15, 2025</span><br />
-<span class="event"><strong>Evaluation Results Release</strong>:</span> <span class="date">Feb 1, 2025</span>  <br />
-<span class="event"><strong>Task overview release (draft)</strong>:</span> <span class="date">Feb 1, 2025</span><br />
-<span class="event"><strong>Submission Due of Participant Papers (draft)</strong>:</span> <span class="date">March 1, 2025</span><br />
-<span class="event"><strong>Camera-Ready Participant Paper Due</strong>:</span> <span class="date">May 1, 2025</span><br />
-<span class="event"><strong>NTCIR-18 Conference</strong>:</span> <span class="date">Jun 10-13 2025</span><br /></p>
     """,unsafe_allow_html=True)
 elif page == "Evaluation Measures":
     st.header("Evaluation Measures")
     st.markdown("""
-<div class='main-text'>
-- **Acc(Accuracy):** The proportion of identical preference results between the model and human annotations. Specifically, we first convert individual scores (ranks) into pairwise preferences and then calculate consistency with human annotations.
-- **Kendall's tau:** Measures the ordinal association between two ranked variables.
-  $$
-  \\tau=\\frac{C-D}{\\frac{1}{2}n(n-1)}
-  $$
-  where:
-  - C is the number of concordant pairs,
-  - D is the number of discordant pairs,
-  - n is the number of pairs.
-- **Spearman's Rank Correlation Coefficient:** Measures the strength and direction of the association between two ranked variables.
-    $$
-        \\rho = 1 - \\frac{6 \sum d_i^2}{n(n^2 - 1)}
-    $$
     where:
-    - $d_i$ is the difference between the ranks of corresponding elements in the two lists,
-    - $n$ is the number of elements.
-</div>
     """,unsafe_allow_html=True)
 elif page == "Data and File format":
     st.header("Data and File format")
@@ -254,10 +247,12 @@ elif page == "LeaderBoard":
     st.markdown("""
 <div class='main-text'>
 This leaderboard is used to show the performance of the **automatic evaluation methods of LLMs** submitted by the **AEOLLM team** on four tasks:
-- Dialogue Generation (DG)
-- Text Expansion (TE)
-- Summary Generation (SG)
-- Non-Factoid QA (NFQA)
 </div>
     """, unsafe_allow_html=True)
     # 创建示例数据
@@ -309,19 +304,19 @@ This leaderboard is used to show the performance of the **automatic evaluation m
     tab1, tab2, tab3, tab4 = st.tabs(["DG", "TE", "SG", "NFQA"])
     with tab1:
-        st.markdown("""Task: Dialogue Generation; Dataset: DialyDialog""", unsafe_allow_html=True)
         st.dataframe(df1, use_container_width=True)
     with tab2:
-        st.markdown("""Task: Text Expansion; Dataset: WritingPrompts""", unsafe_allow_html=True)
         st.dataframe(df2, use_container_width=True)
     with tab3:
-        st.markdown("""Task: Summary Generation; Dataset: Xsum""", unsafe_allow_html=True)
         st.dataframe(df3, use_container_width=True)
     with tab4:
-        st.markdown("""Task: Non-Factoid QA; Dataset: NF_CATS""", unsafe_allow_html=True)
         st.dataframe(df4, use_container_width=True)
 elif page == "Organisers":
     st.header("Organisers")

     .main-text {
         font-size: 18px;
         line-height: 1.6;
+        color: #4CAF50;
     }
     </style>
     """, unsafe_allow_html=True)
     st.image("asserts/method.svg", use_column_width=True)
     st.markdown("""
 <ol class='main-text'>
+  <li class='main-text'>First, we choose four subtasks as shown in the table below:</li>
+  <table class='main-text'>
     <thead>
       <tr>
         <th style="text-align: left">Task</th>
       </tr>
     </tbody>
   </table>
+  <li class='main-text'>Second, we choose a series of popular LLMs during the competition to generate answers.</li>
+  <li class='main-text'>Third, we manually annotate the answer sets for each question, which will be used as gold standards for evaluating the performance of different evaluation methods.</li>
+  <li class='main-text'>Last, we will collect evaluation results from participants and calculate consistency with manually annotated results. We will use Accuracy, Kendall’s tau and Spearman correlation coefficient as the evaluation metrics.</li>
 </ol>
     """,unsafe_allow_html=True)
 elif page == "Important Dates":
     st.header("Important Dates")
     st.markdown("""
+<p class='main-text'><em class='main-text>All deadlines are at 11:59pm in the Anywhere on Earth (AOE) timezone.</em><br />
+<span class='main-text'><strong>Kickoff Event</strong>:</span> <span class='main-text'>March 29, 2024</span><br />
+<span class='main-text'><strong>Dataset Release</strong>:</span> <span class='main-text'>👉May 1, 2024</span><br />
+<span class='main-text'><strong>System Output Submission Deadline</strong>:</span> <span class='main-text'>Jan 15, 2025</span><br />
+<span class='main-text'><strong>Evaluation Results Release</strong>:</span> <span class='main-text'>Feb 1, 2025</span>  <br />
+<span class='main-text'><strong>Task overview release (draft)</strong>:</span> <span class='main-text'>Feb 1, 2025</span><br />
+<span class='main-text'><strong>Submission Due of Participant Papers (draft)</strong>:</span> <span class='main-text'>March 1, 2025</span><br />
+<span class='main-text'><strong>Camera-Ready Participant Paper Due</strong>:</span> <span class='main-text'>May 1, 2025</span><br />
+<span class='main-text'><strong>NTCIR-18 Conference</strong>:</span> <span class='main-text'>Jun 10-13 2025</span><br /></p>
     """,unsafe_allow_html=True)
 elif page == "Evaluation Measures":
     st.header("Evaluation Measures")
     st.markdown("""
+<ul class='main-text'>
+  <li><strong>Acc(Accuracy): </strong>The proportion of identical preference results between the model and human annotations. Specifically, we first convert individual scores (ranks) into pairwise preferences and then calculate consistency with human annotations.</li>
+  <li><strong>Kendall's tau: </strong>Measures the ordinal association between two ranked variables. $$\tau = \frac{C-D}{\frac{1}{2}n(n-1)}$$
     where:
+    C is the number of concordant pairs,
+    D is the number of discordant pairs,
+    n is the number of pairs.</li>
+  <li><strong>Spearman's Rank Correlation Coefficient: </strong>Measures the strength and direction of the association between two ranked variables. $$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$$
+    where:
+    \(d_i\) is the difference between the ranks of corresponding elements in the two lists,
+    n is the number of elements.</li>
+</ul>
     """,unsafe_allow_html=True)
 elif page == "Data and File format":
     st.header("Data and File format")
     st.markdown("""
 <div class='main-text'>
 This leaderboard is used to show the performance of the **automatic evaluation methods of LLMs** submitted by the **AEOLLM team** on four tasks:
+<ul class='main-text'>
+<li>Dialogue Generation (DG)</li>
+<li>Text Expansion (TE)</li>
+<li>Summary Generation (SG)</li>
+<li>Non-Factoid QA (NFQA)</li>
+</ul>
 </div>
     """, unsafe_allow_html=True)
     # 创建示例数据
     tab1, tab2, tab3, tab4 = st.tabs(["DG", "TE", "SG", "NFQA"])
     with tab1:
+        st.markdown("""<div class='main-text'>Task: Dialogue Generation; Dataset: DialyDialog</div>""", unsafe_allow_html=True)
         st.dataframe(df1, use_container_width=True)
     with tab2:
+        st.markdown("""<div class='main-text'>Task: Text Expansion; Dataset: WritingPrompts</div>""", unsafe_allow_html=True)
         st.dataframe(df2, use_container_width=True)
     with tab3:
+        st.markdown("""<div class='main-text'>Task: Summary Generation; Dataset: Xsum</div>""", unsafe_allow_html=True)
         st.dataframe(df3, use_container_width=True)
     with tab4:
+        st.markdown("""<div class='main-text'>Task: Non-Factoid QA; Dataset: NF_CATS</div>""", unsafe_allow_html=True)
         st.dataframe(df4, use_container_width=True)
 elif page == "Organisers":
     st.header("Organisers")

fontSize