Polish_Cultural_Vision_Benchmark

Running

App Files Files Community

djstrong commited on Jun 4, 2025

Commit

fd35185

1 Parent(s): 74253ba

Refactor app.py to use JSON for benchmark data, removing CSV and metadata dependencies. Update performance plotting to reflect new data structure and enhance visualization with cultural context. Introduce benchmark report JSON file for structured model evaluation results.

Browse files

Files changed (7) hide show

app.py +55 -81
benchmark_report.json +142 -0
benchmark_results.csv +0 -189
metadata.json +0 -355
plot_results.py +100 -98
script.py +0 -322
src/about.py +27 -9

app.py CHANGED Viewed

@@ -19,92 +19,54 @@ with demo:
     gr.HTML(TITLE)
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
-    NUMBER_OF_QUESTIONS = 171.0
-    # load dataframe from csv
-    # leaderboard_df = pd.read_csv("benchmark_results.csv")
-    leaderboard_df = []
-    with open("benchmark_results.csv", "r") as f:
-        header = f.readline().strip().split(",")
-        header = [h.strip() for h in header]
-        for i, line in enumerate(f):
-            leaderboard_df.append(line.strip().split(",", 13))
-    metadata = json.load(open('metadata.json'))
-    for k, v in list(metadata.items()):
-        metadata[k.split(",")[0]] = v
-    # create dataframe from list and header
-    leaderboard_df = pd.DataFrame(leaderboard_df, columns=header)
-    # filter column with value eq-bench_v2_pl
-    print(header)
-    leaderboard_df = leaderboard_df[(leaderboard_df["Benchmark Version"] == "eq-bench_v2_pl") | (
-            leaderboard_df["Benchmark Version"] == 'eq-bench_pl')]
-    # fix: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
-    # leave only defined columns
-    leaderboard_df = leaderboard_df[["Model Path", "Benchmark Score", "Num Questions Parseable", "Error"]]
-    # create new column with model name
-    def parse_parseable(x):
-        if x["Num Questions Parseable"] == 'FAILED':
-            m = re.match(r'(\d+)\.0 questions were parseable', x["Error"])
-            return m.group(1)
-        return x["Num Questions Parseable"]
-    leaderboard_df["Num Questions Parseable"] = leaderboard_df[["Num Questions Parseable", "Error"]].apply(
-        lambda x: parse_parseable(x), axis=1)
-    def fraction_to_percentage(numerator: float, denominator: float) -> float:
-        return (numerator / denominator) * 100
-    leaderboard_df["Num Questions Parseable"] = leaderboard_df["Num Questions Parseable"].apply(lambda x: fraction_to_percentage(float(x), NUMBER_OF_QUESTIONS))
-    def get_params(model_name):
-        if model_name in metadata:
-            return metadata[model_name]
-        else:
-            print(model_name)
-        return numpy.nan
-    leaderboard_df["Params"] = leaderboard_df["Model Path"].apply(lambda x: get_params(x))
-    # move column order
-    leaderboard_df = leaderboard_df[["Model Path", "Params", "Benchmark Score", "Num Questions Parseable", 'Error']]
-    # change value of column to nan
-    leaderboard_df["Benchmark Score"] = leaderboard_df["Benchmark Score"].replace('FAILED', numpy.nan)
-    #scale Benchmark Score by Num Questions Parseable*100
-    leaderboard_df["Benchmark Score"] = leaderboard_df["Benchmark Score"].astype(float) * ((leaderboard_df["Num Questions Parseable"].astype(float) / 100))
-    # set datatype of column
-    leaderboard_df["Benchmark Score"] = leaderboard_df["Benchmark Score"].astype(float)
-    leaderboard_df["Num Questions Parseable"] = leaderboard_df["Num Questions Parseable"].astype(float)
-    # set nan if value of column is less than 0
-    leaderboard_df.loc[leaderboard_df["Benchmark Score"] < 0, "Benchmark Score"] = 0
-    # sort by 2 columns
-    leaderboard_df = leaderboard_df.sort_values(by=["Benchmark Score", "Num Questions Parseable"],
-                                                ascending=[False, False])
     # Print model names and scores to console before HTML formatting
     print("\n===== MODEL RESULTS =====")
     for index, row in leaderboard_df.iterrows():
-        print(f"{row['Model Path']}: {row['Benchmark Score']:.2f}")
     print("========================\n")
     # Apply HTML formatting for display
     leaderboard_df["Model Path"] = leaderboard_df["Model Path"].apply(lambda x: make_clickable_model(x))
-    # rename columns
     leaderboard_df = leaderboard_df.rename(columns={"Model Path": "Model"})
-    leaderboard_df = leaderboard_df.rename(columns={"Num Questions Parseable": "Percentage Questions Parseable"})
     leaderboard_df.to_csv("output.csv")
     # Set midpoint for gradient coloring based on data ranges
@@ -118,17 +80,29 @@ with demo:
         vmax=150
     )
-    rounding = {}
-    # for col in ["Benchmark Score", "Num Questions Parseable"]:
-    rounding["Benchmark Score"] = "{:.2f}"
-    rounding["Percentage Questions Parseable"] = "{:.2f}"
-    rounding["Params"] = "{:.0f}"
     leaderboard_df_styled = leaderboard_df_styled.format(rounding)
     leaderboard_table = gr.components.Dataframe(
         value=leaderboard_df_styled,
-        datatype=['markdown', 'number', 'number', 'number', 'str'],
         elem_id="leaderboard-table",
         interactive=False,
         visible=True,

     gr.HTML(TITLE)
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
+    # Load dataframe from JSON
+    with open("benchmark_report.json", "r") as f:
+        json_data = json.load(f)
+    # Create dataframe from JSON data
+    leaderboard_df = pd.DataFrame(json_data)
+    # Rename columns for consistency
+    leaderboard_df = leaderboard_df.rename(columns={
+        "Model Name": "Model Path",
+        "Model Size": "Params"
+    })
+    # Calculate overall benchmark score as average of Avg (object) and Avg (country)
+    leaderboard_df["Avg"] = (leaderboard_df["Avg (object)"] + leaderboard_df["Avg (country)"]) / 2
+    # Select and reorder columns for display (removed Percentage Questions Parseable)
+    display_columns = [
+        "Model Path", "Params", "Avg",
+        "Avg (object)", "Avg (country)",
+        "History (object)", "History (country)",
+        "Geography (object)", "Geography (country)",
+        "Art & Entertainment (object)", "Art & Entertainment (country)",
+        "Culture & Tradition (object)", "Culture & Tradition (country)"
+    ]
+    leaderboard_df = leaderboard_df[display_columns]
+    # Convert Params column - replace "-" with NaN and convert numeric strings to float
+    leaderboard_df["Params"] = leaderboard_df["Params"].replace("-", numpy.nan)
+    # Convert numeric strings directly to float (no regex needed since values are already clean numbers)
+    leaderboard_df.loc[leaderboard_df["Params"].notna(), "Params"] = leaderboard_df.loc[leaderboard_df["Params"].notna(), "Params"].astype(float)
+    # Sort by benchmark score
+    leaderboard_df = leaderboard_df.sort_values(by=["Avg"], ascending=[False])
     # Print model names and scores to console before HTML formatting
     print("\n===== MODEL RESULTS =====")
+    print("Avg is calculated as: (Avg (object) + Avg (country)) / 2")
     for index, row in leaderboard_df.iterrows():
+        print(f"{row['Model Path']}: {row['Avg']:.2f}")
     print("========================\n")
     # Apply HTML formatting for display
     leaderboard_df["Model Path"] = leaderboard_df["Model Path"].apply(lambda x: make_clickable_model(x))
+    # Rename column for display
     leaderboard_df = leaderboard_df.rename(columns={"Model Path": "Model"})
     leaderboard_df.to_csv("output.csv")
     # Set midpoint for gradient coloring based on data ranges
         vmax=150
     )
+    # Set up number formatting (removed Percentage Questions Parseable)
+    rounding = {
+        "Avg": "{:.2f}",
+        "Params": "{:.0f}",
+        "Avg (object)": "{:.2f}",
+        "Avg (country)": "{:.2f}",
+        "History (object)": "{:.2f}",
+        "History (country)": "{:.2f}",
+        "Geography (object)": "{:.2f}",
+        "Geography (country)": "{:.2f}",
+        "Art & Entertainment (object)": "{:.2f}",
+        "Art & Entertainment (country)": "{:.2f}",
+        "Culture & Tradition (object)": "{:.2f}",
+        "Culture & Tradition (country)": "{:.2f}"
+    }
     leaderboard_df_styled = leaderboard_df_styled.format(rounding)
+    # Create dataframe component with appropriate datatypes
+    datatypes = ['markdown', 'number'] + ['number'] * (len(display_columns) - 1)
     leaderboard_table = gr.components.Dataframe(
         value=leaderboard_df_styled,
+        datatype=datatypes,
         elem_id="leaderboard-table",
         interactive=False,
         visible=True,

benchmark_report.json ADDED Viewed

	@@ -0,0 +1,142 @@

+[
+  {
+    "Model Name":"Anthropic Claude 3.7 Sonnet",
+    "Model Size":"-",
+    "Avg (object)":37.06,
+    "Avg (country)":62.46,
+    "History (object)":52.5,
+    "History (country)":80.0,
+    "Geography (object)":58.33,
+    "Geography (country)":83.33,
+    "Art & Entertainment (object)":22.41,
+    "Art & Entertainment (country)":44.83,
+    "Culture & Tradition (object)":15.0,
+    "Culture & Tradition (country)":41.67
+  },
+  {
+    "Model Name":"OpenAI GPT-4o",
+    "Model Size":"-",
+    "Avg (object)":28.94,
+    "Avg (country)":42.49,
+    "History (object)":30.0,
+    "History (country)":37.5,
+    "Geography (object)":45.0,
+    "Geography (country)":55.0,
+    "Art & Entertainment (object)":22.41,
+    "Art & Entertainment (country)":24.14,
+    "Culture & Tradition (object)":18.33,
+    "Culture & Tradition (country)":53.33
+  },
+  {
+    "Model Name":"Qwen 2.5 VL 72B",
+    "Model Size":"72",
+    "Avg (object)":23.91,
+    "Avg (country)":51.51,
+    "History (object)":35.0,
+    "History (country)":70.0,
+    "Geography (object)":31.67,
+    "Geography (country)":71.67,
+    "Art & Entertainment (object)":18.97,
+    "Art & Entertainment (country)":31.03,
+    "Culture & Tradition (object)":10.0,
+    "Culture & Tradition (country)":33.33
+  },
+  {
+    "Model Name":"Qwen 2.5 VL 32B",
+    "Model Size":"32",
+    "Avg (object)":22.27,
+    "Avg (country)":48.8,
+    "History (object)":30.0,
+    "History (country)":67.5,
+    "Geography (object)":28.33,
+    "Geography (country)":66.67,
+    "Art & Entertainment (object)":22.41,
+    "Art & Entertainment (country)":31.03,
+    "Culture & Tradition (object)":8.33,
+    "Culture & Tradition (country)":30.0
+  },
+  {
+    "Model Name":"Qwen 2.5 VL 7B",
+    "Model Size":"7",
+    "Avg (object)":21.62,
+    "Avg (country)":44.72,
+    "History (object)":32.5,
+    "History (country)":65.0,
+    "Geography (object)":28.33,
+    "Geography (country)":66.67,
+    "Art & Entertainment (object)":18.97,
+    "Art & Entertainment (country)":15.52,
+    "Culture & Tradition (object)":6.67,
+    "Culture & Tradition (country)":31.67
+  },
+  {
+    "Model Name":"Google Gemma 3 27B",
+    "Model Size":"27",
+    "Avg (object)":19.14,
+    "Avg (country)":43.76,
+    "History (object)":12.5,
+    "History (country)":52.5,
+    "Geography (object)":28.33,
+    "Geography (country)":48.33,
+    "Art & Entertainment (object)":22.41,
+    "Art & Entertainment (country)":25.86,
+    "Culture & Tradition (object)":13.33,
+    "Culture & Tradition (country)":48.33
+  },
+  {
+    "Model Name":"Meta Llama 4 Maverick",
+    "Model Size":"402",
+    "Avg (object)":17.49,
+    "Avg (country)":42.98,
+    "History (object)":17.5,
+    "History (country)":52.5,
+    "Geography (object)":20.0,
+    "Geography (country)":50.0,
+    "Art & Entertainment (object)":24.14,
+    "Art & Entertainment (country)":32.76,
+    "Culture & Tradition (object)":8.33,
+    "Culture & Tradition (country)":36.67
+  },
+  {
+    "Model Name":"Mistral Medium 3",
+    "Model Size":"-",
+    "Avg (object)":17.45,
+    "Avg (country)":45.99,
+    "History (object)":12.5,
+    "History (country)":65.0,
+    "Geography (object)":31.67,
+    "Geography (country)":56.67,
+    "Art & Entertainment (object)":18.97,
+    "Art & Entertainment (country)":18.97,
+    "Culture & Tradition (object)":6.67,
+    "Culture & Tradition (country)":43.33
+  },
+  {
+    "Model Name":"Google Gemma 3 12B",
+    "Model Size":"12",
+    "Avg (object)":13.06,
+    "Avg (country)":40.04,
+    "History (object)":10.0,
+    "History (country)":42.5,
+    "Geography (object)":15.0,
+    "Geography (country)":46.67,
+    "Art & Entertainment (object)":17.24,
+    "Art & Entertainment (country)":29.31,
+    "Culture & Tradition (object)":10.0,
+    "Culture & Tradition (country)":41.67
+  },
+  {
+    "Model Name":"Google Gemma 3 4B",
+    "Model Size":"4",
+    "Avg (object)":9.72,
+    "Avg (country)":35.84,
+    "History (object)":5.0,
+    "History (country)":47.5,
+    "Geography (object)":8.33,
+    "Geography (country)":38.33,
+    "Art & Entertainment (object)":17.24,
+    "Art & Entertainment (country)":25.86,
+    "Culture & Tradition (object)":8.33,
+    "Culture & Tradition (country)":31.67
+  }
+]

benchmark_results.csv DELETED Viewed

@@ -1,189 +0,0 @@
-Run ID, Benchmark Completed, Prompt Format, Model Path, Lora Path, Quantization, Benchmark Score, Benchmark Version, Num Questions Parseable, Num Iterations, Inference Engine, Ooba Params, Download Filters, Error
-Bielik_v0.1,2024-06-18 12:48:51,,speakleash/Bielik-7B-Instruct-v0.1,,,47.1,eq-bench_v2,170.0,1,transformers, ,,
-Bielik_v0.1,2024-06-18 13:44:54,,speakleash/Bielik-7B-Instruct-v0.1,,,34.17,eq-bench_v2_pl,149.0,1,transformers, ,,
-Bielik_v0.1,2024-06-18 14:01:46,,speakleash/Bielik-7B-Instruct-v0.1,,,34.27,eq-bench_v2_pl,156.0,1,transformers, ,,
-openchat-gemma,2024-06-18 14:03:04,,openchat/openchat-3.5-0106-gemma,,,FAILED,eq-bench,FAILED,1,transformers, ,,System role not supported
-openchat-35-0106,2024-06-18 14:30:24,,openchat/openchat-3.5-0106,,,45.69,eq-bench_v2_pl,170.0,1,transformers, ,,
-openchat-35-0106,2024-06-18 15:15:03,,openchat/openchat-3.5-0106,,,45.69,eq-bench_v2_pl,170.0,1,transformers, ,,
-glm-4-9b-chat,2024-06-18 15:16:14,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-openchat-35-0106,2024-06-18 15:19:01,,openchat/openchat-3.5-0106,,,72.92,eq-bench_v2,171.0,1,transformers, ,,
-glm-4-9b-chat,2024-06-18 15:20:10,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-openchat-35-0106,2024-06-18 15:22:41,,openchat/openchat-3.5-0106,,,45.69,eq-bench_v2_pl,170.0,1,transformers, ,,
-glm-4-9b-chat,2024-06-18 15:23:50,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-glm-4-9b-chat,2024-06-18 15:26:30,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-glm-4-9b-chat,2024-06-18 16:30:21,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-glm-4-9b-chat-1m,2024-06-18 16:54:28,,THUDM/glm-4-9b-chat-1m,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-glm-4-9b-chat-1m,2024-06-18 17:05:16,,THUDM/glm-4-9b-chat-1m,,,FAILED,eq-bench,FAILED,1,transformers, ,,
-openchat-3.6-8b-20240522,2024-06-18 17:12:00,,openchat/openchat-3.6-8b-20240522,,,-1.339640900815702e+23,eq-bench_v2,171.0,1,transformers, ,,
-openchat-gemma,2024-06-18 17:13:12,,openchat/openchat-3.5-0106-gemma,,,FAILED,eq-bench,FAILED,1,transformers, ,,System role not supported
-Meta-Llama-3-8B-Instruct,2024-06-18 21:29:03,,meta-llama/Meta-Llama-3-8B-Instruct,,,69.09,eq-bench_v2,171.0,1,transformers, ,,
-Starling-LM-7B-alpha,2024-06-18 21:45:18,,berkeley-nest/Starling-LM-7B-alpha,,,49.63,eq-bench_v2_pl,171.0,1,transformers, ,,
-Starling-LM-7B-beta,2024-06-18 21:51:54,,Nexusflow/Starling-LM-7B-beta,,,44.91,eq-bench_v2_pl,159.0,1,transformers, ,,
-Mistral-7B-Instruct-v0.2,2024-06-18 21:52:17,,mistralai/Mistral-7B-Instruct-v0.2,,,FAILED,eq-bench,FAILED,1,transformers, ,,Conversation roles must alternate user/assistant/user/assistant/...
-Mistral-7B-Instruct-v0.1,2024-06-18 22:26:07,,mistralai/Mistral-7B-Instruct-v0.1,,,FAILED,eq-bench,FAILED,1,transformers, ,,Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
-Meta-Llama-3-8B-Instruct,2024-06-18 22:35:53,,meta-llama/Meta-Llama-3-8B-Instruct,,,46.53,eq-bench_v2_pl,171.0,1,transformers, ,,
-openchat-gemma,2024-06-19 09:30:28,,openchat/openchat-3.5-0106-gemma,,,FAILED,eq-bench,FAILED,1,transformers, ,,System role not supported
-Mistral-7B-Instruct-v0.2,2024-06-19 09:30:46,,mistralai/Mistral-7B-Instruct-v0.2,,,FAILED,eq-bench,FAILED,1,transformers, ,,Conversation roles must alternate user/assistant/user/assistant/...
-openchat-gemma,2024-06-19 09:35:50,,openchat/openchat-3.5-0106-gemma,,,FAILED,eq-bench,FAILED,1,transformers, ,,System role not supported
-Mistral-7B-Instruct-v0.2,2024-06-19 09:36:01,,mistralai/Mistral-7B-Instruct-v0.2,,,FAILED,eq-bench,FAILED,1,transformers, ,,Conversation roles must alternate user/assistant/user/assistant/...
-openchat-gemma,2024-06-19 09:43:53,,openchat/openchat-3.5-0106-gemma,,,60.11,eq-bench_v2_pl,169.0,1,transformers, ,,
-Mistral-7B-Instruct-v0.2,2024-06-19 09:49:42,,mistralai/Mistral-7B-Instruct-v0.2,,,52.99,eq-bench_v2_pl,148.0,1,transformers, ,,
-openchat-gemma,2024-06-19 09:54:01,,openchat/openchat-3.5-0106-gemma,,,60.11,eq-bench_v2_pl,169.0,1,transformers, ,,
-openchat-gemma,2024-06-19 10:16:52,,openchat/openchat-3.5-0106-gemma,,,59.93,eq-bench_v2_pl,170.0,1,transformers, ,,
-openchat-gemma,2024-06-19 10:19:44,,openchat/openchat-3.5-0106-gemma,,,59.93,eq-bench_v2_pl,170.0,1,transformers, ,,
-Nous-Hermes-2-SOLAR-10.7B,2024-06-19 10:27:36,,NousResearch/Nous-Hermes-2-SOLAR-10.7B,,,48.22,eq-bench_v2_pl,169.0,1,transformers, ,,
-SOLAR-10.7B-Instruct-v1.0,2024-06-19 10:43:47,,upstage/SOLAR-10.7B-Instruct-v1.0,,,57.57,eq-bench_v2_pl,164.0,1,transformers, ,,
-Qwen2-7B-Instruct,2024-06-19 10:46:52,,Qwen/Qwen2-7B-Instruct,,,53.08,eq-bench_v2_pl,171.0,1,transformers, ,,
-models/gwint2,2024-06-19 11:21:15,,speakleash/Bielik-11B-v2.0-Instruct,,,68.24,eq-bench_v2_pl,171.0,1,transformers, ,,
-Azurro/APT3-275M-Base,2024-06-19 11:36:43,,Azurro/APT3-275M-Base,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-Qwen/Qwen2-0.5B,2024-06-19 11:47:44,,Qwen/Qwen2-0.5B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,18.0 questions were parseable (min is 83%)
-Qwen/Qwen2-0.5B-Instruct,2024-06-19 11:51:21,,Qwen/Qwen2-0.5B-Instruct,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,125.0 questions were parseable (min is 83%)
-allegro/plt5-large,2024-06-19 11:51:22,,allegro/plt5-large,,,FAILED,eq-bench,FAILED,1,transformers, ,,Unrecognized configuration class <class 'transformers.models.t5.configuration_t5.T5Config'> for this kind of AutoModel: AutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig.
-APT3-1B-Instruct-e1,2024-06-19 11:51:22,,APT3-1B-Instruct-e1,,,FAILED,eq-bench,FAILED,1,transformers, ,,APT3-1B-Instruct-e1 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
-APT3-1B-Instruct-e2,2024-06-19 11:51:23,,APT3-1B-Instruct-e2,,,FAILED,eq-bench,FAILED,1,transformers, ,,APT3-1B-Instruct-e2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`
-Azurro/APT3-1B-Base,2024-06-19 12:00:40,,Azurro/APT3-1B-Base,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-OPI-PG/Qra-1b,2024-06-19 12:13:15,,OPI-PG/Qra-1b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-TinyLlama/TinyLlama-1.1B-Chat-v1.0,2024-06-19 12:23:45,,TinyLlama/TinyLlama-1.1B-Chat-v1.0,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,36.0 questions were parseable (min is 83%)
-Qwen/Qwen2-1.5B,2024-06-19 12:35:37,,Qwen/Qwen2-1.5B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,54.0 questions were parseable (min is 83%)
-Qwen/Qwen2-1.5B-Instruct,2024-06-19 12:38:29,,Qwen/Qwen2-1.5B-Instruct,,,15.33,eq-bench_v2_pl,165.0,1,transformers, ,,
-sdadas/polish-gpt2-xl,2024-06-19 12:54:39,,sdadas/polish-gpt2-xl,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-internlm/internlm2-1_8b,2024-06-19 13:08:50,,internlm/internlm2-1_8b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-internlm/internlm2-chat-1_8b,2024-06-19 13:13:21,,internlm/internlm2-chat-1_8b,,,13.83,eq-bench_v2_pl,150.0,1,transformers, ,,
-google/gemma-1.1-2b-it,2024-06-19 13:15:24,,google/gemma-1.1-2b-it,,,16.47,eq-bench_v2_pl,171.0,1,transformers, ,,
-microsoft/phi-2,2024-06-19 13:28:07,,microsoft/phi-2,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-google/mt5-xl,2024-06-19 13:28:10,,google/mt5-xl,,,FAILED,eq-bench,FAILED,1,transformers, ,,Unrecognized configuration class <class 'transformers.models.mt5.configuration_mt5.MT5Config'> for this kind of AutoModel: AutoModelForCausalLM. Model type should be one of BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BloomConfig, CamembertConfig, LlamaConfig, CodeGenConfig, CohereConfig, CpmAntConfig, CTRLConfig, Data2VecTextConfig, DbrxConfig, ElectraConfig, ErnieConfig, FalconConfig, FuyuConfig, GemmaConfig, GitConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, JambaConfig, JetMoeConfig, LlamaConfig, MambaConfig, MarianConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MptConfig, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, OlmoConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PegasusConfig, PersimmonConfig, PhiConfig, Phi3Config, PLBartConfig, ProphetNetConfig, QDQBertConfig, Qwen2Config, Qwen2MoeConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Speech2Text2Config, StableLmConfig, Starcoder2Config, TransfoXLConfig, TrOCRConfig, WhisperConfig, XGLMConfig, XLMConfig, XLMProphetNetConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, InternLM2Config, InternLM2Config.
-microsoft/Phi-3-mini-4k-instruct,2024-06-19 13:34:56,,microsoft/Phi-3-mini-4k-instruct,,,28.05,eq-bench_v2_pl,159.0,1,transformers, ,,
-ssmits/Falcon2-5.5B-Polish,2024-06-19 13:47:21,,ssmits/Falcon2-5.5B-Polish,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-01-ai/Yi-1.5-6B,2024-06-19 14:04:20,,01-ai/Yi-1.5-6B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,1.0 questions were parseable (min is 83%)
-01-ai/Yi-1.5-6B-Chat,2024-06-19 14:11:22,,01-ai/Yi-1.5-6B-Chat,,,5.19,eq-bench_v2_pl,161.0,1,transformers, ,,
-THUDM/chatglm3-6b,2024-06-19 14:12:11,,THUDM/chatglm3-6b,,,FAILED,eq-bench,FAILED,1,transformers, ,,too many values to unpack (expected 2)
-THUDM/chatglm3-6b-base,2024-06-19 14:13:00,,THUDM/chatglm3-6b-base,,,FAILED,eq-bench,FAILED,1,transformers, ,,too many values to unpack (expected 2)
-alpindale/Mistral-7B-v0.2-hf,2024-06-19 14:16:37,,alpindale/Mistral-7B-v0.2-hf,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,45.0 questions were parseable (min is 83%)
-berkeley-nest/Starling-LM-7B-alpha,2024-06-19 14:22:32,,berkeley-nest/Starling-LM-7B-alpha,,,46.26,eq-bench_v2_pl,171.0,1,transformers, ,,
-google/gemma-7b,2024-06-19 14:38:02,,google/gemma-7b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-google/gemma-7b-it,2024-06-19 14:53:28,,google/gemma-7b-it,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-HuggingFaceH4/zephyr-7b-alpha,2024-06-19 15:05:31,,HuggingFaceH4/zephyr-7b-alpha,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,99.0 questions were parseable (min is 83%)
-HuggingFaceH4/zephyr-7b-beta,2024-06-19 15:18:24,,HuggingFaceH4/zephyr-7b-beta,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,88.0 questions were parseable (min is 83%)
-internlm/internlm2-7b,2024-06-19 15:36:06,,internlm/internlm2-7b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,43.0 questions were parseable (min is 83%)
-internlm/internlm2-base-7b,2024-06-19 15:54:53,,internlm/internlm2-base-7b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,6.0 questions were parseable (min is 83%)
-internlm/internlm2-chat-7b,2024-06-19 16:02:07,,internlm/internlm2-chat-7b,,,40.0,eq-bench_v2_pl,169.0,1,transformers, ,,
-internlm/internlm2-chat-7b-sft,2024-06-19 16:07:04,,internlm/internlm2-chat-7b-sft,,,41.62,eq-bench_v2_pl,170.0,1,transformers, ,,
-lex-hue/Delexa-7b,2024-06-19 16:12:19,,lex-hue/Delexa-7b,,,49.03,eq-bench_v2_pl,169.0,1,transformers, ,,
-meta-llama/Llama-2-7b-chat-hf,2024-06-19 16:21:08,,meta-llama/Llama-2-7b-chat-hf,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,116.0 questions were parseable (min is 83%)
-meta-llama/Llama-2-7b-hf,2024-06-19 16:36:41,,meta-llama/Llama-2-7b-hf,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,1.0 questions were parseable (min is 83%)
-microsoft/WizardLM-2-7B,2024-06-19 16:44:22,,microsoft/WizardLM-2-7B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,137.0 questions were parseable (min is 83%)
-mistralai/Mistral-7B-Instruct-v0.1,2024-06-19 16:44:33,,mistralai/Mistral-7B-Instruct-v0.1,,,FAILED,eq-bench,FAILED,1,transformers, ,,Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
-mistralai/Mistral-7B-Instruct-v0.2,2024-06-19 16:50:36,,mistralai/Mistral-7B-Instruct-v0.2,,,53.25,eq-bench_v2_pl,151.0,1,transformers, ,,
-mistralai/Mistral-7B-Instruct-v0.3,2024-06-19 16:54:49,,mistralai/Mistral-7B-Instruct-v0.3,,,45.21,eq-bench_v2_pl,171.0,1,transformers, ,,
-mistralai/Mistral-7B-v0.1,2024-06-19 16:59:50,,mistralai/Mistral-7B-v0.1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,65.0 questions were parseable (min is 83%)
-mistralai/Mistral-7B-v0.3,2024-06-19 17:16:38,,mistralai/Mistral-7B-v0.3,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,14.0 questions were parseable (min is 83%)
-Nexusflow/Starling-LM-7B-beta,2024-06-19 17:23:18,,Nexusflow/Starling-LM-7B-beta,,,45.1,eq-bench_v2_pl,166.0,1,transformers, ,,
-openchat/openchat-3.5-0106,2024-06-19 17:27:10,,openchat/openchat-3.5-0106,,,43.81,eq-bench_v2_pl,171.0,1,transformers, ,,
-openchat/openchat-3.5-0106-gemma,2024-06-19 17:30:31,,openchat/openchat-3.5-0106-gemma,,,58.62,eq-bench_v2_pl,169.0,1,transformers, ,,
-openchat/openchat-3.5-1210,2024-06-19 17:34:27,,openchat/openchat-3.5-1210,,,49.04,eq-bench_v2_pl,171.0,1,transformers, ,,
-OPI-PG/Qra-7b,2024-06-19 17:50:28,,OPI-PG/Qra-7b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-Qwen/Qwen1.5-7B,2024-06-19 17:57:53,,Qwen/Qwen1.5-7B,,,23.11,eq-bench_v2_pl,155.0,1,transformers, ,,
-Qwen/Qwen1.5-7B-Chat,2024-06-19 18:03:34,,Qwen/Qwen1.5-7B-Chat,,,25.0,eq-bench_v2_pl,164.0,1,transformers, ,,
-Qwen/Qwen2-7B,2024-06-19 18:09:23,,Qwen/Qwen2-7B,,,36.58,eq-bench_v2_pl,166.0,1,transformers, ,,
-Qwen/Qwen2-7B-Instruct,2024-06-19 18:12:42,,Qwen/Qwen2-7B-Instruct,,,53.74,eq-bench_v2_pl,171.0,1,transformers, ,,
-Remek/Kruk-7B-SP-001,2024-06-19 18:17:13,,Remek/Kruk-7B-SP-001,,,44.44,eq-bench_v2_pl,171.0,1,transformers, ,,
-Remek/OpenChat-3.5-0106-PL-Omnibusv2,2024-06-19 18:17:24,,Remek/OpenChat-3.5-0106-PL-Omnibusv2,,,FAILED,eq-bench,FAILED,1,transformers, ,,'system_message' is undefined
-Remek/OpenChat3.5-0106-Spichlerz-Bocian,2024-06-19 18:24:08,,Remek/OpenChat3.5-0106-Spichlerz-Bocian,,,44.13,eq-bench_v2_pl,166.0,1,transformers, ,,
-Remek/OpenChat3.5-0106-Spichlerz-Inst-001,2024-06-19 18:28:48,,Remek/OpenChat3.5-0106-Spichlerz-Inst-001,,,41.6,eq-bench_v2_pl,171.0,1,transformers, ,,
-RWKV/HF_v5-Eagle-7B,2024-06-19 19:16:27,,RWKV/HF_v5-Eagle-7B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-RWKV/v5-Eagle-7B-HF,2024-06-19 20:04:12,,RWKV/v5-Eagle-7B-HF,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-speakleash/Bielik-7B-v0.1,2024-06-19 20:11:16,,speakleash/Bielik-7B-v0.1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,139.0 questions were parseable (min is 83%)
-szymonrucinski/Curie-7B-v1,2024-06-19 20:29:24,,szymonrucinski/Curie-7B-v1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,1.0 questions were parseable (min is 83%)
-teknium/OpenHermes-2.5-Mistral-7B,2024-06-19 20:34:12,,teknium/OpenHermes-2.5-Mistral-7B,,,37.48,eq-bench_v2_pl,171.0,1,transformers, ,,
-Voicelab/trurl-2-7b,2024-06-19 20:39:26,,Voicelab/trurl-2-7b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,141.0 questions were parseable (min is 83%)
-microsoft/Phi-3-small-8k-instruct,2024-06-19 20:39:31,,microsoft/Phi-3-small-8k-instruct,,,FAILED,eq-bench,FAILED,1,transformers, ,,No module named 'pytest'
-CohereForAI/aya-23-8B,2024-06-19 20:44:01,,CohereForAI/aya-23-8B,,,45.43,eq-bench_v2_pl,171.0,1,transformers, ,,
-meta-llama/Meta-Llama-3-8B,2024-06-19 21:01:55,,meta-llama/Meta-Llama-3-8B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-meta-llama/Meta-Llama-3-8B-Instruct,2024-06-19 21:06:08,,meta-llama/Meta-Llama-3-8B-Instruct,,,46.27,eq-bench_v2_pl,171.0,1,transformers, ,,
-mlabonne/NeuralDaredevil-8B-abliterated,2024-06-19 21:13:31,,mlabonne/NeuralDaredevil-8B-abliterated,,,54.74,eq-bench_v2_pl,171.0,1,transformers, ,,
-NousResearch/Hermes-2-Pro-Llama-3-8B,2024-06-19 21:18:18,,NousResearch/Hermes-2-Pro-Llama-3-8B,,,54.57,eq-bench_v2_pl,171.0,1,transformers, ,,
-NousResearch/Hermes-2-Theta-Llama-3-8B,2024-06-19 21:25:22,,NousResearch/Hermes-2-Theta-Llama-3-8B,,,54.88,eq-bench_v2_pl,171.0,1,transformers, ,,
-nvidia/Llama3-ChatQA-1.5-8B,2024-06-19 22:27:24,,nvidia/Llama3-ChatQA-1.5-8B,,,40.55,eq-bench_v2_pl,166.0,1,transformers, ,,
-openchat/openchat-3.6-8b-20240522,2024-06-19 22:34:56,,openchat/openchat-3.6-8b-20240522,,,-2.0090659464796595e+18,eq-bench_v2_pl,170.0,1,transformers, ,,
-Remek/Llama-3-8B-Omnibus-1-PL-v01-INSTRUCT,2024-06-19 22:39:46,,Remek/Llama-3-8B-Omnibus-1-PL-v01-INSTRUCT,,,26.63,eq-bench_v2_pl,171.0,1,transformers, ,,
-01-ai/Yi-1.5-9B,2024-06-19 23:07:56,,01-ai/Yi-1.5-9B,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,1.0 questions were parseable (min is 83%)
-01-ai/Yi-1.5-9B-Chat,2024-06-19 23:19:16,,01-ai/Yi-1.5-9B-Chat,,,48.78,eq-bench_v2_pl,163.0,1,transformers, ,,
-google/recurrentgemma-9b-it,2024-06-19 23:28:19,,google/recurrentgemma-9b-it,,,52.82,eq-bench_v2_pl,171.0,1,transformers, ,,
-THUDM/glm-4-9b,2024-06-19 23:28:41,,THUDM/glm-4-9b,,,FAILED,eq-bench,FAILED,1,transformers, ,,too many values to unpack (expected 2)
-THUDM/glm-4-9b-chat,2024-06-19 23:29:01,,THUDM/glm-4-9b-chat,,,FAILED,eq-bench,FAILED,1,transformers, ,,too many values to unpack (expected 2)
-NousResearch/Nous-Hermes-2-SOLAR-10.7B,2024-06-19 23:51:07,,NousResearch/Nous-Hermes-2-SOLAR-10.7B,,,49.85,eq-bench_v2_pl,169.0,1,transformers, ,,
-TeeZee/Bielik-SOLAR-LIKE-10.7B-Instruct-v0.1,2024-06-20 00:00:02,,TeeZee/Bielik-SOLAR-LIKE-10.7B-Instruct-v0.1,,,35.63,eq-bench_v2_pl,164.0,1,transformers, ,,
-upstage/SOLAR-10.7B-Instruct-v1.0,2024-06-20 00:19:48,,upstage/SOLAR-10.7B-Instruct-v1.0,,,57.35,eq-bench_v2_pl,162.0,1,transformers, ,,
-upstage/SOLAR-10.7B-v1.0,2024-06-20 01:12:51,,upstage/SOLAR-10.7B-v1.0,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,1.0 questions were parseable (min is 83%)
-tiiuae/falcon-11B,2024-06-20 01:23:54,,tiiuae/falcon-11B,,,42.41,eq-bench_v2_pl,171.0,1,transformers, ,,
-lmsys/vicuna-13b-v1.5,2024-06-20 01:43:40,,lmsys/vicuna-13b-v1.5,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,84.0 questions were parseable (min is 83%)
-OPI-PG/Qra-13b,2024-06-20 02:07:48,,OPI-PG/Qra-13b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,0.0 questions were parseable (min is 83%)
-teknium/OpenHermes-13B,2024-06-20 02:32:04,,teknium/OpenHermes-13B,,,36.85,eq-bench_v2_pl,162.0,1,transformers, ,,
-Voicelab/trurl-2-13b-academic,2024-06-20 02:38:04,,Voicelab/trurl-2-13b-academic,,,25.92,eq-bench_v2_pl,162.0,1,transformers, ,,
-microsoft/Phi-3-medium-4k-instruct,2024-06-20 02:46:38,,microsoft/Phi-3-medium-4k-instruct,,,57.07,eq-bench_v2_pl,169.0,1,transformers, ,,
-Qwen/Qwen1.5-14B-Chat,2024-06-20 02:52:13,,Qwen/Qwen1.5-14B-Chat,,,51.26,eq-bench_v2_pl,160.0,1,transformers, ,,
-internlm/internlm2-20b,2024-06-20 09:04:33,,internlm/internlm2-20b,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,4.0 questions were parseable (min is 83%)
-internlm/internlm2-chat-20b,2024-06-20 09:47:11,,internlm/internlm2-chat-20b,,,36.52,eq-bench_v2_pl,170.0,1,transformers, ,,
-Qwen/Qwen1.5-32B,2024-06-20 13:25:12,,Qwen/Qwen1.5-32B,,,54.35,eq-bench_v2_pl,170.0,1,transformers, ,,
-Qwen/Qwen1.5-32B-Chat,2024-06-20 13:34:52,,Qwen/Qwen1.5-32B-Chat,,,60.69,eq-bench_v2_pl,168.0,1,transformers, ,,
-01-ai/Yi-1.5-34B-Chat,2024-06-20 13:51:30,,01-ai/Yi-1.5-34B-Chat,,,46.32,eq-bench_v2_pl,171.0,1,transformers, ,,
-CohereForAI/aya-23-35B,2024-06-20 14:03:07,,CohereForAI/aya-23-35B,,,58.41,eq-bench_v2_pl,171.0,1,transformers, ,,
-CohereForAI/c4ai-command-r-v01,2024-06-20 14:14:54,,CohereForAI/c4ai-command-r-v01,,,56.43,eq-bench_v2_pl,171.0,1,transformers, ,,
-mistralai/Mixtral-8x7B-Instruct-v0.1,2024-06-20 14:35:28,,mistralai/Mixtral-8x7B-Instruct-v0.1,,,58.64,eq-bench_v2_pl,168.0,1,transformers, ,,
-mistralai/Mixtral-8x7B-v0.1,2024-06-20 15:30:24,,mistralai/Mixtral-8x7B-v0.1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,10.0 questions were parseable (min is 83%)
-Qwen/Qwen2-57B-A14B-Instruct,2024-06-20 16:19:41,,Qwen/Qwen2-57B-A14B-Instruct,,,57.64,eq-bench_v2_pl,171.0,1,transformers, ,,
-meta-llama/Meta-Llama-3-70B,2024-06-20 16:59:30,,meta-llama/Meta-Llama-3-70B,,,46.1,eq-bench_v2_pl,145.0,1,transformers, ,,
-meta-llama/Meta-Llama-3-70B-Instruct,2024-06-20 17:15:58,,meta-llama/Meta-Llama-3-70B-Instruct,,,71.21,eq-bench_v2_pl,171.0,1,transformers, ,,
-Qwen/Qwen1.5-72B,2024-06-20 17:50:17,,Qwen/Qwen1.5-72B,,,53.96,eq-bench_v2_pl,163.0,1,transformers, ,,
-Qwen/Qwen1.5-72B-Chat,2024-06-20 18:06:58,,Qwen/Qwen1.5-72B-Chat,,,68.03,eq-bench_v2_pl,171.0,1,transformers, ,,
-Qwen/Qwen2-72B,2024-06-20 18:36:22,,Qwen/Qwen2-72B,,,69.75,eq-bench_v2_pl,169.0,1,transformers, ,,
-Qwen/Qwen2-72B-Instruct,2024-06-20 18:55:02,,Qwen/Qwen2-72B-Instruct,,,72.07,eq-bench_v2_pl,169.0,1,transformers, ,,
-mistralai/Mixtral-8x22B-v0.1,2024-06-21 20:20:37,,mistralai/Mixtral-8x22B-v0.1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,34.0 questions were parseable (min is 83%)
-mistralai/Mixtral-8x22B-Instruct-v0.1,2024-06-26 23:40:01,,mistralai/Mixtral-8x22B-Instruct-v0.1,,,67.63,eq-bench_v2_pl,171.0,1,transformers, ,,
-mistralai/Mixtral-8x22B-v0.1,2024-06-27 01:17:13,,mistralai/Mixtral-8x22B-v0.1,,,FAILED,eq-bench_pl,FAILED,1,transformers, ,,50.0 questions were parseable (min is 83%)
-alpindale/WizardLM-2-8x22B,2024-06-27 01:50:42,,alpindale/WizardLM-2-8x22B,,,69.56,eq-bench_v2_pl,171.0,1,transformers, ,,
-Bielik_v2.2b,2024-08-24 09:54:33,,speakleash/Bielik-11B-v2.2-Instruct,,,69.05,eq-bench_v2_pl,171.0,1,transformers, ,,
-Bielik_v2.1,2024-08-24 10:07:46,,speakleash/Bielik-11B-v2.1-Instruct,,,66.27,eq-bench_v2_pl,155.0,1,transformers, ,,
-meta-llama/Meta-Llama-3.1-70B-Instruct,2024-08-24 21:24:39,,meta-llama/Meta-Llama-3.1-70B-Instruct,,,FAILED,eq-bench,FAILED,1,transformers, ,,`rope_scaling` must be a dictionary with two fields, `type` and `factor`, got {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}
-mistralai/Mistral-Large-Instruct-2407,2024-08-24 21:51:53,,mistralai/Mistral-Large-Instruct-2407,,,78.07,eq-bench_v2_pl,171.0,1,transformers, ,,
-meta-llama/Meta-Llama-3.1-70B-Instruct,2024-08-24 22:23:40,,meta-llama/Meta-Llama-3.1-70B-Instruct,,,72.53,eq-bench_v2_pl,171.0,1,transformers, ,,
-meta-llama/Meta-Llama-3.1-405B-Instruct-FP8,2024-08-25 20:59:04,openai_api,meta-llama/Meta-Llama-3.1-405B-Instruct-FP8,,,77.23,eq-bench_v2_pl,171.0,1,openai,,,
-gpt-3.5-turbo,2024-08-25 21:14:25,openai_api,gpt-3.5-turbo,,,57.7,eq-bench_v2_pl,171.0,1,openai,,,
-gpt-4o-mini-2024-07-18,2024-08-25 21:17:34,openai_api,gpt-4o-mini-2024-07-18,,,71.15,eq-bench_v2_pl,171.0,1,openai,,,
-gpt-4o-2024-08-06,2024-08-25 21:24:35,openai_api,gpt-4o-2024-08-06,,,75.15,eq-bench_v2_pl,171.0,1,openai,,,
-gpt-4-turbo-2024-04-09,2024-08-25 21:31:42,openai_api,gpt-4-turbo-2024-04-09,,,77.77,eq-bench_v2_pl,164.0,1,openai,,,
-Bielik_v2.3,2024-09-14 10:40:57,,speakleash/Bielik-11B-v2.3-Instruct,,,70.86,eq-bench_v2_pl,171.0,1,transformers, ,,
-PLLuM-12B-nc-chat,2025-02-24 15:02:07,,CYFRAGOVPL/PLLuM-12B-nc-chat,,,49.23,eq-bench_pl,123.0,1,transformers, ,,123.0 questions were parseable (min is 83%)
-Llama-PLLuM-8B-instruct,2025-02-24 16:55:16,,CYFRAGOVPL/Llama-PLLuM-8B-instruct,,,43.56,eq-bench_pl,124.0,1,transformers, ,,124.0 questions were parseable (min is 83%)
-PLLuM-12B-nc-instruct,2025-02-24 17:38:48,,CYFRAGOVPL/PLLuM-12B-nc-instruct,,,29.50,eq-bench_pl,76.0,1,transformers, ,,76.0 questions were parseable (min is 83%)
-PLLuM-12B-chat,2025-02-24 17:56:34,,CYFRAGOVPL/PLLuM-12B-chat,,,57.29,eq-bench_v2_pl,156.0,1,transformers, ,,
-PLLuM-12B-instruct,2025-02-24 18:03:06,,CYFRAGOVPL/PLLuM-12B-instruct,,,40.21,eq-bench_v2_pl,154.0,1,transformers, ,,
-Llama-PLLuM-8B-chat,2025-02-24 18:40:04,,CYFRAGOVPL/Llama-PLLuM-8B-chat,,,50.97,eq-bench_v2_pl,155.0,1,transformers, ,,
-Llama-PLLuM-70B-instruct,2025-02-23 22:45:37,,CYFRAGOVPL/Llama-PLLuM-70B-instruct,,,69.99,eq-bench_v2_pl,171.0,1,transformers, ,,
-Llama-PLLuM-70B-chat,2025-02-24 22:32:57,,CYFRAGOVPL/Llama-PLLuM-70B-chat,,,72.99,eq-bench_v2_pl,170.0,1,transformers, ,,
-PLLuM-8x7B-nc-chat,2025-02-23 14:33:22,openai_api,CYFRAGOVPL/PLLuM-8x7B-nc-chat,,,47.29,eq-bench_v2_pl,171.0,1,openai,,,
-PLLuM-8x7B-nc-instruct,2025-02-23 14:33:22,openai_api,CYFRAGOVPL/PLLuM-8x7B-nc-instruct,,,41.75,eq-bench_v2_pl,171.0,1,openai,,,
-PLLuM-8x7B-chat,2025-02-23 14:33:22,openai_api,CYFRAGOVPL/PLLuM-8x7B-chat,,,45.22,eq-bench_v2_pl,171.0,1,openai,,,
-PLLuM-8x7B-instruct,2025-02-23 14:33:22,openai_api,CYFRAGOVPL/PLLuM-8x7B-instruct,,,39.55,eq-bench_v2_pl,171.0,1,openai,,,
-Qwen2.5-7B-Instruct,2025-03-01 11:49:28,,Qwen/Qwen2.5-7B-Instruct,,,58.58,eq-bench_v2_pl,171.0,1,transformers,,,
-Qwen2.5-14B-Instruct,2025-03-01 12:01:56,,Qwen/Qwen2.5-14B-Instruct,,,69.58,eq-bench_v2_pl,170.0,1,transformers,,,
-Qwen2.5-1.5B-Instruct,2025-03-01 12:09:18,,Qwen/Qwen2.5-1.5B-Instruct,,,27.79,eq-bench_v2_pl,170.0,1,transformers,,,
-phi-4,2025-03-01 12:19:38,,microsoft/phi-4,,,64.37,eq-bench_v2_pl,157.0,1,transformers,,,
-glm-4-9b-chat,2025-03-01 12:23:46,,THUDM/glm-4-9b-chat,,,61.79,eq-bench_v2_pl,171.0,1,transformers,,,
-openchat-3.6-8b-20240522,2025-03-01 12:29:29,,openchat/openchat-3.6-8b-20240522,,,-2.0090659464796536e+18,eq-bench_v2_pl,170.0,1,transformers,,,
-Qwen2.5-32B-Instruct,2025-03-02 14:08:52,,Qwen/Qwen2.5-32B-Instruct,,,71.15,eq-bench_v2_pl,171.0,1,transformers,,,
-Qwen2.5-72B-Instruct,2025-03-02 14:25:32,,Qwen/Qwen2.5-72B-Instruct,,,68.89,eq-bench_v2_pl,170.0,1,transformers,,,
-Llama-3.1-Nemotron-70B-Instruct-HF,2025-03-02 15:04:25,,nvidia/Llama-3.1-Nemotron-70B-Instruct-HF,,,74.75,eq-bench_pl,133.0,1,transformers,,,133.0 questions were parseable (min is 83%)
-Llama-3.2-1B-Instruct,2025-03-02 16:35:24,,meta-llama/Llama-3.2-1B-Instruct,,,20.59,eq-bench_v2_pl,148.0,1,transformers,,,
-EuroLLM-9B-Instruct,2025-03-02 16:41:02,,utter-project/EuroLLM-9B-Instruct,,,54.75,eq-bench_v2_pl,169.0,1,transformers,,,
-Llama-3.3-70B-Instruct,2025-03-02 16:59:31,,meta-llama/Llama-3.3-70B-Instruct,,,72.86,eq-bench_v2_pl,166.0,1,transformers,,,
-Llama-3.2-3B-Instruct,2025-03-02 17:14:17,,meta-llama/Llama-3.2-3B-Instruct,,,46.46,eq-bench_v2_pl,170.0,1,transformers,,,
-Qwen2.5-3B-Instruct,2025-03-02 17:26:57,,Qwen/Qwen2.5-3B-Instruct,,,36.08,eq-bench_v2_pl,170.0,1,transformers,,,
-Mistral-Small-24B-Instruct-2501,2025-03-02 17:33:14,,mistralai/Mistral-Small-24B-Instruct-2501,,,70.52,eq-bench_v2_pl,171.0,1,transformers,,,
-Mistral-Small-Instruct-2409,2025-03-02 17:43:01,,mistralai/Mistral-Small-Instruct-2409,,,72.85,eq-bench_v2_pl,171.0,1,transformers,,,
-Mistral-Nemo-Instruct-2407,2025-03-03 10:29:42,,mistralai/Mistral-Nemo-Instruct-2407,,,61.76,eq-bench_v2_pl,171.0,1,transformers,,,
-Phi-4-mini-instruct,2025-03-03 13:20:03,,microsoft/Phi-4-mini-instruct,,,50.82,eq-bench_v2_pl,170.0,1,transformers,,,
-Mistral-Large-Instruct-2411,2025-03-07 12:17:17,,mistralai/Mistral-Large-Instruct-2411,,,77.29,eq-bench_v2_pl,171.0,1,transformers,,,
-Bielik-11B-v2.5-Instruct,2025-05-01 19:27:42,,speakleash/Bielik-11B-v2.5-Instruct,,,72.42,eq-bench_v2_pl,170.0,1,transformers,,,
-Bielik-1.5B-v3.0-Instruct,2025-05-04 00:28:45,,speakleash/Bielik-1.5B-v3.0-Instruct,,,18.99,eq-bench_pl,125.0,1,transformers,,,125.0 questions were parseable (min is 83%)
-Bielik-4.5B-v3.0-Instruct,2025-05-04 16:16:42,,speakleash/Bielik-4.5B-v3.0-Instruct,,,56.21,eq-bench_v2_pl,163.0,1,transformers,,,

metadata.json DELETED Viewed

@@ -1,355 +0,0 @@
-{
-  "Azurro/APT3-1B-Base": 1,
-  "HuggingFaceH4/zephyr-7b-alpha": 7,
-  "Voicelab/trurl-2-13b-academic": 13,
-  "HuggingFaceH4/zephyr-7b-beta": 7,
-  "Voicelab/trurl-2-7b": 7,
-  "mistralai/Mistral-7B-v0.1": 7,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-400/adapter_model": 7,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-200/adapter_model": 7,
-  "mistralai/Mistral-7B-v0.1,load_in_8bit=True": 7,
-  "Nondzu/zephyr-speakleash-007-pl-8192-32-16-0.05": 7,
-  "openchat/openchat-3.5-0106": 7,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-2000/adapter_model": 7,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-2200/adapter_model": 7,
-  "mistralai/Mistral-7B-Instruct-v0.1": 7,
-  "APT3-1B-Instruct-e1": 1,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-800/adapter_model": 7,
-  "mistralai/Mistral-7B-v0.1,peft=lora/output/mistral-7b-v0.1-lora-pl/checkpoint-600/adapter_model": 7,
-  "APT3-1B-Instruct-e2": 1,
-  "mistralai/Mistral-7B-v0.1,load_in_4bit=True": 7,
-  "speakleash/3-5B_high_base/epoch_2_hf": 3.5,
-  "speakleash/3-5B_high_base/epoch_1_hf": 3.5,
-  "speakleash/3-5B_high_base/epoch_0_hf": 3.5,
-  "speakleash/7B_high_base/epoch_1_hf": 7,
-  "speakleash/7B_high_base/epoch_0_hf": 7,
-  "Nondzu/zephyr-speakleash-010-pl-3072-32-16-0.01": 7,
-  "google/mt5-xl": 3.7,
-  "speakleash/7B_high_sft/epoch_2_base/epoch_2_hf": 7,
-  "OPI-PG/Qra-1b": 1,
-  "OPI-PG/Qra-13b": 13,
-  "OPI-PG/Qra-7b": 7,
-  "teknium/OpenHermes-2.5-Mistral-7B": 7,
-  "openchat/openchat-3.5-1210": 7,
-  "speakleash/apt3-1B_base/apt3-1B-sequential_hf": 1,
-  "speakleash/apt3-1B_base/apt3-1B-shuffled_hf": 1,
-  "speakleash/1B_high_base/like_apt3-1B_hf": 1,
-  "speakleash/1B_high_base/epoch_3_hf": 1,
-  "speakleash/7B_high_sft/epoch_1_base/epoch_2_hf": 7,
-  "speakleash/7B_high_sft/epoch_1_base/epoch_1_hf": 7,
-  "speakleash/7B_high_sft/epoch_0_base/epoch_0_hf": 7,
-  "speakleash/7B_high_sft/epoch_2_base/epoch_1_hf": 7,
-  "speakleash/3-5B_high_sft/epoch_3_base/epoch_2_hf": 3.5,
-  "allegro/plt5-large": 0.82,
-  "internlm/internlm2-7b": 7,
-  "sdadas/polish-gpt2-xl": 1.67,
-  "speakleash/1B_4k_high_sft/epoch_3_base/epoch_1_hf": 1,
-  "speakleash/mistral-PL_7B/epoch_0_hf": 7,
-  "speakleash/1B_high_sft/epoch_3_base/epoch_1_hf": 1,
-  "speakleash/polish-mistral-7B/epoch_0_hf": 7,
-  "speakleash/3-5B_high_sft/epoch_0_base/epoch_2_hf": 3.5,
-  "speakleash/3-5B_high_sft/epoch_0_base/epoch_1_hf": 3.5,
-  "speakleash/3-5B_high_sft/epoch_0_base/epoch_0_hf": 3.5,
-  "speakleash/7B_high_base/epoch_2_hf": 7,
-  "speakleash/10B-4k_high_sft/epoch_3_base/epoch_1_hf": 10,
-  "speakleash/3-5B_high_base/epoch_3_hf": 3.5,
-  "microsoft/phi-2": 2.7,
-  "RWKV/HF_v5-Eagle-7B": 7,
-  "mistralai/Mistral-7B-Instruct-v0.2": 7,
-  "speakleash/llama-apt3-7B/only-spi-e0_hf": 7,
-  "speakleash/llama-apt3-7B/spkl-only_sft/e4_hf": 7,
-  "speakleash/llama-apt3-7B/spkl-only_sft/e5_hf": 7,
-  "speakleash/llama-apt3-7B/spkl-only_sft/e3_hf": 7,
-  "speakleash/llama-apt3-7B/spkl-only_sft/e2_hf": 7,
-  "meta-llama/Llama-2-7b-hf": 7,
-  "meta-llama/Llama-2-7b-chat-hf": 7,
-  "internlm/internlm2-chat-7b": 7,
-  "internlm/internlm2-base-7b": 7,
-  "internlm/internlm2-1_8b": 1.8,
-  "internlm/internlm2-chat-1_8b": 1.8,
-  "speakleash/mistral-apt3-7B/only-spi_sft/e0_hf": 7,
-  "speakleash/mistral-apt3-7B/only-spi-e0_hf": 7,
-  "speakleash/mistral-apt3-7B/apt3-e0_hf": 7,
-  "speakleash/mistral-apt3-7B/spi-e0_hf": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e4_hf": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e5_hf": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e3_hf": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e2_hf": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e4_bb62a5b8": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e6_6b0aa8d6": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e3_f8b5e568": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e2_3b7fc53e": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e5_f75cbc76": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e7_642f3822": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft/e3_17ef3119": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft/e2_7dc8df86": 7,
-  "google/gemma-7b": 7,
-  "google/gemma-7b-it": 7,
-  "SOTA FT HerBERT (large)": 1,
-  "Baseline (majority class)": 0,
-  "SOTA FT Polish RoBERTa": 1,
-  "SOTA FT ULMFiT-SP-PL": 0.1,
-  "speakleash/llama-apt3-13B/spkl-plus/e0_caa5ad79": 13,
-  "speakleash/llama-apt3-13B/spkl-only/e0_cc0931c5": 13,
-  "eryk-mazus/polka-1.1b": 1.1,
-  "berkeley-nest/Starling-LM-7B-alpha": 7,
-  "Remek/OpenChat3.5-0106-Spichlerz-Inst-001": 7,
-  "speakleash/mistral_7B-v2/spkl-all-e2_5bd6027d": 7,
-  "speakleash/mistral_7B-v2/spkl-all-e0_8cf0987d": 7,
-  "speakleash/mistral_7B-v2/spkl-only-e0_ef715d74": 7,
-  "speakleash/mistral_7B-v2/spkl-only-e1_333887a5": 7,
-  "speakleash/mistral_7B-v2/spkl-all-e1_0b514ce9": 7,
-  "speakleash/mistral_7B-v2/spkl-only-e2_5dac700d": 7,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/ext_e3_23b6bc9b": 13,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/spkl_e4_e3a666b1": 13,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/spkl_e3_45ef6b63": 13,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/spkl_e5_bf95416b": 13,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/ext_e2_f7606252": 13,
-  "speakleash/llama-apt3-13B/spkl-only_e0_sft/spkl_e2_898ae6c6": 13,
-  "speakleash/apt4-1B/spkl-only-e3_756856c4": 1,
-  "speakleash/apt4-1B/spkl-all-e0_7f6a991e": 1,
-  "speakleash/apt4-1B/spkl-only-e2_969e76b4": 1,
-  "speakleash/apt4-1B/spkl-all-e2_bfb44ded": 1,
-  "speakleash/apt4-1B/spkl-all-e3_063753f9": 1,
-  "speakleash/apt4-1B/spkl-all-e1_74a293c8": 1,
-  "speakleash/apt4-1B/spkl-only-e0_b9c8bb39": 1,
-  "speakleash/apt4-1B/spkl-only-e1_fea4b41b": 1,
-  "upstage/SOLAR-10.7B-Instruct-v1.0": 10.7,
-  "upstage/SOLAR-10.7B-v1.0": 10.7,
-  "speakleash/mistral_7B-v2/spkl-all_sft/e1_base/spkl-all-e1_9aee511a": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft/e1_base/spkl-all-e0_dd9d2777": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft/e1_base/spkl-only-e1_d0ac34b7": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft/e1_base/spkl-only-e0_9eea5944": 7,
-  "Remek/Kruk-7B-SP-001": 7,
-  "TinyLlama/TinyLlama-1.1B-Chat-v1.0": 1.1,
-  "internlm/internlm2-chat-7b-sft": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft/e1_base/spkl-all-e3_72a6c52a": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft/e1_base/spkl-only-e3_08a0fd89": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft/e1_base/spkl-all-e2_0a1a62c0": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft/e1_base/spkl-only-e2_a7c66ac5": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_2e5-e0_116fa2bc": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_7e6-e0_8544bbd3": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_2e5-e1_013bd434": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only-e1_87bfffac": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only-e2_939d897f": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only-e0_2a5be0dc": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5/spkl-only-e1_0303962d": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5/spkl-only-e0_f4aaf490": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e0_base_2e5/spkl-only-e0_009b090e": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e0_base_2e5/spkl-only-e1_91aae327": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6w-e1_14d52992": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6w-e2_72422a32": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only-e2_dcb87efc": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6-e2_04382c38": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6-e3_860889b1": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6w-e3_78cf3243": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_9e7-e0_27275908": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only-e0_d31a18b7": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6-e0_c26126c8": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only-e3_a5833b75": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6w-e0_6c834bf7": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_7e6-e1_87b7c12f": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_9e7-e2_5ce06dd2": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_9e7-e1_561ac4bb": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only-e1_392d55d9": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e2_db0cd739": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e3_4960543c": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e0_1b65c3ac": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_2e6-e1_70c70cc6": 7,
-  "speakleash/mistral-apt3-7B-v2/spkl-only_sft/e1_base/spkl-only-e2_3a071212": 7,
-  "speakleash/mistral-apt3-7B-v2/spkl-only_sft/e1_base/spkl-only-e0_6dc2e217": 7,
-  "speakleash/mistral-apt3-7B-v2/spkl-only_sft/e1_base/spkl-only-e1_46610eb1": 7,
-  "speakleash/mistral-apt3-7B-v2/spkl-only_sft-weighted/e1_base/spkl-only-e0_e79dcb9f": 7,
-  "speakleash/mistral-apt3-7B-v2/spkl-only_sft-weighted/e1_base/spkl-only-e1_10a78140": 7,
-  "Remek/OpenChat3.5-0106-Spichlerz-Bocian": 7,
-  "alpindale/Mistral-7B-v0.2-hf": 7,
-  "Azurro/APT3-275M-Base": 0.3,
-  "szymonrucinski/Curie-7B-v1": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e0-lr5e5_a47a2047": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e1_1774eb92": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e0-lr2e6_71659188": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e0_35239ee5": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e2_5257da77": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v4/e0_base/spkl-all-e3_5ca4603b": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v3/e0_base/spkl-only-e3_90666ab5": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v3/e0_base/spkl-only-e1_4e524cad": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v3/e0_base/spkl-only-e0_40cdde38": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3/e0_base/spkl-all-e0_67274d1b": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3/e0_base/spkl-all-e1_695e8b44": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3/e0_base/spkl-all-e2_a9e6a2f0": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3/e0_base/spkl-all-e3_2ff00c2b": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e1_4067e14e": 7,
-  "speakleash/mistral-apt3-7B/spkl_sft_v2/e0_6214300a": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e1_596202b3": 7,
-  "speakleash/mistral-apt3-7B/only-spi_sft_v2/e0_c4ea165e": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v4/e0_base/spkl-only-e0_c00001c4": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v4/e0_base/spkl-only-e3_2bcd3961": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v4/e0_base/spkl-only-e1_f2730438": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v4/e0_base/spkl-only-e2_f39a22a2": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3-lr2/e0_base/spkl-all-e0-lr6_376eb1d5": 7,
-  "speakleash/mistral-apt3-7B/spkl-all_sft_v3-lr2/e0_base/spkl-all-e0-lr5_54b6226f": 7,
-  "speakleash/mistral-apt3-7B/spkl-only_sft_v3/e0_base/spkl-only-e2_f036d0fd": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_7e5-e0_e143e6ce": 7,
-  "Nexusflow/Starling-LM-7B-beta": 7,
-  "RWKV/v5-Eagle-7B-HF": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e0_base_2e5/spkl-only-e2_afcfbe2d": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e0_base_2e5/spkl-only-e3_6908149d": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5/spkl-only-e2_d5a874b1": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5/spkl-only-e3_1be744af": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v6/spkl-only-e0_4efab00a": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v6/spkl-only-e1_1b706f85": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v6/spkl-only-e2_f86f7889": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v6/spkl-only-e3_13641875": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v7w/spkl-only-e0_1f5f4968": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v7w/spkl-only-e1_50de9812": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v7w/spkl-only-e2_dd38abb9": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v7w/spkl-only-e3_36236df3": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v8w/spkl-only-e0_e185fb84": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v8w/spkl-only-e1_fb5d327f": 7,
-  "speakleash/mistral-apt3-7B_v2/spkl-only_sft/e1_base_2e5_v8w/spkl-only-e2_dd71be08": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_3e6_v8w-e0_d2d8a320": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_3e6_v8w-e1_cd7c61a1": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_v8wa_9e6-e0_32c27aa5": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_v8wa_9e6-e1_518b38ca": 7,
-  "speakleash/mistral_7B-v2/spkl-all_sft_v2/e1_base/spkl-all_v8wa_9e6-e2_84fb05a1": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_3e6-e0_2ba34bd9": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_3e6-e1_35ecfaaa": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_3e6-e2_920b5c3f": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e0_d137146f": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e1_5bddbd74": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e2_bbc67e89": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e2b_53f28c53": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e3_9931f988": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa_7e6-e4_0bc82b61": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v8wa_9e6-e0_8aa4a0ae": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v8wa_9e6-e1_57357d6c": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v8wa_9e6-e2_5eb84913": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v9wa_3e6-e0_ae5e354c": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v9wa_7e6-e0_724b2d41": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v9wa_7e6-e1_d962636b": 7,
-  "speakleash/Bielik-7B-v0.1": 7,
-  "NousResearch/Nous-Hermes-2-SOLAR-10.7B": 10.7,
-  "Qwen/Qwen1.5-7B-Chat": 7,
-  "THUDM/chatglm3-6b-base": 6,
-  "THUDM/chatglm3-6b": 6,
-  "TeeZee/Bielik-SOLAR-LIKE-10.7B-Instruct-v0.1": 10.7,
-  "google/gemma-1.1-2b-it": 2,
-  "meta-llama/Meta-Llama-3-8B-Instruct": 8,
-  "meta-llama/Meta-Llama-3-8B-Instruct,max_length=4096": 8,
-  "meta-llama/Meta-Llama-3-8B": 8,
-  "meta-llama/Meta-Llama-3-8B,max_length=4096": 8,
-  "microsoft/WizardLM-2-7B": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa4_9e6-e0_193ad881": 7,
-  "speakleash/mistral_7B-v2/spkl-only_sft_v2/e1_base/spkl-only_v10wa4_9e6-e1_f40e0808": 7,
-  "speakleash/Bielik-7B-Instruct-v0.1": 7,
-  "speakleash/mistral_7B-v3/spkl-only_sft_v0/e0_base/spkl-only_v11wa_9e6-e0_fe38d62e": 7,
-  "speakleash/mistral_7B-v3/spkl-only_sft_v0/e0_base/spkl-only_v11wa_9e6-e1_6f84698e": 7,
-  "speakleash/mistral_7B-v3/spkl-only_sft_v0/e0_base/spkl-only_v11wap_9e6-e0_5c6927dd": 7,
-  "speakleash/mistral_7B-v3/spkl-only_sft_v0/e0_base/spkl-only_v11wap_9e6-e1_1d6755a9": 7,
-  "speakleash/mistral_7B-v3/spkl-only_v0-e0_b93294c8": 7,
-  "speakleash/mistral_7B-v3/spkl-only_v2-e0_e5547fd5": 7,
-  "speakleash/Bielik-7B-Instruct-v0.1-GPTQ,autogptq=True": 7,
-  "speakleash/Bielik-7B-Instruct-v0.1,load_in_4bit=True": 7,
-  "speakleash/Test-v02-ep3": 7,
-  "speakleash/mistral_7B-v3/spkl-only_v2-e1.34500_a9c75816": 7,
-  "CohereForAI/c4ai-command-r-v01,max_length=4096": 35,
-  "Qwen/Qwen1.5-14B-Chat": 14,
-  "Remek/Llama-3-8B-Omnibus-1-PL-v01-INSTRUCT": 8,
-  "Remek/Llama-3-8B-Omnibus-1-PL-v01-INSTRUCT,max_length=4096": 8,
-  "internlm/internlm2-20b,max_length=4096": 20,
-  "internlm/internlm2-chat-20b,max_length=4096": 20,
-  "lex-hue/Delexa-7b": 7,
-  "lmsys/vicuna-13b-v1.5": 13,
-  "maciek-pioro/Mixtral-8x7B-v0.1-pl,max_length=4096": 46.7,
-  "mistralai/Mixtral-8x7B-Instruct-v0.1,max_length=4096": 46.7,
-  "mistralai/Mixtral-8x7B-v0.1,max_length=4096": 46.7,
-  "speakleash/Test-001-wiki": 7,
-  "speakleash/Test-002": 7,
-  "teknium/OpenHermes-13B": 13,
-  "meta-llama/Meta-Llama-3-70B-Instruct,max_length=4096": 70,
-  "meta-llama/Meta-Llama-3-70B,max_length=4096": 70,
-  "mistralai/Mixtral-8x22B-Instruct-v0.1,max_length=4096": 141,
-  "mistralai/Mixtral-8x22B-v0.1,max_length=4096": 141,
-  "Qwen/Qwen1.5-14B-Chat,max_length=4096": 14,
-  "Qwen/Qwen1.5-32B-Chat,max_length=4096": 32,
-  "Qwen/Qwen1.5-72B-Chat,max_length=4096": 72,
-  "Qwen/Qwen1.5-32B,max_length=4096": 32,
-  "Qwen/Qwen1.5-72B,max_length=4096": 72,
-  "Qwen/Qwen1.5-7B": 7,
-  "Qwen/Qwen2-0.5B-Instruct": 0.5,
-  "Qwen/Qwen2-0.5B": 0.5,
-  "Qwen/Qwen2-1.5B-Instruct": 1.5,
-  "Qwen/Qwen2-1.5B": 1.5,
-  "Qwen/Qwen2-7B-Instruct": 7,
-  "Qwen/Qwen2-7B": 7,
-  "model=gpt-3.5-turbo-instruct": 20,
-  "model=gpt-4-turbo-2024-04-09": 1000,
-  "01-ai/Yi-1.5-6B-Chat": 6,
-  "01-ai/Yi-1.5-6B": 6,
-  "01-ai/Yi-1.5-9B-Chat": 9,
-  "01-ai/Yi-1.5-9B": 9,
-  "CohereForAI/aya-23-35B,max_length=4096": 35,
-  "CohereForAI/aya-23-8B": 8,
-  "NousResearch/Hermes-2-Pro-Llama-3-8B": 8,
-  "NousResearch/Hermes-2-Theta-Llama-3-8B": 8,
-  "Remek/OpenChat-3.5-0106-PL-Omnibusv2": 7,
-  "mistralai/Mistral-7B-Instruct-v0.3": 7,
-  "mistralai/Mistral-7B-v0.3": 7,
-  "nvidia/Llama3-ChatQA-1.5-8B": 8,
-  "openchat/openchat-3.5-0106-gemma": 7,
-  "openchat/openchat-3.6-8b-20240522": 8,
-  "tiiuae/falcon-11B": 11,
-  "mlabonne/NeuralDaredevil-8B-abliterated": 8,
-  "01-ai/Yi-1.5-34B-Chat,max_length=4096": 34,
-  "Qwen/Qwen2-57B-A14B-Instruct,max_length=4096": 57,
-  "Qwen/Qwen2-72B-Instruct,max_length=4096": 72,
-  "Qwen/Qwen2-72B,max_length=4096": 72,
-  "THUDM/glm-4-9b-chat": 9,
-  "THUDM/glm-4-9b": 9,
-  "google/recurrentgemma-9b-it": 9,
-  "microsoft/Phi-3-medium-4k-instruct,max_length=4096": 14,
-  "microsoft/Phi-3-mini-4k-instruct": 3.8,
-  "microsoft/Phi-3-small-8k-instruct": 7.4,
-  "ssmits/Falcon2-5.5B-Polish": 5.5,
-  "alpindale/WizardLM-2-8x22B,max_length=4096": 141,
-  "dreamgen/WizardLM-2-7B": 7,
-  "mistralai/Mistral-Large-Instruct-2407": 123,
-  "meta-llama/Meta-Llama-3.1-70B-Instruct": 70,
-  "meta-llama/Meta-Llama-3.1-405B-Instruct-FP8": 405,
-  "speakleash/Bielik-11B-v2.0-Instruct": 11,
-  "speakleash/Bielik-11B-v2.2-Instruct": 11,
-  "speakleash/Bielik-11B-v2.1-Instruct": 11,
-  "speakleash/Bielik-11B-v2.3-Instruct": 11,
-  "CYFRAGOVPL/PLLuM-12B-nc-chat": 12,
-  "CYFRAGOVPL/PLLuM-12B-chat": 12,
-  "CYFRAGOVPL/PLLuM-12B-instruct": 12,
-  "CYFRAGOVPL/Llama-PLLuM-8B-instruct": 8,
-  "CYFRAGOVPL/PLLuM-12B-nc-instruct": 12,
-  "CYFRAGOVPL/Llama-PLLuM-8B-chat": 8,
-  "CYFRAGOVPL/PLLuM-8x7B-nc-chat": 46.7,
-  "CYFRAGOVPL/PLLuM-8x7B-nc-instruct": 46.7,
-  "CYFRAGOVPL/PLLuM-8x7B-chat": 46.7,
-  "CYFRAGOVPL/PLLuM-8x7B-instruct": 46.7,
-  "CYFRAGOVPL/Llama-PLLuM-70B-chat": 70,
-  "CYFRAGOVPL/Llama-PLLuM-70B-instruct": 70,
-  "Qwen/Qwen2.5-7B-Instruct": 7,
-  "Qwen/Qwen2.5-14B-Instruct": 14,
-  "Qwen/Qwen2.5-1.5B-Instruct": 1.5,
-  "microsoft/phi-4": 14.7,
-  "Qwen/Qwen2.5-32B-Instruct": 32,
-  "Qwen/Qwen2.5-72B-Instruct": 72,
-  "nvidia/Llama-3.1-Nemotron-70B-Instruct-HF": 70,
-  "meta-llama/Llama-3.2-1B-Instruct": 1,
-  "utter-project/EuroLLM-9B-Instruct": 9,
-  "mistralai/Mistral-Small-Instruct-2409": 22.2,
-  "mistralai/Mistral-Small-24B-Instruct-2501": 24,
-  "meta-llama/Llama-3.3-70B-Instruct": 70,
-  "meta-llama/Llama-3.2-3B-Instruct": 3,
-  "Qwen/Qwen2.5-3B-Instruct": 3,
-  "mistralai/Mistral-Nemo-Instruct-2407": 12,
-  "microsoft/Phi-4-mini-instruct": 4,
-  "mistralai/Mistral-Large-Instruct-2411": 123,
-  "speakleash/Bielik-11B-v2.5-Instruct": 11,
-  "speakleash/Bielik-4.5B-v3.0-Instruct": 4.5,
-  "speakleash/Bielik-1.5B-v3.0-Instruct": 1.5
-}

plot_results.py CHANGED Viewed

@@ -2,90 +2,78 @@ import pandas as pd
 import matplotlib.pyplot as plt
 import numpy as np
 import json
-import csv
-def create_performance_plot(csv_path='benchmark_results.csv', metadata_path='metadata.json'):
     # Define whitelist of interesting models (partial matches)
     WHITELIST = [
-        'Meta-Llama-3.1-70B-Instruct'
     ]
-    # Read the benchmark results with error handling for inconsistent rows
-    valid_rows = []
-    expected_fields = 14  # Number of expected fields in each row
-    with open(csv_path, 'r') as f:
-        reader = csv.reader(f)
-        header = next(reader)  # Get header row
-        # Strip whitespace from header names
-        header = [h.strip() for h in header]
-        for row in reader:
-            if len(row) == expected_fields:  # Only keep rows with correct number of fields
-                # Strip whitespace from values
-                valid_rows.append([val.strip() for val in row])
-    # Create DataFrame from valid rows
-    df = pd.DataFrame(valid_rows, columns=header)
-    # Read model sizes from metadata
-    with open(metadata_path, 'r') as f:
-        metadata = json.load(f)
-    # Process the data
-    # Keep only successful runs (where Benchmark Score is not FAILED)
-    df = df[df['Benchmark Score'] != 'FAILED']
-    df = df[df['Benchmark Score'].notna()]
-    # Convert score to numeric, handling invalid values
-    df['Benchmark Score'] = pd.to_numeric(df['Benchmark Score'], errors='coerce')
-    df = df[df['Benchmark Score'].notna()]  # Remove rows where conversion failed
-    # Convert Num Questions Parseable to numeric and calculate adjusted score
-    df['Num Questions Parseable'] = pd.to_numeric(df['Num Questions Parseable'], errors='coerce')
-    df['Benchmark Score'] = df['Benchmark Score'] * (df['Num Questions Parseable'] / 171)
-    # For each model, keep only the latest run
-    df['Run ID'] = df['Run ID'].fillna('')
-    df['timestamp'] = pd.to_datetime(df['Benchmark Completed'])
-    df = df.sort_values('timestamp')
-    df = df.drop_duplicates(subset=['Model Path'], keep='last')
-    # Get model sizes
-    def get_model_size(model_path):
-        # Try exact match first
-        if model_path in metadata:
-            return metadata[model_path]
-        # Try with max_length suffix
-        if f"{model_path},max_length=4096" in metadata:
-            return metadata[f"{model_path},max_length=4096"]
-        return None
     # Print models without size before filtering
     print("\nModels without size assigned:")
-    models_without_size = df[df['Model Path'].apply(get_model_size).isna()]
-    for model in models_without_size['Model Path']:
-        print(f"- {model}")
-    df['Model Size'] = df['Model Path'].apply(get_model_size)
-    df = df[df['Model Size'].notna()]
     # Remove extreme outliers (scores that are clearly errors)
-    q1 = df['Benchmark Score'].quantile(0.25)
-    q3 = df['Benchmark Score'].quantile(0.75)
-    iqr = q3 - q1
-    df = df[
-        (df['Benchmark Score'] >= q1 - 1.5 * iqr) &
-        (df['Benchmark Score'] <= q3 + 1.5 * iqr)
-    ]
     # Find models on Pareto frontier
-    sizes = sorted(df['Model Size'].unique())
     frontier_points = []
     max_score = float('-inf')
     frontier_models = set()
     for size in sizes:
         # Get scores for models of this size or smaller
-        subset = df[df['Model Size'] <= size]
         if len(subset) > 0:
             max_score_idx = subset['Benchmark Score'].idxmax()
             current_max = subset.loc[max_score_idx, 'Benchmark Score']
@@ -95,59 +83,73 @@ def create_performance_plot(csv_path='benchmark_results.csv', metadata_path='met
                 frontier_models.add(subset.loc[max_score_idx, 'Model Path'])
     # Filter models - keep those on Pareto frontier or matching whitelist
-    df['Keep'] = False
-    for idx, row in df.iterrows():
         if row['Model Path'] in frontier_models:
-            df.loc[idx, 'Keep'] = True
         else:
             for pattern in WHITELIST:
                 if pattern in row['Model Path']:
-                    df.loc[idx, 'Keep'] = True
                     break
-    df = df[df['Keep']]
     # Create the plot
     fig = plt.figure(figsize=(12, 8))
-    # Create scatter plot
-    plt.scatter(df['Model Size'],
-               df['Benchmark Score'],
-               alpha=0.6)
-    # Add labels for points
-    for idx, row in df.iterrows():
-        # Get model name - either last part of path or full name for special cases
-        model_name = row['Model Path'].split('/')[-1]
-        if any(pattern in row['Model Path'] for pattern in ['gpt-3', 'gpt-4']):
             model_name = row['Model Path']
-        plt.annotate(model_name,
-                    (row['Model Size'], row['Benchmark Score']),
-                    xytext=(5, 5), textcoords='offset points',
-                    fontsize=8,
-                    bbox=dict(facecolor='white', alpha=0.7, edgecolor='none', pad=0.5))
-    # Plot the Pareto frontier line
-    if frontier_points:
-        frontier_x, frontier_y = zip(*frontier_points)
-        plt.plot(frontier_x, frontier_y, 'r--', label='Pareto frontier')
-    # Add vertical line for consumer GPU budget
-    plt.axvline(x=12, color='gray', linestyle=':', label='Consumer-budget GPU limit', ymin=-0.15, clip_on=False)
-    plt.text(12, -0.15, 'Consumer-budget\nGPU (24GB) limit\nin half precision',
-             horizontalalignment='center', verticalalignment='top',
-             transform=plt.gca().get_xaxis_transform())
     # Customize the plot
     plt.grid(True, linestyle='--', alpha=0.7)
     plt.xlabel('Model Size (billions of parameters)')
-    plt.ylabel('Benchmark Score')
-    plt.title('Model Performance vs Size (Pareto Frontier)')
     # Add legend
     plt.legend()
     # Adjust layout to prevent label cutoff
     plt.tight_layout()

 import matplotlib.pyplot as plt
 import numpy as np
 import json
+def create_performance_plot(json_path='benchmark_report.json'):
     # Define whitelist of interesting models (partial matches)
     WHITELIST = [
+        'Meta Llama 4 Maverick',
+        'Anthropic Claude 3.7 Sonnet',
+        'OpenAI GPT-4o'
     ]
+    # Load the benchmark results from JSON
+    with open(json_path, 'r') as f:
+        json_data = json.load(f)
+    # Create DataFrame from JSON data
+    df = pd.DataFrame(json_data)
+    # Rename columns for consistency
+    df = df.rename(columns={
+        "Model Name": "Model Path",
+        "Model Size": "Model Size Raw"
+    })
+    # Calculate overall benchmark score as average of Avg (object) and Avg (country)
+    df['Benchmark Score'] = (df['Avg (object)'] + df['Avg (country)']) / 2
+    # Process model sizes - convert to numeric, handle "-" and extract numbers
+    df['Model Size'] = df['Model Size Raw'].replace("-", np.nan)
+    # Extract numeric values from size strings like "72 MB" -> 72 or plain "72" -> 72
+    def extract_size(size_val):
+        if pd.isna(size_val):
+            return np.nan
+        if isinstance(size_val, (int, float)):
+            return float(size_val)
+        if isinstance(size_val, str):
+            # Try to extract number from string (handles both "72" and "72 MB")
+            import re
+            match = re.search(r'(\d+(?:\.\d+)?)', str(size_val))
+            if match:
+                return float(match.group(1))
+        return np.nan
+    df['Model Size'] = df['Model Size'].apply(extract_size)
+    # Remove models without size information for plotting
+    df_with_size = df[df['Model Size'].notna()].copy()
     # Print models without size before filtering
     print("\nModels without size assigned:")
+    models_without_size = df[df['Model Size'].isna()]
+    for idx, row in models_without_size.iterrows():
+        print(f"- {row['Model Path']}")
     # Remove extreme outliers (scores that are clearly errors)
+    if len(df_with_size) > 0:
+        q1 = df_with_size['Benchmark Score'].quantile(0.25)
+        q3 = df_with_size['Benchmark Score'].quantile(0.75)
+        iqr = q3 - q1
+        df_with_size = df_with_size[
+            (df_with_size['Benchmark Score'] >= q1 - 1.5 * iqr) &
+            (df_with_size['Benchmark Score'] <= q3 + 1.5 * iqr)
+        ]
     # Find models on Pareto frontier
+    sizes = sorted(df_with_size['Model Size'].unique())
     frontier_points = []
     max_score = float('-inf')
     frontier_models = set()
     for size in sizes:
         # Get scores for models of this size or smaller
+        subset = df_with_size[df_with_size['Model Size'] <= size]
         if len(subset) > 0:
             max_score_idx = subset['Benchmark Score'].idxmax()
             current_max = subset.loc[max_score_idx, 'Benchmark Score']
                 frontier_models.add(subset.loc[max_score_idx, 'Model Path'])
     # Filter models - keep those on Pareto frontier or matching whitelist
+    df_with_size['Keep'] = False
+    for idx, row in df_with_size.iterrows():
         if row['Model Path'] in frontier_models:
+            df_with_size.loc[idx, 'Keep'] = True
         else:
             for pattern in WHITELIST:
                 if pattern in row['Model Path']:
+                    df_with_size.loc[idx, 'Keep'] = True
                     break
+    # Also include models without size if they're in whitelist
+    df_no_size = df[df['Model Size'].isna()].copy()
+    df_no_size['Keep'] = False
+    for idx, row in df_no_size.iterrows():
+        for pattern in WHITELIST:
+            if pattern in row['Model Path']:
+                df_no_size.loc[idx, 'Keep'] = True
+                break
+    # Combine datasets for plotting
+    plot_df = df_with_size[df_with_size['Keep']].copy()
     # Create the plot
     fig = plt.figure(figsize=(12, 8))
+    if len(plot_df) > 0:
+        # Create scatter plot
+        plt.scatter(plot_df['Model Size'],
+                   plot_df['Benchmark Score'],
+                   alpha=0.6, s=60)
+        # Add labels for points
+        for idx, row in plot_df.iterrows():
+            # Use the full model name for labeling
             model_name = row['Model Path']
+            plt.annotate(model_name,
+                        (row['Model Size'], row['Benchmark Score']),
+                        xytext=(5, 5), textcoords='offset points',
+                        fontsize=8,
+                        bbox=dict(facecolor='white', alpha=0.7, edgecolor='none', pad=0.5))
+        # Plot the Pareto frontier line
+        if frontier_points:
+            frontier_x, frontier_y = zip(*frontier_points)
+            plt.plot(frontier_x, frontier_y, 'r--', label='Pareto frontier', linewidth=2)
+        # Add vertical line for consumer GPU budget (assuming 24GB can handle ~12B parameters)
+        plt.axvline(x=12, color='gray', linestyle=':', label='Consumer-budget GPU limit', ymin=-0.15, clip_on=False)
+        plt.text(12, plt.ylim()[0] - (plt.ylim()[1] - plt.ylim()[0]) * 0.1,
+                 'Consumer-budget\nGPU (24GB) limit\nin half precision',
+                 horizontalalignment='center', verticalalignment='top')
     # Customize the plot
     plt.grid(True, linestyle='--', alpha=0.7)
     plt.xlabel('Model Size (billions of parameters)')
+    plt.ylabel('Benchmark Score (Average of Object & Country Recognition)')
+    plt.title('Polish Photo Recognition: Model Performance vs Size')
     # Add legend
     plt.legend()
+    # Set reasonable axis limits
+    if len(plot_df) > 0:
+        plt.xlim(left=0)
+        plt.ylim(bottom=0)
     # Adjust layout to prevent label cutoff
     plt.tight_layout()

script.py DELETED Viewed

@@ -1,322 +0,0 @@
-import pandas as pd
-import json
-import re
-# Load the CSV file
-leaderboard_df = []
-with open("benchmark_results.csv", "r") as f:
-    header = f.readline().strip().split(",")
-    header = [h.strip() for h in header]
-    for i, line in enumerate(f):
-        leaderboard_df.append(line.strip().split(",", 13))
-# Load metadata
-metadata = json.load(open('metadata.json'))
-for k, v in list(metadata.items()):
-    metadata[k.split(",")[0]] = v
-# Create DataFrame
-leaderboard_df = pd.DataFrame(leaderboard_df, columns=header)
-# Filter and process DataFrame
-leaderboard_df = leaderboard_df[(leaderboard_df["Benchmark Version"] == "eq-bench_v2_pl") | (
-        leaderboard_df["Benchmark Version"] == 'eq-bench_pl')]
-leaderboard_df = leaderboard_df[["Model Path", "Benchmark Score", "Num Questions Parseable", "Error"]]
-def parse_parseable(x):
-    if x["Num Questions Parseable"] == 'FAILED':
-        m = re.match(r'(\d+)\.0 questions were parseable', x["Error"])
-        return m.group(1)
-    return x["Num Questions Parseable"]
-leaderboard_df["Num Questions Parseable"] = leaderboard_df[["Num Questions Parseable", "Error"]].apply(
-    lambda x: parse_parseable(x), axis=1)
-NUMBER_OF_QUESTIONS = 171.0
-def fraction_to_percentage(numerator: float, denominator: float) -> float:
-    return (numerator / denominator) * 100
-leaderboard_df["Num Questions Parseable"] = leaderboard_df["Num Questions Parseable"].apply(lambda x: fraction_to_percentage(float(x), NUMBER_OF_QUESTIONS))
-def get_params(model_name):
-    if model_name in metadata:
-        return metadata[model_name]
-    else:
-        print(model_name)
-    return None
-leaderboard_df["Params"] = leaderboard_df["Model Path"].apply(lambda x: get_params(x))
-leaderboard_df["Benchmark Score"] = leaderboard_df["Benchmark Score"].replace('FAILED', None)
-leaderboard_df["Benchmark Score"] = leaderboard_df["Benchmark Score"].astype(float) * ((leaderboard_df["Num Questions Parseable"].astype(float) / 100))
-leaderboard_df.loc[leaderboard_df["Benchmark Score"] < 0, "Benchmark Score"] = 0
-leaderboard_df = leaderboard_df.sort_values(by=["Benchmark Score", "Num Questions Parseable"], ascending=[False, False])
-leaderboard_df = leaderboard_df.rename(columns={"Model Path": "Model", "Num Questions Parseable": "Percentage Questions Parseable"})
-# Generate HTML with DataTables
-html = """
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Leaderboard</title>
-    <link rel="stylesheet" href="https://cdn.datatables.net/1.11.5/css/jquery.dataTables.min.css">
-    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
-    <script src="https://cdn.datatables.net/1.11.5/js/jquery.dataTables.min.js"></script>
-    <style>
-        body {
-            font: 90%/1.45em "Helvetica Neue", HelveticaNeue, Verdana, Arial, Helvetica, sans-serif;
-            margin: 0;
-            padding: 20px;
-            color: #333;
-            background-color: #fff;
-        }
-        .numeric-cell {
-            text-align: right;
-            padding: 8px !important;
-        }
-    </style>
-    <script>
-        (function($) {
-            $.fn.colorize = function(oOptions) {
-                var settings = $.extend({
-                    parse: function(e) {
-                        return parseFloat(e.html());
-                    },
-                    min: undefined,
-                    max: undefined,
-                    readable: true,
-                    themes: {
-                        "default": {
-                            color_min: "#C80000",
-                            color_mid: "#FFFFFF",
-                            color_max: "#10A54A"
-                        }
-                    },
-                    theme: "default",
-                    center: undefined,
-                    percent: false
-                }, oOptions);
-                function getColor(color1, color2, ratio) {
-                    var hex = function(x) {
-                        x = x.toString(16);
-                        return (x.length == 1) ? '0' + x : x;
-                    }
-                    color1 = (color1.charAt(0) == "#") ? color1.slice(1) : color1
-                    color2 = (color2.charAt(0) == "#") ? color2.slice(1) : color2
-                    var r = Math.ceil(parseInt(color1.substring(0,2), 16) * ratio + parseInt(color2.substring(0,2), 16) * (1-ratio));
-                    var g = Math.ceil(parseInt(color1.substring(2,4), 16) * ratio + parseInt(color2.substring(2,4), 16) * (1-ratio));
-                    var b = Math.ceil(parseInt(color1.substring(4,6), 16) * ratio + parseInt(color2.substring(4,6), 16) * (1-ratio));
-                    return "#" + (hex(r) + hex(g) + hex(b)).toUpperCase();
-                }
-                function getContrastYIQ(hexcolor) {
-                    var hex = (hexcolor.charAt(0) == "#") ? hexcolor.slice(1) : hexcolor;
-                    var r = parseInt(hex.substr(0,2),16);
-                    var g = parseInt(hex.substr(2,2),16);
-                    var b = parseInt(hex.substr(4,2),16);
-                    var yiq = ((r*299)+(g*587)+(b*114))/1000;
-                    return (yiq >= 128) ? 'black' : 'white';
-                }
-                var min = settings.min;
-                var max = settings.max;
-                if (min === undefined || max === undefined) {
-                    min = Infinity;
-                    max = -Infinity;
-                    this.each(function() {
-                        var value = parseFloat(settings.parse($(this)));
-                        if (!isNaN(value) && isFinite(value)) {
-                            min = Math.min(min, value);
-                            max = Math.max(max, value);
-                        }
-                    });
-                }
-                var center = settings.center !== undefined ? settings.center : (max + min) / 2;
-                var adj = Math.max(Math.abs(max - center), Math.abs(center - min));
-                this.each(function() {
-                    var value = parseFloat(settings.parse($(this)));
-                    if (isNaN(value) || !isFinite(value)) return;
-                    var ratio = (value - center) / adj;
-                    var color1, color2;
-                    if (value < center) {
-                        ratio = Math.abs(ratio);
-                        if (ratio > 1) ratio = 1;
-                        color1 = settings.themes[settings.theme].color_min;
-                        color2 = settings.themes[settings.theme].color_mid;
-                    } else {
-                        ratio = Math.abs(ratio);
-                        if (ratio > 1) ratio = 1;
-                        color1 = settings.themes[settings.theme].color_max;
-                        color2 = settings.themes[settings.theme].color_mid;
-                    }
-                    var color = getColor(color1, color2, ratio);
-                    $(this).css('background-color', color);
-                    if (settings.readable)
-                        $(this).css('color', getContrastYIQ(color));
-                });
-                return this;
-            };
-        }(jQuery));
-        $(document).ready(function() {
-            // Add custom filtering function
-            $.fn.dataTable.ext.search.push(function(settings, data, dataIndex) {
-                var searchValue = $('.dataTables_filter input').val();
-                if (!searchValue) return true;
-                // Split search terms by semicolon and trim whitespace
-                var searchTerms = searchValue.split(';').map(term => term.trim().toLowerCase());
-                var modelName = data[0].toLowerCase(); // Model name is in first column
-                // Return true if ANY search terms are found in the model name (OR logic)
-                return searchTerms.some(term => modelName.includes(term));
-            });
-            // Custom sorting function for benchmark scores
-            $.fn.dataTable.ext.type.order['score-pre'] = function(data) {
-                var score = parseFloat(data);
-                return isNaN(score) ? -Infinity : score;
-            };
-            // Get min/max values for each numeric column before initializing DataTables
-            var columnRanges = {
-                1: { min: Infinity, max: -Infinity },  // Params
-                2: { min: Infinity, max: -Infinity },  // Benchmark Score
-                3: { min: Infinity, max: -Infinity }   // Percentage Questions Parseable
-            };
-            $('#leaderboard tbody td').each(function() {
-                var columnIdx = $(this).index();
-                if (columnIdx in columnRanges) {
-                    var value = parseFloat($(this).text());
-                    if (!isNaN(value) && isFinite(value)) {
-                        columnRanges[columnIdx].min = Math.min(columnRanges[columnIdx].min, value);
-                        columnRanges[columnIdx].max = Math.max(columnRanges[columnIdx].max, value);
-                    }
-                }
-            });
-            var table = $('#leaderboard').DataTable({
-                "order": [[2, "desc"]],  // Sort by Benchmark Score by default
-                "pageLength": 20,  // Show 20 results per page
-                "lengthMenu": [[10, 20, 50, 100, -1], [10, 20, 50, 100, "All"]],  // Update length menu options
-                "columnDefs": [
-                    {
-                        "targets": [1],
-                        "className": "numeric-cell"
-                    },
-                    {
-                        "type": "score",
-                        "targets": [2],  // Apply custom sorting to Benchmark Score column
-                        "className": "numeric-cell"
-                    },
-                    {
-                        "targets": [3],
-                        "className": "numeric-cell"
-                    }
-                ],
-                "drawCallback": function() {
-                    // Apply colorization with pre-calculated ranges
-                    $("#leaderboard tbody td:nth-child(2)").colorize({
-                        parse: function(e) { return parseFloat($(e).text()); },
-                        min: columnRanges[1].min,
-                        max: columnRanges[1].max,
-                        themes: {
-                            "default": {
-                                color_min: "#10A54A",    // White for smaller models
-                                color_mid: "#FFD700",    // Gold/yellow for medium models
-                                color_max: "#C80000"     // Hot pink for larger models
-                            }
-                        }
-                    });
-                    $("#leaderboard tbody td:nth-child(3)").colorize({
-                        parse: function(e) { return parseFloat($(e).text()); },
-                        min: columnRanges[2].min,
-                        max: columnRanges[2].max,
-                        themes: {
-                            "default": {
-                                color_min: "#C80000",    // Red for lower scores
-                                color_mid: "#FFD700",    // Gold/yellow for medium scores
-                                color_max: "#10A54A"     // Green for higher scores
-                            }
-                        }
-                    });
-                    $("#leaderboard tbody td:nth-child(4)").colorize({
-                        parse: function(e) { return parseFloat($(e).text()); },
-                        min: columnRanges[3].min,
-                        max: columnRanges[3].max,
-                        themes: {
-                            "default": {
-                                color_min: "#C80000",    // Red for lower percentages
-                                color_mid: "#FFD700",    // Gold/yellow for medium percentages
-                                color_max: "#10A54A"     // Green for higher percentages
-                            }
-                        }
-                    });
-                },
-                // Override the default search behavior
-                "search": {
-                    "smart": false
-                },
-                // Update search on input change
-                "initComplete": function() {
-                    var table = this.api();
-                    $('.dataTables_filter input')
-                        .off() // Remove default binding
-                        .on('input', function() {
-                            table.draw();
-                        });
-                }
-            });
-        });
-    </script>
-</head>
-<body>
-    <h1>Leaderboard</h1>
-    <table id="leaderboard" class="display" style="width:100%">
-        <thead>
-            <tr>
-                <th>Model</th>
-                <th>Params</th>
-                <th>Benchmark Score</th>
-                <th>Percentage Questions Parseable</th>
-                <th>Error</th>
-            </tr>
-        </thead>
-        <tbody>
-"""
-# Add rows to the HTML table
-for _, row in leaderboard_df.iterrows():
-    html += f"""
-            <tr>
-                <td>{row['Model']}</td>
-                <td>{row['Params']}</td>
-                <td>{row['Benchmark Score']:.2f}</td>
-                <td>{row['Percentage Questions Parseable']:.2f}</td>
-                <td>{row['Error']}</td>
-            </tr>
-    """
-# Close the HTML tags
-html += """
-        </tbody>
-    </table>
-</body>
-</html>
-"""
-# Save the HTML to a file
-with open("leaderboard.html", "w") as file:
-    file.write(html)
-print("HTML leaderboard generated and saved as leaderboard.html")

src/about.py CHANGED Viewed

@@ -2,25 +2,43 @@
 TITLE = """<div style="display: flex; flex-wrap: wrap; justify-content: space-around;">
     <img src="https://speakleash.org/wp-content/uploads/2023/09/SpeakLeash_logo.svg">
     <div>
-        <h1 align="center" id="space-title">Polish EQ-Bench Leaderboard</h1>
-        <h2 align="center" id="space-subtitle">Leaderboard was created as part of an open-science project SpeakLeash.org</h2>
     </div>
 </div>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Polish Emotional Intelligence Benchmark for LLMs
 Help us develop Polish Large Language Model Bielik by using [Arena](https://arena.speakleash.org.pl/).
 We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016951.
 """
-AUTHORS = """Authors:
-* Automatic translation: [Remigiusz Kinas](https://www.linkedin.com/in/remigiusz-kinas/)
-* Translation proofreading and localization: [Maria Filipkowska](https://www.linkedin.com/in/maria-filipkowska/), [Zuzanna Dabić](https://www.linkedin.com/in/zuzanna-dabic/)
-* Preparing dataset: [Kacper Milan](https://www.linkedin.com/in/kacper-milan/)
-* Running benchmark and leaderboard: [Krzysztof Wróbel](https://www.linkedin.com/in/wrobelkrzysztof/)
-Based on: EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models, Samuel J. Paech, 2023"""

 TITLE = """<div style="display: flex; flex-wrap: wrap; justify-content: space-around;">
     <img src="https://speakleash.org/wp-content/uploads/2023/09/SpeakLeash_logo.svg">
     <div>
+        <h1 align="center" id="space-title">Polish Cultural Vision Benchmark (PCVB)</h1>
+        <h2 align="center" id="space-subtitle">Evaluating Vision-Language Models on Polish Cultural Heritage</h2>
     </div>
 </div>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+A specialized evaluation dataset designed to assess vision-language models' understanding of Polish cultural heritage, history, geography, and traditions. This benchmark addresses the critical gap in multilingual and culturally-specific evaluation of multimodal AI systems.
+**Benchmark Scope:**
+- **Domain**: Polish Cultural Knowledge
+- **Modality**: Vision + Language
+- **Task Type**: Visual Recognition and Cultural Classification
+- **Dataset Size**: ~220 curated image-text pairs across 11 subcategories
+**Categories Evaluated:**
+- 🎭 **Art & Entertainment**: Movies, Art, Theatre
+- 🏛️ **Culture & Tradition**: Food, Folk Culture, Traditions
+- 🗺️ **Geography**: Cities, Nature, Architecture
+- 📚 **History**: Historical Figures, Historical Sites
 Help us develop Polish Large Language Model Bielik by using [Arena](https://arena.speakleash.org.pl/).
 We gratefully acknowledge Polish high-performance computing infrastructure PLGrid (HPC Centers: ACK Cyfronet AGH) for providing computer facilities and support within computational grant no. PLG/2024/016951.
 """
+AUTHORS = """**Benchmark Details:**
+**Methodology**: Each test item consists of carefully selected and manually verified images that represent authentic Polish cultural elements. Models are prompted to identify specific cultural objects, landmarks, foods, or personalities shown in images, along with their country of origin.
+**Evaluation Protocol**: Responses are evaluated for both object accuracy and geographical attribution using binary scoring (correct/incorrect) across all categories.
+**Unique Value Proposition**:
+- Cultural Specificity: Tests deep understanding of Polish heritage beyond generic object recognition
+- Multimodal Integration: Requires both visual processing and cultural knowledge
+- Bias Detection: Reveals potential Western-centric biases in vision-language models
+- Real-world Relevance: Evaluates practically useful cultural knowledge for Polish applications
+This benchmark is maintained as a private evaluation suite to ensure result integrity and prevent training data contamination."""