File size: 1,629 Bytes
fecd281
2ac66eb
 
c799558
fecd281
c799558
e2fdcf9
2ac66eb
e2fdcf9
fecd281
 
c799558
 
 
 
 
 
 
 
 
fecd281
2ac66eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e2fdcf9
 
 
 
 
 
bc83a77
2d61c75
e2fdcf9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
TITLE = "# Mewsic Bench Leaderboard"

INTRODUCTION_TEXT = """
#### I run model evaluations via API on request.

To request a model evaluation, click **Request Evaluation** tab and enter the model ID.
"""

CITATION_BUTTON_TEXT = """
## What is this anyway?

I test the lyrical capabilities of LLMs by telling them to write songs entirely as meowing.

## WHY is this?

There's been a lot of grandstanding about models learning how to write poetry over the past few years, but I've found the actual generation of poetry to be a poor measure of their actual poetic competency, instead "faking it" by using common phrases and line structures that are close enough to fool a human reader.

By forcing the model to generate nonsensical content, it condenses the test down to the essential characteristics of meter and line.

Also, it's funny.

## Citation

If you use this benchmark, please cite:

```bibtex
@misc{mewsicbench2026,
    title={Mewsic Bench},
    author={CatsCanWrite},
    year={2026},
}
```

## Contact

For questions or issues, please open a discussion on the Hugging Face community tab.
"""

METRIC_INFO_TEXT = """
## About the Metrics

- **Meter** - How closely the model sticks to the meter of the lines.
- **Verse** - How closely the model aligns the lines to the verse and chorus breakup.
- **Focus** - How much of the response is extraneous commentary instead of the song. *(Focus in particular has a very minor contribution to the final score)*
- **Thinking** - The **estimated average** number of thinking tokens per response. Zero means it's not a reasoning model (or is a hybrid model with reasoning off).
"""