Spaces:

CatsCanWrite
/

MewsicBench

Sleeping

App Files Files Community

MewsicBench / about.py

CatsCanWrite

all right, fix'd

2d61c75 29 days ago

raw

history blame contribute delete

1.63 kB

	TITLE = "# Mewsic Bench Leaderboard"

	INTRODUCTION_TEXT = """
	#### I run model evaluations via API on request.

	To request a model evaluation, click Request Evaluation tab and enter the model ID.
	"""

	CITATION_BUTTON_TEXT = """
	## What is this anyway?

	I test the lyrical capabilities of LLMs by telling them to write songs entirely as meowing.

	## WHY is this?

	There's been a lot of grandstanding about models learning how to write poetry over the past few years, but I've found the actual generation of poetry to be a poor measure of their actual poetic competency, instead "faking it" by using common phrases and line structures that are close enough to fool a human reader.

	By forcing the model to generate nonsensical content, it condenses the test down to the essential characteristics of meter and line.

	Also, it's funny.

	## Citation

	If you use this benchmark, please cite:

	```bibtex
	@misc{mewsicbench2026,
	title={Mewsic Bench},
	author={CatsCanWrite},
	year={2026},
	}
	```

	## Contact

	For questions or issues, please open a discussion on the Hugging Face community tab.
	"""

	METRIC_INFO_TEXT = """
	## About the Metrics

	- Meter - How closely the model sticks to the meter of the lines.
	- Verse - How closely the model aligns the lines to the verse and chorus breakup.
	- Focus - How much of the response is extraneous commentary instead of the song. (Focus in particular has a very minor contribution to the final score)
	- Thinking - The estimated average number of thinking tokens per response. Zero means it's not a reasoning model (or is a hybrid model with reasoning off).
	"""