Noteworthy
App:
https://huggingface.co/spaces/build-small-hackathon/noteworthy
The Problem
As a keen musician, I often work with music teachers — either teaching me, or one of my children, a musical instrument. One effective way to learn a large piece of music is to break it into chunks, typically a couple of bars or a single line at a time. Many of the teachers I work with use software that lets them easily edit and excerpt sheet music, so they often send these chunks to us to read from our iPads during practice sessions.
When learning an instrument, students often struggle during practice to check whether they're playing a new piece of music correctly. I wanted to give them a tool to help with this, and naturally thought of using an LLM.
I was surprised to find that none of Claude, Gemini, or GPT-5.5 were accurate and all three were incredibly slow in reasoning mode. That’s a pretty big deal because millions of people learn music and read music every day.
The Dataset
- The PrIMuS dataset seemed the most releveant for the task as it included both sheet music images and their associated notes, along several other files.
- From examining the dataset, each sample had 10 items. it looked like we definitely needed the .png image for the sheet music and after some examination of the other 9 items, the semantic file had the most relevant music note information.
- Examined the dataset in a little more detail to find that there were several clefs (treble, bass, alto, tenor ..), so I needed to find a way to only grab the treble clef as that was what I was interested in.
%G-2@... ← treble clef (G clef on staff line 2)
%F-4@... ← bass clef (F clef on staff line 4)
%C-1@... ← soprano clef (not treble/bass)
- This allowed me to separate out only the treble cleff images and associated data which is what I would use as the dataset for fine-tuning.
The Model
- OpenBMB's vision-to-text model seemed to be ideal. It was only 1B parameters and did not correctly transcribe notes out of the box.
- Performed initial experiments with LoRA fine-tuning and ran inference on a model after it had been fine-tuned for 100 steps. The results were more accurate than the base model. Continued with training to 400 steps but the results were still not accurate enough to be helpful to students.
- Read a little bit more about the model and discovered that by defaulting the images are downsampled by a factor of 16, which would reduce accuracy.
- I couldn't combine fine-tunes that were downsampled with 4 and the previous work done with 16, so decided to go all in on a full-fine tune and a downsample of 4.
- Running inference with checkpoints of 100 steps were significantly better.
Training
- All done on Modal compute - for fine-tuning, and model quantization.