Noteworthy

Community Article

Published June 15, 2026

Upvote

Jonathan Fernandes

jon-fernandes

build-small-hackathon

App: https://huggingface.co/spaces/build-small-hackathon/noteworthy

The Problem

As a keen musician, I often work with music teachers — either teaching me, or one of my children, a musical instrument. One effective way to learn a large piece of music is to break it into chunks, typically a couple of bars or a single line at a time. Many of the teachers I work with use software that lets them easily edit and excerpt sheet music, so they often send these chunks to us to read from our iPads during practice sessions.

When learning an instrument, students often struggle during practice to check whether they're playing a new piece of music correctly. I wanted to give them a tool to help with this, and naturally thought of using an LLM.

I was surprised to find that none of Claude, Gemini, or GPT-5.5 were accurate and all three were incredibly slow in reasoning mode. That’s a pretty big deal because millions of people learn music and read music every day.

The Dataset

The PrIMuS dataset seemed the most releveant for the task as it included both sheet music images and their associated notes, along several other files.
From examining the dataset, each sample had 10 items. it looked like we definitely needed the .png image for the sheet music and after some examination of the other 9 items, the semantic file had the most relevant music note information.
Examined the dataset in a little more detail to find that there were several clefs (treble, bass, alto, tenor ..), so I needed to find a way to only grab the treble clef as that was what I was interested in.

%G-2@...   ← treble clef (G clef on staff line 2)
%F-4@...   ← bass clef (F clef on staff line 4)
%C-1@...   ← soprano clef (not treble/bass)

This allowed me to separate out only the treble cleff images and associated data which is what I would use as the dataset for fine-tuning.

The Model

OpenBMB's vision-to-text model seemed to be ideal. It was only 1B parameters and did not correctly transcribe notes out of the box.
Performed initial experiments with LoRA fine-tuning and ran inference on a model after it had been fine-tuned for 100 steps. The results were more accurate than the base model. Continued with training to 400 steps but the results were still not accurate enough to be helpful to students.
Read a little bit more about the model and discovered that by defaulting the images are downsampled by a factor of 16, which would reduce accuracy.
I couldn't combine fine-tunes that were downsampled with 4 and the previous work done with 16, so decided to go all in on a full-fine tune and a downsample of 4.
Running inference with checkpoints of 100 steps were significantly better.

Training

All done on Modal compute - for fine-tuning, and model quantization.

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

June 16, 2026

Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually work

June 15, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Noteworthy

The Problem

The Dataset

The Model

Training

Spaces mentioned in this article 1

Noteworthy - Sheet Music Transcription

Signal Garden: A Game Engine That Keeps Mutating

Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually work

Community

Spaces mentioned in this article 1

Noteworthy - Sheet Music Transcription

Noteworthy

The Problem

The Dataset

The Model

Training

Spaces mentioned in this article 1

Signal Garden: A Game Engine That Keeps Mutating

Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually *work*

Community

Spaces mentioned in this article 1

Yui Home Assistant — teaching a 3B model to write Home Assistant automations that actually work