noteworthy / README.md
jon-fernandes's picture
Update README.md
c8f9496 verified
|
Raw
History Blame Contribute Delete
5.51 kB
---
title: Noteworthy - Sheet Music Transcription
emoji: 🎵
colorFrom: blue
colorTo: purple
sdk: gradio
app_file: app.py
python_version: "3.10"
pinned: false
license: apache-2.0
huggingface username: jon-fernandes
social media post: https://www.linkedin.com/feed/update/urn:li:activity:7472430895275114497/
tracks:
- Backyard AI
badges:
- Best MiniCPM Build
- Best Use of Codex
- Best Use of Modal
- Off Brand
- Tiny Titan
- Best Demo
- Bonus Quest Champion
tags:
- track:backyard
- sponsor:openbmb
- sponsor:openai
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:llama
- achievement:sharing
- achievement:fieldnotes
---
Huggingface username: jon-fernandes
[LinkedIn post](https://www.linkedin.com/feed/update/urn:li:activity:7472430895275114497/)
# Noteworthy — Sheet Music Transcription
Upload a line of treble clef sheet music image, to transcribe the notes
# The Problem
[Video of The Problem and Demo of solution](https://youtu.be/TjLOu0hy26Q)
As a keen musician, I often work with music teachers or music students. One effective way to learn a large piece of sheet music is to break it into chunks, typically a couple of bars or a single line at a time.
Many of the teachers I work with use software that lets them easily edit and extract sheet music, so they often send these chunks to us to read from our iPads during music practice sessions.
When learning an instrument, students often struggle during practice to check whether they're playing a new piece of music correctly. I wanted to give them a tool to help with this, and naturally thought of using an LLM.
I was surprised to find that in general, Claude, Gemini, or GPT-5.5 could not transcribe sheet music accurately. Additionally, even transcribing one line of music took 5-10 minutes. All three hallucinated significantly and that’s a pretty big deal because **millions of people learn music and read music every day**.
```The treble clef is the symbol at the start of the five horizontal lines that music is written on. It's used for higher-pitched instruments and voices — recorder, flute, and trumpet — which are the instruments I deal with day to day basis, and so that’s where my focus is. (Other instruments use different clefs, such as the bass clef, or a mix of both, as with piano.)```
I often work with students who are new to music. The Noteworthy app addresses a very specific problem music teachers and students in the school face:
1. **Treble clef** (Music students learning recorder, flute or trumpet)
2. **A couple of bars or a single line of music at most** (This is how large pieces of music are broken down to be learned and practiced in the students own time between lessons).
See Forward Looking Improvements below
# Students actually used the app
[Video of students using the app](https://youtu.be/fk7704sOi9A)
I also got a couple of music students who are relatively new to their musical instrument to test out the app. They all said that it would help them when they didn’t have access to someone who could read music.
# Why small models and why OpenBMB's MiniCPM-V-4.6?
1. Noteworthy needs to run on an iPad or phone that a student would use to read off their chunk/line of music.
2. The MiniCPM-V-4.6 was perfect size at 1B parameters. This model could be further quantized to run on llama.cpp on small end-user device like an iPad or phone.
3. It needs to be responsive, so low latency is essential. Young students are unlikely to have the patience to work with the response times of most LLMs (5-10 minutes for a line)
4. I wanted to find the right balance of cost (GPU hardware) and performance as the model would need to be fully finetuned to gain maximum accuracy.
# Works offline using llama.cpp
[Video of working offline using llama.cpp](https://youtu.be/jT4iilgr6UM)
As MiniCPM-V-4.6 is a vision-to-text model, it requires 2 files to operate.
1. The GGUF file - the quantized model weights and related metadata.
2. The mmproj file - handling image related tasks and processing of images
```./llama.cpp/build/bin/llama-mtmd-cli -m MiniCPM-V-4.6-sheetmusic-checkpoint-2400-Q4_K_M.gguf \
--mmproj mmproj-MiniCPM-V-4.6-sheetmusic-checkpoint-2400-F16.gguf \
--image examples/000100059-1_1_1.png \
-p "Transcribe these music notes"
```
# MCP Support
[Video of MCP support](https://youtu.be/O0NnSHE16UI)
As LLMs take so long to transcribe music, I created Noteworthy as an MCP server that an MCP client like Claude Desktop could use to perform any sheet music transcription.
# Infrastructure
- All model-related and quantization tasks were performed exclusively on Modal.
- Fine-tuning of the MiniCPM-V-4.6, using the PrIMuS (Printed Images of Music Staves) dataset was performed on Modal using 4 x A100 GPUs for 2400 steps.
- Creation of the quantized models and mmproj models was also completed on Modal.
- Testing initial model checkpoints using model inference were performed on Modal.
# OpenAI Codex
- All coding tasks (app creation), model related tasks on modal (fine-tuning, inference) and quantization were all performed using OpenAI Codex.
# Forward Looking Improvements
- Supporting other clefs (Bass, Tenor, Alto ...)
- Support for reading a whole page of music
- Creating MIDI files and a metronome to accompany the transcribing of notes, so students can both see and hear
[Field notes](https://huggingface.co/blog/build-small-hackathon/noteworthy)