Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- art
|
| 5 |
+
- music
|
| 6 |
+
- midi
|
| 7 |
+
- emotion
|
| 8 |
+
- clip
|
| 9 |
+
- multimodal
|
| 10 |
+
base_model:
|
| 11 |
+
- openai/clip-vit-large-patch14-336
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# ARIA - Artistic Rendering of Images into Audio
|
| 15 |
+
|
| 16 |
+
ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.
|
| 17 |
+
|
| 18 |
+
## Model Description
|
| 19 |
+
|
| 20 |
+
- **Developed by:** Vincent Amato
|
| 21 |
+
- **Model type:** Multimodal (Image-to-MIDI) Generation
|
| 22 |
+
- **Language(s):** English
|
| 23 |
+
- **License:** MIT
|
| 24 |
+
- **Parent Model:** Uses CLIP for image encoding and midi-emotion for music generation
|
| 25 |
+
|
| 26 |
+
### Model Architecture
|
| 27 |
+
|
| 28 |
+
ARIA consists of two main components:
|
| 29 |
+
1. A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
|
| 30 |
+
2. A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values
|
| 31 |
+
|
| 32 |
+
The model offers three different conditioning modes:
|
| 33 |
+
- `continuous_concat`: Emotions as continuous vectors concatenated to all tokens
|
| 34 |
+
- `continuous_token`: Emotions as continuous vectors prepended to sequence
|
| 35 |
+
- `discrete_token`: Emotions quantized into discrete tokens
|
| 36 |
+
|
| 37 |
+
### Usage
|
| 38 |
+
|
| 39 |
+
The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:
|
| 40 |
+
- `model.pt`: The trained model weights
|
| 41 |
+
- `mappings.pt`: Token mappings for MIDI generation
|
| 42 |
+
- `model_config.pt`: Model configuration
|
| 43 |
+
|
| 44 |
+
Additionally, `image_encoder.pt` contains the CLIP-based image emotion encoder.
|
| 45 |
+
|
| 46 |
+
## Intended Use
|
| 47 |
+
|
| 48 |
+
This model is designed for:
|
| 49 |
+
- Generating music that matches the emotional content of artwork
|
| 50 |
+
- Exploring emotional transfer between visual and musical domains
|
| 51 |
+
- Creative applications in art and music generation
|
| 52 |
+
|
| 53 |
+
### Limitations
|
| 54 |
+
|
| 55 |
+
- Music generation quality depends on the emotional interpretation of input images
|
| 56 |
+
- Generated MIDI may require human curation for professional use
|
| 57 |
+
- Model's emotional understanding is limited to valence-arousal space
|
| 58 |
+
|
| 59 |
+
## Training Data
|
| 60 |
+
|
| 61 |
+
The model combines:
|
| 62 |
+
1. Image encoder: Fine-tuned on a curated dataset of artwork with emotional annotations
|
| 63 |
+
2. MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project
|
| 64 |
+
|
| 65 |
+
## Attribution
|
| 66 |
+
|
| 67 |
+
This project builds upon:
|
| 68 |
+
- **midi-emotion** by Serkan Sulun et al. ([GitHub](https://github.com/serkansulun/midi-emotion))
|
| 69 |
+
- Paper: "Symbolic music generation conditioned on continuous-valued emotions" ([IEEE Access](https://ieeexplore.ieee.org/document/9762257))
|
| 70 |
+
- Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
|
| 71 |
+
- **CLIP** by OpenAI for the base image encoder architecture
|
| 72 |
+
|
| 73 |
+
## License
|
| 74 |
+
|
| 75 |
+
This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.
|