vincentamato commited on
Commit
d0e7717
·
verified ·
1 Parent(s): 7528605

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - art
5
+ - music
6
+ - midi
7
+ - emotion
8
+ - clip
9
+ - multimodal
10
+ base_model:
11
+ - openai/clip-vit-large-patch14-336
12
+ ---
13
+
14
+ # ARIA - Artistic Rendering of Images into Audio
15
+
16
+ ARIA is a multimodal AI model that generates MIDI music based on the emotional content of artwork. It uses a CLIP-based image encoder to extract emotional valence and arousal from images, then generates emotionally appropriate music using conditional MIDI generation.
17
+
18
+ ## Model Description
19
+
20
+ - **Developed by:** Vincent Amato
21
+ - **Model type:** Multimodal (Image-to-MIDI) Generation
22
+ - **Language(s):** English
23
+ - **License:** MIT
24
+ - **Parent Model:** Uses CLIP for image encoding and midi-emotion for music generation
25
+
26
+ ### Model Architecture
27
+
28
+ ARIA consists of two main components:
29
+ 1. A CLIP-based image encoder fine-tuned to predict emotional valence and arousal from images
30
+ 2. A transformer-based MIDI generation model (midi-emotion) that conditions on these emotional values
31
+
32
+ The model offers three different conditioning modes:
33
+ - `continuous_concat`: Emotions as continuous vectors concatenated to all tokens
34
+ - `continuous_token`: Emotions as continuous vectors prepended to sequence
35
+ - `discrete_token`: Emotions quantized into discrete tokens
36
+
37
+ ### Usage
38
+
39
+ The repository contains three variants of the MIDI generation model, each trained with a different conditioning strategy. Each variant includes:
40
+ - `model.pt`: The trained model weights
41
+ - `mappings.pt`: Token mappings for MIDI generation
42
+ - `model_config.pt`: Model configuration
43
+
44
+ Additionally, `image_encoder.pt` contains the CLIP-based image emotion encoder.
45
+
46
+ ## Intended Use
47
+
48
+ This model is designed for:
49
+ - Generating music that matches the emotional content of artwork
50
+ - Exploring emotional transfer between visual and musical domains
51
+ - Creative applications in art and music generation
52
+
53
+ ### Limitations
54
+
55
+ - Music generation quality depends on the emotional interpretation of input images
56
+ - Generated MIDI may require human curation for professional use
57
+ - Model's emotional understanding is limited to valence-arousal space
58
+
59
+ ## Training Data
60
+
61
+ The model combines:
62
+ 1. Image encoder: Fine-tuned on a curated dataset of artwork with emotional annotations
63
+ 2. MIDI generation: Uses the Lakh-Spotify dataset as processed by the midi-emotion project
64
+
65
+ ## Attribution
66
+
67
+ This project builds upon:
68
+ - **midi-emotion** by Serkan Sulun et al. ([GitHub](https://github.com/serkansulun/midi-emotion))
69
+ - Paper: "Symbolic music generation conditioned on continuous-valued emotions" ([IEEE Access](https://ieeexplore.ieee.org/document/9762257))
70
+ - Citation: S. Sulun, M. E. P. Davies and P. Viana, "Symbolic Music Generation Conditioned on Continuous-Valued Emotions," in IEEE Access, vol. 10, pp. 44617-44626, 2022
71
+ - **CLIP** by OpenAI for the base image encoder architecture
72
+
73
+ ## License
74
+
75
+ This model is released under the MIT License. However, usage of the midi-emotion component should comply with its GPL-3.0 license.