Update README.md
Browse files
README.md
CHANGED
|
@@ -37,8 +37,7 @@ The result is a model built for actual production workflows: **tempo-synced, key
|
|
| 37 |
Foundation-1 is designed for **pure sample generation**. It excels at generating coherent musical loops that stay locked to tempo and phrase length while allowing layered prompting across instrument families, timbre descriptors, FX, and notation-driven musical behavior.
|
| 38 |
|
| 39 |
---
|
| 40 |
-
|
| 41 |
-
## What Foundation-1 Does
|
| 42 |
|
| 43 |
- **Generates musically coherent loops** for production workflows
|
| 44 |
- **Understands BPM and bar count** for structured loop generation
|
|
@@ -52,8 +51,7 @@ Foundation-1 is designed for **pure sample generation**. It excels at generating
|
|
| 52 |
- **Understands Wet vs Dry production context** — adding terms like *Dry* encourages minimal FX processing, while *Wet* or FX tags produce more processed, spatial, or effected sounds.
|
| 53 |
|
| 54 |
---
|
| 55 |
-
|
| 56 |
-
## Why It Feels Different
|
| 57 |
|
| 58 |
Most audio models can react to broad prompt terms like “warm pad” or “bright synth.” with inconsistent results. Foundation-1 was designed to go further by treating the sound as a layered system:
|
| 59 |
|
|
@@ -66,8 +64,7 @@ Most audio models can react to broad prompt terms like “warm pad” or “brig
|
|
| 66 |
This layered conditioning approach is a major reason Foundation-1 is able to deliver both **high musicality** and **high prompt control** at the same time.
|
| 67 |
|
| 68 |
---
|
| 69 |
-
|
| 70 |
-
## Audio Showcase
|
| 71 |
|
| 72 |
<div style="text-align: center; margin: 20px 0;">
|
| 73 |
<table style="width: 100%; border-collapse: collapse; margin: 0 auto;">
|
|
@@ -171,8 +168,7 @@ This layered conditioning approach is a major reason Foundation-1 is able to del
|
|
| 171 |
</div>
|
| 172 |
|
| 173 |
---
|
| 174 |
-
|
| 175 |
-
## Core Capabilities
|
| 176 |
|
| 177 |
### 1. Musical Structure
|
| 178 |
Foundation-1 was trained to produce structured musical material rather than full music or generic textures. Musical Notation terms can encourage notation, chord progressions, melodies, arps, phrase direction, rhythmic density, and other musically relevant behaviors.
|
|
@@ -193,8 +189,7 @@ The model supports a dedicated FX layer covering multiple forms of reverb, delay
|
|
| 193 |
Foundation-1 is built for **production-ready loop generation**, including BPM-aware and bar-aware structure within supported denominations.
|
| 194 |
|
| 195 |
---
|
| 196 |
-
|
| 197 |
-
## Conditioning Architecture
|
| 198 |
|
| 199 |
Foundation-1 was trained with a layered tagging hierarchy designed to improve control, composability, and prompt clarity.
|
| 200 |
|
|
@@ -209,8 +204,7 @@ Foundation-1 was trained with a layered tagging hierarchy designed to improve co
|
|
| 209 |
This makes it possible to prompt at different levels of abstraction. A user can stay broad with a family-level prompt like **Synth** or **Keys**, or get more specific with terms like **Synth Lead**, **Wavetable Bass**, **Grand Piano**, **Violin**, or **Trumpet**, then further shape the output using timbral and FX descriptors.
|
| 210 |
|
| 211 |
---
|
| 212 |
-
|
| 213 |
-
## Instrument Coverage
|
| 214 |
|
| 215 |
### Major Families
|
| 216 |
|
|
@@ -272,8 +266,7 @@ Foundation-1 includes a wide sub-family layer covering a broad range of producti
|
|
| 272 |
<center><img src="./Charts/subfamilites_pie.PNG" alt="Sub-Family Chart" width="80%"></center>
|
| 273 |
|
| 274 |
---
|
| 275 |
-
|
| 276 |
-
## Timbre System
|
| 277 |
|
| 278 |
One of Foundation-1’s main strengths is that it was not trained to treat timbre as an afterthought. Timbral character is directly represented in the prompt system, giving users control over not only *what* is being generated, but also *how it sounds*.
|
| 279 |
|
|
@@ -323,7 +316,7 @@ Representative timbre descriptors include:
|
|
| 323 |
|
| 324 |
<center><img src="./Charts/timbre_tags_pie.PNG" alt="Timbre Chart" width="80%"></center>
|
| 325 |
|
| 326 |
-
|
| 327 |
|
| 328 |
This tagging design makes prompts much more flexible. Instead of only asking for an instrument, users can shape:
|
| 329 |
- tonal balance
|
|
@@ -340,8 +333,7 @@ This is especially useful for producers who want to guide the output toward a sp
|
|
| 340 |
For a list of used tags please see the **[Tag Reference Sheet](./Master_Tag_Reference.md)**.
|
| 341 |
|
| 342 |
---
|
| 343 |
-
|
| 344 |
-
## FX Layer
|
| 345 |
|
| 346 |
Foundation-1 includes a dedicated FX descriptor layer spanning multiple common production effects.
|
| 347 |
|
|
@@ -371,8 +363,7 @@ Representative FX tags include:
|
|
| 371 |
<center><img src="./Charts/fx_pie.PNG" alt="FX Chart" width="80%"></center>
|
| 372 |
|
| 373 |
---
|
| 374 |
-
|
| 375 |
-
## Musical Notation and Structure
|
| 376 |
|
| 377 |
Foundation-1 was trained with structured musical descriptors designed to improve phrase coherence, rhythmic intent, melodic motion, and prompt control.
|
| 378 |
|
|
@@ -411,8 +402,7 @@ Examples of supported structural ideas may include terms such as:
|
|
| 411 |
This notation layer is one of the main reasons Foundation-1 produces unusually coherent musical material instead of static or loosely related phrases. These can be mixed and matched as desired.
|
| 412 |
|
| 413 |
---
|
| 414 |
-
|
| 415 |
-
## Tonal and Timing Support
|
| 416 |
|
| 417 |
Foundation-1 is designed for structured music production workflows and supports:
|
| 418 |
|
|
@@ -427,9 +417,7 @@ Foundation-1 is designed for structured music production workflows and supports:
|
|
| 427 |
- Supported BPM denominations: **100 BPM, 110 BPM, 120 BPM, 128 BPM, 130 BPM, 140 BPM, 150 BPM**
|
| 428 |
|
| 429 |
---
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
## Prompt Structure
|
| 433 |
|
| 434 |
For best results, use **rich prompts built around the model’s tags**. These tags can be mixed and matched as needed. The model was trained on a structured hierarchy designed to encourage musically coherent sample generation.
|
| 435 |
|
|
@@ -449,8 +437,7 @@ For best results, use **rich prompts built around the model’s tags**. These ta
|
|
| 449 |
Use **FX and timbre tags sparingly at first**, then layer more once you understand the model’s behavior.
|
| 450 |
|
| 451 |
---
|
| 452 |
-
|
| 453 |
-
## One Prompt → Multiple Outputs
|
| 454 |
|
| 455 |
Each row below uses the **exact same prompt**, but a different random seed.
|
| 456 |
The **timbre tags remain unchanged**, so the overall sound character stays consistent while the **melodic and musical content varies** between generations.
|
|
@@ -548,9 +535,7 @@ The **timbre tags remain unchanged**, so the overall sound character stays consi
|
|
| 548 |
</div>
|
| 549 |
|
| 550 |
---
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
## Recommended Workflow
|
| 554 |
|
| 555 |
Foundation-1 is best used with the **RC Stable Audio Fork**, which is tuned around this model’s metadata and prompting structure.
|
| 556 |
|
|
@@ -600,8 +585,7 @@ Generation speed will vary depending on GPU model and system configuration.
|
|
| 600 |
On an **RTX 3090**, generation time is approximately **~7–8 seconds per sample**.
|
| 601 |
|
| 602 |
---
|
| 603 |
-
|
| 604 |
-
## Dataset and Training Philosophy
|
| 605 |
|
| 606 |
Foundation-1 was built around a **structured sample-generation philosophy**, rather than generic or genre-based audio captioning. The dataset consists entirely of **hand-crafted and labeled audio**, produced through a controlled augmentation pipeline.
|
| 607 |
|
|
@@ -620,8 +604,7 @@ This design is central to the model’s **musical coherence and high degree of s
|
|
| 620 |
For more details on the dataset and training methodology, see the **[Training & Dataset Notes](./training_dataset_info.md)**.
|
| 621 |
|
| 622 |
---
|
| 623 |
-
|
| 624 |
-
## Limitations
|
| 625 |
|
| 626 |
Foundation-1 is a specialized model for **music sample generation**, not a general-purpose music generator.
|
| 627 |
|
|
@@ -646,23 +629,19 @@ If the generation duration is shorter than the musical structure implied by the
|
|
| 646 |
The **RC Stable Audio Fork automatically handles this timing alignment**, making this workflow much easier.
|
| 647 |
|
| 648 |
---
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
## License
|
| 652 |
|
| 653 |
This model is licensed under the Stability AI Community License. It is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M. For revenues exceeding USD $1M, please refer to the repository license file for full terms.
|
| 654 |
|
| 655 |
---
|
| 656 |
-
|
| 657 |
-
### Companion Video
|
| 658 |
|
| 659 |
Further information on the model and design philosophy can be found in the companion video:
|
| 660 |
|
| 661 |
🎥 **[Watch the Foundation-1 overview and design philosophy video](https://www.youtube.com/watch?v=O2iBBWeWaL8)**
|
| 662 |
|
| 663 |
---
|
| 664 |
-
|
| 665 |
-
## Final Notes
|
| 666 |
|
| 667 |
Foundation-1 is intended as a **producer-facing foundation model for structured sample generation**, designed to augment music production rather than replace it.
|
| 668 |
|
|
|
|
| 37 |
Foundation-1 is designed for **pure sample generation**. It excels at generating coherent musical loops that stay locked to tempo and phrase length while allowing layered prompting across instrument families, timbre descriptors, FX, and notation-driven musical behavior.
|
| 38 |
|
| 39 |
---
|
| 40 |
+
<h2 align="center">What Foundation-1 Does</h2>
|
|
|
|
| 41 |
|
| 42 |
- **Generates musically coherent loops** for production workflows
|
| 43 |
- **Understands BPM and bar count** for structured loop generation
|
|
|
|
| 51 |
- **Understands Wet vs Dry production context** — adding terms like *Dry* encourages minimal FX processing, while *Wet* or FX tags produce more processed, spatial, or effected sounds.
|
| 52 |
|
| 53 |
---
|
| 54 |
+
<h2 align="center">Why It Feels Different</h2>
|
|
|
|
| 55 |
|
| 56 |
Most audio models can react to broad prompt terms like “warm pad” or “bright synth.” with inconsistent results. Foundation-1 was designed to go further by treating the sound as a layered system:
|
| 57 |
|
|
|
|
| 64 |
This layered conditioning approach is a major reason Foundation-1 is able to deliver both **high musicality** and **high prompt control** at the same time.
|
| 65 |
|
| 66 |
---
|
| 67 |
+
<h2 align="center">Audio Showcase</h2>
|
|
|
|
| 68 |
|
| 69 |
<div style="text-align: center; margin: 20px 0;">
|
| 70 |
<table style="width: 100%; border-collapse: collapse; margin: 0 auto;">
|
|
|
|
| 168 |
</div>
|
| 169 |
|
| 170 |
---
|
| 171 |
+
<h2 align="center">Core Capabilities</h2>
|
|
|
|
| 172 |
|
| 173 |
### 1. Musical Structure
|
| 174 |
Foundation-1 was trained to produce structured musical material rather than full music or generic textures. Musical Notation terms can encourage notation, chord progressions, melodies, arps, phrase direction, rhythmic density, and other musically relevant behaviors.
|
|
|
|
| 189 |
Foundation-1 is built for **production-ready loop generation**, including BPM-aware and bar-aware structure within supported denominations.
|
| 190 |
|
| 191 |
---
|
| 192 |
+
<h2 align="center">Conditioning Architecture</h2>
|
|
|
|
| 193 |
|
| 194 |
Foundation-1 was trained with a layered tagging hierarchy designed to improve control, composability, and prompt clarity.
|
| 195 |
|
|
|
|
| 204 |
This makes it possible to prompt at different levels of abstraction. A user can stay broad with a family-level prompt like **Synth** or **Keys**, or get more specific with terms like **Synth Lead**, **Wavetable Bass**, **Grand Piano**, **Violin**, or **Trumpet**, then further shape the output using timbral and FX descriptors.
|
| 205 |
|
| 206 |
---
|
| 207 |
+
<h2 align="center">Instrument Coverage</h2>
|
|
|
|
| 208 |
|
| 209 |
### Major Families
|
| 210 |
|
|
|
|
| 266 |
<center><img src="./Charts/subfamilites_pie.PNG" alt="Sub-Family Chart" width="80%"></center>
|
| 267 |
|
| 268 |
---
|
| 269 |
+
<h2 align="center">Timbre System</h2>
|
|
|
|
| 270 |
|
| 271 |
One of Foundation-1’s main strengths is that it was not trained to treat timbre as an afterthought. Timbral character is directly represented in the prompt system, giving users control over not only *what* is being generated, but also *how it sounds*.
|
| 272 |
|
|
|
|
| 316 |
|
| 317 |
<center><img src="./Charts/timbre_tags_pie.PNG" alt="Timbre Chart" width="80%"></center>
|
| 318 |
|
| 319 |
+
<h2 align="center">Why This Matters</h2>
|
| 320 |
|
| 321 |
This tagging design makes prompts much more flexible. Instead of only asking for an instrument, users can shape:
|
| 322 |
- tonal balance
|
|
|
|
| 333 |
For a list of used tags please see the **[Tag Reference Sheet](./Master_Tag_Reference.md)**.
|
| 334 |
|
| 335 |
---
|
| 336 |
+
<h2 align="center">FX Layer</h2>
|
|
|
|
| 337 |
|
| 338 |
Foundation-1 includes a dedicated FX descriptor layer spanning multiple common production effects.
|
| 339 |
|
|
|
|
| 363 |
<center><img src="./Charts/fx_pie.PNG" alt="FX Chart" width="80%"></center>
|
| 364 |
|
| 365 |
---
|
| 366 |
+
<h2 align="center">Musical Notation and Structure</h2>
|
|
|
|
| 367 |
|
| 368 |
Foundation-1 was trained with structured musical descriptors designed to improve phrase coherence, rhythmic intent, melodic motion, and prompt control.
|
| 369 |
|
|
|
|
| 402 |
This notation layer is one of the main reasons Foundation-1 produces unusually coherent musical material instead of static or loosely related phrases. These can be mixed and matched as desired.
|
| 403 |
|
| 404 |
---
|
| 405 |
+
<h2 align="center">Tonal and Timing Support</h2>
|
|
|
|
| 406 |
|
| 407 |
Foundation-1 is designed for structured music production workflows and supports:
|
| 408 |
|
|
|
|
| 417 |
- Supported BPM denominations: **100 BPM, 110 BPM, 120 BPM, 128 BPM, 130 BPM, 140 BPM, 150 BPM**
|
| 418 |
|
| 419 |
---
|
| 420 |
+
<h2 align="center">Prompt Structure</h2>
|
|
|
|
|
|
|
| 421 |
|
| 422 |
For best results, use **rich prompts built around the model’s tags**. These tags can be mixed and matched as needed. The model was trained on a structured hierarchy designed to encourage musically coherent sample generation.
|
| 423 |
|
|
|
|
| 437 |
Use **FX and timbre tags sparingly at first**, then layer more once you understand the model’s behavior.
|
| 438 |
|
| 439 |
---
|
| 440 |
+
<h2 align="center">One Prompt → Multiple Outputs</h2>
|
|
|
|
| 441 |
|
| 442 |
Each row below uses the **exact same prompt**, but a different random seed.
|
| 443 |
The **timbre tags remain unchanged**, so the overall sound character stays consistent while the **melodic and musical content varies** between generations.
|
|
|
|
| 535 |
</div>
|
| 536 |
|
| 537 |
---
|
| 538 |
+
<h2 align="center">Recommended Workflow</h2>
|
|
|
|
|
|
|
| 539 |
|
| 540 |
Foundation-1 is best used with the **RC Stable Audio Fork**, which is tuned around this model’s metadata and prompting structure.
|
| 541 |
|
|
|
|
| 585 |
On an **RTX 3090**, generation time is approximately **~7–8 seconds per sample**.
|
| 586 |
|
| 587 |
---
|
| 588 |
+
<h2 align="center">Dataset and Training Philosophy</h2>
|
|
|
|
| 589 |
|
| 590 |
Foundation-1 was built around a **structured sample-generation philosophy**, rather than generic or genre-based audio captioning. The dataset consists entirely of **hand-crafted and labeled audio**, produced through a controlled augmentation pipeline.
|
| 591 |
|
|
|
|
| 604 |
For more details on the dataset and training methodology, see the **[Training & Dataset Notes](./training_dataset_info.md)**.
|
| 605 |
|
| 606 |
---
|
| 607 |
+
<h2 align="center">Limitations</h2>
|
|
|
|
| 608 |
|
| 609 |
Foundation-1 is a specialized model for **music sample generation**, not a general-purpose music generator.
|
| 610 |
|
|
|
|
| 629 |
The **RC Stable Audio Fork automatically handles this timing alignment**, making this workflow much easier.
|
| 630 |
|
| 631 |
---
|
| 632 |
+
<h2 align="center">License</h2>
|
|
|
|
|
|
|
| 633 |
|
| 634 |
This model is licensed under the Stability AI Community License. It is available for non-commercial use or limited commercial use by entities with annual revenues below USD $1M. For revenues exceeding USD $1M, please refer to the repository license file for full terms.
|
| 635 |
|
| 636 |
---
|
| 637 |
+
<h3 align="center">Companion Video</h3>
|
|
|
|
| 638 |
|
| 639 |
Further information on the model and design philosophy can be found in the companion video:
|
| 640 |
|
| 641 |
🎥 **[Watch the Foundation-1 overview and design philosophy video](https://www.youtube.com/watch?v=O2iBBWeWaL8)**
|
| 642 |
|
| 643 |
---
|
| 644 |
+
<h2 align="center">Final Notes</h2>
|
|
|
|
| 645 |
|
| 646 |
Foundation-1 is intended as a **producer-facing foundation model for structured sample generation**, designed to augment music production rather than replace it.
|
| 647 |
|