stabilityai
/

SAME-S

Model card Files Files and versions

SAME-S / README.md

mattricesound's picture

Update model card

fbeb3dc verified 1 day ago

|

history blame contribute delete

3.35 kB

	---
	language:
	- en
	library_name: stable-audio-3
	license: other
	license_name: stable-audio-community
	license_link: LICENSE
	tags:
	- music
	- sound-effects
	- audio
	- autoencoder
	---

	# SAME: A Semantically-Aligned Music Autoencoder

	Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license)

	## Model Description
	Latent representations are at the heart of the majority of modern generative models.
	In the audio domain they are typically produced by a neural-audio-codec autoencoder.
	In this work we introduce SAME (Semantically Aligned Music autoEncoder),
	a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard)
	while maintaining excellent reconstruction quality and strong downstream generative performance.
	We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses.
	The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives.
	Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

	## Usage

	This model can be used with:
	1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library
	2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library


	### Using with `stable-audio-3`
	```python
	import torchaudio
	from stable_audio_3 import AutoencoderModel

	ae = AutoencoderModel.from_pretrained("same-s")
	waveform, sr = torchaudio.load("audio.wav")
	latents = ae.encode(waveform, sr)
	audio_out = ae.decode(latents)
	```

	### Using with `stable-audio-tools`

	```python
	import torch
	import torchaudio
	from einops import rearrange
	from stable_audio_tools import get_pretrained_model
	from stable_audio_tools.inference.generation import generate_diffusion_cond

	device = "cuda" if torch.cuda.is_available() else "cpu"
	if device == "cuda":
	model_half = True

	# Download model
	model, model_config = get_pretrained_model("stabilityai/SAME-S")
	sample_rate = model_config["sample_rate"]
	sample_size = model_config["sample_size"]

	model = model.to(device)
	if model_half:
	model = model.to(torch.float16)

	audio, sr = torchaudio.load(/path/to/audiofile) # [channels, samples]
	if audio.shape[0] == 1:
	audio = audio.repeat(2, 1)

	audio = audio.unsqueeze(0).to(device)
	if model_half:
	audio = audio.half()
	with torch.no_grad():
	latents = model.encode_audio(audio)
	reconstructed = model.decode_audio(latents)
	reconstructed = reconstructed.squeeze(0).cpu()
	reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu()

	```


	## Model Details
	* Model type: `SAME` is a continuous autoencoder model based on a transformer architecture.
	* Language(s): English
	* License: [Stability AI Community License](https://stability.ai/license).
	* Research Paper: [https://arxiv.org/abs/2605.18613](https://arxiv.org/abs/2605.18613)


	## Training dataset

	### Datasets Used
	Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems.