Stability AI takes AI-powered audio generation to the next level with Stable Audio 2.0

Home » News

2 min. read

Published on April 3, 2024

by Devesh Beri

published on April 3, 2024

Share this article

Improve this guide

Readers help support MSpoweruser. We may get a commission if you buy through our links.

Key notes

Stability AI has announced the Stable Audio 2.0 model.
Stable Audio 2.0 can generate full-length tracks.
The new model can also generate output from audio samples.

After introducing 3D video generation from 2D images last month, Stability AI announced Stable Audio 2.0 to take AI-generated audio to the next level. Stable Audio 2.0 builds upon Stable Audio 1.0 and allows users to generate songs, consisting of intro, development, outro, and stereo sound effects, up to three minutes in length. Apart from generating full-length tracks, Stable Audio 2.0 offers lots of other noteworthy enhancements.

While full-track generation will useful, what seems to be widely appreciated by music artists is newly added support for audio-to-audio capability. Just like how entering a text prompt can generate music, it’s now possible to upload small audio samples for the Stability AI to transform those into “a wide array of sounds”. So, what earlier used to be a small idea can now be turned into a fully produced sample, thanks to Stable Audio 2.0.

It’s worth pointing out that the final output that you get is customizable. In other words, if you don’t like something in that audio, you can change the style and tone to align with your specific needs. That said, the uploaded content should be free of copyright claims.

While sharing some research details about the Stable Audio 2.0 model, Stability AI, in its official blog post, wrote:

The architecture of the Stable Audio 2.0 latent diffusion model is specifically designed to enable the generation of full tracks with coherent structures. To achieve this, we have adapted all components of the system for improved performance over long time scales. A new, highly compressed autoencoder compresses raw audio waveforms into much shorter representations. For the diffusion model, we employ a diffusion transformer (DiT), akin to that used in Stable Diffusion 3, in place of the previous U-Net, as it is more adept at manipulating data over long sequences. The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions.

Not only does Stability Audio 2.0 generate full-length tracks but it also helps you with the production of various sound and audio effects, ranging from the sound that comes out when someone types to the roar of a crowd.

If all of this sounds impressive, you can start using it today for free by going to the Stable Audio website. On the other hand, Stable Audio 2.0 will be available on the Stable Audio API “soon”.

Devesh Beri

Tech Journalist

These are the things that motivate me - creating informative and helpful content, pursuing my passion for motorsports and music, engaging in expeditions, maintaining a healthy lifestyle, and spending time with my adorable cat Taco.

User forum

0 messages

Sort by:

Leave a Reply Cancel reply