Stability AI takes AI-powered audio generation to the next level with Stable Audio 2.0

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Stability AI has announced the Stable Audio 2.0 model.
  • Stable Audio 2.0 can generate full-length tracks.
  • The new model can also generate output from audio samples.
Stable Audio

After introducing 3D video generation from 2D images last month, Stability AI announced Stable Audio 2.0 to take AI-generated audio to the next level. Stable Audio 2.0 builds upon Stable Audio 1.0 and allows users to generate songs, consisting of intro, development, outro, and stereo sound effects, up to three minutes in length. Apart from generating full-length tracks, Stable Audio 2.0 offers lots of other noteworthy enhancements.

While full-track generation will useful, what seems to be widely appreciated by music artists is newly added support for audio-to-audio capability. Just like how entering a text prompt can generate music, it’s now possible to upload small audio samples for the Stability AI to transform those into “a wide array of sounds”. So, what earlier used to be a small idea can now be turned into a fully produced sample, thanks to Stable Audio 2.0.

It’s worth pointing out that the final output that you get is customizable. In other words, if you don’t like something in that audio, you can change the style and tone to align with your specific needs. That said, the uploaded content should be free of copyright claims.

While sharing some research details about the Stable Audio 2.0 model, Stability AI, in its official blog post, wrote:

The architecture of the Stable Audio 2.0 latent diffusion model is specifically designed to enable the generation of full tracks with coherent structures. To achieve this, we have adapted all components of the system for improved performance over long time scales. A new, highly compressed autoencoder compresses raw audio waveforms into much shorter representations. For the diffusion model, we employ a diffusion transformer (DiT), akin to that used in Stable Diffusion 3, in place of the previous U-Net, as it is more adept at manipulating data over long sequences. The combination of these two elements results in a model capable of recognizing and reproducing the large-scale structures that are essential for high-quality musical compositions.

Not only does Stability Audio 2.0 generate full-length tracks but it also helps you with the production of various sound and audio effects, ranging from the sound that comes out when someone types to the roar of a crowd.

If all of this sounds impressive, you can start using it today for free by going to the Stable Audio website. On the other hand, Stable Audio 2.0 will be available on the Stable Audio API “soon”.

More about the topics: audio-generation, Stability AI, Stable Audio 2.0