State of the art audio enhancement in 5 minutes

In the world of multimedia and podcasting, user experience often hinges on audio quality. Superior quality has the capacity to captivate audiences, encouraging prolonged engagement with your content, whether they be podcasts, tutorials, or promos. More specifically, audio quality is not just about enhancing user experience but also plays a crucial role in ensuring the accuracy of downstream AI tasks such as automated transcription services, like transcription with OpenAI's Whisper. In fact, an entire industry of startups like Descript, Krisp AI, and more are working in the space.

With the help of a few open source models, namely AudioSR and DeepFilterNet, we’ve launched an audio enhancement app that helps remove background noise and enhance speech, making sure your audio quality is always up to par.

Here’s a quick example of the sound quality difference using open-source models:

In this post, we’ll go through a quick demonstration of how you might integrate this solution into your AI project and some background on the models we used.

Trying it out

You can try out the pre-built audio enhancement app in a few clicks here with your own audio samples here. Here’s a few more samples from podcasts and YouTube videos:

Sample 1

Original

Enhanced

Sample 2

Original

Enhanced

Run via API or Python

You can also integrate the app into your current workflow through an API call or Python call with the following steps:

Run audio enhancement via API (or see below for Python)

curl -X POST https://mango.sievedata.com/v2/push \
-H "X-API-Key: <your-api-key>" \
-d '{
  "function": "sieve/audio_enhancement",
  "inputs": {
    "audio": {
      "url": "<your-audio-url>"
    }
  }
}'

Run via Python client

Install the Python client

pip install sievedata

sieve login

Run this Python script with your own audio!

import sieve

audio_enhancer = sieve.function.get('sieve/audio_enhancement')

# Specify "upsample", "noise" or "all" for the filtering type
enhanced_audio = audio_enhancer.run(sieve.Audio(path="./speech.wav"), "upsample")

# View results on Sieve dashboard or locally from this path
print(enhanced_audio.path)

It’s as simple as that! Results will now be viewable on your Sieve dashboard or directly from your Python code.

How it works

The magic lies in the tandem of two open-source AI models - AudioSR and DeepFilterNet.

AudioSR

AudioSR is a generative model that uses a diffusion-based approach to estimate the high-frequency components of a low-resolution audio signal. It does this by training a latent diffusion model to learn the conditional generation of high-resolution spectrograms from low-resolution spectrograms. The model can handle a flexible input sampling rate between 4kHz and 32kHz, covering most real-world scenarios. AudioSR has achieved promising results on speech, music, and sound effects with different input sampling rate settings and has been verified to be a plug-and-play module for enhancing the audio quality of various audio generation models.

DeepFilterNet

DeepFilterNet is a deep learning-based speech enhancement framework that utilizes harmonic structure of speech to efficiently enhance speech quality by removing unwanted noise from audio files. It operates in two stages, with the first stage enhancing the speech envelope in the ERB domain, and the second stage using deep filtering to enhance the periodic component. Several optimizations have been made to the training procedure, data augmentation, and network structure resulting in state-of-the-art speech enhancement performance while reducing processing, making it applicable to run on embedded devices in real-time.

What next?

Audio enhancement is a feature that can be used upstream of most other AI audio functionality. For example, a feature like speech editing (also featured on Sieve) can use audio enhancement capabilities to enhance the quality of the generated voices.

Sieve's cloud platform makes combining functionality in this way easy. To try it, create an account!