In the world of multimedia and podcasting, user experience often hinges on audio quality. Superior quality has the capacity to captivate audiences, encouraging prolonged engagement with your content, whether they be podcasts, tutorials, or promos. More specifically, audio quality is not just about enhancing user experience but also plays a crucial role in ensuring the accuracy of downstream AI tasks such as automated transcription services, like transcription with OpenAI's Whisper. In fact, an entire industry of startups like Descript, Krisp AI, and more are working in the space.
With the help of a few open source models, namely AudioSR and DeepFilterNet, we’ve launched an audio enhancement app that helps remove background noise and enhance speech, making sure your audio quality is always up to par.
Here’s a quick example of the sound quality difference using open-source models:
In this post, we’ll go through a quick demonstration of how you might integrate this solution into your AI project and some background on the models we used.
Trying it out
You can try out the pre-built audio enhancement app in a few clicks here with your own audio samples here. Here’s a few more samples from podcasts and YouTube videos:
Sample 1
Sample 2
Run via API or Python
You can also integrate the app into your current workflow through an API call or Python call with the following steps:
-
Sign up for a Sieve account and find your API key here.
-
Run audio enhancement via API (or see below for Python)
curl -X POST https://mango.sievedata.com/v2/push \ -H "X-API-Key: <your-api-key>" \ -d '{ "function": "sieve/audio_enhancement", "inputs": { "audio": { "url": "<your-audio-url>" } } }'
-
Run via Python client
- Install the Python client
pip install sievedata
- Login with your API key
sieve login
- Run this Python script with your own audio!
import sieve audio_enhancer = sieve.function.get('sieve/audio_enhancement') # Specify "upsample", "noise" or "all" for the filtering type enhanced_audio = audio_enhancer.run(sieve.Audio(path="./speech.wav"), "upsample") # View results on Sieve dashboard or locally from this path print(enhanced_audio.path)
It’s as simple as that! Results will now be viewable on your Sieve dashboard or directly from your Python code.
How it works
The magic lies in the tandem of two open-source AI models - AudioSR and DeepFilterNet.
AudioSR
AudioSR is a generative model that uses a diffusion-based approach to estimate the high-frequency components of a low-resolution audio signal. It does this by training a latent diffusion model to learn the conditional generation of high-resolution spectrograms from low-resolution spectrograms. The model can handle a flexible input sampling rate between 4kHz and 32kHz, covering most real-world scenarios. AudioSR has achieved promising results on speech, music, and sound effects with different input sampling rate settings and has been verified to be a plug-and-play module for enhancing the audio quality of various audio generation models.
DeepFilterNet
DeepFilterNet is a deep learning-based speech enhancement framework that utilizes harmonic structure of speech to efficiently enhance speech quality by removing unwanted noise from audio files. It operates in two stages, with the first stage enhancing the speech envelope in the ERB domain, and the second stage using deep filtering to enhance the periodic component. Several optimizations have been made to the training procedure, data augmentation, and network structure resulting in state-of-the-art speech enhancement performance while reducing processing, making it applicable to run on embedded devices in real-time.
Other Solutions
Let's take a look at how open source stands against the other prominent ones in the market, namely the Dolby Enhance API:
Original
Dolby
AudioSR + DeepFilterNet
Very promising results from a purely open-source solution!
What next?
Audio enhancement is a feature that can be used upstream of most other AI audio functionality. For example, a feature like speech editing (also featured on Sieve) can use audio enhancement capabilities to enhance the quality of the generated voices.
Sieve's cloud platform makes combining functionality in this way easy. To try it, create an account!