Our team recently saw the new “sound effect” feature from Pika Labs and found it extremely fun to play with. So we decided to make our own version! Basically the goal is to take an input 5-10 second clip, and generated the corresponding sounds that match.
Generally we think something like this could be useful for folks trying to add sound effects to stock video content. Instead of using stock audio and sound effect libraries like Storyblocks, Adobe, or PremiumBeat — why not generate it with AI?
Adding appropriate sound effects to videos means being able to understand what’s going on in the video, what it could sound like, and then generating appropriate sounds given all of that context.
How it Works
We’ve seen multi-modal AI models like OpenAI’s GPT-4 take in simple text prompts to generate images. This is done by getting the LLM to first expand the text prompt, which it then feeds into DALLE 3. Similarly, our idea is to do the following:
- extract some frames in the stock video
- get a model to describe what that video is or might sound like
- get a model to generate audio using that description
And turns out there’s some great open-source models that allow us to do just this. Vision language models (VLMs) are starting to get really good. These models can take in an image along with a prompt and then respond to it based on the image. We have a couple of these available on Sieve such as CogVLM, InternLM, and Moondream — each of which come with a certain cost / quality tradeoff. For this app, we decided to pick CogVLM which we found to be really descriptive. Here’s how we use Sieve to prompt it.
import sieve
image = sieve.Image(path="./some_image.jpg")
prompt = "describe what you might hear in this image in detail."
cogvlm_chat = sieve.function.get("sieve/cogvlm-chat")
output = cogvlm_chat.run(image, prompt)
print(output)
We then take this prompt and feed it into AudioLDM, a state-of-the-art audio generation model.
audioldm = sieve.function.get("sieve/audioldm")
audio = audioldm.run(output)
print(audio)
In the case of video, we decide to simply sample the middle frame to pass into our VLM though there could be more robust approaches to this using video-native models in the future. One all of this is complete, we’re able to create videos like these. All of the sound you hear was generated with AI!
You can try the app for yourself here and we’ve open-source all the code here so you can modify it however you like.