An active speaker detection model to detect which people are speaking in a video.
Code
ready
Outputs
No outputs yet! Submit a request on the right to see outputs.
Logs
no logs found
README

TalkNet-ASD

This is a heavily modified version of the TalkNet model from this repository. All credit goes to the author.

Pricing

We price per minute of video. We bucket pricing into standard definition (<=720p), high definition (<=1080p), and 4k (>1080p) videos. If face boxes are supplied, we multiply the price by 0.7x since we can skip that step, which is expensive. Additionally, we multiply the price by 1.5x if a debug visual is generated due to the rendering time. We've listed the pay-as-you-go rates below.

ResolutionPrice / MinutePrice / Minute with Debug VisualizationPrice / Minute with Face BoxesPrice / Minute with Debug Visualization and Face Boxes
> 1080p (4k)$0.13$0.195$0.091$0.1365
> 720p (up to 1080p)$0.065$0.0975$0.0455$0.06825
≤ 720p$0.052$0.078$0.0364$0.0546

Note: If your video is poorly encoded, we will re-encode it for you as it would otherwise cause the pipeline to be prohibitively slow. For this, we charge $0.01 per compute minute to re-encode the video.

Notes

Face Boxes

In the event you've previously detected bounding boxes and you just want to perform speaker detection, you can skip the S3FD face detection step by supplying your own bounding boxes in string format frame_1,x0,y0,x1,y1,confidence with newlines in between each box. Here is an example of a valid input.

10,767.00,219.00,1060.00,654.00,0.9
11,753.00,218.00,1064.00,651.00,0.9
...

In Memory Threshold

For the in_memory_threshold param, we recommend a value of less than or equal to 3000, as any more than this will cause memory overload in the response. Keeping frames in memory is a great way to make your request process faster. We've set it to 3000 by default, there shouldn't be a need to change this value.