Sapiens
Sapiens is a collection of models released by Meta focused on human-centric vision tasks.
Repo: https://github.com/facebookresearch/sapiens
Paper: https://arxiv.org/abs/2408.12569
We are hosting the following Sapiens model types:
A model can be chosen by setting the mode
parameter to one of the following:
normal
: Surface Normal Estimationsegmentation
: Body Part Segmentationdepth
: Depth Estimation
For each model, we support either the 0.3b
or 1b
model size. This can be chosen by setting the model_size
parameter to one of the following:
0.3b
: faster, lower quality1b
: slower, higher quality
We found results are best when the model is applied to an image/video with a single person in-frame, where the person's body takes up the majority of the image. It also helps for the video/image to have a simple background.
The endpoint accepts any length of video, but be aware that the sapiens model is expensive to run. We advise trying out shorter clips of videos before submitting longer ones (2+ minutes).
See Examples for a walkthrough on using each of the Sapiens models.
Pricing
This function runs on an A100 40GB GPU and is billed based on our compute-based pricing rates at $4.20/hr.
Output
This endpoint supports both video and image inputs.
Each model returns the following items:
raw output
: The raw output of the model. (numpy.npy
file)visualization
: A visualization of the output. (.png
image)
If the raw_output
parameter is set to true
, only the raw output will be returned. This will reduce costs, as depth estimation and surface normal estimation require running the segmentation model to create visualizations.
Video Outputs
Frame-level video outputs are returned as a dictionary of zipfiles, one for each output type. Each zipfile contains the output for each frame in the video, labeled by a frame number (e.g. the visualization for the 3rd frame will have the file name 000002.png
).
For video inputs, setting the make_video
parameter to true
will return a visualization of the segmentation overlayed onto the original video in addition to other outputs. raw_output=True
overrides make_video
-- no visualization video will be returned.
When a video is returned, the output format is video, dict[str, zipfile]
. Otherwise, the output format is dict[str, file]
.
Input Outputs
The output format for a single image is visualization, raw_output
if raw_output=True
, otherwise it is visualization
.
For more information on the output formats for each model, see Models.
Models
Body Part Segmentation
This mode is used to segment the body parts of a person in an image. The classes and their corresponding ids can be found here: classes
The following items are returned:
raw
: The class ids for each pixel in a frame.visualization
: A visualization of the segmentation overlayed onto the original image.
Depth Estimation
This mode estimates the depth of each pixel belonging to a person in an image.
The following items are returned:
raw output
: The estimated depth for each pixel in a frame.depth visualization
: A normalized visualization of the depth map.
Surface Normal Estimation
This mode estimates the surface normal of each pixel belonging to a person in an image.
The following items are returned:
raw output
: The surface normal map for each pixel in a frame.surface normal visualization
: A visualization of the surface normal map.
Examples
# make sure you've installed the sieve package with `pip install sievedata`
# and set your api key with `export SIEVE_API_KEY=<your_api_key>`
# or use
# os.environ["SIEVE_API_KEY"] = "<your_api_key>"
import numpy as np
import sieve
# get the sapiens sieve function
sapiens_fn = sieve.function.get("sieve/sapiens")
# choose one of the following modes:
# mode = "segmentation"
# mode = "depth"
# mode = "normal"
mode = "segmentation"
# choose one of the following model sizes:
# model_size = "0.3b"
# model_size = "1b"
model_size = "1b"
# segment a single image
visualiztion, raw_output = sapiens_fn.run(sieve.File(path="path/to/image.png"), mode=mode, model_size=model_size)
# load raw output as numpy array
raw_output = np.load(raw_output.path)
# segment a video, return a visualization video
visualization_video, output_dict = sapiens_fn.run(sieve.File(path="path/to/video.mp4"), mode=mode, model_size=model_size, make_video=True)
# path to output zip files containing frame-level outputs
path_to_raw_output_zip = output_dict["raw"].path
path_to_frame_visualization_zip = output_dict["visualization"].path
License
Please read the license for Sapiens here.