Segment Anything 2 (Beta)
This is Sieve's optimized implementation of Segment Anything 2!
SAM 2 (Segment Anything 2) is a new foundation model for video and image segmentation released by the FAIR team at Meta. It is the successor to the original Segment Anything Model (SAM), an image segmentation model released by Meta in 2023. SAM 2 introduces the capability to segment and track objects throughout a video sequence with as little as a single image prompt. See the official announcement here.
See Meta's SAM 2 Github repository here.
For information on prompting, click here.
For information on output options, click here.
For pricing, click here.
For an interactive tutorial, click here.
Models
We currently host two models, which can be selected with the model_type
parameter:
sam2_hiera_large
: Higher quality, more expensive (default)sam2_hiera_tiny
: Lower quality, cheaper
Trimming & Frame Interval
We support trimming videos via the start_frame
and end_frame
parameters. If start_frame
is set to -1
(default), the full video will be processed from the beginning. If end_frame
is set to -1
(default), the full video will be processed until the end.
Additionally, the frame_interval
parameter allows for processing only a subset of frames in the video. This can be useful for reducing the cost of processing long videos. Setting frame_interval=n
will process every nth frame in the video. All outputs will be mapped back to the original frame index. (e.g. if frame_interval=2
, masks/confidences/bboxes will be associated with the 0th, 2nd, 4th, ... frames in the original video, and their filenames will reflect this.)
Note: Outputs will also be returned for any prompted frames, even if they do not fall on the frames collected with frame_interval=n
.
Prompting
A "prompt" in the context of SAM 2 is simply a labeling of a region of interest in an image. This region of interest can be annotated with a bounding box or a point with a positive or negative label. If we'd like to segment the soccer ball in the following image, we could place a positive point on the ball, and negative points on background objects (like the arm of the person holding it).
This visualization was generated using Meta's SAM 2 demo page. It's a great tool to use to understand how SAM 2 works!
The great thing about SAM 2 is that it works on videos! With this prompt for a single frame, we can track the soccer ball throughout the video:
When more convenient, we can also use a bounding box to label a region of interest.
It's important to note that SAM 2 can handle multiple prompts across different frames. This is useful in scenarios where a new object enters the frame or for scene cuts.
An example set of prompts for a video might look like this.
example_prompts = [
{ # Point-Label Prompt
"frame_index": 0, # the index of the frame to apply the prompt to
"object_id": 1, # the id of the object to track
"points": [[300,200], [200, 300]], # 2d array of x,y points corresponding to labels
"labels": [1, 0], # labels for each point (1 for positive, 0 for negative)
},
{ # Bounding Box Prompt
"frame_index": 50, # the index of the frame to apply the prompt to
"object_id": 2, # the id of the object to track
"box": [200, 200, 300, 400], # xmin, ymin, xmax, ymax
}
# ... you can add as many prompts as you want!
]
This prompt identifies an object in the first frame with a positive point and a negative point, then identifies a different object in the 50th frame with a bounding box.
An important note is that SAM 2 can track multiple objects at once. Just make sure to give each one a unique object_id
.
When providing multiple prompts for the same object, be sure to give the same object_id
for each prompt.
View the examples below for more information.
Mask Prompts
SAM 2 also supports optionally providing masks directly as a prompt. To use a mask prompt, pass a zip file containing .png
or .npy
files as the mask_prompts
parameter. Each file in the zip should correspond to a single prompt and should follow the naming convention {frame_index}_{object_id}.png
or {frame_index}_{object_id}.npy
in order to be correctly parsed. Mask prompts may be supplied alongside point/bounding box prompts. Unless the provided prompt masks are very accurate, it is recommended to supply a point/bounding box prompt instead, as SAM 2 is quite capable of inferring a high-quality segmentation mask.
For example, to give mask prompts for object_id="soccer_ball"
for frame 0, and object_id="person"
for frame 1, supply a zip file containing two files of the following format:
0_soccer_ball.png
1_person.png
Mask inputs should be of the shape (video_height, video_width)
, and will be automatically cast to boolean values.
Outputs
We supply a few ways to generate outputs:
By default, binary masks are returned for each segmented object-frame pair.
We also include options for fine-grained pixel confidences, bounding box tracking, and previewing segmentation masks as an overlay over the original video.
The endpoint supports both video and image inputs.
The endpoint supports any video length, but please be advised that Sieve jobs have a maximum runtime of 5 hours. To stay under this time limit, we recommend not exceeding 2.5 hours @ 30 fps
when using default settings.
Segmentation Masks
The primary output of SAM 2 is segmentation masks for each frame object pair. We return these as a zip of .png files, one for each frame-object id pair.
The filenames are of the form {frame_index}_{object_id}.png. The masks are returned in a zip file, which can be accessed via outputs["masks"]
.
Debug Masks
When set to true, the original video with segmentation masks overlaid is returned. This is returned as a separate object, so the output format becomes (debug_video, outputs).
Pixel Confidences
This option produces mask-like outputs with confidence values in the range 0-255, representing the model confidence score for a given pixel from 0.0 to 1.0. Returns a zip file with confidence masks for each frame-object pair. These can be accessed via outputs["pixel_confidences"]
.
Bbox Tracking
Setting bbox_tracking=True
will track the bounding boxes for objects across each frame. This can be useful for object-tracking use cases!
If an object is not present in a frame, it will not have a bounding box included for that frame. The bounding box format is [x_min, y_min, x_max, y_max].
Each bounding box is also associated with the frame's timestep
, given in seconds.
The output maps a frame index to a list of bounding boxes for that frame. Example output:
{
"0": [
{
"frame_index": 0,
"object_id": 1,
"bbox": [100, 400, 300, 600],
"timestep": 0.0
},
{
"frame_index": 0,
"object_id": 2,
"bbox": [200, 500, 400, 700],
"timestep": 0.0
}
],
"1": [
{
"frame_index": 1,
"object_id": 1,
"bbox": [105, 405, 305, 605],
"timestep": 0.033
},
{
"frame_index": 1,
"object_id": 2,
"bbox": [195, 495, 395, 695],
"timestep": 0.033
}
]
...
}
When enabled, bounding boxes can be accessed via outputs["bbox_tracking"]
.
Preview
Sometimes it can be useful to visualize the effects of prompts before segmenting your video. Setting preview=True will generate annotated segmentation visualizations for each frame prompted. This is a great way to ensure your SAM 2 prompts are functioning as intended. The first object segmented is returned as an individual output (for visualizing in the UI), in addition to the full outputs dictionary.
Multimask Output (Images only)
If true, the model will return three masks. For ambiguous input prompts (such as a single click), this will often
produce better masks than a single prediction. If only a single mask is needed, the model's predicted quality score can be used
to select the best mask. For non-ambiguous prompts, such as multiple input prompts, multimask_output=False can give better results. This will generate 3 versions of each output (e.g. masks_0
, masks_1
, masks_2
), one for each of the three masks produced.
Pricing
Pricing is determined by the length of the input video in minutes (L), as well as the number of objects being tracked (O).
For model_type="large"
, we charge:
(0.216 * L) + (0.0288 * O * L)
For model_type="tiny"
, we charge:
(0.144 * L) + (0.0288 * O * L)
If the calculated price is less than 0.01
, we round up to 0.01
.
Both pricing formulas assume the input video is 30fps. For higher/lower framerates, the price will be adjusted accordingly (e.g. a 1-minute 60fps video will cost the same as a 2-minute 30fps video).
For videos where preview=True
, we charge a flat rate of 0.01
.
For image inputs, we charge a flat rate of 0.01
.
Examples
We created a Google Colab notebook that walks through using SAM 2 on Sieve! It includes an interactive prompt generator as well as examples of each of the output options. You can check it out here.
License
Please view the SAM 2 license here.