Model Endpoints¶
Model Endpoints are deployments of models that can receive requests and return predictions containing the results of the model's inference. Each model endpoint is associated with a model bundle, which contains the model's code. An endpoint specifies deployment parameters, such as the minimum and maximum number of workers, as well as the requested resources for each worker, such as the number of CPUs, amount of memory, GPU count, and type of GPU.
Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return a future immediately after receiving a request, and the future can be used to retrieve the prediction once it is ready. Synchronous endpoints return the prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.
Info
Choosing the right inference mode¶
Here are some tips for how to choose between SyncEndpoint, StreamingEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:
A SyncEndpoint is good if:
- You have strict latency requirements (e.g. on the order of seconds or less).
- You are willing to have resources continually allocated.
A StreamingEndpoint is good if:
- You have stricter requirements on perceived latency than SyncEndpoint can support (e.g. you want tokens generated by the model to start being returned almost immediately rather than waiting for the model generation to finish).
- You are willing to have resources continually allocated.
An AsyncEndpoint is good if:
- You want to save on compute costs.
- Your inference code takes a long time to run.
- Your latency requirements are on the order of minutes.
A BatchJob is good if:
- You know there is a large batch of inputs ahead of time.
- You want to optimize for throughput instead of latency.
Creating Async Model Endpoints¶
Async model endpoints are the most cost-efficient way to perform inference on tasks that are less latency-sensitive.
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
endpoint_name="demo-endpoint-async",
model_bundle="test-bundle",
cpus=1,
min_workers=0,
endpoint_type="async",
update_if_exists=True,
labels={
"team": "MY_TEAM",
"product": "MY_PRODUCT",
},
)
Creating Sync Model Endpoints¶
Sync model endpoints are useful for latency-sensitive tasks, such as real-time inference. Sync endpoints are more expensive than async endpoints.
Note
Sync model endpoints require at least 1 min_worker
.
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
endpoint_name="demo-endpoint-sync",
model_bundle="test-bundle",
cpus=1,
min_workers=1,
endpoint_type="sync",
update_if_exists=True,
labels={
"team": "MY_TEAM",
"product": "MY_PRODUCT",
},
)
Creating Streaming Model Endpoints¶
Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.
Note
Streaming model endpoints require at least 1 min_worker
.
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
endpoint_name="demo-endpoint-streaming",
model_bundle="test-streaming-bundle",
cpus=1,
min_workers=1,
per_worker=1,
endpoint_type="streaming",
update_if_exists=True,
labels={
"team": "MY_TEAM",
"product": "MY_PRODUCT",
},
)
Managing Model Endpoints¶
Model endpoints can be listed, updated, and deleted using the Launch API.
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoints = client.list_model_endpoints()
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
client.edit_model_endpoint(
model_endpoint="demo-endpoint-sync",
max_workers=2,
)
import time
import os
from launch import LaunchClient
client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
endpoint_name="demo-endpoint-tmp",
model_bundle="test-bundle",
cpus=1,
min_workers=0,
endpoint_type="async",
update_if_exists=True,
labels={
"team": "MY_TEAM",
"product": "MY_PRODUCT",
},
)
time.sleep(15) # Wait for Launch to build the endpoint
client.delete_model_endpoint(model_endpoint_name="demo-endpoint-tmp")