Skip to content

Model Endpoints

Model Endpoints are deployments of models that can receive requests and return predictions containing the results of the model's inference. Each model endpoint is associated with a model bundle, which contains the model's code. An endpoint specifies deployment parameters, such as the minimum and maximum number of workers, as well as the requested resources for each worker, such as the number of CPUs, amount of memory, GPU count, and type of GPU.

Endpoints can be asynchronous, synchronous, or streaming. Asynchronous endpoints return a future immediately after receiving a request, and the future can be used to retrieve the prediction once it is ready. Synchronous endpoints return the prediction directly after receiving a request. Streaming endpoints are variants of synchronous endpoints that return a stream of SSEs instead of a single HTTP response.

Info

Choosing the right inference mode

Here are some tips for how to choose between SyncEndpoint, StreamingEndpoint, AsyncEndpoint, and BatchJob for deploying your ModelBundle:

A SyncEndpoint is good if:

  • You have strict latency requirements (e.g. on the order of seconds or less).
  • You are willing to have resources continually allocated.

A StreamingEndpoint is good if:

  • You have stricter requirements on perceived latency than SyncEndpoint can support (e.g. you want tokens generated by the model to start being returned almost immediately rather than waiting for the model generation to finish).
  • You are willing to have resources continually allocated.

An AsyncEndpoint is good if:

  • You want to save on compute costs.
  • Your inference code takes a long time to run.
  • Your latency requirements are on the order of minutes.

A BatchJob is good if:

  • You know there is a large batch of inputs ahead of time.
  • You want to optimize for throughput instead of latency.

Creating Async Model Endpoints

Async model endpoints are the most cost-efficient way to perform inference on tasks that are less latency-sensitive.

Creating an Async Model Endpoint
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
    endpoint_name="demo-endpoint-async",
    model_bundle="test-bundle",
    cpus=1,
    min_workers=0,
    endpoint_type="async",
    update_if_exists=True,
    labels={
        "team": "MY_TEAM",
        "product": "MY_PRODUCT",
    },
)

Creating Sync Model Endpoints

Sync model endpoints are useful for latency-sensitive tasks, such as real-time inference. Sync endpoints are more expensive than async endpoints.

Note

Sync model endpoints require at least 1 min_worker.

Creating a Sync Model Endpoint
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
    endpoint_name="demo-endpoint-sync",
    model_bundle="test-bundle",
    cpus=1,
    min_workers=1,
    endpoint_type="sync",
    update_if_exists=True,
    labels={
        "team": "MY_TEAM",
        "product": "MY_PRODUCT",
    },
)

Creating Streaming Model Endpoints

Streaming model endpoints are variants of sync model endpoints that are useful for tasks with strict requirements on perceived latency. Streaming endpoints are more expensive than async endpoints.

Note

Streaming model endpoints require at least 1 min_worker.

Creating a Streaming Model Endpoint
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
    endpoint_name="demo-endpoint-streaming",
    model_bundle="test-streaming-bundle",
    cpus=1,
    min_workers=1,
    per_worker=1,
    endpoint_type="streaming",
    update_if_exists=True,
    labels={
        "team": "MY_TEAM",
        "product": "MY_PRODUCT",
    },
)

Managing Model Endpoints

Model endpoints can be listed, updated, and deleted using the Launch API.

Listing Model Endpoints
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoints = client.list_model_endpoints()
Updating a Model Endpoint
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
client.edit_model_endpoint(
    model_endpoint="demo-endpoint-sync",
    max_workers=2,
)
Deleting a Model Endpoint
import time
import os
from launch import LaunchClient

client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY"))
endpoint = client.create_model_endpoint(
    endpoint_name="demo-endpoint-tmp",
    model_bundle="test-bundle",
    cpus=1,
    min_workers=0,
    endpoint_type="async",
    update_if_exists=True,
    labels={
        "team": "MY_TEAM",
        "product": "MY_PRODUCT",
    },
)
time.sleep(15)  # Wait for Launch to build the endpoint
client.delete_model_endpoint(model_endpoint_name="demo-endpoint-tmp")