Skip to content

Custom Models ๐Ÿšง

Warning

The Model API is currently in Beta. The API and CLI commands described in this guide are subject to change.

Models can be deployed with SGP using the SDK or through the scale-egp CLI which is bundled together with the Python SDK.

If you'd like to use the scale-egp CLI please install it if you haven't done so yet.

Note

All CLI commands below assume that the EGP_API_KEY environment variable has been set to your SGP API key.

Model Template vs Model

Custom models are created using the EGPClient().models().create() method. However, notice that this method requires a model_template_id parameter. So, before a model can be created, a model template must first be registered.

Let's distinguish between a Model Template and a Model.

A Model Template is a static configuration for a model that defines how it should be loaded and executed, whereas a Model is an actual running instance of a model that can be invoked with requests. When the Model is created, the configuration of the model template can be used to reserve the required computing resources, pull the correct docker image, etc.

A Model Template, by definition, is static and incomplete. It expects certain parameters to be provided when a Model is created. For example, a Model Template may define a docker image that loads and executes model, but it will not specify where the model weights are loaded from. This information is provided when a Model is created.

Deploy a Model

To deploy and execute a new model, please follow the following steps:

  1. Create a Model Template (Optional)
  2. Create a Model
  3. Execute a Model

Create a Model Template (Optional)

If an SGP Model Template already exists for the model you want to deploy, you can skip this step and go directly to create a model.

To list all available model templates that you can use, run the following code:

from scale_egp.sdk.client import EGPClient

client = EGPClient(api_key="...", account_id="...")
model_templates = client.model_templates().list()
via the CLI

To use the scale-egp CLI to list all available model templates, run:

scale-egp model_template list

To demonstrate how a model template works, we will deploy a cross encoder model that can be used for reranking. In this example we have already created a docker image that loads and executes the model. We have also tested this docker image on a machine with a specific hardware configuration, so we will directly specify the hardware configuration in the model template.

Note

While the Model API is in Beta, we will not go into detail about how to create this docker image because the creation of Model Templates will initially be reserved for power users. However, we will release these guides once the Model API is in General Availability.

We can use the following SDK code to create a model template.

from scale_egp.sdk.client import EGPClient
from scale_egp.sdk.enums import ModelEndpointType, ModelType
from scale_egp.sdk.types.model_templates import ModelBundleConfig,

LaunchVendorConfiguration
ModelEndpointConfig
from scale_egp.sdk.types.models import ParameterSchemaField, ParameterSchema

client = EGPClient(api_key="...", account_id="...")
model_template = client.model_templates().create(
    name="custom_reranking_model_template",
    endpoint_type=ModelEndpointType.ASYNC,
    model_type=ModelType.RERANKING,
    vendor_configuration=LaunchVendorConfiguration(
        bundle_config=ModelBundleConfig(
            registry="docker.io",
            image="egp-test/sentence-transformer-reranker",
            tag="2023-12-08-1820",
            env={"DEVICE": "cuda"},
            readiness_initial_delay_seconds=120,
        ),
        endpoint_config=ModelEndpointConfig(
            cpus=1,
            memory="8Gi",
            storage="16Gi",
            gpus=1,
            min_workers=0,
            max_workers=1,
            per_worker=10,
            gpu_type="nvidia-ampere-a10",
            endpoint_type=ModelEndpointType.ASYNC,
            high_priority=False,
        ),
    ),
    # These parameters can be set by the user who creates a Model from this Model Template
    # to deploy a live model endpoint.
    model_creation_parameters_schema=ParameterSchema(
        parameters=[
            ParameterSchemaField(
                name="MODEL_BUCKET",
                type="str",
                description="S3 bucket containing model files",
                required=False,
            ),
            ParameterSchemaField(
                name="MODEL_PATH",
                type="str",
                description="S3 path within bucket containing model files",
                required=False,
            ),
        ]
    ),
)
via the CLI

Warning

This feature is currently in beta, and the format of the model template files is likely to change.

To use the scale-egp CLI to create a model template equivalent to the one above, save the following json as model-template.json, and then run:

scale-egp model_template create model-template.json
# model-template.json
{
  "name": "custom_reranking_model_template",
  "endpoint_type": "ASYNC",
  "model_type": "RERANKING",
  "vendor_configuration": {
    "vendor": "LLMENGINE",
    "bundle_config": {
      "registry": "docker.io",
      "image": "egp-test/sentence-transformer-reranker",
      "tag": "latest",
      "env": {
        "DEVICE": "cuda"
      },
      "readiness_initial_delay_seconds": 120
    },
    "endpoint_config": {
      "cpus": 1,
      "memory": "8Gi",
      "storage": "16Gi",
      "gpus": 1,
      "min_workers": 0,
      "max_workers": 1,
      "per_worker": 10,
      "gpu_type": "nvidia-ampere-a10",
      "endpoint_type": "ASYNC",
      "high_priority": false
    }
  },
  "model_creation_parameters_schema": {
    "parameters": [
      {
        "name": "MODEL_BUCKET",
        "type": "str",
        "description": "S3 bucket containing model files",
        "required": false
      },
      {
        "name": "MODEL_PATH",
        "type": "str",
        "description": "S3 path within bucket containing model files",
        "required": false
      }
    ]
  }
}

The commands above will output the id of the newly model template, in our case 39d2f5fd-482d-412a-862c-1c5a593112f0:

> scale-egp model_template create model-template.json
                                      Output of model_template create
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ field                            โ”ƒ value                                                                โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ name                             โ”‚ custom_reranking_model_template                                      โ”‚
โ”‚ endpoint_type                    โ”‚ ASYNC                                                                โ”‚
โ”‚ model_type                       โ”‚ RERANKING                                                            โ”‚
โ”‚ vendor_configuration             โ”‚ {                                                                    โ”‚
โ”‚                                  โ”‚   "vendor": "LLMENGINE",                                             โ”‚
โ”‚                                  โ”‚   "bundle_config": {                                                 โ”‚
โ”‚                                  โ”‚     "registry": "docker.io",                                         โ”‚
โ”‚                                  โ”‚     "image": "egp-test/sentence-transformer-reranker",               โ”‚
โ”‚                                  โ”‚     "tag": "latest",                                                 โ”‚
โ”‚                                  โ”‚     "command": [],                                                   โ”‚
โ”‚                                  โ”‚     "env": {                                                         โ”‚
โ”‚                                  โ”‚       "DEVICE": "cuda"                                               โ”‚
โ”‚                                  โ”‚     },                                                               โ”‚
โ”‚                                  โ”‚     "readiness_initial_delay_seconds": 120                           โ”‚
โ”‚                                  โ”‚   },                                                                 โ”‚
โ”‚                                  โ”‚   "endpoint_config": {                                               โ”‚
โ”‚                                  โ”‚     "cpus": 1,                                                       โ”‚
โ”‚                                  โ”‚     "memory": "8Gi",                                                 โ”‚
โ”‚                                  โ”‚     "storage": "16Gi",                                               โ”‚
โ”‚                                  โ”‚     "gpus": 1,                                                       โ”‚
โ”‚                                  โ”‚     "min_workers": 0,                                                โ”‚
โ”‚                                  โ”‚     "max_workers": 1,                                                โ”‚
โ”‚                                  โ”‚     "per_worker": 10,                                                โ”‚
โ”‚                                  โ”‚     "gpu_type": "nvidia-ampere-a10",                                 โ”‚
โ”‚                                  โ”‚     "endpoint_type": "ASYNC",                                        โ”‚
โ”‚                                  โ”‚     "high_priority": false                                           โ”‚
โ”‚                                  โ”‚   }                                                                  โ”‚
โ”‚                                  โ”‚ }                                                                    โ”‚
โ”‚ model_creation_parameters_schema โ”‚ {                                                                    โ”‚
โ”‚                                  โ”‚   "parameters": [                                                    โ”‚
โ”‚                                  โ”‚     {                                                                โ”‚
โ”‚                                  โ”‚       "name": "MODEL_BUCKET",                                        โ”‚
โ”‚                                  โ”‚       "type": "str",                                                 โ”‚
โ”‚                                  โ”‚       "description": "S3 bucket containing model files",             โ”‚
โ”‚                                  โ”‚       "required": false                                              โ”‚
โ”‚                                  โ”‚     },                                                               โ”‚
โ”‚                                  โ”‚     {                                                                โ”‚
โ”‚                                  โ”‚       "name": "MODEL_PATH",                                          โ”‚
โ”‚                                  โ”‚       "type": "str",                                                 โ”‚
โ”‚                                  โ”‚       "description": "S3 path within bucket containing model files", โ”‚
โ”‚                                  โ”‚       "required": true                                               โ”‚
โ”‚                                  โ”‚     }                                                                โ”‚
โ”‚                                  โ”‚   ]                                                                  โ”‚
โ”‚                                  โ”‚ }                                                                    โ”‚
โ”‚ id                               โ”‚ 39d2f5fd-482d-412a-862c-1c5a593112f0                                 โ”‚
โ”‚ created_at                       โ”‚ 2023-12-23 11:29:38.232825                                           โ”‚
โ”‚ account_id                       โ”‚ 5aec8217-7b07-4564-92da-d62f7f15e800                                 โ”‚
โ”‚ created_by_user_id               โ”‚ 5aec8217-7b07-4564-92da-d62f7f15e800                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Keep track of the id of the model template that was created. We will need it in the next step.

Create a Model

Now that we have a model template, we can create a model from it.

Coming Soon

Commands for intelligent managing model resources in a more cost-effective way will be available in the SGP SDK and the scale-egp command line utility soon. Until then, we recommend setting min_workers to 0 during test to scale down running models to use no resources when idle. You can then scale up the model to more minimum workers if you want to use the model in production and need to ensure that it is always available.

The following code creates a model from the model template created in the previous step:

model = client.models().create(
    name="my_reranking_model",
    model_template_id=model_template.id,
    model_creation_parameters={
        "MODEL_BUCKET": "my-bucket",
        "MODEL_PATH": "finetuned-cross-encoder_ms-marco-MiniLM-L-12-v2.tar.gz"
    },
)
via the CLI

Warning

This feature is currently in beta, and the format of the model files is likely to change.

To use the scale-egp CLI to create a model equivalent to the one above, save the following json as model.json, and then run:

scale-egp model create --model-template-id 39d2f5fd-482d-412a-862c-1c5a593112f0 model.json

Note that the model template contains two model parameters, the S3 bucket and path to use for loading the model's weights files, so we need to set these in our model.json.

# model.json
{
  "name": "my_reranking_model",
  "model_template_id": "39d2f5fd-482d-412a-862c-1c5a593112f0",
  "model_creation_parameters": {
    "MODEL_BUCKET": "my-bucket",
    "MODEL_PATH": "finetuned-cross-encoder_ms-marco-MiniLM-L-12-v2.tar.gz"
  }
}

The command executes and returns the newly created model's id, in our case: 205b9bfb-287d-4097-bd45-ef7cf84cc996.

It can take several minutes for the model to be deployed depending on hardware availability, docker image size, model size, etc.

To check the model's deployment status, we can poll the model id's status:

model = client.models().get(id=model.id)
print(model.status)
via the CLI

The scale-egp CLI can also be used to get the model's status using the get or describe commands:

scale-egp model get 205b9bfb-287d-4097-bd45-ef7cf84cc996
scale-egp model describe 205b9bfb-287d-4097-bd45-ef7cf84cc996

When the status is READY, we can move on to executing the model.

Execute a Model

Once a model is created, it can be invoked with requests. The following code executes the model created in the previous step:

from scale_egp.sdk.types.models import RerankingRequest

model_response = client.models().execute(
    id=model.id,
    # The RerankingRequest type is required since the model's type is RERANKING.
    request=RerankingRequest(
        query="What's the name of the largest continent?",
        chunks=[
            "I like to have milk with cereals for breakfast.",
            "The largest continent is called Asia.",
            "The largest country is called Ibiza.",
            "Asia has the highest population.",
        ],
    ),
)

Because this is a ranking model, you can also use this custom model directly in the Chunk Rank API. In the SDK, this looks like the following:

from scale_egp.sdk.types.chunks import ModelRankParams, ModelRankStrategy

model_response = client.chunks().rank(
    query="What's the name of the largest continent?",
    relevant_chunks=[
        "I like to have milk with cereals for breakfast.",
        "The largest continent is called Asia.",
        "The largest country is called Ibiza.",
        "Asia has the highest population.",
    ],
    rank_strategy=ModelRankStrategy(
        params=ModelRankParams(
            model_id=model.id,
        ),
    ),
    top_k=3,
)

The request and response schemas for each model type are listed and described below

via the CLI

Warning

This feature is currently in beta, and the format of the model files is likely to change.

To use the scale-egp CLI to execute a model equivalent to the one above, first save the following json as model-request.json.

# model-request.json
{
  "query": "What's the name of the largest continent?",
  "chunks": [
    "I like to have milk with cereals for breakfast.",
    "The largest continent is called Asia.",
    "The largest country is called Ibiza.",
    "Asia has the highest population."
  ]
}

Before executing the model, first validate that your model request schema is correct using the following command:

scale-egp model validate-request 205b9bfb-287d-4097-bd45-ef7cf84cc996 model-request.json

If the model request passed validation, you can then execute the model.

`commandline scale-egp model execute 205b9bfb-287d-4097-bd45-ef7cf84cc996 model-request.json

Model Types and Schemas

When creating and executing a model, its request and response payload schema must match predefined schemas for supported SGP model types. These types and schemas are listed below in tabular format.

You can also use the scale-egp CLI to view these schema in JSON-schema format in your command line using the following command:

scale-egp model_template show-model-schemas

Here is a list of each model type and links to their corresponding request and response schemas.

Model Type Request Schema Response Schema
EMBEDDING EmbeddingRequest EmbeddingResponse
RERANKING RerankingRequest RerankingResponse
COMPLETION CompletionRequest CompletionResponse
CHAT_COMPLETION ChatCompletionRequest ChatCompletionResponse
AGENT AgentRequest AgentResponse

Embedding

Request schema for embedding models.

Attributes:

Name Type Description
texts List[str]

List of texts to get embeddings for.

Response schema for embedding models.

Attributes:

Name Type Description
embeddings List[Tuple[str, List[float]]]

List of text, embedding pairs.

Reranking

Request schema for reranking models.

Attributes:

Name Type Description
query str

Query to rerank chunks against in order of relevance.

chunks List[str]

List of chunks to rerank.

Response schema for reranking models.

Attributes:

Name Type Description
chunk_scores List[float]

List of scores for each chunk in the same order as the input chunks.

Completion

Request schema for completion models.

Attributes:

Name Type Description
temperature float

What sampling temperature to use, between [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling.

stop_sequences Optional[List[str]]

List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

max_tokens Optional[int]

The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used.

prompts List[str]

List of prompts to generate completions for.

Response schema for completion models.

Attributes:

Name Type Description
completions List[Tuple[str, List[str]]]

List of prompt, completion pairs.

finish_reason Optional[str]

The reason the completion finished.

Chat Completion

Request schema for chat completion models.

Attributes:

Name Type Description
temperature float

What sampling temperature to use, between [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling.

stop_sequences Optional[List[str]]

List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

max_tokens Optional[int]

The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used.

messages List[Message]

List of messages for the chat completion to consider when generating a response.

Response schema for chat completion models.

Attributes:

Name Type Description
message Message

The generated message from the chat completion model.

finish_reason Optional[str]

The reason the chat completion finished.

Agent

Response schema for agents. See the Execute Agent REST API for more information.

Attributes:

Name Type Description
memory_strategy Optional[MemoryStrategy]

The memory strategy to use for the agent. A memory strategy is a way to prevent the underlying LLM's context limit from being exceeded. Each memory strategy uses a different technique to condense the input message list into a smaller payload for the underlying LLM.

tools List[Tool]

The list of specs of tools that the agent can use. Each spec must contain a name key set to the name of the tool, a description key set to the description of the tool, and an arguments key set to a JSON Schema compliant object describing the tool arguments.

The name and description of each tool is used by the agent to decide when to use certain tools. Because some queries are complex and may require multiple tools to complete, it is important to make these descriptions as informative as possible. If a tool is not being chosen when it should, it is common practice to tune the description of the tool to make it more apparent to the agent when the tool can be used effectively.

messages List[Message]

The list of messages in the conversation.

instructions Optional[str]

The initial instructions to provide to the agent.

Use this to guide the agent to act in more specific ways. For example, if you have specific rules you want to restrict the agent to follow you can specify them here. For example, if I want the agent to always use certain tools before others, I can write that rule in these instructions.

Good prompt engineering is crucial to getting performant results from the agent. If you are having trouble getting the agent to perform well, try writing more specific instructions here before trying more expensive techniques such as swapping in other models or finetuning the underlying LLM.

Response schema for agents.

See the Execute Agent REST API for more information.

Attributes:

Name Type Description
action AgentAction

The action that the agent performed.

context ActionContext

Context object containing the output payload. This will contain a key for all actions that the agent can perform. However, only the key corresponding to the action that the agent performed have a populated value. The rest of the values will be null.