Custom Models ¶
Warning
The Model API is currently in Beta. The API and CLI commands described in this guide are subject to change.
Models can be deployed with SGP using the SDK or through the scale-egp
CLI which is
bundled together with the Python SDK.
If you'd like to use the scale-egp
CLI please install it if you haven't
done so yet.
Note
All CLI commands below assume that the EGP_API_KEY
environment variable has been set to your
SGP API key.
Model Template vs Model¶
Custom models are created using the EGPClient().models().create()
method. However, notice that this method requires a model_template_id
parameter. So, before a
model can be created, a model template must first be registered.
Let's distinguish between a Model Template and a Model.
A Model Template is a static configuration for a model that defines how it should be loaded and executed, whereas a Model is an actual running instance of a model that can be invoked with requests. When the Model is created, the configuration of the model template can be used to reserve the required computing resources, pull the correct docker image, etc.
A Model Template, by definition, is static and incomplete. It expects certain parameters to be provided when a Model is created. For example, a Model Template may define a docker image that loads and executes model, but it will not specify where the model weights are loaded from. This information is provided when a Model is created.
Deploy a Model¶
To deploy and execute a new model, please follow the following steps:
Create a Model Template (Optional)¶
If an SGP Model Template already exists for the model you want to deploy, you can skip this step and go directly to create a model.
To list all available model templates that you can use, run the following code:
from scale_egp.sdk.client import EGPClient
client = EGPClient(api_key="...", account_id="...")
model_templates = client.model_templates().list()
via the CLI
To use the scale-egp
CLI to list all available model templates, run:
To demonstrate how a model template works, we will deploy a cross encoder model that can be used for reranking. In this example we have already created a docker image that loads and executes the model. We have also tested this docker image on a machine with a specific hardware configuration, so we will directly specify the hardware configuration in the model template.
Note
While the Model API is in Beta, we will not go into detail about how to create this docker image because the creation of Model Templates will initially be reserved for power users. However, we will release these guides once the Model API is in General Availability.
We can use the following SDK code to create a model template.
from scale_egp.sdk.client import EGPClient
from scale_egp.sdk.enums import ModelEndpointType, ModelType
from scale_egp.sdk.types.model_templates import ModelBundleConfig,
LaunchVendorConfiguration
ModelEndpointConfig
from scale_egp.sdk.types.models import ParameterSchemaField, ParameterSchema
client = EGPClient(api_key="...", account_id="...")
model_template = client.model_templates().create(
name="custom_reranking_model_template",
endpoint_type=ModelEndpointType.ASYNC,
model_type=ModelType.RERANKING,
vendor_configuration=LaunchVendorConfiguration(
bundle_config=ModelBundleConfig(
registry="docker.io",
image="egp-test/sentence-transformer-reranker",
tag="2023-12-08-1820",
env={"DEVICE": "cuda"},
readiness_initial_delay_seconds=120,
),
endpoint_config=ModelEndpointConfig(
cpus=1,
memory="8Gi",
storage="16Gi",
gpus=1,
min_workers=0,
max_workers=1,
per_worker=10,
gpu_type="nvidia-ampere-a10",
endpoint_type=ModelEndpointType.ASYNC,
high_priority=False,
),
),
# These parameters can be set by the user who creates a Model from this Model Template
# to deploy a live model endpoint.
model_creation_parameters_schema=ParameterSchema(
parameters=[
ParameterSchemaField(
name="MODEL_BUCKET",
type="str",
description="S3 bucket containing model files",
required=False,
),
ParameterSchemaField(
name="MODEL_PATH",
type="str",
description="S3 path within bucket containing model files",
required=False,
),
]
),
)
via the CLI
Warning
This feature is currently in beta, and the format of the model template files is likely to change.
To use the scale-egp
CLI to create a model template equivalent to the one above,
save the following json as model-template.json
, and then run:
# model-template.json
{
"name": "custom_reranking_model_template",
"endpoint_type": "ASYNC",
"model_type": "RERANKING",
"vendor_configuration": {
"vendor": "LLMENGINE",
"bundle_config": {
"registry": "docker.io",
"image": "egp-test/sentence-transformer-reranker",
"tag": "latest",
"env": {
"DEVICE": "cuda"
},
"readiness_initial_delay_seconds": 120
},
"endpoint_config": {
"cpus": 1,
"memory": "8Gi",
"storage": "16Gi",
"gpus": 1,
"min_workers": 0,
"max_workers": 1,
"per_worker": 10,
"gpu_type": "nvidia-ampere-a10",
"endpoint_type": "ASYNC",
"high_priority": false
}
},
"model_creation_parameters_schema": {
"parameters": [
{
"name": "MODEL_BUCKET",
"type": "str",
"description": "S3 bucket containing model files",
"required": false
},
{
"name": "MODEL_PATH",
"type": "str",
"description": "S3 path within bucket containing model files",
"required": false
}
]
}
}
The commands above will output the id of the newly model template, in our
case 39d2f5fd-482d-412a-862c-1c5a593112f0
:
> scale-egp model_template create model-template.json
Output of model_template create
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโณโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ field โ value โ
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
โ name โ custom_reranking_model_template โ
โ endpoint_type โ ASYNC โ
โ model_type โ RERANKING โ
โ vendor_configuration โ { โ
โ โ "vendor": "LLMENGINE", โ
โ โ "bundle_config": { โ
โ โ "registry": "docker.io", โ
โ โ "image": "egp-test/sentence-transformer-reranker", โ
โ โ "tag": "latest", โ
โ โ "command": [], โ
โ โ "env": { โ
โ โ "DEVICE": "cuda" โ
โ โ }, โ
โ โ "readiness_initial_delay_seconds": 120 โ
โ โ }, โ
โ โ "endpoint_config": { โ
โ โ "cpus": 1, โ
โ โ "memory": "8Gi", โ
โ โ "storage": "16Gi", โ
โ โ "gpus": 1, โ
โ โ "min_workers": 0, โ
โ โ "max_workers": 1, โ
โ โ "per_worker": 10, โ
โ โ "gpu_type": "nvidia-ampere-a10", โ
โ โ "endpoint_type": "ASYNC", โ
โ โ "high_priority": false โ
โ โ } โ
โ โ } โ
โ model_creation_parameters_schema โ { โ
โ โ "parameters": [ โ
โ โ { โ
โ โ "name": "MODEL_BUCKET", โ
โ โ "type": "str", โ
โ โ "description": "S3 bucket containing model files", โ
โ โ "required": false โ
โ โ }, โ
โ โ { โ
โ โ "name": "MODEL_PATH", โ
โ โ "type": "str", โ
โ โ "description": "S3 path within bucket containing model files", โ
โ โ "required": true โ
โ โ } โ
โ โ ] โ
โ โ } โ
โ id โ 39d2f5fd-482d-412a-862c-1c5a593112f0 โ
โ created_at โ 2023-12-23 11:29:38.232825 โ
โ account_id โ 5aec8217-7b07-4564-92da-d62f7f15e800 โ
โ created_by_user_id โ 5aec8217-7b07-4564-92da-d62f7f15e800 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Keep track of the id
of the model template that was created. We will need it in the next step.
Create a Model¶
Now that we have a model template, we can create a model from it.
Coming Soon
Commands for intelligent managing model resources in a more cost-effective way will be
available in the SGP SDK and the scale-egp
command line utility soon. Until then, we
recommend setting min_workers
to 0 during test to scale down running models to use no
resources when idle. You can then scale up the model to more minimum workers if you want to use
the model in production and need to ensure that it is always available.
The following code creates a model from the model template created in the previous step:
model = client.models().create(
name="my_reranking_model",
model_template_id=model_template.id,
model_creation_parameters={
"MODEL_BUCKET": "my-bucket",
"MODEL_PATH": "finetuned-cross-encoder_ms-marco-MiniLM-L-12-v2.tar.gz"
},
)
via the CLI
Warning
This feature is currently in beta, and the format of the model files is likely to change.
To use the scale-egp
CLI to create a model equivalent to the one above,
save the following json as model.json
, and then run:
Note that the model template contains two model parameters, the S3 bucket and path to use for loading the model's weights files, so we need to set these in our model.json.
# model.json
{
"name": "my_reranking_model",
"model_template_id": "39d2f5fd-482d-412a-862c-1c5a593112f0",
"model_creation_parameters": {
"MODEL_BUCKET": "my-bucket",
"MODEL_PATH": "finetuned-cross-encoder_ms-marco-MiniLM-L-12-v2.tar.gz"
}
}
The command executes and returns the newly created model's id, in our case:
205b9bfb-287d-4097-bd45-ef7cf84cc996
.
It can take several minutes for the model to be deployed depending on hardware availability, docker image size, model size, etc.
To check the model's deployment status
, we can poll the model id's status
:
via the CLI
The scale-egp
CLI can also be used to get the model's status using the get or describe
commands:
When the status is READY
, we can move on to executing the model.
Execute a Model¶
Once a model is created, it can be invoked with requests. The following code executes the model created in the previous step:
from scale_egp.sdk.types.models import RerankingRequest
model_response = client.models().execute(
id=model.id,
# The RerankingRequest type is required since the model's type is RERANKING.
request=RerankingRequest(
query="What's the name of the largest continent?",
chunks=[
"I like to have milk with cereals for breakfast.",
"The largest continent is called Asia.",
"The largest country is called Ibiza.",
"Asia has the highest population.",
],
),
)
Because this is a ranking model, you can also use this custom model directly in the
Chunk Rank API
. In the SDK, this looks like
the following:
from scale_egp.sdk.types.chunks import ModelRankParams, ModelRankStrategy
model_response = client.chunks().rank(
query="What's the name of the largest continent?",
relevant_chunks=[
"I like to have milk with cereals for breakfast.",
"The largest continent is called Asia.",
"The largest country is called Ibiza.",
"Asia has the highest population.",
],
rank_strategy=ModelRankStrategy(
params=ModelRankParams(
model_id=model.id,
),
),
top_k=3,
)
The request and response schemas for each model type are listed and described below
via the CLI
Warning
This feature is currently in beta, and the format of the model files is likely to change.
To use the scale-egp
CLI to execute a model equivalent to the one above,
first save the following json as model-request.json
.
# model-request.json
{
"query": "What's the name of the largest continent?",
"chunks": [
"I like to have milk with cereals for breakfast.",
"The largest continent is called Asia.",
"The largest country is called Ibiza.",
"Asia has the highest population."
]
}
Before executing the model, first validate that your model request schema is correct using the following command:
If the model request passed validation, you can then execute the model.
`commandline
scale-egp model execute 205b9bfb-287d-4097-bd45-ef7cf84cc996 model-request.json
Model Types and Schemas¶
When creating and executing a model, its request and response payload schema must match predefined schemas for supported SGP model types. These types and schemas are listed below in tabular format.
You can also use the scale-egp
CLI to view these schema in JSON-schema format in your command
line using the following command:
Here is a list of each model type and links to their corresponding request and response schemas.
Model Type | Request Schema | Response Schema |
---|---|---|
EMBEDDING |
EmbeddingRequest | EmbeddingResponse |
RERANKING |
RerankingRequest | RerankingResponse |
COMPLETION |
CompletionRequest | CompletionResponse |
CHAT_COMPLETION |
ChatCompletionRequest | ChatCompletionResponse |
AGENT |
AgentRequest | AgentResponse |
Embedding¶
Request schema for embedding models.
Attributes:
Name | Type | Description |
---|---|---|
texts |
List[str]
|
List of texts to get embeddings for. |
Response schema for embedding models.
Attributes:
Name | Type | Description |
---|---|---|
embeddings |
List[Tuple[str, List[float]]]
|
List of text, embedding pairs. |
Reranking¶
Request schema for reranking models.
Attributes:
Name | Type | Description |
---|---|---|
query |
str
|
Query to rerank chunks against in order of relevance. |
chunks |
List[str]
|
List of chunks to rerank. |
Response schema for reranking models.
Attributes:
Name | Type | Description |
---|---|---|
chunk_scores |
List[float]
|
List of scores for each chunk in the same order as the input chunks. |
Completion¶
Request schema for completion models.
Attributes:
Name | Type | Description |
---|---|---|
temperature |
float
|
What sampling temperature to use, between [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling. |
stop_sequences |
Optional[List[str]]
|
List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. |
max_tokens |
Optional[int]
|
The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used. |
prompts |
List[str]
|
List of prompts to generate completions for. |
Response schema for completion models.
Attributes:
Name | Type | Description |
---|---|---|
completions |
List[Tuple[str, List[str]]]
|
List of prompt, completion pairs. |
finish_reason |
Optional[str]
|
The reason the completion finished. |
Chat Completion¶
Request schema for chat completion models.
Attributes:
Name | Type | Description |
---|---|---|
temperature |
float
|
What sampling temperature to use, between [0, 1]. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Setting temperature=0.0 will enable fully deterministic (greedy) sampling. |
stop_sequences |
Optional[List[str]]
|
List of up to 4 sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence. |
max_tokens |
Optional[int]
|
The maximum number of tokens to generate in the completion. The token count of your prompt plus max_tokens cannot exceed the model's context length. If not, specified, max_tokens will be determined based on the model used. |
messages |
List[Message]
|
List of messages for the chat completion to consider when generating a response. |
Response schema for chat completion models.
Attributes:
Name | Type | Description |
---|---|---|
message |
Message
|
The generated message from the chat completion model. |
finish_reason |
Optional[str]
|
The reason the chat completion finished. |
Agent¶
Response schema for agents. See the Execute Agent REST API for more information.
Attributes:
Name | Type | Description |
---|---|---|
memory_strategy |
Optional[MemoryStrategy]
|
The memory strategy to use for the agent. A memory strategy is a way to prevent the underlying LLM's context limit from being exceeded. Each memory strategy uses a different technique to condense the input message list into a smaller payload for the underlying LLM. |
tools |
List[Tool]
|
The list of specs of tools that the agent can use. Each spec must contain
a The name and description of each tool is used by the agent to decide when to use certain tools. Because some queries are complex and may require multiple tools to complete, it is important to make these descriptions as informative as possible. If a tool is not being chosen when it should, it is common practice to tune the description of the tool to make it more apparent to the agent when the tool can be used effectively. |
messages |
List[Message]
|
The list of messages in the conversation. |
instructions |
Optional[str]
|
The initial instructions to provide to the agent. Use this to guide the agent to act in more specific ways. For example, if you have specific rules you want to restrict the agent to follow you can specify them here. For example, if I want the agent to always use certain tools before others, I can write that rule in these instructions. Good prompt engineering is crucial to getting performant results from the agent. If you are having trouble getting the agent to perform well, try writing more specific instructions here before trying more expensive techniques such as swapping in other models or finetuning the underlying LLM. |
Response schema for agents.
See the Execute Agent REST API for more information.
Attributes:
Name | Type | Description |
---|---|---|
action |
AgentAction
|
The action that the agent performed. |
context |
ActionContext
|
Context object containing the output payload. This will contain a key for all
actions that the agent can perform. However, only the key corresponding to
the action that the agent performed have a populated value. The rest of the
values will be |