Deploy via Amazon SageMaker

This guide shows you how to deploy a Holo model as a real-time endpoint using the managed service Amazon SageMaker.

Pre-requisites

Please make sure you have previously subscribed to the model in AWS Marketplace.

The notebook does not require a GPU, its purpose is to leverage AWS API (boto3) to deploy the endpoint. Ensure that selected IAM role used has enough privileges. You may start with role with AmazonSageMakerFullAccess policy attached abd that its trust relationship policy allows the action sts:AssumeRole for the service principal sagemaker.amazonaws.com.

Step 1: Install required Python dependencies

Use the following code to install the required packages and import the necessary libraries:

!pip install -qU boto3 sagemaker

import boto3
import datetime
from datetime import datetime
from typing import Literal
import sagemaker

Step 2: Set up the SageMaker session and client

Set up a SageMaker session and client so you can connect with AWS and run your models..

# Specify the profile name to use for invoking the model
AWS_PROFILE: str | None = None

session = boto3.Session(profile_name=AWS_PROFILE)
sm_session = sagemaker.Session(boto_session=session)
sm_rt_client = session.client("sagemaker-runtime")
sm_exec_role = "<execution role allowed to deploy sagemaker model>"  # or sagemaker.get_execution_role(sagemaker_session=sm_session)

Step 3: Select Holo model package

Choose your Holo model package .

# Select model holo1-3b or holo1-7b
MODEL_NAME: Literal["holo1-3b", "holo1-7b"] = "holo1-3b"
# Select a region where G6e instance family is available
HOLO1_MODEL_PACKAGES = {
    "holo1-3b": {
        "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Tokyo
        "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Seoul
        "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Frankfurt
        "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Stockholm
        "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # N. Virginia
        "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Ohio
        "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Oregon
    },
    "holo1-7b": {
        "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Tokyo
        "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Seoul
        "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Frankfurt
        "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Stocklom
        "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # N. Virginia
        "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Ohio
        "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Oregon
    },
}

if session.region_name not in HOLO1_MODEL_PACKAGES[MODEL_NAME].keys():
    raise f"Error: The selected region does not support the {MODEL_NAME} model package. Please change your client region."

holo1_model_package = HOLO1_MODEL_PACKAGES[MODEL_NAME][session.region_name]

Step 4: Deploy Holo

Deploy a SageMaker real-time endpoint hosted on a GPU instance. If you need general information on real-time inference with Amazon SageMaker, please refer to the SageMaker documentation. The deployed endpoint leverage vLLM serve, hence, supporting OpenAI APIs, exposing the v1/chat/completions endpoint.

Step 4a. Define the endpoint configuration

INSTANCE_TYPE = "ml.g6e.4xlarge"
# timeout for downloading the model data from S3
MODEL_DATA_DOWNLOAD_TIMEOUT = 1200
# timeout before the container is ready to serve requests
CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT = 1200

Step 4b. Create the endpoint

# create a deployable model from the model package.
model = sagemaker.ModelPackage(
    role=sm_exec_role,
    model_package_arn=holo1_model_package,
    sagemaker_session=sm_session,
)

# create a unique endpoint name
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
endpoint_name = f"{MODEL_NAME}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=INSTANCE_TYPE,
    endpoint_name=endpoint_name,
    model_data_download_timeout=MODEL_DATA_DOWNLOAD_TIMEOUT,
    container_startup_health_check_timeout=CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT,
)

Step 5: Run an example

The endpoint is in service. You can use Sagemaker invoke_endpoint API to perform real-time inference on the deployed Holo-1 model.

Step 6: Clean-up

Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged. Please remember to run the cells below to delete all resources and avoid unecessary charges.

sm_session.delete_endpoint(EndpointName=endpoint_name)
sm_session.delete_model(ModelName=endpoint_name)

Surfer-H

​Pre-requisites

​Step 1: Install required Python dependencies

​Step 2: Set up the SageMaker session and client

​Step 3: Select Holo model package

​Step 4: Deploy Holo

​Step 4a. Define the endpoint configuration

​Step 4b. Create the endpoint

​Step 5: Run an example

​Step 6: Clean-up