This guide shows you how to deploy Holo1 (3B or 7B) as a real-time endpoint using the managed service Amazon SageMaker.

Pre-requisites

Please make sure you have previously subscribed to the model in AWS Marketplace. The notebook does not require a GPU, its purpose is to leverage AWS API (boto3) to deploy the endpoint. Ensure that selected IAM role used has enough privileges. You may start with role with AmazonSageMakerFullAccess policy attached abd that its trust relationship policy allows the action sts:AssumeRole for the service principal sagemaker.amazonaws.com.

Step 1: Install required Python dependencies

Use the following code to install the required packages and import the necessary libraries:
!pip install -qU boto3 sagemaker
import boto3
import datetime
from datetime import datetime
from typing import Literal
import sagemaker

Step 2: Set up the SageMaker session and client

Set up a SageMaker session and client so you can connect with AWS and run your models..
# Specify the profile name to use for invoking the model
AWS_PROFILE: str | None = None

session = boto3.Session(profile_name=AWS_PROFILE)
sm_session = sagemaker.Session(boto_session=session)
sm_rt_client = session.client("sagemaker-runtime")
sm_exec_role = "<execution role allowed to deploy sagemaker model>"  # or sagemaker.get_execution_role(sagemaker_session=sm_session)

Step 3: Select Holo1 model package

Choose your Holo1 model package (either holo1-3b or holo1-7b).
# Select model holo1-3b or holo1-7b
MODEL_NAME: Literal["holo1-3b", "holo1-7b"] = "holo1-3b"
# Select a region where G6e instance family is available
HOLO1_MODEL_PACKAGES = {
    "holo1-3b": {
        "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Tokyo
        "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Seoul
        "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Frankfurt
        "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Stockholm
        "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # N. Virginia
        "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Ohio
        "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/holo1-3b-20250521-5ce382a175493f1ab5666f65ee4774b7",  # Oregon
    },
    "holo1-7b": {
        "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Tokyo
        "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Seoul
        "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Frankfurt
        "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Stocklom
        "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # N. Virginia
        "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Ohio
        "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/holo1-7b-20250521-9e6a3648689635a9a554de600c864e48",  # Oregon
    },
}

if session.region_name not in HOLO1_MODEL_PACKAGES[MODEL_NAME].keys():
    raise f"Error: The selected region does not support the {MODEL_NAME} model package. Please change your client region."

holo1_model_package = HOLO1_MODEL_PACKAGES[MODEL_NAME][session.region_name]

Step 4: Deploy Holo1

Deploy a SageMaker real-time endpoint hosted on a GPU instance. If you need general information on real-time inference with Amazon SageMaker, please refer to the SageMaker documentation. The deployed endpoint leverage vLLM serve, hence, supporting OpenAI APIs, exposing the v1/chat/completions endpoint.

Step 4a. Define the endpoint configuration

INSTANCE_TYPE = "ml.g6e.4xlarge"
# timeout for downloading the model data from S3
MODEL_DATA_DOWNLOAD_TIMEOUT = 1200
# timeout before the container is ready to serve requests
CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT = 1200

Step 4b. Create the endpoint

# create a deployable model from the model package.
model = sagemaker.ModelPackage(
    role=sm_exec_role,
    model_package_arn=holo1_model_package,
    sagemaker_session=sm_session,
)

# create a unique endpoint name
timestamp = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
endpoint_name = f"{MODEL_NAME}-{timestamp}"
print(f"Deploying endpoint {endpoint_name}")

# deploy the model
response = model.deploy(
    initial_instance_count=1,
    instance_type=INSTANCE_TYPE,
    endpoint_name=endpoint_name,
    model_data_download_timeout=MODEL_DATA_DOWNLOAD_TIMEOUT,
    container_startup_health_check_timeout=CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT,
)

Step 5: Run an example

The endpoint is in service. You can use Sagemaker invoke_endpoint API to perform real-time inference on the deployed Holo-1 model.

Step 6: Clean-up

Now that you have successfully performed a real-time inference, you do not need the endpoint anymore. You can terminate the endpoint to avoid being charged. Please remember to run the cells below to delete all resources and avoid unecessary charges.
sm_session.delete_endpoint(EndpointName=endpoint_name)
sm_session.delete_model(ModelName=endpoint_name)