How to Deploy a Pytorch Model on SageMaker

Serving Pytorch model inferences with Sagemaker

James Samuel
5 min readMar 21, 2020


Credit: Created by me with

In this tutorial, you’ll learn how to deploy a Pytorch model on AWS cloud infrastructure.

In my previous post, we saw how to train an image classifier that is capable of recognizing images of objects in 10 categories. Today, we’ll deploy the model from the training job on SageMaker hosting services (Pytorch model server) for real-time inferencing. The full source code for this tutorial can be found on this Github repository.

SageMaker is a fully managed service from AWS that allows you to build, train, and deploy ML models into a production-ready environment at scale. It completely abstracts the complexity of running machine learning models in production away.

First, before AWS SageMaker hosting services can serve your model, you have to upload your model artifacts to an S3 bucket where SageMaker can access it.

Upload Model Artifacts to S3

Create an S3 bucket to host your Gzip compressed model artifacts and ensure that you grant SageMaker permissions to access it.

To keep things simple, we’ll stick to SageMakerFullAccess policy. SageMakerFullAccess policy is a managed policy that includes all the necessary permissions required to perform most actions on SageMaker. This policy is very handy when working on SageMaker as it defines most permissions you would need to work with SageMaker easily.

When using SageMakerFullAccess policy, you should ensure that your bucket name contains sagemaker and should be in the same region as your notebook instance.

We can upload the compressed Pytorch model to a bucket using an AWS CLI command:

aws s3 cp ./model.tar.gz s3://pytorch-sagemaker-example

Create an Inference Handler Script

SageMaker Pytorch model server allows you to configure how you deserialized your saved model (model.pth) and how you transform request calls to inference calls on the loaded model.

# filename: inference.pydef model_fn(model_dir):
def input_fn(request_body, request_content_type):
def predict_fn(input_data, model):
def output_fn(prediction, content_type):


For model loading, SageMaker calls model_fn which loads the model from the directory specified and returns a deserialized Pytorch model. We can implement the function as:

def model_fn(model_dir):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')'Loading the model.')
model = models.resnet50(pretrained=False)
fc_inputs = model.fc.in_features

model.fc = nn.Sequential(
nn.Linear(fc_inputs, 2048),
nn.Linear(2048, 10),

with open(os.path.join(model_dir, 'model_0.pth'), 'rb') as f:
model.load_state_dict(torch.load(f))'Done loading model')
return model

After loading the model, the next step is request handling, which is broken down into three phases.

  • Input processing
  • Prediction
  • Output processing

Each step above has a dedicated function that must be implemented in


The input_fn takes request data and deserializes it into an object ready for prediction. It receives two arguments when a SageMaker endpoint is Invoked: request body and the content type.

def input_fn(request_body, content_type='application/json'):'Deserializing the input data.')
if content_type == 'application/json':
input_data = json.loads(request_body)
url = input_data['url']'Image url: {url}')
image_data =, stream=True).raw)

image_transform = transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

return image_transform(image_data)
raise Exception(f'Requested unsupported ContentType in content_type {content_type}')

The code above receives a JSON object, which contains the URL of the image to predict and the content type as arguments. We perform some transformation on the image and returns a transformed tensor object ready to be passed to predict_fn.


This function takes the transformed input object from input_fn and performs an inference against the loaded model.

def predict_fn(input_data, model):'Generating prediction based on input parameters.')
if torch.cuda.is_available():
input_data = input_data.view(1, 3, 224, 224).cuda()
input_data = input_data.view(1, 3, 224, 224)
with torch.no_grad():
out = model(input_data)
ps = torch.exp(out)
return ps


The output_fn serializes the output of the prediction according to the response content type and maps the predicted outputs to classes of objects used in training the model.

def output_fn(prediction_output, accept='application/json'):'Serializing the generated output.')
classes = {0: 'Leopards', 1: 'airplanes', 2: 'butterfly', 3: 'camera', 4: 'elephant', 5: 'lamp', 6: 'rhino',
7: 'umbrella', 8: 'watch'}

topk, topclass = prediction_output.topk(3, dim=1)
result = []

for i in range(3):
pred = {'prediction': classes[topclass.cpu().numpy()[0][i]], 'score': f'{topk.cpu().numpy()[0][i] * 100}%'}'Adding pediction: {pred}')

if accept == 'application/json':
return json.dumps(result), accept
raise Exception(f'Requested unsupported ContentType in Accept:{accept}')

Having specified how the SageMaker model server should load and serve the model, we create a new Jupiter notebook in the same directory as with the code below and execute it to provision resources.

#filename deploy.ipynbfrom sagemaker.pytorch import PyTorchModel
from sagemaker import get_execution_role
role = get_execution_role()# You can also configure a sagemaker role and reference it by its name.
# role = "CustomSageMakerRoleName"
pytorch_model = PyTorchModel(model_data='s3://pytorch-sagemaker-example/model.tar.gz', role=role, entry_point='', framework_version='1.3.1')predictor = pytorch_model.deploy(instance_type='ml.t2.medium', initial_instance_count=1)

model_data : A path to the compressed, saved Pytorch model on S3
role: An IAM role name or arn for SageMaker to access AWS resources on your behalf.
entry_point: Path a to the python script created earlier as the entry point to the model hosting
instance_type: Type of EC2 instance to use for inferencing.

At this point, you will have two files: and deploy.ipynb in the Jupyter notebook instance, as shown in the image below.

If you open Deploy.ipynb notebook and execute the notebook by choosing Run All from the cell menu, it will deploy the model, as well as the endpoint.

On successful deployment, you can make real-time predictions using InvokeEndpoint by sending a JSON object with a url of image to predict.

For example:


Everyday, I tweet about interesting things people are doing with machine learning and data science from the community; follow @hubofml on twitter to stay up to date.



James Samuel

scaling and growing software teams | Creator of @hubofml | Growing together @ blogging @