Understanding Amazon SageMaker: Built-in Algorithms for Machine Learning
Introduction
I held a study meeting titled “Amazon SageMaker Introduction - Try Machine Learning with Built-in Algorithms”.
In this blog post, I aim to share the presentation content, offering insights into Amazon SageMaker and its powerful capabilities. You can access the example code used in this post from my GitHub repository.
Prerequisites
Target Readers
This post is intended for readers who:
- Are interested in Amazon SageMaker.
- Have a basic understanding of machine learning and AWS.
Goals
- Provide an overview of machine learning and the SageMaker ecosystem.
- Demonstrate supervised machine learning using SageMaker’s built-in algorithms.
Machine Learning Overview
Supervised Learning
Supervised learning is commonly used for classification and regression tasks. It relies on labeled training data. However, the need for labeled data can increase preparation efforts.
Unsupervised Learning
Unsupervised learning focuses on tasks like clustering and dimensionality reduction. It uses unlabeled data but often lacks interpretability for inference results.
Reinforcement Learning
Reinforcement learning combines statistical methods and psychological approaches, emphasizing actions that yield higher rewards. Training requires reward mechanisms and appropriate datasets.
Deep Learning
Deep learning utilizes multi-layered neural networks to achieve remarkable accuracy, often surpassing human capabilities. While feature selection is largely automated, the interpretability of results may be limited. It also demands substantial computational resources.
Neural Networks
The complexity of a neural network can be measured by the sum of weights. Below is an example where the complexity is calculated as 24
:
- 4 (Features) × 3 (Hidden Layer 1) = 12
- 3 (Hidden Layer 1) × 3 (Hidden Layer 2) = 9
- 3 (Hidden Layer 2) × 1 (Output) = 3
- Total: 12 + 9 + 3 = 24
Value | Description |
---|---|
y | Output (Result) |
x | Feature |
w | Weight |
h1, h2 | Hidden Layers |
h1[0], h1[1], … h2[2] | Hidden Units |
Features
Features represent the attributes of the data. Rows are referred to as samples or data points, while columns are features. Feature engineering, including scaling, encoding, and preprocessing, enhances accuracy.
Model Evaluation
k-Fold Cross Validation
This method splits the data into multiple subsets for training and testing, ensuring better generalization and reduced risk of overfitting.
Confusion Matrix
The confusion matrix provides metrics such as accuracy, precision, recall, and f-score, helping evaluate classification models comprehensively.
Prediction | |||
---|---|---|---|
Positive | Negative | ||
Result | Positive | True Positive (TP) | False Negative (FN) |
Negative | False Positive (FP) | True Negative (TN) |
Name | Expression | Description | Problem |
---|---|---|---|
Accuracy | (TP + TN) / (TP + TN + FP + FN) | Ratio of correct prediction | If the model predicts all samples as negative, and there are 100 samples with 99 being negative, the accuracy is 99%. |
Precision | TP / (TP + FP) | Ratio of correct true positive prediction | Only using precision cannot be optimal in case false negative is important, such as cancer diagnosing. |
Recall | TP / (TP + FN) | Ratio of actually positive data | Only using recall cannot be optimal in case false positive is important, such as thorough cancer examination being very expensive. When all samples are predicted as positive, recall will be 100%. |
f-score | 2 * (Precision * Recall / (Precision + Recall)) | A harmonic mean of precision and recall | When all samples are true negative, an error will occur because of zero division. |
SageMaker Overview
Amazon SageMaker is a managed service for machine learning that streamlines the ML lifecycle. SageMaker supports 17 built-in algorithms as of Dec 2021, along with BYOM (Bring Your Own Model) capabilities.
Workflow and Ecosystem
The following diagram illustrates AWS services associated with the machine learning workflow. These services cover every stage, from data preparation to model deployment and monitoring.
Inference Endpoints
Amazon SageMaker offers several endpoint types to serve machine learning models based on specific requirements:
- SageMaker Hosting Services: Provides persistent endpoints that remain active, similar to EC2 instances, ensuring minimal latency for real-time inference.
- SageMaker Serverless Endpoints (Preview): A cost-efficient option with endpoints that incur charges only during use. They experience a cold start latency when idle.
- Asynchronous Inference: Ideal for batch processing or infrequent requests. SageMaker auto-scales the instance count to zero when there are no active requests, significantly reducing costs.
Each endpoint type caters to different use cases, enabling flexibility in deploying models efficiently.
SageMaker Studio
Amazon SageMaker offers several environments to suit different user requirements. This post primarily focuses on SageMaker Studio, a fully integrated development environment for machine learning. Below are the available environment options:
-
SageMaker Studio / RStudio on SageMaker:
- A comprehensive machine learning IDE.
- Includes tools like SageMaker JumpStart, ideal for learning and quick experimentation.
-
- Standalone Jupyter Notebook instances.
- Best suited for temporary or ad-hoc use by a small number of data scientists.
-
Local Environment + AWS SDK + SageMaker SDK:
- Allows developers to interact with SageMaker using their local setup.
- Useful for small-scale development requiring flexibility.
-
- Free access to some SageMaker Studio features without needing an AWS account.
- Ideal for beginners exploring machine learning concepts.
Each option provides unique benefits, allowing users to choose the most suitable environment based on their project needs.
Onboard to SageMaker Domain
To use SageMaker Studio, you need to complete the onboarding processes.
Step 1: Choose Setup Type
- For a quick start, select
Quick Setup
. - For production environments requiring advanced configurations, choose
Custom Setup
.
Step 2: Select a VPC
During the onboarding process, choose any VPC as required for your setup.
Step 3: Launch SageMaker Studio
Use the menu of an automatically created user to launch SageMaker Studio.
Step 4: Access SageMaker Studio
Once launched, the SageMaker Studio interface will appear.
AWS Design
Quick Setup
The Quick Setup is an easy way to get started. Learn more in the official documentation: Quick Setup Documentation
Custom Setup
The Custom Setup option is designed for production environments, allowing network traffic to remain within AWS’s internal network using VPC Endpoints. Learn more: Custom Setup Documentation
Practical Example: Using SageMaker’s Built-in Algorithms
In this example, we use SageMaker’s k-NN algorithm with the popular Iris dataset.
To begin, click the Notebook
button in the center of the SageMaker Studio.
Preparing Dataset
Start by setting up your environment variables. Replace <YOUR_S3_BUCKET>
and <YOUR_SAGEMAKER_ROLE>
with your specific values.
%env S3_DATASET_BUCKET=<YOUR_S3_BUCKET>
%env S3_DATASET_TRAIN=knn/input/iris_train.csv
%env S3_DATASET_TEST=knn/input/iris_test.csv
%env S3_TRAIN_OUTPUT=knn/output
%env SAGEMAKER_ROLE=<YOUR_SAGEMAKER_ROLE>
Next, create a cell with the following Python imports. The Python3 Data Science instance comes pre-installed with these libraries.
import os
import random
import string
import boto3
import matplotlib.pyplot as plt
import pandas as pd
import sagemaker
from IPython.display import display
from sagemaker import image_uris
from sagemaker.deserializers import JSONDeserializer
from sagemaker.estimator import Estimator, Predictor
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer
from sklearn.model_selection import train_test_split
Define constants and variables.
# Define constants
CSV_PATH = './tmp/iris.csv'
S3_DATASET_BUCKET = os.getenv('S3_DATASET_BUCKET')
S3_DATASET_TRAIN = os.getenv('S3_DATASET_TRAIN')
S3_DATASET_TEST = os.getenv('S3_DATASET_TEST')
S3_TRAIN_OUTPUT = os.getenv('S3_TRAIN_OUTPUT')
SAGEMAKER_ROLE = os.getenv('SAGEMAKER_ROLE')
ESTIMATOR_INSTANCE_COUNT = 1
ESTIMATOR_INSTANCE_TYPE = 'ml.m5.large'
PREDICTOR_INSTANCE_TYPE = 'ml.t2.medium'
PREDICTOR_ENDPOINT_NAME = f'sagemaker-knn-{PREDICTOR_INSTANCE_TYPE}'.replace('.', '-')
# Define variables
bucket = boto3.resource('s3').Bucket(S3_DATASET_BUCKET)
train_df = None
test_df = None
train_object_path = None
test_object_path = None
knn = None
predictor = None
Download the Iris dataset from AWS’s SageMaker Examples repository. This dataset is widely used for demonstrating classification models due to its simplicity and well-defined structure. For additional details about the Iris dataset, including its attributes and applications, please refer to the official page.
!mkdir -p tmp
!curl -o "$(pwd)/tmp/iris.csv" -L https://raw.githubusercontent.com/aws/amazon-sagemaker-examples/master/hyperparameter_tuning/r_bring_your_own/iris.csv
After downloading, use the following code to load and preprocess the CSV.
- SageMaker requires the first column in the CSV to be the target label or class, so the
Species
column must be moved to the first position. - The target labels (species names) must also be converted to integers for compatibility.
Refer to the SageMaker documentation for more details: CSV Format Requirements
def load_csv(path: str) -> pd.DataFrame:
# Load the CSV into a Pandas DataFrame
df = pd.read_csv(path)
# Move the label column ('Species') to the first position
df = df[['Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']]
# Convert target labels ('Species') to integers
df['Species'] = df['Species'].map({'setosa': 0, 'versicolor': 1, 'virginica': 2})
return df
This function processes the Iris dataset to meet SageMaker’s requirements, preparing it for training with built-in algorithms.
Visualize Dataset
Create a scatter plot of the dataset to understand its features and distributions.
def plot(df: pd.DataFrame) -> None:
pd.plotting.scatter_matrix(df, figsize=(15, 15), c=df['Species'])
plt.show()
The scatter plot generated provides the following axis mappings:
- X-axis: Represents
Species
,Sepal.Length
,Sepal.Width
,Petal.Length
, andPetal.Width
from left to right. - Y-axis: Represents
Petal.Width
,Petal.Length
,Sepal.Width
,Sepal.Length
, andSpecies
from bottom to top.
By observing the plot, it is evident that species can likely be predicted from the features, as the data points are well-classified into distinct groups.
Upload Dataset to S3
Upload the preprocessed dataset to S3:
def upload_csv_to_s3(df: pd.DataFrame, object_path: str) -> str:
filename = ''.join([random.choice(string.digits + string.ascii_lowercase) for i in range(10)])
path = os.path.abspath(os.path.join('./tmp', filename))
df.to_csv(path, header=False, index=False)
# Change content-type because the default is binary/octet-stream
bucket.upload_file(path, object_path, ExtraArgs={'ContentType': 'text/csv'})
return f's3://{bucket.name}/{object_path}'
Execute Preprocessing Steps
if __name__ == '__main__':
df = load_csv(CSV_PATH)
display(df)
plot(df)
train_df, test_df = train_test_split(df, shuffle=True, random_state=0)
train_object_path = upload_csv_to_s3(train_df, S3_DATASET_TRAIN)
test_object_path = upload_csv_to_s3(test_df, S3_DATASET_TEST)
Training
Configure the k-NN estimator and start the training process:
def get_estimator(**hyperparams) -> Estimator:
estimator = Estimator(
image_uri=image_uris.retrieve('knn', boto3.Session().region_name),
role=SAGEMAKER_ROLE,
instance_count=ESTIMATOR_INSTANCE_COUNT,
instance_type=ESTIMATOR_INSTANCE_TYPE,
input_mode='Pipe',
output_path=f's3://{S3_DATASET_BUCKET}/{S3_TRAIN_OUTPUT}',
sagemaker_session=sagemaker.Session(),
)
hyperparams.update({'predictor_type': 'classifier'})
estimator.set_hyperparameters(**hyperparams)
return estimator
def train(estimator: Estimator, train_object_path: str, test_object_path: str) -> None:
train_input = TrainingInput(train_object_path, content_type='text/csv', input_mode='Pipe')
test_input = TrainingInput(test_object_path, content_type='text/csv', input_mode='Pipe')
estimator.fit({'train': train_input, 'test': test_input})
if __name__ == '__main__':
knn = get_estimator(k=1, sample_size=1000)
train(knn, train_object_path, test_object_path)
ECR Container URI
The image_uri
(line 3) specifies the ECR container URI of the k-NN training algorithm provided by AWS. You can retrieve the appropriate URI for your region using SageMaker’s utility functions.
For detailed information about the container URIs for built-in algorithms, refer to the official documentation: Amazon SageMaker Built-in Algorithm ECR URIs.
Channel Names
The channel name (line 18) for built-in algorithms in Amazon SageMaker is fixed to train
. If you include a test
channel during the training job creation, your ML model will automatically be evaluated on the test data after training.
Using Pipe Mode
To enhance data streaming efficiency, you can enable Pipe mode by setting the input_mode
(line 7) parameter to "Pipe"
in the TrainingInput
definition. Pipe mode streams data directly from S3 to the SageMaker instance, reducing the latency and memory requirements associated with downloading the entire dataset.
k-NN Hyperparameters
The k-NN algorithm includes several configurable hyperparameters. For complete details on their usage and effects, consult the official documentation: k-NN Hyperparameters.
Training Log
After starting the training job, you will observe logs similar to the following:
2022-01-08 13:38:34 Starting - Starting the training job...
2022-01-08 13:38:57 Starting - Launching requested ML instancesProfilerReport-1641649113: InProgress
......
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('accuracy', 0.9736842105263158)
[01/08/2022 13:43:00 INFO 140667182901056] #test_score (algo-1) : ('macro_f_1.000', 0.97170347)
- The log provides metrics like accuracy and macro F1 score, offering insights into the model’s performance on the test dataset.
- Leveraging both the
train
andtest
channels during training ensures the built-in algorithm evaluates the model’s generalization ability automatically.
Inference
Deploy the trained model to an endpoint and validate predictions. The serializer
and deserializer
in SageMaker are used to specify the formats for input and output data when interacting with a deployed inference endpoint.
def deploy(estimator: Estimator) -> Predictor:
return estimator.deploy(
initial_instance_count=1,
instance_type=PREDICTOR_INSTANCE_TYPE,
serializer=CSVSerializer(),
deserializer=JSONDeserializer(),
endpoint_name=PREDICTOR_ENDPOINT_NAME,
)
def validate(predictor: Predictor, test_df: pd.DataFrame) -> pd.DataFrame:
rows = []
for _, data in test_df.iterrows():
predict = predictor.predict(
pd.DataFrame([data.drop('Species')]).to_csv(header=False, index=False),
initial_args={'ContentType': 'text/csv'},
)
predicted_label = predict['predictions'][0]['predicted_label']
row = data.tolist()
row.append(predicted_label)
row.append(data['Species'] == predicted_label)
rows.append(row)
return pd.DataFrame(rows, columns=('Species', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Prediction', 'Result'))
if __name__ == '__main__':
predictor = deploy(knn)
predictions = validate(predictor, test_df)
display(predictions)
The inference results will include the Prediction
and Result
columns, displayed in a tabular format. Here’s how the data is organized:
Prediction
: The predicted label for each sample provided to the model.Result
: A boolean value (True
orFalse
) indicating whether the prediction matches the actual label.
Clean-Up
Release the resources to avoid unnecessary costs:
def delete_model(predictor: Predictor) -> None:
predictor.delete_model()
def delete_endpoint(predictor: Predictor) -> None:
predictor.delete_endpoint(delete_endpoint_config=True)
if __name__ == '__main__':
delete_model(predictor)
delete_endpoint(predictor)
Conclusion
By using SageMaker’s built-in algorithms like k-NN, you can simplify the ML workflow, from training to deployment. SageMaker empowers developers to focus on building effective models while handling the underlying infrastructure seamlessly.
Happy Coding! 🚀
Appendix: Setting Serializer and Deserializer in Predictor
When deploying a model, you can define the serializer
and deserializer
to control the input/output format:
from sagemaker.predictor import Predictor
predictor = Predictor(
endpoint_name="my-endpoint",
serializer=CSVSerializer(), # Input in CSV format
deserializer=JSONDeserializer() # Output in JSON format
)
These settings ensure seamless communication with your SageMaker inference endpoint, making it easier to send requests and process responses.
Serializer
The serializer
converts input data (e.g., Python objects) into the format expected by the SageMaker endpoint.
For example:
CSVSerializer
: Converts data into CSV format.JSONSerializer
: Converts data into JSON format.NumpySerializer
: Converts NumPy arrays into binary format.
Deserializer
The deserializer
interprets the output data from the SageMaker endpoint and converts it back into a usable Python object.
For example:
JSONDeserializer
: Converts JSON responses into Python dictionaries or lists.BytesDeserializer
: Returns raw bytes.