Use AWS SageMaker with Teradata Vantage
Author: Wenjie Tehan
Last updated: February 8th, 2022
This how-to will help you to integrate Amazon SageMaker with Teradata Vantage. The approach this guide explains is one of many potential approaches to integrate with the service.
Amazon SageMaker provides a fully managed Machine Learning Platform. There are two use cases for Amazon SageMaker and Teradata:
Data resides on Teradata Vantage and Amazon SageMaker will be used for both the Model definition and subsequent scoring. Under this use case Teradata will provide data into the Amazon S3 environment so that Amazon SageMaker can consume training and test data sets for the purpose of model development. Teradata would further make data available via Amazon S3 for subsequent scoring by Amazon SageMaker. Under this model Teradata is a data repository only.
Data resides on Teradata Vantage and Amazon SageMaker will be used for the Model definition, and Teradata for the subsequent scoring. Under this use case Teradata will provide data into the Amazon S3 environment so that Amazon SageMaker can consume training and test data sets for the purpose of model development. Teradata will need to import the Amazon SageMaker model into a Teradata table for subsequent scoring by Teradata Vantage. Under this model Teradata is a data repository and a scoring engine.
The first use case is discussed in this document.
Amazon SageMaker consumes training and test data from an Amazon S3 bucket. This article describes how you can load Teradata analytics data sets into an Amazon S3 bucket. The data can then available to Amazon SageMaker to build and train machine learning models and deploy them into a production environment.
Access to a Teradata Vantage instance.
IAM permission to access Amazon S3 bucket, and to use Amazon SageMaker service.
An Amazon S3 bucket to store training data.
Amazon SageMaker trains data from an Amazon S3 bucket. Following are the steps to load training data from Vantage to an Amazon S3 bucket:
Go to Amazon SageMaker console and create a notebook instance. See Amazon SageMaker Developer Guide for instructions on how to create a notebook instance:
Open your notebook instance:
Start a new file by clicking on
New → conda_python3:
Install Teradata Python library:
!pip install teradataml
In a new cell and import additional libraries:
import teradataml as tdml from teradataml import create_context, get_context, remove_context from teradataml.dataframe.dataframe import DataFrame import pandas as pd import boto3, os
In a new cell, connect to Teradata Vantage. Replace
<database user name>,
<database password>to match your Vantage environment:
create_context(host = '<hostname>', username = '<database user name>', password = '<database password>')
Retrieve data rom the table where the training dataset resides using TeradataML DataFrame API:
train_data = tdml.DataFrame('table_with_training_data') trainDF = train_data.to_pandas()
Write data to a local file:
trainFileName = 'train.csv' trainDF.to_csv(trainFileName, header=None, index=False)
Upload the file to Amazon S3:
bucket = 'sagedemo' prefix = 'sagemaker/train' trainFile = open(trainFileName, 'rb') boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, localFile)).upload_fileobj(trainFile)
Training jobson the left menu under
Training, then click on
Create training job:
Create training jobwindow, fill in the
Create a new rolefor the IAM role. Choose
Any S3 bucketfor the Amazon S3 buckets and
Back in the
Create training jobwindow, use
XGBoostas the algorithm:
Use the default
ml.m4.xlargeinstance type, and 30GB of additional storage volume per instance. This is a short training job, shouldn’t take more than 10 minutes.
Fill in following hyperparameters and leave everything else as default:
num_round=100 silent=0 eta=0.2 gamma=4 max_depth=5 min_child_weight=6 subsample=0.8 objective='binary:logistic'
Input data configuration, enter the Amazon S3 bucket where you stored your training data. Input mode is
File. Content type is
S3 locationis where the file uploaded to:
Output data configuration, enter path where the output data will be stored:
Leave everything else as default, and click on “Create training job”. Detail instructions on how to configure the training job can be found in Amazon SageMaker Developer Guide.
Once the training job’s created, Amazon SageMaker launches the ML instances to train the model, and stores the resulting model artifacts and other output in the
Output data configuration (
path/<training job name>/output by default).
After you train your model, deploy it using a persistent endpoint
Inferencefrom the left panel, then
Create model. Fill in the model name (e.g.
xgboost-bank), and choose the IAM role you created from the previous step.
Container definition 1, use
Location of inference code image.
Location of model artifactsis the output path of your training job
Leave everything else as default, then
Select the model you just created, then click on
Create endpoint configuration:
Fill in the name (e.g.
xgboost-bank) and use default for everything else. The model name and training job should be automatically populated for you. Click on
Create endpoint configuration.
Modelsfrom the left panel, select the model again, and click on
Create endpointthis time:
Fill in the name (e.g.
xgboost-bank), and select
Use an existing endpoint configuration: image::sagemaker-with-teradata-vantage/attach.endpoint.configuration.png[Attach endpoint configuration]
Select the endpoint configuration created from last step, and click on
Select endpoint configuration:
Leave everything else as default and click on
Now the model is deployed to the endpoint and can be used by client applications.
This how-to demonstrated how to extract training data from Vantage and use it to train a model in Amazon SageMaker. The solution used a Jupyter notebook to extract data from Vantage and write it to an S3 bucket. A SageMaker training job read data from the S3 bucket and produced a model. The model was deployed to AWS as a service endpoint.