Train ML models in Vantage

Author: Adam Tworkiewicz
Last updated: September 12th, 2021

Overview

There are situations when you want to quickly validate a machine learning model idea. You have a model type in mind. You don’t want to operationalize with an ML pipeline just yet. You just want to test out if the relationship you had in mind exists. Also, sometimes even your production deployment doesn’t require constant relearning with MLops. In such cases, you can use Vantage Analytics Library (VAL) and multiple ML model types it supports.

Prerequsites

You need access to a Teradata Vantage instance. If you need a new instance of Vantage, you can install Vantage Express on your local machine using VMware, VirtualBox, or UTM. You can also run Vantage Express in the cloud on GCP.

Install Vantage Analytics Library

Support for ML in Vantage requires Vantage Analytics Library (VAL). In this section, we will install VAL and load some sample data.

  1. VAL is distributed as an rpm file. Go to Teradata Downloads and download the VAL rpm to your local machine.

  2. Upload the file to your Vantage install. If you are running Vantage Express locally, you have many ways to do it:

    • If you installed Vantage Express on VirtualBox, you should be able to drag & drop the file to the VM’s desktop. You can also use scp by connecting to port 4422, e.g.:

      scp -P 4422 ~/Downloads/VAL-2.0.0.3-1.x86_64.rpm root@localhost:/root/Desktop
    • If you use VMware and you have enabled drag & drop, you should be able to drag and drop the file to the VM’s desktop.

    • If you have SSH access to your Vantage nodes, you can use scp to upload the binary, e.g.:

      scp ~/Downloads/VAL-2.0.0.3-1.x86_64.rpm root@vantage.server.name:/tmp/
  3. We will now create a new database where VAL functions and procedures will be installed. You could install VAL in a global location such as SYSLIB, but installing VAL in a specific database will make it easier to start over if things go wrong. Let’s create a database called val and grant appropriate permissions to our user. Please edit to match your database name and user id:

    CREATE DATABASE val
    AS PERMANENT = 60e6, -- 60MB
        SPOOL = 120e6; -- 120MB
    
    GRANT CREATE FUNCTION ON val to dbc;
    GRANT ALTER FUNCTION ON val to dbc;
    GRANT EXECUTE PROCEDURE on SQLJ.REMOVE_JAR to dbc;
    GRANT EXECUTE PROCEDURE on SQLJ.INSTALL_JAR to dbc;
    GRANT EXECUTE PROCEDURE on SQLJ.REPLACE_JAR to dbc;
    GRANT CREATE EXTERNAL PROCEDURE ON val to dbc;
  4. Open Gnome Terminal in the VM and start the installation process. Adjust the rpm path as necessary:

    rpm -Uvh --nodeps ~/Desktop/VAL-2.0.0.3-1.x86_64.rpm
  5. The install wizard will ask you for the hostname, user id, and password. If you are running the install on your Vantage Express VM, the values are:

    • Hostname: localhost

    • Userid: dbc

    • Password: dbc

    • Account string: leave empty, press ENTER

    • BTEQ or FASTLOAD command: leave empty, press ENTER

  6. The wizard will ask you to choose which part of VAL you want to install. We want to start with installing td_analyze procedure, i.e. option 1. Once you select option 1, the script will ask for the database name where td_analyze will be installed. Enter val and press ENTER.

  7. While still in the wizard, install option 5, i.e. Tutorial Tables. These are sample tables with data that we are going to use to build a sample model.

Sample data

Now, that we have VAL and sample tables loaded, let’s explore the data. It’s a simplistic, fictitious dataset of banking customers (1K-ish rows), Accounts (10K-ish rows) and Transactions (100K-ish rows). They are related to each other in the following ways:

Banking Model

In later parts of this quickstart we are going to explore if we can build a model that predicts average monthly balance that a banking customer has on their credit card based on all non-credit card related variables in the tables.

Create a linear regression model

Let’s start by creating a wide table (Analytic Data Set, or ADS) that joins the three tables above.

You must have CREATE TABLE permissions on the Database where the Vantage Analytic Library is installed.
-- Switch to val database.
DATABASE val;

-- Create the ADS.
CREATE TABLE VAL_ADS AS (
    SELECT
        T1.cust_id  AS cust_id
       ,MIN(T1.income) AS tot_income
       ,MIN(T1.age) AS tot_age
       ,MIN(T1.years_with_bank) AS tot_cust_years
       ,MIN(T1.nbr_children) AS tot_children
       ,CASE WHEN MIN(T1.marital_status) = 1 THEN 1 ELSE 0 END AS single_ind
       ,CASE WHEN MIN(T1.gender) = 'F' THEN 1 ELSE 0 END AS female_ind
       ,CASE WHEN MIN(T1.marital_status) = 2 THEN 1 ELSE 0 END AS married_ind
       ,CASE WHEN MIN(T1.marital_status) = 3 THEN 1 ELSE 0 END AS separated_ind
       ,MAX(CASE WHEN T1.state_code = 'CA' THEN 1 ELSE 0 END) AS ca_resident_ind
       ,MAX(CASE WHEN T1.state_code = 'NY' THEN 1 ELSE 0 END) AS ny_resident_ind
       ,MAX(CASE WHEN T1.state_code = 'TX' THEN 1 ELSE 0 END) AS tx_resident_ind
       ,MAX(CASE WHEN T1.state_code = 'IL' THEN 1 ELSE 0 END) AS il_resident_ind
       ,MAX(CASE WHEN T1.state_code = 'AZ' THEN 1 ELSE 0 END) AS az_resident_ind
       ,MAX(CASE WHEN T1.state_code = 'OH' THEN 1 ELSE 0 END) AS oh_resident_ind
       ,MAX(CASE WHEN T2.acct_type = 'CK' THEN 1 ELSE 0 END) AS ck_acct_ind
       ,MAX(CASE WHEN T2.acct_type = 'SV' THEN 1 ELSE 0 END) AS sv_acct_ind
       ,MAX(CASE WHEN T2.acct_type = 'CC' THEN 1 ELSE 0 END) AS cc_acct_ind
       ,AVG(CASE WHEN T2.acct_type = 'CK' THEN T2.starting_balance+T2.ending_balance ELSE 0 END) AS ck_avg_bal
       ,AVG(CASE WHEN T2.acct_type = 'SV' THEN T2.starting_balance+T2.ending_balance ELSE 0 END) AS sv_avg_bal
       ,AVG(CASE WHEN T2.acct_type = 'CC' THEN T2.starting_balance+T2.ending_balance ELSE 0 END) AS cc_avg_bal
       ,AVG(CASE WHEN T2.acct_type = 'CK' THEN T3.principal_amt+T3.interest_amt ELSE 0 END) AS ck_avg_tran_amt
       ,AVG(CASE WHEN T2.acct_type = 'SV' THEN T3.principal_amt+T3.interest_amt ELSE 0 END) AS sv_avg_tran_amt
       ,AVG(CASE WHEN T2.acct_type = 'CC' THEN T3.principal_amt+T3.interest_amt ELSE 0 END) AS cc_avg_tran_amt
       ,COUNT(CASE WHEN ((EXTRACT(MONTH FROM T3.tran_date) + 2) / 3) = 1 THEN T3.tran_id ELSE NULL END) AS q1_trans_cnt
       ,COUNT(CASE WHEN ((EXTRACT(MONTH FROM T3.tran_date) + 2) / 3) = 2 THEN T3.tran_id ELSE NULL END) AS q2_trans_cnt
       ,COUNT(CASE WHEN ((EXTRACT(MONTH FROM T3.tran_date) + 2) / 3) = 3 THEN T3.tran_id ELSE NULL END) AS q3_trans_cnt
       ,COUNT(CASE WHEN ((EXTRACT(MONTH FROM T3.tran_date) + 2) / 3) = 4 THEN T3.tran_id ELSE NULL END) AS q4_trans_cnt
    FROM Customer AS T1
        LEFT OUTER JOIN Accounts AS T2
            ON T1.cust_id = T2.cust_id
        LEFT OUTER JOIN Transactions AS T3
            ON T2.acct_nbr = T3.acct_nbr
GROUP BY T1.cust_id) WITH DATA UNIQUE PRIMARY INDEX (cust_id);

We will now build a linear regression model that takes parameters from the dataset and tries to predict the monthly credit card balance.

We call td_analyze and tell it we want a linear model. The input is in table VAL_ADS and consists of multiple columns. The dependent variable is cc_avg_bal. We want the model to be written to val database in table called LINEAR_REGRESSION_DEMO:

call td_analyze('linear',
  'database=val;
  tablename=VAL_ADS;
  columns=tot_age,tot_income,tot_cust_years,tot_children,single_ind,female_ind,married_ind,separated_ind,ck_acct_ind,sv_acct_ind,sv_avg_bal,ck_avg_bal,ca_resident_ind,ny_resident_ind,tx_resident_ind,il_resident_ind,az_resident_ind,oh_resident_ind;
  dependent=cc_avg_bal;
  outputdatabase=val;
  outputtablename=linear_regression_demo');

The procedure creates several output tables. For now, we don’t have to analyze what is in the tables. Let’s see how we can use the newly created model to perform scoring.

Scoring

Let’s use the model to perform predictions and evaluate the scores. To do this, we call td_analyze with linearscore parameter. We point to the input table (VAL_ADS), the model tables (prefix linear_regression_demo) and define the target table (linear_regression_score) in val database:

call td_analyze('linearscore',
  'database=val;
  tablename=VAL_ADS;
  modeldatabase=val;
  modeltablename=linear_regression_demo;
  outputdatabase=val;
  outputtablename=linear_regression_score;
  predicted=estimate;
  retain=cc_avg_bal;
  scoringmethod=scoreandevaluate;');

As a result, we get linear_regression_score table that contains the real balance, the predicted balance and the difference between these two. Let’s have a look at a sample:

SELECT * FROM linear_regression_score SAMPLE 10;

You will see results similar to:

cust_id|cc_avg_bal        |estimate          |Residual           |
-------+------------------+------------------+-------------------+
1362498|               0.0| 284.7057772484358| -284.7057772484358|
1362828|           1184.35|463.74177458594215|  720.6082254140578|
1362839| 2933.135802469136| 982.9240031182255| 1950.2117993509103|
1362986| 500.9148148148148| 881.4116539412856| -380.4968391264708|
1362511|235.85941489361701|294.35369563202846|-58.494280738411426|
1363134|               0.0|430.27950420065997|-430.27950420065997|
1363481|               0.0| 411.2359958542745| -411.2359958542745|
1362644| 209.3304347826087|279.75770904482033| -70.42727426221163|
1363141|               0.0| 550.1681921045503| -550.1681921045503|
1363290|               0.0|120.35348558871233|-120.35348558871233|

Summary

In this quick start we have learned how to create ML models in SQL. The method used Vantage Analytics Library (VAL). We were able to build a linear regression model and run predictions using the model. We have done that using SQL without any coding.

Did this page help?