Build a FEAST feature store in Teradata Vantage

Introduction

Feast’s connector for Teradata is a complete implementation with support for all features and uses Teradata Vantage as an online and offline store.

Prerequisites

Access to a Teradata Vantage instance.

If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com.

Overview

This how-to assumes you know Feast terminology. If you need a refresher check out the official FEAST documentation

This document demonstrates how developers can integrate Teradata’s offline and online store with Feast. Teradata’s offline stores allow users to use any underlying data store as their offline feature store. Features can be retrieved from the offline store for model training and can be materialized into the online feature store for use during model inference.

On the other hand, online stores are used to serve features at low latency. The materialize command can be used to load feature values from the data sources (or offline stores) into the online store

The feast-teradata library adds support for Teradata as

  • OfflineStore

  • OnlineStore

Additionally, using Teradata as the registry (catalog) is already supported via the registry_type: sql and included in our examples. This means that everything is located in Teradata. However, depending on the requirements, installation, etc, this can be mixed and matched with other systems as appropriate.

Getting Started

To get started, install the feast-teradata library

pip install feast-teradata

Let’s create a simple feast setup with Teradata using the standard drivers' dataset. Note that you cannot use feast init as this command only works for templates that are part of the core feast library. We intend on getting this library merged into feast core eventually but for now, you will need to use the following cli command for this specific task. All other feast cli commands work as expected.

feast-td init-repo

This will then prompt you for the required information for the Teradata system and upload the example dataset. Let’s assume you used the repo name demo when running the above command. You can find the repository files along with a file called test_workflow.py. Running this test_workflow.py will execute a complete workflow for the feast with Teradata as the Registry, OfflineStore, and OnlineStore.

demo/
    feature_repo/
        driver_repo.py
        feature_store.yml
    test_workflow.py

From within the demo/feature_repo directory, execute the following feast command to apply (import/update) the repo definition into the registry. You will be able to see the registry metadata tables in the teradata database after running this command.

feast apply

To see the registry information in the feast UI, run the following command. Note the --registry_ttl_sec is important as by default it polls every 5 seconds.

feast ui --registry_ttl_sec=120

Offline Store Config

project: <name of project>
registry: <registry>
provider: local
offline_store:
   type: feast_teradata.offline.teradata.TeradataOfflineStore
   host: <db host>
   database: <db name>
   user: <username>
   password: <password>
   log_mech: <connection mechanism>

Repo Definition

Below is an example of definition.py which elaborates how to set the entity, source connector, and feature view.

Now to explain the different components:

  • TeradataSource: Data Source for features stored in Teradata (Enterprise or Lake) or accessible via a Foreign Table from Teradata (NOS, QueryGrid)

  • Entity: A collection of semantically related features

  • Feature View: A feature view is a group of feature data from a specific data source. Feature views allow you to consistently define features and their data sources, enabling the reuse of feature groups across a project

driver = Entity(name="driver", join_keys=["driver_id"])
project_name = yaml.safe_load(open("feature_store.yaml"))["project"]

driver_stats_source = TeradataSource(
    database=yaml.safe_load(open("feature_store.yaml"))["offline_store"]["database"],
    table=f"{project_name}_feast_driver_hourly_stats",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

driver_stats_fv = FeatureView(
    name="driver_hourly_stats",
    entities=[driver],
    ttl=timedelta(weeks=52 * 10),
    schema=[
        Field(name="driver_id", dtype=Int64),
        Field(name="conv_rate", dtype=Float32),
        Field(name="acc_rate", dtype=Float32),
        Field(name="avg_daily_trips", dtype=Int64),
    ],
    source=driver_stats_source,
    tags={"team": "driver_performance"},
)

Offline Store Usage

There are two different ways to test your offline store as explained below. But first, there are a few mandatory steps to follow:

Now, let’s batch-read some features for training, using only entities (population) for which we have seen an event in the last 60 days. The predicates (filter) used can be on anything relevant for the entity (population) selection for the given training dataset. The event_timestamp is only for example purposes.

from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo")
training_df = store.get_historical_features(
    entity_df=f"""
            SELECT
                driver_id,
                event_timestamp
            FROM demo_feast_driver_hourly_stats
            WHERE event_timestamp BETWEEN (CURRENT_TIMESTAMP - INTERVAL '60' DAY) AND CURRENT_TIMESTAMP
        """,
    features=[
        "driver_hourly_stats:conv_rate",
        "driver_hourly_stats:acc_rate",
        "driver_hourly_stats:avg_daily_trips"
    ],
).to_df()
print(training_df.head())

The feast-teradata library allows you to use the complete set of feast APIs and functionality. Please refer to the official feast quickstart for more details on the various things you can do.

Online Store

Feast materializes data to online stores for low-latency lookup at model inference time. Typically, key-value stores are used for online stores, however, relational databases can be used for this purpose as well.

Users can develop their own online stores by creating a class that implements the contract in the OnlineStore class.

Online Store Config

project: <name of project>
registry: <registry>
provider: local
offline_store:
   type: feast_teradata.offline.teradata.TeradataOfflineStore
   host: <db host>
   database: <db name>
   user: <username>
   password: <password>
   log_mech: <connection mechanism>

Online Store Usage

There are a few mandatory steps to follow before we can test the online store:

The command materialize_incremental is used to incrementally materialize features in the online store. If there are no new features to be added, this command will essentially not be doing anything. With feast materialize_incremental, the start time is either now — ttl (the ttl that we defined in our feature views) or the time of the most recent materialization. If you’ve materialized features at least once, then subsequent materializations will only fetch features that weren’t present in the store at the time of the previous materializations.

CURRENT_TIME=$(date +'%Y-%m-%dT%H:%M:%S')
feast materialize-incremental $CURRENT_TIME

Next, while fetching the online features, we have two parameters features and entity_rows. The features parameter is a list and can take any number of features that are present in the df_feature_view. The example above shows all 4 features present but these can be less than 4 as well. Secondly, the entity_rows parameter is also a list and takes a dictionary of the form {feature_identifier_column: value_to_be_fetched}. In our case, the column driver_id is used to uniquely identify the different rows of the entity driver. We are currently fetching values of the features where driver_id is equal to 5. We can also fetch multiple such rows using the format: [{driver_id: val_1}, {driver_id: val_2}, .., {driver_id: val_n}] [{driver_id: val_1}, {driver_id: val_2}, .., {driver_id: val_n}]

entity_rows = [
        {
            "driver_id": 1001,
        },
        {
            "driver_id": 1002,
        },
    ]
features_to_fetch = [
            "driver_hourly_stats:acc_rate",
            "driver_hourly_stats:conv_rate",
            "driver_hourly_stats:avg_daily_trips"
        ]
returned_features = store.get_online_features(
    features=features_to_fetch,
    entity_rows=entity_rows,
).to_dict()
for key, value in sorted(returned_features.items()):
    print(key, " : ", value)

How to set SQL Registry

Another important thing is the SQL Registry. We first make a path variable that uses the username, password, database name, etc. to make a connection string which it then uses to establish a connection to Teradata’s Database.

path = 'teradatasql://'+ teradata_user +':' + teradata_password + '@'+host + '/?database=' + teradata_database + '&LOGMECH=' + teradata_log_mech

It will create the following table in your database:

  • Entities (entity_name,project_id,last_updated_timestamp,entity_proto)

  • Data_sources (data_source_name,project_id,last_updated_timestamp,data_source_proto)

  • Feature_views (feature_view_name,project_id,last_updated_timestamp,materialized_intervals,feature_view_proto,user_metadata)

  • Request_feature_views (feature_view_name,project_id,last_updated_timestamp,feature_view_proto,user_metadata)

  • Stream_feature_views (feature_view_name,project_id,last_updated_timestamp,feature_view_proto,user_metadata)

  • managed_infra (infra_name, project_id, last_updated_timestamp, infra_proto)

  • validation_references (validation_reference_name, project_id, last_updated_timestamp, validation_reference_proto)

  • saved_datasets (saved_dataset_name, project_id, last_updated_timestamp, saved_dataset_proto)

  • feature_services (feature_service_name, project_id, last_updated_timestamp, feature_service_proto)

  • on_demand_feature_views (feature_view_name, project_id, last_updated_timestamp, feature_view_proto, user_metadata)

Additionally, if you want to see a complete (but not real-world), end-to-end example workflow example, see the demo/test_workflow.py script. This is used for testing the complete feast functionality.

An Enterprise Feature Store accelerates the value-gaining process in crucial stages of data analysis. It enhances productivity and reduces the time taken to introduce products in the market. By integrating Teradata with Feast, it enables the use of Teradata’s highly efficient parallel processing within a Feature Store, thereby enhancing performance.

Further reading

If you have any questions or need further assistance, please visit our community forum where you can get support and interact with other community members.
Did this page help?