Use Airbyte to load data from external sources to Teradata Vantage
This tutorial showcases how to use Airbyte (an open-source Extract Load Transform tool) with Teradata Vantage. We work with a very simple end-to-end setup to load data from Google Sheets to Teradata Vantage using Airbyte.
Source: Google Sheets
Destination: Teradata Vantage
Access to a Teradata Vantage Instance. This will be defined as the destination of the Airbyte connection. You will need a database
Passwordfor Airbyte’s configuration.
If you need a test instance of Vantage, you can provision one for free at https://clearscape.teradata.com.
Data from the source system. In this case, we use a sample spreadsheet from google sheets. The sample data is a breakdown of payrate by employee type.
Google Cloud Platform API enabled for your personal or organizational account. You’ll need to authenticate your Google account via OAuth or via Service Account Key Authenticator. In this example, we use Service Account Key Authenticator.
Clone the Airbyte Open Source repository and go to the airbyte directory.
git clone --depth 1 https://github.com/airbytehq/airbyte.git cd airbyte
Make Sure to have Docker Desktop running before running the shell script
Run the shell script
You can run the above commands with
git bashin Windows. Please refer to the Airbyte Local Deployment for more details.
Log in to the web app http://localhost:8000/ by entering the default credentials found in the
.envfile included in the repository.
When logging in for the first time, Airbyte will prompt you to provide your email address and specify your preferences for product improvements. Enter your preferences and click on "Get started."
Once Airbyte Open Source is launched you will see a connections dashboard. If you launched Airbyte Open Source for the first time, it would not show any connections.
You can either click "Create your first connection" or click on the top right corner to initiate the new connection workflow on Airbyte’s Connections dashboard.
Airbyte will ask you for the Source, you can select from an existing source (if you have set it up already) or you can set up a new source, in this case we select
For authentication we are using
Service Account Key Authenticationwhich uses a service account key in JSON format. Toggle from the default
Service Account Key Authentication. To authenticate your Google account via Service Account Key Authentication, enter your Google Cloud service account key in JSON format.
Make sure the Service Account has the Project Viewer permission. If your spreadsheet is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your spreadsheet.
Add the link to the source spreadsheet as
For more details, please refer Setting Google Sheets as Source Connector in Airbyte Open Source
Click Set up source, if the configuration is correct, you will get the message
All connection tests passed!
Assuming you want to create a fresh new connection with
Teradata Vantage, Select
Teradata Vantageas the destination type under the "Set up the destination" section.
Password. These are the same as the
Passwordrespectively, used by your Clearscape Analytics Environment.
Provide a default schema name appropriate to your specific context. Here we have provided
If you do not provide a
Click Set up destination, if the configuration is correct, you will get the message
All connection tests passed!
You might get a configuration check failed error. Make sure your Teradata Vantage instance is running properly before making a connection through Airbyte.
A namespace is a group of streams (tables) in a source or destination. A schema in a relational database system is an example of a namespace. In a source, the namespace is the location from where the data is replicated to the destination. In a destination, the namespace is the location where the replicated data is stored in the destination. For more details please refer to Airbyte Namespace.
In our example the destination is a database, so the namespace is the default schema
gsheet_airbyte_td we defined when we configured the destination. The stream name is a table that is mirroring the name of the spreadsheet in the source, which is
sample_employee_payrate in this case. Since we are using the single spreadsheet connector, it only supports one stream (the active spreadsheet).
Other type of sources and destinations might have a different layout. In this example, Google sheets, as source, does not support a namespace.
In our example, we have used
<destination schema> as the Namespace of the destination, this is the default namespace assigned by Airbyte based on the
Default Schema we declared in the destination settings. The database
gsheet_airbyte_td will be created in our Teradata Vantage Instance.
We use the term "schema", as it is the term used by Airbyte. In a Teradata context the term "database" is the equivalent.
It shows how often data should sync to destination. You can select every hour, 2 hours, 3 hours etc. In our case we used every 24 hours.
You can also use a Cron expression to specify the time when the sync should run. In the example below, we set the Cron expression to run the sync on every Wednesday at 12:43 PM (US/Pacific) time.
Airbyte tracks synchronization attempts in the "Sync History" section of the
Next, you can go to the ClearScape Analytics Experience and run a Jupyter notebook, notebooks in ClearScape Analytics Experience are configured to run Teradata SQL queries, to verify if the database
gsheet_airbyte_td, streams (tables) and complete data is present.
SELECT DatabaseName, TableName, CreateTimeStamp, LastAlterTimeStamp FROM DBC.TablesV WHERE DatabaseName = 'gsheet_airbyte_td' ORDER BY TableName;
SELECT * FROM _airbyte_raw_sample_employee_payrate;
The stream (table) name in destination is prefixed with
_airbyte_raw_ because Normalization and Transformation are not supported for this connection, and we only have the raw table. Each stream (table) contains 3 columns:
_airbyte_ab_id: a uuid assigned by Airbyte to each event that is processed. The column type in Teradata is
_airbyte_emitted_at: a timestamp representing when the event was pulled from the data source. The column type in Teradata is
_airbyte_data: a json blob representing the event data. The column type in Teradata is
Here in the
_airbyte_data column, we see 9 rows, the same as we have in the source Google sheet, and the data is in JSON format which can be transformed further as needed.
You can close the connection in Airbyte by disabling the connection. This will stop the data sync process.
You can also delete the connection.
This tutorial demonstrated how to extract data from a source system like Google sheets and use the Airbyte ELT tool to load the data into the Teradata Vantage Instance. We saw the end-to-end data flow and complete configuration steps for running Airbyte Open Source locally, and configuring the source and destination connections. We also discussed about the available data sync configurations based on replication frequency. We validated the results in the destination using Cloudscape Analytics Experience and finally we saw the methods to pause and delete the Airbyte connection.