How To Setup A Batch Data Pipeline For CSV Files To Redshift, Redshift Spectrum, Athena or BigQuery

A batch data pipeline allows you to deliver, process and route data sources to a target warehouse system like Amazon Redshift, Amazon Redshift Spectrum, Amazon Athena or Google BigQuery. Consolidating your data to a warehouse allows you to easily use your favorite analytics tools like Tableau, Qlik, Mode or Looker.

What are some use cases for this service?

Perfect for automated exports from internal systems like an ERP, ETL, MySQL, SQL Server, Oracle or other enterprise systems
Process exports from 3rd party systems like Salesforce (see How to export data from ExactTarget) and Adobe Analytics (see How to export data from Adobe Analytics).
Supports secure file transfer protocol with a variety of sftp client. The sftp protocol is secure and commonly use for transferring files via command line or graphical interfaces.
Lastly, support use cases where you have an ad hoc CSV file (e.g., sales reports, media plans, lookup files or any other CSV file) that you want to get loaded into your data warehouse.

Setting up a batch Data Pipeline is a simple and easy process. Lets get started!

Step 1: Visit the Data Pipeline — Batch Product Page

The first step is to visit our Data Pipeline — Batch product page. If you don’t have an Openbridge account, you will need to create one. It takes about 20 seconds to set one up!

https://www.openbridge.com/data-pipeline/batch

Step 2: Create a User

The next step is to create a user. A user allows you to log into our system to transfer your data.

Give your user a unique nickname, something that you can use to identify it in the future. Next, set a desired username. This is what you will use when you log in to transfer data. Last, we auto generate a password for you. Click “Show” to reveal it. Don’t worry about remembering this now, we provide a downloadable config file at the end of your order with all these details :)

When you are done, click “Create New Identity”

Your new identity will now show up as a selectable option in the left column along with any other previous users you have setup:

Select the desired user account and then click “Continue”.

Step 3: Activate Your Data Pipeline Subscription

In this step, we configure the specifics of your data pipeline. The first step is to give your pipeline a unique name. If this is loyalty data, you could call it loyalty-purchases. If this was Salesforce tracking data, you could call it salesforce-email-opens.

The name should always be unique enough to reflect the data you are delivering to this pipeline.

Next, you want to set the name of the target table. This is what we will use to load your data in the destination data warehouse. We will create this table for you so make sure the name is unique in your data warehouse.

In our example you are sending your loyalty data so we will name the table loyalty_purchases. If this was Salesforce tracking data, you could name it salesforce_email_opens.

Next, select the target destination data warehouse. This is the warehouse location(s) you had set up as part of your Openbridge account. This tells us the data warehouse where you want the data delivered. In our example below we have registered a collection of warehouses for different teams:

The next step is to select the file size of the data being sent. If the data is never bigger than 1 MB, then you can select “Extra Small”. If your files sizes are closer to 1 GB in size, then you will want “Extra Large”.

Data Trainer: Uploading Sample Data

An important step is to “teach” our system about the data you are sending. The sample data will train our system to understand the schema and data types present.

Let’s use sample data called salestransactions.csv for our example. This file reflects the data that will be delivered:

street,city,zip,state,beds,baths,sq_ft,type,sale_date,price,latitude,longitude
3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768
5828 PEPPERMILL CT,SACRAMENTO,95841,CA,3,1,1122,Condo,Wed May 21 00:00:00 EDT 2008,89921,38.662595,-121.327813
6048 OGDEN NASH WAY,SACRAMENTO,95842,CA,3,2,1104,Residential,Wed May 21 00:00:00 EDT 2008,90895,38.681659,-121.351705
2561 19TH AVE,SACRAMENTO,95820,CA,3,1,1177,Residential,Wed May 21 00:00:00 EDT 2008,91002,38.535092,-121.481367
11150 TRINITY RIVER DR Unit 114,RANCHO CORDOVA,95670,CA,2,2,941,Condo,Wed May 21 00:00:00 EDT 2008,94905,38.621188,-121.270555

We will use this file to “train” the system to understand the data being delivered to it. Select the “Upload file” option.

You will see that the trainer file was successfully loaded and processed ( A ). Next you will need to download the schema the system generated ( B ):

Take a look at the schema we generated based on the sample file. This will define the table we will create for you. It should reflect the structure as the original CSV file with corresponding data types:

column_name,data_type
street,VARCHAR (1024)
city,VARCHAR (1024)
zip,BIGINT
state,VARCHAR (1024)
beds,BIGINT
baths,BIGINT
sq_ft,BIGINT
type,VARCHAR (1024)
sale_date,VARCHAR (1024)
price,BIGINT
latitude,DOUBLE PRECISION
longitude,DOUBLE PRECISION
ob_transaction_id,varchar(256)
ob_file_name,varchar(2048)
ob_processed_at,varchar(256)
ob_modified_date,datetime

Does everything look okay?

If YES, click “Confirm file is OK” ( C ) as shown above.

Almost Done…Review Your Pipeline

You are almost done! Let’s review everything you have setup to make sure it looks correct. If everything looks good, click “Pay and Launch” as shown below:

Congratulations! Your Pipeline Is Ready

After you “PAY AND LAUNCH” you will get text file downloaded to your computer. This contains your connection, delivery and credentials needed to send your “loyalty_purchases” data to your target data warehouse destination “athena-customer-database”:

################
DATA PIPELINE - BATCH DETAILS
################

Welcome! You have setup a connection to batch data for processing, routing and loading. Below are your connection details, credentials and some prerequisites to connecting to your system.

----------
CONNECTION
----------
The following are your connection details.
host: pipeline-01.openbridge.io
port: 22 or 443
database destination: athena-customer-database
table destination: loyalty_purchases

Note: port 443 is often used in the event a corporate firewall blocks outbound connections on port 22.

----------
CREDENTIALS
----------
The following are your Data Pipeline Batch credentials. Keep them safe and secure. They provide access to READ and WRITE access to data resident for your pipeline.

username: yourusername
password: yourpassword

----------
TIPS & RESOURCES
----------
General overview, requirements and best practices:
https://docs.openbridge.com/data-pipelines/data-pipeline-overview

Set Up Prerequisites
You will need software a software client to transfer data to Openbridge. You have a number of options available. Pick one that is right for your team:
https://docs.openbridge.com/data-pipelines/data-pipeline-batch-client-software

We have also an open source solution that leverages rclone for advanced file transfers from various cloud and local storage systems:
https://github.com/openbridge/ob_bulkstash

Next, you will see the “Thank You” page:

You have now activated your batch data pipeline and can start delivering data exports via SFTP to your data warehouse!

Get Started!

Getting started is simple, fast and free. Log in to Openbridge or sign up for an account if you haven’t done so already.

If you have any questions about batch or bulk data processing and want to know more about how to unlock powerful data from systems that hold your data, feel free to leave a comment or contact us at hello@openbridge.com.