What Is A Batch Data Pipeline?

A data pipeline is a series of well designed, intuitive, and cohesive components--built on top of the Openbridge platform. If you are familiar with loading data to relational data stores you'll find this familiar, but simpler and automated.

An important concept for data pipelines is understanding how your data is delivered and organized. Typically, you will want to make sure that the relationship between the file(s) being delivered align with how you want to have them stored in the database. Not only does this insure accuracy and consistency, it is a key element to the "hands-off" automation of your data pipeline workflows.

Organizing And Delivering Data

The first step is to log in to your Data Pipeline location using the credentials you were provided. The name of each directory is very important as it will be the name of the table in the database where we load it and you will access it.

Directories

In this guide, we will show you an example of how to transfer Salesforce data to a pipeline called loyalty_purchases . The pipeline had been configured during the setup process to load data to a target table called loyalty_purchases within a Redshift data warehouse. This process will simply show how loading data to the target upload location will trigger the import to your data warehouse.

Connect To Your Pipeline And Transferring Data

The first step is to connect to the server with your preferred tools:

Host: pipeline-01.openbridge.io
Select SFTP for Protocol
Port: 22 or 443 (Note: port 443 is often used in the event a corporate firewall blocks outbound connections on port 22 )
User: yourusername
Password: yourpassword

After logging in you see a folder called /loyalty_purchase and /testing. There are two important concepts relating to where you are uploading data. The /testing directory is where you can experiment and test your file transfers. This is your sandbox. The /testing directory allows you to upload data without it getting processed and loaded to a warehouse.

The /loyalty_purchases folder is your production delivery location. Only push the data you configured in the batch setup process to this location. This reflects the data you want loaded into a loyalty_purchases table in your target data warehouse destination.

In our example above, we transferred all of our *_salesforce_click.zip data to our /loyalty_purchases production location. In this example we transferred 15 files that totaled close to 74MB. All 15 files will be processed and loaded into your target warehouse destination. Your *_salesforce_click.zip data will make its way to the loyalty_purchases table in about 5-10 minutes.

File Naming

The Openbridge Data Pipeline accepts any UTF-8 encoded delimited text format (e.g. CSV) with column headers (header row) and a valid file type extension (.csv or .txt).

File names should meet the 3 criteria outlined below.

Descriptive: The file name should include the data source, data description and date/date range associated with the data in each file (e.g. socialcom_paidsocial_20140801.txt or socialcom_paidsocial_20140701_20140801.txt)
Unique: Unique files will be stored and accessible for auditing purposes. Files posted with the same name will be overwritten, making auditing impossible (e.g. socialcom_paidsocial_20140801.txt, socialcom_paidsocial_20140901.txt, socialcom_paidsocial_20141001.txt
Consistent: File naming patterns should be kept consistent over time to enable automated auditing (e.g. socialcom_paidsocial_20140801.txt, socialcom_paidsocial_20140901.txt, socialcom_paidsocial_20141001.txt

This will facilitate any required auditing/QC of data received.

It is highly likely that use of non-standard file naming conventions will cause your import pipeline to fail or result in an outcome different than expected.

Example: File Delivery

Let's walk through an example, based on the customers directory structure we defined previously, when a collection of csv and zip files have been delivered to these directories.

customers/
   ├── 2016-01-01_customers.csv
   ├── 2016-01-02_customers.csv
   ├── 2016-01-03_customers.csv

rewards/
   ├── 2016-01-01_rewards.zip
   ├── 2016-01-02_rewards.zip
   ├── 2016-01-03_rewards.zip
   ├── 2016-01-04_rewards.zip
   └── 2016-01-05_rewards.zip

transactions/
   ├── 2016-01-01_transactions.csv
   ├── 2016-01-02_transactions.csv
   ├── 2016-01-03_transactions.csv
   ├── 2016-01-04_transactions.csv
   ├── 2016-01-05_transactions.csv
   └── 2016-01-06_transactions.csv

In this example, a collection of files has been delivered to the directory customers and the rewards and transactions directories. In this use case a table is generated for each directory for which a file is delivered. In the above example you will have a customers, rewards, and transactions. The content of each table will be based on the corresponding .csv or zip files placed in each directory.

Understanding Your File Layouts

Please note, that each .csv within each directory must share the same data structure/layout. For example, all the "*_transactions.csv" files located in the customers/transactions/ directory share the following structure:

    "ID","DATE","CUSTOMERID","PURCHASE","STOREID"
    "0122","December 10, 2015","123432100A","8.43","4985"
    "0123","December 10, 2015","543421111A","2.61","3212"

When additional "*_transactions.csv" files are delivered to customers/transactions/ they will be loaded to a table named transactions

File Structure/Layout

The file column s are important as they will also be the field names used in the corresponding table. As with directory names, the column headers should be descriptive enough for the end user to understand its contents. The column headers must meet the same syntax requirements. These requirements are outlined below...

Must start with a letter
Must contain only lowercase letters
Can contain numbers and underscores ('_') but no spaces or special characters (e.g. '@', '#', '%')

Valid column header:

    "inquiries_to_purchase_pct"

Invalid column header:

    "Inquiries to Purchase %"

Automated Header Cleaning

If the field names in the batch file do not meet these requirements, the system will automatically perform the following transformations...

Uppercase letters will be converted to lowercase
Spaces will be converted to underscores ('_')
Special characters will be dropped

If the system can not automatically perform these transformations data will fail to load or load in a manner that is inconsistent with expectations.

Dealing With File Layouts Changes

The first file delivered to each directory acts as a 'blueprint' for the corresponding database table, defining the field layout (number and order of fields), field formats (text, integer, date, decimal) and expected field delimiter (tab, comma, pipe). The successful load of each subsequent file delivered to that directory is dependent on it following the layout established by the initial 'blueprint' file. Let's walk through an example to illustrate...

Let's assume that the initial file for the tranactions directory looked like the sample file below and *_transactions.csv files following this format were delivered to the tranactions directory on a daily basis for most of the year:

    "ID","DATE","CUSTOMERID","PURCHASE","STOREID"
    "0122","December 10, 2015","123432100A","8.43","4985"
    "0123","December 10, 2015","543421111A","2.61","3212"

However, let us say that in November a change was made upstream which changed the layout of the *_transactions.csv file being delivered . A file named 2016-11-15_transactions.csv was delivered to customers/transactions/ and the file contained an additional field called LOYALTYID:

    "ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
    "0122","December 10, 2015","123432100A","8.43","A102B","4985"
    "0123","December 10, 2015","543421111A","2.61","A904F","3212"

The addition of theLOYALTYID field creates a mismatch between the old structure and the new. The load process for2016-11-15_transactions.csv (or any other file like it) to the transactions table will fail because the underlying structure is different.

If this situation arises, please contact Openbridge Support (support@openbridge.com).

Compressed Files

Openbridge supports the delivery and processing of compressed files in zip, gzip or tar.gz formats. However, since we do not know the contents of an archive prior to it arriving and us unpackaging it, custom processing is needed to ensure we are handling the contents in the manner you want us to. This is especially important when a single archive contains a number of distinct files that have different data types and structures. Here are a few different use cases that arise with compressed files.

Simple Use Case: One Archive, One File

In this example we have *_rewards.zip archives that are delivered to customers/rewards/.

The 2016-01-01_rewards.zip archive contains three files; 2016-01-01a_rewards.csv, 2016-01-01b_rewards.csv and 2016-01-01c_rewards.csv. The archive structure looks like this:

    customers/
      ├── rewards/
          ├── 2016-01-01_rewards.zip
              ├── 2016-01-01a_rewards.csv
              ├── 2016-01-01b_rewards.csv
              ├── 2016-01-01c_rewards.csv

The system will unpack 2016-01-01_rewards.zip according to the contents of the archive. This would result in three directories, one for each file.

    customers/
      ├── rewards/
          ├── 2016-01-01_rewards.zip
          ├── 2016-01-01a_rewards/
              ├── 2016-01-01a_rewards.csv
          ├── 2016-01-01b_rewards/
              ├── 2016-01-01b_rewards.csv
          ├── 2016-01-01c_rewards/
              ├── 2016-01-01c_rewards.csv

The unpacked directory is based on the name of the file. For example, a filename of 2016-01-01a_rewards.csv results in a directory named 2016-01-01a_rewards/

In this example all *_rewards.zip archives share a common schema. This means 2016-01-01a_rewards.csv, 2016-01-01b_rewards.csv and 2016-01-01c_rewards.csv share the same exact structure:

"ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
      "0122","December 10, 2015","123432100A","8.43","A102B","4985"
      "0123","December 10, 2015","543421111A","2.61","A904F","3212"

So regardless of the unpacking directory hierarchy, everything in rewards/ share a schema and will route all data to a common rewards table.

Complex Use Case: One Archive, Many Files

In this example we have more complex archives called *_offers.zip. These archives are delivered to customers/offers/ofers_targeted.

The 2016-01-01_offers.zip archive contains three different files; 2016-01-01_source.csv, 2016-01-01_campaigns.csv and 2016-01-01_costs.csv.

The unpacked archive structure would follow the same pattern as the simple file example. Each file would be unpacked into its own directory with the name of the directory use the name of the file.

    customers/
      ├── offers/
          ├── offers_open/
          └── offers_targeted/
              ├── 2016-01-01_offers.zip
                  ├── 2016-01-01_source/
                      ├── 2016-01-01_source.csv
                  ├── 2016-01-01_campaigns/
                      ├── 2016-01-01_campaigns.csv
                  ├── 2016-01-01_costs/
                      ├── 2016-01-01_costs.csv

The files 2016-01-01_source.csv, 2016-01-01_campaigns.csv and 2016-01-01_costs.csv have the following structure.

Source

    "ID","DATE","CHANNELID", "CAMPAIGNID","PARTNERID"
    "0122","December 10, 2015","123432100A","8.43","A102B","4985"
    "0123","December 10, 2015","543421111A","2.61","A904F","3212"

Campaigns

    "ID","DATE","CAMPAIGNID","RESPONSE","ROI","OFFERID"
    "0122","December 10, 2015","123432100A","8.43","A102B","4985"
    "0123","December 10, 2015","543421111A","2.61","A904F","3212"

Costs

    "ID","DATE","SPEND","BUDGET","ADID","CAMPAIGNID"
    "0122","December 10, 2015","123432100A","8.43","A102B","4985"
    "0123","December 10, 2015","543421111A","2.61","A904F","3212"

Unlike the earlier rewards example, these files represent three related, but different classes of data. This means they are not stored into a single offers table like the rewards example due to these differences.

The table names are user defined. This allows flexibility in how you want to handle this situation. For example, you may want 2016-01-01_source.csv, 2016-01-01_campaigns.csv and 2016-01-01_costs.csv routed to offers_source, offers_campaigns and offers_costs tables.

To properly support delivery of complex compressed files, please contact Openbridge support so the proper handling and routing of compressed data can occur.

Encoding Your Files

Openbridge supports UTF-8 filename character sets. It also supports ASCII, a subset of UTF-8. The system will perform any character set decoding checks as the file is being delivered and will automatically discover the encoding used.

While we may accept the delivery of a file, there is no guarantee that downstream data processing pipelines will work when sending data that is not UTF-8 encoded. A data pipeline will automatically attempt to convert encoding to UTF-8. However, it might not work in all cases. We suggest, if feasible, to send data UTF-8 pre-encoded. Most systems generate UTF-8 encoded data.

Check Encoding

Not sure how to check encoding? Here is an example using the command line. To check the encoding of a file called foo.csv, in your terminal you can type the following;

For Linux you would type:

    $ file -i foo.csv

For OS X it is almost the same command:

    $ file -I foo.csv

The output should look contain a charset= which indicates the character encoding for the file. In the case of our foo.csv the results are as follows:

    $ text/plain; charset=us-ascii

Notice the charset=us-ascii says the file is ASCII. Perfect! The file is ready to be transferred.

Large File Transfers

If you are sending larger files (>1GB), be advised that there are file transfer behaviours that you should be aware of. Specifically, it is possible that this behaviour will give the appearance that the transfer has failed at or near 100%.

How Does it Work?

Our system will not send a complete message until the end-to-end transaction is complete. There are two parts to the system; (1) file transfer and (2) file processing. The transfer part is the time it takes for your client to deliver the file over a network to us. The second part is Openbridge processing the file once it arrives. Both impact the total time for the completion of the transfer.

What Is Happening?

A transfer is not completed when Openbridge has recieved the file. The system still needs to perform post processing activities before it can send a success message to your client. This may give the appearance your transfer has hung or failed. THe reason for this is that the server puts the file into a remote temp location during the initial upload. This is to ensure the file is valid and meets the requirements of the pipeline process. The system will then transfer the file to the production path for loading to a database once validated. The larger the file, the longer this process takes (temp -> production).

Example

Lets assume you have a 1GB file. The initial transfer took about 66 seconds to reach 100%. However, the connection is still processing as previouslly described. It would likely take about the same amount of time, 60-70 seconds, to complete the (temp -> production) process. So the total time would be about 2.2 minutes.

Now, if it was a 10 GB file, the initial process may take about 840 seconds to reach 100% and then another 800-1000 seconds to complete the (temp -> production) process. So the total time is about 30 minutes, half of which was the processing time

What Should You Do?

Your system may think the transfer timed out, so you will want to increase your client timeout windows.
The process can, in various situaitions, lead to blocking other transfers from occuring while the orginal transfer is still in-progress. Queue your transfers so they occur after the confirmation of each file delivery. Also, try not to perform ancillary commands to LIST directories while the process is ongoing.

Loading Files with the Batch Data Pipeline