What Is A Batch Data Pipeline?
A data pipeline is a series of well designed, intuitive, and cohesive components--built on top of the Openbridge platform. If you are familiar with loading data to relational data stores you'll find this familiar, but simpler and automated.
An important concept for data pipelines is understanding how your data is delivered and organized. Typically, you will want to make sure that the relationship between the file(s) being delivered align with how you want to have them stored in the database. Not only does this insure accuracy and consistency, it is a key element to the "hands-off" automation of your data pipeline workflows.
Organizing And Delivering Data
The first step is to log in to your Data Pipeline location using the credentials you were provided. The name of each directory is very important as it will be the name of the table in the database where we load it and you will access it.
Directories
In this guide, we will show you an example of how to transfer Salesforce data to a pipeline called loyalty_purchases
. The pipeline had been configured during the setup process to load data to a target table called loyalty_purchases
within a Redshift data warehouse. This process will simply show how loading data to the target upload location will trigger the import to your data warehouse.
Connect To Your Pipeline And Transferring Data
The first step is to connect to the server with your preferred tools:
Host:
pipeline-01.openbridge.io
Select
SFTP
for ProtocolPort:
22
or443
(Note: port443
is often used in the event a corporate firewall blocks outbound connections on port22
)User:
yourusername
Password:
yourpassword
After logging in you see a folder called /loyalty_purchase
and /testing
. There are two important concepts relating to where you are uploading data. The /testing
directory is where you can experiment and test your file transfers. This is your sandbox. The /testing
directory allows you to upload data without it getting processed and loaded to a warehouse.
The /loyalty_purchases
folder is your production delivery location. Only push the data you configured in the batch setup process to this location. This reflects the data you want loaded into a loyalty_purchases
table in your target data warehouse destination.
In our example above, we transferred all of our *_salesforce_click.zip
data to our /loyalty_purchases
production location. In this example we transferred 15 files that totaled close to 74MB. All 15 files will be processed and loaded into your target warehouse destination. Your *_salesforce_click.zip
data will make its way to the loyalty_purchases
table in about 5-10 minutes.
File Naming
The Openbridge Data Pipeline accepts any UTF-8 encoded delimited text format (e.g. CSV) with column headers (header row) and a valid file type extension (.csv
or .txt
).
File names should meet the 3 criteria outlined below.
Descriptive: The file name should include the data source, data description and date/date range associated with the data in each file (e.g.
socialcom_paidsocial_20140801.txt
orsocialcom_paidsocial_20140701_20140801.txt
)Unique: Unique files will be stored and accessible for auditing purposes. Files posted with the same name will be overwritten, making auditing impossible (e.g.
socialcom_paidsocial_20140801.txt
,socialcom_paidsocial_20140901.txt
,socialcom_paidsocial_20141001.txt
Consistent: File naming patterns should be kept consistent over time to enable automated auditing (e.g.
socialcom_paidsocial_20140801.txt
,socialcom_paidsocial_20140901.txt
,socialcom_paidsocial_20141001.txt
This will facilitate any required auditing/QC of data received.
It is highly likely that use of non-standard file naming conventions will cause your import pipeline to fail or result in an outcome different than expected.
Example: File Delivery
Let's walk through an example, based on the customers
directory structure we defined previously, when a collection of csv
and zip
files have been delivered to these directories.
customers/
├── 2016-01-01_customers.csv
├── 2016-01-02_customers.csv
├── 2016-01-03_customers.csv
rewards/
├── 2016-01-01_rewards.zip
├── 2016-01-02_rewards.zip
├── 2016-01-03_rewards.zip
├── 2016-01-04_rewards.zip
└── 2016-01-05_rewards.zip
transactions/
├── 2016-01-01_transactions.csv
├── 2016-01-02_transactions.csv
├── 2016-01-03_transactions.csv
├── 2016-01-04_transactions.csv
├── 2016-01-05_transactions.csv
└── 2016-01-06_transactions.csv
In this example, a collection of files has been delivered to the directory customers
and the rewards
and transactions
directories. In this use case a table is generated for each directory for which a file is delivered. In the above example you will have a customers
, rewards
, and transactions
. The content of each table will be based on the corresponding .csv
or zip
files placed in each directory.
Understanding Your File Layouts
Please note, that each .csv
within each directory must share the same data structure/layout. For example, all the "*_transactions.csv"
files located in the customers/transactions/
directory share the following structure:
"ID","DATE","CUSTOMERID","PURCHASE","STOREID"
"0122","December 10, 2015","123432100A","8.43","4985"
"0123","December 10, 2015","543421111A","2.61","3212"
When additional "*_transactions.csv"
files are delivered to customers/transactions/
they will be loaded to a table named transactions
File Structure/Layout
The file column s are important as they will also be the field names used in the corresponding table. As with directory names, the column headers should be descriptive enough for the end user to understand its contents. The column headers must meet the same syntax requirements. These requirements are outlined below...
Must start with a letter
Must contain only lowercase letters
Can contain numbers and underscores
('_')
but no spaces or special characters (e.g.'@'
,'#'
,'%'
)
Valid column header:
"inquiries_to_purchase_pct"
Invalid column header:
"Inquiries to Purchase %"
Automated Header Cleaning
If the field names in the batch file do not meet these requirements, the system will automatically perform the following transformations...
Uppercase letters will be converted to lowercase
Spaces will be converted to underscores
('_')
Special characters will be dropped
If the system can not automatically perform these transformations data will fail to load or load in a manner that is inconsistent with expectations.
Dealing With File Layouts Changes
The first file delivered to each directory acts as a 'blueprint' for the corresponding database table, defining the field layout (number and order of fields), field formats (text, integer, date, decimal) and expected field delimiter (tab, comma, pipe). The successful load of each subsequent file delivered to that directory is dependent on it following the layout established by the initial 'blueprint' file. Let's walk through an example to illustrate...
Let's assume that the initial file for the tranactions
directory looked like the sample file below and *_transactions.csv
files following this format were delivered to the tranactions
directory on a daily basis for most of the year:
"ID","DATE","CUSTOMERID","PURCHASE","STOREID"
"0122","December 10, 2015","123432100A","8.43","4985"
"0123","December 10, 2015","543421111A","2.61","3212"
However, let us say that in November a change was made upstream which changed the layout of the *_transactions.csv
file being delivered . A file named 2016-11-15_transactions.csv
was delivered to customers/transactions/
and the file contained an additional field called LOYALTYID
:
"ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
The addition of theLOYALTYID
field creates a mismatch between the old structure and the new. The load process for2016-11-15_transactions.csv
(or any other file like it) to the transactions
table will fail because the underlying structure is different.
If this situation arises, please contact Openbridge Support (support@openbridge.com).
Compressed Files
Openbridge supports the delivery and processing of compressed files in zip
, gzip
or tar.gz
formats. However, since we do not know the contents of an archive prior to it arriving and us unpackaging it, custom processing is needed to ensure we are handling the contents in the manner you want us to. This is especially important when a single archive contains a number of distinct files that have different data types and structures. Here are a few different use cases that arise with compressed files.
Simple Use Case: One Archive, One File
In this example we have *_rewards.zip
archives that are delivered to customers/rewards/
.
The 2016-01-01_rewards.zip
archive contains three files; 2016-01-01a_rewards.csv
, 2016-01-01b_rewards.csv
and 2016-01-01c_rewards.csv
. The archive structure looks like this:
customers/
├── rewards/
├── 2016-01-01_rewards.zip
├── 2016-01-01a_rewards.csv
├── 2016-01-01b_rewards.csv
├── 2016-01-01c_rewards.csv
The system will unpack 2016-01-01_rewards.zip
according to the contents of the archive. This would result in three directories, one for each file.
customers/
├── rewards/
├── 2016-01-01_rewards.zip
├── 2016-01-01a_rewards/
├── 2016-01-01a_rewards.csv
├── 2016-01-01b_rewards/
├── 2016-01-01b_rewards.csv
├── 2016-01-01c_rewards/
├── 2016-01-01c_rewards.csv
The unpacked directory is based on the name of the file. For example, a filename of 2016-01-01a_rewards.csv
results in a directory named 2016-01-01a_rewards/
In this example all *_rewards.zip
archives share a common schema. This means 2016-01-01a_rewards.csv
, 2016-01-01b_rewards.csv
and 2016-01-01c_rewards.csv
share the same exact structure:
"ID","DATE","CUSTOMERID","PURCHASE","LOYALTYID","STOREID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
So regardless of the unpacking directory hierarchy, everything in rewards/
share a schema and will route all data to a common rewards
table.
Complex Use Case: One Archive, Many Files
In this example we have more complex archives called *_offers.zip
. These archives are delivered to customers/offers/ofers_targeted
.
The 2016-01-01_offers.zip
archive contains three different files; 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
.
The unpacked archive structure would follow the same pattern as the simple file example. Each file would be unpacked into its own directory with the name of the directory use the name of the file.
customers/
├── offers/
├── offers_open/
└── offers_targeted/
├── 2016-01-01_offers.zip
├── 2016-01-01_source/
├── 2016-01-01_source.csv
├── 2016-01-01_campaigns/
├── 2016-01-01_campaigns.csv
├── 2016-01-01_costs/
├── 2016-01-01_costs.csv
The files 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
have the following structure.
Source
"ID","DATE","CHANNELID", "CAMPAIGNID","PARTNERID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Campaigns
"ID","DATE","CAMPAIGNID","RESPONSE","ROI","OFFERID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Costs
"ID","DATE","SPEND","BUDGET","ADID","CAMPAIGNID"
"0122","December 10, 2015","123432100A","8.43","A102B","4985"
"0123","December 10, 2015","543421111A","2.61","A904F","3212"
Unlike the earlier rewards
example, these files represent three related, but different classes of data. This means they are not stored into a single offers table like the rewards
example due to these differences.
The table names are user defined. This allows flexibility in how you want to handle this situation. For example, you may want 2016-01-01_source.csv
, 2016-01-01_campaigns.csv
and 2016-01-01_costs.csv
routed to offers_source
, offers_campaigns
and offers_costs
tables.
To properly support delivery of complex compressed files, please contact Openbridge support so the proper handling and routing of compressed data can occur.
Encoding Your Files
Openbridge supports UTF-8 filename character sets. It also supports ASCII, a subset of UTF-8. The system will perform any character set decoding checks as the file is being delivered and will automatically discover the encoding used.
While we may accept the delivery of a file, there is no guarantee that downstream data processing pipelines will work when sending data that is not UTF-8 encoded. A data pipeline will automatically attempt to convert encoding to UTF-8. However, it might not work in all cases. We suggest, if feasible, to send data UTF-8 pre-encoded. Most systems generate UTF-8 encoded data.
Check Encoding
Not sure how to check encoding? Here is an example using the command line. To check the encoding of a file called foo.csv
, in your terminal you can type the following;
For Linux you would type:
$ file -i foo.csv
For OS X it is almost the same command:
$ file -I foo.csv
The output should look contain a charset=
which indicates the character encoding for the file. In the case of our foo.csv
the results are as follows:
$ text/plain; charset=us-ascii
Notice the charset=us-ascii
says the file is ASCII. Perfect! The file is ready to be transferred.
Large File Transfers
If you are sending larger files (>1GB), be advised that there are file transfer behaviours that you should be aware of. Specifically, it is possible that this behaviour will give the appearance that the transfer has failed at or near 100%.
How Does it Work?
Our system will not send a complete message until the end-to-end transaction is complete. There are two parts to the system; (1) file transfer and (2) file processing. The transfer part is the time it takes for your client to deliver the file over a network to us. The second part is Openbridge processing the file once it arrives. Both impact the total time for the completion of the transfer.
What Is Happening?
A transfer is not completed when Openbridge has recieved the file. The system still needs to perform post processing activities before it can send a success message to your client. This may give the appearance your transfer has hung or failed. THe reason for this is that the server puts the file into a remote temp
location during the initial upload. This is to ensure the file is valid and meets the requirements of the pipeline process. The system will then transfer the file to the production
path for loading to a database once validated. The larger the file, the longer this process takes (temp
-> production
).
Example
Lets assume you have a 1GB file. The initial transfer took about 66 seconds to reach 100%. However, the connection is still processing as previouslly described. It would likely take about the same amount of time, 60-70 seconds, to complete the (temp
-> production
) process. So the total time would be about 2.2 minutes.
Now, if it was a 10 GB file, the initial process may take about 840 seconds to reach 100% and then another 800-1000 seconds to complete the (temp
-> production
) process. So the total time is about 30 minutes, half of which was the processing time
What Should You Do?
Your system may think the transfer timed out, so you will want to increase your client timeout windows.
The process can, in various situaitions, lead to blocking other transfers from occuring while the orginal transfer is still in-progress. Queue your transfers so they occur after the confirmation of each file delivery. Also, try not to perform ancillary commands to LIST directories while the process is ongoing.