One of our foundational tasks is caring for the risk associated with changes in upstream data from source systems. For example, what should happen when Google or Facebook add or remove columns in their data feeds? Should the pipeline for that data "fail" gracefully, or should it adapt to the changes in real-time?
Since our goal is to ensure continuous operation of data pipelines, you need a data catalog to keep track of these changes, especially with a data lake. The Openbridge system is designed to adapt and adjust dynamically to changes it detects from various data sources in creating a robust data catalog.
Here is an example data catalog in BigQuery:
For more details on how this process works, continue reading.
Dealing With Change Via Table Versioning
When data for a particular source first appears within Openbridge, data is used to define the schema (blueprint) for the data warehouse "source" table. This schema definition includes field names, field order, and field data types. These source tables are generated following a standardized naming convention like
For example, let's assume we have source data called ‘customer’. This data contains 4 columns (
cust_id, cust_first, cust_last, cust_zip ). The result of this data is a table named
customer_v1 in your data warehouse. It will look like this:
What happens if there are changes to the original data schema? Typically, changes to data schema will fall into one of the following scenarios:
- A field is added
- A field is deleted
- A field name is changed
- A field type is changed
The automated treatment of data for each of these scenarios is described in detail below.
Scenario 1: A field is added
Let’s continue to use the sample use case for illustrative purposes. If at some point data is posted to the ‘customer’ pipeline that includes an additional field (e.g.,
cust_age ), the system will recognize that the new data schema does not match the existing blueprint for the ‘customer’ table. This will automatically create a new version of that table with a
_v2 suffix (e.g.
customer_v2 ) with the new field layout…
Source Tables and Views
To mitigate potential issues associated with using multiple versioned tables, the system also automatically generates a ‘master’ view with a suffix of
_master that contains the superset of all fields from each table version. In our example, a view named
customer_master would be generated that looks like this…
Scenario 2: A field is deleted
In this scenario, let’s assume the delivery of the 2 data feeds in the example above was reversed. In this scenario, the data delivered to the ‘customer’ pipeline has five fields (
cust_id, cust_first, cust_last, cust_zip, cust_age ). In this case, the
customer_v1 table will look like this…
At some point, data in the ‘customer’ pipeline no longer has the
cust_age field. As with the previous scenario, the system will recognize the different schema layout and generate a
_v2 version of the customer table with the new layout.
Once again, the system will automatically generate a master view that includes the superset of fields from each table, which will look just like the master view in the previous example…
Scenario 3: A field is renamed
There may be a use case where the name for one of the fields in your source data changes. Since it is impossible for the system to know whether a field has been deleted and another field added or a field has been renamed, this scenario will also result in a new table version. Let’s continue with our sample data to illustrate.
The data posted to the ‘customer’ pipeline has five fields (
cust_id, cust_first, cust_last, cust_zip, cust_age ) and generates the
Subsequent data in the
customer pipeline includes the same number of fields, but one of the fields has been renamed from
cust_zipcode. This change will generate a
_v2 table that looks like this:
As in previous scenarios, the system will automatically create a master view with the superset of the
v2 tables, which in this case will look like this:
Creating Customized Views For Source Data
If it is necessary that the data for the
cust_zipcode fields are consolidated into one field; you have a couple of options. The first option would be to create a user-generated custom view that consolidates the two fields in question. In our example, the SQL for this custom view would look like this:
CREATE OR REPLACE VIEW customer_master_consolidated
SELECT customer_v1.cust_id, customer_v1.cust_first, customer_v1.cust_last, customer_v1.cust_zip
SELECT customer_v2.cust_id, customer_v2.cust_first, customer_v2.cust_last, customer_v2.cust_zipcode
The view resulting from this SQL statement would look like this…
If custom view generation is not a desirable option, you can also contact Openbridge Support (firstname.lastname@example.org), and a member of our Professional Services team can discuss your requirements.
Scenario 4: The data type for a field changes
When data first arrives for a particular source, the system uses an automated "trainer" to model the data. This process determines the appropriate field type (e.g., string, integer, date, boolean). This becomes part of the blueprint (schema) for each table.
If, at some point, the data type for a field changes to one that is not compatible with the blueprint field type, a new table version will be generated. Let’s use our sample data to illustrate.
customer data pipeline has four fields (
cust_id, cust_first, cust_last, cust_zip ). This will result in a table named
customer_v1 that looks like this (field types included in parens):
If subsequent data in the
customer pipeline contains data with a new field type (in our example, the cust_id field changes from an integer to a string), a
_v2 version of that table will be generated with the new field type:
In this scenario, the master view will include data from both table versions and using the field type that is compatible with both versions. In our example, data for the
cust_id field can be stored as strings for both table versions, so that is the field type used in the master view:
Data Lake Storage Pattern
When you select a data lake as a destination, your data will be organized in a specific manner. Specifically, the hierarchy of the structure follows this pattern:
The top-level will be your
parquet directory will be your data feeds. To preserve the lineage of data feeds, we will version the feed into a new directory as the upstream source changes. These might be changes in schemas, data types, or other changes that trigger a new version.
Within each data feed, everything is date partitioned to optimize for common queries using date and a primary dimension:
Lastly, within a date partition is one or more Google Snappy compressed parquet objects:
As part of this process, we also append metadata to each table and view. Read about that here:
If you have any questions about any of these scenarios or are experiencing an issue not addressed here, please reach out to Openbridge Support (email@example.com).