One of our foundational tasks is caring for the risk associated with changes in upstream data from source systems. For example, what should happen when Google or Facebook add or remove columns in their data feeds? Should the pipeline for that data "fail" gracefully, or should it adapt in real-time to the changes? 

Since our goal is to ensure continuous operation of data pipelines, you need a data catalog to keep track of these changes, especially with a data lake. The Openbridge system is designed to adapt and adjust dynamically to changes it detects from various data sources in creating a robust data catalog. 

Here is an example data catalog in BigQuery:

Fort more details on how this process works, continue reading.

Dealing With Change Via Table Versioning

When data for a particular source first appears within Openbridge, that data is used to define the schema (blueprint) for the data warehouse "source" table.  This schema definition includes field names, field order, and field data types.  These source tables are generated following a standardized naming convention like <sourceA>_v1 , <sourceA>_v2 , <sourceA>_v3 .

For example, let's assume we have source data called ‘customer’. This data contains 4 columns (cust_id, cust_first, cust_last, cust_zip ). The result of this data is a table named customer_v1  in your data warehouse. It will look like this:

customer_v1

What happens if there are changes to the original data schema?  Typically, changes to data schema will fall into one of the following scenarios:

  1. A field is added
  2. A field is deleted
  3. A field name is changed 
  4. A field type is changed

The automated treatment of data for each of these scenarios is described in detail below.

Scenario 1 : A field is added

Let’s continue to use the sample use case for illustrative purposes.  If at some point data is posted to the ‘customer’ pipeline that includes an additional field (e.g., cust_age ), the system will recognize that the new data schema does not match the existing blueprint for the ‘customer’ table. This will automatically create a new version of that table with a _v2  suffix (e.g. customer_v2 ) with the new field layout…

customer_v2

Source Tables and Views

To mitigate potential issues associated with using multiple versioned tables, the system also automatically generates a ‘master’ view with a suffix of _master  that contains the superset of all fields from each table version.  In our example, a view named customer_master would be generated that looks like this…

customer_master

Scenario 2: A field is deleted

In this scenario, let’s assume the delivery of the 2 data feeds in the example above was reversed.  In this scenario, the data delivered to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ).  In this case, the customer_v1  table will look this…

customer_v1

At some point, data in the ‘customer’ pipeline no longer has the cust_age field. As with the previous scenario, the system will recognize the different schema layout and generate a _v2 version of the customer table with the new layout.

customer_v2

Once again, the system will automatically generate a master view that includes the superset of fields from each table which will look just like the master view in the previous example…

customer_master

Scenario 3:  A field is renamed

There may be a use case where the name for one of the fields in your source data changes.  Since it is impossible for the system to know whether a field has been deleted and another field added or a field has been renamed, this scenario will also result in a new table version.  Let’s continue with our sample data to illustrate.

The data posted to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ) and generates the customer_v1  table…

customer_v1

Subsequent data in the customer pipeline includes the same number of fields, but one of the fields has been renamed from cust_zip to cust_zipcode.  This change will generate a _v2  table that looks like this:

customer_v2

As in previous scenarios, the system will automatically create a master view with the superset of the v1  and v2  tables, which in this case will look like this:

customer_master

Creating Customized Views For Source Data

If it is necessary that the data for the cust_zip  and cust_zipcode fields are consolidated into one field, you have a couple of options.  The first option would be to create a user-generated custom view that consolidates the two fields in question.  In our example, the SQL for this custom view would look like this:

CREATE OR REPLACE VIEW customer_master_consolidated
(
  cust_id,
  cust_first,
  cust_last,
  cust_zip
)
AS
SELECT customer_v1.cust_id, customer_v1.cust_first, customer_v1.cust_last,  customer_v1.cust_zip
   FROM customer_v1
UNION ALL
SELECT customer_v2.cust_id, customer_v2.cust_first, customer_v2.cust_last,  customer_v2.cust_zipcode
   FROM customer_v2;

The view resulting from this SQL statement would look like this…

customer_master_consolidated

If custom view generation is not a desirable option, you can also contact Openbridge Support (support@openbridge.com) and a member of our Professional Services team can discuss your requirements.

Scenario 4: The data type for a field changes

When data first arrives for a particular source, the system uses an automated "trainer" to model the data.  This process determines the appropriate field type (e.g. string, integer, date, boolean) for each field. This becomes part of the blueprint (schema) for each table.  

If, at some point, the data type for a field changes to one that is not compatible with the blueprint field type, a new table version will be generated.  Let’s use our sample data to illustrate.

Our customer data pipeline has four fields (cust_id, cust_first, cust_last, cust_zip ). This results will be a table named customer_v1 that looks like this (field types included in parens):

customer_v1

If subsequent data in the customer pipeline contains data with a new field type (in our example the cust_id field changes from an integer to a string), a _v2 version of that table will be generated with the new field type:

customer_v2

In this scenario, the master view generated by the system will include data from both table versions and using the field type that is compatible with both versions.  In our example, data for the cust_id field can be stored as strings for both table versions so that is the field type used in the master view:

customer_master

Metadata

As part of this process, we also append metadata to each table and view. Read about that here:

https://docs.openbridge.com/en/articles/4111506-metadata-appended-to-a-data-pipeline

If you have any questions about any of these scenarios or are experiencing an issue not addressed here, please reach out to Openbridge Support (support@openbridge.com).

Did this answer your question?