All Collections
Data Destinations
Data Catalog: How We Organize and Manage Data In Your Data Lake or Cloud Warehouse
Data Catalog: How We Organize and Manage Data In Your Data Lake or Cloud Warehouse

Dealing with changes in schemas, tables and system metadata

Openbridge Support avatar
Written by Openbridge Support
Updated over a week ago

One of our foundational tasks is caring for the risk associated with changes in upstream data from source systems. For example, what should happen when Google or Facebook add or remove columns in their data feeds? Should the pipeline for that data "fail" gracefully, or should it adapt to the changes? 

Since our goal is to ensure the continuous operation of data pipelines, you need a data catalog to keep track of these changes, especially with a data lake. The Openbridge system is designed to adapt and adjust dynamically to changes it detects from various data sources resulting in a robust data catalog. 

Here is an example data catalog in BigQuery:

Dealing With Change Via Table Versioning

When data for a particular source first appears within Openbridge, data is used to define the schema (blueprint) for the data lake or warehouse "source" table.  This schema definition includes field names, field order, and field data types.  These source tables are generated following a standardized naming convention like <sourceA>_v1 , <sourceA>_v2 , <sourceA>_v3 .

For example, let's assume we have source data called ‘customer’. This data contains 4 columns (cust_id, cust_first, cust_last, cust_zip ). The result of this data is a table named customer_v1  in your data lake or warehouse. It will look like this:

customer_v1

What happens if there are changes to the original data schema?  Typically, changes to data schema will fall into one of the following scenarios:

  1. A field is added

  2. A field is deleted

  3. A field name is changed 

  4. A field type is changed

The automated treatment of data for each of these scenarios is described in detail below.

Scenario 1: A field is added

Let’s continue to use the sample use case for illustrative purposes.  If at some point data is posted to the ‘customer’ pipeline that includes an additional field (e.g., cust_age ), the system will recognize that the new data schema does not match the existing blueprint for the ‘customer’ table. This will automatically create a new version of that table with a _v2  suffix (e.g. customer_v2 ) with the new field layout…

customer_v2

Source Tables and Views

To mitigate potential issues associated with using multiple versioned tables, the system also automatically generates a ‘master’ view with a suffix of _master  that contains the superset of all fields from each table version.  In our example, a view named customer_master would be generated that looks like this…

customer_master

Scenario 2: A field is deleted

In this scenario, let’s assume the delivery of the 2 data feeds in the example above was reversed.  In this scenario, the data delivered to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ).  In this case, the customer_v1  table will look like this…

customer_v1

At some point, data in the ‘customer’ pipeline no longer has the cust_age field. As with the previous scenario, the system will recognize the different schema layouts and generate a _v2 version of the customer table with the new layout.

customer_v2

Once again, the system will automatically generate a master view that includes the superset of fields from each table, which will look just like the master view in the previous example…

customer_master

Scenario 3:  A field is renamed

There may be a use case where the name for one of the fields in your source data changes.  Since it is impossible for the system to know whether a field has been deleted and another field added or a field has been renamed, this scenario will also result in a new table version.  Let’s continue with our sample data to illustrate.

The data posted to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ) and generates the customer_v1  table…

customer_v1

Subsequent data in the customer pipeline includes the same number of fields, but one of the fields has been renamed from cust_zip to cust_zipcode.  This change will generate a _v2  table that looks like this:

customer_v2

As in previous scenarios, the system will automatically create a master view with the superset of the v1  and v2  tables, which in this case will look like this:

customer_master

Creating Customized Views For Source Data

If it is necessary that the data for the cust_zip  and cust_zipcode fields are consolidated into one field; you have a couple of options.  The first option would be to create a user-generated custom view that consolidates the two fields in question.  In our example, the SQL for this custom view would look like this:

CREATE OR REPLACE VIEW customer_master_consolidated
(
  cust_id,
  cust_first,
  cust_last,
  cust_zip
)
AS
SELECT customer_v1.cust_id, customer_v1.cust_first, customer_v1.cust_last,  customer_v1.cust_zip
   FROM customer_v1
UNION ALL
SELECT customer_v2.cust_id, customer_v2.cust_first, customer_v2.cust_last,  customer_v2.cust_zipcode
   FROM customer_v2;

The view resulting from this SQL statement would look like this…

customer_master_consolidated

If custom view generation is not a desirable option, you can also contact Openbridge Support (support@openbridge.com), and a member of our Professional Services team can discuss your requirements.

Scenario 4: The data type for a field changes

When data first arrives for a particular source, the system uses a "trainer" to model the data.  This process determines the appropriate field type (e.g., string, integer, date, boolean). This becomes part of the blueprint (schema) for each table.  

If, at some point, the data type for a field changes to one that is not compatible with the blueprint field type, a new table version will be generated.  Let’s use our sample data to illustrate.

Our customer data pipeline has four fields (cust_id, cust_first, cust_last, cust_zip ). This will result in a table named customer_v1 that looks like this (field types included in parens):

customer_v1

If subsequent data in the customer pipeline contains data with a new field type (in our example, the cust_id field changes from an integer to a string), a _v2 version of that table will be generated with the new field type:

customer_v2

In this scenario, the master view will include data from both table versions and using the field type that is compatible with both versions.  In our example, data for the cust_id field can be stored as strings for both table versions, so that is the field type used in the master view:

customer_master

Data Lake Storage Pattern

When you select a data lake as a destination, your data will be organized in a specific manner. For more information, see this article:

Column Order

Openbridge maintains internal consistency for table and schema column order since source systems offer no guarantees, guidance, or consistency on this topic.

Constantly changing column order can affect ETL code, BI tools, and applications that depend on the specific order of columns. These include queries, views, stored procedures, user-defined functions, and client applications. Openbridge's use of internal consistency mitigates risks when a source system constantly changes column orders in API responses or data exports. Our goal is to ensure some level of schema stability, even if the source system does not.

Column order normally is not an issue for most people. As we have detailed, best practices are to specify the order in which the columns are returned at the application and query level. As a result, you should not rely on the use of SELECT * to return all columns in an expected order based on the order in which they are defined in the table. This is also a highly inefficient and expensive form of query which should be avoided*. Always specify the columns by name in your queries and applications in the order in which you would like them to appear.

* See the Beginners Guide For Faster, Cost-Effective SQL Queries for details on running cost-effective queries.

Metadata

As part of this process, we also append metadata to each table and view. Read about that here:

If you have any questions about any of these scenarios or are experiencing an issue not addressed here, please reach out to Openbridge Support (support@openbridge.com).

Did this answer your question?