Data Catalog: How We Organize and Manage Data In Your Data Lake or Cloud Warehouse

One of our foundational tasks is caring for the risk associated with changes in upstream data from source systems. For example, what should happen when Google or Facebook add or remove columns in their data feeds? Should the pipeline for that data "fail" gracefully, or should it adapt to the changes?

Since our goal is to ensure the continuous operation of data pipelines, you need a data catalog to keep track of these changes, especially with a data lake. The Openbridge system is designed to adapt and adjust dynamically to changes it detects from various data sources, resulting in a robust data catalog.

Here is an example data catalog in BigQuery:

Why Create A Snapshot And View Diven Approach To Your Data Catalog?

The approach for table versioning snapshots and master views is designed to ensure flexibility, traceability, and reliability in managing evolving data schemas.

Schema Evolution Management: Table versioning allows for seamless handling of schema changes, such as adding, deleting, or renaming fields. Each version captures the schema at a specific time, preserving historical context and making it easier to adapt to upstream changes without disrupting ongoing processes.
Data Consistency and Stability: By creating new versions for schema changes, we ensure that existing processes and queries remain stable and consistent. This avoids breaking older schemas' dependencies while allowing for new updates and improvements.
Master Views for Unified Access: Master views provide a consolidated view of all versions, allowing users to access the complete dataset without manually managing individual table versions. This approach reduces complexity and provides a single source of truth for data consumers.
Reduced Risk of Data Loss: With table versioning, no data is lost during schema changes. Older versions remain intact, providing a fallback if issues arise with newer versions. This ensures that historical data remains accessible and reliable.
Simplified Data Integration: Data integration processes benefit from versioned tables by allowing downstream systems to adapt incrementally to schema changes. This prevents unexpected disruptions and ensures that data flows smoothly between systems.
Improved Auditability: Table versioning and master views enhance auditability by providing a clear lineage of how the data schema has evolved. This is particularly useful in compliance and regulatory contexts, where it is essential to demonstrate when and how data structures have changed.
Ease of Troubleshooting: When issues arise, having versioned tables allows for easier troubleshooting by comparing different versions and understanding what changes may have introduced the problem. This helps diagnose and resolve issues more efficiently.

How Does it Work? Dealing With Change Via Table Versioning Snapshots

When data for a particular source first appears within Openbridge, data is used to define the schema (blueprint) for the data lake or warehouse "source" table. This schema definition includes field names, field order, and field data types. These source tables are generated following a standardized naming convention like <sourceA>_v1 , <sourceA>_v2 , <sourceA>_v3 .

For example, let's assume we have source data called ‘customer’. This data contains 4 columns (cust_id, cust_first, cust_last, cust_zip ). The result of this data is a table named customer_v1 in your data lake or warehouse. It will look like this:

customer_v1

What happens if there are changes to the original data schema? Typically, changes to data schema will fall into one of the following scenarios:

A field is added
A field is deleted
A field name is changed
A field type is changed

The automated data treatment for each of these scenarios is described in detail below.

Scenario 1: A field is added

Let’s continue to use the sample use case for illustrative purposes. If, at some point, data is posted to the ‘customer’ pipeline, it includes an additional field (e.g., cust_age ), the system will recognize that the new data schema does not match the existing blueprint for the ‘customer’ table. This will automatically create a new version of that table with a _v2 suffix (e.g. customer_v2 ) with the new field layout…

customer_v2

Source Tables and Views

To mitigate potential issues associated with using multiple versioned tables, the system also automatically generates a ‘master’ view with a suffix of _master that contains the superset of all fields from each table version. In our example, a view named customer_master would be generated that looks like this…

customer_master

Scenario 2: A field is deleted

In this scenario, let’s assume the delivery of the 2 data feeds in the example above was reversed. In this scenario, the data delivered to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ). In this case, the customer_v1 table will look like this…

customer_v1

At some point, data in the ‘customer’ pipeline no longer has the cust_age field. As with the previous scenario, the system will recognize the different schema layouts and generate a _v2 version of the customer table with the new layout.

customer_v2

Once again, the system will automatically generate a master view that includes the superset of fields from each table, which will look just like the master view in the previous example…

customer_master

Scenario 3: A field is renamed

There may be a use case where the name for one of the fields in your source data changes. Since it is impossible for the system to know whether a field has been deleted and another field added or renamed, this scenario will also result in a new table version. Let’s continue with our sample data to illustrate.

The data posted to the ‘customer’ pipeline has five fields (cust_id, cust_first, cust_last, cust_zip, cust_age ) and generates the customer_v1 table…

customer_v1

Subsequent data in the customer pipeline includes the same number of fields, but one of the fields has been renamed from cust_zip to cust_zipcode. This change will generate a _v2 table that looks like this:

customer_v2

As in previous scenarios, the system will automatically create a master view with the superset of the v1 and v2 tables, which in this case will look like this:

customer_master

Creating Customized Views For Source Data

If it is necessary that the data for the cust_zip and cust_zipcode fields are consolidated into one field; you have a couple of options. The first option would be creating a user-generated custom view consolidating the two fields in question. In our example, the SQL for this custom view would look like this:

CREATE OR REPLACE VIEW customer_master_consolidated
(
  cust_id,
  cust_first,
  cust_last,
  cust_zip
)
AS 
SELECT customer_v1.cust_id, customer_v1.cust_first, customer_v1.cust_last,  customer_v1.cust_zip
   FROM customer_v1
UNION ALL 
SELECT customer_v2.cust_id, customer_v2.cust_first, customer_v2.cust_last,  customer_v2.cust_zipcode
   FROM customer_v2;

The view resulting from this SQL statement would look like this…

customer_master_consolidated

If custom view generation is not a desirable option, you can also contact Openbridge Support (support@openbridge.com), and a member of our Professional Services team can discuss your requirements.

Scenario 4: The data type for a field changes

When data first arrives for a particular source, the system uses a "trainer" to model it. This process determines the appropriate field type (e.g., string, integer, date, boolean), which becomes part of the blueprint (schema) for each table.

If a field's data type changes to one that is incompatible with the blueprint field type, a new table version will be generated. Let’s use our sample data to illustrate.

Our customer data pipeline has four fields (cust_id, cust_first, cust_last, cust_zip ). This will result in a table named customer_v1 that looks like this (field types included in parens):

customer_v1

If subsequent data in the customer pipeline contains data with a new field type (in our example, the cust_id field changes from an integer to a string), a _v2 version of that table will be generated with the new field type:

customer_v2

In this scenario, the master view will include data from both table versions and use the field type that is compatible with both versions. In our example, data for the cust_id field can be stored as strings for both table versions, so that is the field type used in the master view:

customer_master

Data Lake Storage Pattern

When you select a data lake as a destination, your data will be organized in a specific manner. For more information, see this article:

https://docs.openbridge.com/en/articles/6004989-data-lake-partitioning

Column Naming

We leverage the column names as provided by the source system and only modify them to align with the column naming requirements for the target data destination.

For example, if the source system has a report column called "Amazon Order ID," this will translate to a column named "amazon_order_id" in the target data destination.

This approach ensures we are not unnecessarily obfuscating, distorting, or changing the columns, facilitating easy mapping. It also allows for better traceability, making it simple to understand the lineage from the table or view column name to its definition upstream in the source.

This consistency helps maintain a clear understanding of how data moves from the source system to the data lake or warehouse.

Additional Benefits

Reduced Maintenance Effort: Keeping column names consistent with the source system reduces the need for documentation and maintenance. Developers and analysts can quickly understand the data without constantly referring to transformation rules or mappings.
Simplified Onboarding: New team members can more easily get up to speed when working with datasets that retain familiar naming conventions from source systems. This reduces the learning curve and increases productivity.
Improved Data Quality: Minimizing changes to column names reduces the risk of introducing errors during transformation, helping ensure the data remains accurate and reliable throughout the pipeline.
Ease of Automation: Automated processes such as ETL jobs benefit from consistent column naming, reducing the complexity of scripts and configurations. This makes automated data integration more straightforward and less error-prone.
Enhanced Collaboration: Consistent naming conventions across the source and target systems facilitate communication when multiple teams or stakeholders access the data. Everyone involved can understand the data more intuitively, leading to better collaboration.
Regulatory Compliance: Retaining original column names can also help in compliance scenarios, where traceability and auditability are crucial. It becomes easier to demonstrate how data flows and transforms from the source to the destination.

Column Order

Openbridge maintains internal consistency for table and schema column order since source systems offer no guarantees, guidance, or consistency.

Constantly changing column order can affect ETL code, BI tools, and applications that depend on the specific order of columns. These include queries, views, stored procedures, user-defined functions, and client applications.

Openbridge's use of internal consistency mitigates risks when a source system constantly changes column orders in API responses or data exports. Our goal is to ensure some schema stability, even if the source system does not.

Column order is normally not an issue for most people. As we have detailed, best practices are to specify the order in which the columns are returned at the application and query levels.

As a result, you should not rely on the use of SELECT * to return all columns in an expected order based on the order in which they are defined in the table. This is also a highly inefficient and expensive form of query that should be avoided*.

Always specify the columns by name in your questions and applications and in the order in which you would like them to appear.

* See the Beginners Guide For Faster, Cost-Effective SQL Queries for details on running cost-effective queries.

Metadata

We also append metadata to each table and view it as part of this process. Read about that here:

https://docs.openbridge.com/en/articles/4111506-metadata-appended-to-a-data-pipeline

If you have any questions about any of these scenarios or have an issue not addressed here, please contact Openbridge Support (support@openbridge.com).