When delivering data to destinations like Azure Data Lake, Databricks, Snowflake, BigQuery, AWS Athena, AWS Redshift, or Redshift Spectrum, we append additional metadata unique to the information resident in a record.
This approach ensures that your tables and views include a series of system-generated fields that provide vital information about the data we collect on your behalf. This offers critical context about each record and simplifies queries and data modeling.
System-Generated Metadata Fields
In addition to the fields retrieved from source files, your tables will include a series of system-generated fields (all prefixed with ob_*
). Each of these fields is described below.
ob_date
Description: The UTC date used in the request to retrieve data from the source system represents the data or "report date."
Applicability: This field is only in non-batch integrations (e.g., Facebook, Amazon Seller Central, Amazon Advertising).
Benefit: Helps identify the specific date for which the data was requested, aiding in time-series analyses.
ob_transaction_id
Description: A system-generated unique ID based on a hash of field values for a given row of data.
Function: Used to prevent or minimize duplicate data from being loaded into a table. The system generates the hash value for each loaded row and compares it to all previously loaded data. The new row is excluded from the load process if a matching record is found.
Note: The de-duplication process applies to cloud warehouses, not data lakes.
Additional Information: For more on data de-duplication, see De-Duplication and Dealing with Updates Over a Long Time Horizon.
ob_file_name
Description: Includes a temporary path and file name from which the data was loaded.
Purpose: Used for quality control to validate that the source outputs and data from a particular source file were successfully loaded to the target destination.
Benefit: Aids in tracking data lineage and troubleshooting data loading issues.
ob_processed_at
Description: Represents the UTC timestamp when data processing started.
Special Cases: For some sources that provide lifetime metrics, this date indicates the date for which lifetime metrics are valid.
Benefit: Helps understand the freshness of the data and coordinates processing timelines.
ob_modified_date
Description: Represents the UTC timestamp when a record was modified.
Applicability: This can define the most recent update for sources that use attribution or lifetime metrics.
Benefit: Crucial for tracking changes over time and managing incremental data updates.
Additional Considerations
Data Consistency Across Systems: These metadata fields ensure consistent data tracking across different systems and platforms, facilitating seamless data integration.
Custom Metadata Options: Depending on your needs, additional custom metadata fields can be included to capture more detailed information.
Storage Implications: Be mindful that adding metadata fields increases the size of your datasets, which may impact storage costs and performance, especially with large volumes of data.
Security and Privacy: Ensure the metadata does not include sensitive information that could pose security risks or violate privacy regulations.
Benefits of Using System-Generated Metadata Fields
Enhanced Data Context: Gain deeper insights into each data record with added metadata.
Simplified Data Modeling: Streamline your data modeling process with standardized metadata fields.
Improved Data Quality: Utilize metadata for quality control to ensure data integrity and consistency.
Efficient De-Duplication: Prevent or minimize duplicate data entries, maintaining accurate datasets.
Easier Auditing and Compliance: Leverage metadata for auditing purposes and to meet compliance requirements.
Better Data Lineage Tracking: Trace your data's origin and processing history with detailed metadata.
Incorporating system-generated metadata fields into your data delivery process offers significant benefits regarding data context, quality, and management. By leveraging these fields, you can enhance your data analytics capabilities, streamline data modeling, and maintain high data integrity across your systems.
Frequently Asked Questions (FAQs)
Why are the system-generated metadata fields prefixed with ob_*
?
The ob_*
prefix stands for "Openbridge," indicating that these fields are generated by the Openbridge system. This naming convention helps differentiate system-generated metadata from your original data fields.
Can I exclude these metadata fields from my tables?
No, these are standard parts of Openbridge's data pipeline processing.
How does ob_transaction_id
help in de-duplication?
The ob_transaction_id
is a unique hash generated from the field values of each row. By comparing this hash to those of existing records, the system can identify and exclude duplicate entries, ensuring data integrity.
Is the de-duplication process automatic, or do I need to configure it?
The de-duplication process is automatic within cloud warehouses. No additional configuration is required on your part.
How can I use ob_processed_at
to monitor data freshness?
The ob_processed_at
timestamp indicates when data processing started for a particular record. By reviewing this field, you can assess how up-to-date your data is and identify any potential delays in data processing.
Are these metadata fields compatible with all data destinations?
Yes, these system-generated metadata fields are designed to be compatible with all supported data destinations, such as Azure Data Lake, Google BigQuery, AWS Athena, AWS Redshift, and Redshift Spectrum.
Will the additional metadata fields affect query performance?
Generally, there is no impact on query performance. However, the added fields can provide more options for indexing and querying, potentially improving performance for specific queries.