Data Destination FAQs
Openbridge Support avatar
Written by Openbridge Support
Updated over a week ago

What is a data destination?

Openbridge collects, processes, routes, and loads data to a target destination of your choosing. A target is a single data warehouse or data lake that reflects a unique cluster, host, data set, or database. Openbridge allows you to connect to multiple destinations based on your plan. Once a warehouse or data lake is registered within Openbridge, it counts as a destination.

Why do I need a data lake or warehouse?

Maximizing productivity means your data needs to be organized, processed, and loaded into a data lake or warehouse. This data lake or warehouse acts as a central repository of data aggregated from disparate sources. The data lake or warehouse becomes the hub that allows you to plug in your favorite tools so you can query, discover, visualize or model that data.

Do you charge for data lake or cloud warehouse infrastructure?

No, there are no charges from Openbridge for your warehouse. Any charges are directly billed by the warehouse provider (i.e., Amazon, Google, Microsoft, or Snowflake) to you.

What are the costs for a data lake or cloud warehouse?

Every data lake or cloud warehouse has its pricing model. Pricing varies by usage, which is defined by the compute and storage consumed or provisioned. Depending on your situation and requirements, different price-performance considerations may come into play. For example, if you need to start with a no or low-cost solution, Amazon Athena and Google BigQuery are the only charges according to usage. On-demand usage pricing may provide you with the essentials to kickstart your efforts. If you have questions, feel free to reach out to us. We can offer some tips and best practices on how best to set up a data lake or warehouse based on your needs.

Which data lake or cloud warehouse should I be using?

When it comes to building your data strategy and architecture, it's essential to understand which data lake or warehouses should be candidates for consideration. Typically, teams will be asking themselves answers like "How do I install and configure a data warehouse?" "Which data warehouse solution will help me to get the fastest query times?" or "Which of my analytics tools are supported?"

This article covers the key features and benefits of five widely used data lake and warehouse solutions supported by Openbridge to help you choose the right one: How to Choose a Data Warehouse Solution that Fits Your Needs. If you have answers, feel free to reach out to us.

Do I need to authorize Openbridge access to my data lake or warehouse?

Yes, typically, Amazon Athena, Google BigQuery, Amazon Redshift, and others require authorization to load your data. You would provide us with the correct authorizations, so we can properly connect to these systems. The process takes a few minutes to set up in your Openbridge account. You can read about how to set up AWS Athena, Redshift Spectrum, Redshift, and Google BigQuery.

What do I do with the data once it is in my target data destination?

The data is yours, so you can do whatever you wish with it! What you do with your data once it is in your target data destination is up to you. As a result, you have the freedom to choose preferred visualizations or business intelligence solutions. Most of our customers utilize business intelligence and visualization tools, including Looker, Mode, Power BI, or Tableau.

We focus on being a simple, cost-effective data pipeline. The goal of our pipelines is to deliver "analytics-ready" data.

Why is my data not available in my target data destination?

Openbridge is only as good as the data sources and destinations we connect. If a source has a delay, outage, or failure, then there is a delay in data replication to your destination. If your target data destination is unavailable due to failed authorizations, firewalls, it has been removed or is extremely busy, then there is a delay (or outright failure) in data replication.

Does Openbridge offer an on-premise solution?

Openbridge is a cloud-based, hosted solution.

How does Openbridge treat target data destinations differently?

Openbridge optimizes for each data destination we support. For example, each target destination has variations with specific data types, naming restrictions/conventions, deduplication, and loads.

What queries does Openbridge run on my target data destination?

Other than loading and deduplication of data to your target data destination, we run only run queries that verify permissions and create a data catalog for change control and risk mitigation.

What happens if the data lake or warehouse gets disconnected?

Openbridge is architected to prevent data loss or duplication. We buffer data once it's in the pipeline, so if a data warehouse gets disconnected, nothing is lost as long as it's reconnected before the buffer expires. Most customers have a two-week buffer; Enterprise customers can define custom data retention policies and expiration intervals.

Can I add multiple data lake or warehouse destinations to my Openbridge account?

Yes, you can attach multiple destinations to your account. For example, let's say you wanted to partition your client data into unique BigQuery destinations, one for each client. You can have various Google BigQuery destinations. We also support a hybrid model attaching different technologies like Amazon Athena, Redshift, or BigQuery. Ultimately, you get to choose the destination a data source is delivered.

Why can't I see any data on my destination?

If you just set up your trial, it could take anywhere from a couple of hours to a couple of days to complete the historical sync depending on the size of your data source. If it has been several days, please submit a support ticket, and we will look into it.

How is data lake or warehouse data encrypted?

Most vendors encrypt data in transit and at rest. In transit, vendors support SSL-enabled connections between your client application and your data destination. At rest, vendors encrypt data using AES-256 or customer-defined methods.

How do I load data into my Amazon Redshift data warehouse?

You can load data into Amazon Redshift from a range of data sources like Amazon Seller Central, Google Ads, Facebook, YouTube, or from on-premises systems via batch import. Openbridge automates these data pipelines so you can ingest data into your data warehouse cluster code-free.

Does Redshift support querying a data lake?

Yes, Redshift supports querying data in a lake via Redshift Spectrum. Data lakes are the future, and Amazon Redshift Spectrum allows you to query data in your data lake without a fully automated data catalog, conversion, and partitioning service.

Do I need an expert services engagement for Amazon Redshift?

Typically, expert or consulting is not needed for Amazon Redshift. Most customers are up and running using their Amazon Redshift data quickly. However, if you need support, we do offer expert services. There may be situations where you have specific needs relating to Amazon Redshift data. These situations can require expert assistance to tailor Amazon Redshift data to fit your requirements.

Ultimately, our mission is to help you get value from data, and this can often happen more quickly with the assistance of our passionate expert services team.

Do you follow best practices for data partitioning in AWS Athena, Azure Data Lake, or Ahana?

Yes! Amazon suggests that the use of partitioning can help reduce the volume of data scanned per query, thereby improving performance and reducing cost. You can restrict the amount of data scanned because partitions act as virtual columns. When you combine partitions with the use of columnar data formats like Apache Parquet, you are optimizing for best practices.

Do you optimize for Amazon Athena, Azure, Ahana, or Redshift Spectrum data lakes?

Yes! We follow best practices relating to the file size of the objects we partition, split, and compress. Doing so ensures queries run more efficiently, and reading data can be parallelized because blocks of data are read sequentially. This is true mostly for more substantial files as well as smaller files, generally less than 128 MB, that do not always realize the same performance benefits.

Do you support file compression and file splitting in data lakes?

Yes! Amazon suggests compression and file splitting can have a significant impact on significantly speeding up Athena queries. The smaller data sizes mean optimized queries, and it reduces network traffic with data stored in Amazon S3 to Athena.

When your data is splittable, Openbridge does this Athena optimization for you. This allows the execution engine in Athena to optimize the reading of a file to increase parallelism and reduce the amount of data scanned. In the case of an unsplittable file, then only a single reader can read the file. This only happens in the case of smaller files (generally less than 128 MB).

Do you support columnar data formats like Apache Parquet?

Yes! Amazon suggests the use of columnar data formats. We have chosen to use Apache Parquet vs. other columnar formats. Parquet stores data efficiently with column-wise compression, including different encoding and compression, based on the data type.

Openbridge automatically handles the conversion of data to Parquet format, saving you time and money, primarily when Athena executes queries that are ad hoc in nature. Also, using Parquet-formatted files means reading fewer bytes from Amazon S3, leading to better Athena query performance.

Is there a Google Cloud Platform SDK for BigQuery?

Yes, Google provides a collection of client libraries in their SDK package. You can download the command-line tools for Google Cloud Platform products and services here.

We have also bundled the Google SDK with Docker for a "run anywhere" solution. This includes a set of services that can export and import operations for BigQuery. Get the pre-built Docker image with Google Cloud SDK.

Does BigQuery provide a query cache?

Yes, BigQuery provides a per-user cache at no charge. If your data doesn't change, the results of your queries are automatically cached for 24 hours.

Did this answer your question?