Often when people are starting off writing queries with SQL or via tools like Tableau or Power BI, they are often not very well optimized. Writing well-structured, narrow queries accomplish a few goals.
The first is allowing you to ask questions and get results as quickly as possible. Overly broad queries can take a LONG time to complete.
The second goal is reducing costs. Systems like BigQuery, Snowflake, Amazon Redshift Spectrum and Amazon Athena charge you for the compute resources required to execute your queries. Overly broad queries are much more expensive to run.
For example, let us assume you have a table of Amazon orders data from Seller Central. There are about two years of data present which totals 200 GB of data. Here is a simple, unoptimized query;
SELECT * FROM myorders;
Running an unoptimized query like this will take some time to complete. Not only will it be slower, but it is also more expensive. For example, Amazon Athena charges $5 per TB of data scanned. Running this query in Athena costs you about $1. If you are running queries like this all the time, you are spending unnecessary time waiting for queries to complete and increasing the costs from Google, AWS, and Snowflake.
How can we improve on this?
Let’s say you wanted to find orders after a specific date. If you were to run a query to search the same data but with a data constraint, the results will be returned faster.
SELECT * FROM myorders WHERE order_date > '2021-07-01'
An important caveat is that while you improve the response time of the query by using a date, you still may be charged $1. Why? Since you are using
SELECT *, you still are scanning all 200GB of data. Systems that charge for compute services charge based on the amount of data scanned. If
myorders has 50 columns, using
* will ensure all 50 columns are included in the query which means you still scanned 200GB of data.
Rather than query all 50 columns, you can narrow the scope of your query. By specifying columns names, you are improving response time and lowering costs.
SELECT order_amount, order_id FROM myorders;
This query will be significantly faster and cost less. In our Amazon Athena use case, this query will have only scanned about 100MB of data. The cost of this query will be about 0.01. Big price between difference between $1 and $.01 simply by specifying the column names of interest.
To further optimize your query and cost performance, you can combine named columns with a date:
SELECT order_amount, order_id FROM myorders WHERE order_date > '2021-07-01'
While we used SQL queries as examples, creating well-composed queries is true for tools like Tableau, Power BI, Data Studio, and hundreds of others. Behind the scenes, these tools are issuing SQL queries. The more optimized your queries in your, the greater your performance will be.
There may be cases where doing
SELECT * is needed. However, in cases where you can choose to narrow the scope of a query, you will reap the benefits in both time and money.