Objective
In many clients we see that most dashboard queries have the range of data is usually less than or equal to 12 months from the current date. This gives motivation to split large tables that span multiple years into two tables, one that holds historical and one that holds more current data, in order to lessen the amount of data needed to be read and substantially improve performance. Older date partitions of tables can be moved into alternative storage tables such as Spectrum in Redshift which can reduce overall database size and cost. The purpose of this write up is to build a blueprint outlining the steps needed to implement this architecture.
Architecture Overview
Smart partitioning relies on three key components:
- Splitting the large table into two: an archive table holding older, seldom queried data and a current table holding newer data that stores the majority of data used by dashboards running their most commonly used filters.
- A database optimizer that takes advantage of when a false condition is present in a query predicate to eliminate scans. More on this below.
- Liquid variables in LookML that allow us to dynamically shape the SQL.
Database Considerations
We have successfully tested this approach on Redshift and MemSQL, but suspect it applies to most optimizers. Databases that support smart partitioning natively such as Postgres have date range optimization built into their compilers to limit the amount of data scanned, but the administration expense of maintaining partitioning is not always convenient. We will show below in the Query Structure section how to test if your database supports this architecture.
Table Structure
For discussion purposes, we’ll refer to an example table, orders, which looks like this:
create table orders (
id bigint, -- primary key
start_time datetime,
end_time datetime,
content_event varchar(30)
)
We want to split the orders table into two tables by a date range. The start_time field works well for that. Though there are other date/timestamp fields in the table (end_time), they are not used as much in analytics queries, and queries that don’t use start_time will query all partitions.
We also want to place the majority of data into an archive table, orders_archive. In order to do this, we’ll have to daily archive data from the current table to the archive table. This can easily be accomplished using a simple SQL script running under cron or some more advanced workflow tool. For our scenario, our script will split on the current date – 365 days. While we could use current date – 1 year, it doesn’t particularly matter as long as we’re consistent between the script and what we code in Looker.
To illustrate, we create the partitioned tables:
create table orders_archive as
select * from orders where start_time < adddate(current_date, interval -365 day);
create table orders_current as
select * from orders where start_time >= adddate(current_date, interval -365 day);
We then replace the original table with a view. The UNION ALL makes the view performant whereas a simple UNION would take query time for deduplication. Note that the ETL that populates this table will need to now write to orders_current.
drop table orders;
create view orders as
select * from orders_current
union all
select * from orders_archive;
Query Structure
The strategy we’re leveraging relies on an observation we made that when a query predicate is false and makes the whole statement false, the optimiser doesn’t actually read any data from the table.
We can illustrate this in MemSQL.
-- run a basic query
memsql> profile select count(*) from orders_current where start_time > '2020-09-01';count(*)
4020
memsql> show profile;
. . .
TableScan orders_current, PRIMARY KEY (id) est_table_rows:7,184 est_filtered:3,592 actual_rows: 7,167 exec_time: 1ms start_time: [00:00:00.007, 00:00:00.011]
-- now run the same query with a false statement that will make the whole WHERE clause falsememsql> profile select count(*) from orders_current where start_time > '2020-09-01' and 1=2;count(*) 0
memsql> show profile;
. . .
TableScan orders_current, PRIMARY KEY (id) est_table_rows:7,184 est_filtered:3,592 actual_rows: 0 exec_time: 0ms
-- so we see the profile didn’t actually read any data via the 0ms exec time
From this, we build our query:
select * from orders_current
where start_time >= '2020-09-01'
union all
select * from orders_archive
where '2020-09-01’ < adddate (current_date, interval -365 day)
and start_time > '2020-09-01'
Remember, what we want is to only query the older data if we have to. Since the current table is smaller, it is more performant and doesn’t matter if it is queried when not needed. We looked at trying to eliminate reads from the current table when not needed, but determined that it was not technically viable.
Also note that while it is not ideal to rely on a fixed time range since there could be a problem with the daily archiving, we did observe in testing that using subqueries to determine our archive date, i.e.
where '2020-09-01’ < (select max(start_time) from orders_archive)
caused the table to be scanned.
Looker Structure
In Looker, there are three steps needed to implement this:
1. Change the view to use the union along with liquid variables and add a date_time filter:
view: orders {
view_label: "Order Data"
derived_table: {
sql:
select * from orders_current
where {% condition orders.start_time_filter %} start_time {% endcondition %}
union all
select * from orders_archive
where coalesce({% date_start orders.start_time_filter %},
adddate(current_date, interval -365 day
<= adddate(current_date, interval -365 day)
and {% condition orders.start_time_filter %} start_time {% endcondition %}
;;
}
filter: start_time_filter {
type: date_time
}
2. Add the time filter in the dashboards:
listen:
time_filter: orders.start_time_filter
3. Add the filter to all explore joins on the view. For example:
explore: content {
extends: [content_base]
join: orders {
sql_on: ${content.event} = ${orders.content_event} ;;
type: left_outer
relationship: one_to_many
fields: [start_time_filter]
}
By: Mike Taluc – Bytecode IO’s Senior Engineer and Architect