I have large historical data and I want to group by it month. I am preparing below query on HP Vertica database to get my data grouped by Month.
SELECT
Region, Country,
TO_TIMESTAMP(TO_CHAR(Order_Date,'m/yyyy'),'m/yyyy') as MONTH_Order_Date,
SUM(CAST(Unit_Price as NUMERIC(37,15))) as Sum_Unit_Price,
SUM(CAST(Total_Revenue as NUMERIC(37,15))) as Sum_Total_Revenue
FROM SalesRecords
GROUP BY Region,Country,TO_TIMESTAMP(TO_CHAR(Order_Date,'m/yyyy'),'m/yyyy')
ORDER BY Region ASC,Country ASC,TO_TIMESTAMP(TO_CHAR(Order_Date,'m/yyyy'),'m/yyyy') ASC
Now problem with this query is performance, over 1 billion data it is taking almost 2 minute and return record set after group by is 0.1 million rows. I need data in below shown format and due to that I need to parse date at DB level to get formatted date and this formatting taking time on DB server.
Please suggest me any other way which I can use because on same table if I am grouping data on some other column except the date type column performance is good, returned record set almost in 40 second over 1 billion record.
I need to do this performance for all major RDBMS.
Region Country MONTH_Order_Date Sum_Unit_Price Sum_Total_Revenue
Asia Bangladesh 2010-01-01 00:00:00 186252.349999999999937 920607119.140000001169023
Asia Bangladesh 2011-01-01 00:00:00 186456.190000000000641 931633189.440000000452752
Asia Bangladesh 2012-01-01 00:00:00 194925.550000000000312 969718040.969999998663069
Asia Bangladesh 2013-01-01 00:00:00 194048.289999999999566 1017253078.219999998891571
Asia Bangladesh 2014-01-01 00:00:00 184143.090000000000413 915867255.449999996567606
Asia Bangladesh 2015-01-01 00:00:00 193697.769999999999864 959097995.869999999053554
Asia Bangladesh 2016-01-01 00:00:00 184833.730000000000529 955360230.500000001682568
Asia Bangladesh 2017-01-01 00:00:00 111476.840000000000014 563824376.189999998257226
Asia Bhutan 2010-01-01 00:00:00 186506.900000000000474 916963415.479999997623498
Vertica run-length encoding (RLE) allows you to leverage low-cardinality columns for improved query performance and disk space savings, and converting a very granular data type like timestamp to mmyyyy might be perfect for the job.
1) Put the mmyyyy conversion into your data load process
2) create a new projection with a new computed column called month_year and formatted mmyyyy.
3) In the projection column list and in the projection ORDER BY, put this and other low-card columns at the beginning of the list. High-card columns go towards the end of the list. But any high-card join columns (like the key) should be immediately after the low-card columns.
4) Be sure to specify ENCODING_RLE on this and all other low-card columns.
These two links to Vertica docs explain the concepts involved:
Choosing Sort Order: Best Practices
Optimizing Queries for Projections with Predicates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With