I am looking to load ~10GB of data into memory and perform SQL on it in the form of:
What might be a good choice for performance? Some solutions I've come across that could possibly work are TimesTen, ExtremeDB, and SQL In-memory, or even dataframes such as Vaex or Cudf.
I am looking to optimize query times -- that is really all I care about. For a conceptual example, think about something like an Excel table where a user can sort or filter any column (and the application does not know ahead of time, which columns to 'index' because all columns may be used).
Update: I'm posting my benchmarks from pandas below. I know pandas isn't ideal for this, but it's great to prototype and get benchmarking figures:
File (20M rows, 1.2GB): https://storage.googleapis.com/gcp-files/Sales20M.csv.
pd.read_csv
): 10.7sdf.groupby('currency_code_id').count
): 3.3sdf.sort_values('price')
): 6.8sdf.pivot_table(index='code',columns='territory_id', values='id', aggfunc=len, fill_value=0)
): 3.4s.If using a database, please do not create indexes, since the use case is that we do not know the columns that are used beforehand. (Alternately, I suppose you could create an index on every field -- but if so, please include the creation of all those indexes in the load time).
Which tool would be the best for this?
I guess you want to materialized
a random data file and perform sub-second queries over it and you are ready to pay the price (as in-memory features are usually enterprise).
For SQL Server
for example, there many options:
achieve gains up to 10 times the query performance
5 times to 20 times faster
performanceor just using partitioning, or PostgreSQL or MongoDB shards. There so many examples and demonstration of such technologies showing sub-second performance ... but it depends on case because there are limitations.
For example:
In your case, having 10 GB of data and wanting a good performance, you are not require to do something special. Just analyze and normalize the data prior the insert and create the corresponding indexes.
Well begun is half done
and paying some time to have the data written in the right way will give you the performance you need.
For example:
Insert the data file in a table
For each column in the table
count distinct
count distinct / count
is smaller then X, create a separate table with columns id
and value
Reducing the size of the table will improve the IO operations count. Searching and grouping by numbers is faster then doing such by text.
Of course, you need to change the application - instead of searching by some city name
, you will filter by its ID
. And after count
of cities per countries ids
, you will perform second query to transform these ids
to names
.
I feel applying some fundamental principles in your case will be better then using some high-level tech on high price and limitations that can be critical in the future when new requirements to the application come.
On virtual machine with 8 GB RAM and 4 virtual processors. Unfortunately, it is on HDD and pretty bad I/O from here. Running SQL Server 2019 Standard edition. So, because of the hardware the data load up is slow.
bcp
it will be better but ... HDD after all)So, the table looks like:
You can see how lazy I am. No normalization and index on each field leading to 3.2
GB allocated for the data and indexes:
exec sp_spaceused 'dbo.Sales20M'
But some results:
select count(*) , currency_code_id from dbo.Sales20M group by currency_code_id
of course, 0 seconds as we use the index:
select TOP 10 * from dbo.Sales20M order by PRICE_IN_USD desc;
0 seconds, but note I am using TOP
- basically, when you are sorting you need to display part of the rows, right? And if I sort the whole rows for some reason:
select *
INTO #X
from dbo.Sales20M order by PRICE_IN_USD desc;
it runs for 2 seconds (I am inserting the data in table as the rendering takes time, too).
As to the PIVOT
, it is not very fast in SQL Server but you can use R
if you need something massive. I do not understand yours, but made a PIVOT
by Code
and territory_id
calculating the average price in USD:
SELECT *
FROM
(
SELECT Code
,territory_id
,AVG(price_in_usd) price_in_usd
FROM dbo.Sales20M
GROUP BY Code
,territory_id
)DS
PIVOT
(
MAX(price_in_usd) FOR territory_id IN ([AE], [AG], [AI], [AL], [AM], [AO], [AR], [AT], [AU], [AW], [AZ], [BA], [BE], [BF], [BG], [BH], [BJ], [BM], [BN], [BO], [BR], [BS], [BW], [BY], [BZ], [CA], [CH], [CI], [CL], [CO], [CR], [CV], [CY], [CZ], [DE], [DK], [DM], [DO], [EC], [EE], [EG], [ES], [FI], [FJ], [FM], [FR], [GA], [GB], [GD], [GH], [GM], [GR], [GT], [GW], [HK], [HN], [HR], [HT], [HU], [ID], [IE], [IL], [IN], [IS], [IT], [JM], [JO], [JP], [KG], [KH], [KN], [KR], [KW], [KY], [KZ], [LA], [LB], [LK], [LT], [LU], [LV], [MD], [MK], [ML], [MN], [MO], [MT], [MU], [MX], [MY], [MZ], [NA], [NE], [NI], [NL], [NO], [NP], [NZ], [OM], [PA], [PE], [PG], [PH], [PL], [PT], [PY], [QA], [RU], [RW], [SA], [SE], [SG], [SI], [SK], [SN], [SV], [SZ], [TG], [TH], [TJ], [TM], [TR], [TT], [TW], [TZ], [UA], [UG], [US], [UY], [UZ], [VE], [VG], [VN], [ZA], [ZM], [ZW])
) PVT;
I am lazy again and not using dynamic PIVOT. It takes 0-1 seconds.
Conclusion:
My point is that even my set up is bad and I am being super lazy as not paying time to normalize the data and create proper indexes I am still getting close to 0 seconds results. You can simply start with something free like PostgreSQL and I believe you will get good results. Of course, the "fancy" stuff are always there if you need them in order to optimize particular use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With