In-memory database optimized for read (low/no writes) when operations involve sorting, aggregating, and filtering on any column

Question

I am looking to load ~10GB of data into memory and perform SQL on it in the form of:

Sort on a single column (any column)
Aggregate on a single column (any column)
Filter on a single column (any column)

What might be a good choice for performance? Some solutions I've come across that could possibly work are TimesTen, ExtremeDB, and SQL In-memory, or even dataframes such as Vaex or Cudf.

I am looking to optimize query times -- that is really all I care about. For a conceptual example, think about something like an Excel table where a user can sort or filter any column (and the application does not know ahead of time, which columns to 'index' because all columns may be used).

Update: I'm posting my benchmarks from pandas below. I know pandas isn't ideal for this, but it's great to prototype and get benchmarking figures:

File (20M rows, 1.2GB): https://storage.googleapis.com/gcp-files/Sales20M.csv.

Load time (pd.read_csv): 10.7s
Aggregation: (df.groupby('currency_code_id').count): 3.3s
Sort: (df.sort_values('price')): 6.8s
Pivot: (df.pivot_table(index='code',columns='territory_id', values='id', aggfunc=len, fill_value=0)): 3.4s.

If using a database, please do not create indexes, since the use case is that we do not know the columns that are used beforehand. (Alternately, I suppose you could create an index on every field -- but if so, please include the creation of all those indexes in the load time).

Which tool would be the best for this?

Admin · Accepted Answer

I guess you want to materialized a random data file and perform sub-second queries over it and you are ready to pay the price (as in-memory features are usually enterprise).

For SQL Server for example, there many options:

using column store indexes which achieve gains up to 10 times the query performance
using in in memory tables and get from 5 times to 20 times faster performance
using Azure Hyperscale

or just using partitioning, or PostgreSQL or MongoDB shards. There so many examples and demonstration of such technologies showing sub-second performance ... but it depends on case because there are limitations.

For example:

column store indexes may have issues when filtering and getting only few rows compare to traditional indexes
in-memory OLTP

In your case, having 10 GB of data and wanting a good performance, you are not require to do something special. Just analyze and normalize the data prior the insert and create the corresponding indexes.

Well begun is half done and paying some time to have the data written in the right way will give you the performance you need.

For example:

Insert the data file in a table
For each column in the table
- perform count distinct
- if the value is smaller count distinct / count is smaller then X, create a separate table with columns id and value
- insert the distinct values in it
- add new column to the table and add the new ids there
- create index on this column

Reducing the size of the table will improve the IO operations count. Searching and grouping by numbers is faster then doing such by text.

Of course, you need to change the application - instead of searching by some city name, you will filter by its ID. And after count of cities per countries ids, you will perform second query to transform these ids to names.

I feel applying some fundamental principles in your case will be better then using some high-level tech on high price and limitations that can be critical in the future when new requirements to the application come.

On virtual machine with 8 GB RAM and 4 virtual processors. Unfortunately, it is on HDD and pretty bad I/O from here. Running SQL Server 2019 Standard edition. So, because of the hardware the data load up is slow.

2 minutes for the table (I am using the SSMS interface to import the data, if you use bcp it will be better but ... HDD after all)
1.5 minutes to add index on each field

So, the table looks like:

enter image description here

You can see how lazy I am. No normalization and index on each field leading to 3.2 GB allocated for the data and indexes:

exec sp_spaceused 'dbo.Sales20M'

enter image description here

But some results:

select count(*) , currency_code_id from dbo.Sales20M group by currency_code_id

of course, 0 seconds as we use the index:

enter image description here

select TOP 10 * from dbo.Sales20M   order by PRICE_IN_USD desc;

0 seconds, but note I am using TOP - basically, when you are sorting you need to display part of the rows, right? And if I sort the whole rows for some reason:

select  * 
INTO #X
from dbo.Sales20M   order by PRICE_IN_USD desc;

it runs for 2 seconds (I am inserting the data in table as the rendering takes time, too).

As to the PIVOT, it is not very fast in SQL Server but you can use R if you need something massive. I do not understand yours, but made a PIVOT by Code and territory_id calculating the average price in USD:

SELECT *
FROM 
( 
    SELECT Code
              ,territory_id
              ,AVG(price_in_usd) price_in_usd
    FROM dbo.Sales20M
    GROUP BY Code
            ,territory_id
)DS
PIVOT
(
    MAX(price_in_usd) FOR territory_id IN ([AE], [AG], [AI], [AL], [AM], [AO], [AR], [AT], [AU], [AW], [AZ], [BA], [BE], [BF], [BG], [BH], [BJ], [BM], [BN], [BO], [BR], [BS], [BW], [BY], [BZ], [CA], [CH], [CI], [CL], [CO], [CR], [CV], [CY], [CZ], [DE], [DK], [DM], [DO], [EC], [EE], [EG], [ES], [FI], [FJ], [FM], [FR], [GA], [GB], [GD], [GH], [GM], [GR], [GT], [GW], [HK], [HN], [HR], [HT], [HU], [ID], [IE], [IL], [IN], [IS], [IT], [JM], [JO], [JP], [KG], [KH], [KN], [KR], [KW], [KY], [KZ], [LA], [LB], [LK], [LT], [LU], [LV], [MD], [MK], [ML], [MN], [MO], [MT], [MU], [MX], [MY], [MZ], [NA], [NE], [NI], [NL], [NO], [NP], [NZ], [OM], [PA], [PE], [PG], [PH], [PL], [PT], [PY], [QA], [RU], [RW], [SA], [SE], [SG], [SI], [SK], [SN], [SV], [SZ], [TG], [TH], [TJ], [TM], [TR], [TT], [TW], [TZ], [UA], [UG], [US], [UY], [UZ], [VE], [VG], [VN], [ZA], [ZM], [ZW])
) PVT;

I am lazy again and not using dynamic PIVOT. It takes 0-1 seconds.

enter image description here

Conclusion:

My point is that even my set up is bad and I am being super lazy as not paying time to normalize the data and create proper indexes I am still getting close to 0 seconds results. You can simply start with something free like PostgreSQL and I believe you will get good results. Of course, the "fancy" stuff are always there if you need them in order to optimize particular use case.

In-memory database optimized for read (low/no writes) when operations involve sorting, aggregating, and filtering on any column

Tags:

oracle-database

sql-server

in-memory-database

timesten

cudf

David542

1 Answers

Recent Activity

Donate For Us

In-memory database optimized for read (low/no writes) when operations involve sorting, aggregating, and filtering on any column

Tags:

oracle-database

sql-server

in-memory-database

timesten

cudf

David542

1 Answers

Related questions

Recent Activity

Donate For Us