Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server performance: 50 columns vs single binary/varbinary

Is it possible to improve SQL Server 2008 R2 (and newer) insert performance by replacing (say) 50 float columns with a single binary(n) (n being 50 x 4)?

I would presume that using a fixed size binary(n) should improve performance (amount of data is the same, with less work needed to handle all the columns and shorter SQL queries), but many sites recommend against using binary columns, so I would like to see if there are really issues with using this?

Also, the issue is that the table is rather denormalized, and not all columns are filled with values usually, so varbinary(n) would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average.

And then the third question is, how about going a step further and replacing (say) 5 rows x 50 float32 columns with a single varbinary(5*50*4)?

So it would be cool to get some insights into:

  1. Replacing 1 row of 50 float columns with single binary(200);
  2. Replacing 1 row of 50 x float with single varbinary(204) (several bytes for flags/length info) - to save space when columns are unused;
  3. Replacing 5 rows of 50 x float with single varbinary(1024) (several bytes for flags/length info).

Entire row is always read at once in all cases.

(Update)

To clarify, the data being stored is:

 Timestamp_rounded    Value_0ms  Value_20ms  Value_40ms ... Value_980ms
 2016-01-10 10:00:00    10.0       11.1        10.5     ...    10.5

I am always reading the entire row, the primary clustered key is the first column (Timestamp), and I will never have to query the table by any of the other columns.

Normalized data would obviously have a Timestamp/Value pair, where Timestamp would then have millisecond precision. But then I would have to store 50 rows of two columns, instead of 1 row (Timestamp + BLOB).

like image 554
Lou Avatar asked Jan 11 '17 17:01

Lou


People also ask

What is the max size of Varbinary in SQL Server?

varbinary [ ( n | max ) ] n can be a value from 1 through 8,000. max indicates that the maximum storage size is 2^31-1 bytes. The storage size is the actual length of the data entered + 2 bytes.

When would you use Varbinary data type?

The VARBINARY data type holds variable-length binary data. Use this type when the data is expected to vary in size. The maximum size for VARBINARY is 8,000 bytes. As an aside, the word VARBINARY stands for varying binary.

What is the difference between Varbinary and binary?

Even though BINARY and VARBINARY are binary byte data types, they have differences when it comes to storage. BINARY stores values in fixed lengths while VARBINARY stores in variable depending on the type of values. Values in BINARY data type are padded with 0x00 which is not the case with VARBINARY.

How many columns should a SQL table have?

Answer. For the columns in a table, there is a maximum limit of 1024 columns in a table. SQL Server does have a wide-table feature that allows a table to have up to 30,000 columns instead of 1024.


2 Answers

This is a bad idea. Having 50 columns of 4 bytes vs. having one column of 200 bytes obliterates any hope of optimizing the query for any of those 50 columns. To begin, from a 'classic' SQL Server pov:

  • You eliminate push-down predicates and scan time filtering
  • You eliminate indexing possibilities
  • You eliminate data purity checks (especially important for floats, since not all bit patterns make valid floats!)
  • You eliminate column statistics based cost optimizations

As you go more 'modern' and start considering SQL Server newer options:

  • You eliminate in-row compression options
  • You eliminate columnar storage options
  • You eliminate in-memory storage optimizations

All these without even considering the pain you inflict on your fellow humans trying to query the data.

the issue is that the table is rather denormalized, and not all columns are filled with values usually, so varbinary(n) would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average.

Then use row-compressed storage:

ALTER TABLE <your table> REBUILD PARTITION = ALL  
   WITH (DATA_COMPRESSION = ROW);

If the data is append-only and seldom updated/deleted and most queries are analytical, then even better use columnstores. Since SQL Server 2016 SP1 columnstores are available across every SQL Server edition.

like image 88
Remus Rusanu Avatar answered Oct 23 '22 13:10

Remus Rusanu


As an experiment I tried out the two different methods to compare them.

I found that after some tuning the binary version was about 3X faster than the 50 col version.

This scenario is very specific and my test only tested something very specific. Any deviation from my test-setup will have an impact on the result.

How the test was made

For the 50 col version I had 50 nullable float columns which I populated all with float.MaxValue.

For the binary version I had a single column. The value for the column was constructed from a string of 50x float.MaxValue + "|", all concatenated into a single long string. The string was then converted to byte[] to be stored in the table.

Both tables were heaps with no indexes or constraints.

My test code can be found here https://github.com/PeterHenell/binaryBulkInsertComparison

I ran the tests on SQL Server 2014 Developer Edition on a 6 Core workstation with SSD drives.

like image 37
Peter Henell Avatar answered Oct 23 '22 13:10

Peter Henell