Is it possible to improve SQL Server 2008 R2 (and newer) insert performance by replacing (say) 50 <code>float</code> columns with a single <code>binary(n)</code> (<code>n</code> being 50 x 4)? I would presume that using a fixed size <code>binary(n)</code> should improve performance (amount of data is the same, with less work needed to handle all the columns and shorter SQL queries), but many sites recommend against using <code>binary</code> columns, so I would like to see if there are really issues with using this? Also, the issue is that the table is rather denormalized, and not all columns are filled with values usually, so <code>varbinary(n)</code> would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average. And then the third question is, how about going a step further and replacing (say) 5 rows x 50 <code>float32</code> columns with a single <code>varbinary(5*50*4)</code>? So it would be cool to get some insights into: <ol> <li>Replacing 1 row of 50 <code>float</code> columns with single <code>binary(200)</code>;</li> <li>Replacing 1 row of 50 x <code>float</code> with single <code>varbinary(204)</code> (several bytes for flags/length info) - to save space when columns are unused;</li> <li>Replacing 5 rows of 50 x <code>float</code> with single <code>varbinary(1024)</code> (several bytes for flags/length info).</li> </ol> Entire row is always read at once in all cases. (Update) To clarify, the data being stored is: <pre class="prettyprint"><code> Timestamp_rounded Value_0ms Value_20ms Value_40ms ... Value_980ms 2016-01-10 10:00:00 10.0 11.1 10.5 ... 10.5 </code></pre> I am always reading the entire row, the primary clustered key is the first column (Timestamp), and I will never have to query the table by any of the other columns. Normalized data would obviously have a <code>Timestamp</code>/<code>Value</code> pair, where <code>Timestamp</code> would then have millisecond precision. But then I would have to store 50 rows of two columns, instead of 1 row (<code>Timestamp</code> + <code>BLOB</code>).

This is a bad idea. Having 50 columns of 4 bytes vs. having one column of 200 bytes obliterates any hope of optimizing the query for any of those 50 columns. To begin, from a 'classic' SQL Server pov: <ul> <li>You eliminate push-down predicates and scan time filtering</li> <li>You eliminate indexing possibilities</li> <li>You eliminate data purity checks (especially important for floats, since not all bit patterns make valid floats!)</li> <li>You eliminate column statistics based cost optimizations</li> </ul> As you go more 'modern' and start considering SQL Server newer options: <ul> <li>You eliminate in-row compression options</li> <li>You eliminate columnar storage options</li> <li>You eliminate in-memory storage optimizations</li> </ul> All these without even considering the pain you inflict on your fellow humans trying to query the data. <blockquote> the issue is that the table is rather denormalized, and not all columns are filled with values usually, so varbinary(n) would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average. </blockquote> Then use row-compressed storage: <pre class="prettyprint"><code>ALTER TABLE <your table> REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = ROW); </code></pre> If the data is append-only and seldom updated/deleted and most queries are analytical, then even better use columnstores. Since SQL Server 2016 SP1 columnstores are available across every SQL Server edition.

As an experiment I tried out the two different methods to compare them. I found that after some tuning the binary version was about 3X faster than the 50 col version. This scenario is very specific and my test only tested something very specific. Any deviation from my test-setup will have an impact on the result. How the test was made For the 50 col version I had 50 nullable float columns which I populated all with <code>float.MaxValue</code>. For the binary version I had a single column. The value for the column was constructed from a string of 50x <code>float.MaxValue + "|"</code>, all concatenated into a single long string. The string was then converted to byte[] to be stored in the table. Both tables were heaps with no indexes or constraints. My test code can be found here https://github.com/PeterHenell/binaryBulkInsertComparison I ran the tests on SQL Server 2014 Developer Edition on a 6 Core workstation with SSD drives.

SQL Server performance: 50 columns vs single binary/varbinary

Tags:

performance

sql-server

blob

denormalization

varbinary

Is it possible to improve SQL Server 2008 R2 (and newer) insert performance by replacing (say) 50 float columns with a single binary(n) (n being 50 x 4)?

I would presume that using a fixed size binary(n) should improve performance (amount of data is the same, with less work needed to handle all the columns and shorter SQL queries), but many sites recommend against using binary columns, so I would like to see if there are really issues with using this?

Also, the issue is that the table is rather denormalized, and not all columns are filled with values usually, so varbinary(n) would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average.

And then the third question is, how about going a step further and replacing (say) 5 rows x 50 float32 columns with a single varbinary(5*50*4)?

So it would be cool to get some insights into:

Replacing 1 row of 50 float columns with single binary(200);
Replacing 1 row of 50 x float with single varbinary(204) (several bytes for flags/length info) - to save space when columns are unused;
Replacing 5 rows of 50 x float with single varbinary(1024) (several bytes for flags/length info).

Entire row is always read at once in all cases.

(Update)

To clarify, the data being stored is:

 Timestamp_rounded    Value_0ms  Value_20ms  Value_40ms ... Value_980ms
 2016-01-10 10:00:00    10.0       11.1        10.5     ...    10.5

I am always reading the entire row, the primary clustered key is the first column (Timestamp), and I will never have to query the table by any of the other columns.

Normalized data would obviously have a Timestamp/Value pair, where Timestamp would then have millisecond precision. But then I would have to store 50 rows of two columns, instead of 1 row (Timestamp + BLOB).

554

asked Jan 11 '17 17:01

Lou

2 Answers

This is a bad idea. Having 50 columns of 4 bytes vs. having one column of 200 bytes obliterates any hope of optimizing the query for any of those 50 columns. To begin, from a 'classic' SQL Server pov:

You eliminate push-down predicates and scan time filtering
You eliminate indexing possibilities
You eliminate data purity checks (especially important for floats, since not all bit patterns make valid floats!)
You eliminate column statistics based cost optimizations

As you go more 'modern' and start considering SQL Server newer options:

You eliminate in-row compression options
You eliminate columnar storage options
You eliminate in-memory storage optimizations

All these without even considering the pain you inflict on your fellow humans trying to query the data.

the issue is that the table is rather denormalized, and not all columns are filled with values usually, so varbinary(n) would allow me to reduce the row size in many cases. Sometimes only a single column is filled, but ~10 on average.

Then use row-compressed storage:

ALTER TABLE <your table> REBUILD PARTITION = ALL  
   WITH (DATA_COMPRESSION = ROW);

If the data is append-only and seldom updated/deleted and most queries are analytical, then even better use columnstores. Since SQL Server 2016 SP1 columnstores are available across every SQL Server edition.

answered Oct 23 '22 13:10

Remus Rusanu

As an experiment I tried out the two different methods to compare them.

I found that after some tuning the binary version was about 3X faster than the 50 col version.

This scenario is very specific and my test only tested something very specific. Any deviation from my test-setup will have an impact on the result.

How the test was made

For the 50 col version I had 50 nullable float columns which I populated all with float.MaxValue.

For the binary version I had a single column. The value for the column was constructed from a string of 50x float.MaxValue + "|", all concatenated into a single long string. The string was then converted to byte[] to be stored in the table.

Both tables were heaps with no indexes or constraints.

My test code can be found here https://github.com/PeterHenell/binaryBulkInsertComparison

I ran the tests on SQL Server 2014 Developer Edition on a 6 Core workstation with SSD drives.

answered Oct 23 '22 13:10

Peter Henell

Related questions
                            
                                How do you identify record pattern sequences in records using TSQL?
                            
                                SQL Find consecutive numbers in groups
                            
                                Partial String match SQL - Inner Join
                            
                                SQL Joins . One to many relationship
                            
                                Increase maximum recursion depth when querying table-valued function
                            
                                Calculate financial year start and end date based on year entered SQL Server and SSRS
                            
                                invalid column name when passing variable mssql
                            
                                Check if there are any records in a table which may not exist
                            
                                How to set partition id/name for row partitions in SQL Server?
                            
                                Xml Parsing Issue in sql job
                            
                                RANK() OVER PARTITION with RANK resetting
                            
                                "CASE WHEN" operator in "IN" statement
                            
                                Partition by using multiple case statements
                            
                                Safe solutions for INSERT OR UPDATE on SQL Server 2016
                            
                                autocomplete in SQL server management studio
                            
                                TSQL -Write FIZZBUZZ without using loop
                            
                                How to get list of XML attributes in a XML node
                            
                                SQL PIVOT: One To Many
                            
                                Does "With recompile" recompile all the queries in stored procedure?
                            
                                T-SQL split on delimiter

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With