First off all, i am not a database expert but a contractor. I hired a (good) programmer but now have some doubts about a certain part of the database design due to some problems we experience and all the information i am reading. Let's start.
We build a housing site that uses a parser to process all the data and store it in a ms-sql database. Every day the feeds contain about 70.000 records of wich most of them also have pictures (average 3) attached. The pictures vary in size from 30kb to 400kb. The database has about the same amount of records. There are about 400 new objects to be processed. This means that every day, all the records in the database have to be entered to see if data has been changes, an object has been deleted or if it is a new object and therefor has to be inserted. The pictures are stored in the database. The feeds are processed on a dual quadcore machine with 32GB of memory and SSA disks. The database is now 600GB in size.
Currently we have about 3000 users a day that look at 6 houses and view 10 images per user on average.
This is what we experience: - The whole process of parsing takes about 13 hours. - We get a lot of timeout errors in the log - We get quit a few of deadlock errors - Google complains about the timeout errors and as a result indexes not much pages. - Google rates the site as slow due to some directories taking more than 10 seconds to load.
I personaly think it has something to do with the pictures in the database and some bad queries. But before i start complaining to my programmer i would like to hear your opinion on this. Thanks in advance for your time.
Update from my programmer: Here is some information on the structure of the tables. There are 2 tables for the images, 1 called imageinfo which is used for quering on the images (like getting a list of imageid’s and content-type for instance) and an image table which contains the id of the image and the BLOB. The imageinfo table has the same id as the image table (1:1 relationship) and has some extra information like the name, type and a hash of the image. That hash is used in the parser process to determine if an image has changed. So the only time when the image table is touched is when there is an insert/update/delete from the parser and the site accesses an image. The time it takes to access and download one image is around the ~350ms.
You tell us two problems:
(2) is easy: You probably need to understand your read queries and index them. This is definitely solvable.
(1) Is much harder to say something about without more specifics. I understand you need to compare lots of blobs - you could store a compact hash of those blogs besides the actual data. That way you don't need to retrieve the blob for comparison purposes and can even index the hash.
Should you have images in the database?
The biggest pros are: Consistent and easy backups, developer convenience. The biggest con is potential misuse. You really can't say in general that images belong into the file system. The database is usually fine for them except if specific and concrete reasons are present to put them somewhere else.
My guess is your usage of those blogs falls under misuse and you would have the same problems if the files were stored in the file system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With