Archive for January 5th, 2008

 

large databases & limited resources

Jan 05, 2008 in mysql

I’m stuck here in lovely Sierra Vista on a weekend, total bummer.

Anyway, I told myself I should work on some useful training for work, but screw it. I decided instead to waste some perfectly valuable time on the boinc site. No idea why, but whatever.

The issue with the site is the daily updates for the entire site. The source data downloaded is a snapshot, meaning the entire dataset rather than updates since the previous day/week/month/whatever. The amount of data is quite large in size to reprocess every single day for little to no benefit. I used to use mysql’s ability to allow me to send an ‘INSERT…ON DUPLICATE KEY…UPDATE’ statement. Unfortunately, it performed a full table scan on each statement, took way too long. Plus, the updates to an individual object was to maybe four or five fields, depending what happened to the object in question. As an example, a user could have changed their website or other useful and fun information in addition to their result count data. Moving on, I now issue DELETE statements on the data based on the project ID that I assign when the project is added to the site. Given there are a few million records, the DELETE statements were taking a long time given the need to again scan the table.

This then led to me giving each project it’s own set of tables, i.e. teams, users, hosts for project x simply by appending ‘_x’ to the table name after copying the table structure from a template table. Simple stuff. The issue I ran into was when I TRUNCATE the sets of tables, all of that data is lost until I can basically re-insert that data from the data refresh I downloaded for that day. Of course, anything accessing the site will not see that data until the refresh is complete. I could of course do a bunch of things, like cache web pages, create temp tables, etc. I created a caching application to store the generated pages on disk, but a few million pages stored on disk caused major performance issues.

The main issue I have is server and financial resources, or the lack thereof. It’s a dedicated box with 1GB RAM and 7200RPM SATA drives, very basic and low on power. My goal is to keep all data available on the site, all the time. I had thoughts of caching pages (Custom, PEAR, memcache, etc.) on a different machine or another database server that I could simply pull data from, but that doubles the cost for little benefit, and I wanted to overcome this challenge using just the one server. Why not simply have two or more boinc databases? Don’t have the resources, both storage and RAM.

Unfortunately, I am definitely out of RAM and thrashing the hard drives quite nicely at this point given the biggest table now has 300+ million records and appropriately sized index. This table is the overall user results history table. I used to have a non-normalized version of the user results history table where I selected the user given a project ID and user ID and selected a pipe delimited field and appended this update’s information to that same field in one SQL statement. This proved to be quite slow when updating daily given the sheer size of selecting the large user object, updating the data, then saving it back to the database. I then opted to simply dump the user’s daily result data into a normalized set of tables, works much faster in both getting the data in and the SELECT statements for the web pages.

The program I wrote to actually parse and insert the XML data performs quite nicely I think. I originally had it to where I simply insert the XML block representing a user into the database in one field. The web application simply deserializes that XML data into an object and I use it from there as an instantiated object. Today, the batch application dumps the XML object and breaks out the information so that I can do stuff with that data when the updates are taking place. In essence, the database stores a user object as the raw XML data in one field and is broken out into individual fields for that particular user in multiple tables. Frankly, the reading and parsing the XML portion screams, can’t get it into the database quick enough. Yay.

So, at this point, the application truncates the project’s tables given there is new data to put back in. The issue I continue to deal with is how to replace data quickly, yet continue to respond to requests for data from the web applications. Definitely a slow server during that time. Plus, once the RAM is fully used, the hard drives thrashing away trying to truncate/replace/insert/select data quickly offers lots of opportunities to solve problems creatively using interesting methods, which is a developer’s dream job/world.