Massive Daily MySQL Imports

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

I am working on a site that has to import a large amount of SQL data every night from a master feed provided by a third party. The import takes about 8 minutes (I hope to get that time down before deployment). The script has to truncate the tables before importing the fresh data, so there is a brief period of time where most or all the data is unavailable on the site. Do you think it is necessary to disable the site, sort of an under construction page while this import is happening? I think a better option would be to somehow import the data without messing with what is already there. What would be the best way to do that?

teammatt3's picture

He has: 2,102 posts

Joined: Sep 2003

Have you thought about creating two databases? While one is being used on the production site, you could be importing into the other database. Once the import is done, a script runs that switches out your dbconfig file with another that is pointing to that other DB. You would just swap what database you are importing to every night (that could be automated).

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

The only problem I see with that is I would have to have a third database to store the data that doesn't change every day. I don't have any other tables as the site is fairly static besides the import data...just a potential problem in the future.

decibel.places's picture

He has: 1,494 posts

Joined: Jun 2008

Depends on the traffic, but 8+ minutes down can alienate users.

The 3 database solution sounds workable.

If it's a Drupal site, you can automate the site maintenance setting, but that would not be necessary with database switching.

I work on a ratings site that has a monthly reset of some tables in the database, but that is different, we don't import anything, just backup/truncate/update

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Depends on the traffic, but 8+ minutes down can alienate users.

I really doubt many people will be browsing the site in the middle of the night, but it flags as a scalability issue, and it makes me paranoid. Wink

If it's a Drupal site, you can automate the site maintenance setting, but that would not be necessary with database switching.

Naw, I do most sites with Codeigniter.

JeevesBond's picture

He has: 3,956 posts

Joined: Jun 2002

I really doubt many people will be browsing the site in the middle of the night

Well, unless it's a regional site, what's the middle of the night for you is the middle of the day for others. Smiling

Are you using LOAD DATA LOCAL INFILE to load the data? If so, I think the best way is the one Matt suggested. There are ways around the problem of static data.

You could use two databases, one is your live db, the other you treat as a temporary data store. So pull down the data, do a LOAD DATA LOCAL INFILE into the temp db; optionally, delete records in the temp db that are already present in the live db; then use an INSERT IGNORE ... SELECT ... statement to copy the data from the temp to the live db.

Using table locking, and temporarily switching off indexes will speed up db to db copying significantly.

a Padded Cell our articles site!

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Well, unless it's a regional site, what's the middle of the night for you is the middle of the day for others. Smiling

It's a real-estate site for a local area, so there wouldn't be many international visitors.

Are you using LOAD DATA LOCAL INFILE to load the data?

No...I wish I could, but the third party where we are getting the data has a habit of changing the format of the CSV files frequently, and w/o telling us. I parse it using PHP's fgetcsv(), and then format it all for the database.

They have: 121 posts

Joined: Dec 2008

I'm (a lot) late to the party, but here are a couple more ideas:

1) Wrap everything in a transaction* :
BEGIN WORK
delete from tblname
... bunch of inserts
if success
COMMIT
otherwise
ROLLBACK

* Never tried this in MySQL, though it apparently supports transactions now...

2) Implement a sort of 'row versioning':

Add a table 'imports' with an import_id start timestamp, end timestamp, and a 'status' field as well as anything else you might like (number of rows inserted, etc).

Add a field to your import target, 'import_id', and change your indexes/keys to be unique (compound) for each 'import_id' and listing_id.

Before begining the import, insert a row in 'imports' to obtain a new unique 'import_id'. Have the status = 'incomplete'.

insert your data, with each row having the new 'import_id' you've gained.

When the import is done, and successful, you can update the 'imports' table row with a 'success' status.

When successful, you can either delete the previous 'version' (import_id) of the rows, or leave them there for historical purposes.

In your application, you'll be selecting rows with your current criteria plus "and import_id = (select import_id from imports where status = 'success' and end_timestamp = (select max(end_timestamp) from imports where status = 'success'))

Cheers,
Shaggy

greg's picture

He has: 1,581 posts

Joined: Nov 2005

Shaggy wrote:
I'm (a lot) late to the party
Yeah, the beer's all gone mate, looks like you're on the Lambrini Laugh

Greg K's picture

He has: 2,145 posts

Joined: Nov 2003

(yes the solution is probably already in place by now, but I like posting anyhow for those who may find this thread on a search engine looking for similar solutions)

Also, something I did to majorly speed up the import was instead of issuing SQL statements to insert the data, I use http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html mysqlimport to quickly import a whole CSV file straight to the table.

Some tips.

If your working on a local mySQL server (the CSV file is on the same server) use the LOAD DATA first, so you if you get any warnings you can then SHOW WARNINGS afterwards.

I needed to delete certain records as well, and only insert most from the feed I received. I had a script go through the CSV file, and for any record that was going to need deteled, set a flag on the record. Then any that needed inserted, I'm picky, and do a data integrety check to make sure i didn't get any bad rows (ie. make sure each row the field count matches), also I get dates in MM/DD/YYYY format, this is a good time two swap those out to YYYY-MM-DD. Save all the good data into a new CSV file.

As this checking and processing what will take the majority of time, your data in the database is intact until you are ready to finally do the update.

Now you can do a sinlge DELETE FROM tblWhatever WHERE DelFlag='Y'
Then immediately do the mysqlimport (you will need to run this via system() command.)

I have found on two projects that this is so much faster than doing inserts/updates. Also another method would be instead of doing an extra CSV file you could just insert all the cleansed data into a temporary table, and then issue one statement to insert all records from the temp table into the live one.

I stick with CSV file for other purposes here though relavent to the project.

Also, not sure how you are getting the data from 3rd party, but if it is like us, where they have their inventory system export about 4 CSV files each night (and accompanying "long text".txt files for descriptions in another directory, rsync (on the linux end) and DeltaCopy (which easily wraps and sets up rsync on the windows end) is a great tool. Extremely fast and it has ways to transfer just what had changed, not everything all over again. Blew away establishing an FTP connection.

-Greg

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Also, something I did to majorly speed up the import was instead of issuing SQL statements to insert the data, I use http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html mysqlimport to quickly import a whole CSV file straight to the table.

That was my first thought, but the CSV files are not consistent, and the slightest change would cause a big problem.

I ended up going with Matt's solution. The website has three databases. DB1 holds the consistent data, and DB2 & DB3 holds the daily imported data. If we are using DB2 today, then when the import script runs, DB3 will be truncated and populated while the website is serving DB2. Once the import is done, the active database is swapped out with the updated one by changing a constant.

Clunky?? Maybe, compared to some of the other solutions posted here, but this company is a bit of a penny pincher, and they wern't wiling to pay me to come up with something better. And, it's worked so far. I support my code free for a year, so the company would get me on the phone in a heartbeat if there was a problem. I had to iron out a couple minor glitches after the launch, but I haven't heard from them on this project since late last year, so it must be working for them.

They have: 121 posts

Joined: Dec 2008

Also sounds like a good approach. Another option could be two tables instead of two databases you switch between.

Ah, real estate data. It is always interesting what weird and wonderful ways the boards come up with to distribute that stuff eh? Have you had the 'pleasure' of dealing with (board's interpretation of) RETS feeds yet?

Cheers,
Shaggy

pr0gr4mm3r's picture

He has: 1,502 posts

Joined: Sep 2006

Have you had the 'pleasure' of dealing with (board's interpretation of) RETS feeds yet?

At least that is in XML. This is what I had to deal with.

They have: 121 posts

Joined: Dec 2008

Yes, I feel for you. Be happy though, this board actually supplied a data dictionary - many don't even go that far...

They have: 13 posts

Joined: Jan 2012

needed to delete certain records as well, and only insert most from the feed I received. I had a script go through the CSV file, and for any record that was going to need deteled, set a flag on the record.

They have: 10 posts

Joined: Mar 2012

The only problem I see with that is I would have to have a third database to store the data that doesn't change every day. I don't have any other tables as the site is fairly static besides the import data...just a potential problem in the future.

Want to join the discussion? Create an account or log in if you already have one. Joining is fast, free and painless! We’ll even whisk you back here when you’ve finished.