Wednesday, April 6, 2011

Is MySQL appropriate for a read-heavy database with 3.5m+ rows? If so, which engine?

My experience with databases is with fairly small web applications, but now I'm working with a dataset of voter information for an entire state. There are approximately 3.5m voters and I will need to do quite a bit of reporting on them based on their address, voting history, age, etc. The web application itself will be written with Django, so I have a few choices of database including MySQL and PostgreSQL.

In the past I've almost exclusively used MySQL since it was so easily available. I realize that 3.5m rows in a table isn't really all that much, but it's the largest dataset I've personally worked with, so I'm out of my personal comfort zone. Also, this project isn't a quickie throw-away application though, so I want to make sure I choose the best database for the job and not just the one I'm most comfortable with.

If MySQL is an appropriate tool for the job I would also like to know if it makes sense to use InnoDB or MyISAM. I understand the basic differences between the two, but some sources say to use MyISAM for speed but InnoDB if you want a "real" database, while others say all modern uses of MySQL should use InnoDB.

Thanks!

From stackoverflow
  • I've run DB's far bigger than this on mysql- you should be fine. Just tune your indexes carefully.

    InnoDB supports better locking semantics, so if there will be occasional or frequent writes (or if you want better data integrity), I'd suggest starting there, and then benchmarking myisam later if you can't hit your performance targets.

    zombat : +1 - my sentiments exactly. I don't use MyISAM anymore unless I have a specific reason to. It's very capable, I used to work with tables with 20 million+ records with no problems, but the transactional capabilities of InnoDB usually outweigh any performance differences you are likely to need until you encounter specific performance issues.
    Branden Hall : Thanks Tim! Can I ask you to expand on "just tune your indexes carefully"? I get the purpose of indexes, but I'm not sure what the tradeoff is - i.e. why not index everything. I'm also a bit confused by what it means to tune an index - I was under the impression you decided to index a column or not and that was it.
    Tim Howland : When you create an index, you add extra overhead when doing inserts (not much, just a little, but it adds up). If you index everything, then it can really slow things down. Indexes can be based on a single field, the first N bytes (or characters) of a field, or two or more fields. You need to work with the "explain query" tool and benchmark your system to figure out the best mix for your particular data set and the searches you typically run. Check the mysql docs here: http://dev.mysql.com/doc/refman/5.0/en/create-index.html for more info.
    Branden Hall : Thanks so much Tim - that explains a lot!
    Greg Smith : Indexes also take up additional disk space. If you create too many of them, the overhead when doing inserts can be huge--it's not safe to assume it will be minor. The problem is that all those little writes to update many of them involves a lot more seeking over the disk surface, and seeks are very slow on most drives.
  • MyISAM only makes sense if you need speed so badly that you're willing to accept many data integrity issues downsides to achieve it. You can end up with database corruption on any unclean shutdown, there's no foreign keys, no transactions, it's really limited. And since 3.5 million rows on modern hardware is a trivial data set (unless your rows are huge), you're certainly not at the point where you're forced to optimize for performance instead of reliability because there's no other way to hit your performance goals--that's the only situation where you should have to put up with MyISAM.

    As for whether to choose PostgreSQL instead, you won't really see a big performance difference between the two on an app this small. If you're familiar with MySQL already, you could certainly justify just using it again to keep your learning curve down.

    I don't like MySQL because there are so many ways you can get bad data into the database where PostgreSQL is intolerant of that behavior (see Comparing Speed and Reliability), the bad MyISAM behavior is just a subset of the concerns there. Given how fractured the MySQL community is now and the uncertainties about what Oracle is going to do with it, you might want to consider taking a look at PostgreSQL just so you have some more options here in the future. There's a lot less drama around the always free BSD licensed PostgreSQL lately, and while smaller at least the whole development community for it is pushing in the same direction.

    Branden Hall : Thanks Greg, the politics of MySQL definately do scare me a bit. Looks like I should do some reading on PostgreSQL and see if I can fit some ramp up time into my development schedule.
  • Since it's a read-heavy table, I will recommend using MyISAM table type. If you do not use foreign keys, you can avoid the bugs like this and that.

    Backing up or copying the table to another server is as simple as coping frm, MYI and MYD files.

  • If you need to compute reports and complex aggregates, be aware that postgres' query optimizer is rather smart and ingenious, wether the mysql "optimizer" is quite simple and dumb.

    On a big join the difference can be huge.

    The only advantage MySQL has is that it can hit the indexes without hitting the tables.

    You should load your dataset in both databases and experiment the biger queries you intend to run. It is better to spend a few days of experimenting, rather than be stuck with the wrong choice.

0 comments:

Post a Comment