Ian and I ventured down to MySQL UC with one goal, which was to figure out a way to scale MySQL efficiently. Specifically, I was looking for a way to scale MySQL in a way that didn’t reduce feature sets and was transparent to my application. My second goal was to learn of ways to optimize MySQL and find out about new features of interest.
The answer to my first question simply; you can’t do that. MySQL offers two options for scaling out and high availability. One is replication, which is asynchronous and requires your application to know where to send reads, writes and “critical” reads. In other words, it’s not transparent to the application. The second option is clustering with NDB, which lacks foreign key constraints and comes with the suggestion that you avoid joins and denormalize your schemas to avoid joins. However, there is a glimmer of hope at the end of the tunnel, which is Continuent’s m/cluster solution, which was exactly what I was looking for. Unfortuantely, at $5k per socket per database node, it comes with a hefty price tag. They do have an open source version that only supports ANSI SQL though which I may look into.
One area which Ian and I have learned quite a bit while at MySQL UC is the abundance of new features in 5.0 and 5.1. Triggers, views, stored procedures and events are all finding their way into these versions. Ian and I have already discussed areas that we’ll be using triggers and stored procedures to create aggregate tables for caching purposes. We have a few areas that we’ll be using events in 5.1 as well.
Some other random thoughts …
- The lunches weren’t all that great. I wish I could opt out of the lunches, save some cash on that and buy my own.
- The weather in Seattle has been better than it has been in Santa Clara, which negates the purpose of having the conference in “sunny” California.
- In discussions with fellow attendees it’s become fairly evident that both Ian and myself are pushing the limits of PHP and MySQL far beyond what most people are doing. This makes it difficult to find people to turn to for help.
- There are a lot of people wearing SCAMP shirts. I can’t believe those dirty bastards have the guts to show up at a conference such as this.
As a side note to my non-techie friends, this is the last in a long string of wholly geek related entries. So, please, quit sending me email complaining about this.
- PHP5 comes with a new MySQL extension called mysqli. While the i stands for improved, interface and incompatible (some say incomplete – HA!).
- Supports MySQL versions starting with MySQL 4.1.
- The new function was basically a way to start over and clean things up to work with the new features in 4.1+.
- Includes SSL connections, stronger password algorithm, prepared statements prevent SQL injection, no default connection parameters. Overall, their goal was to make it safer.
- Can make use of new and more efficient MySQL binary protocol, prepared statements give massive performance improvements on large data sets, faster overall code, support for gzip compressed connections. Additionally, the MySQL server can be embedded into PHP (wtf?).
- You can use either OOP or procedural interfaces, prepared statements make certain operations easier and there’s less that can go wrong (which seems a bit ambiguous).
- Some redundant functions have been dropped, some new functions that support new features and persistent connections are no longer support (about damn time).
- The OOP interface is “marginally” slower than the procedural interface, but tiny compared to the cost of actually getting the data. In other words, use whichever you like.
- In PHP5, the OOP interface supports Exceptions (ie. ConnectException, etc.).
- Added autocommit(), commit() and rollback() functions to the OOP interface.
- Now supports multiple queries with the multi_query() function. This looks absurdly awkward.
Not sure if I can even think of a reason to use this. This functionality was added specifically for stored procedures which can return multiple result sets.
Nevermind, Ian just noticed he’s reading the notes verbatim from the manual. That being said, this looks like an extremely interesting enhancement over the old MySQL client library. We’ll probably look into switching things over when we get back and start forming a larger MySQL strategy.
- They’re now doing 3,500 page views per second and serving that off of about 100 servers.
- Wikipedia is LAMP (Linux, Apache, MySQL and PHP).
- They use 90% of memory in InnoDB buffer pool.
- This is laughable, their operational guideline is “general availability” as opposed to five nines. Their main operational goals are maximum efficiency and always be able to scale as people are always coming in. These are a direct result of them being a donation powered open source community site.
- They use aggregated tables for reports, which doubled overall performance. They also killed some metadata included on every page because it was too expensive to generate.
Overall, not very interesting. They don’t really tell how they’re scaling to that many requests with only 100 servers. I’ve found most of the short sessions missing any kind of meat as to exactly how these guys are scaling MySQL. The feeling I get is that most of them are doing exactly what I’ve been doing, which makes me rethink MySQL’s place in the “high performance” column of databases.
- If your applications use features which were well optimized in 4.1 you might see performance degredation (wtf?).
- As of now 4.1 will only get critical bug fixes.
- Amazingly it appears that 5.0 is slightly slower than 4.1. Why this is I don’t know.
I’m leaving this one to go to the dispearsed storage engine tutorial as I’m bored to death with this. Nothing interesting.
MySQL’s new partitioning feature is a way to split tables up into chunks based on various criteria. For instance, you could break up a table based on a timestamp and rows older than a specific date are stored in MySQL’s archive storage engine.
- Partitioning will first appear in version 5.1.
- Partitioning is a “Divide and Conquer” solution for large tables.
- You can partition data based on year. For instance you can put data from before 1999 into
/var/data/part1, before 2001 onto
- You can move data from partition to partition as data becomes stale as well.
- Sits above the storage engine layer and works with all table types. Sits in its own layer.
- You can remove partitioning with
ALTER TABLE t1 REMOVE PARTITIONING.
- Limits on partitioning include that the partition function must return an integer result, all partitions must use the same storage engine, and it increases response time with many partitions.
- You can partition by odd and even values. For instance, you can put even values into partitions 2, 4, 6 and 8 and odd values into partitions 1, 3, 5 and 7.
- You can partition by hashing, which lets MySQL decide where to put the data. There isn’t a logical distribution between partitions so it makes pruning more difficult. For instance, you never really know which partition the data is on (unlike the above example).
- You can create composit partitions as well, which are essentially sub-partitions.
- If a primary key is defined no fields outside of primary key is allowed in partition function (wtf?).
- You can reorganize partitions whenever you want. For instance you can roll partition 2 into partition 3 and move their locations. This, however, locks the table until the partition operations have completed.
- If you have both a
PRIMARY KEY and a
UNIQUE index in a single you cannot do partitioning on that table.
- If a storage engine has a table size limit of 64GB (ie. InnoDB) you can have 64GB on each partition. This essentially allows you to scale beyond your storage engine’s table size limit.
- Did I just hear something about foreign key constraints being put into the meta storage engine in the near future? This would allow foreign key constraints for all table types I believe.
- Creating, updating, altering, etc. of partitions is done atomically. Either the change to the partition happens or it’s rolled back and an error message is returned.
- If one of your partitions goes away all of your data goes away. Yikes!
- It doesn’t sound like you can partition federated tables.
- Paritioning does work with replication.
- You can use it with auto-increment primary keys.
This is a very cool feature for people who have very large data sets and are looking to break those data sets up into smaller data sets. For instance, if you have a table with 1m records in it you could split it up into two data sets of 500k records in each data set, which would reduce the number of rows MySQL needs to scan during queries.
This specifically about new features in the 5.1 release. As much as I like Australia I think it’s retarded to be talking about that at a MySQL conference, though it is funny I suppose.
- They now support variable sized rows, which reduces memory usage. This equates to more rows per gigabyte.
- Add/Update Index has been optimized over 5.0. Before it copied the entire table, added the new index and then moved it back over the old table. This all happened over the wire so you can imagine how long that took. This optimization speeds things up about four times.
- 5.1 now allows you to replicate across clusters. Used for geographical redundancy, split the processing load (why not add more nodes to the cluster?), etc. You have to use the row-based replication to enable this feature.
- Failover of replication channels is manual.
- They’re adding support for data on disk in 5.1 and indexes on disk in the future.
This absolutely amazes me. The moral of the story at MySQL UC has been that clustering is a nightmare to set up, maintain and use. However, storing data on disk is a step in the right direction. The solution from Continuent seems to be infinitely more elegant than both clustering and replication.
- MySQL 5.0 supports auto-increment variables for bi-directional replication to avoid auto-increment collisions.
- MySQL 5.0 replicates character sets and time zones.
- 5.0 now replicates stored procedures, triggers and views.
- MySQL 5.1 introduces row-based logging and replication (RBR)
- Dyanmic switching of binary log format between
- 5.1 allows for cluster replication.
- Replication method cannot be configured per table, but since it’s dynamic you can change it from the client before a transaction, etc.
auto_increment_increment you can change starting points and how many you step between. Works with most table types. Works with InnoDB and MyISAM. This is specifically for multi-master setups.
- RBR allows clusters to replicate and also allows the server to replicate non-deterministic statements such as LOAD_FILE(). That being said you can’t have different table definitions on the slave as you can with SBR.
Wow, nothing super impressive here that makes me excited about new features in MySQL’s 5.0/5.1 replication. Their answer to the possible auto_increment collisions seems a bit simplistic and short-sighted, but then again I’m not a maintainer for MySQL’s replication code.
What annoys me greatly is MySQL’s refusal to simply add features to storage engines in favor of simply adding new storage engines. MyISAM doesn’t support transactions or foreign key constraints? Use InnoDB. InnoDB doesn’t support FULLTEXT? Use MyISAM. You need synchronous replication of data? Use NDB, but you need to denormalize and reduce your use of JOIN’s.
It’s enough to make me switch to PostgreSQL.
Today I thought I’d check out session by the CTO of Continuent about their clustering solution. Before heading into the session we checked out Continuent’s booth in the exhibit hall. It certainly sounds like a great product, however the $5,000 per CPU licensing seems a bit Oracle’ish to me.
- A share nothing architecture. Split up into two layers; the controllers and the actual database nodes.
- Low latency.
- Single-copy equivalence for reads.
- Supports load balancing and heterogeneous databases (ie. MySQL and SQL Server sharing the same cluster).
- 100% Java and based on Sequoia.
- They might have an open source version. Will need to check this out.
- Fully transparent (including failover) to the application.
- The controllers act as a proxy to the database. To the applications they appear as the actual database. Below these controllers is where the databases actually sit.
- You need to compile/load your their specific driver, which I think would require a near PEAR DB driver.
- Requests come in where it determines what type of request it is (read vs. write). It then broadcasts writes to all controllers. All requests are executed in identical order. From there it’s sent to the scheduler, which makes sure the underlying databases remain in identical states. These requests can be sent in parrallel to each database server.
- The request controller then aggregates the responses from the databases. If there was an error across all of the systems then it’s a bad query of some sort, if only a single node respondes with an error then that systems is dropped. This can be configured to return a success message once one of the underlying nodes responds with a success.
- Reads are simply load balanced across the underlying nodes to the node with the least number of requests (sounds like LVS’s weighted least connection algorithm).
- Works with MyISAM, InnoDB and heap table types.
- The commits are synchronous.
- Once you have a dump with a starting date you can apply the dump to the new node and then the cluster controllers apply the logs until it’s up to date.
- The open source solution only supports ANSI SQL, doesn’t come with their own group talk protocol (so the clusters can communicate with each other) and doesn’t come with the database-specific dump and load.
- The cluster controllers keep track of the position in the overall sequence of each underlying node and sends reads only to those nodes that are up-to-date with the current position.
- The largest cluster they have are four nodes. Has been tested with as many as 64 nodes and supports tiering.
- Failover between cluster controllers happens in the driver level.
- Stored procedures where a “challenge”. Their approach allows you to tell their controllers what is inside of the stored procedure (what tables it changes, etc.) so the controller knows how to handle each procedure.
After sitting in on the MySQL Clustering tutorial I can honestly say this approach is infinitely better. It supports disk write, InnoDB, MyISAM and, on top of that, allows you to tier the clusters. The only downside I see is that it’s written in Java, but these days I’m not sure if that’s really a downside.
The second session today is replication for scaling and high availability, which I’m extremely interested in hearing about as we’ve been having problems with race conditions on our current replication setup.
- Replication works with all table types.
- Any “critical” reads must be done on the master as replication is asynchronous.
- Master will rotate binary logs automatically for every 1G of log records.
- You must purge any old, unused logs yourself. See the
PURGE MASTER LOGS command. Using file system commands is not recommended as there are index files that need to be updated.
- Master and slave must be similar hardware as slaves generally do as much work and, sometimes, more work than the master. In other words, don’t skimp on the slave.
- Keep a “spare” slave that you can take down and clone to bring up new slaves.
- Use a load balancer for managing access to slaves.
- There is no raw limit to the number of slaves that a master can host, but each slaves takes up one connection so you’re limited by your max connections on the master.
- You can create “relay” slaves that relay data from the master to slaves sitting below them. Doing this increases replication delays though.
- If a slave will also be a master (see above) you need to enable
- Relying on the
master-* lines in
my.cnf can be problematic over using
CHANGE MASTER SQL statement. The slave rememebers master information.
- Datacenter failover involves the www connecting through a proxy, which then decides who the current writeable master is.
- Should avoid DNS when doing inter-datacenter failover. DNS caching, propagation, etc.
- Using proxy allows instantaneous and complete switching of traffic between masters. Another option is stunnel.
- A connection comes through A and goes directly to that DB server. On B it comes through and is forwarded to A instead of going to B. None of the connections connect directly to either A or B.
- Writes go to only one master at a time. The firewall rules ensures that.
- Both masters should use
skip-slave-start and read-only. Entire setup should come up
read-only at boot. Essentially, make sure they come up in the state they failed in.
- Need to handle “Server is read only” and “Connection refused” in your application when failover is occurring.
- Master<->Master replication is only a problem if you’re writing to both masters at the same time.
- They recommend switching traffic between datacenters manually as it’s difficult to determine when to actually failover and not actually automatically failing over.
- Steps for failing over
- Set current master to read-only
- Wait for writes to flush through. Checking how far behind the other master is up-to-date. A way to do this is by creating a table on the master every second and seeing how long it takes to show up on the slave.
- Remove forward rules with old settings
- Add fowarding rules with new settings
- Remove read-only from new master using
SET GLOBAL read_only
- Make sure to STOP SLAVE on the new master when master fails. When you bring it back up check the data first. If all is well then allow replication on the failed master to catch up from the new master. If not then restore from the new master.
- Reverse the above steps to bring the master back up (usually done during a planned failover later). You’ll remain live against the new master until then.