I’ve said it once, I’ve said it twice, I’ve screamed it from the top of mountains and yet nobody listens. I’m sitting in a session at the MySQL Conference and the person presenting just said, “You have to have well written code to avoid bottlenecks.” This is, put bluntly, stupid and patently false. Let me explain.
- Your true bottlenecks when scaling are very rarely, if ever, because of your language. Sure Ruby is slower than PHP or Perl or Python, but only incrementally so and it’s only going to get faster. Even if your language is your problem it’s the easiest part of your architecture to scale; add more hardware.
- Just because your code is well written doesn’t mean it will perform well and, conversely, just because you write shitty code doesn’t mean your code will not perform well. I’ve seen some seriously shitty PHP code that’s blazing fast because it’s so simple.
- Depending on your application, as you grow you’ll find that your scaling issues come down to one fundamental problem: I/O. DB I/O, file system / disk I/O, network traffic, etc, etc. Ask anyone who’s written a large scale application where their growing pains were and I’ll bet my last dollar it wasn’t “PHP/Python/Ruby/Perl/Java/COBOL is slow”. I’m betting they’ll say something along the lines of “MySQL took a crap on us after we hit 200,000,000 records and had to do date range scans.” Or they’ll say, “I was storing user generated content and NFS couldn’t scale to the amount of requests for that content.”
I’m sick and tired of the language zealots who say PHP is slower than Perl or Ruby is slower than PHP or Java sucks because which language you’re using has zero to do with that missing index on your table or the fact that you can’t store all of that user generated content.
It comes down to your architecture and, despite what the zealots would have you believe, the language you choose is only one component of your overall architecture. Choose what you know and run with it.
- If your applications use features which were well optimized in 4.1 you might see performance degredation (wtf?).
- As of now 4.1 will only get critical bug fixes.
- Amazingly it appears that 5.0 is slightly slower than 4.1. Why this is I don’t know.
I’m leaving this one to go to the dispearsed storage engine tutorial as I’m bored to death with this. Nothing interesting.
MySQL’s new partitioning feature is a way to split tables up into chunks based on various criteria. For instance, you could break up a table based on a timestamp and rows older than a specific date are stored in MySQL’s archive storage engine.
- Partitioning will first appear in version 5.1.
- Partitioning is a “Divide and Conquer” solution for large tables.
- You can partition data based on year. For instance you can put data from before 1999 into
/var/data/part1, before 2001 onto
- You can move data from partition to partition as data becomes stale as well.
- Sits above the storage engine layer and works with all table types. Sits in its own layer.
- You can remove partitioning with
ALTER TABLE t1 REMOVE PARTITIONING.
- Limits on partitioning include that the partition function must return an integer result, all partitions must use the same storage engine, and it increases response time with many partitions.
- You can partition by odd and even values. For instance, you can put even values into partitions 2, 4, 6 and 8 and odd values into partitions 1, 3, 5 and 7.
- You can partition by hashing, which lets MySQL decide where to put the data. There isn’t a logical distribution between partitions so it makes pruning more difficult. For instance, you never really know which partition the data is on (unlike the above example).
- You can create composit partitions as well, which are essentially sub-partitions.
- If a primary key is defined no fields outside of primary key is allowed in partition function (wtf?).
- You can reorganize partitions whenever you want. For instance you can roll partition 2 into partition 3 and move their locations. This, however, locks the table until the partition operations have completed.
- If you have both a
PRIMARY KEY and a
UNIQUE index in a single you cannot do partitioning on that table.
- If a storage engine has a table size limit of 64GB (ie. InnoDB) you can have 64GB on each partition. This essentially allows you to scale beyond your storage engine’s table size limit.
- Did I just hear something about foreign key constraints being put into the meta storage engine in the near future? This would allow foreign key constraints for all table types I believe.
- Creating, updating, altering, etc. of partitions is done atomically. Either the change to the partition happens or it’s rolled back and an error message is returned.
- If one of your partitions goes away all of your data goes away. Yikes!
- It doesn’t sound like you can partition federated tables.
- Paritioning does work with replication.
- You can use it with auto-increment primary keys.
This is a very cool feature for people who have very large data sets and are looking to break those data sets up into smaller data sets. For instance, if you have a table with 1m records in it you could split it up into two data sets of 500k records in each data set, which would reduce the number of rows MySQL needs to scan during queries.
This specifically about new features in the 5.1 release. As much as I like Australia I think it’s retarded to be talking about that at a MySQL conference, though it is funny I suppose.
- They now support variable sized rows, which reduces memory usage. This equates to more rows per gigabyte.
- Add/Update Index has been optimized over 5.0. Before it copied the entire table, added the new index and then moved it back over the old table. This all happened over the wire so you can imagine how long that took. This optimization speeds things up about four times.
- 5.1 now allows you to replicate across clusters. Used for geographical redundancy, split the processing load (why not add more nodes to the cluster?), etc. You have to use the row-based replication to enable this feature.
- Failover of replication channels is manual.
- They’re adding support for data on disk in 5.1 and indexes on disk in the future.
This absolutely amazes me. The moral of the story at MySQL UC has been that clustering is a nightmare to set up, maintain and use. However, storing data on disk is a step in the right direction. The solution from Continuent seems to be infinitely more elegant than both clustering and replication.
- MySQL 5.0 supports auto-increment variables for bi-directional replication to avoid auto-increment collisions.
- MySQL 5.0 replicates character sets and time zones.
- 5.0 now replicates stored procedures, triggers and views.
- MySQL 5.1 introduces row-based logging and replication (RBR)
- Dyanmic switching of binary log format between
- 5.1 allows for cluster replication.
- Replication method cannot be configured per table, but since it’s dynamic you can change it from the client before a transaction, etc.
auto_increment_increment you can change starting points and how many you step between. Works with most table types. Works with InnoDB and MyISAM. This is specifically for multi-master setups.
- RBR allows clusters to replicate and also allows the server to replicate non-deterministic statements such as LOAD_FILE(). That being said you can’t have different table definitions on the slave as you can with SBR.
Wow, nothing super impressive here that makes me excited about new features in MySQL’s 5.0/5.1 replication. Their answer to the possible auto_increment collisions seems a bit simplistic and short-sighted, but then again I’m not a maintainer for MySQL’s replication code.
What annoys me greatly is MySQL’s refusal to simply add features to storage engines in favor of simply adding new storage engines. MyISAM doesn’t support transactions or foreign key constraints? Use InnoDB. InnoDB doesn’t support FULLTEXT? Use MyISAM. You need synchronous replication of data? Use NDB, but you need to denormalize and reduce your use of JOIN’s.
It’s enough to make me switch to PostgreSQL.
Today I thought I’d check out session by the CTO of Continuent about their clustering solution. Before heading into the session we checked out Continuent’s booth in the exhibit hall. It certainly sounds like a great product, however the $5,000 per CPU licensing seems a bit Oracle’ish to me.
- A share nothing architecture. Split up into two layers; the controllers and the actual database nodes.
- Low latency.
- Single-copy equivalence for reads.
- Supports load balancing and heterogeneous databases (ie. MySQL and SQL Server sharing the same cluster).
- 100% Java and based on Sequoia.
- They might have an open source version. Will need to check this out.
- Fully transparent (including failover) to the application.
- The controllers act as a proxy to the database. To the applications they appear as the actual database. Below these controllers is where the databases actually sit.
- You need to compile/load your their specific driver, which I think would require a near PEAR DB driver.
- Requests come in where it determines what type of request it is (read vs. write). It then broadcasts writes to all controllers. All requests are executed in identical order. From there it’s sent to the scheduler, which makes sure the underlying databases remain in identical states. These requests can be sent in parrallel to each database server.
- The request controller then aggregates the responses from the databases. If there was an error across all of the systems then it’s a bad query of some sort, if only a single node respondes with an error then that systems is dropped. This can be configured to return a success message once one of the underlying nodes responds with a success.
- Reads are simply load balanced across the underlying nodes to the node with the least number of requests (sounds like LVS’s weighted least connection algorithm).
- Works with MyISAM, InnoDB and heap table types.
- The commits are synchronous.
- Once you have a dump with a starting date you can apply the dump to the new node and then the cluster controllers apply the logs until it’s up to date.
- The open source solution only supports ANSI SQL, doesn’t come with their own group talk protocol (so the clusters can communicate with each other) and doesn’t come with the database-specific dump and load.
- The cluster controllers keep track of the position in the overall sequence of each underlying node and sends reads only to those nodes that are up-to-date with the current position.
- The largest cluster they have are four nodes. Has been tested with as many as 64 nodes and supports tiering.
- Failover between cluster controllers happens in the driver level.
- Stored procedures where a “challenge”. Their approach allows you to tell their controllers what is inside of the stored procedure (what tables it changes, etc.) so the controller knows how to handle each procedure.
After sitting in on the MySQL Clustering tutorial I can honestly say this approach is infinitely better. It supports disk write, InnoDB, MyISAM and, on top of that, allows you to tier the clusters. The only downside I see is that it’s written in Java, but these days I’m not sure if that’s really a downside.
The second session today is replication for scaling and high availability, which I’m extremely interested in hearing about as we’ve been having problems with race conditions on our current replication setup.
- Replication works with all table types.
- Any “critical” reads must be done on the master as replication is asynchronous.
- Master will rotate binary logs automatically for every 1G of log records.
- You must purge any old, unused logs yourself. See the
PURGE MASTER LOGS command. Using file system commands is not recommended as there are index files that need to be updated.
- Master and slave must be similar hardware as slaves generally do as much work and, sometimes, more work than the master. In other words, don’t skimp on the slave.
- Keep a “spare” slave that you can take down and clone to bring up new slaves.
- Use a load balancer for managing access to slaves.
- There is no raw limit to the number of slaves that a master can host, but each slaves takes up one connection so you’re limited by your max connections on the master.
- You can create “relay” slaves that relay data from the master to slaves sitting below them. Doing this increases replication delays though.
- If a slave will also be a master (see above) you need to enable
- Relying on the
master-* lines in
my.cnf can be problematic over using
CHANGE MASTER SQL statement. The slave rememebers master information.
- Datacenter failover involves the www connecting through a proxy, which then decides who the current writeable master is.
- Should avoid DNS when doing inter-datacenter failover. DNS caching, propagation, etc.
- Using proxy allows instantaneous and complete switching of traffic between masters. Another option is stunnel.
- A connection comes through A and goes directly to that DB server. On B it comes through and is forwarded to A instead of going to B. None of the connections connect directly to either A or B.
- Writes go to only one master at a time. The firewall rules ensures that.
- Both masters should use
skip-slave-start and read-only. Entire setup should come up
read-only at boot. Essentially, make sure they come up in the state they failed in.
- Need to handle “Server is read only” and “Connection refused” in your application when failover is occurring.
- Master<->Master replication is only a problem if you’re writing to both masters at the same time.
- They recommend switching traffic between datacenters manually as it’s difficult to determine when to actually failover and not actually automatically failing over.
- Steps for failing over
- Set current master to read-only
- Wait for writes to flush through. Checking how far behind the other master is up-to-date. A way to do this is by creating a table on the master every second and seeing how long it takes to show up on the slave.
- Remove forward rules with old settings
- Add fowarding rules with new settings
- Remove read-only from new master using
SET GLOBAL read_only
- Make sure to STOP SLAVE on the new master when master fails. When you bring it back up check the data first. If all is well then allow replication on the failed master to catch up from the new master. If not then restore from the new master.
- Reverse the above steps to bring the master back up (usually done during a planned failover later). You’ll remain live against the new master until then.
Today is the 1st day of MySQL UC and I’m sitting in on the clustering tutorial, which promises to be three hours of information on MySQL’s clustering feature. I’m sure you’re thrilled that I’m, ahem, live blogging this. In reality, this is just so I have a central place for my notes.
- Broken up into three parts. The MySQL servers sit separate from the NDB Storage Engine, which are storage nodes (NDB nodes). The third part is called a management server. The management server, oddly enough, isn’t required once the cluster is up and running unless you want to add another storage node.
- Memory based storage engine. If you’re not using 5.1+ then you must have enough RAM in each storage node to store the data set. This means that if you have four machines with 4GB of RAM each you can store 8GB of data (16GB of total storage divided by two for two copies of the data set).
- Storage nodes are static and pre-allocate resources on startup.
- Supports transactions and row level locking.
- Should be noted this is a storage engine so you can’t create MyISAM or InnoDB tables inside of a cluster.
- Uses fixed sized records. This means if you have a varchar(255) and put a single byte into it that field is still using 255 bytes.
- No foreign key constraint support.
- Replication across nodes is syncronous across nodes. I assume this means that an INSERT happens once all of the nodes have completed the INSERT. This is different than regular replication which is asyncronous and introduces race conditions.
- Tables are divided into fragments (one fragment for each storage node). Each storage node is responsible for each fragment. Each fragment also has a secondary fragment, which is a copy of another node’s primary fragment. This data distribution happens automatically.
- NDB takes your primary key, creates a hash and then converts that to a fragment. So you’ll have various rows on each different storage node.
- A node group is a set of nodes that share the same fragment information. If you lose an entire node group you’ve lost half of the table and the cluster will not continue to operate. However, if one node in a node group fails there will still be enough data to keep the cluster up and running.
- If a node fails and it’s secondary counterpart takes over it will, essentially, have to perform the job of two nodes. Until a node has fully recovered it will not rejoin the cluster.
- Backups are hot and non-locking. Each node writes its own set of backup files. No support for incremental backups.
- Because it’s memory based you could lose data on a system crash (as you might have transactions sitting in RAM when a crash occurs). The COMMIT command does not write changes to disk meaning that you could have data sitting in memory that’s not on disk when a node crashes. This means the odd truth is that MySQL Clusters support syncronous replication, but are not atomic.
- NDB nodes will checkpoint data to disk (data + logs), which are used for system recovery. They write two logs, the UNDO and REDO logs.
- They recommend using TRUNCATE to delete all rows from a table.
- Modification operations are distributed to both the primary and secondary fragments (obviously).
- NDB will run on 64-bit machines. They recommend Dual CPU 64-bit machines. NDB is threaded. Application nodes (MySQL servers) can be whatever.
- SCI offers 30-100% better performance over gigabit.
- They actually recommend avoiding joins and to denormalize your schemas. Are you kidding me? He actually said “Performance for joins sucks.”
Overall, I’m underwhelmed by MySQL Clustering. You’re limited in storage with the RAM and you can’t optimize your schemas due to fixed field sizes. And any RDBMS “solution” that recommends you denormalize puts me off.
That being said the actual technology is pretty interesting and I suspect that in a few years we’ll see the clustering features in MySQL come into their own. As of now I suspect few people would be able to justify the sacrifices for the gains clustering allows.