Only the best powering Lightning

It is Sunday night, CBC is going well and no server hiccups at all, so I’d take a bit of time to post some stuff and benchmarks we’ve hit with Lightning.  Lightning is the name we are calling our new platform.  Not only does it sound better, it also works with a few other products that are coming out that support Lightning. Lightning is a name that has meaning for the goals we are looking to accomplish.

Lightning is a name that could mean a lot on the market, but it would have to live up to its name.  You can’t have a slow API called Lightning.. it would be like trying to call an elephant skinny.  I now had a new goal – make Lightning a lightning fast API.  The original builds of Lightning, powered by Apache, PHP 5.3 and MySQL + Memcached didn’t make me jump in the air and yell out “yay, its lightning fast”.  I’ll be honest, using that configuration, our API calls averaged 500 ms for about 10k requests per minute.  Running that required 6 web 1GB nodes and a MySQL cluster with 1 master and 2 slaves. That is something that we had to improve, with our goals being around 150ms.

Each API call requires the authentication of the call, authentication of the user, logging the user activity, then processing the request.  Some requests are simple, while others are far more complex.  Complex take more time to process, they may have more DB calls or need more data processing.  For whatever reason it may be slower, the more complex API calls have been the focus of all our tests and improvement.  Finding ways to improve our most complex calls would immediately improve the latency of all our calls.

I was on a quest.  A quest to improve request times, but also work on cutting infrastructure costs, as well as ensuring our infrastructure scales.  As cliché as the saying “expect the unexpected” is, you really do want to try to plan for the worst, and if possible even plan for what happens when you exceed the worst.  This is what we’ve done with a lot of what we have for Lightning, including the technology currently powering it.

Apache is a hog, anyone can tell you that.  But Apache does a lot of things well, which is why it is so widely used.  However, NGINX has a clear advantage over Apache, it’s footprint is tiny and queues the requests.  NGINX workers are great and are highly configurable.   As for PHP, we are using PHP-FPM, which is super fast compiling version of PHP, with fastcgi.  Just to make sure we didn’t miss anything, we turned tweaked and tested APC cache to make sure that all our opcode is optimized, fasted, and cached.  I really don’t have much else to say on PHP-FPM other than WOW. This stuff is so good that PHP will be shipping with this by default in the upcoming versions.

We ran some pretty ridiculous tests on our web environment.  The results between Apache and NGINX were night and day.  Apache crashed while attempting to handle more than 100 requests per second, whereas we exceeded 1000 requests per second with NGINX and still only using 25% of our memory and about 10% CPU.  As things got heavy for NGINX, rather than denying our requests, they would be queued up and processed eventually.  While the queue would eventually get so large that some calls wouldn’t be satisfied, it gives us room to play with, rather than panick because we are down.

I’ve lived through moments where Apache and PHP gave up and would simply deny the majority of requests.  Unfortunately for Apache, even requests for static files had an absolutely massive overhead.  NGINX, given that it is still small, just loads up the resources it needs to load up a static file.  Scaling Apache has been a pain and at times, quite expensive for us to maintain.  The solution we had to implement was two-fold, add more nodes and ensure that absolutely no static files are served by our services.   Naturally, to solve the static file issue, we got a CDN implemented.

Our database layer required a lot of research, tests and development to finally choose one that would scale and that would meet all our requirements.  Unfortunately, I can’t divulge details on these requirements, but they included what can allow us to get to the market faster, what gave us the most scaling potential and what supported our future plans best.  Our hosting partner, Joyent, recently started offered Basho Riak NoSQL machines.  Looking at that and reading people’s thoughts on it made us look at more available technologies, which lead us to look into over a dozen before narrowing the list down to Couchbase, VoltDB, and MongoDB.

MySQL Cluster was great, but still required sharding.  While sharding isn’t by any means impossible, it would be a lot of work to convert the number of tables we had, along with updating our database code to support sharding.  There would be some obvious problems where data may exist in multiple shards that would need us to build further logic to support that.  This wold have been a lot of work to continue using Percona.  This is why I went on to read on NoSQL and NewSQL.

I’ve always loved schema and for what we want to do, it works best.  With that said, I was open to trying something different, since we were fans of Memcached and how easy it was to store data.  MongoDB fared well, but Couchbase was a clear winner on paper and in the tests.  I ran into some issues, but likely were issues that could be solved with more knowledge on Couchbase.  VoltDB did not have a very good PHP extension which made testing difficult, but the database technology looked extremely promising.   In order to run tests, I put on my Java cap and wrote some tests in Java to get it running.  The results were quite impressive, including the tests they give when you download the database.

It was a very tough choice between Couchbase and VoltDB.  Couchbase would require that rewrite all of our code and database layers to properly support it.  VoltDB would be the more natural fit, but the PHP extension did not at all meet our needs.  I talked with the teams from both of those companies to see how they could support us and our goals.  VoltDB was the most receptive and offered up to build a PHP extension that met our needs and match the performance characteristics of competing database’s PHP extensions along with a far simplified coding style.

The path of implementation for VoltDB was bumpy, due to my limited knowledge on it and very short implementation period for it. Probably a recipe for disaster, but it was not.  Much of what one will learn about scaling MySQL can be applied to VoltDB.  This includes understanding how data is partitioned, how indexes work, and how tables are joined in queries.  VoltDB uses stored procedures primarily, with support for AdHoc queries.  AdHoc queries aren’t planned and have a higher latency due to it.  The place where VoltDB differed the most is in our inserts and updates.  We’ve opted to use upserts to simplify our code.  This level of complexity required that these be written in Java.

We partitioned all our data, retained our indexes from MySQL, and copied all our queries to a project XML or stored procedures.  We use the project XML for 90-95% of queries, it allows us to skip the process of making a Java file for a very simple query.  The process was quite simple, but time-consuming with the large number of queries and tables we have.  Needless to say, we had Lightning running on VoltDB within days of being head down in code.  I spent a few sleepless nights to get it to that point, but it was well worth it.  I think it’s the most Amp energy drinks I’ve had, ever. They are tasty, but let’s be honest, how good can they be for you?

Skip ahead a few days, Lightning is running on the new infrastructure and ready for prime time… testing. Running load tests, as you would expect found bugs and some performance issues.  It was a fix and retest scenario that went on for a few days.  But looking at our stats today, it was well worth it.  Averaging around 10k requests per minute and Lightning now averages less than 30ms.  That is a nearly 17 times faster or a near 95% improvement on what we had before!  Want to throw double the amount of requests at us over the span of a minute, sure!  We managed to do satisfy those requests in 30ms as well.  Lightning is built to easily scale databases out and adding web nodes on the fly.  Whether either of those layers needs more capacity or capability to handle more requests, nodes can be added in less than 5 minutes.

I am extremely proud of the things we’ve managed to accomplish with Lightning and our infrastructure.  I am now far more confident to call this platform “Lightning“, as the speeds do represent its name.