Category Archives: Web

Use the cache, Luke, Part 2: don’t put all your eggs into the memcached buck … basket

This is the second part of a series called: Use the cache, Luke. If you missed the first part, here it is: From memcached to Membase memcached buckets. Meanwhile, the AWS ElastiCache service proved to have better network latency than our own rolled out Membase setup, therefore the migration was easily done by simply switching the memcached config. No vendor lock in.

However, it took me a while to write this second part.

If you can see this, then you might need a Flash Player upgrade or you need to install Flash Player if it’s missing. Get Flash Player from Adobe. This error may appear if the URL path to the embedded object is broken or you have connectivity issue to the embedded object. Powered BY XVE Various Embed.

Please have a look at the above video. Besides the general common sense guidelines about how to scale your stuff, and the Postgres typical stuff, there’s a general rule: cache, cache, and then cache some more.

However, too much caching in memcache (whatever implementation) may kill the application at some point. The application may not be database dependent, but it is cache dependent. Anything that affects the cache may have the effect of a sledgehammer on your database. Of couse, you can always scale vertically that DB instance, scale horizontally by adding read-only replicas, but the not-so-fun part is that it costs a lot just to have the provisioned resources in order to survive a cache failure.

The second option is to have a short lived failover cache on the application server. Something like five minutes, while the distributed cache from memcache may last for hours. Enough to keep the database from being hit from live traffic, while you don’t have to provision a really large database instance. Of course, it won’t work with stuff that needs some “real time” junk, but it works with data that doesn’t change with each request.

There are a lot of options for a failover cache since there’s no distributed setup to think about. It may be a memcached daemon, something like PHP’s APC API, or, the fastest option: the file based caching. Now you may think that I’m insane, but memcached still has the IPC penalty, especially for TCP communication, while if you’re a PHP user, APC doesn’t perform as expected.

I say file based caching, not disk based caching, as the kernel does a pretty good job at “eating your RAM” with the disk caching stuff. It takes more to implement it since the cache management logic must be implemented into the application itself, you don’t have stuff like LRU, expiration, etc. by default, but for failover reasons, it is good enough to worth the effort. In fact, it ran for a few days on the failover cache without any measurable impact.

The next part for not using the same basket for all of your eggs is: cache everywhere you can. For example, by using the nginx FastCGI cache, we could shave off 40% of our CPU load. Nothing experimental about this last part. It is production for the last 18 months. If you get it right, then it could be a really valuable addition to a web stack. However, a lot of testing is required before pushing the changes to production. We hit a lot of weird bugs for edge cases. The rule of thumb is: if you get the cache key right, then most of the issues are gone before going live.

In fact, by adding the cache control stuff from the application itself, we could push relatively shortly lived pages to the CDN edges, shaving off a lot of latency for repeated requests as there’s no round trip from the hosting data center to the CDN edge. Yes, it’s the latency, stupid. The dynamic acceleration that CDNs provide is nice. Leveraging the HTTP caching capabilities is nicer. Having the application in a data center closer to the client is desirable, but unless your target market is more distributed than having a bunch of machines into the same geo location, it doesn’t make any sense to deploy into a new data center which adds its fair share of complexity when scaling the data layer.

Use the cache, Luke, Part 1: from memcached to Membase memcached buckets

I start with a quote:

Matt Ingenthron said internally at Membase Inc they view Memcached as a rabbit. It is fast, but it is pretty dumb and procreates quickly. Before you know it, it will be running wild all over your system.

But this post isn’t about switching from a volatile cache to a persistent solution. It is about removing the dumb part from the memcached setup.

We started with memcached as this is the first step. The setup had its quirks since AWS EC2 doesn’t provide by default a fixed addressing method while the memcached client from PHP still has issues with the timeouts. Therefore, the fallback was the plain memcache client.

The fixed addressing issue was resolved by deploying Elastic IPs with a little trick for the internal network, as explained by Eric Hammond. This might be unfeasible for large enough deployments, but it wasn’t our case. Amazon introduced ElastiCache since then which removes this limitation, but having a bunch of t1.micros with reservation is still way much cheaper. Which makes me wonder why they won’t introduce machine addresses which internally resolve as internal address. They have this technology for a lot of their services, but it is simply unavailable for plain EC2 instances.

Back to the memcached issues. Having a Membase cluster that provides a memcached bucket is a nice drop-in replacement, if you lower a little bit your memory allocation. Membase over memcached still has some overhead as its services tend to occupy more RAM. The great thing is that the cluster requires fewer machines with fixed addressing. We use a couple for high availability reasons, but this is not the rule. The rest have the EC2 provided dynamic addresses. If a machine happens to go down, another one can take up its place.

But there still is the client issue. memcached for PHP is dumb. memcache for PHP is even dumber. None of these can actually speak the Membase goodies. This is the part where Moxi (Memcached Proxy) kicks in. For memcached buckets, Moxi can discover the newly added machines to the Membase cluster without doing any client configuration. Without any Moxi server configuration as the config is streamed to the servers via the machines that have the fixed addresses. With plain memcached, every time there was a change, we needed to deploy the application. The memcached cluster was basically nullified till it was refilled. Doesn’t happen with Moxi + Membase. Since there no “smart client” for PHP which includes the Moxi logic, we use client side Moxi in order to reduce the network round-trips. There still is a local communication over the loopback interface, but the latency is far smaller than doing server-side Moxi. Basically the memcache for PHP client connects to aka where Moxi lives, then the request hits the appropriate Membase server that holds our cached data. It also uses the binary protocol and SASL authentication which is unsupported by the memcache for PHP client.

The last of the goodies about the Membase cluster: it actually has an interface. I may not be an UI fan, I live most of my time in /bin/bash, but I am a stats junkie. The Membase web console can give you realtime info about how the cluster is doing. With plain memcached you’re left in the dust with wrapping up your own interface or calling stats over plain TCP. Which is so wrong at so many levels.

PS: v2.0 will be called Couchbase for political reasons. But currently the stable release is still called Membase.

Why sometimes I hate RFCs

Every time when there’s a debate about the format of something that floats around the Internets, people go to RFCs in order to figure out who’s right and who’s wrong. Which may be a great thing in theory. In practice, the rocket scientists that wrote those papers might squeeze a lot of confusion into a single page of text, as the G-WAN manual states.

Today’s case was a debate about the Expires header timestamps as defined by the HTTP/1.1 specs (RFC 2616). If you read the 14.21 section regarding the Expires header, you can see the following statement:

The format is an absolute date and time as defined by HTTP-date in section 3.3.1; it MUST be in RFC 1123 date format:

Expires = “Expires” “:” HTTP-date

I made a newb mistake in thinking that the RFC 1123 dates are legal Expires timestamps. Actually, by proof reading 3.3.1 of RFC 2616 you may deduce the following: the dates in use by the HTTP/1.1 protocol are not the dates into the RFC 1123 format, but the actual format is a subset of RFC 1123. The debate started around the GMT specification which in the HTTP/1.1 contexts is actually UTC, but it must be specified as GMT anyway. Even more, +0000 which is valid timezone specifier as defined by RFC 1123 is not valid for Expires timestamps. Although some caches accept +0000 as valid timezone specifier for the HTTP timestamps, some of them don’t.

It isn’t that the RFCs are broken per se, but the language they use can be very confusing sometimes.

nginx + PHP-FPM for high loaded virtual private servers

The VPSes that use the OS level virtualization have some challenges that you won’t find in setups that use full OS virtualization / paravirtualization. Stuff like OpenVZ doesn’t provide you a method for tuning OS parameters which may impair certain setups such as the above mentioned into the article title. All the stuff mentioned in my previous article applies here as well, except the sysctl tuning part. It would be redundant to mention those tips again.

However, the thing that applies differently is the fact that the maximum connection backlog on a OpenVZ VPS is typically 128, therefore you can not increase the listen.backlog parameter of PHP-FPM over that value. The kernel denies you that. Even more, due to the low memory setups of a lot of VPSes out there in the wild, you may be forced to use a low value for pm.max_children which translates this into a shorter cycle for the process respawn. A process that is respawning, can’t handle incoming connections, therefore the heavy lifting is done by the kernel backlog. The concurrency may not increase linearly with the number of available processes into the FPM pool because of this consequence.

Since the backlog is kept by the kernel, it is common for all services, therefore it may affect nginx as well if it can not accept() a connection for some reason. Load balancing over UNIX sockets doesn’t increase the possible concurrency level. It simply adds useless complexity. Using TCP sockets may increase the reliability of the setup, at the cost of added latency and used up system resources, but it would fail in the end. Some measures need to be taken.

I gave a thought about using HAProxy as connections manager since HAProxy has a nice maxconn feature that would pass the connections to an upstream up to a defined number. But that would add another layer of complexity. It has its benefits. But it also means that’s another point of failure. Having more services to process the same request pipeline is clearly suboptimal if the services composing the request pipeline won’t add some clear value to the setup, for example the way a proxy cache does.

Then I thought about node.js, but implementing FastCGI on top of node seems a hackish solution at best. One of the Orchestra guys did it, but I wouldn’t go into production with a node.js HTTP server for static objects and FastCGI response serving, no matter how “cool” this may sound. Then the revelation hit me: Ryan Dahl, the author of node.js, wrote a nginx module that adds the maxconn feature to nginx: nginx-ey-balancer. Hopefully, someday this would make it into upstream.

The module adds some boilerplate to the configuration since it requires a fastcgi_pass to an upstream unlike direct fastcgi_pass to an UDS path, but otherwise that this, it works as advertised. Although the module wasn’t actively maintained, or, at least this is how things look from outside, the v0.8.32 patch works even for nginx v1.0.2. Having nginx to act as connection manager instead of sending all the connections straight to the FPM upstream may have clear benefits from the concurrency point of view. It is recommended to set max_connections to the value of net.core.somaxconn. That guarantees the fact that no connection gets dropped because the FPM pool processes are respawning due to a short cycle policy.

By using this module, nginx could handle easily around 4000 concurrent connections for a 4 worker process setup, but increasing the workers number does not increase linearly the possible concurrency. Anyway, at that concurrency level, most probably the issues caused by the IPC between nginx and PHP-FPM would be your last concern. This setup simply removes an IPC limit which is ridiculously small most of the times.

nginx + PHP-FPM for high loaded websites

The title of the post is quite obvious. Apache2 + mod_php5 lost the so called “crown” for quite a while. The fist eye opener was a thing that got me pretty annoyed back in 2008. Dual-quad machine, 4 GiB of RAM, RAID 5 SAS @ 10k RPM – the server was looking pretty good, apparently. The only show stopper: Apache2 + mod_php5 that choked at 350 concurrent clients. What? No more free RAM? WTF? LAMP is not my cup of tea for obvious reasons. LEMP seems to be more appropriate while I ditched Apache from all the production servers.

Since then, I keep telling people that Apache2 + mod_php5 + prefork MPM is a memory hog, while most of the stuff comes from the brain-dead client connection management that Apache does. One process per connection is a model that ought to be killed. Probably the best bang for the buck is using nginx as front-end for serving the static objects and buffering the response from Apache in order to spoon-feed the slow clients. But here’s the kicker: for virtual hosting setups, Apache/prefork/PHP is pretty dull to configure safely aka isolate the virtual hosts. Packing more applications together over a production cluster is an obvious way for doing server consolidation.

There’s mod_fcgid/FastCGI … but nginx supports FastCGI as well. Therefore, cutting off the middle man was the obvious solution. By using this setup, you won’t lose … much. PHP-FPM was an obvious solution as the default FastCGI manager that used to come as the sole solution from the PHP upstream is pretty dumb. Instead of having a dedicated PHP service for each virtual host, one can have a dedicated process pool for each virtual host. Believe me, this method is much easier to maintain.

While nginx comes pretty well tuned by default, except the open file cache settings, for PHP-FPM you need to tune some essential settings. nginx has a predictable memory usage. I can say that most of the time the memory used by nginx is negligible compared to the memory used by the PHP subsystem. However, PHP-FPM has a predictable memory consumption if used properly. The process respawn feature ought to be used in order to keep the memory leaks under control. pm.max_requests is the option that you need to tune properly. One may use 10000 served request before respawning, but under very restricted memory requirements I even had to use 10 (very low memory VPS).

The pm.max_children option may be used to spawn an adaptive number of processes, based onto the server load, but IMHO that might overcommit the system resources into the worse case scenario. Having a fixed number of processes per pool is preferred. Usually I have a rough estimation of the memory consumption in order to keep all the runtime into the RAM. Thrashing the memory is not something you would want onto a loaded web server. For everything else, there’s Master … cough! munin.

The Inter-Process Communication is preferred over an UNIX Domain Socket. Unlike the TCP sockets, the UDS doesn’t have the TCP overload that the IPC has even for connections over the loopback interface. For small payloads, UDS might have a 30% performance boost due to lower latency than the TCP stack. Rememer: you can’t beat the latency. For larger payloads, the TCP latency has a lower impact, but it’s still there. Another nice thing about the UDS is the namespace. UDS uses the filesystem for defining a new listening socket. Under Linux, classic ACLs may be used for restricting what user can read / write to the UDS. BSD systems may be more permissive for this kind of stuff. The TCP sockets require a numerical port that can’t be used for something else while the management from an application that generates the configuration files is more difficult for this kind of settings. For UDS the socket name can be derived from the host name. As I said into a previous article, I don’t write the configuration files by hand. Having a decent way of keeping the IPC namespace is always a plus.

Another thing that you should take care of is the listen.backlog option of PHP-FPM. Using TCP sockets seem to be more reliable for PHP-FPM. However, that’s untrue if you dig enough. The IPC starts to fail around 500 concurrent connections, while for UDS this happens way faster aka for an 1 process pool, you can serve at most 129 concurrent clients. 129 is not a random number. PHP-FPM can keep 128 connections into its backlog while the 129th connection is the active process. The default listen.backlog for Linux is 128, although the PHP-FPM documentation may state -1 aka the maximum allowed value by the system. Taking a peek at the PHP-FPM source code reveals this (sapi/fpm/fpm/fpm_sockets.h):

  On FreeBSD and OpenBSD, backlog negative values are truncated to SOMAXCONN
#if (__FreeBSD__) || (__OpenBSD__)

The default configuration file that is distributed with the PHP-FPM source tree states that the value is 128 for Linux. The statement that it defaults to -1 gave me a lot of grief as I though the manual won’t give me rubbish instead of usable information. However, since PHP 5.3.5 you may debug the configuration by using the -t flag for the php-fpm binary. You can use it like:

php-fpm -tt -y /path/to/php-fpm.conf

The doube t flag is not a mistake. If you’re using NOTICE as the debug level, the double t testing level prints the internal values of the PHP-FPM configuration:

php-fpm -tt -y /etc/php-fpm/php-fpm.conf
[08-May-2011 20:53:26] NOTICE: [General]
[08-May-2011 20:53:26] NOTICE:  pid = /var/run/
[08-May-2011 20:53:26] NOTICE:  daemonize = yes
[08-May-2011 20:53:26] NOTICE:  error_log = /var/log/php-fpm.log
[08-May-2011 20:53:26] NOTICE:  log_level = NOTICE
[08-May-2011 20:53:26] NOTICE:  process_control_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  emergency_restart_interval = 0s
[08-May-2011 20:53:26] NOTICE:  emergency_restart_threshold = 0
[08-May-2011 20:53:26] NOTICE:
[08-May-2011 20:53:26] NOTICE: [www]
[08-May-2011 20:53:26] NOTICE:  prefix = undefined
[08-May-2011 20:53:26] NOTICE:  user = www-data
[08-May-2011 20:53:26] NOTICE:  group = www-data
[08-May-2011 20:53:26] NOTICE:  chroot = undefined
[08-May-2011 20:53:26] NOTICE:  chdir = undefined
[08-May-2011 20:53:26] NOTICE:  listen = /var/run/php-fpm.sock
[08-May-2011 20:53:26] NOTICE:  listen.backlog = -1
[08-May-2011 20:53:26] NOTICE:  listen.owner = undefined
[08-May-2011 20:53:26] NOTICE: = undefined
[08-May-2011 20:53:26] NOTICE:  listen.mode = undefined
[08-May-2011 20:53:26] NOTICE:  listen.allowed_clients = undefined
[08-May-2011 20:53:26] NOTICE:  pm = static
[08-May-2011 20:53:26] NOTICE:  pm.max_children = 1
[08-May-2011 20:53:26] NOTICE:  pm.max_requests = 0
[08-May-2011 20:53:26] NOTICE:  pm.start_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.min_spare_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.max_spare_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.status_path = undefined
[08-May-2011 20:53:26] NOTICE:  ping.path = undefined
[08-May-2011 20:53:26] NOTICE:  ping.response = undefined
[08-May-2011 20:53:26] NOTICE:  catch_workers_output = no
[08-May-2011 20:53:26] NOTICE:  request_terminate_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  request_slowlog_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  slowlog = undefined
[08-May-2011 20:53:26] NOTICE:  rlimit_files = 0
[08-May-2011 20:53:26] NOTICE:  rlimit_core = 0
[08-May-2011 20:53:26] NOTICE:
[08-May-2011 20:53:26] NOTICE: configuration file /etc/php-fpm/php-fpm.conf test is successful

This stuff is not documented properly. I discovered it by having a nice afternoon at work, reading the PHP-FPM sources. That could save me some hours of debugging the internal state of PHP-FPM by other means. Maybe, for the 1st time, saying “undocumented feature” doesn’t sound like marketing crap implying “undiscovered bug”.

You may use listen.backlog = -1 for the system to decide, or you may use your own limit. -1 is a valid value as the listen(3) man page says. I am planning for opening a new issue as -1 is a more appropriate default for Linux as well. However, please keep in mind that a high backlog value may be truncated by the Linux kernel. For example, under Ubuntu Server this limit is … 128. The same manual page for listen(3) states that the maximum value for the backlog option is the SOMAXCONN value. While reading the Linux kernel sources is not exactly toilet reading, I could find the exact implementation of the listen syscall (net/socket.c):

 *      Perform a listen. Basically, we allow the protocol to do anything
 *      necessary for a listen, and if that works, we mark the socket as
 *      ready for listening.
SYSCALL_DEFINE2(listen, int, fd, int, backlog)
        struct socket *sock;
        int err, fput_needed;
        int somaxconn;
        sock = sockfd_lookup_light(fd, &err, &fput_needed);
        if (sock) {
                somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
                if ((unsigned)backlog > somaxconn)
                        backlog = somaxconn;
                err = security_socket_listen(sock, backlog);
                if (!err)
                        err = sock->ops->listen(sock, backlog);
                fput_light(sock->file, fput_needed);
        return err;

In plain English: the backlog value can not be higher than net.core.somaxconn. In order to be able to queue more idle connections into the kernel backlog, you ought to inrease the SOMAXCONN value:

root@localhost~# sysctl net.core.somaxconn=1024

The sysctl utility however modifies this value till the system is rebooted. In order to make it persistent, you have to define it as a new file into /etc/sysctl.d/. Or at least, using sysctl.d is recommended as it keeps the configuration to be more structured. I used /etc/sysctl.d/10-unix.conf:


for having 1024 queued connections per listening UDS + the number of active connections that equals the size of the process pool. Remember that you need to restart the PHP-FPM daemon for the new backlog setting to be enabled. You may increase the limit as the usage model seems fit. Since nginx doesn’t queue any FastCGI connections, you need to be very careful about this setting. All the requests go straight to the kernel backlog. If there’s no more room for new connections, a 502 response is returned to the client. I can safely assume that you would like to avoid this.

Another thing that you should take care of for the number of idle connections to the PHP upstream is the fact that nginx opens a file descriptor for each UDS connection. If you increase too much the SOMAXCONN limit without increasing the number of allowed file descriptors per process, you will run into 502 errors as well. By default, a process may open up to 1024 file descriptors. Usually I increase this limit by adding a ulimit -n $fd_value to the init script of a certain service instead of increasing this limit as system wide.

You may want to buffer the FastCGI response in nginx as well. Buffering the response doesn’t tie the upstream PHP process for longer than needed. As nginx properly does the spoon-feeding to slow clients, the system is free to process more requests from the queue. fastcgi_buffer_size and fastcgi_buffers are the couple of options that you need to tune in order to fit your application usage mode.

Update (Aug 24, 2011):

Increasing the SO_SNDBUF also helps. Writes to the socket won’t block as it would be the kernel’s job to stream the data to the clients. For a server with enough memory, nginx could be free to do something else. The socket(7) man page comes to the rescue in order to demystify the SO_SNDBUF concept. Basically net.core.wmem_max is the one to blame when writes to the socket are blocking. By default the net.core.wmem_max is 128k which is very small for a busy server. If the server has a fat network pipe available, then you can get some more hints here: Linux Tuning. It may not be the case for most EC2 scenarios where the networking is shared. Therefore smaller buffers will do just fine. But it may be the case if you’re playing like me with toys that have dual 1G network interfaces.