Category Archives: Systems Administration

Snapshots are not backups

Some people may slip into your head the idea that by doing snapshots, you’re free from the burden of doing proper backups. While this may sound good in theory, in practice there are a bunch of caveats. There are certain technologies that use the snapshot methodology at the core, but they make sure that your data isn’t corrupted. Some may even provide access to the actual file revisions.

The data corruption is the specific topic that snapshots simply don’t care about, at least in Amazon’s way of doing things. This isn’t exactly Amazon’s fault for EC2. EBS actually stands for Elastic Block Storage. They provide you a block storage, you do whatever you want with it. For RDS they should do a better job though as it’s a managed service where you don’t have access to the actual instance. The issue is those ‘specialists’ that put emphasis onto the ‘easy, cloud-ish way’ of doing backups by using snapshots. If you’re new to the ‘cloud’ stuff as I used to be, you may actually believe that crap. As I used to believe.

A couple of real life examples:

  • An EBS-backed instance suffered some filesystem level corruption. Since EXT3 is not as smart as ZFS if we’re talking about silent data corruption, you may never know until it’s too late. Going back through revisions in order to find the last good piece of data is a pain. I could fix the filesystem corruption, I could retrieve the lost data, but I had to work quite a lot for that. Luck is an important skill, but I’d rather not put all my eggs into the luck basket.
  • An RDS instance ran out of space. There wasn’t a notification to tell me: ‘yo dumbass, ya ran out of space’. Statistically it wasn’t the case, but a huge data import proved me wrong. I increased the available storage. Problem solved. A day later, somebody dropped by accident a couple of tables. I had to restore them. How? Take the latest snapshot, spin up a new instance, dig through the data. The latest snapshot contained a couple of corrupted databases due to the space issue, one of them being the database I needed to restore. I had to take a bunch of time in order to repair the database before the restoration process. Fortunately nothing really bad happened. But it was a signal that the RDS snapshot methodology is broken by design.

Lesson learned. The current way of doing backups puts the data, not the block storage, first. If you’re doing EBS snapshots as the sole method, you may need to rethink your strategy.

nginx + PHP-FPM for high loaded virtual private servers

The VPSes that use the OS level virtualization have some challenges that you won’t find in setups that use full OS virtualization / paravirtualization. Stuff like OpenVZ doesn’t provide you a method for tuning OS parameters which may impair certain setups such as the above mentioned into the article title. All the stuff mentioned in my previous article applies here as well, except the sysctl tuning part. It would be redundant to mention those tips again.

However, the thing that applies differently is the fact that the maximum connection backlog on a OpenVZ VPS is typically 128, therefore you can not increase the listen.backlog parameter of PHP-FPM over that value. The kernel denies you that. Even more, due to the low memory setups of a lot of VPSes out there in the wild, you may be forced to use a low value for pm.max_children which translates this into a shorter cycle for the process respawn. A process that is respawning, can’t handle incoming connections, therefore the heavy lifting is done by the kernel backlog. The concurrency may not increase linearly with the number of available processes into the FPM pool because of this consequence.

Since the backlog is kept by the kernel, it is common for all services, therefore it may affect nginx as well if it can not accept() a connection for some reason. Load balancing over UNIX sockets doesn’t increase the possible concurrency level. It simply adds useless complexity. Using TCP sockets may increase the reliability of the setup, at the cost of added latency and used up system resources, but it would fail in the end. Some measures need to be taken.

I gave a thought about using HAProxy as connections manager since HAProxy has a nice maxconn feature that would pass the connections to an upstream up to a defined number. But that would add another layer of complexity. It has its benefits. But it also means that’s another point of failure. Having more services to process the same request pipeline is clearly suboptimal if the services composing the request pipeline won’t add some clear value to the setup, for example the way a proxy cache does.

Then I thought about node.js, but implementing FastCGI on top of node seems a hackish solution at best. One of the Orchestra guys did it, but I wouldn’t go into production with a node.js HTTP server for static objects and FastCGI response serving, no matter how “cool” this may sound. Then the revelation hit me: Ryan Dahl, the author of node.js, wrote a nginx module that adds the maxconn feature to nginx: nginx-ey-balancer. Hopefully, someday this would make it into upstream.

The module adds some boilerplate to the configuration since it requires a fastcgi_pass to an upstream unlike direct fastcgi_pass to an UDS path, but otherwise that this, it works as advertised. Although the module wasn’t actively maintained, or, at least this is how things look from outside, the v0.8.32 patch works even for nginx v1.0.2. Having nginx to act as connection manager instead of sending all the connections straight to the FPM upstream may have clear benefits from the concurrency point of view. It is recommended to set max_connections to the value of net.core.somaxconn. That guarantees the fact that no connection gets dropped because the FPM pool processes are respawning due to a short cycle policy.

By using this module, nginx could handle easily around 4000 concurrent connections for a 4 worker process setup, but increasing the workers number does not increase linearly the possible concurrency. Anyway, at that concurrency level, most probably the issues caused by the IPC between nginx and PHP-FPM would be your last concern. This setup simply removes an IPC limit which is ridiculously small most of the times.

nginx + PHP-FPM for high loaded websites

The title of the post is quite obvious. Apache2 + mod_php5 lost the so called “crown” for quite a while. The fist eye opener was a thing that got me pretty annoyed back in 2008. Dual-quad machine, 4 GiB of RAM, RAID 5 SAS @ 10k RPM – the server was looking pretty good, apparently. The only show stopper: Apache2 + mod_php5 that choked at 350 concurrent clients. What? No more free RAM? WTF? LAMP is not my cup of tea for obvious reasons. LEMP seems to be more appropriate while I ditched Apache from all the production servers.

Since then, I keep telling people that Apache2 + mod_php5 + prefork MPM is a memory hog, while most of the stuff comes from the brain-dead client connection management that Apache does. One process per connection is a model that ought to be killed. Probably the best bang for the buck is using nginx as front-end for serving the static objects and buffering the response from Apache in order to spoon-feed the slow clients. But here’s the kicker: for virtual hosting setups, Apache/prefork/PHP is pretty dull to configure safely aka isolate the virtual hosts. Packing more applications together over a production cluster is an obvious way for doing server consolidation.

There’s mod_fcgid/FastCGI … but nginx supports FastCGI as well. Therefore, cutting off the middle man was the obvious solution. By using this setup, you won’t lose … much. PHP-FPM was an obvious solution as the default FastCGI manager that used to come as the sole solution from the PHP upstream is pretty dumb. Instead of having a dedicated PHP service for each virtual host, one can have a dedicated process pool for each virtual host. Believe me, this method is much easier to maintain.

While nginx comes pretty well tuned by default, except the open file cache settings, for PHP-FPM you need to tune some essential settings. nginx has a predictable memory usage. I can say that most of the time the memory used by nginx is negligible compared to the memory used by the PHP subsystem. However, PHP-FPM has a predictable memory consumption if used properly. The process respawn feature ought to be used in order to keep the memory leaks under control. pm.max_requests is the option that you need to tune properly. One may use 10000 served request before respawning, but under very restricted memory requirements I even had to use 10 (very low memory VPS).

The pm.max_children option may be used to spawn an adaptive number of processes, based onto the server load, but IMHO that might overcommit the system resources into the worse case scenario. Having a fixed number of processes per pool is preferred. Usually I have a rough estimation of the memory consumption in order to keep all the runtime into the RAM. Thrashing the memory is not something you would want onto a loaded web server. For everything else, there’s Master … cough! munin.

The Inter-Process Communication is preferred over an UNIX Domain Socket. Unlike the TCP sockets, the UDS doesn’t have the TCP overload that the IPC has even for connections over the loopback interface. For small payloads, UDS might have a 30% performance boost due to lower latency than the TCP stack. Rememer: you can’t beat the latency. For larger payloads, the TCP latency has a lower impact, but it’s still there. Another nice thing about the UDS is the namespace. UDS uses the filesystem for defining a new listening socket. Under Linux, classic ACLs may be used for restricting what user can read / write to the UDS. BSD systems may be more permissive for this kind of stuff. The TCP sockets require a numerical port that can’t be used for something else while the management from an application that generates the configuration files is more difficult for this kind of settings. For UDS the socket name can be derived from the host name. As I said into a previous article, I don’t write the configuration files by hand. Having a decent way of keeping the IPC namespace is always a plus.

Another thing that you should take care of is the listen.backlog option of PHP-FPM. Using TCP sockets seem to be more reliable for PHP-FPM. However, that’s untrue if you dig enough. The IPC starts to fail around 500 concurrent connections, while for UDS this happens way faster aka for an 1 process pool, you can serve at most 129 concurrent clients. 129 is not a random number. PHP-FPM can keep 128 connections into its backlog while the 129th connection is the active process. The default listen.backlog for Linux is 128, although the PHP-FPM documentation may state -1 aka the maximum allowed value by the system. Taking a peek at the PHP-FPM source code reveals this (sapi/fpm/fpm/fpm_sockets.h):

/*
  On FreeBSD and OpenBSD, backlog negative values are truncated to SOMAXCONN
*/
#if (__FreeBSD__) || (__OpenBSD__)
#define FPM_BACKLOG_DEFAULT -1
#else
#define FPM_BACKLOG_DEFAULT 128
#endif

The default configuration file that is distributed with the PHP-FPM source tree states that the value is 128 for Linux. The php.net statement that it defaults to -1 gave me a lot of grief as I though the manual won’t give me rubbish instead of usable information. However, since PHP 5.3.5 you may debug the configuration by using the -t flag for the php-fpm binary. You can use it like:

php-fpm -tt -y /path/to/php-fpm.conf

The doube t flag is not a mistake. If you’re using NOTICE as the debug level, the double t testing level prints the internal values of the PHP-FPM configuration:

php-fpm -tt -y /etc/php-fpm/php-fpm.conf
[08-May-2011 20:53:26] NOTICE: [General]
[08-May-2011 20:53:26] NOTICE:  pid = /var/run/php-fpm.pid
[08-May-2011 20:53:26] NOTICE:  daemonize = yes
[08-May-2011 20:53:26] NOTICE:  error_log = /var/log/php-fpm.log
[08-May-2011 20:53:26] NOTICE:  log_level = NOTICE
[08-May-2011 20:53:26] NOTICE:  process_control_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  emergency_restart_interval = 0s
[08-May-2011 20:53:26] NOTICE:  emergency_restart_threshold = 0
[08-May-2011 20:53:26] NOTICE:
[08-May-2011 20:53:26] NOTICE: [www]
[08-May-2011 20:53:26] NOTICE:  prefix = undefined
[08-May-2011 20:53:26] NOTICE:  user = www-data
[08-May-2011 20:53:26] NOTICE:  group = www-data
[08-May-2011 20:53:26] NOTICE:  chroot = undefined
[08-May-2011 20:53:26] NOTICE:  chdir = undefined
[08-May-2011 20:53:26] NOTICE:  listen = /var/run/php-fpm.sock
[08-May-2011 20:53:26] NOTICE:  listen.backlog = -1
[08-May-2011 20:53:26] NOTICE:  listen.owner = undefined
[08-May-2011 20:53:26] NOTICE:  listen.group = undefined
[08-May-2011 20:53:26] NOTICE:  listen.mode = undefined
[08-May-2011 20:53:26] NOTICE:  listen.allowed_clients = undefined
[08-May-2011 20:53:26] NOTICE:  pm = static
[08-May-2011 20:53:26] NOTICE:  pm.max_children = 1
[08-May-2011 20:53:26] NOTICE:  pm.max_requests = 0
[08-May-2011 20:53:26] NOTICE:  pm.start_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.min_spare_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.max_spare_servers = 0
[08-May-2011 20:53:26] NOTICE:  pm.status_path = undefined
[08-May-2011 20:53:26] NOTICE:  ping.path = undefined
[08-May-2011 20:53:26] NOTICE:  ping.response = undefined
[08-May-2011 20:53:26] NOTICE:  catch_workers_output = no
[08-May-2011 20:53:26] NOTICE:  request_terminate_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  request_slowlog_timeout = 0s
[08-May-2011 20:53:26] NOTICE:  slowlog = undefined
[08-May-2011 20:53:26] NOTICE:  rlimit_files = 0
[08-May-2011 20:53:26] NOTICE:  rlimit_core = 0
[08-May-2011 20:53:26] NOTICE:
[08-May-2011 20:53:26] NOTICE: configuration file /etc/php-fpm/php-fpm.conf test is successful

This stuff is not documented properly. I discovered it by having a nice afternoon at work, reading the PHP-FPM sources. That could save me some hours of debugging the internal state of PHP-FPM by other means. Maybe, for the 1st time, saying “undocumented feature” doesn’t sound like marketing crap implying “undiscovered bug”.

You may use listen.backlog = -1 for the system to decide, or you may use your own limit. -1 is a valid value as the listen(3) man page says. I am planning for opening a new issue as -1 is a more appropriate default for Linux as well. However, please keep in mind that a high backlog value may be truncated by the Linux kernel. For example, under Ubuntu Server this limit is … 128. The same manual page for listen(3) states that the maximum value for the backlog option is the SOMAXCONN value. While reading the Linux kernel sources is not exactly toilet reading, I could find the exact implementation of the listen syscall (net/socket.c):

/*
 *      Perform a listen. Basically, we allow the protocol to do anything
 *      necessary for a listen, and if that works, we mark the socket as
 *      ready for listening.
 */
 
SYSCALL_DEFINE2(listen, int, fd, int, backlog)
{
        struct socket *sock;
        int err, fput_needed;
        int somaxconn;
 
        sock = sockfd_lookup_light(fd, &err, &fput_needed);
        if (sock) {
                somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
                if ((unsigned)backlog > somaxconn)
                        backlog = somaxconn;
 
                err = security_socket_listen(sock, backlog);
                if (!err)
                        err = sock->ops->listen(sock, backlog);
 
                fput_light(sock->file, fput_needed);
        }
        return err;
}

In plain English: the backlog value can not be higher than net.core.somaxconn. In order to be able to queue more idle connections into the kernel backlog, you ought to inrease the SOMAXCONN value:

root@localhost~# sysctl net.core.somaxconn=1024

The sysctl utility however modifies this value till the system is rebooted. In order to make it persistent, you have to define it as a new file into /etc/sysctl.d/. Or at least, using sysctl.d is recommended as it keeps the configuration to be more structured. I used /etc/sysctl.d/10-unix.conf:

net.core.somaxconn=1024

for having 1024 queued connections per listening UDS + the number of active connections that equals the size of the process pool. Remember that you need to restart the PHP-FPM daemon for the new backlog setting to be enabled. You may increase the limit as the usage model seems fit. Since nginx doesn’t queue any FastCGI connections, you need to be very careful about this setting. All the requests go straight to the kernel backlog. If there’s no more room for new connections, a 502 response is returned to the client. I can safely assume that you would like to avoid this.

Another thing that you should take care of for the number of idle connections to the PHP upstream is the fact that nginx opens a file descriptor for each UDS connection. If you increase too much the SOMAXCONN limit without increasing the number of allowed file descriptors per process, you will run into 502 errors as well. By default, a process may open up to 1024 file descriptors. Usually I increase this limit by adding a ulimit -n $fd_value to the init script of a certain service instead of increasing this limit as system wide.

You may want to buffer the FastCGI response in nginx as well. Buffering the response doesn’t tie the upstream PHP process for longer than needed. As nginx properly does the spoon-feeding to slow clients, the system is free to process more requests from the queue. fastcgi_buffer_size and fastcgi_buffers are the couple of options that you need to tune in order to fit your application usage mode.

Update (Aug 24, 2011):

Increasing the SO_SNDBUF also helps. Writes to the socket won’t block as it would be the kernel’s job to stream the data to the clients. For a server with enough memory, nginx could be free to do something else. The socket(7) man page comes to the rescue in order to demystify the SO_SNDBUF concept. Basically net.core.wmem_max is the one to blame when writes to the socket are blocking. By default the net.core.wmem_max is 128k which is very small for a busy server. If the server has a fat network pipe available, then you can get some more hints here: Linux Tuning. It may not be the case for most EC2 scenarios where the networking is shared. Therefore smaller buffers will do just fine. But it may be the case if you’re playing like me with toys that have dual 1G network interfaces.

Why I don’t benchmark HTTP static object serving

Usually I tend to pick a solution based on its features and architecture. That doesn’t mean that I’d pick a dog slow application that pushes the network traffic. But here’s the kicker: in a distributed system, you play by the lowest common denominator aka the real bottleneck. One thing I learned from experience is that given the appropriate software, the hardware usually isn’t the limit. On the other hand, the network pipe is.

Few months ago I had an interesting task: place something in front of Amazon S3 in order to cut down the file distributions costs. While S3 is great for storage, for the actual distribution you’d see that the traffic counter is spinning fast and the USD counter even faster. Certain web hosting providers have good deals for dedicated servers, therefore something like the S3 traffic could be turned on into a fixed cost, given by the amount of machines and the bandwidth pool specified into the contract.

This is the part where the technology kicks in. As I said earlier, the hardware rarely is the limit. Remember the C10K problem? Some servers deal with with this stuff better than others. This means that I wouldn’t pick Apache, or any other web server that deals with the connections by the same model. The network has a high latency compared to the rest of the hardware, even the local disk. People should stop thinking in blocking mode.

For the above task, nginx did the load balancing, fail-over, and URL protection part really well. squid did the other part aka the proxy cache job. Now you may wonder why the good old squid since I think nginx doesn’t need introduction.

I tried Varnish as the proxy cache, but the people behind it thought it was funny not to have an unbuffered transfer mode. This might work well sometime, with the “most of the web traffic”, as they say, but when you try to push a large file over HTTP this point is somewhat the real world usage for our case:

HAI, I ARE TEH CLIENT. PLZ GIVE ME /large-file.zip
LOL, OK, BUT WAI TILL HAZ IT IN ME CACHE

I considered trying Varnish for all the hype surrounding it. I kept myself for going straight to squid because of so called StackOverflow type of specialist that doesn’t recommend it since Varnish is “way faster”. Got a little news flash for you brother: it doesn’t matter. The fact that a client needs to wait till the proxy cache buffers all the stuff from the origin server (S3) won’t count as “speed”. It may work great for CDN-like type of traffic where certain patterns emerge, but not here where we have to deal even with hundred of megabytes for a single object.

The next bus stop was Apache Traffic Server. While it looks good on paper, it required rocket scientists to configure a simple usage mode. While I like the pictures with the baseball bat explaining the concept of a cache hit or miss, it still lacks proper, understandable documentation for a new comer in order to get something up and running in a timely manner. I am not a sysadmin newb, but some things are harder to pick up from scratch than others. Work against a short deadline in order to get my point. My squid configuration file has exactly 18 lines. All human readable and easy to understand. The default Traffic Server installation had as much as configuration files. How am I supposed to figure out this mess?

Things I also considered: nginx – it is dreadful as proxy cache due to the lack of HTTP 1.1 for the proxy backend, therefore no If-Modify-Since or If-None-Match support for invalidating a cached object. Amazon S3 sends both Last-Modified and ETag. A half-baked product from this side of the story. We have HTTP 1.1 and the 304 status. Igor, are you there?

lighttpd – also annoyed me with some proxy cache limitations that didn’t play nice. Don’t get me wrong, excluding some security record, lighttpd is actually a great web server, performance wise. It pushed our file distribution traffic for almost 4 years from a couple of FreeBSD machines. We changed the hardware every 2 years though. I never saw those machines going over 200MiB of RAM even at “rush hour”, while the CPU used to sip power from the socket instead of going like crazy. I didn’t even install a monitoring system since there was nothing to alert. The hardware load balancer did the fail-over stuff back then. Believe me or not, I actually felt sorry for turning off a web server with almost 600 days of uptime, after moving to EC2. But I still hate to use lighttpd as proxy cache.

I thought about Apache + mod_proxy for 3.14 seconds, but I moved on. Even with Event MPM it falls behind the competition. While I have a great deal of respect for the Apache Foundation, I still don’t get how they manage to screw things up with the Apache HTTP Server. I mean, come on, I expect more from the so called market share leader.

I gave a thought to node.js which I think is a terrific piece of technology, if you can ignore some of the silliness of JavaScript. But I wanted a piece of software that’s maintain towards my need by a bunch of people that do this stuff for quite a while, not a home brewed proxy cache.

Which brought me down to squid. A lot of “expert opinions” kept me from picking it in the first place, sparing me the pain of installing, configuring, and testing a bunch of alternatives, even though I’ve being using it as forward proxy for years. It does everything that it needs to do for a proxy cache with more or less configuration. It is an asynchronous server. It isn’t as slow as advertised, even though it’s written into that “slow, modern language” called C++. The configuration is straight forward. In fact, since the cluster is in production, it didn’t went even near the hardware limits.

Something bad happened into the first couple of days due to lack of proper production traffic that didn’t match the theoretical testing: clients who send invalidation headers along with ranged requests, which squid doesn’t handle very well. This sent the whole cluster into a vicious circle. The solution was simple at the nginx layer: strip down all the headers, pass just the relevant stuff. nginx does the header management way much better than squid that tries to implement some sort of “standards compliant” stuff that makes it very stiff. But even then, even though the cluster was working like crazy, the CPU didn’t even come close to 20% while the RAM was mostly doing nothing. The network pipe on the other hand, was working at full capacity, therefore some high I/O waits started to build up. Same goes for nginx.

Therefore no, I won’t do the same mistake twice. I stopped doing benchmarks. I don’t care about faster, but just about fast enough and well enough engineered. Business wise, we don’t pay less for using less CPU or RAM. Or even for drawing less power from the socket. Even Amazon doesn’t do that with it’s pay-as-you-go-only-what-you-use model. We have a bunch of dedicated hardware that we can use at 1% or at 100% load. There’s a bandwidth pool that’s the actual interest in saving money. I encourage fellow sysadmins do the same: stop caring about irrelevant stuff like: this web server is 10% faster. Even if it’s 200% faster (made up number), if the network pipe is the limit, it doesn’t matter overall. That stuff matters into the application server market where you have to shuffle data around, instead of taking a bunch of bytes of the disk and push them over the network interface. Today’s hardware is stupid fast. When 10 Gbps is going to be a standard, then yes, some of the great debates may actually have some ground. Currently they don’t. People like to argue about anything. Even toiler paper positioning. I fail to see where’s the productive part.

Don’t get me wrong. I hate bloat-ware. I love the raw speed. I love the machines that return a fast answer. But sometimes good enough is enough. Don’t keep “the religion” standing in your way of properly doing your job. People should focus on finding solutions instead of fighting religious wars with colorful graphs.

For example, serving 55.000 requests per second for a 2 KiB object over the loopback interface requires higher bandwidth than a 1 Gbps network pipe can provide. 2 KiB is a very small file. Most JavaScript libs, and style-sheets go over that limit, therefore the bandwidth limit is easier to hit. I don’t have to mention the images or plain large objects. Yes, some tests may be great for the ego of a developer behind a certain product. But most of the time some sort of apple to oranges comparison is involved. Certain features cost CPU time. But most of the time, the lack of specific features costs us time. The time is money. The CPU time, in this case, isn’t.

PS: we’re caching hundreds of GiB. We’re serving at least one TiB of data per day. Not bad for an old, slow, proxy cache running on a couple of machines that nobody wants to use since we have this new cool technology. Do we?