Not too long ago after upgrading from Lucid LTS to Precise LTS our uWSGI servers started erroring with the following message:

Sun Sep 23 03:29:31 2012 - *** uWSGI listen queue of socket 3 full !!! (129/128) ***

After some light research I found that the backlog limit passed to the listen syscall had been exceeded. The listen queue size defaults to 128 for uWSGI and that number has a maximum system limit set by /proc/sys/net/core/somaxconn. This lead me to ask a couple of questions I thought were worth researching some more.

  • Why is the default limit set to 128?
  • What extactly does the limit do?
  • What are the reprecussions of raising the limit?
  • What is the optimal limit for our machines?

This is what I found:

From the listen manpage

The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow.  If a connection request arrives when the queue is full,  the  client may  receive  an  error  with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt  at  connection succeeds.
...
If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is  silently  truncated  to that value; the default value in this file is 128.  In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

Please note the second paragraph. The value is silently truncated if it exceeds the system max. This caused some service degradation as one of the servers had its listen limit increased but not the system limit. More on that later.

From UNP

When a SYN arrives from a client, TCP creates a new entry on the incomplete queue and then responds with the second segment of the three-way handshake: the server's SYN with an ACK of the client's SYN (Section 2.6). This entry will remain on the incomplete queue until the third segment of the three-way handshake arrives (the client's ACK of the server's SYN), or until the entry times out. (Berkeley-derived implementations have a timeout of 75 seconds for these incomplete entries.) If the three-way handshake completes normally, the entry moves from the incomplete queue to the end of the completed queue. When the process calls accept, which we will describe in the next section, the first entry on the completed queue is returned to the process, or if the queue is empty, the process is put to sleep until an entry is placed onto the completed queue. 
There are several points to consider regarding the handling of these two queues. 

The backlog argument to the listen function has historically specified the maximum value for the sum of both queues.
There has never been a formal definition of what the backlog means. The 4.2BSD man page says that it "defines the maximum length the queue of pending connections may grow to." Many man pages and even the POSIX specification copy this definition verbatim, but this definition does not say whether a pending connection is one in the SYN_RCVD state, one in the ESTABLISHED state that has not yet been accepted, or either. The historical definition in this bullet is the Berkeley implementation, dating back to 4.2BSD, and copied by many others.

Berkeley-derived implementations add a fudge factor to the backlog: It is multiplied by 1.5 (p. 257 of TCPv1 and p. 462 of TCPV2). For example, the commonly specified backlog of 5 really allows up to 8 queued entries on these systems, as we show in Figure 4.10.
The reason for adding this fudge factor appears lost to history [Joy 1994]. But if we consider the backlog as specifying the maximum number of completed connections that the kernel will queue for a socket ([Borman 1997b], as discussed shortly), then the reason for the fudge factor is to take into account incomplete connections on the queue.

Do not specify a backlog of 0, as different implementations interpret this differently (Figure 4.10). If you do not want any clients connecting to your listening socket, close the listening socket.

Assuming the three-way handshake completes normally (i.e., no lost segments and no retransmissions), an entry remains on the incomplete connection queue for one RTT, whatever that value happens to be between a particular client and server. Section 14.4 of TCPv3 shows that for one Web server, the median RTT between many clients and the server was 187 ms. (The median is often used for this statistic, since a few large values can noticeably skew the mean.)

Historically, sample code always shows a backlog of 5, as that was the maximum value supported by 4.2BSD. This was adequate in the 1980s when busy servers would handle only a few hundred connections per day. But with the growth of the World Wide Web (WWW), where busy servers handle millions of connections per day, this small number is completely inadequate (pp. 187–192 of TCPv3). Busy HTTP servers must specify a much larger backlog, and newer kernels must support larger values.
Many current systems allow the administrator to modify the maximum value for the backlog.

A problem is: What value should the application specify for the backlog, since 5 is often inadequate? There is no easy answer to this. HTTP servers now specify a larger value, but if the value specified is a constant in the source code, to increase the constant requires recompiling the server. Another method is to assume some default but allow a command-line option or an environment variable to override the default. It is always acceptable to specify a value that is larger than supported by the kernel, as the kernel should silently truncate the value to the maximum value that it supports, without returning an error (p. 456 of TCPv2).

So some years ago 5 was a questionable limit and that also somewhat explains the silent truncate. Further inspection of kernel source lead to this comment

net/core/request_sock.c:

/*
 * Maximum number of SYN_RECV sockets in queue per LISTEN socket.
 * One SYN_RECV socket costs about 80bytes on a 32bit machine.
 * It would be better to replace it with a global counter for all sockets
 * but then some measure against one socket starving all other sockets
 * would be needed.
 *
 * The minimum value of it is 128. Experiments with real servers show that
 * it is absolutely not enough even at 100conn/sec. 256 cures most
 * of problems.
 * This value is adjusted to 128 for low memory machines,
 * and it will increase in proportion to the memory of machine.
 * Note : Dont forget somaxconn that may limit backlog too.
 */

So now things are becoming more clear. The backlog defines the number of clients that are in the handshake phase of TCP connection and from what I can tell from Stevens the value seems to be mostly of legacy significance when servers had memory in the number of megabytes not gigabytes as we do today.

Continuing inspection shows that queue is a list of request_sock structures.

net/request_sock.h

/* struct request_sock - mini sock to represent a connection request
 */
struct request_sock {
        struct request_sock             *dl_next; /* Must be first member! */
        u16                             mss;
        u8                              retrans;
        u8                              cookie_ts; /* syncookie: encode tcpopts in timestamp */
        /* The following two fields can be easily recomputed I think -AK */
        u32                             window_clamp; /* window clamp at creation time */
        u32                             rcv_wnd;          /* rcv_wnd offered first time */
        u32                             ts_recent;
        unsigned long                   expires;
        const struct request_sock_ops   *rsk_ops;
        struct sock                     *sk;
        u32                             secid;
        u32                             peer_secid;
};

This is where the queue is allocated.

net/core/request_sock.c

int reqsk_queue_alloc(struct request_sock_queue *queue,
                      unsigned int nr_table_entries)
{
        size_t lopt_size = sizeof(struct listen_sock);
        struct listen_sock *lopt;

        nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
        nr_table_entries = max_t(u32, nr_table_entries, 8);
        nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
        lopt_size += nr_table_entries * sizeof(struct request_sock *);
        if (lopt_size > PAGE_SIZE)
                lopt = vzalloc(lopt_size);
        else
                lopt = kzalloc(lopt_size, GFP_KERNEL);
        if (lopt == NULL)
                return -ENOMEM;

        for (lopt->max_qlen_log = 3;
             (1 << lopt->max_qlen_log) < nr_table_entries;
             lopt->max_qlen_log++);

        get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
        rwlock_init(&queue->syn_wait_lock);
        queue->rskq_accept_head = NULL;
        lopt->nr_table_entries = nr_table_entries;

        write_lock_bh(&queue->syn_wait_lock);
        queue->listen_opt = lopt;
        write_unlock_bh(&queue->syn_wait_lock);

        return 0;
}

Conclusions

So now we have some answers. The somaxconn defines the number of request_sock structures allocated per each listen call. The queue is persistent through the life of the listen socket and the default value of 128 is not applicable to today's servers. Increasing that number 5x would have an insignificant affect on system resources while ensuring your server will stay up during load.

From the kernel comment each entry takes 80 bytes on a 32-bit machine. Doubling the number for 64-bits still leads to an inconsequential number for todays machines. Increasing the limit to 8k would lead to 1.25mb of memory usage.

Actions Taken

Since there is no error if the listen backend limit exceeds the system max I submitted a patch to uWSGI to hard fail if the number passed through the --listen parameter exceeds the system max. The patch was accepted and now the case that caused the second failure will not happen.

Going Forward

I would like to know what the maximum of number slots are being used in the queue. All I know for certain is that the number is greater than 128 and less than 8192. So the most optimal size is somewhere in-between. Considering that the cost of increasing the size is so low this lowers the priority on that.

However I do think it would be interesting case study. For instance would the queue be better served with a linked list that would grow and shrink depending on demand. In the highly available environments we work in it seems that this is an area for improvement. I just cannot see any good reason for lost of service due to the listen backlog queue limit especially one that defaults to 128.

For uWSGI I am happy with the first submission but now I am thinking the --listen parameter should take a soft value like 'max' that would default to system max or maybe even override it. Again my motivation is that I cannot see a good reason for the short limit in the first place.