Nginx bottleneck as load balancer?

Question

We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.

During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so. CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.

Our setup

1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5

The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.

We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.

NGINX SETTINGS

worker_processes=16
worker_connections=4096
multi_accept=on

LINUX SETTINGS

fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"

Why is the app bypassing the load balancer so fast? Can nginx as load balancer be the bottle neck? Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?

cobaco · Accepted Answer

Depending on your settings nginx might be the bottleneck...

Check/tune the following settings in nginx:

the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)

check/tune the following settings of your OS:

number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)

Update:

so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser

that allows for slightly over 16k concurrent users, is that enough for your peaks?

Nginx bottleneck as load balancer?

Tags:

performance

nginx

postgresql

ruby-on-rails

lorgartzor

1 Answers

cobaco

Recent Activity

Donate For Us

Nginx bottleneck as load balancer?

Tags:

performance

nginx

postgresql

ruby-on-rails

lorgartzor

1 Answers

cobaco

Related questions

Recent Activity

Donate For Us