Be warned, your Javascript is turned off. Please activate Javascript to improve your experience.

Configuring Puma Workers for the Cloud

Pascal Zumkehr

8. März 2024

Our Ruby on Rails applications are served by the Puma web server. The total number of Puma threads is configured with two parameters: The number of processes and the number of threads per process. Their product defines the number of requests that can be handled in parallel. Let’s examine how these values depend on your system resources.

Life in the Cloud

We deploy our applications as containers in the cloud, where we get to choose (and pay) how many CPUs and how much memory each container gets. Memory often plays a more significant role in costs, prompting us to minimize its allocation.

Our applications do not have that much traffic, with a few requests per second. If they get a hundred requests per second, it’s already a lot. Still, they should not block completely on the odd long running request (especially because when there is one long running request, there usually is another. And another). To avoid blocking other requests, a sufficient amount of total threads should be available to handle incoming requests. However, how should we divide this number between processes and threads?

Ruby Processes

The standard Ruby runtime environment has one very important restriction: One process can only use up to one CPU. No matter how many concurrent threads live in a process, only one of them gets to use the CPU. This is due to the global interpreter lock (GIL; not actually just a bad thing, but out of scope of this article). However, when one thread is blocked on IO, e.g. waiting for a database query to return, another can go ahead.

When it comes to processing web requests, and when we want them to respond as fast as possible, there should not be too many threads competing for the CPU. While many threads allow to increase the number of requests that can be handled in parallel (this is also called throughput), the response time (latency) of each request will suffer. Instead of being processed in one go, a request would get the CPU only for a couple of operations and then has to wait again while others use the CPU. Only during the time waiting for the database, other requests can profit from an unused CPU.

A large number of threads slows down the response times.

The relation between the time spent on the CPU and the time spent in the database (or other IO-based sources) are relevant when figuring out a good value for the number of threads per process. This has been intensely discussed in the Rails community recently, resulting in a smaller default thread per process count of 3.

The Memory Factor

To give full CPU power to your requests, you could set the thread count to 1 and increase the number of processes accordingly. This however comes at the cost of additional memory used. While forked puma processes share the memory used for the application’s code, each one allocates its own memory for the objects created during the requests.

The more processes you have, the more memory you need.

Measuring the World

On one hand, we want to optimize response times, one the other the memory usage. It’s very hard to optimize numbers without knowing them. There are a few metrics that are very important here:

Used memory
Response duration quantiles
Free Puma threads

We use the prometheus_exporter gem to collect those metrics and have a nice Grafana dashboard to visualize them.

No matter how we choose our values, we want to assert that the number of free Puma threads is larger than zero at basically all times. When there are no free threads, requests are entirely blocked by others, which degrades the application’s availability. We choose the total number of threads large enough to avoid such situations as much as possible. For an application with a median response time of 25ms and a 95% quantile response time of about 300ms, we figured that a maximum of about 8 total threads is fine.

Number of free puma threads and number of requests waiting in the backlog.

Next we reduced the number of threads per process step by step, until the measured response times did not improve anymore. For our application, where the database only uses about 20% of the total response times, they ceased to improve at about 3 threads per process. There was only a small difference between 2 and 3 threads. We chose to prioritize fast responses and set the number of threads per process to 2.

Given a total number of threads of 8, the number of processes then resulted in 4. If memory (cost) would have been more crucial, 3 processes with 3 threads each might have been a better match.

Conclusion

For a long time, we did not think about the implications of the number of threads per process on the response times. We naively thought this would be a „free“ way to increase the total number of threads. While memory cost is crucial, having fast response times should neither be neglected if you care about happy users.

So when your application grows, you would rather increase the number of processes to provide enough total threads. The number of processes also defines the maximum number of CPUs you are going to need for your web containers.

Another interesting correlation is that applications with faster requests require less memory. If you have many slow requests, you will need more concurrent threads to handle them all. If you manage to optimize the response times in your application code, it might be possible that you can reduce the number of processes. With less processes, Puma will use less memory.

When using rails, the new sensible default for the number of threads per process becomes 3 in version 8. To configure both values in production environments, set the environment variables WEB_CONCURRENCY for the number of processes and RAILS_MAX_THREADS for the number of threads per process.