PHP-FPM pm.max_children: Why Bitrix Fails Under Load

pm.max_children is the PHP-FPM configuration parameter that sets the maximum number of worker processes in a pool. On a busy Bitrix installation, getting this wrong is the most reliable way to turn a promotional spike into a 504 error wave — and the most reliable way to miss it during normal traffic.

Bitrix shows 504. Nginx logs are clean. MySQL is at 40% CPU. OPcache is warm. Composite cache is on. The site is dead anyway.

I've seen this three times on different projects. Each time, diagnosis took too long because we started with the wrong layer — code, database, network — when the answer was in the PHP-FPM config.

One concrete case: an online store runs a promotion, traffic triples. TTFB jumps to 8-12 seconds. ~600 errors per hour, all 504s. ps aux | grep php-fpm shows exactly 5 processes — all busy. Requests queue up and time out.

The config: pm.max_children = 5. The ISPmanager default. Nobody touched it since the server was provisioned.

How PHP-FPM manages worker processes

PHP-FPM runs a pool of worker processes. Nginx sends PHP requests via a FastCGI socket. FPM puts them in a queue. A free worker picks up the next request. When there are no free workers, requests wait.

pm.max_children is the hard ceiling. If all workers are busy and a new request arrives, it joins the queue. Queue depth is controlled by listen.backlog (default: 511). Once the queue fills, FPM starts refusing connections.

The default value of 5 comes from shared hosting panel templates — ISPmanager, VestaCP, DirectAdmin. It's reasonable when you're sharing a server between 30 websites. It's a bottleneck when you're running a single Bitrix catalog with 10,000+ SKUs.

What 502 vs 504 actually tells you

Both errors mean Nginx didn't get a response from PHP-FPM in time, but they point to different causes.

504 Gateway Timeout means the request reached a worker, the worker started processing it, but didn't finish within fastcgi_read_timeout. Either the request is slow (heavy SQL, external API call), or the worker is stuck.

502 Bad Gateway means Nginx couldn't connect to FPM at all, or got an immediate refusal. This happens when the backlog queue is full and FPM is dropping connections, or when an FPM process crashes.

In practice: if you're seeing 504s under load, look at request processing time and queue depth. If you're seeing 502s, look at the FPM process itself.

In our case it was 504s: the 5 workers were processing Bitrix catalog pages at 2-3 seconds each (cache miss on a promotion launch), new requests queued, timeouts hit before the queue cleared.

Calculating pm.max_children for Bitrix production

The right value of pm.max_children is determined by how much RAM is available for PHP processes after the OS, database, and other services take their share. The formula:

pm.max_children = (total_RAM - OS_overhead - MySQL_RAM) / avg_php_process_memory

An average PHP-FPM process running Bitrix with prolog/epilog loaded takes 50-120 MB RAM. With sale, catalog, and iblock modules enabled, expect 80-100 MB without memory optimization. Measure yours with ps aux --sort -rss | grep php-fpm.

Example for a 4 GB VPS:

OS + Nginx: ~512 MB
MySQL (innodb_buffer_pool_size 768 MB + overhead): ~1024 MB
Available for PHP: 4096 - 512 - 1024 = 2560 MB
Average Bitrix PHP process size (measured): ~68 MB
pm.max_children = 2560 / 68 ≈ 37

We set it to 38. TTFB under load dropped from 8-12 seconds to 1.1 seconds. The 504 errors stopped within five minutes of applying the config.

Leave a 10-15% buffer — processes grow during heavy requests. At pm.max_children = 38 and 68 MB per process, peak PHP memory use is ~2.6 GB, which fits comfortably.

pm.dynamic vs pm.static — which to choose

There are three pool management modes.

First — pm.static: a fixed number of workers, all started at launch. Predictable. The downside is that memory is consumed even at 3am when traffic is minimal.

Second — pm.dynamic: workers are created and destroyed based on demand. Controlled by pm.max_children, pm.start_servers, pm.min_spare_servers, and pm.max_spare_servers. Saves RAM during quiet periods.

Third — pm.ondemand: a new process is spawned for each incoming request. Not suitable for production — the first request after a quiet period pays the process startup overhead.

For mid-traffic production Bitrix, I use pm.dynamic with these values (example for 4 GB VPS, max_children=38):

pm = dynamic
pm.max_children = 38
pm.start_servers = 10
pm.min_spare_servers = 5
pm.max_spare_servers = 15
pm.max_requests = 500

pm.max_requests = 500 recycles workers after 500 requests — useful if your Bitrix version has memory leaks in long-running processes.

On servers with 8+ GB RAM and stable load, pm.static with max_children = 60-80 gives more predictable behavior.

Monitoring your pool without external agents

PHP-FPM has a built-in status endpoint. Enable it in www.conf:

pm.status_path = /fpm-status

Add a restricted Nginx location:

location = /fpm-status {
    allow 127.0.0.1;
    deny all;
    fastcgi_pass unix:/var/run/php/php-fpm.sock;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
    include fastcgi_params;
}

curl http://localhost/fpm-status then shows:

pool:                 www
process manager:      dynamic
accepted conn:        12847
listen queue:         0
max listen queue:     127
idle processes:       8
active processes:     3
total processes:      11
max active processes: 38

listen queue above zero means requests are waiting. active processes near pm.max_children means the pool is at capacity. Both need alerting.

Before the incident I described, listen queue was hitting 127 — the listen.backlog default. That's 127 requests either waiting for a worker or timing out. The pool had been a bottleneck for weeks; the promotion just made it visible.

What not to touch while you're at it

fastcgi_read_timeout in Nginx is a separate knob. Increasing it to "fix" 504s is a trap: requests wait longer, the queue grows larger, and memory runs out faster. Fix the pool first, then check whether slow requests remain.

memory_limit in php.ini doesn't directly control FPM process size. It caps per-request memory allocation, but the process itself takes more RAM because of loaded Bitrix modules. Always measure actual size via ps aux, not by reading memory_limit.

What this actually changes

pm.max_children = 5 isn't a bug or malice. It's a shared hosting default that nobody changed when the server went dedicated. The fix takes under an hour: measure actual process size, update the config, run systemctl reload php8.1-fpm. No downtime required.

The same Bitrix server likely has two more common misconfigurations in the same config layer. I wrote about them in How PHP OPcache Silently Degrades Bitrix in Production and PHP File Sessions Took Down Our Server at Peak Load. All three are fixable in one afternoon.