Back to blog

PHP File Sessions Took Down Our Server at Peak Load. Here's What We Found.

Our server went down at 23:00. Right in the middle of a marketing promotion.

CPU at 40%. Memory fine. Traffic about 30% above normal. Nginx was responding. PHP-FPM wasn't. New requests just hung.

We stared at the monitoring dashboard for 20 minutes and couldn't figure out what was happening.

What PHP-FPM status page showed

Enable pm.status_path in your Nginx config and you get a live view of every FPM worker. When we opened it, all 128 workers were either Reading headers or Running. Queue: 340 waiting requests.

FPM had run out of workers. Every slot occupied, new requests stacking up.

But why? This wasn't record traffic. We'd handled similar loads before without issues.

I ran strace on a few stuck workers:

flock(14, LOCK_EX)  = 0
write(14, "...")
flock(14, LOCK_UN)

File lock. On session files.

How PHP session locking works by default

PHP's default session handler (session.save_handler = files) stores each user session as an individual file and takes an exclusive file lock for the duration of every request that touches that session. One request holds the lock; any parallel request for the same user waits.

When your code calls session_start(), PHP takes an exclusive lock (LOCK_EX) on that file. The lock holds until the session closes — either via session_write_close() or when the script finishes.

For a single sequential request per user, that's fine. But in practice:

  • A user loads a catalog page → that's 1 main request + 3-5 AJAX calls in parallel (cart widget, recommendations, filters)
  • All requests share the same session ID → same file
  • PHP-FPM runs them concurrently → all try to grab LOCK_EX on the same file
  • One wins. The rest wait.

That day we had 340 concurrent users. Each one running a modern frontend with parallel AJAX. So roughly 340 × 4 = ~1,300 "simultaneous" PHP requests, half of them blocked on file locks.

That's what killed FPM. Not the load — the queuing.

Why it doesn't show up on a normal day

Below about 150 concurrent users, requests finish fast enough that the queue never gets long. The lock contention exists, but it clears before it compounds.

Once you cross a threshold — especially with SPA frontends that fire several parallel requests per page load — the effect becomes self-reinforcing. Workers are busy, new requests wait, users retry, the queue grows.

I ran lsof | grep sess_ and saw several thousand open file descriptors on session files. That's when it became obvious.

Switching to Redis sessions

We were already using Redis for Bitrix's page cache. Adding it as the session handler took about 20 minutes.

In php.ini (or inside a php-fpm.d/*.conf pool file):

session.save_handler = redis
session.save_path = "tcp://127.0.0.1:6379?weight=1&timeout=2"

For Bitrix specifically: the framework uses PHP's native session_start() by default, so changing php.ini is enough. Check bitrix/.settings.php — if there's no custom 'session' key defined, you're good.

One thing to get right: don't share the same Redis database for sessions and cache. A stray FLUSHDB on the cache DB will log out every user.

; sessions on database 1, cache stays on database 0
session.save_path = "tcp://127.0.0.1:6379?weight=1&timeout=2&database=1"

The parts that took an extra hour to figure out

Redis doesn't use file locks — it implements locking via SET NX with a TTL. That removes the queue problem entirely, but introduces a subtler issue: if a long-running request holds a session open, the lock TTL can expire before the request finishes.

redis.session.lock_expire controls this. We set it to 30 seconds, which covers our heaviest requests comfortably.

The other thing: session TTL doesn't carry over automatically. The gc_maxlifetime value in php.ini needs to match what Redis will actually use:

session.gc_maxlifetime = 1440
redis.session.lock_expire = 30

And make sure Redis persistence is on (appendonly yes in redis.conf). Without it, a Redis restart wipes all sessions and logs everyone out. Acceptable in development, not in production.

Before and after

Same server. Same traffic volume. No hardware changes between measurements.

| Metric | File sessions (before) | Redis sessions (after) | |--------|----------------------|----------------------| | p95 latency, catalog pages | 3.8s | 0.7s | | Peak FPM workers occupied | 127 / 128 | 38–42 / 128 | | 504 errors during promotion | ~1,100 | 0 |

The server handled the next promotional spike without incident.

What's on my checklist for every new PHP project

session.save_handler is now in the first batch of things I check when I come into a new Bitrix project. Same batch as OPcache hit ratio and N+1 queries in iblock calls. All three look fine on light traffic and blow up under load.

Same pattern with OPcache misconfiguration — I covered that case in a separate article.

File sessions aren't broken. They're a reasonable default for a site with 20 concurrent users. Once you're past ~200 simultaneous users with a modern frontend that fires parallel requests, session.save_handler = files is the same kind of quiet problem as OPcache without revalidate_freq — fine until it isn't.

Redis solves this specific problem completely. Just keep it separate from your cache instance.