From Polling to Event-Driven: A 70x Throughput Rewrite
March 4, 2026
We started noticing that our PostgreSQL database was getting pinned at 70–95% CPU for long periods of time. Not spikes, but sustained load that began affecting unrelated queries.
Looking at the active queries, there were constantly processes streaming rows. The load wasn’t from one expensive query, but from something continuously running.
That eventually led back to a cron-driven pipeline.
The Old Pipeline
The system was split across a few services:
- Cron service → triggered processing every 30 minutes
- Versioning service → handled version creation and publishing
- Product + User services → provided data needed for processing
The flow looked like this:
- cron service scans the product table
- filters records to process
- calls the versioning service per record
- versioning service fetches product + user data
- performs versioning logic
- writes back to the database
This worked fine at smaller scale. It didn’t hold up once the dataset grew.
Where It Started Breaking
The issue wasn’t a single slow query. It was how the system did work.
-
Every run scanned millions of rows, even if only a small subset had changed
-
Processing was sequential across services
-
Each record triggered the following:
- one HTTP call (cron → versioning)
- multiple DB reads (product + user)
- multiple DB writes
For ~10,000 records, this meant:
- 10,000 HTTP calls
- ~60,000+ database operations
Throughput stayed around ~1 record per second.
At that rate:
- a single batch took a few hours
- new cron runs started before previous ones finished
- load stacked instead of resetting
That’s when the database started getting pinned.
What Changed
OLD (Cron-based, per-record)
Cron Trigger (30min interval) ↓ Cron Service ↓ Product Table (scan) ↓ Versioning Service (per record) ↓ Product + User Services ↓ Database Writes
- Repeated scans
- 1 request per record
- High DB + network overhead
NEW (Event-driven, batched)
Product Update ↓ Queue Table ↓ Cron Worker (batch consumer) ↓ Versioning Service ↓ Product + User Services (bulk reads) ↓ Database Writes (batched)
- No table scans
- Batched processing
- Parallel workers
The fix was not inside the loop, but removing the loop entirely.
Work is no longer discovered by scanning, but captured at the time of change and processed in batches.
Queue (Triggering Work)
When a product becomes eligible, it is inserted into a queue table at write time (via a database trigger).
Instead of scanning large tables, the cron service now pulls only pending records from the queue.
Batching (Reducing Cross-Service Overhead)
Previously, the cron service called the versioning service once per record.
This was replaced with batched processing:
- the cron service sends batches (150 IDs)
- the versioning service processes the entire batch in one request
This reduces both network overhead and repeated database access.
Concurrency (Parallelizing Safely)
Multiple workers process the queue in parallel using:
FOR UPDATE SKIP LOCKED
Each worker claims a batch and processes it independently, allowing concurrency without coordination or duplicate work.
Bulk Processing Across Services (Doing Work Once)
Inside the versioning service, the workflow is no longer record-by-record:
- product data is fetched in bulk
- user data is fetched in bulk
- versioning logic runs in-memory across the batch
Database writes are also grouped.
All operations for a batch are executed within a single transaction:
- insert versions
- update products
- upsert published data
- remove processed queue rows
This reduces write amplification and transaction overhead, since hundreds of operations are committed as a single batch instead of individual transactions, which contributed to the earlier CPU spikes.
Measured Impact
Measured before and after moving from cron-based polling to queue-driven batching.
| Metric | Before | After |
|---|---|---|
| Execution | Cron + scan | Queue + batch |
| Service calls | 1 / record | 1 / batch |
| DB operations | ~6–8 / record | ~4 / batch |
| Throughput | ~1/sec | ~70/sec |
| Scaling | Table size | Change volume |
The shift to queue-driven batching increased throughput by ~70x while reducing database load and cross-service overhead.
What I Learned
-
Polling does not scale with data size
Time-based scans repeatedly process the same dataset, even when little has changed. This leads to unnecessary load and poor scalability. -
Work should be captured at the time of change
Moving from periodic scanning to change-driven triggers eliminates the need to search for work. -
Batching reduces overhead across every layer
Network calls, database operations, and transactions become significantly cheaper when work is grouped instead of processed per record. -
Throughput is often limited by system design, not query performance
The bottleneck was not a slow query, but the way work was distributed and executed. -
Concurrency needs coordination primitives
UsingFOR UPDATE SKIP LOCKEDallowed safe parallel processing without duplicate work or contention. -
Cross-service boundaries amplify inefficiency
Per-record service calls multiplied latency and load. Batch-based communication reduced this drastically. -
Write amplification matters at scale
Grouping writes into a single transaction reduced database pressure and improved consistency. -
Scaling should depend on change volume, not table size
Systems that scale with total data size degrade over time. Systems that scale with actual changes remain stable.
Closing
This wasn’t a database problem. It was a work distribution problem.
Once the system stopped scanning and started reacting to changes, both performance and scalability followed naturally.