From Polling to Event-Driven: A 70x Throughput Rewrite

March 4, 2026

Event-DrivenQueuesBatchingPostgreSQL

We started noticing that our PostgreSQL database was getting pinned at 70–95% CPU for long periods of time. Not spikes, but sustained load that began affecting unrelated queries.

Looking at the active queries, there were constantly processes streaming rows. The load wasn’t from one expensive query, but from something continuously running.

That eventually led back to a cron-driven pipeline.


The Old Pipeline

The system was split across a few services:

  • Cron service → triggered processing every 30 minutes
  • Versioning service → handled version creation and publishing
  • Product + User services → provided data needed for processing

The flow looked like this:

  • cron service scans the product table
  • filters records to process
  • calls the versioning service per record
  • versioning service fetches product + user data
  • performs versioning logic
  • writes back to the database

This worked fine at smaller scale. It didn’t hold up once the dataset grew.


Where It Started Breaking

The issue wasn’t a single slow query. It was how the system did work.

  • Every run scanned millions of rows, even if only a small subset had changed

  • Processing was sequential across services

  • Each record triggered the following:

    • one HTTP call (cron → versioning)
    • multiple DB reads (product + user)
    • multiple DB writes

For ~10,000 records, this meant:

  • 10,000 HTTP calls
  • ~60,000+ database operations

Throughput stayed around ~1 record per second.

At that rate:

  • a single batch took a few hours
  • new cron runs started before previous ones finished
  • load stacked instead of resetting

That’s when the database started getting pinned.


What Changed

OLD (Cron-based, per-record)

Cron Trigger (30min interval) ↓ Cron Service ↓ Product Table (scan) ↓ Versioning Service (per record) ↓ Product + User Services ↓ Database Writes

  • Repeated scans
  • 1 request per record
  • High DB + network overhead

NEW (Event-driven, batched)

Product Update ↓ Queue Table ↓ Cron Worker (batch consumer) ↓ Versioning Service ↓ Product + User Services (bulk reads) ↓ Database Writes (batched)

  • No table scans
  • Batched processing
  • Parallel workers

The fix was not inside the loop, but removing the loop entirely.

Work is no longer discovered by scanning, but captured at the time of change and processed in batches.


Queue (Triggering Work)

When a product becomes eligible, it is inserted into a queue table at write time (via a database trigger).

Instead of scanning large tables, the cron service now pulls only pending records from the queue.


Batching (Reducing Cross-Service Overhead)

Previously, the cron service called the versioning service once per record.

This was replaced with batched processing:

  • the cron service sends batches (150 IDs)
  • the versioning service processes the entire batch in one request

This reduces both network overhead and repeated database access.


Concurrency (Parallelizing Safely)

Multiple workers process the queue in parallel using:

FOR UPDATE SKIP LOCKED

Each worker claims a batch and processes it independently, allowing concurrency without coordination or duplicate work.


Bulk Processing Across Services (Doing Work Once)

Inside the versioning service, the workflow is no longer record-by-record:

  • product data is fetched in bulk
  • user data is fetched in bulk
  • versioning logic runs in-memory across the batch

Database writes are also grouped.

All operations for a batch are executed within a single transaction:

  • insert versions
  • update products
  • upsert published data
  • remove processed queue rows

This reduces write amplification and transaction overhead, since hundreds of operations are committed as a single batch instead of individual transactions, which contributed to the earlier CPU spikes.


Measured Impact

Measured before and after moving from cron-based polling to queue-driven batching.

MetricBeforeAfter
ExecutionCron + scanQueue + batch
Service calls1 / record1 / batch
DB operations~6–8 / record~4 / batch
Throughput~1/sec~70/sec
ScalingTable sizeChange volume

The shift to queue-driven batching increased throughput by ~70x while reducing database load and cross-service overhead.


What I Learned

  • Polling does not scale with data size
    Time-based scans repeatedly process the same dataset, even when little has changed. This leads to unnecessary load and poor scalability.

  • Work should be captured at the time of change
    Moving from periodic scanning to change-driven triggers eliminates the need to search for work.

  • Batching reduces overhead across every layer
    Network calls, database operations, and transactions become significantly cheaper when work is grouped instead of processed per record.

  • Throughput is often limited by system design, not query performance
    The bottleneck was not a slow query, but the way work was distributed and executed.

  • Concurrency needs coordination primitives
    Using FOR UPDATE SKIP LOCKED allowed safe parallel processing without duplicate work or contention.

  • Cross-service boundaries amplify inefficiency
    Per-record service calls multiplied latency and load. Batch-based communication reduced this drastically.

  • Write amplification matters at scale
    Grouping writes into a single transaction reduced database pressure and improved consistency.

  • Scaling should depend on change volume, not table size
    Systems that scale with total data size degrade over time. Systems that scale with actual changes remain stable.


Closing

This wasn’t a database problem. It was a work distribution problem.

Once the system stopped scanning and started reacting to changes, both performance and scalability followed naturally.