ValenTech logoValenTech

data-engineering

Pagination at Scale: Reliable Strategies for Large-Dataset Collection

How to collect complete, accurate data from paginated APIs and web sources without drift, duplication, or silent gaps.

March 21, 202611 min read • By ValenTech Engineering Team

Pagination is treated as a solved problem because the basic pattern is simple: request page one, get results, request page two, repeat until there are no more results. The simplicity of this pattern obscures a set of failure modes that only appear at scale — when the dataset is large, when collection runs frequently, when the source is actively updated, or when something goes wrong mid-run.

A collection system that works correctly on a 500-item dataset running once a week can fail silently on a 500,000-item dataset running every hour. The items it misses are not random noise. They tend to be the items that changed — new listings, updated prices, recently modified records — which is exactly the data that makes the collection useful in the first place.

This post covers the design decisions that determine whether a paginated collection system remains accurate and complete as volume, frequency, and source complexity increase.

Understand what the pagination model actually guarantees

Not all pagination is equivalent. The guarantees differ significantly between pagination models, and building a collection system around the wrong assumptions for a given model produces predictable gaps.

Offset-based pagination is the most common model in older APIs and many web sources. You request page=1&per_page=100, get the first 100 results, then request page=2&per_page=100, and so on. The position in the dataset is calculated by multiplying the page number by the page size.

The critical failure mode of offset pagination is drift during collection. If new items are added to the beginning of the dataset while you are paginating through it — a common pattern for APIs that return results in reverse-chronological order — your page boundaries shift. By the time you request page 5, some items that were on page 4 during your page 3 request are now on page 5. You skip them. Conversely, items deleted during collection produce the opposite effect: you see the same item twice, once before the deletion shifts the boundary and once after.

Offset pagination drift is not a race condition you can prevent by running faster. On large datasets, a collection run takes long enough that source changes during the run are guaranteed. The only reliable mitigation is to treat the offset model as inherently approximate and implement a separate reconciliation pass.

Cursor-based pagination addresses the drift problem by anchoring each page to a stable position in the dataset rather than a numeric offset. The API returns an opaque cursor with each page — a token encoding the position of the last returned item — and you pass that cursor in the next request to get the items that follow it. Because the cursor is tied to a specific item rather than a position number, insertions and deletions earlier in the dataset do not shift your page boundaries.

Cursor pagination is safer than offset pagination for large, actively-updated datasets, but it introduces its own constraints. Cursors are typically not persistent — they expire after a short window, often 10 to 30 minutes. A collection job that pauses mid-run (due to a rate limit backoff, a transient error, or a system restart) may find its cursor has expired when it resumes. The job either needs to restart from the beginning or maintain a checkpoint strategy that allows resumption without relying on cursor validity.

Keyset pagination is a variant that uses the value of a sortable field — typically a timestamp or an auto-incrementing ID — as the pagination anchor. Instead of an opaque cursor, you request all items where id > 48291 or updated_at > 2026-03-15T14:00:00Z. This is more transparent and more durable than cursor-based pagination: the key value is stable, not time-limited, and can be stored as a collection checkpoint that survives job restarts indefinitely.

Keyset pagination requires that the source supports filtering by the anchor field and that the anchor field is indexed for efficient queries. It also requires that the anchor field is stable — that the sort order does not change and that items are not assigned new IDs or timestamps when they are updated. When these conditions hold, keyset pagination is the most reliable model for resumable, large-scale collection.

Design resumable jobs from the start

A collection job that cannot resume after interruption is forced to restart from the beginning when something goes wrong. On small datasets, a full restart is acceptable. On a dataset that takes four hours to collect, a restart mid-run because of a brief downstream outage wastes two hours of work and delays the results by four hours instead of the two hours remaining.

Checkpoint storage records the current position in the pagination sequence durably, outside the collection process, so that a restarted job can continue from where it left off rather than from the beginning. The checkpoint should be written after each page is successfully fetched and stored, not after processing. Storing the checkpoint after processing introduces a window where the job has processed a page but not recorded the checkpoint — if the job restarts during this window, it will reprocess the page. Whether that is a problem depends on whether your processing logic is idempotent (covered in the ETL pipeline design post).

For offset-based pagination, the checkpoint is the last successfully processed page number. For keyset pagination, it is the last successfully processed key value. For cursor-based pagination, the checkpoint is less useful because cursors expire, but storing the last processed item's ID allows the job to approximate a resume point by scanning forward from that item.

Partial page handling matters when a page fetch succeeds but storage fails. If you fetch page 7, store 80 of the 100 results successfully, and then the write fails, your checkpoint should reflect that page 7 was only partially processed. The simple approach is to treat any partial write as if the entire page was not processed — the job will re-fetch page 7 on resume and attempt to write all 100 items again. If your load logic uses upserts, this is safe. If it uses inserts, you will produce duplicates for the 80 items that were already stored.

Completion detection is not always obvious. An API that returns an empty final page is unambiguous — no results means no more pages. But many APIs signal completion through a has_more: false flag, a missing next_cursor field, a total count in the response headers that the client must compare against items fetched, or simply by returning fewer items than the requested page size. A collection system that does not correctly detect completion will either terminate early (missing the last partial page) or loop indefinitely making empty requests. Validate the completion detection logic against a controlled dataset before deploying to production.

Handle rate limits at the pagination layer

Rate limiting and pagination interact in ways that are not obvious from looking at either problem in isolation. An API that allows 100 requests per minute sounds permissive until you calculate how long it takes to paginate through a 1,000,000-item dataset at 100 items per page: 10,000 requests at the rate limit of 100 per minute takes 100 minutes per full collection cycle.

Rate limit budgets should be calculated before the collection system is built, not after. For each source, determine the page size, the estimated total item count, the API rate limit, and the target collection frequency. If the math does not work — if completing a full collection cycle takes longer than the target frequency — you need either a higher rate limit (via a paid API tier or a negotiated arrangement with the source), a larger page size, or a scoped collection strategy that collects the full dataset less frequently and collects high-priority subsets more frequently.

Burst absorption allows collection jobs to use available rate limit capacity efficiently rather than distributing requests evenly over time. If the rate limit is 100 requests per minute and the job runs every hour, a burst-absorbing system can make 100 requests in the first minute, then wait, then make the next 100, rather than making 1.67 requests per second continuously. For many sources, concentrated bursts during off-peak hours produce lower latency and fewer transient errors than sustained low-rate traffic spread over the collection window.

Rate limit response handling should be explicit at the pagination level. When the source returns a 429 response or a Retry-After header, the collection job should respect it — stop making requests, wait the specified duration, and then resume from the last successful checkpoint. A collection job that treats 429 responses as transient errors and retries immediately will exhaust its retry budget quickly and back off for longer than a single respectful wait would have required.

Validate completeness, not just correctness

A collection run that completes without errors may still be incomplete. The job may have fetched 94,000 items from a source that had 100,000, with the remaining 6,000 silently missing due to a pagination boundary issue, an early termination, or a gap in the source's own indexing.

Expected count validation compares the number of items fetched against the expected count, when the source provides one. Many APIs include a total count in the response envelope — total: 100000 alongside the page results. After collection completes, compare the actual item count in storage against the expected total. A significant discrepancy — more than a small percentage difference for sources with ongoing updates — indicates an incomplete run.

Coverage spot-checks sample items from known positions in the dataset — the first page, the last page, and several pages in the middle — and verify that they are present in storage. If an item from the expected middle of the dataset is missing, the spot-check fails and triggers a targeted re-collection of that range.

Monotonic ID gap detection applies to sources where items have sequential numeric identifiers. After collection, scan the stored IDs for gaps. A range of missing IDs — say, no items with IDs between 48000 and 49500 — indicates a missed page or a collection gap that can be targeted for re-collection without re-running the entire job.

Cross-run comparison detects systematic incompleteness that per-run validation might miss. If each individual collection run reports 100% completion but the item count consistently grows less than the known source growth rate, something is being missed on every run. Tracking item count over time and comparing it against source-reported totals or known growth rates catches this pattern.

Manage source-side state changes during collection

The most difficult aspect of paginated collection is that the source is not static. Items are added, updated, and deleted while your collection job runs. A collection strategy that assumes a frozen dataset produces results that are somewhere between a snapshot and an approximation, depending on how much the source changed during the run.

Collection ordering strategy determines which items are most likely to be captured accurately. Collecting in reverse-chronological order — newest items first — ensures that the highest-priority items (recent additions and updates) are captured early in the run, before the job is interrupted or rate-limited. Older items, collected later, are more stable and less likely to have changed since the previous collection run.

Update detection after collection closes the gap between items collected at the beginning of the run (which may have been updated by the time the run completes) and items collected at the end. After the full paginated collection completes, a targeted update scan fetches any items whose modification timestamp is more recent than the collection run start time. This ensures that updates that occurred during the collection run are captured without requiring a full re-collection.

Deletion handling is the hardest problem in paginated collection. Items deleted from the source do not produce any signal in a paginated response — they simply disappear. The only way to detect deletions is to compare the current collection against the previous collection and identify items that were present before and absent now. For large datasets, this comparison is expensive. The practical approach is to run deletion detection at a lower frequency than collection — perhaps once per day for a source collected every hour — and to flag detected deletions for human review before removing them from storage, since deletion detection false positives (items that were temporarily unavailable rather than permanently deleted) are common.


If your team is running data collection jobs that are silently incomplete or degrading under volume, see how ValenTech approaches data extraction and pipeline delivery or review our engagement process to discuss your requirements.

Work with us

Need this built and operated for your team?

ValenTech delivers project-based automation engineering and managed monitoring subscriptions for operations-heavy teams. We scope, build, and ship — with runbooks, alerts, and handoff documentation included.

Related Posts

Mar 10, 202611 min read

Reliable System Integrations: Webhooks, Queues, and the Idempotency Problem

Why operational integrations break under load and the patterns that prevent it.

Nov 2, 20259 min read

Monitoring and Alerting for Automation Systems

A practical approach to catching incidents before operations teams feel them.

Aug 10, 202510 min read

Reliability in Anti-Bot Environments

Engineering patterns for durable data collection under strict anti-abuse controls.