Accelerating a Computer Vision Inference Pipeline

Executive Summary (Overview)

High-resolution images (e.g., 100k×100k pixels in digital pathology) routinely exceed GPU memory limits, so models are run on tiles and results are stitched afterward. That conventional post-processing merge stage can become a major bottleneck, especially when millions of tiles are written to and read from disk before any whole-slide visualization is possible [1–5].

In this project, our ML team redesigned merging as a real-time for accelerating a computer vision inference pipeline, in-pipeline process: as tiles are inferred, they’re blended into the full canvas immediately, stripe by stripe, rather than after inference completes. This eliminated a critical latency hotspot, halved merge time, cut peak memory by ~70%, and delivered a 1.6× end-to-end speedup, while enabling progressive, near-real-time visualization.

Background

The client works in digital pathology and “virtually stains” Whole Slide Images (WSIs), producing a stained counterpart from an unstained input via an image-to-image deep learning pipeline. WSIs are often gigapixel-scale (≈100,000×100,000 px), stored as multi-resolution pyramids (e.g., SVS, OME-TIFF), and typically accessed in tiles [1–5, 9–11, 16–19].

Because a single WSI can be tens of gigabytes uncompressed, prior practice split slides into 512×512 tiles with overlap, ran inference per tile, then merged all results at the end. That final stitching step alone could take >10 hours per slide, delaying delivery and visualization.

Why tiles and overlap?
Tiles are mandatory to fit GPU memory, while overlap (a “halo”) mitigates edge artifacts from convolution/normalization, with sliding-window strategies and overlap-aware blending widely used in medical and remote-sensing pipelines [2, 5, 20–22, 24–27]. Overlap values are often chosen as a fraction of the tile size (e.g., 12.5%–50%) to balance quality vs. throughput [24–27].

Challenges with the Previous Approach

Latency: Merging only after inference, delayed results, and visualization.
I/O Bottlenecks: Writing hundreds of thousands of intermediate tiles to disk (then re-reading them) inflated wall-clock time; random I/O patterns further hurt throughput compared with sequential write-out [2, 6, 20, 28–31].
Visualization Lag: Users couldn’t view slides until everything was merged, blocking UI rendering [6, 12–13, 21].

Solution: Real-Time Merging During Tiled Inference

We built a custom stitcher tightly coupled to the inference engine. Instead of waiting for all tiles, we streamed results into a large virtual canvas as soon as row-aligned stripes were complete, blending overlaps deterministically and flushing full-width stripes to disk in row-major order. This preserved sequential I/O and cache locality, avoided race conditions from async tile arrivals, and enabled progressive rendering (users can see partial slides early).

Result:

Sequential writes (vs random) are far more efficient on both HDDs and SSDs; they reduce controller overhead and garbage-collection penalties [28–31].
Row-major, stripe-wise processing exploits CPU cache locality and mimics long-standing scanline/strip paradigms in TIFF and out-of-core imaging [7, 14–15, 32–36].
Progressive viewers (e.g., DZI/OpenSeadragon) and pyramid formats benefit from partial/level-wise availability, so streaming improves UX immediately [5–6, 12–13, 21].

Technical Implementation

1) Inference Scheduler with Tile Metadata

A scheduler partitions the WSI into overlapping tiles and attaches metadata (tile ID, (x,y) origin, overlap extents). It batches neighboring tiles so adjacent outputs arrive close in time, yet returns results asynchronously (no serial stall on stragglers). This aligns with sliding-window WSI practices in MONAI and WSI tooling [24–27].

Plain-English: We line up tiles in a logical order so that “neighbors” tend to finish near each other, which is great for merging stripes without waiting for distant tiles.

2) Stripe-Based Flushing & Deterministic Ordering

We accumulate tiles until a full-width row stripe (and its immediate successor) is complete. Then we flush that stripe to disk, in strict top-to-bottom, row-major order. Because adjacent rows are guaranteed before flushing, overlap blending across row boundaries is clean and seam-free. Sequential, row-major writes preserve throughput; they’re friendlier to caches and storage devices than scattered random writes [28–31, 33–36].

Plain-English: Imagine building a giant poster one horizontal band at a time; once a band is complete, we glue it down and move on, avoiding messy back-and-forth.

3) Rolling Circular-Band Accumulator (for Overlap Blending)

Instead of holding the full canvas in memory, the stitcher maintains two small stripe-sized arrays:

an accumulator for weighted pixel sums, and
an accumulator for the corresponding weights.

Each incoming tile is multiplied by a precomputed per-pixel weight map (higher weight near the tile center, lower at edges), then added into the current rolling band; when a stripe is done and flushed, that band is zeroed, and the circular pointer advances. This achieves near-constant memory regardless of whole-slide size and produces seamless overlaps, akin to distance-weighted “feathering” and related blending techniques used in mosaicking and pyramid blending [8–13, 22–23].

Plain-English: We keep only two skinny horizontal buffers in RAM and “pour” weighted tiles into them. Once a stripe is finalized, we empty the bucket and slide it down.

Note on blending choices. Weighted averaging (“feathering”) is fast and robust for consistent exposure/contrast; multi-resolution/Laplacian splines or gradient-domain blends can suppress tougher seams but at higher compute cost [10–13]. The chosen method balanced speed, memory, and visual quality for pathology outputs.

4) Partial Finalization Modes for Real-Time Visualization

Real systems face interruptions (GPU timeouts, network hiccups). We implemented:

Strict mode: flush only complete stripes (pixel-perfect guarantees).
Lenient mode: flush any available rows for partial slides, ideal for dashboards where users prefer “something now” over “everything later.”

This matches progressive viewing patterns in deep-zoom/pyramid viewers and improves perceived responsiveness [5–6, 12–13, 21].

Results

Performance Metrics (per WSI)

Metric	Before	After	Improvement
Avg. merge time	~12 hours	~6 hours	2× faster
Peak memory usage	~100 GB	~30 GB	≈70% reduction

Overall, end-to-end pipeline time improved by ~1.6×, and visualization became effectively real-time at the stripe level. (Internal measurement details: client workloads, identical hardware, identical model checkpoints; the merge component was the only change.)

Scalability Gains

Merging is not a bottleneck anymore: it frees headroom for larger slides and higher throughput.
Ready for horizontal scale: merging is amenable to multi-GPU/producer streams with deterministic consumers.
Cleaner streaming integration: straightforward fit for pyramidal formats (OME-TIFF/SVS) and deep-zoom viewers.

Business Impact

Faster turnaround for pathologists generating digital stains.
Earlier insights via progressive visualization (no “all-or-nothing” waiting).
Lower hardware pressure from reduced RAM/I/O overhead during merge.

Practical Notes & Design Choices

Tile size & overlap: 512×512 with overlap is a pragmatic default for many CV/medical tasks; increases in tile size typically improve prediction quality until GPU memory becomes limiting. Overlap of 12.5%–50% is common to eliminate edge artifacts; choose empirically per model and normalization strategy [24–27].
Blending weights: Start with linear or cosine distance-to-edge weights; consider Gaussian weights or multi-resolution blends if residual seams appear [10–13, 22–23].
I/O format: Write stripes into pyramidal containers (OME-TIFF, Zarr, or DZI) to support progressive multi-resolution viewers and efficient random access at different zooms [3–6, 8–11, 18–19, 21].
Mem-mapping: Use memory-mapped stripe buffers for low-overhead access to on-disk arrays; this is an established pattern for large images [7, 15, 32–36].
Asynchrony: Keep inference and merging loosely coupled with a bounded queue; deterministically order commits (row-major) to avoid race conditions and ensure reproducibility.

Limitations & Future Work

Artifact-prone content: In rare cases (extreme contrast changes across tiles), simple feathering may leave faint seams; multi-resolution or gradient-domain blending removes these at higher compute cost [10–13].
Skewed arrivals: If scheduling can’t keep neighbors close in time (e.g., heterogeneous GPUs), stripe waits may increase; adaptive stripe height or micro-flushes can help.
Distributed merging: Next steps include sharded stripe ownership across consumers, then concatenation, while preserving deterministic, sequential write-out.

Conclusion

Treating slide reconstruction as a streaming, first-class stage, rather than a monolithic post-process, transformed this WSI pipeline: 2× faster merges, ~70% less memory, 1.6× end-to-end speedups, and progressive visualization for users. In production-grade CV systems, holistic optimization, model, data flow, I/O, and visualization often unlock the biggest real-world wins.

Glossary (quick reader-friendly notes)

WSI (Whole Slide Image): Gigapixel-scale microscopy image, stored as a multi-resolution pyramid for zooming [1–5, 8–11, 16].
Tiled inference: Running the model on overlapping crops to fit GPU memory, then merging outputs [2, 24–27].
Feathering / weighted blending: Distance-to-edge weighting to hide seams where tiles overlap [11–13].
Stripe (row-major) flushing: Writing completed horizontal bands sequentially to maximize throughput and locality [28–36].
Memory-mapped arrays: On-disk arrays accessed like memory, without loading the whole file at once [7, 15].

References

[0] Source case material (internal): Case Study: Accelerating Computer Vision Inference Pipeline via Real-Time Large Image Merging from Tiled Inference.
[1] Wang et al., “Managing and Querying Whole Slide Images.” Notes typical ~100k×100k WSI sizes. PMC
[2] NVIDIA Developer Blog, “Accelerating Digital Pathology Workflows Using cuCIM & GPUDirect Storage.” WSI sizes, tiling necessity, I/O constraints. NVIDIA Developer
[3] OpenSlide Python Docs. Multi-resolution WSI reading. openslide.org
[4] OME-TIFF Spec. Pyramidal/multi-resolution support. docs.openmicroscopy.org
[5] OpenSeadragon DZI / deep zoom tile viewing (progressive visualization). OpenSeadragon+1
[6] Microsoft Research (Deep Zoom overview slides). Progressive tiling concepts. Microsoft
[7] NumPy memmap docs. Memory-mapped arrays for large files. numpy.org
[8] Bio-Formats WSI notes (multi-resolution, 100k+ pixels per side). docs.openmicroscopy.org
[9] OpenSlide “Aperio format” (SVS: pyramidal tiled TIFF). openslide.org
[10] Burt & Adelson (1983), “A Multiresolution Spline with Application to Image Mosaics.” Classic pyramid blending. persci.mit.edu+1
[11] ArcGIS Mosaicking “Blend” rule (distance/weight-based). desktop.arcgis.com+1
[12] UW Image Stitching notes: alpha/weighted blending basics. University of Washington Courses
[13] Panorama blending overview (weighted averages / feathering). UW Computer Sciences
[14] Kazhdan et al., “Streaming multigrid for gigapixel images” (out-of-core stripes/windows). hhoppe.com
[15] libtiff scanline/strip I/O pattern (sequential stripe processing). cs.rochester.edu
[16] IIPImage blog: OME-TIFF for whole-slide microscopy. IIPImage
[17] MONAI inferers (sliding window & WSI splitters with overlap/blending). docs.monai.io+2docs.monai.io+2
[18] Reina et al. (2020): systematic evaluation of tile size/overlap impacts. PMC
[19] MDPI (2024): effects of tile size/overlap on segmentation accuracy. MDPI+1
[20] Mapscaping: mosaicking overview (adjacent vs overlapping rasters). mapscaping.com
[21] large_image / HistomicsTK: tiled access & processing for WSIs. digitalslidearchive.github.io
[22] Nature LSA reviews on virtual staining and WSI DL workflows. Nature+1
[23] GDAL/Whitebox/QGIS community notes on feathered mosaics. GitHub+1
[24] MONAI issue: overlap + Gaussian weighting for sliding-window inference. GitHub
[25] DeepMIB/QuPath study: overlapping tiles to avoid edge errors. Frontiers
[26] “No More Sliding Window” (2025): fixed 50% overlap baseline (context). arXiv
[27] Tiling artifacts & normalization trade-offs in large biological images. ResearchGate
[28] Sequential vs random write throughput context (SSDs/HDDs). Astute Group+1
[29] Storage performance primer: sequential vs random access patterns. Partition Wizard
[30] Cache locality & row-major iteration benefits. CS 61+2Raygun+2
[31] Out-of-core/streaming algorithms for large data. cse.engineering.nyu.edu