Moving Infrastructure Inference to Hardware Accelerators

Felipe Hlibco

July 31, 2020

Last quarter we moved a couple of our ML inference workloads off general-purpose CPUs and onto NVIDIA T4 GPUs. The performance gains were immediate and dramatic. The operational complexity that came with them was… also immediate.

At TaskRabbit, we use ML models for ranking and recommendation—matching Taskers to jobs, surfacing relevant categories, scoring urgency. These aren’t massive models by research standards, but they run on every request. Latency matters. Cost matters. And for a while, our CPU-based inference was both too slow and too expensive.

The Numbers That Forced the Move #

Our primary ranking model was averaging 45ms per inference call on CPU (c5.2xlarge instances). Acceptable, technically. But we were running enough of these calls per second that the compute bill was climbing faster than traffic was growing. Something about the scaling curve didn’t feel right.

On a T4 instance (g4dn.xlarge), the same model runs in about 2ms. That’s not a typo—roughly 22x faster. Even accounting for the higher per-instance cost of GPU hardware, the total cost per inference dropped by about 60% because each GPU handles so many more requests.

The T4 is NVIDIA’s inference-optimized GPU. It’s not a V100 (which is the training workhorse); it’s designed for exactly this use case. Lower power draw, INT8 support via Tensor Cores, and a price point that makes economic sense for serving.

TensorRT Changed the Equation #

Raw GPU execution was only part of the improvement. NVIDIA’s TensorRT optimizer took our ONNX-exported model and produced an optimized inference engine with FP16 precision. That alone doubled throughput compared to running the same model on the same GPU without TensorRT.

import tensorrt as trt

# Build engine from ONNX model
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, TRT_LOGGER)
parser.parse_from_file("model.onnx")

config = builder.create_builder_config()
config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_engine(network, config)

The conversion process isn’t trivial, but it’s a one-time cost per model version. We export from PyTorch to ONNX, then compile ONNX to TensorRT. Each step has its own set of operator compatibility gotchas, but for standard architectures (which ours are), it works.

FP16 quantization gave us a 2x throughput gain with negligible accuracy loss—less than 0.1% difference on our evaluation metrics. INT8 quantization promises another 2x on top of that, but requires a calibration dataset and more careful validation. We haven’t pushed there yet.

What Nobody Tells You About GPU Inference in Production #

The benchmarks are the easy part. Here’s what was harder:

Batching is everything. A single inference call on a GPU is wasteful; the hardware is designed for parallel execution. We implemented dynamic batching with a 5ms window (collect requests for 5ms, batch them, run inference once). This was the difference between “GPU is slightly faster” and “GPU is 20x faster.”

Memory management is your problem now. GPU memory is finite and shared across all models loaded on the device. We run three models on each T4 (16GB VRAM), and keeping track of memory allocation requires more attention than CPU deployments ever did.

Cold starts hurt. Loading a TensorRT engine into GPU memory takes 2-3 seconds. For a service that might scale to zero, that’s unacceptable. We keep at least one warm instance per model at all times, which adds baseline cost.

Monitoring is different. CPU utilization is a metric everyone understands. GPU utilization, GPU memory, SM (streaming multiprocessor) occupancy—these require new dashboards, new alerts, new intuition. nvidia-smi becomes your best friend.

The Alternatives We Considered #

Before committing to GPUs, we looked at AWS Inferentia (their custom ML chip, launched late 2019) and Intel’s OpenVINO for optimized CPU inference.

Inferentia looked promising on paper—purpose-built for inference, competitive pricing. But the tooling was immature. Model compilation via the Neuron SDK had limited operator support, and debugging failures was painful. I think Inferentia will be a serious option eventually; it wasn’t ready for us in Q2 2020.

OpenVINO gave us a 3-4x improvement over vanilla CPU inference, which is solid. For teams that can’t justify GPU infrastructure, it’s worth trying. But it didn’t close the gap enough for our latency targets.

FPGAs were interesting conceptually—reconfigurable, energy-efficient, potentially great for edge deployments. In practice, the development cycle for FPGA inference is an order of magnitude longer than GPU. We didn’t have the expertise on the team, and hiring for it seemed like the wrong bet.

Where This Is Going #

We’re serving about 4x the traffic per dollar compared to our CPU setup. The engineering investment was real (maybe 6 weeks of one engineer’s time, including the monitoring and deployment pipeline work), but it’s paid for itself already.

If your models are small and your latency requirements are loose, CPUs are fine. Don’t over-engineer it. But if you’re running inference on every request and cost or latency is becoming a constraint, the T4 is a remarkably good piece of hardware for the price.

The model optimization pipeline (PyTorch -> ONNX -> TensorRT) adds complexity to your deployment process. It’s worth it, but go in with eyes open about the operational surface area you’re adding.