Inference Modes

Picking the Right Way to Run Your Model
Once you've trained or selected a model, you need to actually run it against data — that's inference. The decision of how you run inference has massive implications for latency, cost, and operational complexity. AWS gives you multiple deployment patterns across both SageMaker and Bedrock, plus purpose-built hardware to run it all on. Let's break it down.
The Four SageMaker Deployment Modes
SageMaker offers four distinct inference modes, each optimized for a different traffic pattern. Picking wrong here means you're either burning money or frustrating users.
Real-Time Inference (Hosting Services)
This is your always-on, low-latency option. You deploy a model to a persistent endpoint backed by one or more instances, and it sits there waiting for requests. Every invocation is synchronous — you send a request, you wait, you get a response.
Use this for chatbots, fraud detection, recommendation engines, patient risk scoring — anything where a user is waiting on the other end. SageMaker endpoints support autoscaling, so you can scale instance counts up and down based on traffic. The tradeoff is cost: you're paying for compute even when nobody's calling it.
The invocation pattern is typically REST API or gRPC. You call InvokeEndpoint, the model runs inference on your payload, and you get the result back in the same HTTP response. Latency is usually milliseconds to low seconds depending on model size.
Serverless Inference
Serverless inference is the Lambda-style option for ML. SageMaker manages all the infrastructure, spinning up compute when requests arrive and scaling to zero when traffic stops. You pay only for what you use.
The sweet spot is intermittent, spiky workloads — a content recommendation model that gets hammered during peak hours and sits idle overnight. The catch is cold starts. When traffic scales from zero, the first request takes longer because SageMaker needs to spin up the container and load the model. If your use case can tolerate a few seconds of initial latency, this saves serious money compared to always-on endpoints.
It integrates with Lambda under the hood and gives you built-in fault tolerance and automatic scaling without you configuring scaling policies or choosing instance types.
Asynchronous Inference
Async inference queues incoming requests and processes them in the background. This is the middle ground between real-time and batch — you still get a dedicated endpoint, but the execution is decoupled from the request.
The killer feature: payloads up to 1GB and processing times up to one hour. That's genomic analysis territory, large document processing, complex image pipelines. You submit the request, SageMaker queues it, processes it, and drops the result in S3 or sends a notification via SNS.
It also autoscales to zero when the queue is empty, so you're not paying for idle compute. This makes it cost-effective for workloads where you need near-real-time results but don't need them in the same HTTP response. For payloads under 1GB where immediate results aren't critical — say, daily sales analysis used for weekly meetings — async is often the right call over batch.
Batch Transform
Batch transform is pure offline processing. No endpoint, no waiting around. You point it at a dataset in S3, it spins up compute, runs inference on the entire dataset, writes results back to S3, and shuts down.
Use this for periodic bulk scoring: running a churn model against your entire customer base overnight, generating predictions for a whole dataset before a quarterly review, or preprocessing data for downstream analytics. SageMaker will automatically split multi-gigabyte input files using SplitType: Line and BatchStrategy: MultiRecord.
Batch is the cheapest per-inference option when you have large datasets and no latency requirements. The execution is fundamentally asynchronous — you kick it off and check results later.
Quick Decision Matrix
| Mode | Latency | Payload | Scaling | Pay For |
|---|---|---|---|---|
| Real-time | Milliseconds | Small | Autoscaling instances | Always-on compute |
| Serverless | Seconds (cold start) | Small | Auto, scales to zero | Per-request |
| Async | Minutes | Up to 1GB | Auto, scales to zero | Per-request |
| Batch Transform | Hours | Multi-GB datasets | Job-based | Job duration |
The mental model: Real-time = someone's waiting. Serverless = someone might be waiting, but traffic is spiky. Async = nobody's waiting, payload is big. Batch = process the whole dataset offline.
Bedrock Inference: A Different Model
Amazon Bedrock's inference options are distinct from SageMaker's. Bedrock is serverless by nature — you don't manage instances. The three modes are:
On-Demand — Pay per token, no commitments. You call the API, you get charged for input and output tokens. This is the default and works for variable or unpredictable workloads. Cost is directly proportional to token count, so shorter prompts = cheaper inference.
Batch Inference — Submit a batch of prompts stored in S3, and Bedrock processes them asynchronously. The big draw: Bedrock offers batch inference at 50% of on-demand pricing for select models (Anthropic, Meta, Mistral, Amazon). If you can tolerate delays and you're running large volumes — media analysis, content generation at scale — batch cuts your bill in half.
Provisioned Throughput — Reserve a fixed amount of capacity for guaranteed performance. This is the enterprise option for predictable, high-volume workloads. Critically, custom models (fine-tuned or continued pre-training) can only be accessed via Provisioned Throughput — you cannot use on-demand or batch mode with customized Bedrock models.
Note that real-time inference and serverless inference are SageMaker concepts. Bedrock doesn't use those terms. If a question mentions serverless inference on Bedrock, that's a trick — Bedrock's default API is already serverless by design.
Inference Parameters
Regardless of which mode you use, the inference parameters control what comes out of the model. These apply across Bedrock and SageMaker-hosted LLMs:
Temperature (0 to 1) — Controls randomness/creativity. Low temperature = deterministic, focused outputs. High temperature = more creative, diverse, potentially less coherent. For factual Q&A, keep it low. For creative content, push it higher.
Top K — The number of most-likely next tokens the model considers. Lower K = more constrained vocabulary, more predictable output. Higher K = broader candidate pool, more variety.
Top P (nucleus sampling) — The percentage (cumulative probability) of most-likely tokens considered. Top P of 0.9 means the model considers the smallest set of tokens whose probabilities sum to 90%. It's a softer constraint than Top K.
Max Tokens / Response Length — Hard cap on output length. The model stops generating after this many tokens regardless of whether it's "done." Directly affects cost since you pay per output token.
Stop Sequences — Character sequences that halt generation. If the model outputs a stop sequence you've defined, it stops right there.
None of these parameters affect inference cost directly (except max tokens / response length, which caps output tokens). Temperature, top-p, and top-k control output quality and style, not billing.
Purpose-Built Hardware: Inferentia and Trainium
AWS builds custom silicon specifically for ML workloads, and the naming tells you exactly what each does:
AWS Inferentia — Purpose-built for inference. Inferentia-powered EC2 instances (Inf1, Inf2) deliver up to 2.3x higher throughput and up to 70% lower cost per inference compared to comparable GPU instances. If you're running inference at scale on SageMaker or self-managed endpoints, Inferentia is the cost-performance play.
AWS Trainium — Purpose-built for training. Trn1 instances use Trainium chips for training 100B+ parameter models at lower cost and up to 25% better energy efficiency than comparable GPU instances. Trainium is for the training phase — not inference.
The distinction matters: Trainium trains, Inferentia infers. Don't mix them up. And if a question mentions carbon footprint or energy efficiency for ML training workloads, Trainium is the answer — it's specifically marketed as the most energy-efficient option.
Cost Optimization Patterns
A few patterns that come up repeatedly when optimizing inference costs:
Reduce input tokens — On Bedrock, you pay per token. If you're using few-shot prompting with 10 examples, trimming those examples or making them more concise directly reduces cost. This is the single most effective cost lever.
Use batch when latency doesn't matter — 50% savings on Bedrock, efficient resource utilization on SageMaker. If nobody's waiting for the result, batch it.
Scale to zero — Both serverless and async inference on SageMaker autoscale to zero. If your traffic has idle periods, these modes save you from paying for idle compute.
Right-size the model — Smaller models are cheaper per inference. If a smaller model meets your accuracy requirements, don't pay for a bigger one. The relationship is direct: bigger model = more compute = higher cost.
Provisioned Throughput for predictable workloads — If you have consistent, high-volume traffic on Bedrock, provisioned capacity with term commitments is cheaper than on-demand at scale. Plus it's mandatory for custom models.
Putting It Together
The inference mode decision tree is straightforward:
Need instant responses? → Real-time (SageMaker) or On-Demand (Bedrock)
Spiky traffic with idle periods? → Serverless (SageMaker)
Big payloads, can wait? → Async (SageMaker up to 1GB)
Whole dataset, no time pressure? → Batch Transform (SageMaker) or Batch (Bedrock at 50% off)
Custom fine-tuned Bedrock model? → Provisioned Throughput (mandatory)
Need to optimize training cost? → Trainium instances
Need to optimize inference cost at hardware level? → Inferentia instances
Each mode exists because different workloads have fundamentally different requirements. A fraud detection system and a quarterly churn analysis have nothing in common operationally, even if the underlying models are similar. Match the deployment mode to the workload pattern, and you'll save money while hitting your latency targets.