By Ryan McBridein
cloud engineering
·

Amazon Bedrock: What A Software Engineer Actually Needs To Know

Amazon Bedrock: What A Software Engineer Actually Needs To Know

If you strip away the marketing, Amazon Bedrock is a single HTTP API in front of a bunch of foundation models. That's it. You send a prompt to an endpoint, you get tokens back, you pay per token. No GPUs to rent, no containers to build, no model weights to download. If you've used the OpenAI API, Bedrock is AWS's answer to it — but with a rotating cast of models behind the same door and the usual AWS machinery (IAM, KMS, VPC endpoints, CloudTrail) wrapped around every call.

This article walks through the parts of Bedrock you'll actually touch as an engineer: the model catalog, how you call it, how you customize models without training one yourself, the two features that turn Bedrock from "chat API" into "app platform" (Knowledge Bases and Agents), and the pricing and security knobs that will show up the moment you try to ship anything to production.

The Shape Of The Thing
Bedrock is serverless. You don't provision an endpoint, you don't pick an instance type, you don't warm up a container. You call InvokeModel (or InvokeModelWithResponseStream for streaming) against a model ID like anthropic.claude-3-5-sonnet-20241022-v2:0, and AWS routes it to whatever hardware is running that model. You pay on-demand by input and output tokens. The first time you try a model in a new region you have to click "request access" in the console — this is a one-time gate, not a per-request thing, but it will bite you if you forget.

The model catalog is the important part. Bedrock isn't one model, it's a menu:

  • Anthropic Claude — the flagship text models most people reach for. Strong reasoning, long context windows.

  • Amazon Titan — Amazon's own family. Titan Text for chat/completion, Titan Embeddings for vectors, Titan Image for image generation.

  • Meta Llama — open-weight models, cheaper, good for bulk workloads where you don't need frontier quality.

  • Mistral — small, fast, cheap. Nice for latency-sensitive stuff.

  • Cohere — strong embeddings, good rerankers.

  • AI21 Jurassic — another text model family.

  • Stability AI Stable Diffusion — image generation.

The practical upshot: you can swap models by changing a string. That's the whole selling point over calling OpenAI directly — you're not locked to one vendor, and because it's all IAM-authenticated AWS API calls, your data never leaves your AWS account boundary to go to a third party.

Calling It
The unsexy reality: you use the AWS SDK bedrock-runtime), authenticate with normal IAM credentials, and POST JSON. There are two client surfaces worth knowing:

  • bedrock* — the control plane. List models, manage custom models, set up Provisioned Throughput, create Guardrails.

  • bedrock-runtime* — the data plane. InvokeModel, InvokeModelWithResponseStream, Converse, ConverseStream.

Converse is the newer, normalized API — it takes the same message-shaped request regardless of which model you're calling, so you don't have to hand-craft Anthropic's messages format for Claude and Meta's format for Llama separately. If you're starting a new project, use Converse. If you're reading old tutorials, you'll see InvokeModel with model-specific JSON bodies, which still works but is more annoying.

Inference Parameters
These aren't Bedrock-specific, but the exam (and reality) leans on them. They control what the model does at generation time, not at training time:

  • temperature — randomness of sampling. 0 = deterministic, higher = more creative/varied. If a user asks you to "make creative ad copy on Bedrock," the answer is turn temperature up. If you're doing structured extraction, turn it down.

  • top_p (nucleus sampling) — only sample from the smallest set of tokens whose cumulative probability exceeds p. So top_p=0.9 means "ignore the long tail."

  • top_k — only consider the k most-likely next tokens. A hard cap on candidate count. Use this when you want to "regulate the number of most-likely candidates considered for the next word" — that's the exam phrasing.

  • max_tokens — hard ceiling on output length. This is also a cost knob; output tokens are usually more expensive than input tokens.

Prompt engineering is the cheapest customization lever. Before you even think about fine-tuning, burn a few hours iterating on system prompts, few-shot examples, and chain-of-thought. The cost difference between "better prompt" and "fine-tuned model" is multiple orders of magnitude.

The Customization Ladder
This is the single most important mental model in Bedrock. When you need the model to behave differently — know your domain, follow your format, cite your documents — there's a hierarchy of options, and you should always try the cheaper ones first:

  1. Prompt engineering — free, instant. System prompts, few-shot examples, output format instructions. Try this first.

  2. RAG (Retrieval Augmented Generation) — inject fresh/private context into the prompt at inference time. Bedrock's managed path is Knowledge Bases. Cheap, no training.

  3. Fine-tuning — supervised training on labeled task data. You give it (input, desired output) pairs and the model learns to produce outputs in your style or format. Use when prompt+RAG aren't enough to enforce behavior.

  4. Continued pre-training — unsupervised training on unlabeled domain data (raw docs, manuals, tickets). Use when the model doesn't know your domain vocabulary at all. This broadens knowledge; fine-tuning sharpens a task.

  5. Train from scratch — almost never the right answer. You're not going to out-train Anthropic.

fine-tuning = labeled data for a task; continued pre-training = unlabeled data for a domain.

One catch: once you've customized a model (either flavor), you can't call it on-demand. You have to run it through Provisioned Throughput — you reserve dedicated inference capacity at an hourly rate. This is a real cost cliff. Customized models are useful, but budget-wise they stop being "pay per token."

Knowledge Bases — Bedrock's Managed RAG
RAG is "stuff relevant documents into the prompt." The mechanical version:

  1. Chunk your documents.

  2. Embed each chunk into a vector.

  3. Store vectors in a vector DB.

  4. At query time, embed the user's question, find the top-k nearest chunks, stuff them into the prompt, generate.

You could build this yourself. Most teams don't want to. Knowledge Bases for Amazon Bedrock is the managed version: you point it at an S3 bucket of docs, pick an embedding model (usually Titan Embeddings), and pick a vector store. Bedrock handles ingestion, chunking, embedding, retrieval, and prompt assembly. At runtime you call RetrieveAndGenerate instead of InvokeModel.

Supported vector stores:

  • OpenSearch Serverless — the default, easiest path. Spin up, point the KB at it, done.

  • Aurora PostgreSQL with pgvector — good if you already run Aurora.

  • Pinecone — third-party, popular.

  • Redis Enterprise Cloud.

Why RAG matters: it's the fix for hallucinations and stale training data. If a user asks "what's the return policy on product X" and your docs live in PDFs, you don't fine-tune — you index the PDFs into a Knowledge Base. The model stays the same. The data lives in your account. You can update the docs and re-index without touching the model. This is the cheapest, most maintainable way to give an LLM access to your private data.

Agents — Multi-Step Tool Use
Agents for Amazon Bedrock are how you go from "answer a question" to "take an action." The idea: you define a set of tools (Lambda functions, typically) with a schema, and the agent can decide — across multiple LLM turns — which tools to call, pass the results back to itself, and keep going until it's done.

Example from the real exam bank: a manufacturing company wants a generative AI app that monitors inventory, sales data, and supply chain info and recommends reorder points. Answer: Agents for Bedrock. The agent can call a "read inventory" tool, a "read sales" tool, reason over the results, and call a "create reorder" tool. You're not writing the orchestration loop by hand; the agent runtime does it.

Under the hood, this is function calling + ReAct-style planning + short-term memory, wrapped in an AWS service so you don't manage it. If you've hand-rolled an agent loop with OpenAI, Bedrock Agents is the "I don't want to maintain that code" version.

Guardrails
Guardrails are content filters you attach to a model invocation. They run on both the input prompt and the output response, and they can:

  • Block or mask PII (names, SSNs, credit cards, phone numbers).

  • Block prompts on denied topics you define ("no legal advice," "no competitor discussion").

  • Filter harmful content categories (hate, violence, sexual, insults) at tunable thresholds.

  • Enforce word filters (blocklist of specific strings).

  • Block prompt injection attempts.

The one exam scenario to remember: a company fine-tuned a model on data that included confidential info, and now wants to make sure the model's responses don't leak it. Guardrails with PII masking is the answer — you don't have to retrain the model, you just put a filter in front of the output.

Guardrails are a separate resource you create once and then reference by ID in your InvokeModel or Converse calls. Worth treating them as a security control, not a feature flag.

Pricing: On-Demand vs Provisioned Throughput
Two billing modes, and they're not interchangeable:

  • On-demand — pay per input and output token. Good for prototyping, spiky traffic, low volume. Works only for the base models in the catalog.

  • Provisioned Throughput — reserve model units by the hour (1-month or 6-month commitments for discounts). Required for customized models (fine-tuned or continued pre-trained). Also useful when you need predictable throughput for a high-traffic production workload, because on-demand has quotas and throttling.

The gotcha: the moment you fine-tune, you're no longer on on-demand pricing. Your tiny $50/month experiment becomes a dedicated capacity bill. Factor this in before you recommend fine-tuning to anyone.

Cost levers at inference time, ranked by how much they matter:

  1. Pick a smaller model. Smaller models are cheaper per token. Haiku vs Sonnet vs Opus is a 10x difference.

  2. Reduce input tokens. If you're stuffing 10 few-shot examples into every prompt, that's 10x the input bill forever. RAG with targeted retrieval beats shotgun few-shot.

  3. Cap max_tokens. Don't pay for output you'll truncate.

  4. Cache responses where queries repeat.

Security, Logging, And Networking
Everything in Bedrock flows through AWS's normal security controls, which is most of the reason to use it over a third-party API:

  • IAM — every InvokeModel call is authenticated. You scope permissions per-model via IAM policies bedrock:InvokeModel on specific model ARNs). Least-privilege applies.

  • KMS — your custom models, training data, and Knowledge Base contents are encrypted at rest with customer-managed keys if you want.

  • VPC endpoints / PrivateLink — route Bedrock traffic over the AWS backbone, never touching the public internet. Standard enterprise ask.

  • CloudTrail — logs every control-plane call.

  • Model invocation logging — this is the one you'll actually need. It's off by default. Turn it on and Bedrock writes every request/response pair to S3 or CloudWatch Logs. This is the answer to any exam question about "audit what the model saw and said" or "debug production prompt behavior." Enable it on day one.

Your prompts, completions, and fine-tuning data are not used to train Amazon's or any vendor's models. This is a contractual data-handling commitment and it's the main reason enterprises pick Bedrock over calling model providers directly.

Bedrock vs SageMaker vs Amazon Q
These three get confused constantly. The clean mental model:

  • Bedrock — you call an API, a foundation model answers. Build-your-own-app. Serverless. You pick the model.

  • SageMaker — the full ML platform. You build, train, tune, deploy your own models (traditional ML or your own FMs). Much more flexible, much more work. JumpStart is the SageMaker feature that overlaps with Bedrock — it lets you deploy pretrained FMs to your own endpoints.

  • Amazon Q — the finished product. Q Developer is a code assistant (in-IDE, like a Copilot competitor). Q Business is a RAG assistant over your company data with pre-built connectors to Slack, Salesforce, S3, etc. Q uses foundation models but doesn't let you pick them. Q is the finished end-user app; Bedrock is the LEGO set you build one with.

If the question is "build a custom generative AI app with choice of model," it's Bedrock. If the question is "give our employees a ChatGPT-over-our-docs with zero code," it's Q Business. If the question is "train a fraud model on our tabular data," it's SageMaker.

A Reasonable Mental Model To Ship With
If you're standing up a Bedrock-backed app today, the default stack looks roughly like this:

  1. 1. Start with Claude (or whichever model's good at your task) via Converse on-demand pricing.

  2. 2. Iterate on prompts until you've squeezed everything you can out of them.

  3. 3. Put your private docs behind a Knowledge Base with OpenSearch Serverless. Call RetrieveAndGenerate.

  4. 4. Attach a Guardrail with PII masking and denied-topics the day before you show it to anyone outside the team.

  5. 5. Turn on model invocation logging to S3. You will need it.

  6. 6. Put the whole thing behind a VPC endpoint if you're in an enterprise account.

  7. 7. Only reach for fine-tuning or Agents when you've proven prompt+RAG isn't enough, and budget for Provisioned Throughput if you do.

That covers maybe 90% of real-world Bedrock usage. The rest is details — and most of the details are just regular AWS (IAM policies, KMS keys, CloudTrail) applied to a slightly new service.