How to Intermediate · 3 min read

How to reduce AWS Bedrock costs

Quick answer
To reduce AWS Bedrock costs, optimize your max_tokens and prompt length to minimize token usage per request. Use cost-effective models like amazon.titan-text-express-v1 and batch requests when possible to lower overhead.

PREREQUISITES

  • Python 3.8+
  • AWS credentials configured (~/.aws/credentials or env vars)
  • pip install boto3

Setup

Install the boto3 library and configure your AWS credentials to access AWS Bedrock. Ensure you have permissions to invoke Bedrock models.
bash
pip install boto3

Step by step

Use the boto3 bedrock-runtime client to call models efficiently. Limit max_tokens and keep prompts concise to reduce token consumption and cost.
python
import os
import boto3
import json

# Initialize Bedrock runtime client
client = boto3.client('bedrock-runtime', region_name='us-east-1')

# Define a concise prompt
prompt = "Summarize the benefits of renewable energy in two sentences."

# Prepare the message payload
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]

# Call the Bedrock model with controlled max_tokens
response = client.converse(
    modelId="amazon.titan-text-express-v1",
    messages=messages,
    maxTokens=100  # Limit tokens to reduce cost
)

# Extract and print the response text
output_text = response['output']['message']['content'][0]['text']
print("Response:", output_text)
output
Response: Renewable energy reduces greenhouse gas emissions and dependence on fossil fuels, promoting environmental sustainability and energy security.

Common variations

Batch multiple prompts in one request to reduce overhead. Use cheaper models like amazon.titan-text-express-v1 for less critical tasks. For streaming or async calls, use boto3 async clients or SDK wrappers.
python
import asyncio
import boto3

# Async example requires aiobotocore or similar (not native boto3)
# This is a conceptual example

async def async_bedrock_call(client, prompt):
    response = await client.converse(
        modelId="amazon.titan-text-express-v1",
        messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
        maxTokens=100
    )
    return response['output']['message']['content'][0]['text']

# Use batching by combining multiple prompts into one or sequential calls
prompts = ["Explain AI ethics.", "What is quantum computing?"]

# Sequential calls example
for p in prompts:
    response = client.converse(
        modelId="amazon.titan-text-express-v1",
        messages=[{"role": "user", "content": [{"type": "text", "text": p}]}],
        maxTokens=100
    )
    print(response['output']['message']['content'][0]['text'])
output
Explain AI ethics: AI ethics involves principles ensuring AI systems are fair, transparent, and respect privacy.
Quantum computing: Quantum computing uses quantum bits to perform complex calculations faster than classical computers.

Troubleshooting

If you see high costs, check your maxTokens and prompt length; reduce them. Ensure you are using the most cost-effective model for your use case. Monitor usage in AWS Cost Explorer to identify spikes. Use caching for repeated queries to avoid unnecessary calls.

Key Takeaways

  • Limit maxTokens and prompt length to control token usage and cost.
  • Use cost-effective models like amazon.titan-text-express-v1 for less critical tasks.
  • Batch requests and cache frequent queries to reduce API call overhead.
  • Monitor usage regularly with AWS Cost Explorer to detect and manage cost spikes.
Verified 2026-04 · amazon.titan-text-express-v1
Verify ↗