How to reduce AWS Bedrock costs
Quick answer
To reduce AWS Bedrock costs, optimize your max_tokens and prompt length to minimize token usage per request. Use cost-effective models like amazon.titan-text-express-v1 and batch requests when possible to lower overhead.
PREREQUISITES
Python 3.8+AWS credentials configured (~/.aws/credentials or env vars)pip install boto3
Setup
Install the boto3 library and configure your AWS credentials to access AWS Bedrock. Ensure you have permissions to invoke Bedrock models.
pip install boto3 Step by step
Use the boto3 bedrock-runtime client to call models efficiently. Limit max_tokens and keep prompts concise to reduce token consumption and cost.
import os
import boto3
import json
# Initialize Bedrock runtime client
client = boto3.client('bedrock-runtime', region_name='us-east-1')
# Define a concise prompt
prompt = "Summarize the benefits of renewable energy in two sentences."
# Prepare the message payload
messages = [{"role": "user", "content": [{"type": "text", "text": prompt}]}]
# Call the Bedrock model with controlled max_tokens
response = client.converse(
modelId="amazon.titan-text-express-v1",
messages=messages,
maxTokens=100 # Limit tokens to reduce cost
)
# Extract and print the response text
output_text = response['output']['message']['content'][0]['text']
print("Response:", output_text) output
Response: Renewable energy reduces greenhouse gas emissions and dependence on fossil fuels, promoting environmental sustainability and energy security.
Common variations
Batch multiple prompts in one request to reduce overhead. Use cheaper models like amazon.titan-text-express-v1 for less critical tasks. For streaming or async calls, use boto3 async clients or SDK wrappers.
import asyncio
import boto3
# Async example requires aiobotocore or similar (not native boto3)
# This is a conceptual example
async def async_bedrock_call(client, prompt):
response = await client.converse(
modelId="amazon.titan-text-express-v1",
messages=[{"role": "user", "content": [{"type": "text", "text": prompt}]}],
maxTokens=100
)
return response['output']['message']['content'][0]['text']
# Use batching by combining multiple prompts into one or sequential calls
prompts = ["Explain AI ethics.", "What is quantum computing?"]
# Sequential calls example
for p in prompts:
response = client.converse(
modelId="amazon.titan-text-express-v1",
messages=[{"role": "user", "content": [{"type": "text", "text": p}]}],
maxTokens=100
)
print(response['output']['message']['content'][0]['text']) output
Explain AI ethics: AI ethics involves principles ensuring AI systems are fair, transparent, and respect privacy. Quantum computing: Quantum computing uses quantum bits to perform complex calculations faster than classical computers.
Troubleshooting
If you see high costs, check your maxTokens and prompt length; reduce them. Ensure you are using the most cost-effective model for your use case. Monitor usage in AWS Cost Explorer to identify spikes. Use caching for repeated queries to avoid unnecessary calls.
Key Takeaways
- Limit maxTokens and prompt length to control token usage and cost.
- Use cost-effective models like amazon.titan-text-express-v1 for less critical tasks.
- Batch requests and cache frequent queries to reduce API call overhead.
- Monitor usage regularly with AWS Cost Explorer to detect and manage cost spikes.