How to Intermediate · 3 min read

AWS Bedrock cost optimization strategies

Quick answer
Optimize AWS Bedrock costs by selecting appropriate models based on your workload, batching requests to reduce overhead, and monitoring usage with AWS Cost Explorer and CloudWatch. Implement caching and limit token usage per request to control expenses effectively.

PREREQUISITES

  • Python 3.8+
  • AWS CLI configured with appropriate permissions
  • boto3 installed (pip install boto3)
  • AWS Bedrock access enabled in your AWS account

Setup

Install boto3 and configure AWS CLI with credentials that have access to AWS Bedrock. Ensure your AWS account is enabled for Bedrock usage.

bash
pip install boto3
output
Collecting boto3
  Downloading boto3-1.26.0-py3-none-any.whl (132 kB)
Installing collected packages: boto3
Successfully installed boto3-1.26.0

Step by step

This example demonstrates how to call AWS Bedrock with boto3, select a cost-effective model, batch requests, and monitor token usage to optimize costs.

python
import os
import boto3

# Initialize Bedrock client
client = boto3.client('bedrock-runtime', region_name='us-east-1')

# Define a function to call Bedrock with token limit and batching

def call_bedrock_batch(prompts, model_id='anthropic.claude-3-5-sonnet-20241022-v2:0', max_tokens=512):
    responses = []
    for prompt in prompts:
        response = client.invoke_model(
            ModelId=model_id,
            Body={"messages": [{"role": "user", "content": prompt}]},
            ContentType='application/json',
            Accept='application/json',
            MaxTokens=max_tokens
        )
        response_body = response['Body'].read().decode('utf-8')
        # Assuming response_body is JSON with 'choices' key
        import json
        data = json.loads(response_body)
        text = data['choices'][0]['message']['content']
        responses.append(text)
    return responses

# Example usage
prompts = [
    "Explain cost optimization in AWS Bedrock.",
    "How to batch requests for better efficiency?"
]

outputs = call_bedrock_batch(prompts)
for i, output in enumerate(outputs):
    print(f"Response {i+1}: {output}\n")
output
Response 1: AWS Bedrock cost optimization involves selecting the right model, batching requests, and monitoring usage to reduce expenses.

Response 2: Batching requests reduces overhead by sending multiple prompts in one call, improving efficiency and lowering costs.

Common variations

You can optimize costs further by using smaller or specialized models for less complex tasks, implementing asynchronous calls for concurrency, and integrating AWS CloudWatch for real-time usage monitoring.

python
import asyncio
import aiobotocore.session

async def async_call_bedrock(prompt, model_id='amazon.titan-text-express-v1', max_tokens=256):
    session = aiobotocore.session.get_session()
    async with session.create_client('bedrock-runtime', region_name='us-east-1') as client:
        response = await client.invoke_model(
            ModelId=model_id,
            Body={"messages": [{"role": "user", "content": prompt}]},
            ContentType='application/json',
            Accept='application/json',
            MaxTokens=max_tokens
        )
        response_body = await response['Body'].read()
        import json
        data = json.loads(response_body)
        return data['choices'][0]['message']['content']

async def main():
    prompts = ["Summarize AWS Bedrock pricing.", "Best practices for cost control."]
    tasks = [async_call_bedrock(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    for i, res in enumerate(results):
        print(f"Async response {i+1}: {res}\n")

# To run: asyncio.run(main())
output
Async response 1: AWS Bedrock pricing depends on model usage and token consumption; choose models wisely.

Async response 2: Use request batching, token limits, and monitoring to control costs effectively.

Troubleshooting

  • If you encounter AccessDeniedException, verify your AWS IAM permissions include Bedrock access.
  • High costs? Check token limits and reduce maxTokens per request.
  • Unexpected latency? Use smaller models or batch requests to improve throughput.

Key Takeaways

  • Select models aligned with task complexity to avoid overpaying for large models.
  • Batch multiple prompts in a single API call to reduce overhead and improve cost efficiency.
  • Monitor usage with AWS Cost Explorer and CloudWatch to identify and control spending.
  • Limit maxTokens per request to prevent unexpectedly high token consumption.
  • Use asynchronous calls and smaller models for scalable, cost-effective Bedrock integration.
Verified 2026-04 · anthropic.claude-3-5-sonnet-20241022-v2:0, amazon.titan-text-express-v1
Verify ↗