Code intermediate · 3 min read

How to stream AWS Bedrock responses in Python

Direct answer
Use the boto3 bedrock-runtime client with the converse method and set stream=True to receive streamed AWS Bedrock responses in Python.

Setup

Install
bash
pip install boto3
Env vars
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_DEFAULT_REGION
Imports
python
import boto3
import json

Examples

inUser message: 'Explain quantum computing in simple terms.'
outStreaming response chunks printing partial explanations as they arrive.
inUser message: 'Summarize the latest AI research breakthroughs.'
outStreamed text output showing summary progressively.
inUser message: 'Tell me a joke.'
outStreamed joke text printed chunk by chunk.

Integration steps

  1. Initialize the boto3 client for bedrock-runtime with AWS credentials and region.
  2. Prepare the input message in the required JSON format for the converse method.
  3. Call converse with stream=True to enable streaming output.
  4. Iterate over the streaming response chunks as they arrive from the API.
  5. Extract and print the partial text content from each chunk for real-time display.

Full code

python
import boto3
import json
import os

# Initialize the Bedrock runtime client
client = boto3.client('bedrock-runtime', region_name=os.environ.get('AWS_DEFAULT_REGION'))

# Prepare the user message
user_message = "Explain quantum computing in simple terms."

# Construct the request body
request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 512,
    "messages": [
        {"role": "user", "content": [{"type": "text", "text": user_message}]}
    ]
}

# Convert request body to JSON string
body_str = json.dumps(request_body)

# Call converse with streaming enabled
response_stream = client.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=body_str,
    stream=True
)

print("Streaming response:")

# Iterate over streamed chunks
for chunk in response_stream:
    # Each chunk is bytes, decode and parse JSON
    chunk_str = chunk.decode('utf-8')
    try:
        chunk_json = json.loads(chunk_str)
        # Extract partial text from chunk
        partial_text = chunk_json.get('output', {}).get('message', {}).get('content', [{}])[0].get('text', '')
        print(partial_text, end='', flush=True)
    except json.JSONDecodeError:
        # Handle incomplete JSON chunks gracefully
        continue

print()  # Newline after streaming completes
output
Streaming response:
Quantum computing is a type of computing that uses quantum bits, or qubits, which can represent both 0 and 1 simultaneously, allowing computers to solve certain problems much faster than classical computers.

API trace

Request
json
{"modelId": "anthropic.claude-3-5-sonnet-20241022-v2:0", "body": "{\"anthropic_version\": \"bedrock-2023-05-31\", \"max_tokens\": 512, \"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": \"Explain quantum computing in simple terms.\"}]}]}" , "stream": true}
Response
json
{"output": {"message": {"content": [{"type": "text", "text": "partial streamed text chunk"}]}}}
Extractchunk_json['output']['message']['content'][0]['text']

Variants

Non-streaming synchronous call

Use when you want the full response at once without streaming.

python
import boto3
import json
import os

client = boto3.client('bedrock-runtime', region_name=os.environ.get('AWS_DEFAULT_REGION'))

user_message = "Explain quantum computing in simple terms."

request_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 512,
    "messages": [
        {"role": "user", "content": [{"type": "text", "text": user_message}]}
    ]
}

body_str = json.dumps(request_body)

response = client.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=body_str
)

text = response['output']['message']['content'][0]['text']
print(text)
Async streaming with aiobotocore

Use for asynchronous applications requiring non-blocking streaming.

python
import asyncio
import aiobotocore
import json
import os

async def stream_bedrock():
    session = aiobotocore.get_session()
    async with session.create_client('bedrock-runtime', region_name=os.environ.get('AWS_DEFAULT_REGION')) as client:
        user_message = "Explain quantum computing in simple terms."
        request_body = {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 512,
            "messages": [
                {"role": "user", "content": [{"type": "text", "text": user_message}]}
            ]
        }
        body_str = json.dumps(request_body)

        response_stream = await client.converse(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            body=body_str,
            stream=True
        )

        async for chunk in response_stream:
            chunk_str = chunk.decode('utf-8')
            try:
                chunk_json = json.loads(chunk_str)
                partial_text = chunk_json.get('output', {}).get('message', {}).get('content', [{}])[0].get('text', '')
                print(partial_text, end='', flush=True)
            except json.JSONDecodeError:
                continue

        print()

asyncio.run(stream_bedrock())

Performance

Latency~1-2 seconds to first streamed chunk for typical queries
Cost~$0.0025 per 500 tokens for Anthropic Claude models on Bedrock
Rate limitsDefault AWS Bedrock limits vary by account; typically 60 RPM and 120,000 TPM
  • Limit <code>max_tokens</code> to reduce cost and latency.
  • Use concise prompts to minimize input tokens.
  • Stream responses to start processing output early and reduce perceived latency.
ApproachLatencyCost/callBest for
Streaming via boto3 converse(stream=True)~1-2s to first chunk~$0.0025 per 500 tokensReal-time UI updates, chatbots
Non-streaming synchronous call~3-5s total~$0.0025 per 500 tokensSimple scripts, batch processing
Async streaming with aiobotocore~1-2s to first chunk~$0.0025 per 500 tokensAsync apps, web servers

Quick tip

Always set <code>stream=True</code> in <code>client.converse()</code> to receive partial responses as they are generated for better UX.

Common mistake

Beginners often forget to decode and parse each streamed chunk as JSON, causing errors when processing the streamed response.

Verified 2026-04 · anthropic.claude-3-5-sonnet-20241022-v2:0
Verify ↗