How to Intermediate · 4 min read

How to build AI-powered data pipeline

Quick answer
Build an AI-powered data pipeline by ingesting raw data, preprocessing it, and then using an AI model like gpt-4o via the OpenAI API to analyze or transform the data. Automate the flow with Python scripts and schedule with tools like Airflow or Prefect for scalable, repeatable AI-driven workflows.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0
  • Basic knowledge of data processing libraries (pandas, requests)

Setup environment

Install required Python packages and set your environment variable for the OpenAI API key.

bash
pip install openai pandas requests

Step by step pipeline code

This example ingests JSON data from a public API, preprocesses it with pandas, then sends a prompt to gpt-4o to generate insights. The pipeline is synchronous and prints the AI response.

python
import os
import requests
import pandas as pd
from openai import OpenAI

# Initialize OpenAI client
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Step 1: Data ingestion
response = requests.get("https://jsonplaceholder.typicode.com/posts")
data = response.json()

# Step 2: Data preprocessing
# Convert to DataFrame and filter posts with userId=1
df = pd.DataFrame(data)
filtered = df[df["userId"] == 1]

# Step 3: Prepare prompt for AI
prompt = f"Summarize the following posts titles:\n" + "\n".join(filtered["title"].tolist())

# Step 4: Call AI model
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}]
)

# Step 5: Output AI response
print("AI Summary:\n", response.choices[0].message.content)
output
AI Summary:
 The posts by user 1 cover various topics including sunt aut facere repellat provident occaecati excepturi optio reprehenderit, qui est esse, ea molestias quasi exercitationem repellat qui ipsa sit aut, eum et est occaecati, and nesciunt quas odio.

Common variations

  • Use async calls with httpx and asyncio for faster ingestion.
  • Switch models to claude-3-5-haiku-20241022 for improved coding or summarization tasks.
  • Integrate with workflow orchestrators like Apache Airflow or Prefect for scheduling and monitoring.

Troubleshooting tips

  • If you get authentication errors, verify your OPENAI_API_KEY environment variable is set correctly.
  • For rate limit errors, implement exponential backoff retries.
  • If the AI response is incomplete, increase max_tokens in the API call.

Key Takeaways

  • Use Python with the OpenAI SDK OpenAI client for easy AI integration.
  • Preprocess data before sending to AI to reduce token usage and improve results.
  • Automate and schedule pipelines with tools like Airflow or Prefect for production use.
Verified 2026-04 · gpt-4o, claude-3-5-haiku-20241022
Verify ↗