How to beginner to intermediate · 3 min read

How to use Ray Serve for ML models

Quick answer
Use Ray Serve to deploy and scale ML models by defining model-serving classes or functions, then deploying them as Ray Serve deployments. Ray Serve handles HTTP requests, scaling, and concurrency, enabling easy production-ready model serving with Python code.

PREREQUISITES

  • Python 3.8+
  • pip install ray[serve]
  • Basic knowledge of Python and ML model APIs

Setup Ray Serve

Install Ray with Serve support and start a Ray cluster locally. Ray Serve runs on top of Ray to provide scalable model serving.

bash
pip install ray[serve]

Step by step example

This example shows how to deploy a simple ML model using Ray Serve. We define a deployment class that loads a model and handles prediction requests via HTTP.

python
import ray
from ray import serve
from fastapi import FastAPI

ray.init()
serve.start()

@serve.deployment(route_prefix="/predict")
class ModelDeployment:
    def __init__(self):
        # Simulate loading a model
        self.model = lambda x: f"Predicted: {x[::-1]}"  # dummy model reverses input string

    async def __call__(self, request):
        data = await request.json()
        input_text = data.get("input", "")
        prediction = self.model(input_text)
        return {"prediction": prediction}

# Deploy the model
ModelDeployment.deploy()

# Client example
import requests
response = requests.post("http://127.0.0.1:8000/predict", json={"input": "hello"})
print(response.json())
output
{"prediction": "olleh"}

Common variations

  • Use async model inference for I/O-bound models.
  • Deploy multiple models as separate Ray Serve deployments.
  • Integrate with FastAPI or other ASGI frameworks for advanced routing.
  • Use Ray Serve's autoscaling and batching features for production workloads.

Troubleshooting

  • If the server is unreachable, ensure Ray is running and serve.start() was called.
  • Check for port conflicts on 8000 (default HTTP port).
  • Use ray status to verify cluster health.
  • Enable Ray Serve logs for debugging by setting RAY_LOG_TO_STDERR=1.

Key Takeaways

  • Ray Serve lets you deploy ML models as scalable HTTP endpoints with minimal code.
  • Define model logic in Python classes or functions decorated with @serve.deployment.
  • Ray Serve handles concurrency, autoscaling, and batching for production readiness.
Verified 2026-04
Verify ↗