How to beginner to intermediate · 3 min read

How to use Ray Serve for ML models

Q: How to use Ray Serve for ML models

Use Ray Serve to deploy and scale ML models by defining model-serving classes or functions, then deploying them as Ray Serve deployments. Ray Serve handles HTTP requests, scaling, and concurrency, enabling easy production-ready model serving with Python code.

Quick answer

Use Ray Serve to deploy and scale ML models by defining model-serving classes or functions, then deploying them as Ray Serve deployments. Ray Serve handles HTTP requests, scaling, and concurrency, enabling easy production-ready model serving with Python code.

PREREQUISITES

Python 3.8+
pip install ray[serve]
Basic knowledge of Python and ML model APIs

Setup Ray Serve

Install Ray with Serve support and start a Ray cluster locally. Ray Serve runs on top of Ray to provide scalable model serving.

bash

pip install ray[serve]

Step by step example

This example shows how to deploy a simple ML model using Ray Serve. We define a deployment class that loads a model and handles prediction requests via HTTP.

python

import ray
from ray import serve
from fastapi import FastAPI

ray.init()
serve.start()

@serve.deployment(route_prefix="/predict")
class ModelDeployment:
    def __init__(self):
        # Simulate loading a model
        self.model = lambda x: f"Predicted: {x[::-1]}"  # dummy model reverses input string

    async def __call__(self, request):
        data = await request.json()
        input_text = data.get("input", "")
        prediction = self.model(input_text)
        return {"prediction": prediction}

# Deploy the model
ModelDeployment.deploy()

# Client example
import requests
response = requests.post("http://127.0.0.1:8000/predict", json={"input": "hello"})
print(response.json())

output

{"prediction": "olleh"}

Common variations

Use async model inference for I/O-bound models.
Deploy multiple models as separate Ray Serve deployments.
Integrate with FastAPI or other ASGI frameworks for advanced routing.
Use Ray Serve's autoscaling and batching features for production workloads.

Troubleshooting

If the server is unreachable, ensure Ray is running and serve.start() was called.
Check for port conflicts on 8000 (default HTTP port).
Use ray status to verify cluster health.
Enable Ray Serve logs for debugging by setting RAY_LOG_TO_STDERR=1.

✅

Key Takeaways

Ray Serve lets you deploy ML models as scalable HTTP endpoints with minimal code.
Define model logic in Python classes or functions decorated with @serve.deployment.
Ray Serve handles concurrency, autoscaling, and batching for production readiness.

Verified 2026-04

Verify ↗