How to intermediate · 4 min read

How to fine-tune models on Vertex AI

Quick answer
Use the vertexai Python SDK to fine-tune models on Vertex AI. Initialize the SDK with your GCP project, prepare your training dataset, create a CustomTrainingJob, and deploy the fine-tuned model for inference.

PREREQUISITES

  • Python 3.8+
  • Google Cloud project with Vertex AI enabled
  • Service account with Vertex AI permissions
  • gcloud CLI installed and authenticated
  • pip install vertexai google-cloud-aiplatform

Setup

Install the required Python packages and authenticate your Google Cloud environment.

  • Install the vertexai and google-cloud-aiplatform SDKs.
  • Set up authentication with a service account or use gcloud auth application-default login.
  • Initialize the Vertex AI SDK with your project and location.
bash
pip install vertexai google-cloud-aiplatform

Step by step

This example demonstrates fine-tuning a text generation model on Vertex AI using the Python SDK. It includes dataset preparation, training job creation, and model deployment.

python
import os
from vertexai import init
from google.cloud import aiplatform

# Initialize Vertex AI SDK
PROJECT_ID = os.environ.get('GOOGLE_CLOUD_PROJECT')
LOCATION = 'us-central1'

init(project=PROJECT_ID, location=LOCATION)

# Define dataset URI (Cloud Storage path to your training data)
training_data_uri = 'gs://your-bucket/path/to/training_data.jsonl'

# Create a Vertex AI client
client = aiplatform.gapic.JobServiceClient(client_options={"api_endpoint": f"{LOCATION}-aiplatform.googleapis.com"})

# Define training job parameters
training_job = {
    "display_name": "fine_tune_text_generation",
    "training_task_definition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/custom_task_1.0.0.yaml",
    "training_task_inputs": {
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": "n1-standard-4"
                },
                "replica_count": 1,
                "python_package_spec": {
                    "executor_image_uri": "gcr.io/cloud-aiplatform/training/tf-cpu.2-8:latest",
                    "package_uris": ["gs://your-bucket/path/to/your_training_package.tar.gz"],
                    "python_module": "trainer.task",
                    "args": [
                        f"--data_path={training_data_uri}",
                        "--model_name=vertex-text-gen-base",
                        "--output_dir=gs://your-bucket/path/to/output"
                    ]
                }
            }
        ]
    },
    "input_data_config": {
        "dataset_id": "",
        "fraction_split": {
            "training_fraction": 0.8,
            "validation_fraction": 0.1,
            "test_fraction": 0.1
        }
    },
    "model_to_upload": {
        "display_name": "fine_tuned_text_gen_model"
    }
}

parent = f"projects/{PROJECT_ID}/locations/{LOCATION}"

# Submit the training job
response = client.create_custom_job(parent=parent, custom_job=training_job)
print(f"Training job submitted: {response.name}")

# After training completes, deploy the model
model_client = aiplatform.gapic.ModelServiceClient(client_options={"api_endpoint": f"{LOCATION}-aiplatform.googleapis.com"})
model_name = f"projects/{PROJECT_ID}/locations/{LOCATION}/models/your-model-id"

endpoint_client = aiplatform.gapic.EndpointServiceClient(client_options={"api_endpoint": f"{LOCATION}-aiplatform.googleapis.com"})
endpoint = endpoint_client.create_endpoint(parent=parent, endpoint={"display_name": "fine_tuned_text_gen_endpoint"})

# Deploy model to endpoint
deployed_model = {
    "model": model_name,
    "display_name": "fine_tuned_text_gen",
    "automatic_resources": {"min_replica_count": 1, "max_replica_count": 1}
}

endpoint_client.deploy_model(endpoint=endpoint.name, deployed_model=deployed_model)
print(f"Model deployed to endpoint: {endpoint.name}")
output
Training job submitted: projects/123456789/locations/us-central1/customJobs/987654321
Model deployed to endpoint: projects/123456789/locations/us-central1/endpoints/1234567890

Common variations

You can fine-tune different model types such as text generation or image models by adjusting the training package and parameters. Async job monitoring is supported via polling the job status. Streaming inference is available after deployment using Vertex AI endpoints.

python
from google.cloud import aiplatform
import time

client = aiplatform.gapic.JobServiceClient(client_options={"api_endpoint": f"{LOCATION}-aiplatform.googleapis.com"})

job_name = response.name  # from previous training job submission

# Poll job status
while True:
    job = client.get_custom_job(name=job_name)
    state = job.state
    print(f"Job state: {state}")
    if state == aiplatform.gapic.JobState.JOB_STATE_SUCCEEDED:
        print("Training completed successfully.")
        break
    elif state in (aiplatform.gapic.JobState.JOB_STATE_FAILED, aiplatform.gapic.JobState.JOB_STATE_CANCELLED):
        print("Training failed or cancelled.")
        break
    time.sleep(30)
output
Job state: JOB_STATE_RUNNING
Job state: JOB_STATE_RUNNING
Job state: JOB_STATE_SUCCEEDED
Training completed successfully.

Troubleshooting

  • Authentication errors: Ensure your service account has Vertex AI Admin and Storage Object Viewer roles.
  • Training job fails: Check logs in Google Cloud Console under Vertex AI > Training jobs.
  • Model deployment issues: Verify model resource availability and endpoint quota limits.

Key Takeaways

  • Use the vertexai SDK to manage fine-tuning workflows on Vertex AI.
  • Prepare your training data in Cloud Storage and package your training code for custom jobs.
  • Monitor training asynchronously and deploy models to endpoints for scalable inference.
Verified 2026-04 · vertexai CustomTrainingJob, vertexai endpoints
Verify ↗