Vertex AI supervised fine-tuning guide
Quick answer
Use the
google-cloud-aiplatform Python SDK to create a fine-tuning job on Vertex AI by preparing a labeled dataset, configuring a CustomJob or FineTuningJob, and deploying the fine-tuned model. The process involves uploading training data to Google Cloud Storage, defining training parameters, and monitoring the job via the SDK or Google Cloud Console.PREREQUISITES
Python 3.8+Google Cloud project with Vertex AI enabledGoogle Cloud SDK installed and configuredService account with Vertex AI permissionspip install google-cloud-aiplatform
Setup
Install the google-cloud-aiplatform SDK and set environment variables for authentication and project configuration.
- Enable Vertex AI API in your Google Cloud project.
- Set
GOOGLE_APPLICATION_CREDENTIALSto your service account JSON key. - Install the SDK with
pip install google-cloud-aiplatform.
pip install google-cloud-aiplatform Step by step
This example demonstrates supervised fine-tuning on Vertex AI using a prepared dataset in Google Cloud Storage. It creates a CustomJob to train a model and deploys the fine-tuned model.
from google.cloud import aiplatform
import os
# Set your Google Cloud project and region
PROJECT_ID = os.environ.get('GOOGLE_CLOUD_PROJECT')
REGION = 'us-central1'
BUCKET_NAME = 'your-gcs-bucket'
TRAINING_DATA_URI = f'gs://{BUCKET_NAME}/training_data.csv'
# Initialize Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION)
# Define training job parameters
job_display_name = 'vertex-ai-supervised-finetune'
# Define training container image (example: custom training container or prebuilt)
training_container_image = 'gcr.io/cloud-aiplatform/training/tf-cpu.2-11:latest'
# Define worker pool spec
worker_pool_specs = [
{
'machine_spec': {'machine_type': 'n1-standard-4'},
'replica_count': 1,
'container_spec': {
'image_uri': training_container_image,
'command': [
'python3', 'trainer/task.py',
'--data-path', TRAINING_DATA_URI
]
}
}
]
# Create CustomJob
custom_job = aiplatform.CustomJob(
display_name=job_display_name,
worker_pool_specs=worker_pool_specs
)
# Run training job
custom_job.run(sync=True)
# After training, deploy the model (example assumes model artifact is saved to GCS)
model_display_name = 'vertex-ai-finetuned-model'
model_artifact_uri = f'gs://{BUCKET_NAME}/model/'
model = aiplatform.Model.upload(
display_name=model_display_name,
artifact_uri=model_artifact_uri,
serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-11:latest'
)
endpoint = model.deploy(machine_type='n1-standard-4')
print(f'Model deployed to endpoint: {endpoint.resource_name}') output
Model deployed to endpoint: projects/PROJECT_ID/locations/us-central1/endpoints/1234567890
Common variations
You can fine-tune using different training frameworks by specifying custom containers or prebuilt containers for PyTorch, TensorFlow, or scikit-learn.
Use asynchronous job execution by setting sync=False in custom_job.run() to monitor progress separately.
For large datasets, use Vertex AI Dataset resources and AutoML training jobs for easier management.
from google.cloud import aiplatform
aiplatform.init(project=os.environ['GOOGLE_CLOUD_PROJECT'], location='us-central1')
# Async training example
custom_job = aiplatform.CustomJob(
display_name='async-finetune-job',
worker_pool_specs=worker_pool_specs
)
custom_job.run(sync=False)
print(f'Training job started: {custom_job.resource_name}') output
Training job started: projects/PROJECT_ID/locations/us-central1/customJobs/1234567890
Troubleshooting
- Authentication errors: Ensure
GOOGLE_APPLICATION_CREDENTIALSpoints to a valid service account JSON with Vertex AI permissions. - Permission denied: Verify your service account has roles like
Vertex AI AdminandStorage Object Viewer. - Training job fails: Check logs in Google Cloud Console under Vertex AI > Training jobs for detailed error messages.
- Model deployment issues: Confirm the model artifact path is correct and the serving container image matches your model framework.
Key Takeaways
- Use the official
google-cloud-aiplatformSDK to manage supervised fine-tuning jobs on Vertex AI. - Prepare and upload your labeled training data to Google Cloud Storage before starting a fine-tuning job.
- Monitor training jobs asynchronously and deploy fine-tuned models with appropriate serving containers.
- Ensure proper IAM permissions and authentication setup to avoid common errors.
- Customize training with different containers or frameworks by adjusting the worker pool specs.