How to beginner · 3 min read

How to load CSV for machine learning in pandas

Q: How to load CSV for machine learning in pandas

Use pandas.read_csv() to load CSV files into a DataFrame, then preprocess the data as needed for machine learning tasks. This method is standard for preparing datasets before feeding them into frameworks like PyTorch.

Quick answer

Use pandas.read_csv() to load CSV files into a DataFrame, then preprocess the data as needed for machine learning tasks. This method is standard for preparing datasets before feeding them into frameworks like PyTorch.

PREREQUISITES

Python 3.8+
pip install pandas>=1.5.0
Basic knowledge of pandas and PyTorch

Setup

Install pandas if not already installed. This library is essential for CSV data loading and preprocessing.

bash

pip install pandas

Step by step

Load a CSV file into a pandas.DataFrame, inspect the data, and prepare features and labels for machine learning.

python

import pandas as pd

# Load CSV file into DataFrame
df = pd.read_csv('data.csv')

# Display first 5 rows
print(df.head())

# Separate features and target label
X = df.drop('target', axis=1)  # Replace 'target' with your label column name
y = df['target']

# Convert to PyTorch tensors if needed
import torch
X_tensor = torch.tensor(X.values, dtype=torch.float32)
y_tensor = torch.tensor(y.values, dtype=torch.long)  # or float32 for regression

print('Features tensor shape:', X_tensor.shape)
print('Labels tensor shape:', y_tensor.shape)

output

   feature1  feature2  target
0       1.0       3.5       0
1       2.1       4.2       1
2       1.3       3.8       0
3       0.7       3.0       1
4       1.8       4.0       0
Features tensor shape: torch.Size([N, M])
Labels tensor shape: torch.Size([N])

Common variations

You can load CSV files asynchronously using libraries like dask.dataframe for large datasets or use pandas.read_csv() with parameters like chunksize to process data in batches.

python

import pandas as pd

# Load CSV in chunks for large files
chunk_iter = pd.read_csv('data.csv', chunksize=1000)
for chunk in chunk_iter:
    print(chunk.head())  # Process each chunk separately

output

   feature1  feature2  target
0       1.0       3.5       0
1       2.1       4.2       1
2       1.3       3.8       0
3       0.7       3.0       1
4       1.8       4.0       0

Troubleshooting

If you encounter encoding errors, specify the encoding parameter like encoding='utf-8' or encoding='latin1'. For missing values, use df.fillna() or df.dropna() to handle them before training.

python

df = pd.read_csv('data.csv', encoding='utf-8')
df = df.fillna(method='ffill')  # Forward fill missing values
print(df.isnull().sum())

output

feature1    0
feature2    0
target      0
dtype: int64

✅

Key Takeaways

Use pandas.read_csv() to load CSV data efficiently into DataFrames.
Separate features and labels before converting to tensors for PyTorch.
Handle missing data and encoding issues proactively to avoid training errors.

Verified 2026-04 · PyTorch

Verify ↗