How to beginner · 3 min read

How to load CSV for machine learning in pandas

Quick answer
Use pandas.read_csv() to load CSV files into a DataFrame, then preprocess the data as needed for machine learning tasks. This method is standard for preparing datasets before feeding them into frameworks like PyTorch.

PREREQUISITES

  • Python 3.8+
  • pip install pandas>=1.5.0
  • Basic knowledge of pandas and PyTorch

Setup

Install pandas if not already installed. This library is essential for CSV data loading and preprocessing.

bash
pip install pandas

Step by step

Load a CSV file into a pandas.DataFrame, inspect the data, and prepare features and labels for machine learning.

python
import pandas as pd

# Load CSV file into DataFrame
df = pd.read_csv('data.csv')

# Display first 5 rows
print(df.head())

# Separate features and target label
X = df.drop('target', axis=1)  # Replace 'target' with your label column name
y = df['target']

# Convert to PyTorch tensors if needed
import torch
X_tensor = torch.tensor(X.values, dtype=torch.float32)
y_tensor = torch.tensor(y.values, dtype=torch.long)  # or float32 for regression

print('Features tensor shape:', X_tensor.shape)
print('Labels tensor shape:', y_tensor.shape)
output
   feature1  feature2  target
0       1.0       3.5       0
1       2.1       4.2       1
2       1.3       3.8       0
3       0.7       3.0       1
4       1.8       4.0       0
Features tensor shape: torch.Size([N, M])
Labels tensor shape: torch.Size([N])

Common variations

You can load CSV files asynchronously using libraries like dask.dataframe for large datasets or use pandas.read_csv() with parameters like chunksize to process data in batches.

python
import pandas as pd

# Load CSV in chunks for large files
chunk_iter = pd.read_csv('data.csv', chunksize=1000)
for chunk in chunk_iter:
    print(chunk.head())  # Process each chunk separately
output
   feature1  feature2  target
0       1.0       3.5       0
1       2.1       4.2       1
2       1.3       3.8       0
3       0.7       3.0       1
4       1.8       4.0       0

Troubleshooting

If you encounter encoding errors, specify the encoding parameter like encoding='utf-8' or encoding='latin1'. For missing values, use df.fillna() or df.dropna() to handle them before training.

python
df = pd.read_csv('data.csv', encoding='utf-8')
df = df.fillna(method='ffill')  # Forward fill missing values
print(df.isnull().sum())
output
feature1    0
feature2    0
target      0
dtype: int64

Key Takeaways

  • Use pandas.read_csv() to load CSV data efficiently into DataFrames.
  • Separate features and labels before converting to tensors for PyTorch.
  • Handle missing data and encoding issues proactively to avoid training errors.
Verified 2026-04 · PyTorch
Verify ↗