How to explore dataset with pandas
Quick answer
Use
pandas to load and explore datasets by inspecting data with methods like head(), info(), and describe(). These functions help you understand data types, missing values, and basic statistics before using the data in PyTorch pipelines.PREREQUISITES
Python 3.8+pip install pandas>=1.5.0
Setup
Install pandas if not already installed and import it in your Python environment.
pip install pandas Step by step
Load a dataset using pandas.read_csv() and explore it with common methods to understand its structure and contents.
import pandas as pd
# Load dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)
# Show first 5 rows
print(df.head())
# Get summary info including data types and non-null counts
print(df.info())
# Get descriptive statistics for numeric columns
print(df.describe())
# Check for missing values
print(df.isnull().sum()) output
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 150 non-null object
Descriptive statistics:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Missing values per column:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64 Common variations
You can explore datasets asynchronously or with streaming for large files, or use other pandas functions like value_counts() for categorical data analysis.
import pandas as pd
# Load dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)
# Count unique values in 'species' column
print(df['species'].value_counts()) output
setosa 50 versicolor 50 virginica 50 Name: species, dtype: int64
Troubleshooting
- If
read_csv()fails, check the URL or file path is correct. - If you see unexpected missing values, verify the dataset encoding or delimiters.
- Use
df.sample(5)to inspect random rows ifhead()is not representative.
Key Takeaways
- Use
head()andinfo()to quickly understand dataset structure and data types. - Check for missing values with
isnull().sum()before training PyTorch models. - Use
describe()for numeric summary statistics to detect outliers or data issues.