How to beginner · 3 min read

How to explore dataset with pandas

Quick answer
Use pandas to load and explore datasets by inspecting data with methods like head(), info(), and describe(). These functions help you understand data types, missing values, and basic statistics before using the data in PyTorch pipelines.

PREREQUISITES

  • Python 3.8+
  • pip install pandas>=1.5.0

Setup

Install pandas if not already installed and import it in your Python environment.

bash
pip install pandas

Step by step

Load a dataset using pandas.read_csv() and explore it with common methods to understand its structure and contents.

python
import pandas as pd

# Load dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)

# Show first 5 rows
print(df.head())

# Get summary info including data types and non-null counts
print(df.info())

# Get descriptive statistics for numeric columns
print(df.describe())

# Check for missing values
print(df.isnull().sum())
output
   sepal_length  sepal_width  petal_length  petal_width    species
0           5.1          3.5           1.4          0.2     setosa
1           4.9          3.0           1.4          0.2     setosa
2           4.7          3.2           1.3          0.2     setosa
3           4.6          3.1           1.5          0.2     setosa
4           5.0          3.6           1.4          0.2     setosa

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 

Descriptive statistics:
       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

Common variations

You can explore datasets asynchronously or with streaming for large files, or use other pandas functions like value_counts() for categorical data analysis.

python
import pandas as pd

# Load dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)

# Count unique values in 'species' column
print(df['species'].value_counts())
output
setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Troubleshooting

  • If read_csv() fails, check the URL or file path is correct.
  • If you see unexpected missing values, verify the dataset encoding or delimiters.
  • Use df.sample(5) to inspect random rows if head() is not representative.

Key Takeaways

  • Use head() and info() to quickly understand dataset structure and data types.
  • Check for missing values with isnull().sum() before training PyTorch models.
  • Use describe() for numeric summary statistics to detect outliers or data issues.
Verified 2026-04
Verify ↗