Missing data is a common problem in real-world datasets, and handling this problem effectively is a crucial step in any data analysis or machine-learning pipeline. In this guide, we’ll take a deep dive into filling missing values in Pandas using fillna and interpolate, two powerful methods the Pandas library provides to handle missing data gracefully.
Whether you’re cleaning survey data, preparing time series for forecasting, or just ensuring your dataset is model-ready, this guide will show you how to intelligently fill missing values in Pandas and why each method matters.
Table of Contents
Why Missing Values Occur
Before we explore how to use fillna()
and interpolate()
, it’s important to understand why missing values in datasets:
- Data Corruption during collection or transmission
- Incomplete surveys or forms
- System errors during data logging
- Manual entry errors
- Data filtering or merging with incompatible datasets
Pandas represent missing values using NaN
(Not a Number). These NaN
values can interfere with computations, visualizations, and machine-learning models. That’s why filling them appropriately is essential.
Understanding Missing Values in Pandas
Before filling in missing values we have to understand missing values so let’s create a sample DataFrame with missing values:
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 30, np.nan, 22],
'Salary': [50000, 54000, np.nan, 58000, np.nan]
}
df = pd.DataFrame(data)
print(df)
Python#Output of Above Code
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 54000.0
2 Charlie 30.0 NaN
3 David NaN 58000.0
4 Eve 22.0 NaN
Filling Missing Values in Pandas using fillna and interpolateChecking for Missing Values
Pandas provide various functions to check null or NaN values in DataFrame:
# Check if DataFrame has any missing values
print(df.isna().any())
# Count total missing values across columns
print(df.isna().sum().sum())
# Count missing values per column
print(df.isna().sum())
Python# Output
# Check Missing Values
Name False
Age True
Salary True
dtype: bool
# Count Total Missing Values
4
# Count MIssing Values per column
Name 0
Age 2
Salary 2
dtype: int64
Python
Filling Missing Values with fillna()
The fillna()
function is one of the most straightforward and widely used methods in Pandas for replacing missing values.
Syntax is:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
PythonParameters:
- value: The value utilized to replace any missing entries. This can be a scalar, dictionary, Series, or DataFrame. The default setting is None.
- method: The interpolation technique applied for numeric data. The default option is None.
- axis: The axis along which to perform the filling operation. Use 0 for columns and 1 for rows. The default is None.
- inplace: Indicates whether to alter the DataFrame directly or return a new copy. The default is set to False.
- limit: For both forward and backward filling, this specifies the maximum number of consecutive periods to fill.
Filling with a Specific Value
df['Age'] = df['Age'].fillna(0)
print(df)
PythonIt will replace all NaN values in the Age column with 0. You can use any other default value based on domain knowledge.
#Ouput
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 0.0 54000.0
2 Charlie 30.0 NaN
3 David 0.0 58000.0
4 Eve 22.0 NaN
Filling Missing Values in Pandas using fillna and interpolateForward Fill (Propagation of Last Valid Observation)
df['Age'] = df['Age'].ffill()
print(df)
PythonFills missing values with the last known non-null values. Ideals for time-series data.
# Output
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 25.0 54000.0
2 Charlie 30.0 NaN
3 David 30.0 58000.0
4 Eve 22.0 NaN
PythonBackward Fill
df['Age'] = df['Age'].bfill()
print(df)
Python# Output
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 30.0 54000.0
2 Charlie 30.0 NaN
3 David 22.0 58000.0
4 Eve 22.0 NaN
PythonColumn-wise Mean, Median, or Mode Imputation
df['Salary'] = df['Salary'].fillna(df['Salary'].mean())
print(df)
PythonYou can also use .median()
or .mode()[0]
to replace missing values based on statistical calculations.
# Output
Name Age Salary
0 Alice 25.0 50000.0
1 Bob NaN 54000.0
2 Charlie 30.0 54000.0
3 David NaN 58000.0
4 Eve 22.0 54000.0
Python
Interpolating Missing Values with interpolate()
Unlike fillna()
, which replaces NaN
values with constant or derived values, interpolate()
performs estimation based on existing data patterns, which is especially useful for numerical or time-series data.
Syntax:
DataFrame.interpolate(method='linear', axis=0, limit=None, inplace=False, limit_direction='forward', limit_area=None)
PythonThe parameters are the same as fillna()
, with the interpolation method being the key addition.
Linear Interpolation (default method)
df['Age'] = df['Age'].interpolate()
print(df)
PythonEstimates missing values using linear interpolation. It assumes data is evenly spaced and suitable for numeric sequences.
# Output
Name Age Salary
0 Alice 25.0 50000.0
1 Bob 27.5 54000.0
2 Charlie 30.0 NaN
3 David 26.0 58000.0
4 Eve 22.0 NaN
PythonTime-based Interpolation
Time-based interpolation is useful when your index is a DateTime object.
df['Date'] = pd.date_range('2023-01-01', periods=5, freq='D')
df.set_index('Date', inplace=True)
df['Salary'] = df['Salary'].interpolate(method='time')
print(df)
PythonThis approach considers the time gaps between values and interpolates accordingly.
# Output
Name Age Salary
Date
2023-01-01 Alice 25.0 50000.0
2023-01-02 Bob NaN 54000.0
2023-01-03 Charlie 30.0 56000.0
2023-01-04 David NaN 58000.0
2023-01-05 Eve 22.0 58000.0
PythonPolynomial Interpolation
df['Salary'] = df['Salary'].interpolate(method='polynomial', order=2)
print(df)
PythonSuitable for data that follows a nonlinear pattern.
# Output
Name Age Salary
0 Alice 25.0 50000.000000
1 Bob NaN 54000.000000
2 Charlie 30.0 56666.666667
3 David NaN 58000.000000
4 Eve 22.0 NaN
PythonSpine Interpolation
Another smooth interpolation method is ideal for continuous curves.
df['Salary'] = df['Salary'].interpolate(method='spline', order=2)
print(df)
Python# Output
Name Age Salary
0 Alice 25.0 50000.000000
1 Bob NaN 54000.000000
2 Charlie 30.0 56666.666667
3 David NaN 58000.000000
4 Eve 22.0 58000.000000
PythonReal-life Examples of Handling Missing Values in Pandas
Let’s now put theory into practice and explore some real-life examples of Filling Missing Values in Pandas using fillna and interpolate. These examples reflect common scenarios encountered by data analysts and data scientists when working with messy, real-world data.
Cleaning a Messy Dataset with Mixed Missing Values
In many practical cases, missing values aren’t always represented as NaN
or None
. Sometimes they appear as strings like 'NaN'
, 'NULL'
, or even blanks (''
). Before you can fill these values, you need to standardize them.
import pandas as pd
import numpy as np
data = {
'A': [1, np.nan, 'NaN', 4],
'B': [5, np.nan, 'NaN', 8],
'C': ['a', 'b', None, 'd']
}
df = pd.DataFrame(data)
print(df)
Python# Intial Output:
A B C
0 1 5.0 a
1 NaN NaN b
2 NaN NaN None
3 4 8.0 d
PythonHere, we have a mix of real NaN
values, string-based "NaN"
entries, and None
values. Let’s clean it:
# Convert "NaN" strings to actual np.nan values
df = df.replace('NaN', np.nan)
# Fill numeric columns using forward fill
# Fill string column 'C' with a placeholder for missing
df = df.fillna(method='ffill', numeric_only=False)
# Optionally, fill any remaining string NaNs with 'Missing'
df['C'] = df['C'].fillna('Missing')
print(df)
Python# Cleaned Output
A B C
0 1 5.0 a
1 1 5.0 b
2 1 5.0 Missing
3 4 8.0 d
PythonAlways standardize your missing values first, then use
fillna()
to fill based on the context.
Filling Time-Series Gaps Using Time-based Interpolation
When working with time-series data, gaps are common due to missing entries or irregular time intervals. The best way to handle this is by using interpolate(method='time')
.
dates = pd.date_range('2022-01-01', periods=10, freq='W')
values = [1.5, np.nan, 2.1, np.nan, 6.3, np.nan, 4.6, 5.1, np.nan, 8.9]
ser = pd.Series(values, index=dates)
print(ser)
Python# Initial Output
2022-01-02 1.5
2022-01-09 NaN
2022-01-16 2.1
2022-01-23 NaN
2022-01-30 6.3
2022-02-06 NaN
2022-02-13 4.6
2022-02-20 5.1
2022-02-27 NaN
2022-03-06 8.9
Freq: W-SUN, dtype: float64
Filling Missing Values in Pandas using fillna and interpolateLet’s interpolate missing values based on time:
ser = ser.interpolate(method='time')
print(ser)
Python# Interpolated Output
2022-01-02 1.50
2022-01-09 1.80
2022-01-16 2.10
2022-01-23 4.20
2022-01-30 6.30
2022-02-06 5.45
2022-02-13 4.60
2022-02-20 5.10
2022-02-27 7.00
2022-03-06 8.90
Freq: W-SUN, dtype: float64
PythonTime-aware interpolation captures temporal trends more accurately than linear or static value filling.
Filling Missing Weather Data Using Domain Knowledge
Domain-specific logic often improves the quality of imputation. Let’s look at a weather dataset with missing temperature and weather event data.
weather = {
'Day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'Temperature': [28.5, np.nan, 27.2, 26.4, np.nan, 25.1, 29.7],
'Event': ['Sunny', 'Rain', 'Rain', 'Clouds', 'Rain', 'Sunny', 'Sunny']
}
df = pd.DataFrame(weather)
print(df)
Python# Initial Output
Day Temperature Event
0 Mon 28.5 Sunny
1 Tue NaN Rain
2 Wed 27.2 Rain
3 Thu 26.4 Clouds
4 Fri NaN Rain
5 Sat 25.1 Sunny
6 Sun 29.7 Sunny
Filling missing values in pandas using interpolateNow apply intelligent filling strategies:
# Fill temperature gaps using nearest interpolation
df['Temperature'] = df['Temperature'].interpolate(method='nearest')
# Fill missing event with most likely common event (e.g., 'Clouds' in rainy seasons)
df['Event'] = df['Event'].fillna('Clouds')
print(df)
Python# Cleaned Output
Day Temperature Event
0 Mon 28.5 Sunny
1 Tue 27.2 Rain
2 Wed 27.2 Rain
3 Thu 26.4 Clouds
4 Fri 26.4 Rain
5 Sat 25.1 Sunny
6 Sun 29.7 Sunny
PythonUse real-world context when choosing values to fill missing data, especially for categorical variables.
These real-world examples demonstrate the practical power of Pandas’ fillna()
and interpolate()
functions. Whether you’re dealing with messy survey results, incomplete logs, or patchy time series, these tools help ensure your dataset is clean, consistent, and analysis-ready.
Related Post:
>> Implementing Breadth-First Search to Traverse a Binary Tree in Python
>> How to Repeat and Tile Array using NumPy in Python
>> Histogramming and Binning Data with NumPy in Python
>> A Comprehensive Guide to Filter Function in Python
Difference Between fillna() and interpolate()
Features | fillna() | interpolate() |
Method of Imputation | Static (constant value or strategy) | Dynamic (based on data pattern) |
Best For | Categorical data, consistent replacements | Numeric or time-series data |
Customizability | High (custom values, ffill, bfill, etc.) | Moderate (requires numeric context) |
Accuracy | Less accurate but more predictable | More accurate if data follows a trend |
Conclusion: Filling Missing Values in Pandas using fillna and interpolate
Handling missing data is a fundamental step in data preprocessing, and choosing the right method can significantly influence your analysis or machine learning results. In this blog, we explored how to Filling Missing Values in Pandas using fillna and interpolate, both essential tools for every data analyst and data scientist.
- Use
fillna()
for simpler, rule-based imputation. - Use
interpolate()
for data that has patterns or trends, especially time-series data.
Now that you know the difference and how to apply both, start experimenting with your datasets. Proper imputation = better insights and better models.