cover-img

Data Cleaning and Preprocessing in Pandas and Polars: A Comparison

6 May, 2023

0

0

0

Introduction

The process of finding, fixing, and deleting errors, inconsistencies, and inaccuracies in data is known as data cleaning. Data cleaning and preprocessing is a significant step in data analysis as it ensures that the results are accurate and reliable. Two popular tools for data cleaning and preprocessing are pandas and polars.

In this article, we will discuss the importance of data cleaning and several standard techniques for data cleaning and preprocessing in python using pandas and polars.

So, let's dive into the world of pandas and polars and see how they stack up against each other!

Why is Data Cleaning Important?

Data cleaning is important because:

  1. Improves data quality: Data cleaning improves data quality by removing errors, inconsistencies, and inaccuracies.
  2. Increases accuracy: Data cleaning increases the accuracy of data analysis by eliminating incorrect data that can skew the results.
  3. Saves time and resources: Data cleaning saves time and resources by eliminating the need to repeat data analysis due to errors in the data.
  4. Enables data integration: Data cleaning enables data integration from different sources by ensuring that the data is consistent and accurate.

Common Techniques for Data Cleaning and Preprocessing 

  1. Data Exploration: In this step, the data is explored to identify any errors, inconsistencies, and inaccuracies. The data is analyzed to determine if it is complete, accurate, and consistent.
  2. Handling missing Data: Missing data is a frequent problem in real-world datasets. It can occur for various reasons, such as incomplete surveys, data entry errors, or sensor failures. Handling missing data is essential for accurate analysis, and Pandas provides several functions to help with this. Missing data can be handled by removing the rows or columns with missing values, imputing missing values, using predictive modeling techniques to estimate missing values, replace them with a default value, such as the mean or median of the column. The fillna() function can be used to replace missing values with a specified value.

Code example with pandas on how to handle missing values

Lets use pandas to replace missing values in the age column with the mean value of the column:

import pandas as pd
import numpy as np
# create a DataFrame with missing values
df = pd.DataFrame({'name': ['Happiess', 'Joy', 'Gloria', 'David'],
'age': [22, np.nan, 18, 13],
'gender': ['F', 'F', 'F', 'M']})
df

# replace missing values with the mean of the column
df['age'] = df['age'].fillna(df['age'].mean())
df

Output:

Another method to handle missing data is removing the rows or columns with missing values. The dropna() function can be used to remove rows or columns that contain missing values. For example, the following code removes rows with missing values:

# remove rows with missing values
df = df.dropna()

Code example with polars on how to handle missing values

import polars as pl
import numpy as np
# create a DataFrame with missing values
df = pl.DataFrame({'name': ['Happiess', 'Joy', 'Gloria', 'David'],
                   'age': [22, None, 18, 13],
                   'gender': ['F', 'F', 'F', 'M']})
df

#drop rows with missing values
df_dropped = df.drop_nulls()
df_dropped

Output:

# fill missing values with a specified value
df_filled = df.fill_null(0)
print(df_filled)

Output:

  1. Handling Outliers: Outliers are extreme values that differ significantly from the other values in the dataset. Outliers may occur due to measurement error or other factors and can significantly impact the analysis of the result.
    Outliers can be removed from the dataset and replaced with the median value or identified and handled using more advanced statistical techniques.
    The describe() function can be used to identify potential outliers in a column by computing summary statistics such as the mean, standard deviation, and quartiles.

Code example with pandas on how to handle outliers

For example, the following code computes the summary statistics for the age column:


import pandas as pd
import numpy as np
# create a DataFrame with missing values
df = pd.DataFrame({'name': ['Happiess', 'Joy', 'Gloria', 'David'],
'age': [22, np.nan, 18, 13],
'gender': ['F', 'F', 'F', 'M']})
# compute summary statistics for the age column
summary = df['age'].describe()
print(summary)

Once potential outliers have been identified, they can be removed from the dataset using boolean indexing. For example, the following code removes rows where the age is more than two standard deviations away from the mean:

# remove rows with age more than two standard deviations away from the mean
df = df[(df['age'] - df['age'].mean()) / df['age'].std() > 2]
df

There is no output because no row has more than two standard deviation.

Code example with polars on how to handle outliers

import polars as pl
import numpy as np
# create a polars DataFrame
df = pl.DataFrame({'name': ['Happiess', 'Joy', 'Gloria', 'David', 'xose', 'kenny'],
                   'age': [100, 200, 300, 400, 500, 1000],
                   'gender': ['F', 'F', 'F', 'M', 'M', 'M']})
df

Output:

# Calculate the mean and standard deviation of column age
mean_age = df['age'].mean()
std_age = df['age'].std()
print(mean_age)
print(std_age)

Output:

Create a boolean mask that identifies outliers (i.e., any value that is more than two standard deviations away from the mean).

# Create a boolean mask to identify outliers
mask = (df['age'] < mean_age + 2*std_age) & (df['age'] > mean_age - 2*std_age)
# Filter the dataframe to remove outliers
df_filtered = df.filter(mask)
df_filtered

Output:

  1. Handling Duplicate Data: Duplicate data can be handled by removing the duplicates, or keeping all duplicates and treating them as separate records.

Code example with pandas on how to handle duplicate data

import pandas as pd
# create a sample DataFrame with duplicate values
data = {'name': ['kenny', 'xosa', 'eben', 'kenny'],
'age': [25, 34, 20, 25],
'gender': ['F', 'M', 'M', 'F']}
df = pd.DataFrame(data)
df

Output:

# check for duplicates based on all columns
print(df.duplicated())

Output:

# check for duplicates based on a subset of columns
print(df.duplicated(subset=['name', 'age']))

Output:

# drop duplicates based on all columns
df.drop_duplicates(inplace=True)
df

Output:

# drop duplicates based on a subset of columns
df.drop_duplicates(subset=['name', 'age'], inplace=True)
df

Output:

# replace duplicates based on a subset of columns
df.replace({'name': {'kenny': 'taiye'}, 'age': {25: 24}}, inplace=True)
df

Output:

Code example with polars on how to handle duplicate data

import polars as pl
# create a sample DataFrame with duplicate values
data = {'name': ['kenny', 'xosa', 'eben', 'kenny'],
        'age': [25, 34, 20, 25],
        'gender': ['F', 'M', 'M', 'F']}
df = pl.DataFrame(data)
df

Output:

Using the unique() method

First, let’s remove the duplicates using the unique() method:

df.unique()

Output:

Observe from the output that if you don’t apply any argument to the unique() method, all the duplicating rows will be removed.

Lets remove duplicates based on specific columns using the subset parameter:

# drop duplicate rows based on 'name' and 'age' columns
df.unique(subset=['name','age'], keep='first')

Output:

In the above result, observe that the first row is kept while the last row is removed. This is because for these two rows, they have duplicate values for columns 'name' and 'age'. The keep='first' argument (default argument value) keeps the first duplicate row and removes the rest.

If you want to keep the last duplicate row, set keep to 'last':

# drop duplicate rows based on 'name' and 'age' columns
df.unique(subset=['name','age'], keep='last')

Output:

Observe that the last row will now be kept.

  1. Data Scaling and Normalization: Data scaling and normalization are important data preprocessing techniques involving transforming the data to have a specific range or distribution. This is often necessary for machine learning algorithms that require input data to be on a similar scale. One common technique for scaling data is to use the min-max scaler, which transforms the data to have a minimum value of 0 and a maximum value of 1. The MinMaxScaler class from the sklearn.preprocessing module can be used to apply this transformation.

Code example with pandas on how to perform data scaling and normalization

import numpy as np
from sklearn.preprocessing import MinMaxScaler
# create a 2D array of random numbers between 0 and 100
data = np.random.randint(0, 100, size=(5, 3))
# print the original data
print('original data:\n', data)

Output:

# create a MinMaxScaler object
scaler = MinMaxScaler()
# fit and transform the data using the scaler object
scaled_data = scaler.fit_transform(data)
# print the scaled data
print("Scaled data:\n", scaled_data)

Output:

Code example with polars on how to perform data scaling and normalization

import polars as pl
# create a 2D array of random numbers between 0 and 100
df = np.random.randint(0, 100, size=(5, 3))
df

Output:

# Scale the data to range [0, 1]
df_scaled = (df - df.min()) / (df.max() - df.min())
df_scaled

Output:

# Normalize the data to have mean 0 and variance 1
df_normalized = (df - df.mean()) / df.std()
df_normalized

Output:

FAQs

  1. What is the difference between pandas and polars?

Pandas and polars are data manipulation libraries in python, but they differ in performance and memory usage. Polars is a newer library that is designed to be faster and more memory-efficient than pandas, especially for large datasets.

  1. Can I use the same syntax for data cleaning and preprocessing in both pandas and polars?

Yes, many methods and functions used for data cleaning and preprocessing are similar in both pandas and polars. However, some differences exist in the syntax and functionality of certain methods.

  1. Which library is better for handling large datasets?

Polars is generally better for handling large datasets, as it is designed to be more memory-efficient and faster than pandas. However, the performance difference may not be noticeable for smaller datasets.

  1. Are there any limitations to using polars compared to pandas?

Polars is a newer library and therefore may have some limitations compared to pandas, which has been around longer and has a larger user base. However, polars is actively being developed and improved, so that these limitations may be addressed in future updates.

  1. Is it possible to use both pandas and polars in the same project?

Yes, it is possible to use both libraries in the same project. However, it may be more efficient to choose one library for a specific task depending on the size and complexity of the dataset.

  1. Which library is better for visualizing data?

Neither pandas nor polars is explicitly designed for data visualization. Both libraries can be used to manipulate and preprocess data, which can then be visualized using other python libraries such as matplotlib or seaborn.

Conclusion

pandas and polars are important tools for data cleaning and preprocessing with each of their unique strengths and weaknesses. Pandas have been the industry standard for many years and has a larger user community, while polars is a newer and faster tool with a more modern syntax. Choosing between the two ultimately depends on the specific needs of your project, and it is recommended to try both and see which one works best for you.

Thank you for reading!

0

0

0

More Articles

Showwcase is a professional tech network with over 0 users from over 150 countries. We assist tech professionals in showcasing their unique skills through dedicated profiles and connect them with top global companies for career opportunities.

© Copyright 2025. Showcase Creators Inc. All rights reserved.