Data Wrangling with Python is an important tool for data scientists and analysts. With the growing extent of data from various sources and technologies, the need to clean and organise these available data is equally important. Data Wrangling helps in this segment by cleaning, transforming, and organizing data in a suitable format and making it suitable for use.
In this article, we are going to learn about data wrangling with Python along with some examples. We will also learn the differences between data cleaning and data wrangling.
What is Data Wrangling?
Data Wrangling is an important feature of the Pandas library, an open-source library used for cleaning, transforming, and organizing data in an optimized manner to make it suitable for machine learning algorithms.
Data used for analysis purposes must be carefully cleaned and processed for the analysts to use. The main objective of data wrangling is to convert the raw data into a usable format and make it suitable for analysts to extract useful insights. It removes the unstructured, noisy, and unfiltered contents from the dataset which makes the dataset more easy and efficient for analysis.
Python Libraries for Data Wrangling
Python is a major contributor in implementing data wrangling as it consists of extensive library support which makes the data wrangling process easy and effective. Numpy is used with arrays in Python which is used to handle numerical data and operations related to numerical data.
Pandas is also a Python library that includes cleaning, filtering, and data manipulation. Scikit Learn is used to build advanced machine learning algorithms such as clustering models, classification, regression, etc.
Matplotlib is a data visualization library in Python used to make interactive charts, graphs, and visual models for smooth communication and understanding. Scipy is used for technical and scientific computing which provides a wide range of functionalities such as mathematical formulas, scientific models, engineering computations, etc.
Working of Data Wrangling with Python
Data Wrangling with Python deals with missing values, manipulation, filtering, and exploration of data to make the data suitable for analysis purposes. Data exploration is the first step where the available raw data is studied, and understood with various visualization methods available in Python.
Most of the datasets have repetitive values, and missing values which must be taken care of, and this work is also done by data wrangling with Python where the missing value is replaced with mean, median, or mode on the particular cell. Old data is reshaped and modified to make it suitable for new inputs and manipulation.
Any unwanted rows or columns in the dataset are removed or filtered at this stage. Now, we can use matplotlib to represent data in visual form and ensure the data is optimized for analysis, machine learning, training, data visualization, and plenty of other uses.
Data Wrangling Examples: Data Exploration
Let us first load the data using the data frame available in the pandas library.
# Create sample data
data = { ‘OrderID’: [101, 102, 103, 104, 105], ‘Customer’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, ‘Eva’], ‘Product’: [‘Laptop’, ‘Tablet’, ‘Smartphone’, ‘Laptop’, ‘Tablet’], ‘Quantity’: [1, 2, 1, 3, 1], ‘Price’: [1200, 500, 800, 1200, 500], ‘OrderDate’: pd.to_datetime([‘2025-01-01’, ‘2025-01-02’, ‘2025-01-03’, ‘2025-01-04’, ‘2025-01-05’]), } df = pd.DataFrame(data) # Display the DataFrame print(df) |
In this simple example, the complete descriptive statistics of data is represented in the form of column or row. You can directly check for any null or missing value by using the commands below.
# Check for null/missing values
print(df.isnull().sum()) |
You can also perform many other functionalities on the available data such as sorting, visualization, manipulating rows and columns, etc.
Data Wrangling Examples: How to Handle Missing Values?
Let us get an overview of how to handle missing values in rows or columns of the sample data used. Check the sample data given below.
import pandas as pd
# Sample data with missing values data = { ‘Name’: [‘Alice’, ‘Bob’, ‘Charlie’, ‘David’, None], ‘Age’: [25, 30, None, 40, 35], ‘Salary’: [50000, None, 60000, 70000, 80000] } df = pd.DataFrame(data) # Check for missing values print(df.isnull()) # Returns a DataFrame of True/False for each cell print(df.isnull().sum()) # Total missing values per column |
Now, the following function isnull() will check whether there is any missing values in the given dataset.
The “true” value represents the missing data in the dataset. Now you have to handle the missing data in the rows and columns. We can find 3 missing data in this sample dataset. You can represent this data using “missingno” to get a more clearer view.
You can see the void space which represents the missing values in the dataset. This can be handled using Data Wrangling with the Python library. Now, the rows and columns with the missing values are dropped.
df_dropped_rows = df.dropna()
print(df_dropped_rows) df_dropped_columns = df.dropna(axis=1) print(df_dropped_columns) |
After removing the NAN values from the dataset we have to substitute them with zeroes in the dataset. This will replace all data with missing values as zeros.
df_filled_constant = df.fillna(0) # Replace NaN with 0
print(df_filled_constant) |
Fill the mean or median value to the column or row of the sample dataset.
df[‘Age’] = df[‘Age’].fillna(df[‘Age’].mean()) # Replace NaN with mean
df[‘Salary’] = df[‘Salary’].fillna(df[‘Salary’].median()) # Replace NaN with median print(df) df[‘Missing_Age’] = df[‘Age’].isnull() print(df) |
You can flag the missing data at the last replacement with mean, median or mode values instead of NaN.
Data Cleaning Vs Data Wrangling with Python
Let us know the difference between data wrangling vs data cleaning in Python.
Data Cleaning | Data Wrangling |
The process of identifying and correcting inaccuracies, inconsistencies, and errors in data is data cleaning. | The process of transforming and mapping data from its raw form into a usable format for analysis is data wrangling with python. |
It ensures data quality and accuracy. | It is used to prepare data for specific analyses or machine learning models. |
It helps in correcting errors like missing values, duplicates, or incorrect formatting. | It is used in reshaping, merging, and transforming data to align with the desired structure or requirements. |
It is used in removing duplicates and filling missing values. It also corrects inconsistent data types | It is used in reshaping data, and combining datasets. normalizing or aggregating data and feature engineering |
Pandas is used to handle missing or duplicate data and numPy handles numerical fixes. | Pandas are used for data transformation and overview. Numpy is used for numerical operations. |
Data cleaning lay emphasis on clean and error-free data. | It makes the data analysis-ready, structured data tailored to specific tasks. |
It replaces NaN values with averages and removes invalid entries in columns | It is used in converting wide data to long format and create new columns based on existing data |
Cleaning is often a subset of wrangling as wrangling may involve cleaning as a first step. | Data Wrangling with Python builds upon cleaning to format and structure data for analysis. |
Learn Data Wrangling with Python PW Skills
Become proficient in Python libraries and learn to prepare, process, clean, and visualize data for analysis purposes. Learn about Python programming and data structures and algorithms. Get in depth tutorials and practice exercises to strengthen your concepts in Python programming with PW Skills Decode DSA With Python Course.
Learn from real world projects and experienced mentors in this self paced course. Get a certificate after completion of the course on pwskills.com
Data Wrangling with Python FAQs
Q1. What is data wrangling, and why is it important in Python?
Ans: Data wrangling is the process of cleaning, transforming, and structuring raw data into a usable format for analysis or machine learning. It is crucial because raw data is often messy, inconsistent, or incomplete, and data wrangling helps ensure the data is accurate, consistent, and ready for analysis.
Q2. Which Python libraries are commonly used for data wrangling?
Ans: Python offers several libraries for data wrangling:
pandas: For data manipulation and analysis, including filtering, grouping, and reshaping data.
numpy: For numerical operations and handling multi-dimensional arrays.
pyjanitor: Extends pandas with additional data cleaning and wrangling utilities.
re: For text manipulation and handling regular expressions.
Q3. How is data wrangling different from data cleaning in Python?
Ans: Data cleaning focuses on correcting errors, handling missing values, and improving data quality. Data wrangling includes cleaning but also involves reshaping, transforming, and integrating data to make it suitable for specific analytical tasks or workflows.
Q4. What are some common challenges in data wrangling?
Ans:
Handling missing or null values: Deciding whether to impute or remove them.
Inconsistent data types: Converting columns to uniform formats.
Outliers: Identifying and managing extreme values.
Data merging and reshaping: Combining datasets with varying structures or formats.
Python libraries like pandas provide functions like merge(), pivot(), and fillna() to address these challenges.