
Data Wrangling with Python is an important tool for data scientists and analysts. With the growing extent of data from various sources and technologies, the need to clean and organise these available data is equally important. Data Wrangling helps in this segment by cleaning, transforming, and organizing data in a suitable format and making it suitable for use.
In this article, we are going to learn about data wrangling with Python along with some examples. We will also learn the differences between data cleaning and data wrangling.
Data used for analysis purposes must be carefully cleaned and processed for the analysts to use. The main objective of data wrangling is to convert the raw data into a usable format and make it suitable for analysts to extract useful insights. It removes the unstructured, noisy, and unfiltered contents from the dataset which makes the dataset more easy and efficient for analysis.
Most of the datasets have repetitive values, and missing values which must be taken care of, and this work is also done by data wrangling with Python where the missing value is replaced with mean, median, or mode on the particular cell. Old data is reshaped and modified to make it suitable for new inputs and manipulation.
Any unwanted rows or columns in the dataset are removed or filtered at this stage. Now, we can use matplotlib to represent data in visual form and ensure the data is optimized for analysis, machine learning, training, data visualization, and plenty of other uses.
| # Create sample data data = { 'OrderID': [101, 102, 103, 104, 105], 'Customer': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'], 'Product': ['Laptop', 'Tablet', 'Smartphone', 'Laptop', 'Tablet'], 'Quantity': [1, 2, 1, 3, 1], 'Price': [1200, 500, 800, 1200, 500], 'OrderDate': pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05']), } df = pd.DataFrame(data) # Display the DataFrame print(df) |
| # Check for null/missing values print(df.isnull().sum()) |
| import pandas as pd # Sample data with missing values data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David', None], 'Age': [25, 30, None, 40, 35], 'Salary': [50000, None, 60000, 70000, 80000] } df = pd.DataFrame(data) # Check for missing values print(df.isnull()) # Returns a DataFrame of True/False for each cell print(df.isnull().sum()) # Total missing values per column |
| df_dropped_rows = df.dropna() print(df_dropped_rows) df_dropped_columns = df.dropna(axis=1) print(df_dropped_columns) |
| df_filled_constant = df.fillna(0) # Replace NaN with 0 print(df_filled_constant) |
| df['Age'] = df['Age'].fillna(df['Age'].mean()) # Replace NaN with mean df['Salary'] = df['Salary'].fillna(df['Salary'].median()) # Replace NaN with median print(df) df['Missing_Age'] = df['Age'].isnull() print(df) |
Let us know the difference between data wrangling vs data cleaning in Python.
| Data Cleaning | Data Wrangling |
| The process of identifying and correcting inaccuracies, inconsistencies, and errors in data is data cleaning. | The process of transforming and mapping data from its raw form into a usable format for analysis is data wrangling with python. |
| It ensures data quality and accuracy. | It is used to prepare data for specific analyses or machine learning models. |
| It helps in correcting errors like missing values, duplicates, or incorrect formatting. | It is used in reshaping, merging, and transforming data to align with the desired structure or requirements. |
| It is used in removing duplicates and filling missing values. It also corrects inconsistent data types | It is used in reshaping data, and combining datasets. normalizing or aggregating data and feature engineering |
| Pandas is used to handle missing or duplicate data and numPy handles numerical fixes. | Pandas are used for data transformation and overview. Numpy is used for numerical operations. |
| Data cleaning lay emphasis on clean and error-free data. | It makes the data analysis-ready, structured data tailored to specific tasks. |
| It replaces NaN values with averages and removes invalid entries in columns | It is used in converting wide data to long format and create new columns based on existing data |
| Cleaning is often a subset of wrangling as wrangling may involve cleaning as a first step. | Data Wrangling with Python builds upon cleaning to format and structure data for analysis. |