top of page

US Household Income Project

  • nalwogaimmaculate3
  • Sep 13, 2025
  • 3 min read

Updated: Feb 22

Project Type: Personal Project

Tools Used: MySQL, Stored Procedures, Joins, Aggregations, Data Cleaning Techniques.


Link to Code:


Overview:

In this project we walk through the process of a full data workflow using SQL —

from cleaning the raw data to uncovering insights and building an automated cleaning system. The dataset contains U.S. household income information across states, cities, and counties. I approach this project in three phases.

  1. Data Cleaning

  2. Exploratory Data Analysis (EDA)

  3. Automated Data Cleaning Workflow


Phase 1: Data Cleaning

Objective: Prepare the raw data for analysis by resolving inconsistencies, duplicates, and missing values.


First, let's take a look at the data:



Table Inspection and Column Renaming

SELECT * 
FROM us_household_income;

SELECT * 
FROM us_household_income_statistics;



ALTER TABLE us_household_income_statistics 
RENAME COLUMN `id` TO `id`;

Row Count and Duplicate Detection

SELECT COUNT(id) 
FROM us_household_income;

SELECT COUNT(id) 
FROM us_household_income_statistics;

SELECT id, COUNT(id)
FROM us_household_income
GROUP BY id
HAVING COUNT(id) > 1;

SELECT * FROM (
	SELECT row_id, id,
	ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS row_num
	FROM us_household_income) row_tab
	WHERE row_num > 1;

Duplicate Removal

DELETE FROM us_household_income
WHERE row_id IN (
  SELECT row_id FROM (
    SELECT row_id, id,
    ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS row_num
    FROM us_household_income
  ) row_tab
  WHERE row_num > 1
);

Standardizing Text Fields

SELECT DISTINCT State_Name
 FROM us_household_income;

UPDATE us_household_income 
 SET State_Name = 'Alabama'
 WHERE State_Name = 'alabama';

UPDATE us_household_income 
SET State_Name = 'Georgia'
 WHERE State_Name = 'georia';

UPDATE us_household_income 
 SET State_Name = UPPER(State_Name);

UPDATE us_household_income 
 SET County = UPPER(County);
	
UPDATE us_household_income 
 SET State_ab = UPPER(State_ab);

UPDATE us_household_income 
 SET City = UPPER(City);

UPDATE us_household_income 
 SET Place = UPPER(Place);

UPDATE us_household_income 
 SET Type = 'Borough' 
 WHERE Type = 'Boroughs';

UPDATE us_household_income 
 SET Type = 'CDP'
 WHERE Type = 'CPD';

Handling Missing Values


SELECT * FROM us_household_income WHERE Place = '';

UPDATE us_household_income
SET Place = 'Autaugaville'
WHERE County = 'Autauga County' AND City = 'Vinemont';

Land and Water Validation


SELECT ALand, AWater
 FROM us_household_income
WHERE ALand = 0 OR ALand = '' OR ALand IS NULL;

SELECT ALand, AWater 
FROM us_household_income
WHERE AWater = 0 OR AWater = '' OR AWater IS NULL;

SELECT ALand, AWater 
FROM us_household_income
WHERE (ALand = 0 OR ALand = '' OR ALand IS NULL)
AND (AWater = 0 OR AWater = '' OR AWater IS NULL);


Phase 2: Exploratory Data Analysis

Objective: Analyze geographic and income patterns across U.S. states and cities.


Land and Water Distribution by State


SELECT State_Name, SUM(ALand), SUM(AWater)
FROM us_household_income
GROUP BY State_Name;

-- Top 10 states by land

SELECT State_Name, SUM(ALand)
FROM us_household_income
GROUP BY State_Name
ORDER BY 2 DESC
LIMIT 10

-- Top 10 states by water

SELECT State_Name, SUM(AWater)
FROM us_household_income
GROUP BY State_Name
ORDER BY 2 DESC
LIMIT 10;

Top 10 states by land

Top 10 states by water


Income Analysis by Type

SELECT Type, COUNT(Type),
ROUND(AVG(Mean),2) AS avg_mean,
ROUND(AVG(Median),2) AS avg_median
FROM us_household_income u
INNER JOIN us_household_income_statistics us
 ON u.id = us.id
WHERE mean <> 0
GROUP BY Type
ORDER BY avg_mean DESC;

Income Analysis by State

SELECT u.State_Name,
ROUND(AVG(Mean),2) AS avg_mean,
ROUND(AVG(Median),2) AS avg_median
FROM us_household_income u
INNER JOIN us_household_income_statistics us ON u.id = us.id
WHERE mean <> 0
GROUP BY u.State_Name
ORDER BY avg_mean DESC
LIMIT 10;

Top Cities by Income

SELECT u.State_Name, City,
ROUND(AVG(Mean),1) AS avg_mean,
ROUND(AVG(Median),1) AS avg_median
FROM us_household_income u
INNER JOIN us_household_income_statistics us ON u.id = us.id
WHERE mean <> 0
GROUP BY u.State_Name, City
ORDER BY avg_mean DESC
LIMIT 20;

Phase 3: Automated Data Cleaning

Objective: Build a reusable stored procedure to clean data and schedule it for regular execution.


Stored Procedure: Copy_and_Clean_Data

DELIMITER $$
DROP PROCEDURE IF EXISTS Copy_and_Clean_Data;
   CREATE PROCEDURE Copy_and_Clean_Data()
    BEGIN
  CREATE TABLE IF NOT EXISTS us_household_income_clean (...);
  INSERT INTO us_household_income_clean
  SELECT *, CURRENT_TIMESTAMP()
  FROM us_household_income_backup;

  DELETE FROM us_household_income_clean
  WHERE row_id IN (
  SELECT row_id FROM (
  SELECT row_id, id,
  ROW_NUMBER() OVER(PARTITION BY id, TimeStamp ORDER BY id, TimeStamp)      		   AS row_num FROM us_household_income_clean) row_tab
 WHERE row_num > 1
  );

 UPDATE us_household_income_clean 
 SET State_Name = 'Georgia'
 WHERE   State_Name = 'Georia';

UPDATE us_household_income_clean 
SET Type = 'CDP' 
WHERE Type = 'CPD';

 UPDATE us_household_income_clean
 SET Type = 'Borough'
 WHERE Type = 'Boroughs';

UPDATE us_household_income_clean 
SET County = UPPER(County);

UPDATE us_household_income_clean 
SET City = UPPER(City);

 UPDATE us_household_income_clean
 SET Place = UPPER(Place);

 UPDATE us_household_income_clean
 SET State_Name = UPPER(State_Name);

END $$

DELIMITER ;

Scheduling the Procedure

CREATE EVENT run_data_cleaning
ON SCHEDULE EVERY 30 DAY
DO CALL Copy_and_Clean_Data;

Final Reflection

"This project helped me strengthen my SQL skills across data cleaning, analysis, and automation. I learned how to structure queries for clarity, build reusable procedures, and uncover meaningful insights from raw data. It also taught me the importance of designing scalable workflows — something I'm excited to apply in future projects."



🔗 Additional Resources

  • 📂 Full SQL Code on GitHub:

  • 🖥️ Project Summary & Visuals: You’re already here! This post walks through the project phases with explanations and screenshots.


“This post summarizes the project. For full SQL scripts and automation logic, visit the GitHub repo.”







 
 
 

Comments


bottom of page