
In today’s data-driven world, ensuring the accuracy and integrity of your data is paramount. This is especially true for sensitive information like Permanent Account Numbers (PAN), which are crucial for financial and legal operations in India. This guide walks you through a practical SQL project on how to clean and validate a large dataset of PAN numbers, ensuring high data quality for your business.
The Project Breakdown
Our project focuses on a dataset of 10,000 PAN numbers and is broken down into three key phases:
- Data Cleaning and Pre-processing
- Validation against Specific Business Rules
- Categorization and Reporting
Step 1: Data Cleaning and Pre-processing
The first step is to transform raw data into a usable format. This involves a series of essential cleaning operations:
- Handling Missing Data: Remove records with missing or null values.
- Duplicate Removal: Eliminate any duplicate entries to ensure each record is unique.
- Whitespace Trimming: Trim leading and trailing spaces to standardize the format.
- Case Standardization: Convert all characters to uppercase for consistent validation.
Step 2: Implementing Validation Rules
Once the data is clean, the next critical phase is to validate each PAN number against a set of predefined rules. A PAN number must adhere to the format AAAAA9999A
and follow these specific checks:
- Length Check: The PAN number must be exactly 10 characters long.
- Format Check: The structure must be five alphabets, followed by four digits, and ending with one alphabet.
- Adjacent Character Rule: The first five alphabets cannot have any adjacent repeating characters (e.g., “AA” is invalid).
- Sequential Character Rule: The first five alphabets cannot be a sequential series (e.g., “ABCDE” is invalid).
- Adjacent Digit Rule: The four digits cannot have any adjacent repeating digits (e.g., “11” is invalid).
- Sequential Digit Rule: The four digits cannot be a sequential series (e.g., “1234” is invalid).
Step 3: Categorization and Reporting
The final step is to categorize the PAN numbers as “valid” or “invalid” based on the results of the validation process. Two types of reports can be generated to provide a clear overview:
- Detailed Report: A line-by-line report listing each PAN number and its validation status.
- Summary Report: A high-level report that shows the total number of records, and the count of valid, invalid, and missing PANs.
Implementation
Ensuring data integrity is critical for any organization. This blog post provides a step-by-step guide to a real-world SQL project focused on cleaning and validating a dataset of PAN (Permanent Account Number) numbers. You will learn how to use PostgreSQL to handle everything from data loading to generating comprehensive reports.
Phase 1: Data Loading & Pre-processing
The first step is to load your raw data into a database table and perform initial cleaning. This makes your data consistent and ready for validation.
Step 1: Create the Table
We begin by creating a table to store the raw PAN data.
SQL
CREATE TABLE STG_pan_numbers_dataset (
pan_number text
);
Step 2: Clean the Data with a CTE
A Common Table Expression (CTE) is used to apply all cleaning rules in a single, efficient query. This includes removing nulls, trimming whitespace, converting to uppercase, and eliminating duplicates.
SQL
WITH CTE_cleaned_pan AS (
SELECT DISTINCT
UPPER(TRIM(pan_number)) AS pan_number
FROM STG_pan_numbers_dataset
WHERE
pan_number IS NOT NULL
AND TRIM(pan_number) != ''
)
SELECT * FROM CTE_cleaned_pan;
Phase 2: Implementing Validation Logic
PAN numbers follow a strict format (AAAAA9999A
). In addition to this format, a valid PAN cannot have adjacent repeating characters or digits, or sequential series. We create user-defined functions to enforce these custom rules.
Step 3: Create User-Defined Functions
These functions will be used to check for adjacent and sequential characters/digits.
Function to check for adjacent same characters/digits:
SQL
CREATE OR REPLACE FUNCTION fn_check_adjacent_characters(p_string text)
RETURNS boolean
LANGUAGE plpgsql
AS $$
DECLARE
i integer;
BEGIN
FOR i IN 1..(LENGTH(p_string) - 1) LOOP
IF SUBSTRING(p_string FROM i FOR 1) = SUBSTRING(p_string FROM (i + 1) FOR 1) THEN
RETURN TRUE;
END IF;
END LOOP;
RETURN FALSE;
END;
$$;
Function to check for sequential characters/digits:
SQL
CREATE OR REPLACE FUNCTION fn_check_sequential_characters(p_string text)
RETURNS boolean
LANGUAGE plpgsql
AS $$
DECLARE
i integer;
BEGIN
FOR i IN 1..(LENGTH(p_string) - 1) LOOP
IF ASCII(SUBSTRING(p_string FROM (i + 1) FOR 1)) - ASCII(SUBSTRING(p_string FROM i FOR 1)) != 1 THEN
RETURN FALSE;
END IF;
END LOOP;
RETURN TRUE;
END;
$$;
Phase 3: Categorization & Reporting
The final phase involves applying all validation rules and generating insightful reports to summarize the data quality.
Step 4: Identify Valid PANs with a CTE
This CTE filters the cleaned data to find all valid PANs using the format check (~ '^[A-Z]{5}[0-9]{4}[A-Z]{1}$'
) and the custom functions.
SQL
WITH CTE_cleaned_pan AS (
-- ( ... previous data cleaning query ... )
),
CTE_valid_pans AS (
SELECT pan_number FROM CTE_cleaned_pan
WHERE
pan_number ~ '^[A-Z]{5}[0-9]{4}[A-Z]{1}$'
AND fn_check_adjacent_characters(SUBSTRING(pan_number FROM 1 FOR 5)) = FALSE
AND fn_check_sequential_characters(SUBSTRING(pan_number FROM 1 FOR 5)) = FALSE
AND fn_check_adjacent_characters(SUBSTRING(pan_number FROM 6 FOR 4)) = FALSE
AND fn_check_sequential_characters(SUBSTRING(pan_number FROM 6 FOR 4)) = FALSE
)
Step 5: Generate a Detailed Report
A final query categorizes each cleaned PAN as ‘valid’ or ‘invalid’ by joining the cleaned data with the valid PANs.
SQL
SELECT
CLN.pan_number,
CASE
WHEN VLD.pan_number IS NOT NULL THEN 'valid pan'
ELSE 'invalid pan'
END AS status
FROM CTE_cleaned_pan CLN
LEFT JOIN CTE_valid_pans VLD
ON VLD.pan_number = CLN.pan_number;
Step 6: Create a Summary Report
This query provides a high-level overview of the dataset, including counts of valid, invalid, and missing PANs.
SQL
WITH CTE AS ( SELECT (SELECT COUNT(*) FROM STG_pan_numbers_dataset) AS total_processed_records, COUNT(*) FILTER (WHERE status = 'valid pan') AS total_valid_pans, COUNT(*) FILTER (WHERE status = 'invalid pan') AS total_invalid_pans FROM view_valid_invalid_pans ) SELECT total_processed_records, total_valid_pans, total_invalid_pans, total_processed_records - (total_valid_pans + total_invalid_pans) AS total_missing_or_incomplete_pans FROM CTE;