In today’s data-driven world, ensuring the accuracy and integrity of your data is paramount. This is especially true for sensitive information like Permanent Account Numbers (PAN), which are crucial for financial and legal operations in India. This guide walks you through a practical SQL project on how to clean and validate a large dataset of PAN numbers, ensuring high data quality for your business.

The Project Breakdown

Our project focuses on a dataset of 10,000 PAN numbers and is broken down into three key phases:

Data Cleaning and Pre-processing

Validation against Specific Business Rules

Categorization and Reporting

Step 1: Data Cleaning and Pre-processing

The first step is to transform raw data into a usable format. This involves a series of essential cleaning operations:

Handling Missing Data: Remove records with missing or null values.

Duplicate Removal: Eliminate any duplicate entries to ensure each record is unique.

Whitespace Trimming: Trim leading and trailing spaces to standardize the format.

Case Standardization: Convert all characters to uppercase for consistent validation.

Step 2: Implementing Validation Rules

Once the data is clean, the next critical phase is to validate each PAN number against a set of predefined rules. A PAN number must adhere to the format AAAAA9999A and follow these specific checks:

Length Check: The PAN number must be exactly 10 characters long.

Format Check: The structure must be five alphabets, followed by four digits, and ending with one alphabet.

Adjacent Character Rule: The first five alphabets cannot have any adjacent repeating characters (e.g., “AA” is invalid).

Sequential Character Rule: The first five alphabets cannot be a sequential series (e.g., “ABCDE” is invalid).

Adjacent Digit Rule: The four digits cannot have any adjacent repeating digits (e.g., “11” is invalid).

Sequential Digit Rule: The four digits cannot be a sequential series (e.g., “1234” is invalid).

Step 3: Categorization and Reporting

The final step is to categorize the PAN numbers as “valid” or “invalid” based on the results of the validation process. Two types of reports can be generated to provide a clear overview:

Detailed Report: A line-by-line report listing each PAN number and its validation status.

Summary Report: A high-level report that shows the total number of records, and the count of valid, invalid, and missing PANs.

Implementation

Ensuring data integrity is critical for any organization. This blog post provides a step-by-step guide to a real-world SQL project focused on cleaning and validating a dataset of PAN (Permanent Account Number) numbers. You will learn how to use PostgreSQL to handle everything from data loading to generating comprehensive reports.

Phase 1: Data Loading & Pre-processing

The first step is to load your raw data into a database table and perform initial cleaning. This makes your data consistent and ready for validation.

Step 1: Create the Table

We begin by creating a table to store the raw PAN data.

SQL

CREATE TABLE STG_pan_numbers_dataset (
    pan_number text
);

Step 2: Clean the Data with a CTE

A Common Table Expression (CTE) is used to apply all cleaning rules in a single, efficient query. This includes removing nulls, trimming whitespace, converting to uppercase, and eliminating duplicates.

SQL

WITH CTE_cleaned_pan AS (
    SELECT DISTINCT
        UPPER(TRIM(pan_number)) AS pan_number
    FROM STG_pan_numbers_dataset
    WHERE
        pan_number IS NOT NULL
        AND TRIM(pan_number) != ''
)
SELECT * FROM CTE_cleaned_pan;

Phase 2: Implementing Validation Logic

PAN numbers follow a strict format (AAAAA9999A). In addition to this format, a valid PAN cannot have adjacent repeating characters or digits, or sequential series. We create user-defined functions to enforce these custom rules.

Step 3: Create User-Defined Functions

These functions will be used to check for adjacent and sequential characters/digits.

Function to check for adjacent same characters/digits:

SQL

CREATE OR REPLACE FUNCTION fn_check_adjacent_characters(p_string text)
RETURNS boolean
LANGUAGE plpgsql
AS $$
DECLARE
    i integer;
BEGIN
    FOR i IN 1..(LENGTH(p_string) - 1) LOOP
        IF SUBSTRING(p_string FROM i FOR 1) = SUBSTRING(p_string FROM (i + 1) FOR 1) THEN
            RETURN TRUE;
        END IF;
    END LOOP;
    RETURN FALSE;
END;
$$;

Function to check for sequential characters/digits:

SQL

CREATE OR REPLACE FUNCTION fn_check_sequential_characters(p_string text)
RETURNS boolean
LANGUAGE plpgsql
AS $$
DECLARE
    i integer;
BEGIN
    FOR i IN 1..(LENGTH(p_string) - 1) LOOP
        IF ASCII(SUBSTRING(p_string FROM (i + 1) FOR 1)) - ASCII(SUBSTRING(p_string FROM i FOR 1)) != 1 THEN
            RETURN FALSE;
        END IF;
    END LOOP;
    RETURN TRUE;
END;
$$;

Phase 3: Categorization & Reporting

The final phase involves applying all validation rules and generating insightful reports to summarize the data quality.

Step 4: Identify Valid PANs with a CTE

This CTE filters the cleaned data to find all valid PANs using the format check (~ '^[A-Z]{5}[0-9]{4}[A-Z]{1}$') and the custom functions.

SQL

WITH CTE_cleaned_pan AS (
    -- ( ... previous data cleaning query ... )
),
CTE_valid_pans AS (
    SELECT pan_number FROM CTE_cleaned_pan
    WHERE
        pan_number ~ '^[A-Z]{5}[0-9]{4}[A-Z]{1}$'
        AND fn_check_adjacent_characters(SUBSTRING(pan_number FROM 1 FOR 5)) = FALSE
        AND fn_check_sequential_characters(SUBSTRING(pan_number FROM 1 FOR 5)) = FALSE
        AND fn_check_adjacent_characters(SUBSTRING(pan_number FROM 6 FOR 4)) = FALSE
        AND fn_check_sequential_characters(SUBSTRING(pan_number FROM 6 FOR 4)) = FALSE
)

Step 5: Generate a Detailed Report

A final query categorizes each cleaned PAN as ‘valid’ or ‘invalid’ by joining the cleaned data with the valid PANs.

SQL

SELECT
    CLN.pan_number,
    CASE
        WHEN VLD.pan_number IS NOT NULL THEN 'valid pan'
        ELSE 'invalid pan'
    END AS status
FROM CTE_cleaned_pan CLN
LEFT JOIN CTE_valid_pans VLD
    ON VLD.pan_number = CLN.pan_number;

Step 6: Create a Summary Report

This query provides a high-level overview of the dataset, including counts of valid, invalid, and missing PANs.

SQL

WITH CTE AS ( SELECT (SELECT COUNT(*) FROM STG_pan_numbers_dataset) AS total_processed_records, COUNT(*) FILTER (WHERE status = 'valid pan') AS total_valid_pans, COUNT(*) FILTER (WHERE status = 'invalid pan') AS total_invalid_pans FROM view_valid_invalid_pans ) SELECT total_processed_records, total_valid_pans, total_invalid_pans, total_processed_records - (total_valid_pans + total_invalid_pans) AS total_missing_or_incomplete_pans FROM CTE;