
Data is the fuel for analytics and machine learning — but bad data leads to bad decisions. Imagine building a sales dashboard where customer ages are negative, emails are missing, or revenue numbers don’t add up. That’s where data validation comes in.
The good news? With Great Expectations, you can validate your data in just 10 minutes.
Let’s walk through a real-time example step by step.
⏱️ What is Great Expectations?
Great Expectations (GE) is an open-source data quality framework that helps you:
- Profile your data (understand what’s inside).
- Validate your data (set rules, aka expectations).
- Document your data (auto-generate data quality reports).
Think of it as unit tests for your data.
🔧 Prerequisites
- Python 3.8+ installed
- A dataset to validate (we’ll use a simple Sales CSV file)
Sample sales.csv
:
order_id,customer_email,order_amount,order_date
1,john@example.com,120,2025-08-01
2,mary@example.com,300,2025-08-02
3,,250,2025-08-03
4,bob@example.com,-50,2025-08-04
👉 Notice the issues: missing email and a negative order amount.
🚀 Step 1: Install Great Expectations
Run in your terminal:
pip install great-expectations
🚀 Step 2: Initialize Great Expectations
In your project folder:
great_expectations init
This creates a great_expectations/
directory with configs.
🚀 Step 3: Connect Your Data
Create a Pandas Datasource for the CSV:
import great_expectations as gx
import pandas as pd
# Load dataset
df = pd.read_csv("sales.csv")
# Convert dataframe into GE dataset
context = gx.get_context()
dataset = gx.dataset.PandasDataset(df)
🚀 Step 4: Define Expectations (Validation Rules)
Let’s validate:
order_id
should never be null.customer_email
should not be null.order_amount
should be positive.order_date
should follow YYYY-MM-DD format.
# 1. order_id must not be null
dataset.expect_column_values_to_not_be_null("order_id")
# 2. customer_email must not be null
dataset.expect_column_values_to_not_be_null("customer_email")
# 3. order_amount must be greater than 0
dataset.expect_column_values_to_be_between("order_amount", min_value=1)
# 4. order_date must match YYYY-MM-DD format
dataset.expect_column_values_to_match_strftime_format(
"order_date", "%Y-%m-%d"
)
🚀 Step 5: Run Validation
results = dataset.validate()
print(results)
Sample Output:
{
"success": false,
"statistics": {
"evaluated_expectations": 4,
"successful_expectations": 2,
"unsuccessful_expectations": 2
},
"results": [
{
"expectation_type": "expect_column_values_to_not_be_null",
"success": false,
"unexpected_count": 1
},
{
"expectation_type": "expect_column_values_to_be_between",
"success": false,
"unexpected_count": 1
}
]
}
👉 The validation caught the missing email and negative order amount! 🎉
🚀 Step 6: Auto-Generate Data Docs
GE can generate HTML reports:
great_expectations docs build
This creates a beautiful validation report you can open in your browser:
- Shows passed & failed checks.
- Helps teams track data quality over time.
🔍 Real-Time Example in Action
Imagine you’re a data engineer at an e-commerce company. Every day, you ingest thousands of orders. If you don’t validate:
- Wrong order amounts could distort revenue.
- Missing customer emails could break CRM integration.
- Wrong dates could misalign sales reports.
By running a 10-minute GE validation, you catch these issues before they hit dashboards or ML models.
⚡ Best Practices
- Automate checks in your ETL/ELT pipelines (Airflow, Prefect, dbt).
- Version-control your expectations (treat them like code).
- Share Data Docs with analysts so everyone trusts the data.
- Start small (few expectations) → grow over time.
🧠 Key Takeaways
- Great Expectations = unit testing for data.
- In under 10 minutes, you can validate a dataset with just a few lines of code.
- Catching issues early saves time, money, and credibility.
- Use it in real-time pipelines to keep your data trustworthy.
✨ Bottom line: If you want reliable, production-ready data, Great Expectations should be part of your workflow. It’s fast, scalable, and developer-friendly.