Data Cleaning and Preprocessing Checklist Builder
Generate a comprehensive data cleaning and preprocessing checklist tailored to your dataset type — covering missing values, outliers, encoding, normalization, and validation steps before analysis or model training.
Prompt Template
You are a senior data engineer and analyst. Create a comprehensive data cleaning and preprocessing checklist for: **Dataset description:** [what the data contains — e.g., 'e-commerce transaction records with 500K rows'] **Data source:** [CSV export, database dump, API pull, web scrape, survey tool, etc.] **Number of columns/features:** [approximate count and types — numeric, categorical, text, date, boolean] **Known issues (if any):** [e.g., 'lots of nulls in address fields', 'dates in mixed formats', 'duplicate customer entries'] **End goal:** [e.g., build a churn prediction model, create a dashboard, generate a report, feed into ML pipeline] **Tools/language:** [Python/pandas, R, SQL, Excel, dbt, etc.] **Compliance requirements:** [GDPR, HIPAA, PII handling, anonymization needs] Provide: 1. **Initial data profiling checklist** — first steps to understand the dataset's shape, types, and quality 2. **Missing data strategy** — decision framework for handling nulls (drop, impute, flag) based on the percentage and pattern of missingness 3. **Duplicate detection and resolution** — methods to identify and handle exact and fuzzy duplicates 4. **Outlier detection** — techniques appropriate for the data type (statistical, IQR, domain-specific) 5. **Data type and format standardization** — date parsing, string cleaning, encoding categorical variables 6. **Validation rules** — domain-specific checks (e.g., no negative prices, valid email formats, date ranges) 7. **Feature engineering suggestions** — derived columns that could improve analysis or modeling 8. **Code snippets** — ready-to-use code in [specified language] for each step 9. **Data quality report template** — summary table to document cleaning decisions and their impact 10. **Pre-analysis sanity check** — final validation steps before using the cleaned data
Example Output
Data Cleaning Checklist — E-Commerce Transaction Dataset (500K rows)
Step 1: Initial Data Profiling
import pandas as pd
df = pd.read_csv('transactions.csv')
# Shape and types
print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
print(df.dtypes)
print(df.describe(include='all'))
# Missing values summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'count': missing, 'pct': missing_pct})
print(missing_report[missing_report['count'] > 0].sort_values('pct', ascending=False))
# Unique values per column
print(df.nunique())
Step 2: Missing Data Strategy
| Missing % | Strategy | Rationale |
|-----------|----------|----------|
| <5% | Drop rows OR mean/median impute | Minimal impact on dataset size |
| 5-30% | Impute (median for numeric, mode for categorical) + add `_missing` flag column | Preserve rows while flagging imputation |
| >30% | Consider dropping the column OR building a separate model | Too much imputation introduces noise |
| MCAR (random) | Safe to impute with simple methods | Missingness is not informative |
| MAR/MNAR (systematic) | Investigate WHY it's missing — the pattern may be a feature itself | e.g., missing income = customer didn't provide = possible signal |
# Impute numeric with median, categorical with mode
for col in df.select_dtypes(include='number').columns:
if df[col].isnull().sum() > 0:
df[f'{col}_missing'] = df[col].isnull().astype(int)
df[col].fillna(df[col].median(), inplace=True)
Step 3: Duplicate Detection
# Exact duplicates
exact_dupes = df.duplicated().sum()
print(f"Exact duplicates: {exact_dupes}")
# Business-key duplicates (same customer + same order)
biz_dupes = df.duplicated(subset=['customer_id', 'order_id'], keep=False)
print(f"Business-key duplicates: {biz_dupes.sum()}")
# Resolution: keep first occurrence
df = df.drop_duplicates(subset=['customer_id', 'order_id'], keep='first')
Step 4: Outlier Detection
# IQR method for numeric columns
def flag_outliers_iqr(series, factor=1.5):
Q1, Q3 = series.quantile([0.25, 0.75])
IQR = Q3 - Q1
return (series < Q1 - factor * IQR) | (series > Q3 + factor * IQR)
df['price_outlier'] = flag_outliers_iqr(df['transaction_amount'])
print(f"Price outliers: {df['price_outlier'].sum()}")
Step 5: Validation Rules
assert (df['transaction_amount'] >= 0).all(), "Negative transaction amounts found!"
assert df['email'].str.contains('@').all(), "Invalid email addresses found!"
assert df['order_date'].between('2020-01-01', '2026-12-31').all(), "Dates out of expected range!"
Data Quality Report
| Check | Before | After | Action Taken |
|-------|--------|-------|--------------|
| Total rows | 500,000 | 487,231 | Removed 12,769 duplicates |
| Missing (address) | 23.4% | 0% | Imputed with 'Unknown' + flag |
| Outliers (price) | 342 rows | 342 flagged | Kept but flagged, not removed |
| Date format | Mixed (US/EU) | ISO 8601 | Standardized to YYYY-MM-DD |
| Negative prices | 17 rows | 0 | Investigated — data entry errors, corrected |
Tips for Best Results
- 💡Always profile your data BEFORE cleaning — don't assume you know what's in there. A 5-minute profiling step can save hours of debugging.
- 💡Document every cleaning decision in a data quality report. Future-you (or your teammate) will need to know why 12K rows were dropped.
- 💡Don't delete outliers by default — flag them. In many business contexts, outliers ARE the interesting data (big deals, fraud, power users).
- 💡Create a `_missing` flag column before imputing — the fact that data was missing is often a useful feature for models.
Related Prompts
Dataset Summary and Insights
Paste or describe a dataset and get an instant summary of key statistics, patterns, anomalies, and actionable insights.
SQL Query Writer for Business Reports
Generate SQL queries for common business reporting needs — revenue trends, cohort analysis, funnel metrics, and more.
Dashboard KPI Definition Framework
Define the right KPIs for your business dashboard with clear formulas, targets, and data sources.