Back to prompts
Data AnalysisChatGPTClaudeGemini

Data Cleaning and Preprocessing Checklist Builder

Generate a comprehensive data cleaning and preprocessing checklist tailored to your dataset type — covering missing values, outliers, encoding, normalization, and validation steps before analysis or model training.

Prompt Template

You are a senior data engineer and analyst. Create a comprehensive data cleaning and preprocessing checklist for:

**Dataset description:** [what the data contains — e.g., 'e-commerce transaction records with 500K rows']
**Data source:** [CSV export, database dump, API pull, web scrape, survey tool, etc.]
**Number of columns/features:** [approximate count and types — numeric, categorical, text, date, boolean]
**Known issues (if any):** [e.g., 'lots of nulls in address fields', 'dates in mixed formats', 'duplicate customer entries']
**End goal:** [e.g., build a churn prediction model, create a dashboard, generate a report, feed into ML pipeline]
**Tools/language:** [Python/pandas, R, SQL, Excel, dbt, etc.]
**Compliance requirements:** [GDPR, HIPAA, PII handling, anonymization needs]

Provide:
1. **Initial data profiling checklist** — first steps to understand the dataset's shape, types, and quality
2. **Missing data strategy** — decision framework for handling nulls (drop, impute, flag) based on the percentage and pattern of missingness
3. **Duplicate detection and resolution** — methods to identify and handle exact and fuzzy duplicates
4. **Outlier detection** — techniques appropriate for the data type (statistical, IQR, domain-specific)
5. **Data type and format standardization** — date parsing, string cleaning, encoding categorical variables
6. **Validation rules** — domain-specific checks (e.g., no negative prices, valid email formats, date ranges)
7. **Feature engineering suggestions** — derived columns that could improve analysis or modeling
8. **Code snippets** — ready-to-use code in [specified language] for each step
9. **Data quality report template** — summary table to document cleaning decisions and their impact
10. **Pre-analysis sanity check** — final validation steps before using the cleaned data

Example Output

Data Cleaning Checklist — E-Commerce Transaction Dataset (500K rows)

Step 1: Initial Data Profiling

import pandas as pd

df = pd.read_csv('transactions.csv')

# Shape and types

print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")

print(df.dtypes)

print(df.describe(include='all'))

# Missing values summary

missing = df.isnull().sum()

missing_pct = (missing / len(df) * 100).round(2)

missing_report = pd.DataFrame({'count': missing, 'pct': missing_pct})

print(missing_report[missing_report['count'] > 0].sort_values('pct', ascending=False))

# Unique values per column

print(df.nunique())

Step 2: Missing Data Strategy

| Missing % | Strategy | Rationale |

|-----------|----------|----------|

| <5% | Drop rows OR mean/median impute | Minimal impact on dataset size |

| 5-30% | Impute (median for numeric, mode for categorical) + add `_missing` flag column | Preserve rows while flagging imputation |

| >30% | Consider dropping the column OR building a separate model | Too much imputation introduces noise |

| MCAR (random) | Safe to impute with simple methods | Missingness is not informative |

| MAR/MNAR (systematic) | Investigate WHY it's missing — the pattern may be a feature itself | e.g., missing income = customer didn't provide = possible signal |

# Impute numeric with median, categorical with mode

for col in df.select_dtypes(include='number').columns:

if df[col].isnull().sum() > 0:

df[f'{col}_missing'] = df[col].isnull().astype(int)

df[col].fillna(df[col].median(), inplace=True)

Step 3: Duplicate Detection

# Exact duplicates

exact_dupes = df.duplicated().sum()

print(f"Exact duplicates: {exact_dupes}")

# Business-key duplicates (same customer + same order)

biz_dupes = df.duplicated(subset=['customer_id', 'order_id'], keep=False)

print(f"Business-key duplicates: {biz_dupes.sum()}")

# Resolution: keep first occurrence

df = df.drop_duplicates(subset=['customer_id', 'order_id'], keep='first')

Step 4: Outlier Detection

# IQR method for numeric columns

def flag_outliers_iqr(series, factor=1.5):

Q1, Q3 = series.quantile([0.25, 0.75])

IQR = Q3 - Q1

return (series < Q1 - factor * IQR) | (series > Q3 + factor * IQR)

df['price_outlier'] = flag_outliers_iqr(df['transaction_amount'])

print(f"Price outliers: {df['price_outlier'].sum()}")

Step 5: Validation Rules

assert (df['transaction_amount'] >= 0).all(), "Negative transaction amounts found!"

assert df['email'].str.contains('@').all(), "Invalid email addresses found!"

assert df['order_date'].between('2020-01-01', '2026-12-31').all(), "Dates out of expected range!"

Data Quality Report

| Check | Before | After | Action Taken |

|-------|--------|-------|--------------|

| Total rows | 500,000 | 487,231 | Removed 12,769 duplicates |

| Missing (address) | 23.4% | 0% | Imputed with 'Unknown' + flag |

| Outliers (price) | 342 rows | 342 flagged | Kept but flagged, not removed |

| Date format | Mixed (US/EU) | ISO 8601 | Standardized to YYYY-MM-DD |

| Negative prices | 17 rows | 0 | Investigated — data entry errors, corrected |

Tips for Best Results

  • 💡Always profile your data BEFORE cleaning — don't assume you know what's in there. A 5-minute profiling step can save hours of debugging.
  • 💡Document every cleaning decision in a data quality report. Future-you (or your teammate) will need to know why 12K rows were dropped.
  • 💡Don't delete outliers by default — flag them. In many business contexts, outliers ARE the interesting data (big deals, fraud, power users).
  • 💡Create a `_missing` flag column before imputing — the fact that data was missing is often a useful feature for models.