Data Cleaning and Preprocessing Checklist Builder
Generate a comprehensive data cleaning and preprocessing checklist tailored to your dataset type — covering missing values, outliers, encoding, normalization, and validation steps before analysis or model training.
Prompt Template
You are a senior data engineer and analyst. Create a comprehensive data cleaning and preprocessing checklist for: **Dataset description:** [what the data contains — e.g., 'e-commerce transaction records with 500K rows'] **Data source:** [CSV export, database dump, API pull, web scrape, survey tool, etc.] **Number of columns/features:** [approximate count and types — numeric, categorical, text, date, boolean] **Known issues (if any):** [e.g., 'lots of nulls in address fields', 'dates in mixed formats', 'duplicate customer entries'] **End goal:** [e.g., build a churn prediction model, create a dashboard, generate a report, feed into ML pipeline] **Tools/language:** [Python/pandas, R, SQL, Excel, dbt, etc.] **Compliance requirements:** [GDPR, HIPAA, PII handling, anonymization needs] Provide: 1. **Initial data profiling checklist** — first steps to understand the dataset's shape, types, and quality 2. **Missing data strategy** — decision framework for handling nulls (drop, impute, flag) based on the percentage and pattern of missingness 3. **Duplicate detection and resolution** — methods to identify and handle exact and fuzzy duplicates 4. **Outlier detection** — techniques appropriate for the data type (statistical, IQR, domain-specific) 5. **Data type and format standardization** — date parsing, string cleaning, encoding categorical variables 6. **Validation rules** — domain-specific checks (e.g., no negative prices, valid email formats, date ranges) 7. **Feature engineering suggestions** — derived columns that could improve analysis or modeling 8. **Code snippets** — ready-to-use code in [specified language] for each step 9. **Data quality report template** — summary table to document cleaning decisions and their impact 10. **Pre-analysis sanity check** — final validation steps before using the cleaned data
Example Output
Data Cleaning Checklist — E-Commerce Transaction Dataset (500K rows)
Step 1: Initial Data Profiling
import pandas as pd
df = pd.read_csv('transactions.csv')
# Shape and types
print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
print(df.dtypes)
print(df.describe(include='all'))
# Missing values summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'count': missing, 'pct': missing_pct})
print(missing_report[missing_report['count'] > 0].sort_values('pct', ascending=False))
# Unique values per column
print(df.nunique())
Step 2: Missing Data Strategy
| Missing % | Strategy | Rationale |
|-----------|----------|----------|
| <5% | Drop rows OR mean/median impute | Minimal impact on dataset size |
| 5-30% | Impute (median for numeric, mode for categorical) + add `_missing` flag column | Preserve rows while flagging imputation |
| >30% | Consider dropping the column OR building a separate model | Too much imputation introduces noise |
| MCAR (random) | Safe to impute with simple methods | Missingness is not informative |
| MAR/MNAR (systematic) | Investigate WHY it's missing — the pattern may be a feature itself | e.g., missing income = customer didn't provide = possible signal |
# Impute numeric with median, categorical with mode
for col in df.select_dtypes(include='number').columns:
if df[col].isnull().sum() > 0:
df[f'{col}_missing'] = df[col].isnull().astype(int)
df[col].fillna(df[col].median(), inplace=True)
Step 3: Duplicate Detection
# Exact duplicates
exact_dupes = df.duplicated().sum()
print(f"Exact duplicates: {exact_dupes}")
# Business-key duplicates (same customer + same order)
biz_dupes = df.duplicated(subset=['customer_id', 'order_id'], keep=False)
print(f"Business-key duplicates: {biz_dupes.sum()}")
# Resolution: keep first occurrence
df = df.drop_duplicates(subset=['customer_id', 'order_id'], keep='first')
Step 4: Outlier Detection
# IQR method for numeric columns
def flag_outliers_iqr(series, factor=1.5):
Q1, Q3 = series.quantile([0.25, 0.75])
IQR = Q3 - Q1
return (series < Q1 - factor * IQR) | (series > Q3 + factor * IQR)
df['price_outlier'] = flag_outliers_iqr(df['transaction_amount'])
print(f"Price outliers: {df['price_outlier'].sum()}")
Step 5: Validation Rules
assert (df['transaction_amount'] >= 0).all(), "Negative transaction amounts found!"
assert df['email'].str.contains('@').all(), "Invalid email addresses found!"
assert df['order_date'].between('2020-01-01', '2026-12-31').all(), "Dates out of expected range!"
Data Quality Report
| Check | Before | After | Action Taken |
|-------|--------|-------|--------------|
| Total rows | 500,000 | 487,231 | Removed 12,769 duplicates |
| Missing (address) | 23.4% | 0% | Imputed with 'Unknown' + flag |
| Outliers (price) | 342 rows | 342 flagged | Kept but flagged, not removed |
| Date format | Mixed (US/EU) | ISO 8601 | Standardized to YYYY-MM-DD |
| Negative prices | 17 rows | 0 | Investigated — data entry errors, corrected |
Tips for Best Results
- 💡Always profile your data BEFORE cleaning — don't assume you know what's in there. A 5-minute profiling step can save hours of debugging.
- 💡Document every cleaning decision in a data quality report. Future-you (or your teammate) will need to know why 12K rows were dropped.
- 💡Don't delete outliers by default — flag them. In many business contexts, outliers ARE the interesting data (big deals, fraud, power users).
- 💡Create a `_missing` flag column before imputing — the fact that data was missing is often a useful feature for models.
Related Prompts
GA4 Event Tracking QA Audit Prompt
Audit a Google Analytics 4 event tracking plan for naming consistency, parameters, conversions, attribution risks, and reporting readiness.
Survey Sampling Bias Audit Builder
Audit survey results for sampling bias, nonresponse bias, weighting needs, and reporting caveats before making decisions from the data.
Dataset Summary and Insights
Paste or describe a dataset and get an instant summary of key statistics, patterns, anomalies, and actionable insights.