Dirty data is data that's duplicated, inconsistent, incomplete, or conflicting — mismatched units, part numbers coded differently across systems, missing fields, disagreeing definitions. Cleaning it means making records consistent, accurate, and reconciled, so BI and AI can rely on them.
It's the unglamorous middle of data engineering — and the step that decides whether everything above it is trustworthy.
Why it matters
Dirty data is expensive, even when it's invisible. Gartner puts the average cost of poor data quality at $12.9 million a year per organization, and the McKinsey Global Institute has linked poor-quality data to a ~20% drop in productivity and a ~30% rise in costs. On a factory floor that shows up as the scrapped batch, the wrong reorder, and the maintenance done too late. And for AI specifically, dirty data is fatal: a model trained on contradictions learns contradictions. Clean data isn't a nicety — it's the precondition for trusting any number you produce.
The common forms of dirty manufacturing data
Five problems account for most of it:
- Duplicates. The same part, order, or record entered more than once — inflating counts and confusing every downstream calculation.
- Mismatched units. Metric vs imperial, per-unit vs per-case, different time bases. Two true numbers that can't be compared because they're measuring in different scales.
- Inconsistent identifiers. The same part, machine, or work order coded differently across ERP, MES, and QMS — so the systems literally can't tell they're talking about the same thing.
- Missing or incomplete fields. Gaps that quietly skew analysis and break models expecting complete records.
- Conflicting definitions. A "shift," a "good unit," or "downtime" defined differently by system or site — the root of the reports-that-disagree problem.
How to clean it
Cleaning tackles each problem deliberately:
- Deduplicate. Identify and merge duplicate records using consistent matching rules, so one real thing is represented once.
- Standardize units. Convert everything to one consistent unit system, so figures are genuinely comparable.
- Reconcile identifiers. Build one master — a single part master, machine registry, and work-order scheme — and map every system's codes to it.
- Handle missing data on purpose. Decide, per field, whether to fill (from a reliable source), flag, or exclude — never silently ignore.
- Agree definitions. Settle on one definition per metric and enforce it everywhere, so "good unit" means the same thing across the plant.
The output is data that finally ties out — where the night shift's numbers and the ERP's numbers can be reconciled, and a single figure means one thing.
Keeping it clean
Cleaning isn't a one-time scrub. New data flows in constantly, and without upkeep it drifts dirty again. Sustainable clean data needs automated quality checks built into the data pipelines, data governance to hold definitions and ownership steady, and ongoing monitoring to catch new issues early — part of continuous optimization. Clean once and walk away, and you'll be back to conflicting numbers within a year.
Where cleaning fits
Cleaning is the second move in building a connected data foundation — after connecting the sources and before structuring them for use. Skip it and "integration" just produces a bigger, faster mess: connected systems that still disagree. (How it fits the whole build: How to integrate ERP, MES, and shop-floor data.) Done right, it's what turns connection into trust.
A real-world example
(Brief composite illustration — not a specific named client.)
A manufacturer running mixed metric and imperial equipment couldn't get a plant-wide yield number that made sense — half the lines reported in one unit system, half in the other, and the part master had drifted into duplicates over years of manual entry. Nobody trusted the rollup. The fix wasn't a new dashboard; it was cleaning: dedupe the part master, convert everything to one unit system, and lock in single definitions. Once the data tied out, the yield number was believable for the first time — and the analytics built on it actually got used.
Frequently asked questions
Next steps
Data Readiness Scorecard
Gauge where your data stands before building anything on top of it.
Take the ScorecardData Engineering
We build the pipelines and data layer that make every system downstream reliable.
See how it worksBook a Discovery Call
See exactly how we'd approach this for your operation. No pitch decks.
Book a Discovery CallSources
- Gartner, *Magic Quadrant for Data Quality Solutions* — poor data quality costs organizations an average of $12.9 million/year.
- McKinsey Global Institute — poor-quality data linked to ~20% lower productivity and ~30% higher costs.
- Gartner, *Magic Quadrant for Data Quality Solutions* — poor data quality costs organizations an average of $12.9 million/year.
- McKinsey Global Institute — poor-quality data linked to ~20% lower productivity and ~30% higher costs.