Book a Discovery Call

Cleaning Dirty Manufacturing Data: Deduplication, Units, and Standardization

The same part exists three times in your systems, under three slightly different numbers. One line reports output in pieces, another in cases. A "good unit" means one thing on the floor and another in the back office. None of it is anyone's fault — it's just what happens when systems grow up separately. But it's exactly why your reports never tie out and your AI can't be trusted. Here's what dirty manufacturing data looks like, what it costs, and how to clean it.

Dirty data is data that's duplicated, inconsistent, incomplete, or conflicting — mismatched units, part numbers coded differently across systems, missing fields, disagreeing definitions. Cleaning it means making records consistent, accurate, and reconciled, so BI and AI can rely on them.

It's the unglamorous middle of data engineering — and the step that decides whether everything above it is trustworthy.

Why it matters

Dirty data is expensive, even when it's invisible. Gartner puts the average cost of poor data quality at $12.9 million a year per organization, and the McKinsey Global Institute has linked poor-quality data to a ~20% drop in productivity and a ~30% rise in costs. On a factory floor that shows up as the scrapped batch, the wrong reorder, and the maintenance done too late. And for AI specifically, dirty data is fatal: a model trained on contradictions learns contradictions. Clean data isn't a nicety — it's the precondition for trusting any number you produce.

The common forms of dirty manufacturing data

Five problems account for most of it:

  • Duplicates. The same part, order, or record entered more than once — inflating counts and confusing every downstream calculation.
  • Mismatched units. Metric vs imperial, per-unit vs per-case, different time bases. Two true numbers that can't be compared because they're measuring in different scales.
  • Inconsistent identifiers. The same part, machine, or work order coded differently across ERP, MES, and QMS — so the systems literally can't tell they're talking about the same thing.
  • Missing or incomplete fields. Gaps that quietly skew analysis and break models expecting complete records.
  • Conflicting definitions. A "shift," a "good unit," or "downtime" defined differently by system or site — the root of the reports-that-disagree problem.

How to clean it

Cleaning tackles each problem deliberately:

  • Deduplicate. Identify and merge duplicate records using consistent matching rules, so one real thing is represented once.
  • Standardize units. Convert everything to one consistent unit system, so figures are genuinely comparable.
  • Reconcile identifiers. Build one master — a single part master, machine registry, and work-order scheme — and map every system's codes to it.
  • Handle missing data on purpose. Decide, per field, whether to fill (from a reliable source), flag, or exclude — never silently ignore.
  • Agree definitions. Settle on one definition per metric and enforce it everywhere, so "good unit" means the same thing across the plant.

The output is data that finally ties out — where the night shift's numbers and the ERP's numbers can be reconciled, and a single figure means one thing.

Keeping it clean

Cleaning isn't a one-time scrub. New data flows in constantly, and without upkeep it drifts dirty again. Sustainable clean data needs automated quality checks built into the data pipelines, data governance to hold definitions and ownership steady, and ongoing monitoring to catch new issues early — part of continuous optimization. Clean once and walk away, and you'll be back to conflicting numbers within a year.

Where cleaning fits

Cleaning is the second move in building a connected data foundation — after connecting the sources and before structuring them for use. Skip it and "integration" just produces a bigger, faster mess: connected systems that still disagree. (How it fits the whole build: How to integrate ERP, MES, and shop-floor data.) Done right, it's what turns connection into trust.

Composite Case

A real-world example

(Brief composite illustration — not a specific named client.)

A manufacturer running mixed metric and imperial equipment couldn't get a plant-wide yield number that made sense — half the lines reported in one unit system, half in the other, and the part master had drifted into duplicates over years of manual entry. Nobody trusted the rollup. The fix wasn't a new dashboard; it was cleaning: dedupe the part master, convert everything to one unit system, and lock in single definitions. Once the data tied out, the yield number was believable for the first time — and the analytics built on it actually got used.

FAQs

Frequently asked questions

Almost always dirtier than you think. Duplicate part numbers, mismatched units, and inconsistent definitions are nearly universal in manufacturing — and you usually don't see the full extent until a readiness audit maps it.
No — that's a costly myth. A model trained on contradictory or inconsistent data learns the contradictions. It's a leading reason AI pilots fail. Clean data comes first.
No. New data arrives constantly, so cleaning has to be sustained with automated checks and governance. A one-time scrub degrades back to dirty data without upkeep.

Next steps

3-min assessment

Data Readiness Scorecard

Gauge where your data stands before building anything on top of it.

Take the Scorecard
Service

Data Engineering

We build the pipelines and data layer that make every system downstream reliable.

See how it works
Talk to us

Book a Discovery Call

See exactly how we'd approach this for your operation. No pitch decks.

Book a Discovery Call

Sources

  • Gartner, *Magic Quadrant for Data Quality Solutions* — poor data quality costs organizations an average of $12.9 million/year.
  • McKinsey Global Institute — poor-quality data linked to ~20% lower productivity and ~30% higher costs.
  • Gartner, *Magic Quadrant for Data Quality Solutions* — poor data quality costs organizations an average of $12.9 million/year.
  • McKinsey Global Institute — poor-quality data linked to ~20% lower productivity and ~30% higher costs.