How to Create Clean Data 

If you poll 100 data scientists asking what they spend most of their time doing, they will respond cleaning data 99% of the time.
Here are some simple steps that can be taken to simplify the process of cleaning data, reducing the time and cost of data analysis:

Identify Known Anomalies

Identifying anomalies is a key element of any analysis effort. Identify or remove known anomalies at the point of data capture to save time and money.


  • During machine startup and stops, sensors may produce values well outside the typical range seen during machine operation.
  • A newly replaced sensor may produce a shift in the measurement baseline, mark sensor changes in the data set and consider calibrating new sensors.
  • Mark the data set when the machine is shut down improperly for example, unexpected loss of power.

Normalize Data

Develop a standard for how data should be captured by to ensure that across all the systems the recorded data is in one standard format, this can save valuable time.


  • Is a temperature reported in Fahrenheit or Celsius?
  • Is it reported as an integer or a floating point? How many decimal places are required?
  • If the value can be negative, what format is used to record the negative number?
  • What is the maximum and minimum value for the given sensor?
  • What values indicate failure for a given sensor?


Time is the most critical parameter in time series data. Develop a time stamp strategy that can be deployed across all systems including time calibration and time adjustments.

Examples of the importance of Time:

  • The time format including date and year
  • Time resolution
  • How is the time set and calibrated
  • How is the time verified and how often is it verified
  • How are errors reported

Clean Data by Design

The best way to ensure a data set is clean is by designing it to be clean from the start.


Data Plan Document

The goal of this document is to define formats for all the data types that could be captured. This can simplify comparisons across different systems and eliminate the need for significant normalization efforts.

The plan does not explicitly define all data that will be collected or every sensor type that will be used. Attempting to create a very specific data plan may force engineers to work outside the specification. A general specification that can be applied consistently across all systems is preferred.

Data Plan Measurements 

The data plan should define units used for all measurements in the systems.
Here is a simple list of measurements and a few possible units of measure for each:

Measuring Pressure
(Pascal, PSI)


Measuring Mass
(Gram, Pound, Ton)


Measuring Distance
(Yards, Meters)


Measuring Temp
(Celsius, Fahrenheit)


Measuring Volume
(Liter, Gallon)


Numerical Format

The numerical format for each sample value should be defined in the data plan. An effort should be made to minimize the number of different formats supported. Selecting a slightly larger numerical format so all values can be stored in that format may simplify programming at a later point in time. For example, the document may define that all pressure measurements will be made in pound per square inch and stored as a 32bit floating point value. Allowing for a maximum value of 2^10 with a step size of 0.5 and a step size of 0.0005 for numbers with an absolute value less than 1. It is possible this value could fit into a 16bit floating point value but 32 bit is chosen because that level of accuracy is required for distance measurements.


What Could Go Wrong?

Planning for and creating clean data with additional context markers is a significant investment
which is easily justified when the consequences of data analysis errors are considered.

Data analytics is an iterative process commonly performed by someone several times removed from that actual company operation. The process is slow and may take some time to generate measurable company wide performance results. Typically, an effort focuses on simple conclusions that are then tested and analyzed over a period of months. A poor decision based on a data anomaly that is not well understood can impact the performance of the company by wasting team members time as shown in the following example.

Sampling Data Too Infrequently

Assume a company is collecting data in the field. The system is recording pallets stacked monthly. In the graph to the right, the number of units produced drops by 50% prior to a customer returning a machine. This may be an indication that a user is going to terminate their lease and return the machine.



If data was collected per day or hour it would show that customers use the product at the same level prior to returning the system. The falloff is because leases terminate in the middle of the month. Additional context (Lease return date) or increasing sampling frequency (Daily) would eliminate this incorrect conclusion.

At its core, Data Mining is the process of finding relevant patterns in data

Unfortunately, there are many irrelevant patterns in the data. Spending time on irrelevant data is costly. By investing time upfront product engineers developing data capture solutions can create data sets that data scientist can make dance. Opening the door to a better customer experience and more sales.


Questions about how to Clean your Data? Contact one of our IIoT experts:

Learn more about IIoT