How to Create Clean Data 

If you poll 100 data scientists asking what they spend most of their time doing, they will respond cleaning data 99% of the time.
Here are some simple steps that can be taken to simplify the process of cleaning data, reducing the time and cost of data analysis:

Identify Known Anomalies

Identifying anomalies is a key element of any analysis effort. Identify or remove known anomalies at the point of data capture to save time and money.

examples:

  • During machine startup and stops, sensors may produce values well outside the typical range seen during machine operation.
  • A newly replaced sensor may produce a shift in the measurement baseline, mark sensor changes in the data set and consider calibrating new sensors.
  • Mark the data set when the machine is shut down improperly for example, unexpected loss of power.

Normalize Data

Develop a standard for how data should be captured by to ensure that across all the systems the recorded data is in one standard format, this can save valuable time.

examples:

  • Is a temperature reported in Fahrenheit or Celsius?
  • Is it reported as an integer or a floating point? How many decimal places are required?
  • If the value can be negative, what format is used to record the negative number?
  • What is the maximum and minimum value for the given sensor?
  • What values indicate failure for a given sensor?

Time

Time is the most critical parameter in time series data. Develop a time stamp strategy that can be deployed across all systems including time calibration and time adjustments.

Examples of the importance of Time:

  • The time format including date and year
  • Time resolution
  • How is the time set and calibrated
  • How is the time verified and how often is it verified
  • How are errors reported

Questions on how to leverage data to improve the quality of your products?

Download our whitepaper focused on helping you leverage your data in your machines. Learn more here:

Leveraging IIoT Data Whitepaper

Data Plan Measurements 

The data plan should define units used for all measurements in the systems.
Here is a simple list of measurements and a few possible units of measure for each:

Measuring Pressure
(Pascal, PSI)

pressure

Measuring Mass
(Gram, Pound, Ton)

mass

Measuring Distance
(Yards, Meters)

ruler

Measuring Temp
(Celsius, Fahrenheit)

temperature

Measuring Volume
(Liter, Gallon)

Volume

What Could Go Wrong?

Planning for and creating clean data with additional context markers is a significant investment
which is easily justified when the consequences of data analysis errors are considered.

Data analytics is an iterative process commonly performed by someone several times removed from that actual company operation. The process is slow and may take some time to generate measurable company wide performance results. Typically, an effort focuses on simple conclusions that are then tested and analyzed over a period of months. A poor decision based on a data anomaly that is not well understood can impact the performance of the company by wasting team members time as shown in the following example.

Sampling Data Too Infrequently

Assume a company is collecting data in the field. The system is recording pallets stacked monthly. In the graph to the right, the number of units produced drops by 50% prior to a customer returning a machine. This may be an indication that a user is going to terminate their lease and return the machine.

graph

Solution

If data was collected per day or hour it would show that customers use the product at the same level prior to returning the system. The falloff is because leases terminate in the middle of the month. Additional context (Lease return date) or increasing sampling frequency (Daily) would eliminate this incorrect conclusion.

At its core, Data Mining is the process of finding relevant patterns in data

Unfortunately, there are many irrelevant patterns in the data. Spending time on irrelevant data is costly. By investing time upfront product engineers developing data capture solutions can create data sets that data scientist can make dance. Opening the door to a better customer experience and more sales. Learn more here:

Leveraging IIoT Data Whitepaper