Data Engineering

Data Engineering – Only Some Data Matters

There is a critical difference between classical statistical analysis and Internet-of-Things (IoT) analytics: In classical statistics, most data matter. In the Internet-of-Things, most of them don’t.
Why do I care?

Because if this is true, then the conventional approach of putting raw data into historians and “data lakes” in the hope of being able to use statistics to make sense of it later will not work. To overcome this, IoT Architects need a Data Engineering step in their Information Value Chain.

Think about it in terms of the famous thought experiment first attributed to the great 18th century Irish philosopher, George Berkeley:

If a tree falls in a forest and no one is around to hear it, does it make a sound?

Let’s restate this:

If a datum is collected, but it does not affect anything I care about, does it matter?

And the answer is NO.

Data does not have value in its own right, but only to the extent that it explains or predicts an outcome which itself has value.

Humans determine value, not things

The most important thing to understand when monitoring, measuring and analyzing the real world is that humans determine whether “things” and what happen to the “things” have value, not the “things” themselves. Value may differ for different people, but the fact that event “X” happened to object “Y” does not have any value in its own right. For it to be valuable, we must be able to connect the event to a consequence for people.

Connecting events to consequences is a complicated and shifting process.

  • The connection is rarely direct: one event often has to trigger another and another before the final human impact comes to pass.
  • The event is rarely the only thing happening at the given time: often many other events are being logged simultaneously but most turn out to be inconsequential.
  • Events are ephemeral: the event may well — in fact often does — disappear before all its consequences have run their course. It may even have disappeared before its consequences are first observed.
To make this real, let’s look at the example of machinery faults

When managing machinery, it has been commonplace for many years now among leading manufacturers to collect fault codes from machines in the hope of being able to use them to figure out why production was lost and to prevent that loss from happening again. As any Automation Engineer will tell you, the problem these days is not so much to collect the fault codes, but to distinguish the codes that matter from those that do not.

  • A fault is often followed by a blizzard of fault codes – tens, even hundreds of them. In the midst of this noise is the report of the fault that actually caused the stoppage. Most or all of the rest are irrelevant, but cannot be ignored, because in a different context, they might matter.
  • That machine’s fault may not actually cause a loss. If the machine is not a constraint — think “on the critical path” in project management terms — then the machine may just “catch up” with the rest of the line when it restarts. In truth, this machine can continue stopping intermittently like this forever. So long as it continues not to affect the results a human cares about, knowing that it stopped or why it is stopping will probably have no value at all.
  • Which machine in a process is the constraint may change from moment to moment as machines, material and people interact with each other.
  • Similarly, the context makes a difference: if the machine is shut down anyway — for the week-end, for maintenance,  for a lunch break — then a very real fault may occur, but have no impact on performance.
Let us introduce some terminology
  • Data about an event is consequential if that event increases or decreases value to a human being or organisation; inconsequential if its occurrence does not change that value.
  • Determining whether data is consequential or not and measuring its impact is “Data Engineering.”
  • How a consequential event and its data affect the value delivered is a mechanism.

Previously we have suggested the Information Value Chain to readers as a useful conceptual model to get from real-world data to value. Converting data to information is a critical link in this value chain. In those earlier, more general discussions, we did not specify how to achieve this conversion: in different domains different tools may be appropriate. Clearly in IoT, a Data Engineering step is critical to this conversion.  How to implement this Data Engineering?

Given the indeterminate natural of connections between events and consequences, the indeterminate temporal relationships between them and the overwhelming volume of inconsequential data when compared to consequential, it will be no surprise that we have found traditional statistical tools, such as ANOVA, to be inadequate to the task:

  • There is simply too much noise in the data to be confident in the results.
  • Even where the data suggests a relationship, the mechanism can rarely be established with any confidence.
  • Without a mechanism, the diagnosis cannot be prescriptive — you cannot tell people what knobs to turn to achieve the desired outcome.

Fraysen Systems has found that the Digital Twin model we described previously to be a very effective tool for Data Engineering. Its stateful nature lends itself to filtering out what data matter from what do not as the context changes, and because it effectively runs tests on multiple mechanisms in parallel, the results of the model tend to be easily translated into prescriptions for action. We recommend the real-time, continuous simulation achieved by a Digital Twin for your consideration when deciding how to implement this critical step in your Information value Chain.