When working with data, people intrinsically know that it represents the world around them. However, few see data through the lens of signal processing. As part 2 of our journey through architecting a safety science platform, let’s look at how data drives decisions about the pipeline. After all, the type of information handled has significant impact on the tools we use.
Data is never truly static.
The object under measurement — the stock ticker, the temperature of a pot on the stove, the motion of a person, the playful barking of a nearby dog, your ability to hear and perceive the world around you — is in analog motion. You may gather a sample of the data, and for a fleeting moment of time capture a value that represents the state of the object, but this is only a snapshot.
It follows then that data always has a timestamp, whether it is there in the data explicitly, or not.
Precision and accuracy are critical.
The precise number of digits that represent the thing in the world that you are tracking — precision — is key to a relevant solution. Each choice along the way defines how precise your engineered solution will be.
In creating models that predict, we must also consider accuracy, by definition “the degree to which the result of a measurement, calculation, or specification conforms to the correct value or a standard.” The quality we aim for are those data elements that have High Precision and High Accuracy - anything less than the prediction, model and the target solution breaks down.
If any part of the system is incapable of managing the flow of data for some reason, then the system breaks down. The breakdown causes ripple effects in the system and data.
Data is an estimation.
Numbers represent an estimation of the world around us, they are as imperfect as we select them to be. The models that we choose to build with these numbers have built-in accuracies that must be understood to grasp the problem.
Data travels in a pipeline
If data is collected in one spot, and used in another, then information arrives in discrete samples. Examples here include a person’s ears, or any digital system where information is sampled. As data is sampled, this data flows. As data flows each part of a system must be in sync to receive the information in a timely fashion. If any part of the system is incapable of managing the flow of data for some reason, then the system breaks down. The breakdown causes ripple effects in the system and data.
As data crosses a system, it will be “caught up in the gear,” so to speak. The data in motion will be reliant on the transport mechanisms to move it from one place to another. The transport mechanisms can only move so quickly. There are many different mechanisms, logical elements of code, code interactions and many physical properties of the electronics that make up a small or large system. The performance bottleneck will “move” to a new part of the system as the system is modified, or performance changes in one areas. Understanding where the data bottleneck is critically important to capacity planning.
Pipelines must break open.
Just like in electrical engineering and code, parts of the system will sometimes “break.” There will be an ingestion problem, there will be data that was never supposed to be in the system to begin with, or your system will need to deal with a bunch of data that it was never built to handle. In these cases, you will need to know and accept this reality: Things should break.
If they do not break, you will find that information in the data pipeline will go for long periods of time and be unusable for far longer than it would take you to fix a break. (This is also a friendly reminder: Build a monitoring system, it’s worth it.)
These principles are all important as we build, architect and control information flows. However, they’re difficult to put into practice without real world examples. In the next iteration in this series, we dive in and use information in these contexts to see the interconnected pieces.