It’s that time of the year when everyone thinks about time. As the Earth makes another trip around the sun, all around the world there will be events, celebrations, reflections, and resolutions. We will be looking backward and also contemplating the future. When better to think about how time comes into play when we look at data? What sort of assumptions are we making when we create and use data that includes some aspect of time? How do these assumptions influence the kinds of decisions we make? The reality is that much of what we do when we think about time is relative, and somewhat subjective. The inherent “squishiness” of dealing with time introduces some fascinating nuances that we may not realize. So without delay, let’s seize the day and take a look at time in data.
Time is quirky: Observing the basics
From the outset, time is quirky. At first, it seems like the most precise of things. We have stop watches, chronometers, and expensive timepieces that are intended to keep track of time within tiny fractions of a second. But what exactly are they keeping track of? Let’s start with that “new year” we are about to celebrate. Do we actually think there is some line in space that the Earth will pass through to mark another trip around the sun? Let’s say someone asked me my age. I would tell them how many “years” old I am. If they wanted to know how many sunrises I had experienced, they might calculate the answer by multiplying that age by 365 (the number of days in a year), right? Of course not. We have leap years. Our calendar is imprecise because the earth takes roughly 365.2422 diurnal rotations during the time it takes to pass the same place in its orbit around the sun. So every 4 years, we add a day to February to create a year with 366 days. But since .2422 is not exactly one quarter, every year that is divisible by 4 is a leap year except those divisible by 100 and not 400. So the year 2000 was a leap year, but the year 2100 will not be. Even at that we have “leap seconds” from time to time (since 1972, due to the effect of measuring time with atomic clocks) just to get closer to that imaginary line we seem to pass every orbit around the sun.
All of this naming of time is approximate, but the point is that our concepts of “days” and “years” are a bit off from what we think they measure. Over the years, there have been various attempts to address this problem. Our current calendar is only one of many that have existed to try to mark the spinning of the Earth and the orbit of the Earth around the sun. What if we try to measure time between now and some time before the current calendar system? How many days have there been since the day Caesar was murdered? The answer is more difficult to calculate because we used a different calendar during Caesar’s time. Even in current times, there are many calendars in use around the world.
Most of us know all of this time stuff, or vaguely remember it from school. But how does it play out in our data? The answer is not always obvious. Let’s say we have a field called “Effectivity date” that tells us when a particular change is to take effect – such as a price change. Many Enterprise Resource Program (ERP) systems have such dates. For example, these systems might store the fact that a price change goes into effect at midnight on November 8th. That “fact” seems very precise. However, what if the company is operating globally? Is it midnight on November 8th locally, in which case the price change would happen at 24 different times as the various time zones mark midnight on November 8th, or relative to some standard like Greenwich Mean Time? If we make it relative to a single time zone, we ensure the price change will happen at the same “time” – avoiding price arbitrage and other confusion that may happen with communicating data between time zones. However, by doing so we create a seemingly arbitrary mid-day price change in many parts of the world. Solving one problem seems to cause another. A very recent manifestation of this sort of problem can be seen in recent US elections, as the polls “opened” and “closed” at different times around the country and were simultaneously influencing behavior in different time zones (the US has 6 different time zones counting Alaska and Hawaii).
All of this chronological confusion brings up the first truth about time-related data. All references to time are relative. They may be relative to a time zone, a calendar system, or some other reference. When comparing things from one relativistic frame to another (e.g. across time zones, or between now and some time before the current calendar system was adopted), we must take those assumptions into consideration or we will introduce bias into our conclusions.
All time is relative. Analytical and process-driven activities must consider the implications of that relativity, especially when the action or analysis spans more than one frame of reference.
Aging: Maturity and decay
All true data is not simultaneously true. We know this phenomenon, but we sometimes forget it. Like many people, I have various apps on my phone which can “alert” me to breaking news and events. Invariably, when something newsworthy happens, I receive multiple alerts with conflicting information. It’s not that the information is wrong per-se. It’s just that the various sources have different latency. The time from when the information was true to when it was published can cause the published information to diverge from the future truth, thus setting up a cycle where multiple versions of the “truth” seem to exist. This is not such a big problem when reporting snowfall predictions, but it can be very problematic when reporting airport closures relating to snowfall predictions!
Data maturity (how long it takes before a certain conclusion is valid) and decay (how long it takes before a valid conclusion becomes invalid) are often overlooked in analytical approaches. For example, machine learning often starts with taking a large corpus of data and drawing conclusions based on training sets, where people have performed some decision on like data. Such an approach must consider how the behavior of experts might change over time and how the data itself may become less relevant or more distracting to learning algorithms because of the passage of time.
A good example of this phenomenon comes from sentiment analysis, which looks at large bodies of data to see how people “feel” based on their articulations. We train such algorithms by having experts tell us how certain language emotes feeling (e.g. burned toast is bad, burning the competition is good). We must also consider, however, that such an analytical approach would be impacted if the data decayed (e.g. if we start talking about burning flags or “burning” becomes slang for something new).
A best practice is to always consider maturity and decay in any analytical approach related to time, or time-series data. We can ask simple questions such as: How often does the data change? What causes it to change? What is the relative frame for time-based data and how does it differ from the analytical frame?
Always consider maturity and decay in any analysis relating to time. Consider how conclusions change with time, as well as how the data itself changes with time.
Changing data: Data at rest and in motion
Another quixotic time-related aspect of data is the rate at which an attribute changes vs. the data that describes that attribute. Using the age example from above, let’s say we stored my age in a data field for the purposes of analysis. We all realize that the data will decay. When I become one year older, that data will be incorrect. This error highlights the concept of data decay – or the attribute of data that causes it to become less valuable over time. Of course, in this simple example, we could simply store my birth date, or at least the day on which I said I was a certain age to get around the problem of decay. Of course, in large complex datasets the problem becomes much more nuanced.
In big data terms, we sometimes refer to data at rest vs. data in motion. Data at rest is data that doesn’t change from time of creation. For example, if I stored the capital cities of all of the countries on Earth, that data would not change very much (unless a country changed the capital city or a new country came about). Much of this sort of data is descriptive, such as dimensions of parts, names of compounds, or product descriptions. These things are not “completely” at rest, but for the most part, we do not have to worry much about such data in terms of decay.
Data in motion is very different. It changes, sometimes constantly. An example would be the current speed and altitude of an airplane passing through a controlled airspace. This data contains relativistic assumptions (altitudes are based on distance from the ground to the plane, but above a certain altitude planes measure their altitude based on barometric pressure so that they don’t have to constantly increase and decrease altitude to stay at the same flight level when flying over changing terrain). There is also data that is simply changing over time, such as the speed of the plane (which also can be measured in many ways, such as what the plane experiences passing through moving air vs. over a fixed point on the ground). Data in motion is fascinating. Storing it for analytical purposes is very tricky.
Of course, not all data in motion is as complicated as tracking a moving airplane. Consider data we store in a manufacturing company’s ERP. Typically, ERP’s refer to Master Data, which is data that does not change with the business activity of the organization, and Transactional Data, which does change with business activity. Examples of master data are the names and addresses of customers and vendors. If the company makes bicycles, the names of the vendors and customers don’t change specifically because they made another bicycle. Master data for the bicycle company changes because of purchasing (vendors) or sales (customers) activity. Transactional data might include in-process inventory on hand, production schedules, and quality inspection data. All of this data changes during the normal course of making bicycles.
The danger in mixing data at rest and data in motion for analytical purposes is that there is an inherent effective usefulness of any answer produced. Using our aircraft example, a computation of the increase in density of traffic over a point over the seasons of the year may be valid if we take sample data in the right way, even though the data itself is changing quite rapidly. Using our manufacturing example, computing the average number of idle workers during equipment downtime for unscheduled repair would be an equivalent example of combining data at rest and data in motion for a valid analytical purpose.
Perhaps the biggest mistake made with dynamic data is failing to consider the changing nature of data when designing an analytic approach. Storing data in motion in a repository turns it into data at rest.
Theodor Geisel captured the paradoxical nature of Time beautifully, when he wrote “How did it get so late so soon? It’s night before it’s afternoon. December is here before it’s June. My goodness how the time has flewn. How did it get so late so soon?” (n.b. at the time, he was writing as Dr. Seuss!). This brilliant and amusing rhyme seems to capture the illusive nature of time. The fact that what was true may change over time, as may our perception of the thing. So as the new year dawns and the old year passes, think about time and how we use it in drawing meaning from data.
About the Author: Anthony Scriffignano, Ph.D. SVP, Chief Data Scientist at Dun & Bradstreet. Lead the efforts to source, curate, and synthesize insight for one of the world’s largest commercial databases. Holder of several patents related to entity extraction, synthesis, and multilingual disambiguation. Anthony moderated a panel at the BIIA 10th Anniversary Conference in 2015