Three fatal flaws of historical data sets, and how to avoid them
Historical data – it’s a bad system, but it’s the best we’ve got, right? Wrong!
It’s biased, out of date, and based on the flawed assumption that the future will look like the past. Historical data is far from ideal for training your artificial intelligence (AI) systems on.
Questions you need to think about:
- What problems will you encounter using historical data?
- How can you ensure that those problems don’t damage your business?
- Is there any alternative to using historical data?
We’ve collated some expert advice to answer those questions for you here:
“If you really want an AI system that delivers good business value, it’s got to be forward looking, and therefore it must look at real time execution data.”
“If it’s not then it’s always looking in the rear-view mirror, which doesn’t help me make good decisions which increase my revenue and decrease my cost.”
“If I’m looking at a data warehouse, all I’m doing is looking to see what happened. Not to what is going to happen and then make sure it happens in a positive way.”
“Let’s take the example of supply chain management (SCM): in SCM, data can’t be old, it can’t be stale. If you have to physically move data from your transportation provider’s system into your system, or from your own store point of sales system into your system – by the time that data moves it’s no longer accurate.”
“In the retail consumer goods space, you’re typically moving products daily. What if it takes you 17 hours to move the data that you need to execute tomorrow morning’s delivery plan? “
“In a lot of cases the supply chain has executed past the relevance of the data that you’re using to plan your supply chain. AI technology needs to be on as real time a data as possible or you’ll just make rapider bad decisions.”
“From my perspective, I think that historical data absolutely has a shelf life.”
“If we look at the world of predictive AI, it’s been heavily focused on creating look-a-like modelling based on historical data to recommend actions that will produce more of the same. While that can be useful as one variable, it is extremely flawed as the only one.”
“If we think about it in the sales and marketing context, for example, the prospects and customers that you’ve been successful/not successful at selling to in the past – they’re one data point, but they should not be the only data point to determine what you should do in the future.”
“Your customers are changing over time, and just because you identified a set of customers to pitch and sell to in the past, it doesn’t necessarily mean that those are the optimal set of prospects that would drive more revenue per unit of time.”
“It’s advisable to take a graph approach, where you can make sense of all the data points that you’re working with and understand how they connect together; what is their relationship to one another?”
“One you’ve done that, you can leverage other proxies in order to prescribe future actions, not using any of the historical data to do it.”
Falon Fatemi is the CEO and founder of Node.io, a first-of-its-kind AI platform that transforms how businesses are able to analyze relationships between entities on the web to uncover new opportunities. Formerly of Google where she was the youngest employee starting at age 19.
“Because it uses historical data, machine learning (ML) is inherently biased. Just look at some of the ethical problems that are being paid attention to now, such as when ML systems can discriminate if you use them for hiring.”
“There’s been many experiments showing that these systems can lock on to wrong correlations, just because the past data is biased because of past biases. The danger is that, by default, ML is reflecting the current biases that we have.”
“That’s why it’s important to make sure that the system doesn’t just provide a recommendation, or make a decision, without providing context around how it came to that decision, and how it is interpreting the data.”
“You need to have a system where you don’t just get a prediction and know if it’s 60% vs 70% accurate. Instead you need to know which of the inputs have pushed the decision towards that way, and which of the factors are pushing it the other way.”
Misha Bilenko is now head of the Machine Intelligence and Research (MIR) department at Yandex, after a decade working at Microsoft. Yandex is a technology company that builds intelligent products and services powered by machine learning. It is one of Europe’s largest internet companies and the leading search provider in Russia.