How to derive actionable insights from raw data
Have you wondered how much of your data can provide actual actionable insights? I constantly get asked about how companies can better use their data for insights and create machine learning learning programs around it.
I typically ask them these questions which are relatively simple and you can find them in every statistics book. Below are just a few samples of what is typically expected from a non technical conversation around data.
- What’s the size of your data sets and how many sets do you have?
- How often are data sets updated and how many new entries do you get on a daily basis?
- Just by looking at your data from a macro perspective what relationships can you define at a glance between different data sets?
- What are you trying to optimize using this data?
- How many different data sources do you consolidate and do you combine them all into one large cluster or is it housed separately?
Once you’re able to understand the structure of your data you can move forward with seeing what you can do with it. Most of your data’s usefulness lies in derivatives, most data sets that are raw are very hard to work with because they limit what can be learnt from them.
The example: optimizing operations in logistics services
So, for the sake of this post let’s take a simple case of a company that runs a logistics service. They track the inflow of packages from vendors, then do package sorting, and deliveries. They have close to 5 million data points, mainly quantitative numerical data. They have a major bottleneck at the sorting section since it’s all done manually.
Management usually looks at the total processing time from intake of package to the package getting out through the door into a delivery van. When things start to slow down in the assembly line their first instinct is Employee A, B and C aren’t sorting fast enough so lets get a set of fresh hands. The only data they have is:
- Time the package arrives at the facility
- Weight of the package
- Size of the package
- Time each package enters the conveyor belt after it’s sorted
- The final destination of the package
- The original location of the package
There’s a missing component in there, quantitatively I don’t really know what happens during sorting besides the time to sort a package. My assumption as always is Arrival time subtracted by Dispatch time should be what I should optimize for. The shorter time it takes to process and sort each package the more money we can make. I have one very simple derivation that I can always rely on but we’re ignoring the big picture.
Let’s derive more data from just those 4 numerical data points I have and create some relationships.
- Humans are involved in sorting the packages. They pick packages from their respective conveyor belts and sort them one at a time. Do packages of a certain size or weight get processed faster through sorting? Heavier packages are hard to handle, even worse awkward box sizes can also be an issue. We can now create sub sets out of our initial data. Boxes of E, F and G size take X amount of time in sorting.
- There is some data that is static and will not change. We can easily add that to our calculations or data tables. Packages always spend 3 minutes on Conveyor A and Conveyor B. We can get a net time for how long a package is in sorting easily.
- There are 3 shifts in a day. 20 workers each shift but sorting times may vary across shifts. We should check how many packages are received every hour.
- …..But wait there are more factors I haven’t looked at that are outside the scope of the data I already have that don’t have much to do with sorting directly. Through my data assessment my fastest processing times at sorting are at 12pm to 1pm. I can see that most of those packages are light and most of them are of a similar size. Now if I look outside my data set I know that these packages are from a large drop off by a large e-commerce company. Maybe I should try to get more business from this company because I can tell them we’re optimized for this type of work and I can improve their delivery times.
These are just some of the few derivatives I can come up with and in turn create new data sets. As we add and create new data sets from our initial raw data sets we can come up with concrete conclusions of how we can improve certain processes and business goals.
Enter machine learning and deep learning
There are a lot of places here where machine learning and deep learning can be implemented too. For example we can help the company optimize drop off windows for vendors so that we can normalize the distribution of packages that go through sorting by automating the drop off timetables.
We can also pass package insights to the vendors directly and help them create better package sizes with the aim that their customers will get their deliveries faster. We can also add a hardware component with machine vision, packages that come off the conveyor flipped with the bar code facing down can be split and sent to a different conveyor where they are put right side up before reaching sorting.
We can also take this derived data and attach it to external data sources. There are opportunities for bench-marking, adding 3rd party API’s and improving processes that are directly and indirectly related to sorting too. Again machine learning algorithms can help identify new feature sets, some that are beyond the simplistic ones that I’ve stated.
The limits are endless and this is where custom software plays a major role in improving processes. Whether it’s in supply chain, software development, marketing or even manufacturing, in order for companies to make the most of their warehoused data it has to be looked at from all sides and not just a magnifying glass.
Syed Ahmed is the CTO of Tara.ai, a software company facilitating talent acquisition and recruiting automation