Breaking Down Data Silos
We have data scattered across the organization, and a good view of that data can provide valuable insights. It seems like it should be simple: pull everything from various sources – both internal and external – and combine it into one useful dataset. The frustrating part is that it does seem like it should be easy, but the process always falls one critical step short of what we need.
Let’s tackle the typical issues in breaking down data silos and how to solve them, unlocking that elusive high-level view of your organization’s data.
Two key requirements exist for consistent, stable, and reliable analytics:
- Automation to move data through various pipelines and get it where it needs to be
- Data correlation practices to make disparate systems’ data work together (especially crucial for third-party services)
Most attempts at bringing data together and breaking down these silos fail due to manual data management and/or missing key correlation points.
Why Manual Attempts Make Us Pull Our Hair Out
We’ve seen it a hundred times… export this Excel from this system, that CSV from over there, dump some data from this in-house database, and load it all up into a big Excel book to try to get it under control. Usually just a month’s worth of data from each system for the task.
Then the problems start rolling in. We have billing, payroll, and scheduling data from here, there, and wherever. Sure, we might have some relevant CustomerIDs or EmployeeIDs. We can even correlate the billing and payroll data. But when we start trying to merge the datasets, things get messy. That scheduling data just doesn’t want to fit.
It’s like trying to complete a puzzle where a handful of pieces were cut from different templates and just won’t go together.
Maybe we can force it a bit, get a 70% match perhaps – at least it’ll give us something, right? But now everything downstream has a confidence gremlin lurking. The more you try to build on top of it, the more that confidence dissipates. What good are insights if we can’t trust the data? Correlation integrity is a big deal.
Even if you solve the correlation puzzle (often a grueling task), the next major issue crops up. After spending a week or two getting this data into some semblance of focus, it’s already outdated. We’re halfway through the month, and we wanted this answer two weeks ago. Two-week-old data just isn’t that helpful, if we can even trust it.
The state of affairs settles into something along the lines of “I’m always half a step behind where I need to be, and it takes too many person-hours to get there…” Kind of like trying to climb a hill littered with loose stone where those last ten steps always send you backward instead of forward.
Automation: The Real MVP
If manual processing is time-consuming, tedious, and leaves us in a half-step-away-from-sufficient state, what’s the answer? Automation, of course.
Automating the process of moving and correlating data is the savior of data analytics. Usually running overnight, it supplies us every morning with a fresh set of clean data from all our sources, with zero human effort past the initial setup phases.
Sound dreamy? It should if you’ve been spinning your wheels with manual attempts long enough. It’s formally referred to as an ETL process (Extract, Transform and Load). The ETL process comes in a few variations (sometimes called a Data Pipeline), but the core concept remains the same.
ETL processes handle extracting from both internal and external data sources, usually through database reads, file drops, or external service APIs. These processes gather data from various sources, then set about transforming them to obtain those key correlations. Once the data is cleaned up, it’s loaded into a centralized warehouse for further analytical processing.
The beauty is that once the ETL process is set up and running, it tends to run very well and requires no manual input. Furthermore, data from new, future services can be brought into the fold quite easily, without disrupting any existing processing. It’s a very forward-safe means of handling data.
Data Correlation: The Secret Sauce
Correlating data from disparate systems is often the real key to getting good insights. While correlations should be handled via automations (the Transform part of ETL), the tasks are quite different from the automations that get us the data in the first place.
Here’s the thing about data correlation: it needs rock-solid integrity. Low-integrity correlations carry a loss of confidence in the big picture, and it tends to be exacerbated with each “layer” of analytical processing you put on top of it. Most valuable insights are built in layers – if we have X, Y, and Z, we can get answers to A, B, and C, and if we have those, we can then get answers to D, E, and F. The underlying issue is that correlation integrity almost always happens at the base level, leaving a lot of room for cumulative error. Incorrect insights are probably worse than no insights at all.
The key is ensuring sufficient data points in each dataset that tie them to other relevant datasets. Sometimes these exist naturally and we just need to extract them, but sometimes we need to force it to happen – maybe adding a custom field on a third-party service, preferably through some automated integration (because let’s face it, relying on users to input identifiers correctly is a recipe for disaster).
The Payoff: More Than Just Data
The result of having well-correlated, high-integrity data that refreshes automatically is incredibly powerful. We end up with an organization-wide dataset, treating it as a single entity rather than data from separate systems. That alone is worth celebrating, but there’s more.
Beyond having a robust data source we can query, we’re also building a rich data history for our organization. As the daily imports continue, previously processed data isn’t thrown away but marked with valid date windows. This means we always have both an “active as of now” dataset and a complete history of what things were “back then,” unlocking even more analytical power.
This rich history is also the gateway to working with organizationally-centric AI: trend analysis gives way to predictive analysis, which is the foundation of many advanced AI and ML applications.
Only then can we finally put these data silo headaches behind us and focus on the real analytical work to get those valuable insights we’ve been chasing all along.