Data Quality

Introduction

The challenge of migrating a critical system while working with a “black box” is a familiar scenario for many lead developers. This was the situation we faced during a platform migration project for a B2B music streaming service provider. The company’s infrastructure, supporting thousands of business locations worldwide, combines IoT devices with streaming technology to deliver licensed music solutions. As the lead developer, I encountered a fundamental challenge: ensuring data quality without clear visibility into source systems. This challenge led us to develop an innovative data quality monitoring system that became essential to our migration success. Let me walk you through our journey — from initial challenges to successful implementation, including those “aha” moments that made it all worthwhile.

The Challenge and Our Journey to a Solution

Picture this: It’s 2 AM, and I’m staring at our monitoring dashboard, trying to figure out why our data pipeline suddenly started behaving strangely. This was a common scene during our migration from Databricks to AWS. The main headache? We had what we jokingly called our “mystery box” — external systems feeding data into our analytical platform with zero documentation and even less visibility.

We were essentially flying blind, trying to figure out if we were getting all the data we should (spoiler: sometimes we weren’t), validate data transfers without knowing what “complete” looked like, monitor source systems we barely understood, and maintain data quality when we couldn’t even define what “quality” meant. Our old monitoring approach was about as useful as a chocolate teapot — sure, it told us if jobs completed, but that’s like saying your car works because the engine turns on, ignoring the fact that it’s making concerning noises and the check engine light is on.

“Our old monitoring approach was about as useful as a chocolate teapot — sure, it told us if jobs completed, but that’s like saying your car works because the engine turns on, ignoring the fact that it’s making concerning noises and the check engine light is on.”

The challenge was particularly acute because we needed to monitor multiple aspects of our data pipeline. Here are just some examples of the metrics we tracked (and there were many more):

The number of files being processed each day
The volume of data in each table
The number of rows being inserted or updated
The size of incoming data files
Data quality scores

Without clear specifications, we couldn’t simply set fixed thresholds for these metrics. We needed a more sophisticated approach that would account for natural variations in our data patterns.

After countless late-night calls with source system teams (and probably too much coffee), we realized we weren’t getting detailed specifications anytime soon. You know that moment when you’re stuck in traffic and realize you should have taken the back road? That’s how we felt about our initial approach. But then it hit us — we had years of historical data sitting right there! It was like finding an old map in your glove compartment when you’re lost.

This historical data became our goldmine. By analyzing patterns over time, we discovered fascinating insights about our 24/7 operation:

Companies are seeing dramatic improvements:

Our music streaming data showed distinct patterns throughout the week, with predictable variations not just between weekdays and weekends, but also between different times of day
We noticed that weekend patterns differed significantly from weekday patterns, with their own unique characteristics for both Saturday and Sunday
Royalty calculation data followed clear monthly cycles, aligning with billing periods
File processing volumes had seasonal variations we hadn’t previously noticed
Certain tables showed consistent growth rates that we could use as baseline indicators
We discovered time-of-day patterns that were consistent across all days of the week

Using this historical context, we developed a statistical approach to define “normal” behavior. Instead of rigid - - thresholds, we created dynamic ranges that accounted for:

Time-of-day variations (24/7 operation)
Day-of-week patterns (including weekend specifics)
Seasonal trends
Growth trajectories
Holiday and special event impacts

This approach allowed us to identify true anomalies while avoiding false alarms from normal variations. For example, if a table typically grew by 5–7% monthly, we could quickly spot when growth suddenly dropped to 2% or jumped to 15%, even if both numbers might seem reasonable in isolation.

“The beauty of this approach was that it didn’t require perfect knowledge of the source systems. Instead, it let the data tell us what ‘normal’ looked like, and we built our monitoring system around these self-discovered patterns.” *

The beauty of this approach was that it didn’t require perfect knowledge of the source systems. Instead, it let the data tell us what “normal” looked like, and we built our monitoring system around these self-discovered patterns.

Building Our Solution: From Infrastructure to Insights

We had two options: build a new monitoring system from scratch (the “shiny new toy” approach) or integrate monitoring into our existing infrastructure. After one particularly long whiteboard session (and several pizzas), we chose the latter. Why? Because sometimes the boring solution is the right one. Our existing ETL pipelines had already proven themselves during migration — they were like that reliable friend who always shows up when you need them.

This decision led us into the fascinating world of pipeline analysis. We spent weeks diving into our pipelines, and what we found was like discovering hidden patterns in your favorite song. For example, we noticed our music streaming data had this weird but consistent pattern — like clockwork, it would dip on Mondays (nobody’s favorite day) and peak on Fridays (everyone’s favorite day). It was like watching the heartbeat of our business through data.

Designing our monitoring schema in Redshift was like trying to teach a computer to understand jazz — you need to capture not just the notes, but the soul of the thing. We had to look at years of data and figure out what “normal” even meant. Some days it felt like we were data archaeologists, carefully brushing away the digital dust to reveal the patterns beneath.

The most challenging part was modifying our pipelines while they were actively processing data. It was like trying to repair a car while it’s driving down the highway. Every change had to be perfect because any failure would mean manually reloading data for specific time periods — a time-consuming and error-prone process. We learned this lesson the hard way when a small configuration change caused a system failure, requiring us to manually reload several days of historical data. This experience led us to implement an extremely careful, step-by-step approach to all modifications.

Creating our Tableau dashboard was where the magic really happened. Being the primary consumers of this monitoring data, we had a clear understanding of what information was crucial for our daily operations. We needed a dashboard that would help us quickly identify anomalies and system health issues without getting lost in the details. The first version suffered from the common “more is better” syndrome — it displayed every possible metric, making it difficult to focus on what really mattered. Through multiple iterations and incorporating feedback from our team members (who relied on these dashboards for their daily monitoring tasks), we streamlined the interface to highlight the most important indicators. This focus on practical usability helped us maintain data quality and respond to issues more effectively.

The Impact: Transforming Our Operations and Business

The transformation in our operations was like upgrading from a flip phone to a smartphone. What used to take hours of manual checking became automated. Our team members, who previously looked like they were auditioning for a zombie movie after long nights of data checking, now had time for actual strategic work.

“The transformation in our operations was like upgrading from a flip phone to a smartphone. What used to take hours of manual checking became automated.”

The business impact was immediate and visible. Remember that skeptical stakeholder who always asked “but how do you know the data is right?” Well, now we could show them exactly why we were confident. It was like having receipts for every business decision. From a technical perspective, we didn’t just build a monitoring system — we created a data quality ecosystem that grew with us. The system’s performance impact was minimal (which was crucial because our data volumes were growing faster than my coffee consumption during this project).

Lessons Learned and Future Directions

The biggest surprise throughout this journey? Historical data became our best friend. What we initially thought was just digital clutter turned out to be a goldmine of insights. We learned that sometimes the simplest solution (like using basic statistical analysis) works better than the fancy, complex approaches we initially considered.

Conclusion

This journey taught us that effective data quality monitoring doesn’t need perfect information or fancy tools — it needs creativity, persistence, and a willingness to learn from both successes and failures. Like any good story, ours isn’t finished — we’re still writing new chapters in our data quality journey, learning from each challenge and improving with every iteration.

About the Author: Senior Data Engineer with 9+ years of experience in building and optimizing enterprise-level data solutions. Expert in implementing complex data integration projects across banking, IoT, and enterprise systems. Specializes in designing scalable data architectures and ensuring data quality through advanced monitoring solutions. Passionate about sharing knowledge and experiences from real-world data engineering challenges.