Architect Microsoft Fabric Data Pipelines for Mid-Sized Healthcare Orgs
The art and science to building robust Microsoft data pipelines in healthcare
Here we go with a more “hands-on” issue. The last four weeks have brought many learnings on Microsoft Fabric data pipelines. Orchestrating them with no prior experience in Microsoft data engineering tools proved to be challenging.
This issue is a clear follow up to the Microsoft Fabric data pipeline issue we wrote a few weeks ago.
This issue’s intent is to make healthcare data leaders aware of the quirks and caveats of Microsoft Fabric to build scalable data pipelines. I am by no means an expert, so feel free to add to this pile of insights in the comments. I’d love to continue learning from real Microsoft data experts.
I’m putting myself in the shoes of hundreds of healthcare data leaders just like me. Slammed with marketing hype and internal pressure, it’s easy to see why Fabric is gaining so much traction and interest by healthcare companies. Integrating directly into the Office 365 suite, it’s a tempting tech stack to default to.
I’m here to break the news that, like any other tool, there are always complexities. To truly master Fabric, it helps to have tenured Microsoft data people to guide you. The Microsoft data pipelining tools can be confusing and complex, with lacking documentation and always changing software updates as they execute CI/CD in real time. It’s hard to know which solutions to use for a given scenario.
Balancing Scalability with Maintainability
At Steinert Analytics, we strive to build data infrastructure that can be managed and maintained cost effectively when we turn it over to our clients. We’re not in the business of making ourselves sticky for the sake of more revenue for us.
With that in mind, the solutions we’ve built for our clients, specifically in DME, balance scalability with maintainability. These companies typically have a small IT team, composed of one or maybe two people (leader and DBA). Many of them do not have data-specific expertise. They’re technical enough to know their way around SQL and a few classic Microsoft tools like SSIS and Power BI, but are not true experts.
For clients in this scenario, the data pipeline solution we architect in Microsoft Fabric directly caters to the strengths and constraints of these stakeholders. We’ve found this results in more efficient, confident and happy clients because they’re able to quickly wrap their heads around how to unlock data for high impact decisions without drowning in mounds of code and complexity.
True Microsoft Fabric experts may not agree with the approach I’m about to share. Keep in mind that large healthcare orgs (the ones most commonly using Microsoft tech) often use Fabric and have an unlimited budget.
When you’re a small to medium-sized company, you have to consider costs for any data pipelining activities. When you’re dealing with hundreds of millions of data records and thousands of columns, things can get out of control quickly. We keep cost top of mind for your mid-sized healthcare company.
The Solution
We work with our clients’ Electronic Medical Record (EMR) data mostly. These EMR systems contain millions of rows with hundreds of columns. Microsoft Fabric, no matter the data pipeline solution used, can tear through millions of rows. The issue arises with how wide many of these EMR tables are. Wide meaning hundreds of columns in a single table.
When you’re dealing with tables this wide, a true Fabric data architect would suggest using Pipeline Notebooks written in Pyspark or a big data language of your choice. This gives the developer full control of unique data load patterns (incremental merge) and column data type mapping. Furthermore, the performance of Spark is incredible when dealing with big data sets. But you have to be on a high capacity license of Fabric (I’d suggest >= F32).
So what are we to do? Sitting on an F4 license, often we need to up the capacity to an F8 due to bursting an F4 capacity for initial table loads. Side note - bursting your capacity so Fabric disables you from continued use is temporary. They balance your capacity for a given month by smoothing out bursts of usage so you don’t risk exceeding monthly capacity permanently (called burn down).
Keep in mind many of our clients are around 150 million rows and 200 gb of data.
Like we noted in our previous data pipelining issue, Spark notebooks are too compute heavy. Additionally, we don’t like complexity, especially with our clients that have a small IT team. Managing mounds of Pyspark isn’t ideal, so we took a powerful, cost effective low code approach.
Leverage a combination of Copy Jobs and Fabric Data Factory Copy Activities.
Our approach strikes the optimal balance between performance, cost, and maintainability for SMB healthcare organizations working with EMR data. Here's the technical reasoning behind our architecture decisions:
For tables < 10 million rows: Use Fabric Data Factory Copy Activities
Technical Rationale:
Copy Activities within Data Factory pipelines excel for smaller EMR tables due to their sophisticated orchestration capabilities and cost-effective compute consumption model. For each intelligent optimization throughput resource usage in a Data Pipeline execution, 1.5 CU hours are consumed for data movement Copy activities, making them economical for smaller datasets.
Performance Advantages:
Intelligent Parallelism: You can set the 'Degree of copy parallelism' setting in the Settings tab of the Copy activity to indicate the parallelism you want the copy activity to use. For wide EMR tables with hundreds of columns, Copy Activities automatically optimize parallel processing without overwhelming your F4/F8 capacity.
Dynamic Resource Allocation: The service dynamically applies optimal parallel copy settings based on your source-destination pair and data pattern, crucial for EMR systems with varying table structures.
Advanced Partitioning: Copy Activities support both physical partitions and dynamic range partitioning, enabling efficient handling of wide EMR tables by intelligently splitting the workload across available compute resources.
Cost Optimization:
With an F4 capacity (often bursting to F8), Copy Activities' 1.5 CU consumption rate ensures you stay within capacity limits while processing smaller EMR tables efficiently. The 0.0056 CU hours consumed for each orchestration activity run adds minimal overhead, making pipeline orchestration cost-effective for complex EMR workflows.
Things to watch out for:
Tables >=10 million rows: We noticed large degradation in performance for tables exceeding the hard-fast rule of 10 million rows. A payment table for one of our clients ended up timing out after running for 12.5 hours (not good!).
The data copy from our source required Fabric to create a backend staging layer. The real burden was when Fabric had to copy the data from staging to the Lakehouse destination that caused the time out.
For tables >= 10 million rows: Use Copy Jobs
Technical Rationale:
Copy Jobs represent Microsoft's optimized solution for high-volume data movement, specifically engineered for scenarios like large EMR datasets with millions of records and hundreds of columns.
Superior Performance for Large EMR Datasets:
Serverless Scalability: Copy job supports High performance: Move large amounts of data quickly and reliably, thanks to a serverless, scalable system. This serverless architecture automatically scales to handle massive EMR extracts without manual capacity planning.
Optimized for Wide Tables: Unlike pipeline Copy Activities that can struggle with EMR tables containing 200+ columns, Copy Jobs are architected to handle wide schemas efficiently through intelligent column processing and memory management.
Native Incremental Loading: Copy job supports Incremental copy mode: The first run copies everything, and future runs only move new or changed data. If your database uses CDC (Change Data Capture), inserted, updated, and deleted rows are included. This is critical for EMR systems where you need to capture patient data updates, new records, and deletions efficiently (ie. upserts).
Things to watch out for:
Data Mapping bugs: When moving wide and large transaction tables, we’ve encountered data type mapping issues from source to destination. We had to run the job and follow the error message for each field it complained about, change the data type, and then rerun to keep isolating the problematic fields.
The issue we have encountered was a few metric values of data type “long” that needed to be converted to string to run correctly. In the source, they were integers. Very weird, and we believe this to be a Copy Job bug due to its newness.
Cost Efficiency at Scale:
The pricing model reveals why Copy Jobs are superior for large EMR tables:
Full Copy: 1.5 CU hours are consumed for full copy, which is same as data movement Copy activities in pipeline
Incremental Copy: 3 CU hours are consumed for incremental copy, during which only delta data is moved using a more efficient approach that reduces the processing time
While incremental copies consume 3 CU hours versus 1.5 for full copies, the efficiency gains are substantial. For a 10+ million row EMR table, an incremental copy might process only 50,000 changed records instead of reloading 10 million rows, resulting in dramatically reduced processing time and overall lower cost.
EMR-Specific Advantages:
CDC Support: CDC-based incremental copy: If your source database has CDC enabled, Copy job automatically captures and replicates inserts, updates, and deletes to the destination. Most modern EMR systems support CDC, enabling real-time patient data synchronization.
Multi-table Management: EMR systems typically require synchronized loading of related tables (patients, encounters, procedures, medications). Copy Jobs can handle multiple large tables within a single job, maintaining referential integrity while optimizing resource usage.
Automatic Schema Handling: Copy Jobs automatically detect and handle schema changes common in EMR systems, reducing maintenance overhead for your small IT team.
Cost Comparison with Alternative Fabric Solutions
vs. Spark Notebooks/Dataflows: Traditional big data approaches like Spark notebooks would consume significantly more capacity:
Spark operations typically require 16 CU/hour for standard compute
Our EMR clients' 150 million row, 200GB datasets would require F32+ capacity (~$4,000/month) vs. our F4/F8 approach (~$500-1,000/month)
Spark's overhead for wide tables creates memory pressure, forcing expensive capacity upgrades
vs. Pure Pipeline Approach: Using only Copy Activities for all tables would create several issues:
Large EMR tables (10M+ rows) would exceed F4/F8 capacity limits, forcing expensive upgrades
No native incremental loading optimization, leading to unnecessary full reloads
Complex orchestration required for large multi-table EMR loads
vs. Copy Jobs Only: Using Copy Jobs for all tables would be suboptimal because:
Smaller tables don't benefit from Copy Jobs' overhead
For changing requirements and staying flexible, we recommend you to create one copy job for each table. This gets to be way too much to maintain.
Versus Data Factory Copy Activities that allow multiple tables to be loaded or turned off in one place (more flexibility)
Implementation Strategy for SMB Healthcare
Capacity Planning: Our approach allows SMB healthcare organizations to operate efficiently on F4 capacity with occasional F8 bursting:
F4 Base Capacity: 4 CU continuous, sufficient for Copy Activities on smaller EMR tables
Burst to F8: Microsoft Fabric bursting means you can temporarily use more than your capacity in compute power for large Copy Jobs during off-hours
Cost Control: Smoothing & autoscaling allows you to continuously run intensive data processing jobs during the day, eliminating the need for expensive always-on high capacity
Operational Benefits:
Simplified Management: Your IT team manages two simple, well-documented tools instead of complex Spark code
Predictable Costs: Clear CU consumption patterns enable accurate budget forecasting
Scalable Architecture: Easy to upgrade individual table loads from Copy Activities to Copy Jobs as data volumes grow
This hybrid approach delivers enterprise-grade performance for EMR data processing while maintaining the simplicity and cost-effectiveness essential for mid-sized healthcare organizations. The architecture leverages Microsoft Fabric's strengths while mitigating the complexity and costs that typically challenge SMB IT teams.
So there you go…
Boom! A scalable and maintainable low code solution to all your Microsoft Fabric data pipelining needs. Hopefully the call outs, explanations and tips help you to understand how to leverage these tools to your company’s benefit.
You just received 5 weeks of hands-on Fabric learning, failures and successes all in a quick few minutes reading this article. Congrats!
If you’re in need of guidance on Microsoft Fabric in your healthcare org, we’d be happy to help. Feel free to book a call here or reach out to me on LinkedIn.
Happy data pipelining, healthcare data leaders!
Christian Steinert is the founder of Steinert Analytics, helping healthcare & roofing organizations turn data into actionable insights. Subscribe to Rooftop Insights for weekly perspectives on analytics and business intelligence in these industries. Thank you!