When Data Pipelines Go Rogue: Our Microsoft Fabric Reality Check
The challenges with quirks of a new data tool in healthcare
We’ll revisit our data modeling in healthcare series in the coming weeks. This week we’re pivoting away from it to rehash our experience with Microsoft Fabric data pipelining. This article is mainly strategic, with a dash of technical troubleshooting tips. If you’ve been subscribed for a while now, it goes without saying that I focus more on data strategy, architecture and business outcomes over hard technical skills.
In the same vein as last week, rather than practical application, this issue will provide some tips and aspects to be aware of when leading a healthcare data implementation with Microsoft Fabric. My goal is for this to appeal to the Directors and VPs of Technology in healthcare orgs.
So without further adieu, let’s dive in…
A Plate of Chaos
We’ve undergone some surprising curve balls in recent healthcare data modernizations when it comes to data pipelining. A large benefit of Fabric is its dynamic OneLake storage and seamless compatibility with Snowflake. Interestingly, the architecture for one of our clients has all their EMR data located in a vendor-permissioned Snowflake reader account.
Our gut reaction was to leverage Snowflake mirroring with Fabric. However, battling vendor permissions on the Snowflake side was challenging. Mirroring requires specific permissions on Snowflake tables that we didn’t have. The EMR vendor was slow to respond. In turn, we had to figure out alternative options.
A Mess Moving Data
When considering alternative solutions, it was helpful to lay out all Fabric data pipeline options with pros and cons. This not only created visibility for our team, but also the client.
Here’s our list -
1. PySpark
Pros:
- Successfully connected to Snowflake
- Tested that the MERGE works (just need tweaking)
Cons:
- Compute isn't scalable (especially on F2)
2. Copy Job
Pros:
- More efficient than PySpark
- Functionality for upserts available
Cons:
- Fabric unable to authenticate into Snowflake connection (figured out workaround)
- Potential bug issue or IP Whitelist problem
3. Airflow DAG
Pros:
- Workaround solution
- Scalable
Cons:
- Code creates a lot of maintenance
- Technical issues & complexity
4. Dataflow Gen2
Pros:
- Successfully tested
- Low code
Cons:
- Not scalable (DataFlow queries too much to maintain and troubleshoot)
- Incremental not working
5. Snowflake Mirroring
Pros:
- Easiest method for pipelining/upserts
Cons:
- EMR vendor slow to enable us
We wanted simplicity, however some Fabric bugs we encountered (inability to connect to Snowflake via the connector and IP whitelisting) forced us to build the pipeline in Pyspark.
This comes with its own set of challenges and advantages. For one, we had an incremental MERGE requirement for the data load. Pyspark gave us the flexibility a Dataflow Gen2 doesn’t have.
However, Pyspark can be extremely compute heavy, and it’s definitely something to be aware of as a small to mid-sized healthcare company. Additionally, Pyspark is reserved for the advanced data engineer, requiring more senior development skills for the build and maintenance of the data pipelines.
I’m more familiarized with low code ETL tools such as Keboola, so this added complexity does not resonate with me. However, Microsoft Fabric’s Copy Job option is promising. It took a few tries to troubleshoot and connect Snowflake to the Copy Job Fabric Data Factory connector.
A Little Troubleshooting
We kept returning an “invalid credentials” error in the Copy Job connection workflow. This was extremely frustrating, and further highlights the immaturity of Fabric. In our experience thus far, the tool is quirky. However, we figured out a workaround. I was able to connect to the data source by adding the connection via the Connections & Gateways admin settings first.
After that, we configured the Fabric Data Factory connector by searching for the created connection and added it to the Copy Job settings. This allowed us to execute the data pipeline.
What a ridiculous journey just to move some data!
See you next week
I’m excited to continue part 2 of our series on data modeling next week. We have some great insights to share on the complexities of net revenue reporting in healthcare.
We’re glad you could stick with us through a rather lackluster experience in Microsoft Fabric. Each experience comes with a large amount of learning, and we’re excited to continue sharing our knowledge and mistakes so you don’t have to learn the hard way.
Thanks for laying out the detailed options - and outcomes.