Why Your Data Team Needs a Bulletproof Data Dictionary
Data Catalogue Series #1: The foundation that prevents data disasters
Last week, I highlighted the importance of documentation on one of my LinkedIn posts. Data teams that fail to document end up with inconsistent definitions, little context and a mound of code that produces wrong numbers.
When building something new, no one knows where to begin. It's really a failure to prioritize organization for a data team. I've been part of data teams that move so quickly they don't keep up on documentation. I've also been part of data teams so concerned about their delivery output they fail to do it at all.
The result of avoiding it was a 30 table schema with no understanding of what unlocks the business. The backend data and models in highly customizable source systems become a pool of assumptions with no real confirmation from the business. An experience permanently part of my data nightmares involved a financial asset management system that we gave zero attention to understanding.
We dove into the source system's API and codebase without interviewing stakeholders, briefing ourselves on legacy reports and the fields the business focuses on to build metrics, KPIs, and decisions.
Building a Data Catalogue That Actually Works
As a data management company in healthcare, Steinert Analytics is committed to providing thorough documentation to all our clients. That's why we want to focus our content on building out a robust data catalogue for the next few weeks.
We'll cover a section of the data catalogue build in each issue, why it's important, and some tips to enhance your workflows. Let's kick this series off with the data dictionary.
What Exactly Is a Data Dictionary?
The IBM Dictionary of Computing defines a data dictionary as a "centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format". It is an extremely metadata heavy documentation piece that gives in-depth technical and business context to data points.
I've built data dictionaries for both Fortune 500 and start-up data teams. I agree with IBM's definition (big surprise ;) ). Typically, we leverage a Google Sheet or Microsoft Excel. I realize there are other tools such as dbt Cloud's Data Catalog, or data documentation specific software like Collibra. Transparently, I haven't used or explored these tools extensively. The only fantastic data catalogue tool I've used is Keboola (we're certified implementation partners if you're interested).
However, from a cost effective standpoint I like spreadsheets for data documentation & governance. They're a low barrier to entry and familiar to everyone in the business. The risk is that, without proper strategy and role assignment, documentation goes stale.
Essential Fields for Your Data Dictionary
Typical fields included in a dictionary are:
Field Name
Definition
Table
Schema
Data Type
Associated Report(s)
Source System
Transformation
Data Load Logic
Data Pipeline Name
Refresh Cadence
I realize this is technically heavy. A business user isn't going to understand or care about what database schema or data type a certain datapoint is. Why is all this metadata important to document? For the data team, IT department and any administrators or users of these backend data systems, it's critical to have these documented. It gives context to the technical components and architecture of the system, so the engineers can actually build data solutions with confidence.
Data Dictionary vs. Data Glossary: The Line Gets Blurry
Key point: A data dictionary IS NOT a data glossary.
A data dictionary's field definition is strictly the technical definition of a source system's data model, schemas, and the data generation that results from its workflows. A data glossary captures a field's definition based on how it's used to answer business questions in reporting and BI.
Now, I'd be lying if I said I stuck to this black and white definition for a data dictionary and glossary. Traditionally, these are two separate documents. I might get eaten alive for saying this - but I've been in many environments where the actual "data dictionary" asset is a hybrid between a data dictionary and glossary.
This hybrid data dictionary contains all the relevant metadata, but when it comes to the field definition it pertains strictly to how it's being used in the business. Not its technical description. Data experts - I'd love to hear your thoughts on this. I'm speaking strictly from what I've seen in real data product and engineering teams. Quite frankly, I like the consolidation as well because it's less to maintain.
A Few Tips For Maintaining a Data Dictionary Manually
I'm not going to sit here and tout that a manual data dictionary is the perfect solution. In fact, as I learn more about established data governance programs, this may be a poor choice. If you don't have a designated analyst or data custodian constantly upkeeping documentation, it's almost guaranteed to go stale.
Not only expensive data documentation and governance tools, but automating a Google Sheet integrated with your data integration workflows is something I hadn't thought of. Shout out to Sebastian Hewing for this idea!
However, pulling from my own experiences, here are a few tips and considerations to keep in mind:
1. Decide a data catalogue owner
This may seem obvious, but don't begin creating documentation until you've established who will update it consistently. The need for accountability is paramount and without clearly defined owners, documentation is guaranteed to go stale.
Typically as a lead analyst I've been in charge of updating it consistently. In larger data teams the product manager is also a good one to handle this.
2. Block out time monthly or weekly to review and update
Data pipelines change often. There are modifications, additional pipelines, and new data sources that need to be logged. If you don't have set times to update documentation, you are guaranteed to fall behind.
I typically review and update documentation once every two weeks.
3. Start small and build incrementally
Don't try to document your entire data ecosystem in one go. Pick your most critical tables and fields first - the ones that power your key business metrics and reports. Build out these core definitions thoroughly before expanding to secondary datasets.
I've seen too many teams get overwhelmed trying to document everything at once, only to abandon the effort halfway through. Start with what matters most to the business and expand from there. Your users will thank you for having accurate documentation on the fields they actually use, rather than incomplete documentation on everything.
Making It Stick
The reality is that documentation is only as good as its adoption and maintenance. The best data dictionary in the world is useless if it's buried in a folder no one can find or outdated by six months.
Make your data dictionary easily accessible - pin it to your team's workspace, include links in your standard operating procedures, and reference it in onboarding materials. When new team members join, walk them through it. When business users ask questions about data definitions, point them to the dictionary first.
Most importantly, treat your data dictionary as a living document. It should evolve with your data architecture and business needs. The moment it becomes static is the moment it becomes obsolete.
Next week, we'll dive into data lineage documentation - because knowing what your data means is only half the battle. Understanding where it comes from and how it flows through your systems is the other half.
What's your experience with data dictionaries? Are you team hybrid or do you keep strict separation between dictionaries and glossaries? Hit reply and let me know - I read every response.
Christian Steinert is the founder of Steinert Analytics, helping healthcare & roofing organizations turn data into actionable insights. Subscribe to Rooftop Insights for weekly perspectives on analytics and business intelligence in these industries.
Feel free to book a call with us here or reach out to Christian on LinkedIn. Thank you!
Love this. I also try and include relevant sections of the data glossary/dictionary in my Power BI reports. I include a 'plain english' description as well as the relevant code. Eg SQL used in a stored procedure and/or the DAX behind a measure. This code bit helps future me answer any questions from report users that the plain English description might not have answered for them.