Data leaders from HubSpot, Justworks, and The Movement Cooperative unpack the evolving role of governance in building trustworthy, AI-ready data systems.
Trust is the lynchpin that supports the data team’s work with the rest of the company. It’s no surprise, then, that our State of Data Teams report found that 84% of data leaders considered data quality and reliability to be their highest area of focus.
But there is a surprise just below the surface: 50% of data leaders have invested in a semantic layer, but many leaders are skeptical that semantic layers are actually helping. 57% of the leaders who have a semantic layer as part of their BI tool, for example, are considering a change in tools this year.
To find answers, we brought together three data leaders from organizations of varying sizes:
Tony Avino from Hubspot, a CRM platform with 8000 employees and a 150-person centralized analytics team.
Mona Khalil from Justworks, a payroll and PEO platform with over 1000 employees and a 40-person centralized data team.
Bella Wang from The Movement Cooperative, a progressive nonprofit providing data tech infrastructure for over 90 members.
Our participants' answers have been lightly edited for clarity.
Tony Avino, Hubspot: We think from a centralized point of view about data governance and consistency across our department in a somewhat democratized way, but we also align on the standards that we want to adhere to from an analytics engineering perspective.
That can include things like data cataloging: What tool of choice do we consider for cataloging? What tools of choice do we consider for semantic modeling? More recently, with AI, it's about putting an accelerator on how quickly we need to move to audit the quality of our data right now. How well is it documented? How many assets are out there that might be duplicative? How many things are managed or monitored?
We've been going through that this year, trying to think about how we adhere to these standards so that other capabilities, like AI and LLMs, can start to consume from them, and other analyst groups can also consume from them. We've been thinking about governance in a variety of different ways, but I think AI has been a very important accelerator for us to move quicker and to control the quality of our data.
Mona Khalil, Justworks: We're a very wide business. Anything that I'm talking about here, think 10,000 plus tables representing all the business processes just at the first party. We're not including the third-party sources yet.
Accessibility is top of mind, and a part of that is minimizing the cognitive load necessary to answer questions, thinking about whether or not people are talking about the same thing, where to find the right assets -- all of that should occur with relative ease as part of data governance. First and foremost, consistency should be a tool to make things easier for the business.
Bella Wang, The Movement Cooperative: We have a complex web, both on the intake data side and the usage side, particularly in terms of thinking about both data sharing and the types of data that come in.
On the intake side, a lot of our members are using very similar tools, so there might be 20 to 40 tools, of which any given member is using maybe 5 to 10 of them. A lot of these tools have somewhat similar purposes, so they might have similar concepts behind them. But they're using different pipeline styles, different modeling, different data models, and different schemas. There's a need to try to figure out how to wrangle that kudzu of tools and data on that side.
On the other side, we have members where compliance and data sharing are some of the main things we need to think about because we have a lot of members who work with each other. They might be joint on a project for a few months, or joint on something for a long time, or they might have similar affiliates. We actually have to figure out how to wrangle all of this data in a way that allows the clear conceptual combination of these tools, but then the sharing of that data also has to be very privacy-forward and security-forward as well.
We have a very warehouse-centric approach where we try to make sure that we have a central backbone where data is tagged accurately, where we know what the data should look like, what the quality should be, and what kinds of metrics people might be using out of it.
Mona Khalil, Justworks: We're starting with the most business-critical metrics and then moving outward. Think: financial industry standard financial calculations, revenue, ARR – you name it.
Over the course of the business, the calculations are ultimately never going to change, and the implications of getting them wrong and breaking from the standard calculations are quite high. Those are really the best places that you can start by building out semantic models, then testing your AI tools on top of it, because it's unlikely those are going to change at any point in the coming year.
After, move outward to other metrics around the business that are industry standard. That's what we're going to do as step two. Think: metrics that have a standard calculation that aren't the critical to your business's bottom line, but aren't likely to change in any reasonable amount of time.
We have to figure out whether the semantic layer is actually going to be a good use of our time after we put this in place and lift and shift everything over to that model. That's an open question. I think if you're starting from scratch, that's probably the way to do it, but I'm still skeptical. I couldn't tell you the exact threshold of what belongs in a semantic model. I think it's definitely a great tool for lowering the barrier to entry for using SQL and building some type of AI on top of it, though.
Bella Wang, The Movement Cooperative: We're really in the early stages of figuring out how to do this, how to build concepts that can extend across multiple organizations and different kinds of programs – concepts that allow our stakeholders to weigh in and control their data.
I think a delicate balance that we're always dealing with is how to make it easier for people to collaborate across from each other, while also allowing them to have ownership of their data and their metrics, and their program. We have all different kinds of data, so we have to figure out how to combine and stack those kinds of data in a way that allows us to think about: what are the metrics?
Tony Avino, Hubspot: More recently, we have been focused on how semantic layers could unlock the potential for interfacing with AI and LLMs. We are thinking about it from a scaling unlock perspective, so we want to back into making sure AI has the right context to be able to answer those questions confidently. We've seen problems today with AI hallucinating, which has put an emphasis on how semantic layers could be leveraged to mitigate some of that.
I think the prerequisite for the semantic layers that we have been working on this year has been around getting the sources of truth and the building blocks. Across each of our disciplines – sales, marketing, product, etc. – there are curated sources of truth that we've really started to hyperfocus on as elevating and promoting to be eligible for a semantic layer. We've been going through defining what those metrics are that back into those sources of truth. Then, we’re trying to isolate that from all the noise out there of conflicting assets or metrics. We've been doing a lot of that, which I think is groundwork for any semantic layer.
What's been helpful along the way is digging into each of the functions and recognizing that the questions that each of those functions asks might be easier to address and build with the semantic layer.
Tony Avino, Hubspot: Today, the thing that we're trying to mitigate against, if we're using AI to generate an answer, is it the right answer? We don't want AI to generate just any answer, so how do we make sure that we are confident in the output that it's producing?
In the situations where it doesn't have enough context, it can stay grounded and say, “Actually, I don't know,” or “I haven't been asked that question before,” because at least that then gives us some feedback to be able to figure out, do we not have a source for that question? Is it not well documented?
There's an important feedback loop here that we are trying to back into. We're an Atlan house, so our data cataloging covers all the core assets that we would want AI to consume from, but we want to make sure that the content that's being filled in there has actually meaningful details that an AI LLM could actually reason with. We've been stricter and stricter with what that bar is. And then, also on the dbt side, we want to make sure that our columns are well-defined, that we have YAML files that are properly architected, and that we have the proper tests.
That's been what we've been building up to, effectively having a core centralized set of assets that the semantic layer can read from, and then meeting that threshold across a checklist of things – documentation, structure, what have you – that could then be relied on and elevated to a semantic layer.
Mona Khalil, Justworks: In order to get to a point where we could actually start testing a tool like that, we leaned heavily into actually using it to enforce those standards. We rebuilt our staging layer using AI by first putting in place a number of metadata standards on that level, ensuring consistency of keys, data types, and naming conventions across almost a hundred different data sources.
We put that in place in the span of weeks instead of one or more quarters. And then we used it to build successive documentation. We started with documentation on the schema, then the tables, and then using that as input for the columns, and then populating that in Atlan using their API.
I think it's very helpful for foundational pieces. If you're able to incorporate a human in the loop for review purposes at the final stage, it's great for accelerating and getting everything on the same page so you can even begin to think about that innovation.
Tony Avino, Hubspot: Frame your approach around a crawl, walk, run setup (learn more about HubSpot's approach here). We're doing parallel paths right now where we're experimenting with AI, but we're still also building our data foundation, and I think, as you're going through those steps, don’t jump straight to AI. Sure, you can connect the dataset to an LLM and get a response, but there's an accuracy balance that you have to play out there. I would say starting with the foundation: How do we make sure that we know what our sources of truth should be? How do we have those sources of truth adhering to the standards? And what are those standards?
I think it's really important, and I think, along the way, making sure that you are utilizing AI to become more efficient with how you fill in those standards or how you consume the information. There's a human intervention here, still, especially in the current state today, where you have to take a trust-but-verify approach to whatever AI might be generating and making sure that you can build from that.
Don't get over your skis when you're thinking about going straight to AI and understand there's a building block setup you have to work towards before you can actually unlock the true value of what an LLM could do for your space.
Mona Khalil, Justworks: I'd love to throw a wrench in your question of balance as a spectrum of flexibility and control. Really think about what your end state is – in many ways, that is the trust and accessibility of your data. If you're working toward that, it's a lot easier to think about what components of a governance strategy will actually help move you toward that faster. You might not be ready for comprehensive standards in certain areas of the data that you collect, but it's more valuable to start working with your engineering team to align on standards that are most useful for bringing value to your data downstream.
Maybe don't start with the semantic layer for AI, but instead use it to put your foundation in place and just make your data more trustworthy – going back to the adage “garbage in, garbage out.” AI helps you go a long way a lot faster, so make sure that your input is of better quality and keep your eye on the prize of usability and time to insight for data.
Work with folks so that you can demonstrate how a governance strategy saves them time, gives them better information, and brings value to the business. Too often, business audiences see data governance as something that's being enforced upon them and not something that's in place to help them. I love the idea of making that a partnership. It's not an ivory tower. It's a team sport.
To build an effective semantic layer, start with the most stable, most queried metrics and build out toward other metrics.
Success with AI requires a foundation in high-quality data governance.
Data governance can benefit everyone in the company.
Any investment in data quality or data trust is an investment in AI because the models will benefit from the same documentation and control put in place by governance. Hex enables data teams and business users to collaborate, making the feedback loop between teams faster, allowing value to compound.
And check out the rest of our State of Data Teams live event series here!