Blog

You're probably evaluating AI analytics tools wrong

Carlos Aguilar, Hex's Head of Product, shares why traditional checklist-based evaluations fail for AI — success depends on context management and real-world user interaction, not feature comparison.

You're probably evaluating AI tools wrong

Given how impactful generative AI is for analytics, it's taken surprisingly long for conversational analytics to actually work. It feels like we’re just hitting a turning point where it’s actually starting to work and unlock self-service for business users (Hex just launched Threads just over a month ago!)

There are probably a dozen or more products that claim to let non-technical users answer analytical questions with AI, and teams are struggling to figure out how to compare them. BI evaluations used to be long lists of features you just needed to “check the box” on. This works well for deterministic software, but LLMs are not deterministic, and we need to rethink how we evaluate these tools and how effective they will be with actual business users.

The challenge right now is that the context used by AI analytics tools can be very different, the mechanisms for setting the tools up are different, and the user interfaces for interacting with data are different. To account for all of these differences, you should keep the full workflow in mind — from the data team through end-users — as you evaluate tools.

Given this shift, it’s the data team who should be driving evaluations and weighing the following two workflows:

  • End-user experience

    How well does it answer questions and explain answers to non-technical users? You should be using real users and real questions.

  • Data team experience

    What is the process and system for monitoring question quality, managing context, and improving answers over time? You should consider the data team’s experience for managing the platform in addition to the end user experience, since they will be responsible for maintaining the quality of the system.

The obvious AI evaluation (and why it doesn’t work)

The obvious way to evaluate these interfaces is to take some sample questions, feed them into each system, and grade the answers. This seems so reasonable and fair! But this approach doesn’t actually test either of the crucial criteria I listed above and has numerous other shortcomings.

Why this approach doesn’t work:

  • It doesn’t test the workflow for improving context or observing and monitoring answer quality over time.

  • It doesn’t test how well a system that is optimized for one set of questions generalizes across a domain.

  • It doesn’t test your ability to adjust context and improve answers over time. How will the data team operationalize improving answers over time and correcting them?

  • It doesn’t test the very large differences in technical and non-user experience and usability.

    • One example I’ve seen is an interface that forces users to select which table they would like an answer from. That can improve answers in a contrived evaluation where the data team answers correctly which table a question comes from. But with real users asking real questions, this will likely fail when a user faces this UI.

How to actually evaluate conversational AI

Having gone through the AI setup at a bunch of organizations, I want to share my pitch on how to get this right. My take: evaluations should help you get a real sense for how conversational AI will look at your organization.

Let’s go back to the criteria above:

  • Testing with real users

  • Testing the system of observation, quality management, and improvement

For each tool, I would:

1. Select five reference questions

Pick reference questions for each domain you want to evaluate. Ideally, these reference questions should use two or three reference tables or data models. Five is probably a lower bound; if you have the time, ten to fifteen questions might be worth it. These reference questions are used to properly set up context in each system; they are not used as evaluation criteria to test systems.

💡How to pick reference questions

For each domain, you should pick reference questions of varying complexity. You should answer simple, critical questions like “what is our revenue?” and maybe more nuanced (and even subjective) questions like “what marketing campaigns performed best last year?”

It may also be worth adding reference questions that cannot be answered with existing data to test the system’s ability to reject them correctly.

2. Improve context

Ask these reference questions in the conversational AI tool you’re evaluating, and add context to the tool to ensure that it gets the reference questions correct. Depending on the tool, this process could look somewhat different: the context might be reference queries, a semantic model, or free-text rules files. If it’s hard or impossible to add context to the system to get to the right answers, this is valuable information for evaluating tools!

3. Test with real users and real questions

Invite two or three business users who have questions about the domain you are testing. Have them test the tool with questions they naturally have in the course of their day-to-day jobs. These should be questions they genuinely want answered, not scripted test cases.

4. Monitor the responses

It’s also worth evaluating the observability and monitoring of the tool. In the real world, you’re going to want to know what types of questions users are asking and be able to flag incorrect answers and use that to improve context.

When evaluating tools, I wouldn’t over-optimize on the exact style of questions to include: just have users ask real questions (or use questions that are inbound to the data or in the data-request Slack channel, if you have one).

The key is to use authentic questions that your actual users ask, rather than contrived test cases. Have real users use the same questions in multiple tools and compare which responses they find most valuable and useful in their day-to-day work. Also, have the data team judge the accuracy of the answers and the workflow for raising issues with the data team.

Evaluation criteria

Your evaluation should be broken into two components:

  • Rating both the quality and relevance of answers during the test

  • Rating the quality of the workflow for monitoring and improving answers

Answer evaluation

After a few days of gathering and testing questions, both data teams and end-users should rate responses:

  • We want the answers to these questions to be accurate (rated by the data team). The data team should evaluate accuracy on a consistent rubric, but I’ll leave it for another post to define exactly how to think about accuracy.

  • We want answers to be relevant (rated by end-users).

  • Data teams should also rate question quality and priority.

    • Is this question one that you want to optimize the system for? Sometimes, question quality can be low, and you want to weight that question less.

Workflow evaluation

The quality of conversational systems is not a fixed attribute but will continuously change as the systems evolve. These are often overlooked, but are a critical part of the workflow when you actually put these types of tools into production.

  • How easy is it for people to share analytical conversations with the data team or other experts?

  • How easy is it to debug or extend an analysis started by an agent?

  • How easy is it to fix an answer if and when you get an incorrect answer that is reported by an end-user?

  • How strong are the observability workflows? Can you see and flag incorrect responses to the data team?

  • How well does the tool integrate with the rest of your tools and ecosystem?

I get that this evaluation approach is more work than running ten sample questions through each tool and picking the one with the most green checkmarks. But that's kind of the point. If you can't invest a week to test these tools with real users and real questions properly, you're probably not going to invest the ongoing effort needed to make them work in production either.

This evaluation approach — reference questions to set up context, real users asking authentic questions, and serious evaluation of the monitoring and improvement workflow — will tell you within a few days whether a tool is actually going to work at your organization. You'll see how business users interact with the interface. You'll experience what it's like for the data team to manage context and fix issues. You'll avoid the trap of buying a tool that nails the demo but crashes when it hits reality.

Conversational analytics feels like it's finally crossing over from an interesting experiment to something that can be operationalized. The evaluation process needs to catch up. The shallow checklist approach doesn't work here, and the sooner data teams recognize that, the better chance these tools have of delivering on their promise.

New to Hex and want to try the Hex Agent?