Measuring AEO success

The metrics that tell you whether AI visibility work is moving, how to read mention rate and recommendation rate separately, and how to run before-and-after measurement that proves a change worked.

Most brands new to AI visibility measurement start by asking: "Are we in the answers?" That is a useful first question. It is not, on its own, a useful measurement framework. Whether you are mentioned and whether you are recommended are different things, they move at different speeds, and they respond to different levers. Conflating them produces a single number that can look fine while you are losing commercial ground.

This guide covers the metrics that matter, how to read them separately, and how to run measurement that proves whether a specific action actually changed something.

The primary metrics

There are three metrics that carry the most signal for AI visibility work.

Mention rate is the share of questions in your tracked question set where your brand is named in the answer. It counts presence without quality judgment. A mention can be passing, sceptical, or clearly positive; mention rate does not distinguish. It is useful as a baseline and as a ceiling-check: if your mention rate is low, the other metrics cannot be high. But a high mention rate is not a victory signal on its own.

Recommendation rate is the share of questions where your brand is actively named as a preferred choice, not just mentioned as an option or cited as a source. This is the metric most directly connected to commercial outcomes. A buyer who reads an AI answer recommending your brand is in a different position than one who reads an answer that mentions your brand alongside five others without ranking them. Track these separately and always report recommendation separately; it is the one that moves pipeline.

AI share of voice is your recommendation rate relative to the named competitors in your category. If your recommendation rate is 30 percent but a competitor's is 60 percent, the absolute number understates the gap. Share of voice answers the relative question: among the brands getting the recommendation, what portion of that goes to you? This is the number most useful for board and leadership reporting, because it maps onto the share-of-mind framing CMOs already use in traditional media.

Why citations are a diagnostic, not the headline metric

Being cited by an AI engine means the engine retrieved your content as evidence for an answer and attributed it as a source. Citation is distinct from recommendation. You can be cited while a competitor is recommended, which means your content is being used to build a case that does not end in your favour.

Track citation separately as a diagnostic tool. A high citation rate with a low recommendation rate tells you something specific: your content is credible enough to be retrieved, but the synthesis judgment is not landing in your favour. The gap is usually in comparison evidence, proof of quality, or the breadth of third-party description. A low citation rate tells you the retrieval problem needs addressing first. These two diagnoses lead to different actions, which is why keeping the metrics separate matters.

Per-engine measurement

Different AI engines retrieve from different sources, apply different synthesis weights, and produce different recommendation patterns. A brand recommended confidently on one engine may be described vaguely or not named at all on another, even for the same question. Averaging across engines produces a number that hides the variation and makes it impossible to know where to act.

Measure per-engine: ChatGPT, Gemini, and Perplexity at minimum. For category questions where AI Overviews appear in Indian or global search, add that surface to your tracked set. The per-engine view tells you where your gaps are largest, which is where to focus the first round of improvement work.

The question set: fixed, versioned, and demand-grounded

Measurement is only meaningful if the question set stays constant across measurement cycles. If you change the questions between measurements, the before-and-after comparison is noise, not signal. You cannot tell whether a change in your recommendation rate reflects something you did or a change in what you were measuring.

Build your question set from real buyer demand: the questions your buyers actually ask when researching your category. This is described in more detail in what is AEO and GEO. A question set built from genuine buyer demand captures the commercial surface you are trying to protect; a question set built from keyword guesses or branded queries captures a narrower and less informative picture.

Once the question set is set, version it. Record which questions were in each measurement run, so you can defend the comparison if it is challenged. The question set can evolve over time as buyer demand shifts, but changes should be deliberate and documented, not incidental.

Before-and-after measurement for single actions

The most useful measurement discipline is to take a baseline, make one change, wait the appropriate recovery period, and run the same question set again. One change at a time is the key constraint. If you publish a new comparison page, improve your FAQ schema, and launch a PR campaign in the same two-week window, and then your recommendation rate improves, you cannot tell which lever moved it.

Recovery period varies by lever. Changes that affect the retrieval stage, like publishing a new page or adding structured data, can show up in measurement within a few weeks once the page is indexed. Changes that affect parametric memory, like an earned Wikipedia mention or a sustained press campaign, take longer, measured in months.

For a typical before-and-after on a content or schema change:

Baseline measurement on the fixed question set before the change.
Publish the change. Wait for indexing to complete, typically two to three weeks for a page on a well-indexed domain.
Re-measure the same question set.
Compare mention rate, recommendation rate, and AI share of voice between the two runs.
Note which questions moved, not just the aggregate. A change that moved a specific cluster of questions tells you the lever worked for that question type.

Document the result: which lever, which questions, what the change was, and how much the metric moved. That record is the foundation of a learning loop that becomes more precise over time, because each action teaches you which levers work in your category.

Re-measurement cadence

Measurement once and then inaction is common and wasteful. The value of measurement is in the loop: measure, act, re-measure, keep what worked.

A practical cadence for most brands:

Baseline scan when you start, to understand where you stand before any action.

Re-measurement after each material change: a new page, a schema update, a significant press placement, a round of review collection. Not every minor edit warrants a full re-measurement, but any action you are committing budget or time to should be paired with a re-measurement plan.

Periodic full scans, quarterly at minimum, to catch changes you did not cause: shifts in how an engine is answering category questions, new competitors entering the citation pool, or seasonal query pattern changes.

The measurement-to-execution playbook describes the full loop in more detail, including how to pick the cell worth fixing from a measurement result.

Questions

What is the difference between mention rate and recommendation rate?

Mention rate counts questions where your brand is named in the answer, regardless of how it is framed. Recommendation rate counts only questions where your brand is named as the preferred or top choice. You can be mentioned sceptically, listed among many options, or cited as a source without being recommended. Recommendation rate is more directly connected to whether the buyer considers you a frontrunner.

What is AI share of voice?

AI share of voice is your recommendation rate expressed relative to the named competitors in your tracked set. If three brands share a category and you receive 30 percent of recommendations in AI answers to category questions, your AI share of voice is 30 percent. It is the relative metric, analogous to share of voice in traditional media, and the most useful number for benchmarking and leadership reporting.

How many questions should be in my tracked question set?

The right size depends on your category's question complexity and buyer journey. A minimum of 20-30 questions is needed to produce a statistically meaningful recommendation rate. More questions, covering different buyer stages and intent types, produce a richer picture. The practical ceiling is the cost and time of running scans; most brands settle on 50-150 core questions that represent their commercial question set well.

How do I know which lever to pull after measuring?

Look at the questions where the gap is largest, specifically where your recommendation rate is low but your mention rate is not zero. Those are questions where the engine knows about you but is not recommending you, which usually signals a comparison evidence or proof gap. Questions where you are not even mentioned point to a retrieval or parametric gap. These different patterns lead to different actions, as described in the measurement-to-execution playbook.

Is AI share of voice comparable across different AI engines?

Not directly, because each engine retrieves from different sources and applies different synthesis judgments. Measure AI share of voice per-engine and report it separately. Your share on one engine may be higher or lower than on another, and the drivers of each difference are distinct. Averaging across engines produces a number that is neither actionable nor comparable.

How long should I wait between a change and re-measurement?

For changes that affect the retrieval stage, such as publishing a new page or adding structured data, two to four weeks after indexing is a reasonable minimum. For changes that primarily affect parametric memory, such as a Wikipedia update or a sustained press campaign, the effect is slower; three to six months is a more realistic window for meaningful parametric shift. Measuring too soon produces noise rather than signal.

Can I measure AI share of voice for free?

You can run spot checks manually by asking AI engines specific questions and recording the answers, which gives you a rough read. This approach is time-consuming, non-reproducible, and impossible to scale across the question set size needed for meaningful measurement. A structured measurement tool, such as AI Native, automates the scan across a fixed question set and produces consistent, comparable results across engines and time periods.

What should I do if my recommendation rate does not improve after a change?

First check that the page was indexed and that the engine is retrieving it, by inspecting the source layer in your scan results. If the page is being retrieved but not moving the recommendation rate, the gap is likely in the synthesis judgment, which weighs comparison evidence and proof from across sources. If the page is not being retrieved, the issue is at the retrieval stage, which requires domain authority or page quality work. See how to get cited for the retrieval-stage levers.