Measuring the Right Thing

The internet era gave us a single metric for value: engagement. The AI era does not give us a single replacement. It gives us two — one for each mode of cognitive work. The error most AI products are making is not optimising for the wrong metric. It is failing to ask which metric applies before they optimise at all.

What engagement actually measured

In the distribution era, engagement was a reasonable approximation of value. The logic was coherent: if users voluntarily spent time on a platform, the platform was delivering something worth their attention. Time-on-app, session depth, daily active users, notification open rates — these metrics tracked well enough with value creation that two decades of product thinking organised itself around them.

But engagement was always a proxy, never the underlying reality. And two inconvenient truths were already visible within the internet era itself. First, a user scrolling endlessly through a content feed was often not finding value — she was failing to find it. High engagement frequently masked product failure dressed as success. Second, the mechanism through which engagement was sustained was often friction, not delight — unclear navigation, buried content, dark patterns that extended sessions without extending satisfaction.

The internet era's engagement metrics measured attention captured. They said nothing about whether that attention was exchanged for something worth having. This ambiguity was manageable when distribution was the product. It becomes structurally problematic when cognition is the product — because cognitive work is not uniform, and the value of engagement depends entirely on what kind of cognitive work is being performed.

two modes, two measurement logics

Cognition is not a single activity

Part III of this series introduces the input certainty framework in full. But its core implication for measurement is simple: cognitive work exists on a spectrum defined by how clearly the problem is specified before work begins.

At one end, the problem is open. The mandate is to discover, frame, and structure something that does not yet exist — a market thesis, a platform strategy, a product concept. This is generative discovery. At the other end, the problem is closed. The mandate is to convert a well-specified brief into a precise, high-quality output as efficiently as possible — a compliance review, a trade execution memo, a structured decision package. This is deterministic execution.

These are not just different tasks. They have fundamentally different value structures — which means they require fundamentally different measurement frameworks. Applying a single metric logic across both produces a product that is miscalibrated for at least one of them, and often both.

Generative discovery

Where engagement has value

The iterative process is the work product in its formative state. A venture investor constructing a sector thesis, a chief strategy officer stress-testing a platform shift, a management consultant mapping an undefined problem space — these professionals are paid to think through problems without clean edges.

For them, AI is a thought multiplier. Depth of engagement is directionally positive — provided it produces genuine novelty, surfaces non-obvious options, or constructs frames that did not previously exist. The qualifying condition is output quality, not session length.

Deterministic execution

Where engagement becomes cost

The problem is specified before engagement begins. A capital markets analyst with a structured mandate, a risk officer reviewing against a compliance framework, an operations manager processing a defined brief — these users are not paying for dialogue. They are paying for the output.

For them, AI is a compression engine. Prolonged engagement signals failure: the system demanded cognitive labour from the user instead of absorbing it. Every additional prompt is a cost, not a feature.

The inversion — scoped correctly

In deterministic execution, the internet-era metric logic inverts completely. Where the distribution era rewarded more — more time, more sessions, more depth — the AI compression mode rewards less. Fewer prompts, shorter sessions, faster time to a decision-ready artifact. The user's goal is not to engage with the system. It is to disengage from it, output in hand, and go execute in the real world.

The clearest articulation of this is the McKinsey dynamic. A major strategy engagement does not ask the client to participate in the analytical process. The consultants absorb the complexity, run the cognitive work internally, and return a clean set of prioritised action directives. The client's engagement with the deliverable is minimal by design — and that minimal engagement is precisely what the client is paying for. The AI equivalent of McKinsey is not a brilliant conversationalist. It is a system that does the same: ingests messy inputs, processes silently, returns executable outputs with minimum demand on the user's time.

"In deterministic execution, the North Star is not how long the user stays. It is how quickly they leave — output in hand — and how reliably they return with the next problem."

In generative discovery, no such inversion applies. The internet era's instinct — that depth of engagement correlates with depth of value — is not wrong here. It simply requires qualification. Engagement must be paired with output quality to be meaningful. A long session that produces a genuinely novel strategic frame is leverage. A long session that cycles through obvious options and confirms pre-existing assumptions is exploration without discovery — and it is a product failure regardless of how much time the user enjoyed spending.

North Star metrics across all three columns

The table below places the internet era baseline alongside both AI-era modes. Reading across a row reveals how the same metric dimension shifts — sometimes inverting, sometimes qualifying, occasionally holding constant — as the underlying value proposition changes.

Internet era

AI era · Expansion

AI era · Compression

Dimension	Distribution leverage	Generative discovery	Deterministic execution
North Star	Time on platform · DAU / MAU	Output quality × cognitive hours invested	Non-engagement rate + strategic repeat rate
Engagement signal	Session depth · scroll depth — unconditionally positive	Exploration depth — positive when paired with novelty of output	Prompts per engagement — lower is strictly better
Session success	User returns to feed; another session begins	Novel framing, option set, or synthesis produced	Decision-ready artifact in minimum interactions
Retention signal	Return frequency · stickiness · push-notification response rate	User returns to deepen or extend a prior thread	User returns with a new, independent mandate
Willingness to pay	Ad revenue per attention hour · subscription against habit	10× output per cognitive hour — personal leverage captured	Time saved per decision cycle — institutional ROI
Failure signal	High churn despite high engagement	Long sessions producing confirmatory, low-novelty output	High babysitting rate — user correcting and re-prompting excessively
Cost bearer	User pays with attention; platform monetises the aggregate	Individual — exploration is the billable work	Institution — token consumption is an operational cost
Moat mechanism	Network effects · content volume · algorithmic lock-in	Synthesis depth impossible to replicate without the tool	Memory flywheel — persistent context reducing re-explanation cost

Fig. 2 — North Star metrics across the internet era and both AI-era cognitive modes

How to read this table. The internet era column is the baseline every product team has been optimising toward for two decades. In AI-era expansion mode, the direction of the engagement signal is similar to the internet era — depth is still directionally positive — but the qualifying condition changes from volume to novelty. In AI-era compression mode, the inversion is complete: every metric the internet era rewarded becomes a warning sign. The mistake most AI products are making is applying the internet-era column to a compression-mode product, either because they have not classified their use case or because their most vocal users are expansion-mode power users pulling the product roadmap toward engagement depth.

The willingness-to-pay test

Each mode has a distinct commercial logic rooted in what the user is actually purchasing.

In generative discovery, the user is buying cognitive leverage — the ability to think further, faster, and across more dimensions than unaided reasoning allows. The value is personal and compounds: better thinking at the exploration stage produces better decisions downstream. Power users in this mode will pay meaningfully because the delta between assisted and unassisted output is large and attributable. They are paying for the quality of what the tool helps them produce.

In deterministic execution, the user is buying time recovery — the return of hours that would otherwise be consumed by cognitive grind on a well-specified problem. Institutional buyers in this mode calibrate willingness-to-pay against a straightforward calculation: cost of the tool versus cost of the analyst-hours it replaces or accelerates. The commercial case breaks sharply the moment the tool requires significant analyst time to operate — because at that point, the net time saving approaches zero and the value proposition collapses.

In both modes, the same underlying test applies: is the AI doing the work, or is the user doing the work to operate the AI? Where the answer is the latter, the pricing case is structurally weak regardless of how sophisticated the underlying model is. Cognition leverage is only leverage when the machine absorbs the burden. When it redistributes the burden back to the user under a different name, it is not a new era — it is an expensive interface on top of the old one.

Three metrics that hold across all modes

Metric	Why it transcends era and mode
Strategic repeat rate	Voluntary return with a new mandate — across any era or mode — signals the prior output was acted on. It is the cleanest proxy for delivered value that does not require the user to self-report satisfaction.
Output actionability	Whether the session produced a discovery synthesis or a decision memo, the artifact must be usable. The format changes across modes; the standard of usability does not.
Memory flywheel	A system requiring progressively less re-explanation as the relationship matures builds structural advantage in every mode. Persistent context reduces the cost of each subsequent engagement for both user and institution.