11 min read

Are you drowning in vanity metrics?

Measuring the outcomes of T&S work is hard — here’s how to do it

I'm Alice Hunsberger. Trust & Safety Insider is my weekly rundown on the topics, industry trends and workplace strategies that trust and safety professionals need to know about to do their job. 

When Trust & Safety works, the result is no harm done. But how do you measure this as an outcome? This week I'm continuing my “How To” series: this week we’re looking at how to build a measurement program that keeps you honest about whether things are working, and not just vanity metrics.

Get in touch if you'd like your questions answered or just want to share your feedback. Please send along questions, comments, feedback, rants, raves, existential problems…. I hope to dedicate an upcoming edition of the newsletter to questions I get from EiM subscribers and via LinkedIn.

 Here we go! — Alice

For the month of May, T&S Insider will be guest-edited by Georgia Iacovou, author of Horrific/Terrific


How to measure T&S impact

Why this matters: measuring the impact of this work is both challenging and urgent because the role of T&S is to prevent harm. This is why you want your measurement infrastructure to be accurate: keep it up to date, measure the right things, and ultimately give users a better experience.

I used to be head of T&S at a platform that used a BPO for moderation. One day I went digging into our QA scores. We'd been seeing user complaint patterns that didn't match the picture the metrics painted. The QA scores looked excellent week after week, with the moderation team hitting accuracy targets that any T&S leader would be happy with. But the complaints kept coming, and I wanted to understand the gap.

What I found was that the QA team and the moderation team were both working from a slightly out-of-date version of the spam and scam guidelines. Nobody had done anything wrong; the space was moving fast and the documentation simply hadn't kept up. The moderation team were seeing new patterns that the QA rubric did not yet reflect. So everyone was being scored against documentation that was missing the things that actually mattered. The aggregate numbers looked great, but the actual enforcement had real gaps — and the moderators, who are best positioned to see those gaps, had no structured way to surface this into the policy or measurement work.

I later ended up running the Trust & Safety line of business at a BPO myself, where I tried to structure our internal teams specifically to avoid this problem. I learned that the dynamic I'd seen from the platform side wasn't unusual, and this represents what I think is the most important pattern to understand in T&S measurement: it’s the difference between metrics that look good and metrics that mean something.

Why measurement is harder in T&S than anywhere else

In marketing you can tie an investment to a conversion rate; in engineering you can measure uptime against a clear target. But with T&S you’re trying to prevent harms which are difficult to quantify, in an environment that’s constantly changing, using systems that are inevitably imperfect. Even with a perfect system, it’s genuinely hard to prove that something didn’t happen, which is ultimately what T&S is all about: stopping harm before it happens, not just reacting after the fact. 

Also, there's no universal set of metrics that works across all platforms. A measurement framework that makes sense for a large social platform may be completely the wrong tool for a gaming platform, a dating app, or a marketplace. The metrics that matter depend on what your policies cover, what your enforcement model looks like, and what the realistic harms on your platform actually are. 

Finally, improving your measurement often makes your numbers look worse before they look better. If you've been undercounting violations because your sampling methodology was flawed, fixing it will make prevalence appear to spike. If you expand your detection coverage, incident rates will appear to climb. This is a precondition for improvement, not evidence that things are getting worse, but it's a hard argument to make when a leadership team sees the numbers move in the wrong direction.

All of these are worth knowing before you start, because they shape what good measurement actually looks like.

The metrics that matter: prevalence, detection/decision, and user experience

There are three distinct questions T&S measurement needs to answer, and they require different metrics, different methodologies, and different responses when something goes wrong.

How much harm exists on the platform?

This is the hardest question to answer. The proportion of content or interactions on your platform that are violating — or prevalence — is the closest thing T&S has to a guiding metric. It tells you something the other metrics can't: how much harm exists, regardless of how well your enforcement systems are performing. A platform can have excellent enforcement metrics and still have rising prevalence if the underlying volume of harmful content is growing faster than enforcement can keep up.

Measuring prevalence well requires labelling a statistically significant random sample of platform content. Flagged and reported content is already a biased sample that systematically understates prevalence, so you want something that’s truly random. Historically, doing this was expensive, which is why most platforms skipped it. LLMs have changed this: if you're already running automated labelling across your content for moderation purposes, prevalence measurement is largely already built into your pipeline. Pull a random sample, run it through your classifier, and you have a prevalence number. That said, this only works if your labels are well-calibrated against ground truth, which is why the QA work below matters so much.

For high-severity, low-volume harms (CSAM, terrorism, coordinated inauthentic behaviour), prevalence sampling won't catch enough examples to be meaningful. Track these through incident rate instead, and do regular post-incident reviews (similar to engineering post-mortems) that capture not just that an incident happened but how it was detected, how long it took to respond, and what needs to change.

How well are we detecting harm and dealing with it?

Let’s get one thing straight: detecting harms, and dealing with harms, are two very different things. Detection about whether your systems are finding harm. ‘Dealing with it’ — or decision quality — is about whether your systems are making the right calls on what they detect. These have completely different failure modes and fixes. A detection failure points to your classifiers, your signal sources, your sensitivity thresholds. A decision failure points to policy clarity or reviewer calibration. When something looks wrong, the first question to ask is which kind of failure you're dealing with.

Key metrics for detection: 

  • Proactive detection rate – the proportion of harm caught before users report it
  • Recall – how much of the total harm you're catching
  • Time to detection, which is similar to but distinct from proactive detection rate. A system that catches 95% of violations but takes 48 hours to do so is performing very differently from one that catches 90% within an hour.

Key metrics for decision quality: 

  • Precision – how often you're right when you flag something 
  • False positive rate – how many times you incorrectly label something as violative
  • Appeal overturn rate – how many times users appeal a decision and you agree with them
  • Time to enforcement – how long it takes you to take action after detection
  • Inter-rater agreement – how often different reviewers, or different automated systems, reach the same decision on the same content. It’s one of the most useful early signals that something is going wrong. When it drops, it usually means policy guidance has gaps or reviewer calibration has drifted, before that shows up anywhere else.

For every metric you track, you should be able to answer two questions: what level or trend would cause you to take action, and what action would you take? If you can't answer both, you're tracking something for the sake of it — this is just dashboard noise.

How are users experiencing the system?

This is the question that unfortunately gets the least attention in most T&S measurement frameworks. A system that performs well on detection and decision metrics but creates a hostile environment for legitimate users is failing at its actual job, even if every dashboard says otherwise.

Example: the tradeoff between precision and recall has a direct effect on user experience. If you over-index on recall, you may catch more bad actors, but you’ll also drive up false positives, and ultimately this erodes trust. A system tuned hard for precision will allow bad actors to fall through the cracks, and you good users will be stuck with harassment and scams.

Where you sit on this tradeoff shouldn’t be a default that emerged from your system architecture; it should be a deliberate values-driven decision, steered by your harm severity and your user populations.

Key metrics: 

  • Appeal volume – not just overturn rate, but the volume. High appeals mean users are dissatisfied with decisions even when those decisions are technically correct)
  • False report rate – if users are consistently reporting things that don't violate anything, that's a clarity problem in your policies, not just an enforcement problem
  • User complaint patterns
  • Time to resolution from the user's perspective, rather than from yours.

But metrics only capture what reaches your formal channels. Most users who have a bad experience don't file an appeal. They don't always report. They tell friends, post on Reddit, or simply stop using the platform. A real voice-of-the-customer program with user research interviews, user surveys, and structured monitoring of external channels will surface patterns the formal metrics miss. Quantitative measurement tells you what's happening; qualitative input tells you what it feels like to be on the receiving end. Both are important.

How to actually do it

Build your QA process before you worry about dashboards. Most of the metrics that tell you whether your operation is working (inter-rater agreement, appeal overturn rate, decision accuracy) depend on a QA process that generates reliable ground truth. At minimum, regularly sample a meaningful slice of review decisions, have a senior reviewer or policy owner evaluate them against the written guidance, and track results by category over time. Deliberately oversample from the categories where you expect the most ambiguity and from the edges of your policy thresholds.

The failure mode to avoid is what I described in the opening: QA structured around a fixed rubric, without a mechanism for updating that rubric as content patterns evolve. The rubric needs to be a living document, owned by someone close enough to the content to notice when new patterns aren't covered, with both the authority and the responsibility to keep it current.

For automated systems, the equivalent is a golden dataset: a curated set of examples with known correct labels that you run your model or LLM against on a regular cadence. Content patterns evolve, and a system that was well-calibrated three months ago may have drifted significantly since, not because anything changed in the system but because what users are posting has changed around it.

Disaggregate everything. If you look at everything in aggregate, failure is invisible: A 95% decision accuracy rate across your full review queue can coexist with 70% accuracy on a specific content type, in a specific language, or for a specific user population. Build your measurement infrastructure so you can break down key metrics by content type, policy category, language, user population, and enforcement channel from the beginning. This matters operationally, because you need to know where the problems are, but it matters for equity, too. Bias in enforcement almost always looks like accuracy in the aggregate. 

Different user populations also need different measurement priorities. For vulnerable populations like young people or marginalised communities, the cost of a missed detection is often catastrophic, so your measurement should weigh false negatives heavily. For paying adult users, over-enforcement has direct revenue consequences alongside the experienced harm, and false positive rates deserve close attention. These aren't the same measurement problem, and treating them as one will mean you're optimising for the wrong thing in at least one direction.

Connect measurement to your review cadence. A weekly or biweekly calibration meeting which brings together policy owners, QA, and frontline moderators or prompt engineers is the right cadence for active policy areas. What matters is that measurement findings have a regular, structured moment to become operational decisions rather than sitting in a report. QA findings and appeal data should be reviewed together in the same conversation, so the team can see whether they're pointing at the same problems from different angles.

Who owns this matters enormously. Measurement, policy, and operations teams will all touch this but they each have their own sense of what good looks like. When this works, it's usually because someone is doing active connective work to make sure findings from one group reach the others and translate into changes. When it doesn't, measurement findings sit in dashboards nobody outside the data team reviews, and the organisational gaps become the measurement failure. 

Build in genuine human oversight, not reflexive approval. As automation scales, the informal pattern-recognition that comes from humans seeing large volumes of content gets lost. Human-in-the-loop processes have to replace those insights, and that's harder than it sounds. The impulse to agree with what an AI suggests is real and well-documented, and it gets stronger the more accurate the system generally is. Skepticism has to be designed in, not assumed. That means structured review of automated decisions (not just the ones the system is uncertain about), regular sampling of automated approvals to check for drift, and explicitly rewarding reviewers who catch automation errors.

Listen to your frontline team. Whether you're using in-house moderators or an outsourced team, the people doing the day-to-day review see emerging patterns, new edge cases, and gradual shifts in how harm is showing up. This is important information. Build structured channels for those observations to reach the policy and measurement work. 

What it looks like when it's working

You catch problems before users do. If you're consistently learning about enforcement failures from outside the team rather than from your own data, your measurement isn't looking at the right things or isn't being reviewed frequently enough.

Your calibration meetings generate policy changes rather than just acknowledging trends. If you're seeing the same pattern in your QA data for three months running, this should trigger a policy change.

Your escalation queue contains genuinely hard cases, not easy ones. If escalation volume drops to near zero, that's a sign of policy drift: your reviewers may be deciding on edge cases independently in lieu of flagging.

And, most importantly: you find yourself comfortable defending honest measurement even when the numbers move in directions leadership doesn't love. You don’t want to fall into the trap I explained in the beginning, where measurement looks great but the methodology is flawed. Often, when you catch issues early and start to measure correctly, the numbers look “worse”, but this is a sign of progress. Your org’s culture should be comfortable with this “worse before it gets better” phase.


Resources


I’ll leave you with a measurement checklist

And there will be more on avoiding vanity metrics in a future issue.

What to measure:

  • Prevalence methodology exists for your highest-risk content categories, using genuinely random sampling
  • If using LLM-based labelling for prevalence, golden dataset calibration is in place
  • Incident rate is tracked for high-severity, low-volume harms, with regular post-incident reviews
  • Detection metrics and decision metrics are tracked separately
  • Time to detection and time to enforcement are tracked, not just rates
  • False positive and false negative rates are tracked separately
  • Inter-rater agreement is tracked for human review; golden dataset performance for automated systems
  • Appeal volume, false report rate, and user complaint patterns are tracked
  • Voice-of-the-customer program captures qualitative input on a regular cadence

How to measure it:

  • QA process exists with a mechanism to update the rubric as content patterns evolve
  • All key metrics can be disaggregated by content type, language, user population, and enforcement channel
  • For every metric, the response to a change is defined in advance
  • Efficiency metrics are reviewed alongside quality metrics, never instead of them
  • Measurement is reviewed on the same cadence as calibration (weekly or biweekly for active policy areas)
  • A named person owns the connective work between measurement, policy, and operations
  • Drift monitoring is in place for automated systems
  • Human-in-the-loop processes are designed for genuine review, not reflexive approval
  • Frontline moderators have structured channels to surface observations into policy and measurement work

Leadership and culture:

  • Leadership understands that better measurement may temporarily make numbers look worse
  • Performance management does not reward throughput or metric improvement at the expense of accuracy
  • Leadership incentives reward honest measurement over metrics that look good

You ask, I answer

Send me your questions — or things you need help to think through — and I'll answer them in an upcoming edition of T&S Insider, only with Everything in Moderation*

Get in touch