7 min read

How to use AI for policy creation & iteration

An actionable list of ideas for how AI can supercharge your platform policy work

I'm Alice Hunsberger. Trust & Safety Insider is my weekly rundown on the topics, industry trends and workplace strategies that trust and safety professionals need to know about to do their job. 

This is the third installment of my series on platform policy, and for this one, we’re talking AI. As a reminder, the first covered how to write user-facing community guidelines, the second covered how to operationalize them across human reviewers, LLMs, and ML models. Today we’re going to look at how to use AI to produce and pressure-test those policies faster and more thoroughly than used to be possible.

Get in touch with questions, comments, feedback, rants, raves, existential problems…. I'm hoping to have the next issue be answering questions from readers (that's YOU!), so if there’s anything you want to share, please don’t hesitate.

Here we go! — Alice


Stress-test policies at scale with AI

Why this matters: You might have millions of decisions running through your policies; truly testing the durability of these manually is pretty much insurmountable. There are so many ways you can use AI models to sensibly and rigorously stress-test your policies at scale.

What’s great is, you don't need specialized software for any of this. You can start today with a model you already have and a policy you're already working on. Before we get into the exercises you can do, some general (maybe obvious) AI tips:

  • Use the latest models: such as a large thinking model — I tend to use Claude Opus 4.7. 
  • Use technical industry jargon: weirdly this makes AI answers better quality. Use words like bias, precision, recall, tradeoffs, taxonomy, etc. for the best responses.
  • Build a custom skill/ GPT/ gem for these stress tests: with step-by-step instructions and context for your specific platform, especially if you’re going to run these regularly. (Here's a policy analysis skill I created for Claude, feel free to copy/ use/ modify).
  • Don’t let AI do all the thinking for you! It’s best when used as a mirror to reflect back your policies to you, warts and all. Don’t use it as an end-to-end solution or replacement for critical thinking. 

Now here are some actionable ideas on how you can use AI to improve and build on your policies. Let me know if you try any of these, or if you use AI in any other ways. 

1. Gap analysis against peer platforms

Pull community guidelines from three to five comparable platforms, paste them alongside your own, and ask the model to produce a coverage matrix, so you can see which categories each platform addresses that you don't. For each gap, ask what harm it's addressing and whether your platform actually has that surface, so you're deciding what fits rather than copying someone else's list.

2. Coverage check against a reference taxonomy

Peer analysis shows you what comparable platforms cover. A taxonomy check shows you what you're missing entirely, including harms that aren't well-represented in peer guidelines at all. Ask the model to check your policy against a structured library of harm categories, and flag what you haven't addressed. (Note: I have a project on this in beta testing – reach out if you want a free analysis in exchange for giving me feedback).

3. Research on academic, advocacy, and regulatory sources

Ask AI to surface what researchers, advocacy groups, and regulators have written about best practices in your policy area. This can pull up useful work you might not have found otherwise. Be aware that models can produce plausible-sounding references to regulations or standards that don't exist, or mischaracterize sources that do; always ask for citations, and go read them yourself before taking anything on board.

4. Generate the basic structure of a policy from a problem statement

Give the model a problem statement and ask for the scaffolding a policy needs. Structure it like this:

  • A clear statement of what problem you're addressing
  • A list of terms for anything that could multiple definitions
  • Explicit "this includes" and "this does not include" sections
  • Different ways to frame platform values
  • And three to five boundary examples for each rule.

Reacting to a first structure is considerably easier than facing a blank page. The structure AI generates won't be right. But it will be something you can push back on, cut from, and redirect.

5. Narrow vs. broad variants to surface the precision/recall tradeoff

Ask AI to write a narrow version of a rule and a broad version of the same rule, then read them side by side. The narrow version will miss things; the broad version will catch things it shouldn't. Seeing both in concrete language makes the precision/recall tradeoff visible as a choice rather than something you discover later in your false-positive rate.

6. Write carve-outs and edge cases for every rule

For every rule, ask the model to write what should not be caught: the things that look similar to a violation but aren't. On a harassment rule, that means naming explicitly that a heated disagreement or a blunt criticism isn't itself a violation, and that someone describing abuse they received isn't either. On a self-harm rule, it means naming harm-reduction content, clinical discussion, and survivor testimony.

Carve-outs are the highest-leverage move for precision and the easiest to skip, because when you're writing a rule you're thinking about what to catch, not what to spare. Delegating the first draft of carve-outs to AI and then stress-testing them means you're less likely to miss the obvious ones.

When generating edge cases, ask the model to produce borderline scenarios along three lines: how a bad actor routes around the rule while technically complying; how the rule accidentally catches good-faith users operating across a cultural difference, a language barrier, a different communication style, or a crisis; and what happens when this rule meets your other rules and your platform's features at scale. Then label the cases yourself. The act of labeling is what surfaces the calls you hadn't actually made yet.

7. Replace placeholder words with observable criteria

Whenever a rule uses a word like "excessive," "egregious," or "inappropriate," or a phrase like "intended to," ask the model to rewrite it in observable terms. Then check whether you can actually answer the question that forces. If you as the policy maker can't say what observable feature actually makes something "excessive", then neither can any reviewer or model applying that policy.

Vague policies and principled-but-abstract ones tend to converge here. Both produce placeholder language that sounds like a decision rule but is actually asking the enforcer to make the real call. When you force the rewrite and still can't answer the question it raises, the policy is showing you something: what you had wasn't actually a decision rule. It was a vibe.

8. Check labeled data against written policy

If you have labeled data, like historical enforcement decisions, moderation queue samples, or literally anything where a human has made a call, run your policy against it and look at the disagreements. In nearly every policy I work with, there's a gap between what the policy says and where the labels actually draw the line. Those divergences are where your written policy and your real enforcement intent have come apart.

Hand both to a model, ask it to apply the policy to the labeled cases, and the divergences will surface quickly. The hard cases should go back into the policy as examples so it gets sharper with each pass. If you don't have labeled data, generate cases instead, but label them yourself, because the act of labeling is what surfaces the calls you hadn't actually made.

9. Check for consistency across multiple models

Hand your draft and a set of borderline cases to multiple models and see where the rulings diverge. Where different models agree, the clarity is in the policy. Where they diverge, the policy is leaning on one model's interpretation, and that's the part that will break when the model underneath it changes.

This was genuinely out of reach for a solo practitioner before these tools existed. You could train moderators on a handful of test cases, but not at a volume that predicted how things would behave once millions of decisions ran through the policy. Running a policy across several models and a real body of content, before anything reaches a user, is a different category of validation than what was previously available to teams without significant resources.

10. Test for cross-rule conflicts

Paste all your policies together and ask the model to find places where rules contradict each other, or where a carve-out in one policy would swallow a prohibition in another. Hard to catch manually when you're looking at one policy at a time.

11. Flag where your policy requires signals that the enforcement system won't have

AI reads your policy and surfaces every rule that depends on intent, account history, or contextual information that a content-level enforcement system can't access. This is a direct precursor to operationalization. If a rule says "designed to target a specific individual" but you're evaluating single pieces of content with no account context, the rule can't actually be applied. 

12. Test for demographic consistency and bias

Ask the model to generate matched scenario pairs for each rule: the same behavior, the same content, the same framing, with the only variable being the identity of the people involved. Run the policy against both scenarios in each pair and check whether the rulings are consistent across race, gender, sexual orientation, religion, political affiliation, nationality, and other dimensions relevant to your platform's userbase.

What this surfaces isn't always what you expect. Some asymmetry is intentional — a hate speech policy that treats in-group and out-group use of slurs differently is making a deliberate choice. But a lot of asymmetry isn't intentional at all. It comes from policy language that carries different cultural weight across communities, from examples that implicitly center one population, or from placeholder terms like "offensive" or "inappropriate" that give biased enforcement room to operate without detection. This is one of the reasons observable criteria matter beyond consistency: they replace terms that mean different things to different people with criteria that don't, which makes it harder for bias to hide inside a policy that looks neutral on its surface.

I hope this list has been useful. Is there anything missing? How do you use AI in platform policy? As usual, you can get in touch any time.

You ask, I answer

Send me your questions — or things you need help to think through — and I'll answer them in an upcoming edition of T&S Insider, only with Everything in Moderation*

Get in touch