5 min read

I read the 219-page AI Safety report; here are my five takeaways

The International AI Safety Report is long and technical but its insights will feel very familiar to anyone who’s worked in Trust & Safety

I'm Alice Hunsberger. Trust & Safety Insider is my weekly rundown on the topics, industry trends and workplace strategies that trust and safety professionals need to know about to do their job.

Over the weekend, I read the mammoth International AI Safety Report that was published last week. I'm sharing my key takeaways here for anyone who didn't get round to reading it in full.

Get in touch with your reflections on the report, questions about my step-by-step guide to running RFPs, or just want to share your feedback on today's edition. Here we go! — Alice


SpoNSOR EVERYTHING IN MODERATION AND SEE YOUR BRAND HERE

Get in front of the smartest, most engaged audience in Trust & Safety. Sponsoring Everything in Moderation puts your brand directly in the inboxes of policy experts, technologists, and researchers shaping the future of internet safety and platform governance. New sponsors get a discount before the end of February.

FIND OUT MORE ABOUT SPONSORSHIP

AI safety, meet T&S

The second International AI Safety Report, authored by over 100 AI experts, was published last week. It's dense and technical and, like last year's, there’s a lot of interesting stuff in there. 

One idea I've written about here in T&S Insider is how Trust & Safety and AI Safety — not to mention Risk and Fraud and Cybersecurity — are all different branches of the same tree. My take is that the challenges facing AI safety today aren’t wholly new; they’re variations of the same problems that Trust & Safety professionals have been dealing with for decades. This report certainly backs that theory up.

As well as crystallising my view on the overlap of these two safety fields, I wanted to share my five takeaways from the 219-page report. In no particular order:

Detection is a losing game

First up, AI-generated content is now indistinguishable from human content for most people in most contexts. After a five-minute conversation, participants in one study misidentified AI-generated text as human-written 77% of the time. For audio deepfakes, people mistook AI voice clones for real speakers 80% of the time. And of course, photos are getting much more realistic as well.

While there are promising technical approaches like watermarking, there are more issues to its use — not least their easy removal and the fact that content can be harmful even if it's labelled as AI-generated. That means detection is a useful tool (when it works), but it can't be the foundation of your safety strategy.

We need to stop hoping that AI-detection systems will solve T&S issues. Instead, we must focus on the harm itself, regardless of how it was created. A deepfake used for extortion or disinformation is the problem, not the fact that it's an AI-generated deepfake. Yes, AI is certainly making it easier and faster to create fake accounts, but I moderated fake accounts on dating apps over 15 years ago. It’s not a new problem.

Safeguards will be bypassed

The report is very clear that there’s no one magic AI safeguard: you have to layer multiple approaches together and look at content, behaviour, and the actors themselves (both human and AI). In practice, this means:

Pre-deployment AI safety testing and tuning (this is solidly AI Safety work, but is no longer effective on its own: models are now sophisticated enough to distinguish between test environments and production, behaving differently when they think they're being evaluated)

  • Monitoring user interactions and behaviour to identify and remove bad actors (traditional T&S tactics)
  • Monitoring AI reasoning or chain-of-thought for AI deception (just like you would monitor user content)
  • Input filtering (aka screening user prompts/creating guardrails – note that prompt injection attacks succeed approximately 50% of the time when attackers make 10 attempts)
  • Output filtering (screening AI-generated content for content violations)
  • Human oversight and overrides (i.e. have a human make the final decision when it’s something really critical)
  • Incident response protocols (when something goes wrong, because it will)

We need to start moderating AI like it’s a potential human bad actor, especially when there can be real human bad actors behind them. My recent deep dive into Moltbook, the social media platform for agentic AI, found many of the same things there as you would find in any human social platform: crypto scams, karma farming, violent philosophical posts, and spam posts galore. These are problems that T&S teams know how to monitor.

Design Human-in-the-Loop for humans

Human oversight seems like an obvious fix for AI reliability problems, and I hear about Human-in-the-Loop (HITL) all the time. But there's a trap: AI systems exhibit sycophancy (wanting to agree with humans) while humans exhibit automation bias (wanting to agree with AI). If you don’t mitigate these tendencies, you get rubber-stamping.

I’ve spent a lot of time thinking about what good HITL systems should look like, and a lot of it comes down to what you’re actually incentivising:

  • Make it easy to override or correct AI decisions (like, physically easy, not buried in a bunch of menus or clicks)
  • Train reviewers specifically on automation bias, so they know to look out for it
  • Monitor for rubber-stamping by looking at individual decisions and moderator behaviour as a whole
  • Celebrate when moderators catch AI errors
  • Use those learnings to update policies or guardrails so the same error doesn’t happen again

Focus on outcomes here as well as process. I’m continually surprised at how few people are talking about HITL system design, when to me it feels like one of the most critical things to get right.

Performance is uneven across languages and cultures

I just wrote a detailed article about how to measure and mitigate bias in moderation systems, including AI systems, so I was glad to see a section in the report on this. The summary is that AI performance is highest in English, particularly when about topics, places, and things that are most often represented in training data. This means there is a lot of inequality in results when it comes to language, geography, and even along socioeconomic lines

For example, the report mentions a study showing that AI models correctly answered 79% of questions about everyday US culture but only 12% of questions about Ethiopian culture. This means if you’re using AI for content moderation in Ethiopia, it will have much less cultural context to make an accurate decision.

AI has enormous potential for finally enabling T&S teams to moderate more fairly, because it is steerable, specific, and scalable in ways that 100% human teams never can be. But anyone using AI or designing systems with AI will need to understand that they’ve got built-in bias as well, and will want to take steps to mitigate that bias (see my point about designing robust HITL systems).

Why T&S expertise still matters (and what we can teach the world)

Human bad actors are still at the root of many AI safety problems. AI didn't create deepfake pornography or phishing or CSAM; it just made these harms cheaper, faster, and more accessible.

Luckily, T&S professionals have been solving these human problems in adversarial environments for decades. Core T&S skills like critical thinking, creative problem-solving, investigation, understanding human psychology and behaviour, and cultural competency are more relevant than ever. The work is getting harder and more complex, and the systems and layers are changing quickly along with technology, but the core of what we do remains the same.

However, the critical thinking, human judgment, and healthy scepticism we rely on for risk mitigation can't just live inside our teams. We need to be actively teaching these skills to the general public because societal resilience against AI harms matters just as much as technical safeguards. This can look like building AI literacy in schools, creating a culture of scepticism and critical thinking, and explaining what goes on behind the scenes to keep people safe. If we do this right, then the general public will understand basic T&S principles, and maybe – finally – will appreciate what we do.

You ask, I answer

Send me your questions — or things you need help to think through — and I'll answer them in an upcoming edition of T&S Insider, only with Everything in Moderation*

Get in touch

Also worth reading

Six Months Of ‘AI CSAM Crisis’ Headlines Were Based On Misleading Data (Techdirt)
Why? This was equal parts predictable and completely frustrating.

Can you sue for social media addiction? (Usermag)
Why? A great piece on the difference between habits and addiction.

Related, read this great piece from Nita Farahany's class on the same topic.