4 min read

Is this what assessing risk *actually* looks like?

Regulators have spent years trying to get platforms to anticipate harm before it happens. Anthropic’s Mythos release suggests some AI labs may already be adopting similar principles.

I'm Ben Whitelaw, the founder and editor of Everything in Moderation*. I'm standing in for Alice in this week's Trust & Safety Insider. She'll be back as usual next week.

Today, I'm looking at Anthropic's incredibly detailed Mythos release documentation and teasing out what it says about the regulatory shift towards risk assessment. The full version is available to paid EiM members only as part of a series of experiments I'm running. Now's a great time to join the club if you've been thinking about it but are yet to take the plunge.

As ever, drop me an email if you have views about today's topic, want to talk about sponsoring T&S Insider or have ideas about how your organisation/institution can collaborate with EiM. Enough small talk, here we go! — Ben


When AI companies show their working

Why this matters: Regulators have spent years talking about systemic risk assessment and mitigation. Anthropic’s voluntary Alignment Risk Update for Mythos, its powerful new model, suggests that such logic may now be shaping how some AI labs release models.

The Grok scandal earlier this year triggered an entirely justified wave of rage: media and civil society organisations condemned it, politicians demand answers and the general public expressed outrage at the fact that this could happen at all. 

But buried within that fury, there was an important question with regulatory implications: what did X/Twitter do to anticipate and mitigate foreseeable harms before rolling out its model to all users?

That kind of pre-emptive approach is what the European Commission and Ofcom have made core to their landmark online safety laws — and which they put front and centre in the investigations into Grok.

Ofcom’s investigation under the Online Safety Act explicitly seeks to examine whether X/Twitter failed to “assess the risk their service poses to UK children, and to carry out an updated risk assessment before making any significant changes to their service” (12th January) while the Commission uses the Digital Services Act to test whether “the company properly assessed and mitigated risks associated with the deployment of Grok's functionalities into X in the EU” (26th January). Both investigations are still ongoing. 

In both cases, the primary focus is not whether harm occurred — that part is obvious from the harrowing stories shared by women and girls and ongoing lawsuits about its impact — but whether a company had thought in advance what might go wrong and tried to prevent those eventualities from happening.

Grok-blocked

I was reminded of the Grok drama last week when Anthropic released its Mythos model to a select group of companies because it was deemed “too dangerous” for wider deployment. Project Glasswing took the headlines for for the model’s ability to fix decades-old, previously undetected bugs, but I was most interested in the Alignment Risk Update, a first-of-its kind document that Anthropic released alongside the models’ announcement  

The 58-page document compliments the company’s usual safety card analysis by laying out the chances of misalignment leading to one particular risk: harmful autonomous behaviour within an organisation. It lays out in unusual detail how Anthropic assesses the risk that Mythos might attempt harmful actions and its likelihood to succeed. it concludes that it can “sometimes employ concerning actions to work around obstacles to task success”.

One document shouldn’t be overstated. But right now, taking such a step sets Anthropic apart from other large technology companies in terms of publicly pre-empting risk. Andrew Clearwater, who works for AI automation platforms Airia and writes a newsletter on AI, called it:

The first document I’ve seen from any frontier lab that I think every executive, every AI lead, every governance person at any company using AI models should actually read. Not because it’s scary. Because it’s honest. And in AI right now, honest is rare.

Honest, yes — although to what degree we’ll never fully know. But I’d also say pre-emptive. Which is where the contrast with Grok is notable.

Get access to the rest of this edition of EiM and 200+ others by becoming a paying member