What is AI red teaming?

AI red teaming is the practice of adversarially testing an AI system before deployment — trying to get it to produce harmful, dangerous, or unintended outputs. It borrows the concept from cybersecurity red teams (attackers hired to probe defenses) and applies it to AI models. Red teamers try to find ways to make the model bypass its safety guidelines, produce misinformation, assist with dangerous activities, or behave inconsistently. The goal is to find these failure modes before real users or attackers do.

What did OpenAI find when red teaming GPT-4?

According to OpenAI's published GPT-4 System Card, red teamers found that early versions of GPT-4 could provide meaningful uplift for certain dangerous activities including generating detailed synthesis routes for chemical and biological agents, providing cybersecurity attack assistance, and producing persuasive disinformation. OpenAI mitigated these before launch. The system card also documented that GPT-4 was significantly harder to jailbreak than GPT-3.5 but not impossible — the team observed a 29% vs 82% "disallowed content" rate on internal benchmarks comparing GPT-3.5 and GPT-4 respectively.

How can I red team my own AI application?

Start with these categories: (1) Prompt injection — can users inject instructions that override your system prompt? (2) Jailbreaks — can users use roleplay, hypotheticals, or multi-step prompts to bypass content restrictions? (3) Data leakage — can users extract your system prompt or internal context? (4) Excessive agency — if your AI can take actions, can it be tricked into unauthorized ones? (5) Misinformation — does the model confidently produce false information in your domain? Run Custodia against your codebase to find the structural vulnerability patterns first, then layer in manual red team testing for the model behavior side.

Blog→AI Compliance

AI ComplianceCase StudyApril 1, 2026·9 min read

How OpenAI Red Teams GPT-4:
Inside the Process of Breaking
Their Own Model

Before GPT-4 launched, OpenAI paid more than 50 external experts — biosecurity researchers, ex-intelligence officers, disinformation specialists — to spend months trying to break it. Here's what they found, what OpenAI fixed, what they shipped anyway, and what every developer building on top of LLMs should steal from their playbook.

The 50-Person Team You Never Heard About

Three months before GPT-4 went live in March 2023, OpenAI did something unusual for a tech company: they handed early access to researchers whose entire job was to make the model do things it wasn't supposed to.

The team included biosecurity researchers from the Johns Hopkins Center for Health Security, cybersecurity professionals from the Alignment Research Center, disinformation researchers, and domain experts in areas like chemical weapons synthesization — people who'd spent careers in fields where model outputs could genuinely cause harm. OpenAI called it red teaming. The output was a 94-page document called the GPT-4 System Card, which became one of the most detailed public disclosures of AI safety testing ever published.

Reading that document carefully tells you a lot about how the best-resourced AI lab in the world thinks about security. Most developers building AI products have no idea this level of process exists — or that some version of it applies to them too.

"GPT-4 has the potential to be used for disinformation, influence operations, and other harmful activities." — OpenAI GPT-4 System Card, March 2023

OpenAI published this. About their own model. Before launch.

What They Actually Tested

The red team organized their work into five broad threat categories. Each one maps directly to a risk class developers should think about when they build on top of any LLM.

01 — CBRN Uplift

Chemical, biological, radiological, and nuclear. Biosecurity experts tested whether GPT-4 could meaningfully help someone synthesize dangerous compounds or pathogens. Early versions could. OpenAI mitigated this with fine-tuning and refusals — but acknowledged in the System Card that the mitigations aren't perfect and that "GPT-4 could provide real uplift to people seeking to create biological weapons with the potential for mass casualties." That sentence, published in their own document, tells you how seriously they took this.

02 — Cybersecurity Attacks

The model was tested for its ability to assist in writing malware, explain exploitation techniques, and help with reconnaissance. GPT-4 was better at this than GPT-3.5 — which is the risk that comes with capability improvements. The team found it could explain vulnerabilities in existing code, suggest attack vectors, and write working exploit scaffolding. Mitigations included content filtering on specific query patterns and refusal training on a dataset of known harmful cybersecurity requests.

03 — Disinformation & Influence Ops

Researchers tested whether GPT-4 could write convincing false news articles, generate fake quotes attributed to real people, create targeted propaganda for specific demographics, and generate content designed to inflame political divisions. It could do all of these, and was dramatically better at it than GPT-3.5. The bar for creating professional-quality disinformation dropped from "you need a skilled writer" to "you need a prompt."

04 — System Prompt Extraction

Red teamers tested whether they could get GPT-4 to reveal its system prompt — the confidential instructions operators put at the top of conversations. They found that with persistent adversarial prompting, early versions would often reveal or paraphrase the system prompt. For developers building products on top of OpenAI's API, this is directly relevant: your system prompt is never truly secret, and you should treat it that way.

05 — Jailbreaks & Persona Attacks

Classic jailbreaks — "pretend you're DAN (Do Anything Now)," "respond as your unfiltered self," "write this as fiction" — were tested at scale. GPT-4 was more resistant than GPT-3.5 but not immune. The system card showed that GPT-4 refused disallowed content 82% of the time vs. 62% for GPT-3.5 on OpenAI's internal benchmark. That means at launch, 18% of attempts at regulated content weren't blocked.

What They Shipped Anyway (And Why)

Here's the uncomfortable part of the GPT-4 System Card that most coverage missed: OpenAI knew about these risks, documented them publicly, and shipped the model anyway. That wasn't recklessness — it was a calculated decision based on a framework called the "preparedness score."

The logic goes like this: every capability also exists in some form in the world already. Information about biological agents is in academic papers. Disinformation techniques are in political science textbooks. The question isn't "can the model do harmful thing X" — it's "does the model provide meaningful uplift over what's already freely available?" If the answer is no, the risk calculus shifts.

For some things — like the most dangerous CBRN queries — OpenAI decided the uplift was real and the mitigations needed to be strong. For others — like explaining how SQL injection works — the information is so widely available that blocking it would make the model less useful without meaningfully improving safety.

What This Means for You

If you're building a product on top of GPT-4 or any LLM, the base model has already been hardened against the highest-risk scenarios. But the hardening is imperfect and highly context-dependent. Your system prompt, your user base, and your application domain create a new threat surface that OpenAI's base training didn't account for. You are responsible for that layer.

The Part of Their Process Every Developer Should Steal

You don't have 50 domain experts. But OpenAI's process breaks down into a structure that scales to any team:

Define your threat model before you build

OpenAI defined specific risk categories — CBRN, cyber, disinfo, etc. — before they started testing. If you ship an AI feature without a written threat model, you haven't red teamed it, you've just tested it. Write down: who are the adversarial users? What's the worst thing they could get the model to do in your application context?

Test the model in your specific application context

GPT-4's base safety training was built for general-purpose use. Your system prompt changes the threat surface. If you've told the model it's a medical assistant, test whether users can bypass it to get advice the base model would give but your application shouldn't. If you've given it tool access, test whether it can be prompted into unauthorized tool calls.

Treat your system prompt as semi-public

OpenAI found that system prompt extraction was achievable with persistent prompting. Design your system prompt assuming users will eventually see it. Don't put secrets in the system prompt. Don't put instructions that reveal sensitive business logic. Treat it like a `.env` file that you assume gets leaked eventually.

Document what you found before you ship

The GPT-4 System Card was valuable not just for OpenAI — it was valuable for every developer building on top of GPT-4. The act of writing it forced OpenAI to articulate the residual risks. Do the same thing internally. Even a one-pager that says "we tested these 5 attacks, here's what we found, here's what we mitigated, here's what we accepted" is infinitely better than nothing.

What's Changed Since GPT-4

OpenAI formalized their process further in late 2023 with the Preparedness Framework — a structured policy that defines how they evaluate new models before deployment. It assigns risk levels (low/medium/high/critical) across four domains: CBRN, cybersecurity, persuasion/influence, and model autonomy. Any model scoring "high" in any domain requires additional safety work before deployment. "Critical" means they won't deploy it at all.

They also created a Safety Advisory Group that reviews model evaluations before major releases. The frontier model commitments they signed — alongside Anthropic, Google DeepMind, and Microsoft — include sharing red team findings with each other and with governments before major releases. This is new. Six months ago, AI companies kept this entirely internal.

None of this means the models are "safe" in some absolute sense. It means the highest-capability AI lab in the world is building an increasingly structured process for understanding and documenting the risks of what they ship. That process is worth understanding — not just because you're using their models, but because some version of it should exist in your product too.

The Bottom Line

OpenAI's red team process is more structured, more thorough, and more publicly documented than anything else in the industry. It still doesn't catch everything. The base model still has residual risks. And the layer you build on top of it has a completely separate attack surface that they never touched.

The GPT-4 System Card is worth reading in full. It's 94 pages of the world's best AI safety team documenting what they found when they tried to break their most capable model. For developers building AI products, it's the closest thing to a field manual for AI security that exists.

AI ComplianceInside Constitutional AI: How Anthropic Bakes Security Into Claude Before It Ships OWASP LLMPrompt Injection Prevention: Stop LLM01 Attacks Before They Ship OWASP LLMSecuring MCP Servers: Attack Surfaces in AI Tool Use

Red Team Your Codebase

Start With the
Structural Vulnerabilities.

Prompt injection · Excessive agency · Insecure output handling · Hardcoded secrets. One scan. AI fix prompts for every finding.

Scan My Codebase Free View Demo Report →

How OpenAI Red Teams GPT-4:Inside the Process of BreakingTheir Own Model