Earlier in 2025, OpenAI released its own open-source artificial intelligence models, namely GPT-OSS-20B and GPT-OSS-120B, both of which demonstrated superior reasoning performance compared to many existing open models.
Now, the company has unveiled a new safety-centric family of models — GPT-OSS-SafeGuard-20B and GPT-OSS-SafeGuard-120B — designed primarily for security classification tasks and enabling developers to define their own safety boundaries.
Unlike imposing a one-size-fits-all safety framework, OpenAI allows developers to adjust their safety parameters based on contextual needs — for instance, relaxing certain restrictions in specific use cases to offer users more flexibility and functionality.
The new models leverage their reasoning capabilities to interpret developer-defined safety policies during inference, rather than relying on static, pre-trained constraints. They can perform safety classification on user messages, chat segments, or entire conversation histories, depending on the application’s requirements.
Because safety policies are applied at inference time rather than during training, developers retain the ability to modify and refine their strategies to improve performance. Each model can simultaneously accept two inputs — a policy and the content to be analyzed — and will output both the classification result and a reasoning trace that explains the decision.
According to OpenAI, this approach is particularly effective in the following scenarios:
- When potential risks are emerging or evolving rapidly, requiring policies to adapt swiftly;
- When the domain is too sensitive or specialized for small-scale classifiers to handle effectively;
- When developers lack sufficient labeled data to train high-quality, risk-specific classifiers;
- When interpretability and reliability are more critical than latency or throughput.
No model is without trade-offs, and OpenAI notes two primary limitations of the SafeGuard line:
First, if developers possess the time and data to train traditional classifiers on tens of thousands of labeled samples, such custom systems may still outperform SafeGuard on highly complex or high-stakes tasks. In other words, custom-trained classifiers remain the optimal choice for maximum precision.
Second, the SafeGuard models are computationally heavier and slower to execute, making them less suitable for deployment across all content on large-scale platforms.
The SafeGuard models are released under the Apache 2.0 license, granting anyone the freedom to use, modify, and deploy them. Interested developers can download the models directly from OpenAI’s official repository.
Related Posts:
- Anthropic’s “Atomic Shield”: A New AI Classifier Blocks Nuclear Weapon Blueprints
- AI’s New Attack Vector: How Real-Time Bots Are Straining Websites
- Red Hat & AMD Deepen AI Partnership: Optimizing AI and Virtualization
- Critical Triton Flaws (CVSS 9.8) Expose AI Servers to Remote Takeover – Patch Now!
- CVE-2024-0087: NVIDIA Releases Security Patch for Critical Flaw in Triton Inference Server