OpenAI Releases GPT-OSS-SafeGuard Models for Customizable AI Safety

Do Son October 30, 2025 3 minutes read

Earlier in 2025, OpenAI released its own open-source artificial intelligence models, namely GPT-OSS-20B and GPT-OSS-120B, both of which demonstrated superior reasoning performance compared to many existing open models.

Now, the company has unveiled a new safety-centric family of models — GPT-OSS-SafeGuard-20B and GPT-OSS-SafeGuard-120B — designed primarily for security classification tasks and enabling developers to define their own safety boundaries.

Unlike imposing a one-size-fits-all safety framework, OpenAI allows developers to adjust their safety parameters based on contextual needs — for instance, relaxing certain restrictions in specific use cases to offer users more flexibility and functionality.

The new models leverage their reasoning capabilities to interpret developer-defined safety policies during inference, rather than relying on static, pre-trained constraints. They can perform safety classification on user messages, chat segments, or entire conversation histories, depending on the application’s requirements.

Because safety policies are applied at inference time rather than during training, developers retain the ability to modify and refine their strategies to improve performance. Each model can simultaneously accept two inputs — a policy and the content to be analyzed — and will output both the classification result and a reasoning trace that explains the decision.

According to OpenAI, this approach is particularly effective in the following scenarios:

When potential risks are emerging or evolving rapidly, requiring policies to adapt swiftly;
When the domain is too sensitive or specialized for small-scale classifiers to handle effectively;
When developers lack sufficient labeled data to train high-quality, risk-specific classifiers;
When interpretability and reliability are more critical than latency or throughput.

No model is without trade-offs, and OpenAI notes two primary limitations of the SafeGuard line:

First, if developers possess the time and data to train traditional classifiers on tens of thousands of labeled samples, such custom systems may still outperform SafeGuard on highly complex or high-stakes tasks. In other words, custom-trained classifiers remain the optimal choice for maximum precision.

Second, the SafeGuard models are computationally heavier and slower to execute, making them less suitable for deployment across all content on large-scale platforms.

The SafeGuard models are released under the Apache 2.0 license, granting anyone the freedom to use, modify, and deploy them. Interested developers can download the models directly from OpenAI’s official repository.

Support Our Threat Intelligence

If you find our CVE report and cybersecurity news helpful, consider supporting our work.

Buy Me a Coffee PayPal

Written by

@DdoS · Security Researcher

Do Son

Do Son is the Founder and Editor of SecurityOnline.info. Working in cybersecurity since 2013, he reports on vulnerabilities, malware, and emerging threats, providing timely analysis to help organizations and individuals stay ahead of evolving risks.

Related Posts:

Get Zero-Hour Vulnerability Alerts

Support Our Threat Intelligence

Do Son

Leave a Reply Cancel reply