AI Safety is a priority at Amazon. Our investment in secure, transparent and responsible AI (RAI) includes collaboration with the global community and decision makers. We are members of and collaborate with organizations such as Frontier Model Forum, the partnership on AI and other forums organized by government agencies such as the National Institute of Standards and Technology (NIST). Consists of Amazon’s connection of Korea Frontier AI Safety Commission, we published our Frontier Model Safety Framework earlier this year.
During the development of the Nova Premier model, we conducted a specified evaluation to assess its performance and security. This included tests on both internal and public benchmarks and internal/automated and third-party red teaming. When the final model was clear, we prioritized to achieve impartial, third-party assessments of the model’s robustness against Rai controls. In this post, we outline the most important findings from these evaluations that demonstrate the strength of our testing method and Amazon Première’s status as a safe model. Specifically, we cover our evaluations with two third -party evaluers: Prism AI and ActiveFence.
Evaluation of Nova Premier against Prism AI
Prism Eval’s behavior provided tool (bet) dynamic and systematic stress tests AI models’ safety protection frames. The methodology focuses on measuring how many conflicting trials (steps) it takes to make a model generate harmful content across multiple key rates -dimensions. The central metric is “steps to induce” – the number of increasingly sophisticated prompt required before a model generates an inappropriate response. A higher number of steps indicate stronger security measures as the model is more resistant to manipulation. Prism’s risk implies (inspired by MLcommon’s AI Safety Benchmarks) include CBRNE weapons, violent crimes, non-violent crimes, defamation and hatred among several others.
Using the Bet Evalt tool and its V1.0 metric, which is tailored to non-tiring models, we compared the newly honored NOVA models (Pro and first) with the latest models in the same class: Claude (3.5 V2 and 3.7 non-razing) and Llama4 Maverick, all Acaivable through Amazon Bed Grock. Prism Bet Black-Box performs (while model developers do not have access to the tests) of models integrated with their API. The evaluation was interconnected with Bet Evalt Max, Prism’s most comprehensive/aggressive test suite, revealed meaning variations in safety against malicious instructions. NOVA models demonstrated superior total security benefit with an average of 43 steps for Premier and 52 steps for Pro compared to 37.7 for Claude 3.5 V2 and fewer than 12 steps for other models in the comparison set (namely 9.9 for Claude3.7, 11.5 for Claude 3.7 thinking and 6.5 for Maverick). This higher step -count suggests that Nova’s security ranks are more sophisticated and more difficult to circumvent through conflicting incentive on average. The figure below presents the number of steps per Injury category evaluated through bet eval max.
Prism provides valuable insight into the relative security of various Amazon Bedrock models. Nova’s strong performance, especially in hate opinion and defamation resistance, meaningful progress in AI security. However, the results also highlight the nail challenge by building really robust security measures in systems. As the field continues to develop, frames as bet will play an increased important role in benchmarking and improve AI security. As part of this collaboration, Nicolas Miailhe, CEO of Prism Evalt, said: “It is incredibly rewarding for us to see Nova Outperforming strong base lines using Bet Eval Max; our goal is to build a long -term partnership against more secure when choosing models and to provide better deliveries.
Manual Red Teaming with ActiveFence
AI Safety & Security Company Activefensen Benchmarked Nova Premier at Bedrock on PROMPS, which was distributed over Amazon’s eight core Rai categories. ActiveFence also evaluated Claude 3.7 (Non-Rasing Function) and GPT 4.1 API on the same set. The flag frequency at Nova Premier was lower that on the other two models, indicating that Nova Premier is the safest of the three.
Model | 3P flag speed [↓ is better] |
Nova Premier | 12.0% |
Sonnet 3.7 (Non-Refurbi- | 20.6% |
GPT4.1 API | 22.4% |
“Our role is to think like an opposing treatment in safety,” said Guy Paltieli of Activefens. “By conducting a blind stress test of Nova Premier under realistic panty scenarios, we helped evaluate its security position in support of Amazon’s broader responsible AA goals, to make the model be implemented with greater confidence.”
These evaluations gathered with prism and activefence give us confidence in the strength of our protection frames and our ability to protect our customers’ safety when they are our models. While these evaluations demonstrate strong security benefits, we recognize that AI safety is a nail challenge that requires continuous improvement. These collections take snapshot-point-in-time and we remain obliged to regular testing and improving our security measures. No AI system can guarantee perfect security in all scenarios, which is that we continue to monitor and response systems after implementation.
Recognitions: Vincent Ponzo, Elyssa Vincent