OpenZeppelin Flags Methodological Flaws in OpenAI’s EVMbench Blockchain Security Benchmark

There is a controversy opening up regarding the use of AI and blockchain security. OpenZeppelin has looked into the new AI benchmarks issued by OpenAI regarding smart contracts (EVMbench), and has found some issues with methodology as well as contamination of the data being tested.

Designed to assess how well AI models can identify, remediate and exploit vulnerabilities in Ethereum Virtual Machine smart contracts, the benchmark is the result of a collaboration between the crypto investment company Paradigm and researchers from Stanford University.

OpenZeppelin expressed support for the proposal but used the same scrutiny used for measuring other major DeFi protocols when doing the same with this benchmark proposal. This led to an examination of the benchmark that raised numerous important questions regarding how we will measure AI performance related to blockchain security going forwards into the future.

What Is EVMbench Designed to Do

EVMbench serves as a benchmark for testing AI models against actual vulnerabilities in smart contracts under Solidade code and EVM, allowing you to:

Identify security vulnerabilities in Solidity code,
Classify the severity of those security vulnerabilities,
Recommend patches for weakened security,
Demonstrate how an attacker would exploit a weakness.

The goal of the benchmark is to provide developers with an objective measurement of how effectively their code will secure a blockchain based solution when financial stakes are high and exploiting the blockchain can result in immeasurable loss.

With the growing use of AI in auditing processes, these benchmarks could impact development teams' selection of AI tools for audit/protocol security.

However; comparing AI in high-risk/no win environments require a high degree of methodological discipline in benchmarking AI;

Image by Author

OpenZeppelin’s Review Process

According to an OpenZeppelin representative, the company has chosen to review EVMbench following the same general procedures as those utilized for auditing large decentralized finance protocols.

OpenZeppelin has completed audits on many projects, including Aave, Lido, and Uniswap, which all process billions of dollars worth of transactions.

OpenZeppelin stated that its purpose was not to challenge this initiative; rather, it was to ensure that security claims based on AI are supported by arbitrary and rigorous statistical methodology.

The company stated publicly and in discussions with the public that artificial intelligence benchmarks that will impact decisions regarding security for blockchain projects must pass an adversarial test.

Key Issue 1: Training Data Contamination

Findings from my research demonstrate that contamination of training data presents considerable risk.

Contamination occurs where benchmark dataset used to assess performance of machine learning (ML) algorithms overlaps partially or completely with the data used to train algorithms. This overlap will lead to inflated performance metrics.

In the context of EVMbench, there is concern about contamination.

If any vulnerabilities contained within benchmarking datasets were present within widely available public repositories (e.g., GitHub) or in other published studies, there is a chance that highly advanced ML algorithms will have memorized those patterns (i.e., learned to memorize the association between training data and corresponding performances).

Thus undermining the EVMbench benchmarks' credibility as a valid measure of an algorithm's ability to reason.

Reasoning is critical in the world of blockchain security where there exists an environment of adversarial creativity where reliance on interpreting memorized data (i.e., recall) is much more difficult than demonstrating consistent applications of analytical reasoning (i.e., logic).

Key Issue 2: Vulnerability Classification Errors

OpenZeppelin has stated in its second main concern regarding vulnerability classification that there appear to be numerous issues classified as very high severity that are not able to be exploited in a practical manner. They indicated to us that at least four of these high severity classifications are indeed invalid because, under actual blockchain conditions, these vulnerabilities cannot actually be exploited.

The importance of the severity classification system is that:

• Severity classifications help focus resources on fixing the most important issues first

• Severity classifications impact model scores

• Public perception of AI capability will be shaped by severity classifications

If a model is correctly deprioritising a non-exploitable issue but that issue has been assigned a high severity, then that model may be unfairly penalised for doing so. On the flip side, a model may be able to simply flag many more issues without being able to determine if they are exploitable or not and may receive a higher score.

These discrepancies also undermine the reliability of the benchmarks.

Image by Author

Why Benchmark Integrity Matters for Blockchain Security

A Critical Factor Shaping the Adoption of Artificial Intelligence

A benchmark that provides a measure of confidence that a particular AI model will be able to effectively identify and exploit vulnerabilities is something that can lead development teams to incorporate it into their production audit pipelines.

There can be severe consequences for using flawed auditing tools within Decentralized Finance (DeFi) that include:

- Loss of user funds

- Protocol insolvency

- Disruption of governance

- Damage to reputation

Blockchain smart contracts are typically deployed and immutable. Security vulnerabilities cannot easily be patched without governance coordination or migration. This increases the need for accurate vulnerability classifications and sound evaluation metrics. An unreliable benchmark can create an environment of misplaced trust in AI driven security products.

The Growing Role of AI in Smart Contract Auditing

Smart contracts are now commonly reviewed using artificial intelligence (AI). The use of AI in this respect can be summarised as follows:

- To pre-scan programming code and locate identified new vulnerabilities,

- Assist human auditors in analysing the code for functional or logical errors,

- Provide recommendations for code patches if errors are located, and

- Create test cases that simulate the exploitation of the vulnerability.

The effective use of artificial intelligence will complement, but not replace, the work of human auditors. Increasingly, we are seeing the use of artificial intelligence in this way. EVMbench is an effort to evaluate how well AI performs against established metrics in this sub-domain. OpenZeppelin offers a critique of this evaluation method, noting the need for a secure and usably designed evaluation process for benchmarking purposes.

Lastly, to be effective with respect to adversaries who will actively seek weaknesses, evaluation processes must be designed such that they will not be able to be 'gamed'.

Broader Implications for AI Evaluation in Crypto

The controversy surrounding EVMbench highlights an ongoing challenge when evaluating AI; distinguishing between true reasoning and pattern recognition.

As the capabilities of large language models continue to expand, the benchmarks used to assess their capabilities typically also improve. However, without properly isolating and validating a benchmark's underlying dataset, such capability improvements could be attributed to having been exposed to training data rather than having been developed by true analytical depth.

This distinction is especially important when evaluating the security of smart contracts, as these types of exploits frequently involve complex interactions, contextual constraints, and economic edge cases. To be a reliable benchmark, a benchmark must:

• Feasibility of Fulfilling Requirements through Practical Exploitability

• Economic Considerations about Feasibility

• Execution Constraints Related to On-Chain Transactions

• Attack Surfaces That Exist in the Physical World

Should either the severity levels or assumptions about vulnerabilities used in benchmarking be incorrect, those benchmarks could lead developers astray. OpenZeppelin’s comments indicate that the crypto secu rity industry has the same expectations of AI-based benchmarks as is expected within the protocol audit process.

A Constructive Tension Between AI and Security Experts

It should be noted that OpenZeppelin expressed their support for the initiative prior to publishing their criticism. This suggests that the argument is not against the use of AI for benchmarking, but rather to strengthen the process of benchmarking AI.

The interrelationship between the blockchain secure auditing community and the AI research community is a constructive tension that will create:

Working together to develop definitions, criteria, and standards for datasets will help reduce the chance of excessive confidence in automated systems while also encouraging innovation, as AI-based tools continue to rise in popularity within the Web3 development space.

As artificial intelligence tools gain more and more traction in the Web3 development community, it has become increasingly important to establish a transparent process for validating their use.

Conclusion

EVMbench's results from OpenZeppelin highlight how challenging it is to assess the quality of artificial intelligence used for assessing security in the blockchain space. The discovery of potential contamination of training data that could impact how well AI can identify vulnerabilities in contracts, has generated a very important conversation around the integrity of benchmarks used in this industry. This industry manages hundreds of billions of dollars in value that is held on chain, so using sound methods when performing any sort of analysis is critical.

For artificial intelligence to become a reliable contributor to auditing smart contracts, any framework used to evaluate AI will also need to be subject to the type of adversarial assessment that the underlying protocols that artificial intelligence will help to establish. The convergence of AI and blockchain is expected to yield significant efficiencies but as this case study has shown, innovation will need to be subject to exacting standards for this outcome to be realized.