GPT-4 has the world speculating about its potential to transform industry, but the important issue for the web3 community is to discover what this AI tool can do right now.
In the months since its March ‘23 release, GPT-4’s ability to audit smart contracts has begun to be tested by developers. How accurately can it perform this critical task for ensuring smart contracts are safe and can’t be hacked? We have a roundup for you that includes several tests and their results.
Before we dig into the details, let’s go over a few basics about this technology, beginning with the core technology that underlies these chat tools.
We’ve all experienced how frustrating it can be to interact with simple chatbots that have limited capabilities. Large Language Models (LLMs) have greatly improved this situation. OpenAI’s ChatGPT, especially as powered by the company’s most recent model, GPT-4, is able to solve more complex problems by training on vast amounts of text.
GPT is an acronym for Generative Pre-trained Transformer. It’s called generative because it can synthesize what it’s given and generate something new, namely, textual responses to prompts. The model is pre-trained to give those responses. The transformer is the architecture that guides how the model will use the information it’s been given. The model is then fine-tuned for more particular tasks, such as detailed, useful chat responses in the case of ChatGPT.
GPT-4 is the most recent GPT model released by OpenAI. The research and deployment company describes GPT-4 as a major leap forward in “scaling up deep learning.” The result is “human-level performance on various professional and academic benchmarks.” For instance, GPT-4 got a top-10% result on a simulated bar exam.
What if that human is a blockchain developer? Can GPT-4 perform developer tasks such as a smart contract audit and find flaws that might be exploited by attackers? Web3 developers decided to find out.
4 Experiments that asked GPT-4 to audit smart contracts
Developers have taken several different approaches to testing GPT-4’s ability to conduct smart contract audits. As you’ll see, results have been mixed–some tests seem to go well while in other cases, GPT-4 can’t find contract flaws despite repeated and increasingly refined prompts.
Experiment 1: Coinbase tests a live Ethereum contract
Shortly after GPT-4’s release, Coinbase product head Conor Grogan popped a live Ethereum smart contract written in Solidity into GPT-4 to see if it could identify vulnerabilities.
Here’s his request of GPT-4:
And here is the response from GPT-4:
The results seemed promising. But observers pointed out that since this contract was hacked in 2018, GPT-4 might have ‘figured out’ the contract’s problems simply from consuming internet discussions of the earlier hack.
Also, its results listed only issues that had been discussed online prior to the tool’s September ‘21 learning cutoff. So had it truly analyzed, or merely regurgitated? It’s unclear.
Experiment 2: Coinbase tests 20 more contracts
Not content with a single try, Coinbase then set up another test in which it submitted 20 different token contracts to GPT-4 for audits. As Decrypt reported, results this time were more worrying.
The AI tool gave results identical to those found in a manual, human review 12 of 20 times. But of the eight results that differed, in five cases GPT-4 incorrectly labeled high-risk assets as low-risk. Decrypt called this “the worst case failure.”
Also troubling: When the experiment was repeated with identical prompts, GPT-4 sometimes gave different answers than it had the first time. Coinbase’s Tom Ryan concluded that the tool “did not hit the accuracy bar to clearly demonstrate that it should be integrated into our asset review process…”
Experiment 3: Zellic asks GPT-4 to find a vault contract’s flaw
Also in March ‘23, the blockchain-security firm Zellic asked GPT-4 to find a major flaw in a vault contract. In a post about the test, Zellic’s Stephen Tong criticizes developers who’ve publicized successful audit tests as offering “cherry-picked” examples that obscure GPT-4’s shortcomings.
In his own test, he submitted a simple Solidity vault contract. The contract contained a serious, surface-level problem in the redeem function that could allow an attacker to drain the entire vault.
At first, GPT-4 completely and “confidently” misses the problem, listing other non-critical issues. On a second try, Tong observes that GPT-4 “hallucinates” a bug in the redeem function, but once again misses the actual vulnerability.
Trying one final time, Tong submits the contract one function at a time, tailoring his prompt to ask GPT-4 specifically to “focus on serious security vulnerabilities with real potential impact.” The AI reported that it found no such issues.
“This should be sufficient to show that ChatGPT is certainly not up to the task of auditing smart contracts, especially for mission-critical, financial code,” Tong concludes.
Experiment 4: Moonbeam tests a smart contract it had GPT-4 write
In May ‘23, developers at the cross-chain smart-contract platform Moonbeam decided to take GPT-4 through all the paces: the team had the tool create, test, debug, and deploy a Solidity smart contract.
The process went fairly smoothly. GPT-4 first creates a very simple contract based on prompts, and then expands it per instructions. It then writes a set of tests and discovers flaws.
With simple flaws, GPT-4 easily found the problem. If the problem was more complex, the tool generated a list of possible fixes to try. Developers tried adding a function that included a reentrancy vulnerability, and GPT-4 correctly diagnosed the problem.
Despite the seemingly positive results, Moonbeam’s Kevin Neilson concluded that GPT-4 wasn’t yet sophisticated or reliable enough for smart contract audits:
“It's important to remember that ChatGPT can only act as an aid to a developer and cannot fulfill any sort of audit or review. Developers must be aware that generative AI tools like ChatGPT can produce inaccurate, buggy, or non-working code.”
Why Does GPT-4 Fall Short?
The tests show GPT-4’s auditing capabilities are very hit-and-miss. Why does the tool often fail at identifying major security issues in a smart contract? There are multiple challenges that leave GPT-4 missing the mark:
The training data for GPT-4 was cut off in September 2021. Blockchain technology is a rapidly evolving sector, as are the means attackers may use to compromise a smart contract. As the training cutoff of an LLM is farther away from the present, it may be ignorant of emerging vulnerabilities that have been uncovered subsequently. For future models, it’s possible that this lag time will grow shorter. For now, the substantial gap poses a challenge to accurately and reliably audit smart contracts.
Prompt engineering for GPT-4 is an art
As we saw in the tests, giving GPT-4 a more generalized prompt can lead to higher fail rates, where a more specific, sharply tailored question can get better results. The model responds to the precise instructions it’s given. If the instructions are too general, GPT-4 may report false positives or overlook critical bugs.
Perhaps as developers and testers become more experienced at writing prompts, GPT-4’s usefulness in smart contract audits will increase. You can learn more about how to optimize your prompts from this Microsoft guide.
The challenge of prioritizing smart contract flaws
Repeatedly in the tests, GPT-4 spotlights minor issues while ignoring critical safety risks. It sometimes must be prompted to ignore minor issues before it finds the big problems.
GPT-4 struggles as smart contract complexity increases
GPT-4 failed to identify some problems even in simple contracts, but had a better track record than when it encountered more sophisticated contracts. Though LLM technology will no doubt continue improving rapidly, the complexity of smart contracts will likely continue to grow as well. That could leave LLM-based tools forever struggling to catch up to the advancing levels of new smart contracts.
GPT-4 is not a smart contract developer
Smart contract developers have a distinct advantage over LLM tools because they have experience writing smart contracts and watching them operate in real-world environments. While developers are experimenting with having GPT-4 write smart contracts, it’s not something they have deep experience with yet. The lack of focused training means GPT-4 has a blind spot that seems to leave the tool struggling to tell the difference between a massive security hole and a minor glitch.
Not there yet, but a tool to watch for smart contract audits
If you scan recent headlines about using GPT-4 for smart contract audits, it appears the blockchain community has reached consensus:
“ChatGPT can’t beat human smart contract auditors yet”–Cointelegraph
“Auditing with ChatGPT: Complementary But Incomplete”–Certik
“Smart Contract Auditors’ Jobs Are Safe, for Now”–BeInCrypto
“Experiments show AI could help audit smart contracts, but not yet”–Cointelegraph
“Can ChatGPT Really Replace Crypto Audits? Not Yet, Say Researchers”–Decrypt
We agree–LLMs still have a ways to go before they can be considered a viable tool for performing a reliable smart contract audit. Should you rely on GPT-4 and take your smart contract live based on its smart contract audit? We wouldn’t advise it.
But one thing is certain: LLMs will continue to evolve and improve. They may rapidly progress from their current state–not mature enough to reliably audit smart contracts–to become a key part of developer teams’ cutting-edge toolkit for debugging contracts. Developers should definitely keep an eye on LLMs’ progress.
Perhaps the current state of things is best summed up by OpenZeppelin machine learning lead Mariko Wakabayashi:
To date, LLMs haven’t specifically trained to debug smart contracts. If and when that happens, we may see more promising results using the next iteration of LLMs for smart contract audits.