Pentestify LTD is a registered company in the UK.
Pentestify CEO, Lucas Martin Calderon was invited to the Dappcon 2023 Summit to talk about marrying AI and blockchain security. In this talk, Lucas goes back to the basics of how AI actually works in order to get a better understanding of where it is heading: “Getting the basic right will help make a product that truly makes a difference and that will have the biggest impact”. He then ties the latest development in both open and closed source AI models, their advantages and disadvantages and why Pentestify chose DRL and GNNs models to build NEO: Automated, post deployment smart contract vulnerability detection and remediation SaaS.
Find the full transcript below:
I’m super happy to be here especially because I’ve been thinking quite a lot about what to speak about today. Whether to speak quite technically about what AI is or comparing our tools to other world leading tools like static analyzers Dynamic analyzers or formal verification tools but at the end of the day I decided to truly to maximize the impact and go back to the basics of how AI works to understand where the future is heading so you can have the bases right and hence develop a product that really makes a difference.
As Elon Musk says: “one of the problems with with entrepreneurs with smart Engineers smart people is that we focus on the wrong problems so we try to optimize and maximize problems that of things that shouldn’t even exist”
So who here in this room is afraid of AI replacing your job or actually taking a bit of your job, are there any takers?
Well by the end of this if you feel a little bit worse then that means that I’ve done a a good job.
Today we are going to talk about how a AI works why you should fear AI the ways to think about it. Same for the exact same thing for for the blockchain
So what is a neural network ?
You already have some inputs in the middle of the inputs and the outputs that’s where the layers come in right ? That’s where you have some weights. The weights could mean how strong the system, how strong the relations between these variables is and then those “B”s represent the bias so in a loss function of an AI model that means to minimize the error the loss function the cost function. Then you have an activation function which could be softmax reu sigmoid to simplify the output. Finally you have the outputs which generally could be: it’s a car, it’s a carpet, it’s a mug or that would be classification or it could be regression regarding numbers and so on. So again we’ve talked a lot about and we’ve heard about uh chat GPT, LLMs, Transformers. What they do how they work. And I would like to give a very basic explanation but yet enough for you guys to understand why chat GPT or different LLMs wouldn’t be able to or wouldn’t be apt for certain vulnerabilities and so on.
I don’t know if you can read it well but chat GPT is made of or Transformer networks is made of encoders and decoders right ?
Encoders get your input when you put text into chat GPT and decoders actually get that context and decode and output the end result right? And it’s quite easy and well it’s quite important to understand how self attention works, that actually comes from a paper from Google called attention is all you need from 2016, it is quite quite old but it’s the basics for all we’ve got right now. So if you want to use these tools to find vulnerabilities in smart contracts that’s a bit more tailored to Smart contract security auditors, you already start wrong if you use a pre-train model. Why? Because the attention mechanism of LLMs or of chat GPT is already tailored to optimize and maximize the probability of the next word right ? So according to your questions according to the tokens and the whole embedding what is the probability of the next word popping up right ?
This is important to know because if you think about mathematics, if the smart contract does some mathematical reasoning. When you ask it 5 + 5 and it outputs 10 it doesn’t know what 10 means or it doesn’t know what addition means or what five represents/signifies. It simply means that after all the terabytes of training data, academic papers and so on, most of the time uh most of the times there is a 10 at the end of the thing so do you want that to be piloting/driving the security of your web3 company ? I I definitely hope no.
So when when thinking about all these things, on the right as you can see that’s that’s simply how AI sees the input text that you put. When we call artificial intelligence is it really intelligent ? how intelligent is it ?
The stability CEO of stability AI said that by the end of 2024, and I might be wrong about this but we will be able to have chat GPT on our phones downloadable. And the size would be 5GB more or less.
Doesn’t that mean a lot already ? You’re able to train on terabytes of data. You’re able to put a big portion of the internet, of the best books, of the smartest academic papers and then you’re able to download all that information into 5 gigabytes.
Are we talking about a sort of compression algorithm ? Are we talking about something else ? Well that something else is what we interpret as intelligence but however we plan to transpose our intelligence to what this AI how AI works right ? Which is something completely different and this is why when really thinking about security and even though certain models might output certain vulnerabilities of smart contracts, you truly need to understand how it works first, what kind of intelligence or compression algorithm it has, or pattern recognition, to know the limits of its creativity, of how it works and so on.
This image is simply what the AI sees in the in the first step.
So if you see on this slide, these are the word embeddings. All the features and all the rows, each row is a token and each column is a feature of the token, a dimension. A dimension could represent different things that we couldn’t even imagine in that case it might be word similarity semantic similarity and it’s quite important for the self attention algorithm.
This is taken actually from Jay Alsom one of the best Illustrated Engineers that are able to express with graphics how AI works and in this case for example, it says “the animal didn’t cross the street because it was too tired” when you refer back to the whole graph to the whole way of seeing things by the AI, what does it represent ? Does it go back to the animal or does it mean the street right ?
This is where attention comes in and where it’s tailored to certain parameters. In this case, chat GPT is tailored to make sense for the questions that the user asked them but it’s definitely not engineered, or made to find vulnerabilities in smart contracts.
Now however, we would like to talk about the opposite of the Monopoly of AI and why AI companies should fear AI. Why it is really down to the community, and this is the message that I want to put across it should be up to us, up to the community to really drive and set the pace for the future of the developments in web 3 including AI and blockchain and not to these big corporations.
So GPT4 was the first commercially available one, even though Google started way before. Then we’ve got Cohere, Stability AI, Anthropic. In this case it even seems that Mark Zuckerberg is actually one of the good guys as Meta is one of the biggest research companies putting their the research out there for free.
This is how fast AI has been working. Since the first activity was released on February 24th Lama was released but it was open source and yet the weights of the model, which means the intelligence after the training model, they were private somehow they got released from forchan, and since then there has been a super fast acceleration of of people building on top of these tools. To give you an exampl, on March 13 it was already running on consumer Hardware, on March 19th “Vicuna” and other models already surpassed Google’s “Bard”, so something that was around $200 million to train in the AI uh database and Computing zones. It already takes $300 uh to to fine-tune the model. Fine-tuning simply means making it better after it’s already been pre-trained or tailored to your needs.
GPT for all launches and it already creates an entire ecosystem, the biggest LLMs open source libraries like “Lang Chain” like Wev8 the vector database already tried to pop up and catch up with a speed.
On March 30th Bloomberg GPT launches and shortly after FiGPT, so Finance GPT the open source version of that launches as well so as you can see since the model, since the weights of Lama were put uh out in the wild by well definitely not by meta but by someone else, I guess an Insider. In in less than six weeks we were already able to get the same performance or very similar accuracy F1 scores precision than models that took years of development and hundreds of millions of of dollars to train so this is again the mote. There was as well a leaked um article by Google saying “open AI has no mote and neither do we”. By mote they simply mean Competitive Edge or competitive advantage. It was already possible given certain Technologies certain advancements in the field of AI allowed all of that to to happen, however it is important to note that you’re able to fine-tune the model simply by prompt so you don’t need to do that technically prompt and giving context already kind of helps that out.
However speaking of the technologies that actually helped the open source Community carry all this advancement forwards and actually achieve the same precision than Google or Open AI were two main Technologies: quantization and Lora. Lura is not to be confused in the electronics field but it is a lower rank adaptation that allows actually to reduce the number of trainable parameters by 10,000 times which actually reduces and helps free up the bottleneck of AI training which is generally the GPUs. Then quantization means faster inference, it saves money and when the training happens and an inference you’re able to reduce the vector space which is used to train the model.
Now the problems with generative AI and this is really where the research that I’ve been doing at Pentestify in collaboration with University College London, we really wanted to approach it in the most natural way, and by natural I mean one of the humans best inventions or rather than inventions, executions and Engineering achievements were actually when we observed nature. So what best to represent how a human finds vulnerabilities and things of vulnerabilities than actually studying the brain and how it interacts when presented with a smart contract ?
So we took 50 Smart contract auditors and we put them on an MRI machine to know the areas of the brain that activated when doing these things. A lot of noise happened but we were thank god helped by medical professionals that know exactly how to interpret and read that data, and actually it aligned very well with the philosophy at the beginning that we carried at Pentestify. It was what kind of AI should we develop to uh to really find vulnerabilities without having the same weaknesses as open AI or Chat GPT. Again we don’t want to optimize a problem of an issue that shouldn’t exist in the first place right ?
So we realized that the answer was actually the true intelligence for this very specific task was mixing different types of intelligence and we realized that when spatial and vision activated in the brain there’s already an AI model tailored for that and it’s called graph neural networks that are able to think sequentially through time and through functions. Again, as smart contract auditor when you’re creating or auditing the code you might think of the control flow or data flow diagram right? But you might be thinking directly with all the different function calls across time or in a different order that it was already meant to. So graph neural networks are indeed able to to achieve that task and have different information in parallel at the same time.
The language area in the brain already was quite active and for that we use part of LLMs like chat GPT but only up to the embeddings point and the embeddings against it is the representation of the tokens of the words that you put in and we wanted to make sure that the attention algorithm, the attention mechanism was tailored to find vulnerabilities instead of to understand the semantic context of it. Because at the end of the day we don’t want the AI to understand the smart contract per se, the semantics of it and we want to understand the different in instruction sets and its interactions with the evm. There were a bunch as well of of algorithms like short-term long memory, algorithms and so on for prediction, mathematical reasoning that we already know that it might not be the best at in fact even though it improves it, it is definitely from the base from scratch the wrong algorithm to think and the wrong AI model to think for for these things.
The way blockchain evolves, the good ways the bad ways. Well the fact that that is on red doesn’t mean that I don’t agree with with those two but are definitely something that we should take into consideration. The fact that both things exists at the same time and the need for both things to exist at the same time. For the first time in history our money in banks will be programmable that means that if you receive money from your job or from even your own venture you might need to spend it in a timely manner because otherwise you could get burned or you won’t be able to route it through certain channels because it would be forbidden. Again control has never been so active.
So what is the best way to marry blockchain and AI in the context of security ? And why is it so interesting to mix them ? Is it simply because in web 2 AI was so prevalent ? Well in this case it really helps the infrastructure of blockchain the availability of the data is there for the first time public. We’ve given up uh the control for our data to be public, for certain transactions to be public for at least the pleasure of not being stored in a centralized server that someone else controls.
The sector is about to change as well with different encryption schemes: FHE, ZK are definitely there to change the game and if you want to simulate different attacks on the blockchain it’s never been so easy with an infrastructure that you can Fork that you can literally copy and simulate.
Why is the security aspect so important ? Even though uh this is not a double sale that I to make here I’m sure but in 2023 alone there was over three billion stolen dollars. 70% came from Smart contracts and 92% of the smart contracts were already audited by top firms. This is nothing to say of these top firms but rather the fact that it evolves. That new vulnerabilities have been found, and when referring to most tools, so static analyzers and dynamic anal analyzers… they have predefined instructions, predefined vulnerabilities so again the very basics of AI, of deep learning is being able to train to inference on unseen data and being able to infer new vulnerabilities that you haven’t even learned the patterns before. Or hasn’t been entered by an expert.
So that’s what we do at Pentestify through five different AI models. You simply give us the address of the smart contract and we get all the vulnerability, all the dependencies, all the graph from the different smart contracts and across time we continuously monitor it. We extract the vulnerable patterns we store them in a database and although many of them are not vulnerable but given that it is in a multi-dimensional space that not even humans can understand as you saw in the graph embeddings AI uses way more dimensions that what we can possibly imagine, that remains on the database until a smart contract with a similar uh vulnerability is found and then we alert the team immediately. We are also able to find uh variations of vulnerabilities like reentrancy that a static analyzer wouldn’t detect unless it receives an update.
So yeah thank you very much more than happy to answer any questions now or after. I was only able to touch upon an overview many many topics but happy to answer any questions.
Yes now it’s time for questions so if anyone has questions please raise your hands