1 00:00:19,852 --> 00:00:22,212 S1: All right, Michael, welcome to unsupervised learning. 2 00:00:22,892 --> 00:00:24,412 S2: Hey, it's great to be here. Thanks for having me. 3 00:00:25,532 --> 00:00:29,892 S1: Yeah. So, uh, lots to talk about here. Uh, can 4 00:00:29,892 --> 00:00:31,732 S1: you give a quick intro on yourself? 5 00:00:32,492 --> 00:00:34,172 S2: Yeah, sure. So, uh, my name is Michael Brown. I'm 6 00:00:34,172 --> 00:00:36,812 S2: a principal security engineer at trilobites. I lead up our 7 00:00:36,812 --> 00:00:40,932 S2: company's AI and ML security research group. We really focus 8 00:00:40,932 --> 00:00:44,972 S2: on two kinds of, uh, intersections between AI, ML, and security. 9 00:00:45,012 --> 00:00:51,332 S2: It's primarily using AIML technologies to solve traditional cybersecurity problems 10 00:00:51,332 --> 00:00:54,292 S2: that are really hairy and really kind of sticky, and 11 00:00:54,292 --> 00:00:57,972 S2: conventional methods have kind of failed to address. And then 12 00:00:57,972 --> 00:01:01,412 S2: we also, uh, to a smaller degree, look at, um, 13 00:01:01,452 --> 00:01:06,292 S2: the security of AIML based systems. So, um, I was 14 00:01:06,532 --> 00:01:10,332 S2: also the lead designer, um, in team lead for um, 15 00:01:10,372 --> 00:01:14,092 S2: trilobites team that entered into the AI Cyber Challenge. Uh, 16 00:01:14,132 --> 00:01:16,892 S2: we built the tool called Buttercup, which took second place 17 00:01:16,892 --> 00:01:21,382 S2: in And overall in the iacc. And, um. Yeah, that's 18 00:01:21,382 --> 00:01:21,862 S2: about it. 19 00:01:22,462 --> 00:01:26,062 S1: Yeah. That's perfect. And that's exactly what I'd like to 20 00:01:26,102 --> 00:01:32,622 S1: chat about. Um, so I guess, um, I guess the 21 00:01:32,622 --> 00:01:35,702 S1: thing I'm most interested in is, uh, just the design 22 00:01:35,702 --> 00:01:41,622 S1: of the system, and, um, I guess overall, what you 23 00:01:41,622 --> 00:01:44,542 S1: know about the designs of the other system. So design 24 00:01:44,542 --> 00:01:49,222 S1: versus design, system versus system. What? Whatever you want to 25 00:01:49,222 --> 00:01:51,541 S1: share or can share. Like what? What are your thoughts 26 00:01:51,542 --> 00:01:54,541 S1: on that? Um, I guess everyone releases open source. So 27 00:01:54,862 --> 00:01:56,462 S1: maybe you've had a chance to look at some of 28 00:01:56,462 --> 00:01:59,662 S1: the other offerings. Maybe you've heard them talking, maybe you know, 29 00:01:59,662 --> 00:02:02,742 S1: the teams. Uh, so I guess what kind of Intel 30 00:02:02,742 --> 00:02:06,502 S1: do you have on what everyone else was doing versus 31 00:02:06,502 --> 00:02:10,062 S1: what you guys were doing? And how do you think 32 00:02:10,182 --> 00:02:11,142 S1: that went? 33 00:02:12,502 --> 00:02:14,622 S2: Yeah. Well, um, yeah, I guess I can answer that 34 00:02:14,622 --> 00:02:17,992 S2: last part pretty easily. It went pretty well for us. Um, 35 00:02:18,232 --> 00:02:20,712 S2: so we took second place. Uh, the team that finished 36 00:02:20,712 --> 00:02:23,952 S2: in first. Team Atlanta. Um, they had a pretty similar 37 00:02:23,952 --> 00:02:28,512 S2: setup to ours. Um, they had more components, more moving parts, uh, 38 00:02:28,512 --> 00:02:31,552 S2: more pieces. They had more hands. Um, larger team to 39 00:02:31,552 --> 00:02:34,112 S2: be able to kind of implement more, um, but ultimately 40 00:02:34,112 --> 00:02:37,952 S2: they had a really similar kind of set of design principles, um, 41 00:02:37,992 --> 00:02:41,632 S2: that worked out for us, the third place finishing team theory, they, um, 42 00:02:41,672 --> 00:02:44,112 S2: had a bit of a deviation in terms of like 43 00:02:44,112 --> 00:02:47,232 S2: their conceptual, uh, principles that guided how they built their system. 44 00:02:47,232 --> 00:02:49,712 S2: But I can get into that in a bit. Um, 45 00:02:49,952 --> 00:02:51,392 S2: I guess I can first start off by talking a 46 00:02:51,392 --> 00:02:55,192 S2: little bit about our concept. So it's interesting. Um, you know, 47 00:02:55,232 --> 00:02:57,472 S2: the concept for Buttercup changed quite a bit over the 48 00:02:57,472 --> 00:03:00,032 S2: course of the over the course of the AI Cyber Challenge. 49 00:03:00,032 --> 00:03:03,832 S2: So this got announced, um, a couple years back, and 50 00:03:03,832 --> 00:03:06,592 S2: there was a period of about 4 or 5 months, um, 51 00:03:06,672 --> 00:03:09,512 S2: after the cyber challenge was announced, but before DARPA had 52 00:03:09,512 --> 00:03:13,031 S2: really released any rules. So we didn't really know exactly 53 00:03:13,312 --> 00:03:15,282 S2: how the competition was going to be structured. We structured. 54 00:03:15,282 --> 00:03:16,682 S2: We just knew that we would have to build a 55 00:03:16,681 --> 00:03:22,281 S2: fully autonomous, AI driven system that could find and patch vulnerabilities, um, 56 00:03:22,322 --> 00:03:26,202 S2: with a high degree of accuracy. Um, so originally, the 57 00:03:26,202 --> 00:03:29,962 S2: concept that I drew up along with my co-creator Ian Smith, um, 58 00:03:30,562 --> 00:03:34,042 S2: was originally really ambitious. Lots of moving parts, lots of 59 00:03:34,042 --> 00:03:39,602 S2: static analysis, dynamic analysis, lots of, um, conventional techniques, lots 60 00:03:39,602 --> 00:03:42,642 S2: of AIML based techniques. But ultimately, once the rules came out, 61 00:03:42,642 --> 00:03:44,442 S2: it kind of got pared down quite a bit. Um, 62 00:03:44,442 --> 00:03:47,322 S2: some of the things that we wanted to do, um, were, 63 00:03:47,522 --> 00:03:49,122 S2: were marked as like out of scope. Some of the 64 00:03:49,122 --> 00:03:52,162 S2: stuff we wanted to do were marked as against the rules, um, 65 00:03:52,162 --> 00:03:54,322 S2: just for the tractability of the competition. 66 00:03:54,322 --> 00:03:56,802 S1: So is that because they were, they would have been 67 00:03:56,802 --> 00:03:59,402 S1: too expensive. Didn't you have budgets you had to stay under? 68 00:04:00,162 --> 00:04:02,722 S2: Yeah. So some of it was definitely, um, budgetary and 69 00:04:02,722 --> 00:04:04,562 S2: some stuff was just, you know, flat out against the rules. 70 00:04:04,562 --> 00:04:07,402 S2: We looked at fine tuning a large language model, um, 71 00:04:07,442 --> 00:04:10,602 S2: with information about lots of open source software. And, um, 72 00:04:10,642 --> 00:04:15,022 S2: there ended up being a rule about pre-baking models, so. okay, really, 73 00:04:15,022 --> 00:04:17,702 S2: kudos to DARPA for making sure that, you know, competitors 74 00:04:17,702 --> 00:04:21,022 S2: didn't have the ability to kind of, um, skew the 75 00:04:21,022 --> 00:04:23,622 S2: systems that they build for the test, which is, you know, 76 00:04:23,662 --> 00:04:27,182 S2: finding and patching vulnerabilities and open source software. Um, so, yeah, 77 00:04:27,222 --> 00:04:29,541 S2: there was a lot of stuff that gets cut down. Um, 78 00:04:29,582 --> 00:04:33,382 S2: they got cut down. But ultimately the design of our system, um, was, 79 00:04:33,382 --> 00:04:35,541 S2: was basically a pipeline. We we kind of broke the 80 00:04:35,541 --> 00:04:37,491 S2: problem down. We realized we had to do basically 4 81 00:04:37,492 --> 00:04:40,421 S2: or 5 things really well. To win this competition, we 82 00:04:40,422 --> 00:04:42,462 S2: had to be able to find vulnerabilities. And not only that, 83 00:04:42,462 --> 00:04:44,302 S2: we had to be able to prove they exist. So 84 00:04:44,302 --> 00:04:46,942 S2: it wasn't enough just to, you know, use a static 85 00:04:46,942 --> 00:04:49,302 S2: analysis scanner and say, hey, this thing thinks there's a 86 00:04:49,302 --> 00:04:54,501 S2: vulnerability online. 50 of, you know, whatever, uh, you actually 87 00:04:54,502 --> 00:04:56,862 S2: had to have a crashing test case for the first 88 00:04:57,981 --> 00:05:01,582 S2: round of the competition in the semifinals. And in the finals. 89 00:05:01,702 --> 00:05:05,222 S2: You didn't they, they relaxed this requirement. But the pathway 90 00:05:05,222 --> 00:05:08,982 S2: to getting lots of points basically still required one. Um, 91 00:05:08,981 --> 00:05:11,462 S2: so you you had to find vulnerabilities and also prove 92 00:05:11,462 --> 00:05:14,152 S2: they exist with a crashing input, or an input that 93 00:05:14,152 --> 00:05:19,232 S2: would trigger a sanitizer in the target function. Um, you 94 00:05:19,231 --> 00:05:22,032 S2: had to be able to contextualize and draw additional information 95 00:05:22,032 --> 00:05:25,952 S2: about this vulnerability. Otherwise, patching was doomed to fail. Um, 96 00:05:25,992 --> 00:05:30,272 S2: and then you had to patch the actually patched the vulnerability. Um, 97 00:05:30,791 --> 00:05:35,072 S2: so this is a highly complex, uh, problem that conventional 98 00:05:35,072 --> 00:05:39,312 S2: approaches to software analysis have really kind of not addressed. Well, 99 00:05:39,312 --> 00:05:41,032 S2: in my opinion. And it was a great area to 100 00:05:41,072 --> 00:05:43,272 S2: use I. And then we also, you know, finally we 101 00:05:43,272 --> 00:05:47,272 S2: had to orchestrate all of these functions and do really 102 00:05:47,272 --> 00:05:50,032 S2: high quality engineering around all of them so that the 103 00:05:50,032 --> 00:05:53,032 S2: system would stay up and running for several days. Um, 104 00:05:53,032 --> 00:05:54,872 S2: so based on those kind of 4 or 5, depending 105 00:05:54,872 --> 00:05:57,632 S2: on how you chop them up, core principles or core 106 00:05:57,632 --> 00:05:59,671 S2: tasks that we had to do, um, we kind of 107 00:05:59,712 --> 00:06:01,952 S2: decided on an approach that we kind of call the 108 00:06:01,952 --> 00:06:04,672 S2: best of both worlds, which was, you know, we knew 109 00:06:04,712 --> 00:06:08,752 S2: that conventional software analysis, whether it's dynamic, static, hybrid, whatever, um, 110 00:06:08,791 --> 00:06:12,162 S2: it really excels at certain subproblems within this pipeline. and 111 00:06:12,162 --> 00:06:15,722 S2: it really struggles with other ones. And AIML and specifically 112 00:06:15,722 --> 00:06:19,202 S2: generative AI, which the competition was, was kind of heavily 113 00:06:19,202 --> 00:06:22,522 S2: skewed towards generative AI. Generative AI does really well at 114 00:06:22,522 --> 00:06:25,162 S2: certain types of subproblems in this pipeline, but also really 115 00:06:25,162 --> 00:06:29,282 S2: struggles with others. So our approach is pretty straightforward. We're 116 00:06:29,282 --> 00:06:31,442 S2: going to merge the best in class capability for each 117 00:06:31,442 --> 00:06:35,481 S2: part of this pipeline. Uh, stitch them together with high uptime, 118 00:06:35,481 --> 00:06:39,842 S2: high reliability engineering code, um, and then focus on doing really, 119 00:06:39,842 --> 00:06:44,042 S2: really well for the largest number of, um, the largest 120 00:06:44,041 --> 00:06:48,282 S2: number of possible targets that we could possibly, um, that 121 00:06:48,282 --> 00:06:49,522 S2: we could possibly do well in. 122 00:06:51,322 --> 00:07:01,002 S1: Okay. Yeah. Interesting. So would you say that, um. Basically 123 00:07:01,002 --> 00:07:03,241 S1: those those things that you described in the beginning, those 124 00:07:03,242 --> 00:07:05,882 S1: are like modules and they should almost like, kind of 125 00:07:05,922 --> 00:07:08,442 S1: work independently. So you can, like, hand a task to 126 00:07:08,481 --> 00:07:11,372 S1: each of them. Is that kind of the the system 127 00:07:11,372 --> 00:07:12,332 S1: design idea? 128 00:07:12,892 --> 00:07:15,692 S2: Yeah. Yeah. So we, um, part of this was just 129 00:07:15,732 --> 00:07:20,052 S2: surviving a really rapid development cycle. This wasn't really advertised 130 00:07:20,092 --> 00:07:22,012 S2: all that well, but we actually only had about three 131 00:07:22,012 --> 00:07:26,292 S2: months to develop the first version of Buttercup in the semi-finals. Um, 132 00:07:26,292 --> 00:07:29,452 S2: and we actually had only had about six months to develop, um, 133 00:07:29,492 --> 00:07:32,332 S2: the final version of Buttercup or Buttercup 2.0, which, which 134 00:07:32,332 --> 00:07:35,132 S2: took second place in the finals. Um, and that was 135 00:07:35,132 --> 00:07:37,852 S2: because even though each round of the competition ran for 136 00:07:37,852 --> 00:07:41,012 S2: a year, it took DARPA a while to solicit feedback 137 00:07:41,012 --> 00:07:45,572 S2: from competitors, other stakeholders, and actually solidify the rules. Um, 138 00:07:45,612 --> 00:07:47,732 S2: and so the rules were solidified. It was really at 139 00:07:47,732 --> 00:07:51,652 S2: risk to do really kind of any development on the system. Also, 140 00:07:51,652 --> 00:07:54,772 S2: certain things like the the technical specifics on their competition 141 00:07:54,772 --> 00:07:58,492 S2: API weren't available until later in the, in these cycles. Um, 142 00:07:58,492 --> 00:08:01,292 S2: so part of the reason why we modularized each component 143 00:08:01,612 --> 00:08:04,812 S2: was so that we could take smaller subteams within my 144 00:08:04,812 --> 00:08:08,452 S2: larger team of about ten engineers, um, all working some 145 00:08:08,452 --> 00:08:10,862 S2: degree of part time on this system so we can 146 00:08:10,862 --> 00:08:12,982 S2: modularize it, keep them kind of separate. You know, it 147 00:08:12,982 --> 00:08:14,782 S2: gives us this integration problem that we have to deal 148 00:08:14,782 --> 00:08:15,902 S2: with at the end. We have to kind of put 149 00:08:15,902 --> 00:08:18,302 S2: everything together and make sure that it runs well. Um, 150 00:08:18,342 --> 00:08:19,822 S2: but it was kind of a necessity. It was kind 151 00:08:19,822 --> 00:08:21,902 S2: of a necessity because we had to work on developing 152 00:08:21,902 --> 00:08:25,342 S2: everything independently. We couldn't afford to just do the first block. 153 00:08:25,622 --> 00:08:27,302 S2: And is it becoming like that? You know, that meme 154 00:08:27,302 --> 00:08:31,302 S2: of the horse drawing where really finally defined head and 155 00:08:31,302 --> 00:08:33,222 S2: then as it gets towards like the the back parts 156 00:08:33,222 --> 00:08:35,382 S2: of the animal, it turns into like a raw sketch. 157 00:08:35,502 --> 00:08:37,381 S2: That was what was going to happen if we if 158 00:08:37,382 --> 00:08:40,822 S2: we didn't modularize this. Um, but it also helped because 159 00:08:40,822 --> 00:08:43,742 S2: as we decided to change out strategies or play with 160 00:08:43,742 --> 00:08:45,982 S2: different strategies, made it really easy to kind of plug 161 00:08:45,982 --> 00:08:48,262 S2: and play different parts to see what would work later on. 162 00:08:49,462 --> 00:08:52,822 S1: Yeah, that makes sense. So I keep having this debate 163 00:08:52,822 --> 00:08:56,381 S1: with a whole bunch of people. It's kind of around, um, 164 00:08:56,942 --> 00:09:00,542 S1: let the model do the work because the model is smarter. Um, 165 00:09:00,742 --> 00:09:04,781 S1: and it just understands what to do. And then there's uh, 166 00:09:05,462 --> 00:09:10,112 S1: the other argument, which is, um, build a robust system 167 00:09:10,832 --> 00:09:13,552 S1: and you have the model kind of just be the 168 00:09:13,552 --> 00:09:17,432 S1: intelligence that helps guide the system or moves things through 169 00:09:17,432 --> 00:09:22,392 S1: the system, or maybe routes, uh, across the system or whatever. 170 00:09:22,712 --> 00:09:25,432 S1: But the system itself should be set up really well, 171 00:09:26,232 --> 00:09:28,752 S1: and you're kind of like functioning as a router. And 172 00:09:28,752 --> 00:09:33,352 S1: then when the model gets updated, it makes the system better. Um, 173 00:09:33,592 --> 00:09:37,272 S1: but the counter to that is basically that we're just 174 00:09:37,272 --> 00:09:40,032 S1: going to design bad systems. So we should stop trying 175 00:09:40,032 --> 00:09:43,192 S1: to be rigid there and just use the model. Like 176 00:09:43,352 --> 00:09:44,752 S1: where do you guys fall on that? 177 00:09:45,432 --> 00:09:49,672 S2: Uh, I think it was probably closest to the second 178 00:09:49,672 --> 00:09:53,712 S2: one and maybe more like an an undescribed third thing. 179 00:09:53,712 --> 00:09:56,911 S2: So I'll kind of go over for I, um, you know, we've, 180 00:09:56,912 --> 00:09:58,792 S2: we've been, you know, in me particular I've been doing 181 00:09:58,792 --> 00:10:03,592 S2: research on like applied AI for, for security problems since before, uh, 182 00:10:03,592 --> 00:10:06,271 S2: the large language model became the predominant form of technology. 183 00:10:06,272 --> 00:10:12,402 S2: Back to, you know, 2018, 2019 time frame. Um, and uh, realistically, 184 00:10:12,402 --> 00:10:14,482 S2: like large language models are great at a good number 185 00:10:14,482 --> 00:10:17,842 S2: of things. Um, but they really struggle with certain things. 186 00:10:18,282 --> 00:10:20,881 S2: And particularly in a challenge like this where you have 187 00:10:20,881 --> 00:10:23,722 S2: to do multiple things right in sequence in order to 188 00:10:23,722 --> 00:10:27,082 S2: be successful, you have to worry about errors that start 189 00:10:27,122 --> 00:10:31,242 S2: off in early stages of an LLM heavy pipeline that 190 00:10:31,242 --> 00:10:33,562 S2: compound over time, until eventually you get to the point 191 00:10:33,562 --> 00:10:36,521 S2: where I think kind of collapses. Um, so our philosophy 192 00:10:36,522 --> 00:10:40,122 S2: on using AI, uh, specifically within the AI cyber challenge 193 00:10:40,122 --> 00:10:42,602 S2: and also kind of more broadly, um, is to use 194 00:10:42,602 --> 00:10:49,082 S2: it for, um, tightly constrained, highly contextualized problems that, um, 195 00:10:49,362 --> 00:10:51,842 S2: the models are set up for success. Um, so this 196 00:10:51,842 --> 00:10:54,162 S2: is actually kind of an interesting anecdote. Um, during the 197 00:10:54,162 --> 00:10:58,122 S2: first round of, uh, during the first round of the 198 00:10:58,122 --> 00:11:03,202 S2: AI Cyber Challenge, um, the whole concept of like multi-agent systems, 199 00:11:03,442 --> 00:11:08,342 S2: systems that have, like, tools available to them. um, didn't 200 00:11:08,342 --> 00:11:10,622 S2: really exist. It was like in a couple of papers 201 00:11:10,622 --> 00:11:13,901 S2: on archive and ultimately, um, the way we built our 202 00:11:13,902 --> 00:11:17,582 S2: aperture for the semi-finals and for the finals, um, is 203 00:11:17,622 --> 00:11:20,862 S2: is now reflective of how LM driven systems are just 204 00:11:20,862 --> 00:11:23,742 S2: built today. So it's actually really vindicating. So like our 205 00:11:23,742 --> 00:11:28,021 S2: patcher is a like a multi-agent system. It's got multiple 206 00:11:28,022 --> 00:11:30,662 S2: large language models, each with different roles to play within 207 00:11:30,662 --> 00:11:34,342 S2: this process that collaborate to generate a patch and then 208 00:11:34,342 --> 00:11:38,021 S2: validate it to make sure that it's actually one will compile, 209 00:11:38,062 --> 00:11:41,582 S2: two will actually fix the vulnerability that we've discovered. And 210 00:11:41,582 --> 00:11:44,462 S2: three doesn't break other functionality within the program. So we 211 00:11:44,462 --> 00:11:46,342 S2: found that trying to ask one large language model to 212 00:11:46,342 --> 00:11:48,982 S2: do all of that didn't really work out. And also 213 00:11:48,982 --> 00:11:52,662 S2: in the semi-finals, the, the reasoning models, um, or the 214 00:11:52,662 --> 00:11:55,702 S2: thinking models, depending on, on the branding, they didn't exist, 215 00:11:55,702 --> 00:11:57,702 S2: they weren't available. They weren't even available to us to 216 00:11:57,742 --> 00:12:02,222 S2: use as like, um, early adopter models in the a.i.c.c. 217 00:12:02,222 --> 00:12:04,262 S2: So we were dealing with, with simple, you know, back 218 00:12:04,261 --> 00:12:09,592 S2: and forth, um, style chat models. Um, so we actually 219 00:12:09,592 --> 00:12:12,391 S2: had to build in a lot of this reasoning as 220 00:12:12,392 --> 00:12:14,912 S2: part of this, like multi-agent architecture, we had to build 221 00:12:14,912 --> 00:12:18,512 S2: in a lot of like reliability and engineering code around 222 00:12:18,511 --> 00:12:23,872 S2: maintaining the pipeline. Um, fortunately, the process for um, discovering 223 00:12:23,872 --> 00:12:27,272 S2: artifacts and submitting them was pretty rigid. Um, so it 224 00:12:27,272 --> 00:12:29,912 S2: didn't really affect us that much in terms of or 225 00:12:29,912 --> 00:12:31,232 S2: it didn't have to like put a lot of really 226 00:12:31,232 --> 00:12:34,632 S2: complex reasoning in, um, but actually we ended up even 227 00:12:34,631 --> 00:12:36,552 S2: by the end of the finals, we didn't use a 228 00:12:36,552 --> 00:12:39,952 S2: reasoning or a thinking model, um, in Buttercup, because we'd 229 00:12:39,952 --> 00:12:42,552 S2: actually built it in, it was part of the circuitry 230 00:12:42,552 --> 00:12:45,832 S2: or part of like the, um, the Python code, part 231 00:12:45,832 --> 00:12:49,312 S2: of our orchestration code. Um, so we had the opportunity 232 00:12:49,312 --> 00:12:50,712 S2: in the finals to take that out and let the 233 00:12:50,712 --> 00:12:52,992 S2: model do the work. We kind of explored it a 234 00:12:52,992 --> 00:12:55,592 S2: little bit, but ultimately we decided against it because the 235 00:12:55,592 --> 00:12:58,672 S2: best case scenario was that the model would kind of 236 00:12:58,712 --> 00:13:01,792 S2: figure out on its own how to break the problem 237 00:13:01,792 --> 00:13:03,752 S2: down and how to do individual things, and what tools 238 00:13:03,752 --> 00:13:07,242 S2: to call in sequence. Uh, but we were already subject 239 00:13:07,242 --> 00:13:09,202 S2: matter experts who did it exactly the way it should 240 00:13:09,202 --> 00:13:12,242 S2: be done. So the the best case scenario is that 241 00:13:12,242 --> 00:13:14,762 S2: the model was able to replicate what we've done only 242 00:13:14,761 --> 00:13:17,842 S2: at a more expensive per call. Um, or more expensive, 243 00:13:17,881 --> 00:13:21,882 S2: like number of volume of tokens. Um, so we actually kept, um, we, 244 00:13:21,881 --> 00:13:23,842 S2: we did upgrade our models. We went from the GPT 245 00:13:23,881 --> 00:13:28,161 S2: three series, um, and the Claude three, uh, series of 246 00:13:28,162 --> 00:13:33,122 S2: models and moved up to, um, the four and like 247 00:13:33,162 --> 00:13:36,362 S2: the basically the Gen four versions of models for the final. 248 00:13:36,362 --> 00:13:39,402 S2: So we, we upgraded the underlying models, but we very much, um, 249 00:13:39,442 --> 00:13:42,562 S2: kept the problems very small for the, for the AI's 250 00:13:42,682 --> 00:13:45,362 S2: or for the, um, for the AI models, so that 251 00:13:45,362 --> 00:13:48,122 S2: we would avoid this issue where you have compounding errors, 252 00:13:48,362 --> 00:13:51,642 S2: you have to worry about like these, these modulo errors of, 253 00:13:51,682 --> 00:13:54,082 S2: you know, deciding to do the wrong thing in sequence. 254 00:13:54,562 --> 00:13:56,682 S2: And that actually turns out to be really, uh, to 255 00:13:56,682 --> 00:14:00,242 S2: be penalize you heavily in these long systems because, you know, 256 00:14:00,282 --> 00:14:03,202 S2: when a system decides, you know, hey, I've got to 257 00:14:03,202 --> 00:14:05,692 S2: do A, B, C and D and C before b. 258 00:14:06,052 --> 00:14:09,651 S2: All of that information involved with dealing with this like 259 00:14:09,852 --> 00:14:12,852 S2: out of sequence task. It stays in the context window. 260 00:14:12,852 --> 00:14:14,732 S2: And it kind of, for lack of a better term, 261 00:14:14,772 --> 00:14:17,532 S2: kind of pollutes the model's ability to kind of reorder 262 00:14:17,532 --> 00:14:19,132 S2: those tasks and do them correctly. It has a hard 263 00:14:19,132 --> 00:14:22,532 S2: time kind of forgetting information until it rolls out of 264 00:14:22,532 --> 00:14:24,692 S2: the context window. So it's a really long way to 265 00:14:24,692 --> 00:14:28,172 S2: say we probably did the latter version. But, um, one 266 00:14:28,172 --> 00:14:29,692 S2: thing I do want to say is like the actual 267 00:14:29,732 --> 00:14:33,052 S2: like processing of artifacts through the system, we didn't rely 268 00:14:33,052 --> 00:14:34,692 S2: on the AI to kind of figure out, okay, I've 269 00:14:34,692 --> 00:14:36,532 S2: got a vulnerability now I should patch it. That was 270 00:14:36,532 --> 00:14:40,772 S2: also all, um, that was also all orchestrated, um, by 271 00:14:40,772 --> 00:14:42,172 S2: our by our larger pipeline. 272 00:14:42,572 --> 00:14:46,132 S1: Okay. Okay. So yeah, I've seen this a lot as well. 273 00:14:46,172 --> 00:14:48,772 S1: I mean, I feel like this is a general concept 274 00:14:48,772 --> 00:14:54,252 S1: that people are coming to, which is, um, I don't 275 00:14:54,252 --> 00:14:59,372 S1: want to say legacy tech. Traditional tech is just like, deterministic. So, like, 276 00:14:59,372 --> 00:15:01,532 S1: that's the tech that you want to use to, like, 277 00:15:02,092 --> 00:15:05,342 S1: do things that matter, and then you kind of want 278 00:15:05,382 --> 00:15:09,702 S1: to use like AI for like a, um, I don't know, 279 00:15:09,742 --> 00:15:13,462 S1: like a router maybe, or like a, um, something intelligent 280 00:15:13,462 --> 00:15:19,222 S1: about choosing which standard tech to use, but not making like, choices. 281 00:15:19,222 --> 00:15:22,742 S1: Maybe necessarily. Um, I don't know. I'm trying to figure 282 00:15:22,742 --> 00:15:24,582 S1: out how to articulate that, but it's like. 283 00:15:24,782 --> 00:15:26,262 S2: Yeah, well, it's actually funny you bring this up. I've 284 00:15:26,262 --> 00:15:28,982 S2: had to kind of get good at articulating this, um, 285 00:15:28,982 --> 00:15:31,022 S2: over the last couple of years. So the way I've 286 00:15:31,022 --> 00:15:33,742 S2: explained this to people is that certain problems, particularly in 287 00:15:33,742 --> 00:15:37,462 S2: computer science with this kind of generalizes everywhere. Certain problems 288 00:15:37,502 --> 00:15:43,142 S2: lend themselves to prescriptive solutions. So prescriptive solution is something 289 00:15:43,142 --> 00:15:44,982 S2: that we do when we write an algorithm to solve 290 00:15:44,982 --> 00:15:47,502 S2: a problem. This could be like coming up with an 291 00:15:47,502 --> 00:15:50,302 S2: answer for the traveling salesman problem. You know, we know 292 00:15:50,342 --> 00:15:52,502 S2: it's a really difficult problem to solve, but there's greedy 293 00:15:52,502 --> 00:15:54,982 S2: algorithms that do a pretty good job and for the 294 00:15:54,982 --> 00:15:56,822 S2: most part, will get you a good answer. Maybe not 295 00:15:56,822 --> 00:15:58,542 S2: the best answer, but they'll get you a good one. 296 00:15:59,102 --> 00:16:01,952 S2: So for these types of problems, you can prescribe a 297 00:16:01,952 --> 00:16:04,552 S2: set of steps to the computer and let them execute them. 298 00:16:04,952 --> 00:16:09,032 S2: Now other problems are really, really challenging to prescribe a 299 00:16:09,032 --> 00:16:12,952 S2: solution for. So these types of problems lend themselves to 300 00:16:12,992 --> 00:16:15,592 S2: AI or ML techniques because you can use a descriptive 301 00:16:15,712 --> 00:16:19,352 S2: instead of prescriptive solution. So a good example of this 302 00:16:19,352 --> 00:16:22,432 S2: is like image recognition. So it's really really hard to 303 00:16:22,472 --> 00:16:25,112 S2: take a picture of a cat and write a computer 304 00:16:25,112 --> 00:16:28,832 S2: program that will say, okay, based on the pixel colors 305 00:16:28,832 --> 00:16:30,992 S2: of this pixel and this position, this is going to 306 00:16:30,992 --> 00:16:32,832 S2: be a cat, because a cat can be in a 307 00:16:32,832 --> 00:16:36,312 S2: million different contortions. It can have different hair, the face 308 00:16:36,312 --> 00:16:38,752 S2: can be half obscured. But what we can do is 309 00:16:38,792 --> 00:16:41,152 S2: we can describe to an AI ML model what a 310 00:16:41,192 --> 00:16:43,712 S2: cat looks like with millions of pictures, because we have 311 00:16:43,712 --> 00:16:46,032 S2: millions of pictures of cats. And then it can do 312 00:16:46,032 --> 00:16:48,512 S2: a good job of solving that problem. Now it might 313 00:16:48,512 --> 00:16:51,192 S2: make mistakes, but this is better than the option that 314 00:16:51,192 --> 00:16:54,152 S2: you had with the traditional approach, because that approach was 315 00:16:54,152 --> 00:16:57,152 S2: awful to begin with. So a good example of a 316 00:16:57,152 --> 00:17:01,242 S2: corollary for this in Buttercup is patch generation. There's a 317 00:17:01,282 --> 00:17:03,282 S2: lot of synthetic code generation tools and a lot of 318 00:17:03,282 --> 00:17:06,202 S2: research in this area. But in terms of like automatically 319 00:17:06,202 --> 00:17:10,242 S2: generating patches to fix bugs, unless your bug is like 320 00:17:10,282 --> 00:17:13,321 S2: dead obvious, like it's missing a bounds check and it's 321 00:17:13,322 --> 00:17:15,402 S2: really easy to apply some sort of pattern matching to 322 00:17:15,442 --> 00:17:17,202 S2: figure out what the lower bound is, or the upper 323 00:17:17,202 --> 00:17:20,882 S2: bound is that needs to be checked. Um, tools to 324 00:17:20,922 --> 00:17:24,482 S2: generate patches for weird bugs. Like they just don't exist. 325 00:17:24,922 --> 00:17:27,402 S2: So this is a great place for AIML to help 326 00:17:27,402 --> 00:17:29,401 S2: us out. And it actually turns out, um, you know, 327 00:17:29,402 --> 00:17:31,922 S2: this is really proven true by the AI Cyber Challenge 328 00:17:31,922 --> 00:17:35,921 S2: and by Buttercup, more specifically, um, llms are great at 329 00:17:35,922 --> 00:17:38,402 S2: generating code, um, because it's one of the biggest value 330 00:17:38,402 --> 00:17:43,002 S2: propositions right now for the technology. So, um, generating patches 331 00:17:43,002 --> 00:17:45,602 S2: for bugs is tightly constrained. It's not not asking you 332 00:17:45,602 --> 00:17:48,561 S2: to generate all of the code that is necessary to 333 00:17:48,602 --> 00:17:51,042 S2: build this entire system that I've got a spec sheet for. 334 00:17:51,482 --> 00:17:54,121 S2: I'm only asking it given this code, and given what 335 00:17:54,122 --> 00:17:56,002 S2: we know about this vulnerability, how would you change it 336 00:17:56,002 --> 00:17:59,122 S2: to fix it? The large language models have already internalized 337 00:17:59,122 --> 00:18:03,382 S2: internalize large numbers of incremental commits to open source code 338 00:18:03,382 --> 00:18:06,262 S2: repositories that fix bugs, so they actually have a really 339 00:18:06,262 --> 00:18:09,742 S2: good track record with, um, more than I expected, even 340 00:18:09,742 --> 00:18:13,022 S2: when we started this, uh, with generating patches. So this 341 00:18:13,022 --> 00:18:15,382 S2: is a great example of where generating a patch is 342 00:18:15,382 --> 00:18:18,462 S2: something that lends itself towards a descriptive solution and a 343 00:18:18,462 --> 00:18:23,621 S2: descriptive algorithm, uh, or an AIML algorithm versus something that's prescriptive, um, 344 00:18:23,902 --> 00:18:25,941 S2: which is fuzzing. Fuzzing is a good example of a 345 00:18:25,942 --> 00:18:28,341 S2: prescriptive solution. If you if you need to find a 346 00:18:28,342 --> 00:18:32,822 S2: vulnerability and you need a crashing input, um, you have 347 00:18:32,821 --> 00:18:35,061 S2: to be able to prove that it exists. It's really, 348 00:18:35,061 --> 00:18:37,262 S2: really hard to get an LLM to do that because 349 00:18:37,302 --> 00:18:42,262 S2: llms the underlying reasoning. They don't have like data feedforward. Um, 350 00:18:42,302 --> 00:18:45,542 S2: they basically they look at source code like they look 351 00:18:45,542 --> 00:18:49,621 S2: at natural language. Natural language doesn't describe the activities of 352 00:18:49,622 --> 00:18:53,022 S2: an underlying state machine that runs on hardware after it 353 00:18:53,022 --> 00:18:55,622 S2: passes through a compiler. So like, you know, the source 354 00:18:55,622 --> 00:18:57,982 S2: code when looked at by a model. Models look at 355 00:18:57,982 --> 00:19:01,112 S2: source code in a really shallow way. Um, so when 356 00:19:01,112 --> 00:19:04,191 S2: we want to find, you know, a crashing input, a 357 00:19:04,192 --> 00:19:06,192 S2: fuzzer is a great way because we can prescribe a solution, 358 00:19:06,192 --> 00:19:10,072 S2: which is try everything, brute force it. Um, just come 359 00:19:10,071 --> 00:19:11,671 S2: up with different inputs, throw it in there, and then 360 00:19:11,672 --> 00:19:13,752 S2: if it crashes, well, there you go. You've proven it. 361 00:19:13,912 --> 00:19:16,952 S2: So that's what fuzzing heavily early on. You know, for 362 00:19:16,952 --> 00:19:19,192 S2: one type of problem we use patching heavily for another. 363 00:19:20,071 --> 00:19:24,552 S1: Yeah, that makes sense. And the other problem with, um, 364 00:19:25,632 --> 00:19:33,152 S1: finding vulns with with um, I also seems to me that, um, they, 365 00:19:33,152 --> 00:19:35,992 S1: they want to please there's they're heavily biased to be like, 366 00:19:35,992 --> 00:19:38,391 S1: this is it. This is one. Yeah. Well, this is 367 00:19:38,392 --> 00:19:40,831 S1: definitely a hit or whatever. And you look at it 368 00:19:40,872 --> 00:19:44,952 S1: and it's actually not. So I guess the intelligence is 369 00:19:44,952 --> 00:19:48,632 S1: deciding to use the fuzzer, which it could help make 370 00:19:48,632 --> 00:19:51,552 S1: that decision that a fuzzer should be used. Right. 371 00:19:52,512 --> 00:19:55,192 S2: Yeah. Yeah. So it's it's funny you bring that up. 372 00:19:55,192 --> 00:19:59,402 S2: Large language models really struggle to solve problems that aren't 373 00:19:59,442 --> 00:20:02,162 S2: rooted in some kind of ground truth. Um, it turns 374 00:20:02,162 --> 00:20:04,242 S2: out there's a huge difference there. We have some internal 375 00:20:04,242 --> 00:20:08,242 S2: research that we haven't published. Anybody could reproduce it. But, um, 376 00:20:08,282 --> 00:20:09,482 S2: so it turns out if you if you have a 377 00:20:09,482 --> 00:20:11,242 S2: bit of source code and you ask the model to 378 00:20:11,282 --> 00:20:15,522 S2: tell you where the vulnerability is, um, it will absolutely 379 00:20:15,561 --> 00:20:18,202 S2: hallucinate a vulnerability because it wants to please you. Uh, 380 00:20:18,202 --> 00:20:20,841 S2: we have one of our researchers, um, one of our 381 00:20:20,842 --> 00:20:23,522 S2: principal researchers, Artem. He's a great guy. He, um, he 382 00:20:23,522 --> 00:20:28,522 S2: downloaded the, um, formally, correct. Uh, the formally proven correct 383 00:20:28,522 --> 00:20:32,042 S2: portions of, uh, of Linux and asked a large language 384 00:20:32,042 --> 00:20:35,882 S2: model several hundred times. Um, here's a snippet of code. 385 00:20:35,882 --> 00:20:37,722 S2: It has a vulnerability where it is, and every single 386 00:20:37,722 --> 00:20:40,802 S2: time it would find it would manufacture vulnerability because it 387 00:20:40,802 --> 00:20:43,722 S2: wants to find the answer. So it turns out when 388 00:20:43,722 --> 00:20:46,601 S2: we started asking it, is there a vulnerability? Um, it 389 00:20:46,602 --> 00:20:49,162 S2: messed up a little less, but it would still assume 390 00:20:49,162 --> 00:20:51,921 S2: that because you're asking that there's something to find and 391 00:20:51,922 --> 00:20:54,322 S2: it would still mess up quite a bit. So that's 392 00:20:54,321 --> 00:20:57,012 S2: why when we're in the concept where we're, when we're using, um, 393 00:20:57,052 --> 00:21:00,252 S2: large language models for generating patches. It's great because we 394 00:21:00,252 --> 00:21:02,411 S2: know there's a vulnerability because we found it and we 395 00:21:02,412 --> 00:21:04,571 S2: proved it, and we can collect additional information. 396 00:21:04,612 --> 00:21:05,172 S1: Yeah. 397 00:21:05,412 --> 00:21:08,532 S2: So now I don't have to worry about asking the model. Hey, 398 00:21:08,532 --> 00:21:10,611 S2: do you think there's a vulnerability? And if so, patch it. 399 00:21:10,612 --> 00:21:12,972 S2: I say no, there is a vulnerability. It's here. This 400 00:21:12,972 --> 00:21:15,772 S2: is extra information about a code that touches it. Now 401 00:21:15,772 --> 00:21:18,571 S2: generate a patch. And the model is very good at 402 00:21:18,571 --> 00:21:21,012 S2: doing that because it takes away the decision making or, 403 00:21:21,332 --> 00:21:24,092 S2: or the judgment call that large language models are really, 404 00:21:24,092 --> 00:21:27,252 S2: really bad at because they don't actually model judgment calls underneath. 405 00:21:27,612 --> 00:21:31,332 S2: And their architecture, they, they model, you know, sequencing information, 406 00:21:31,571 --> 00:21:34,052 S2: sequencing tokens. And when you write code, you're writing a 407 00:21:34,052 --> 00:21:36,851 S2: sequence of tokens. So these problems tend to be, um, 408 00:21:36,892 --> 00:21:39,972 S2: a lot more suitable than other problems where you're asking 409 00:21:39,972 --> 00:21:42,292 S2: it to find the ground truth for you, bad problems 410 00:21:42,292 --> 00:21:45,332 S2: for llms asking it to take ground truth and expand 411 00:21:45,332 --> 00:21:47,611 S2: upon it. Great applications for Llms. 412 00:21:48,012 --> 00:21:49,772 S1: Oh man, I love that. And this also goes to 413 00:21:49,772 --> 00:21:52,652 S1: your previous point of not wanting to pollute the context 414 00:21:52,652 --> 00:21:56,622 S1: for the current task on hand, which is building that patch, 415 00:21:57,222 --> 00:22:00,582 S1: because if you have like some history of like there 416 00:22:00,582 --> 00:22:04,182 S1: were previous decisions made or previous questions asked or whatever 417 00:22:04,222 --> 00:22:06,061 S1: it might get like diverted, you know? 418 00:22:06,942 --> 00:22:11,102 S2: Yeah, absolutely. It's um, it's a, it's a big challenge particularly, um, 419 00:22:11,582 --> 00:22:13,222 S2: I don't know, it's funny. I've, I've been kind of 420 00:22:13,262 --> 00:22:15,742 S2: trying to sing this gospel internally, uh, at Trail of 421 00:22:15,742 --> 00:22:18,222 S2: Bits and to other people who will listen that, um, 422 00:22:18,622 --> 00:22:22,502 S2: the increasing size of context window is not always your friend. Um, 423 00:22:23,061 --> 00:22:25,502 S2: by increasing the size of the context window. I mean, 424 00:22:25,502 --> 00:22:27,102 S2: if you think about how the large language model works 425 00:22:27,102 --> 00:22:29,702 S2: under the hood, it's using these contexts to attune the 426 00:22:29,702 --> 00:22:32,302 S2: model to certain parts of its training data that are 427 00:22:32,302 --> 00:22:35,262 S2: going to be highly relevant to solving your particular problem. 428 00:22:35,622 --> 00:22:37,862 S2: And the more words and the more tokens you put 429 00:22:37,862 --> 00:22:41,262 S2: into the context window, the more you are kind of 430 00:22:41,302 --> 00:22:46,821 S2: nulling out or, um, numbing the attention mechanism. You're forcing 431 00:22:46,821 --> 00:22:48,742 S2: it to become more and more general, because now there 432 00:22:48,782 --> 00:22:52,822 S2: are more tokens that are affecting these attuned probabilities. So 433 00:22:52,821 --> 00:22:56,192 S2: you actually are better off with using now. Context window 434 00:22:56,232 --> 00:22:58,911 S2: is great because if you need, let's say a million, 435 00:22:59,152 --> 00:23:01,671 S2: you know, a million tokens in your context window to 436 00:23:01,712 --> 00:23:04,472 S2: constrain the problem, then use a million tokens. But if 437 00:23:04,472 --> 00:23:07,312 S2: you can do it for 1000 or 10,000, you're going 438 00:23:07,311 --> 00:23:09,831 S2: to get better results because you're more likely to focus 439 00:23:09,832 --> 00:23:11,311 S2: that model where it needs to be. 440 00:23:12,512 --> 00:23:16,311 S1: Yeah, I love this. Like, by the way, this this 441 00:23:16,352 --> 00:23:20,392 S1: this is great. This is great. Um, I'm going to 442 00:23:20,792 --> 00:23:24,912 S1: create a lot of content out of this, um, because it's, 443 00:23:24,912 --> 00:23:30,631 S1: it's really crystallizing in like one starting to form something 444 00:23:30,632 --> 00:23:34,232 S1: in my mind. I'd love to work with you on it. Um, essentially, 445 00:23:34,232 --> 00:23:37,391 S1: what I'm trying to think of is, um, what are 446 00:23:37,392 --> 00:23:40,592 S1: some general statements that we could make? Um, one that 447 00:23:40,592 --> 00:23:42,512 S1: I'm sort of heading in the direction of, you tell 448 00:23:42,512 --> 00:23:46,872 S1: me if I'm wrong is like. And this might be 449 00:23:46,872 --> 00:23:51,032 S1: overstating it, but like, the system itself should be highly 450 00:23:51,032 --> 00:23:56,522 S1: modular and and most as much as possible made up 451 00:23:56,522 --> 00:24:01,602 S1: of traditional and deterministic tech. And then the way that 452 00:24:01,602 --> 00:24:05,082 S1: you use the AI is for the specific type of problem, 453 00:24:05,282 --> 00:24:08,121 S1: which we're going to articulate the way you articulated it 454 00:24:09,482 --> 00:24:13,762 S1: for those types of problems where routing is needed to 455 00:24:13,762 --> 00:24:18,841 S1: the traditional tech. Um, and it's like, don't just go 456 00:24:18,882 --> 00:24:22,682 S1: crazy with AI. Don't ask it questions that the traditional 457 00:24:22,682 --> 00:24:27,122 S1: text should be answering. Um, it's something like that. And 458 00:24:27,122 --> 00:24:33,362 S1: then ultimately you have like this dependable deterministic system with 459 00:24:33,762 --> 00:24:37,081 S1: the minimum amount of AI that is required to move 460 00:24:37,402 --> 00:24:39,162 S1: appropriately through that system. 461 00:24:40,522 --> 00:24:43,561 S2: Yeah. So yeah, really it comes down to problem formulation. 462 00:24:43,561 --> 00:24:46,722 S2: And this is like the the great part about and 463 00:24:46,722 --> 00:24:48,002 S2: this is part of the reason why you see such 464 00:24:48,002 --> 00:24:50,841 S2: a huge overlap in interest between people from the computer 465 00:24:50,842 --> 00:24:53,622 S2: science background and people from like data science backgrounds on 466 00:24:53,622 --> 00:24:55,782 S2: here because, you know, one of the basic things you 467 00:24:55,782 --> 00:24:57,582 S2: learn in computer science, like when you get to like 468 00:24:57,582 --> 00:25:01,821 S2: the graduate level is problem formulation. It's how to recognize 469 00:25:02,022 --> 00:25:07,302 S2: your problem as a derivative, or maybe a like dressed 470 00:25:07,302 --> 00:25:12,102 S2: up version of some other problem. So, you know, right away, um, okay, 471 00:25:12,102 --> 00:25:13,742 S2: I have this problem of, okay, I've got to manage 472 00:25:13,742 --> 00:25:16,742 S2: this delivery system. How do I make this delivery system, um, 473 00:25:16,742 --> 00:25:20,022 S2: for Amazon efficient? You can recognize this right away as, oh, 474 00:25:20,022 --> 00:25:22,782 S2: this is traveling salesman. There's no good way to do this. 475 00:25:22,821 --> 00:25:24,222 S2: But what I can do is I can. I'm going 476 00:25:24,262 --> 00:25:26,582 S2: to get a good answer. I just have to accept 477 00:25:26,942 --> 00:25:29,142 S2: that my answer is going to be imprecise or not 478 00:25:29,142 --> 00:25:33,742 S2: necessarily optimal. Um, and in applying AI and ML to 479 00:25:33,742 --> 00:25:37,342 S2: security problems or any problem in general, the first step 480 00:25:37,342 --> 00:25:40,782 S2: is very much like problem formulation. It's understanding what kind 481 00:25:40,782 --> 00:25:42,621 S2: of model is going to work best for this problem, 482 00:25:42,662 --> 00:25:45,742 S2: because is this a problem that will work well with 483 00:25:45,742 --> 00:25:47,661 S2: a time series model, because my data is coming in 484 00:25:47,662 --> 00:25:49,861 S2: over time, or is this a model that's going to 485 00:25:49,862 --> 00:25:54,992 S2: work well with, um, let's say like a, like linear regression, 486 00:25:54,992 --> 00:25:59,472 S2: because there is some true underlying probability for how the 487 00:25:59,472 --> 00:26:02,152 S2: data is distributed that I'm trying to learn from one 488 00:26:02,152 --> 00:26:05,432 S2: of like the kind of curses of large language models 489 00:26:05,752 --> 00:26:09,032 S2: is that they have abstracted all of this good data 490 00:26:09,032 --> 00:26:12,592 S2: science practice, all these good data science practices away. And 491 00:26:12,592 --> 00:26:15,992 S2: now it's great because it democratizes it. Anybody can use AI, 492 00:26:16,032 --> 00:26:18,071 S2: anybody can use an LLM. And all you have to 493 00:26:18,071 --> 00:26:20,272 S2: do is be able to articulate your problem. The problem is, 494 00:26:20,272 --> 00:26:23,232 S2: is that it also abstracts away problem formulation. And now 495 00:26:23,232 --> 00:26:26,311 S2: we're starting to use Llms because they're accessible for certain 496 00:26:26,311 --> 00:26:29,831 S2: types of problems that they're really not well formulated for. Um. 497 00:26:30,672 --> 00:26:31,272 S1: Yeah. 498 00:26:31,432 --> 00:26:33,391 S2: So this is this is kind of where we get 499 00:26:33,392 --> 00:26:35,712 S2: to the issue. So the good news is we don't 500 00:26:35,712 --> 00:26:38,152 S2: have to just like say, okay, well, I can't do 501 00:26:38,192 --> 00:26:40,232 S2: problem formulation with an LLM, so I just throw it away. 502 00:26:40,232 --> 00:26:42,111 S2: Don't use it. I have to go back to, you know, 503 00:26:42,152 --> 00:26:44,431 S2: TensorFlow and writing my own models and stuff. What we 504 00:26:44,432 --> 00:26:46,592 S2: really have to do is get to what you were describing, 505 00:26:46,912 --> 00:26:50,282 S2: which is rather than throw the LLM at a large problem. 506 00:26:50,282 --> 00:26:52,722 S2: We take it a step further. We break the problem down. 507 00:26:52,722 --> 00:26:56,002 S2: Are there subproblems that are highly amenable to AI solutions? 508 00:26:56,242 --> 00:26:58,762 S2: I have a litmus test that I, that I pass, um, 509 00:26:58,802 --> 00:27:01,042 S2: you know, problems through. And I try to encourage my 510 00:27:01,042 --> 00:27:04,802 S2: team members to use, um, which is, you know, basically 511 00:27:04,802 --> 00:27:06,482 S2: like a check to see whether a problem is good 512 00:27:06,482 --> 00:27:09,042 S2: for AIML. And it's usually, you know, do you have 513 00:27:09,042 --> 00:27:11,802 S2: enough data in the model that you can train? In 514 00:27:11,802 --> 00:27:13,722 S2: this case, it now becomes is the LLM. Does the 515 00:27:13,722 --> 00:27:15,722 S2: LLM have examples of this on the internet that it 516 00:27:15,722 --> 00:27:17,841 S2: can draw from, or are you asking it to do 517 00:27:17,842 --> 00:27:23,282 S2: something like reverse engineering, you know, firmware code on this 518 00:27:23,282 --> 00:27:26,162 S2: obscure chipset that like there's no examples on the internet, 519 00:27:26,162 --> 00:27:29,282 S2: bad example or to it won't have it won't have 520 00:27:29,282 --> 00:27:32,562 S2: anything to draw from. Number two, um, is there some 521 00:27:32,561 --> 00:27:36,361 S2: probabilistic nature to the data that's underlying? This is actually 522 00:27:36,362 --> 00:27:38,401 S2: makes large language models really bad for a lot of 523 00:27:38,402 --> 00:27:42,722 S2: security problems, because they're what we call non-differentiable, meaning that 524 00:27:42,722 --> 00:27:45,082 S2: they don't have like this nice curved space that you 525 00:27:45,082 --> 00:27:49,852 S2: can use stochastic gradient descent or virtually any optimization function 526 00:27:49,852 --> 00:27:52,052 S2: to try and climb and find a good answer for 527 00:27:52,172 --> 00:27:54,052 S2: it actually exists more of like this kind of cloud 528 00:27:54,052 --> 00:27:56,012 S2: with dots of answers all over the place. If you 529 00:27:56,012 --> 00:27:58,811 S2: were to try and imagine the answers to security questions 530 00:27:59,132 --> 00:28:01,252 S2: in like a mathematical graph. 531 00:28:01,732 --> 00:28:04,772 S1: Okay, what's an example of what's an example of one 532 00:28:04,772 --> 00:28:06,612 S1: of those? I'm, I'm trying to think of what that 533 00:28:06,612 --> 00:28:07,692 S1: space might look like. 534 00:28:08,172 --> 00:28:10,212 S2: Yeah. So a good example of like a problem that 535 00:28:10,212 --> 00:28:14,532 S2: is differentiable is like housing prices. So housing prices vary by, 536 00:28:14,571 --> 00:28:17,851 S2: you know, like the size by square footage. Yeah. Square footage, 537 00:28:17,852 --> 00:28:20,931 S2: number of rooms, zip code quality of the schools. So 538 00:28:20,932 --> 00:28:22,691 S2: when you plot these all out you get something that 539 00:28:22,692 --> 00:28:24,732 S2: you can do linear regression on. You can see like. 540 00:28:24,732 --> 00:28:24,932 S1: A. 541 00:28:25,132 --> 00:28:28,052 S2: Little loop. And that's called a differentiable function because it's 542 00:28:28,052 --> 00:28:31,052 S2: a continuous line that you can draw through the data 543 00:28:31,052 --> 00:28:33,212 S2: that more or less minimizes the error of those points 544 00:28:33,212 --> 00:28:34,012 S2: along the line. 545 00:28:34,252 --> 00:28:34,611 S1: Yep. 546 00:28:35,132 --> 00:28:37,732 S2: But if we want to think about, um, let's say 547 00:28:37,772 --> 00:28:40,332 S2: now optimizing a program, we can take a look at 548 00:28:40,332 --> 00:28:45,532 S2: how ordering certain steps or changing the way we implement 549 00:28:45,532 --> 00:28:48,342 S2: certain functions as changing the speed of a program up 550 00:28:48,342 --> 00:28:52,782 S2: and down, and that becomes kind of pseudo differentiable. It's 551 00:28:52,782 --> 00:28:54,382 S2: it's more like a step function where you have kind 552 00:28:54,382 --> 00:28:56,502 S2: of like little lines where if I change this one thing, 553 00:28:56,502 --> 00:28:59,262 S2: it jumps up a little bit, it's more jagged, but 554 00:28:59,302 --> 00:29:03,022 S2: there's still, um, it's close to differentiable because I can 555 00:29:03,062 --> 00:29:06,662 S2: kind of map deterministically how if I run it on, 556 00:29:06,982 --> 00:29:09,622 S2: you know, with this set of compiler optimizations or that 557 00:29:09,662 --> 00:29:13,102 S2: it's definitely not differentiable, but it's closer. Security is just 558 00:29:13,102 --> 00:29:17,622 S2: wild because the flaws in computer programs can come from 559 00:29:17,622 --> 00:29:19,382 S2: one of a million different sources. It can be a 560 00:29:19,382 --> 00:29:22,142 S2: logic bug, it can be a mis implemented function. It 561 00:29:22,142 --> 00:29:23,982 S2: can be the use of an unsafe function, which is 562 00:29:23,982 --> 00:29:27,702 S2: easy to find. There's no way for us to take, um, 563 00:29:28,502 --> 00:29:32,262 S2: root causes for vulnerabilities in software and solutions to them 564 00:29:32,422 --> 00:29:35,062 S2: and plot them on a graph. Because they come from 565 00:29:35,702 --> 00:29:39,502 S2: they come from unquantifiable sources. Some of them like, you know, 566 00:29:39,542 --> 00:29:42,982 S2: Spectre and Meltdown and stuff. They they're resident in hardware 567 00:29:43,222 --> 00:29:45,942 S2: and the implementation there. Some are purely in software like 568 00:29:45,992 --> 00:29:50,312 S2: X type vulnerabilities. We can't they don't they're it's, um, 569 00:29:50,352 --> 00:29:52,272 S2: it's not even apples and oranges. It's like trying to 570 00:29:52,272 --> 00:29:55,512 S2: compare apples and fighter jets. Um. 571 00:29:56,752 --> 00:29:59,112 S1: Is it, is it a matter of, like the, the 572 00:29:59,272 --> 00:30:03,792 S1: tensor size or the, um, I think that's called tensor size. 573 00:30:03,792 --> 00:30:07,752 S1: I can't remember the, the, um, the number of dimensions 574 00:30:07,752 --> 00:30:10,112 S1: in the space, because when you're looking at square footage 575 00:30:10,112 --> 00:30:13,592 S1: and price what you have to write, is it the 576 00:30:13,592 --> 00:30:18,472 S1: problem in security that is just so many dimensions that, um, 577 00:30:18,472 --> 00:30:20,912 S1: when you try to plot it, you try to simplify it, 578 00:30:20,952 --> 00:30:22,192 S1: it just becomes garbage. 579 00:30:22,912 --> 00:30:25,072 S2: Well, it's a matter of common dimensions. So if you 580 00:30:25,072 --> 00:30:28,112 S2: build a house, every house has square footage. 581 00:30:28,552 --> 00:30:29,192 S1: There you go. 582 00:30:29,712 --> 00:30:32,272 S2: And you can calculate the space underneath. But a cross 583 00:30:32,312 --> 00:30:36,792 S2: site request forgery vulnerability in a, um, you know, piece 584 00:30:36,792 --> 00:30:39,552 S2: of JavaScript code that exists on the web has almost 585 00:30:39,552 --> 00:30:42,952 S2: nothing in common with a memory corruption vulnerability in a 586 00:30:42,992 --> 00:30:47,762 S2: C program running on a router in your home device. 587 00:30:48,082 --> 00:30:51,802 S2: They are implemented at different levels of abstraction. You know, 588 00:30:51,842 --> 00:30:54,482 S2: like even the program representations are different because some of 589 00:30:54,482 --> 00:30:57,122 S2: the vulnerabilities might exist only in binary code after it's 590 00:30:57,122 --> 00:31:02,362 S2: been compiled versus other vulnerabilities that are resident in source 591 00:31:02,362 --> 00:31:06,162 S2: code that's interpreted via web browser. Um, so really what 592 00:31:06,162 --> 00:31:07,962 S2: it is, is it's like trying to it's like trying 593 00:31:07,962 --> 00:31:10,682 S2: to plot, you know, the prices of homes, along with 594 00:31:11,002 --> 00:31:14,962 S2: the prices of, um, I don't know, oranges in a 595 00:31:14,962 --> 00:31:18,722 S2: particular year. You know, there's very little in common between 596 00:31:18,762 --> 00:31:21,802 S2: a house and an orange other than maybe some, like, 597 00:31:21,842 --> 00:31:25,402 S2: you know, global macro effects that might show some correlation, 598 00:31:25,802 --> 00:31:28,122 S2: you know. You know, economic factors like inflation. 599 00:31:28,522 --> 00:31:31,202 S1: Or like the beating of a whale's heart to determine 600 00:31:31,202 --> 00:31:35,962 S1: whether or not it's healthy. It's it's like completely different. Uh, yeah. 601 00:31:36,002 --> 00:31:39,161 S1: Completely different sports. Yeah. Yeah, yeah. 602 00:31:39,522 --> 00:31:40,962 S2: Yeah. So, so really, it's a it's a lack of 603 00:31:40,962 --> 00:31:43,682 S2: common dimensions in cybersecurity, which is why, you know, if 604 00:31:43,682 --> 00:31:45,732 S2: we think about like if we were trying to model, 605 00:31:45,772 --> 00:31:47,532 S2: like what the data would look like, if we could 606 00:31:47,532 --> 00:31:50,012 S2: visualize it, it would just be a bunch of points 607 00:31:50,012 --> 00:31:54,332 S2: of presence out there. Um, uh, within this, like, kind 608 00:31:54,332 --> 00:31:57,332 S2: of large cloud. Um, and even then, that's another problem 609 00:31:57,332 --> 00:31:59,692 S2: that kind of makes cybersecurity really hard to model with 610 00:31:59,692 --> 00:32:05,092 S2: AML is that there is really comparatively little data, um, 611 00:32:05,692 --> 00:32:07,412 S2: in terms of like the volume of data, there's tons 612 00:32:07,412 --> 00:32:09,452 S2: of vulnerabilities out there. But if you're trying to make 613 00:32:09,452 --> 00:32:13,732 S2: a model that's really, really good at, let's say, detecting, um, 614 00:32:14,052 --> 00:32:17,692 S2: buffer overflows and embedded device code, um, you're going to 615 00:32:17,692 --> 00:32:19,452 S2: find some data for that, but there's not that much 616 00:32:19,452 --> 00:32:21,492 S2: you have to rely on like POC write ups on, 617 00:32:21,492 --> 00:32:23,572 S2: on the internet for practitioners who put it out there 618 00:32:23,572 --> 00:32:27,412 S2: for fun. Um, but there's not a million of examples 619 00:32:27,412 --> 00:32:28,772 S2: of that like, it is if you want to say, 620 00:32:28,772 --> 00:32:30,732 S2: I want to train a model to write the Great 621 00:32:30,732 --> 00:32:33,972 S2: American novel, there you can take you can take every 622 00:32:33,972 --> 00:32:36,132 S2: novel ever written, throw it in there and then see 623 00:32:36,132 --> 00:32:38,292 S2: what the model comes up with. If you prompt it 624 00:32:38,292 --> 00:32:40,052 S2: with like a general plot line, it's going to do 625 00:32:40,052 --> 00:32:42,412 S2: a lot better at that because, you know, that data 626 00:32:42,412 --> 00:32:48,312 S2: fills in that space a lot more. Um, so so, yeah, it's, um. Yeah. 627 00:32:48,352 --> 00:32:50,992 S2: Like the, the, the challenges and problem formulation are, are 628 00:32:50,992 --> 00:32:53,112 S2: really big and, um, yeah, that's why I kind of 629 00:32:53,152 --> 00:32:55,752 S2: encourage people when they look at these like, okay, I 630 00:32:55,752 --> 00:32:58,552 S2: want to build an AI, ML driven system. Um, take 631 00:32:58,552 --> 00:33:01,312 S2: a look at what subproblems are actually suitable for AIML. 632 00:33:01,592 --> 00:33:03,352 S2: Use them there. And I think you'll also find that 633 00:33:03,352 --> 00:33:05,472 S2: a lot of the times we have a tendency to 634 00:33:05,512 --> 00:33:08,152 S2: like say, okay, let's just kind of throw large language 635 00:33:08,152 --> 00:33:09,592 S2: models at some of these problems that we know we 636 00:33:09,592 --> 00:33:13,312 S2: could really solve with regular code. Um, and that's really 637 00:33:13,312 --> 00:33:16,191 S2: bad because of this compounding error problem. So, you know, 638 00:33:16,232 --> 00:33:18,432 S2: if I, you know, five steps in sequence that I've 639 00:33:18,432 --> 00:33:20,232 S2: got to do in step three is good for AIML 640 00:33:20,232 --> 00:33:23,272 S2: and step four is good for AIML. You know, like 641 00:33:23,272 --> 00:33:25,352 S2: it's like, okay, well, look, almost half of this problem is, 642 00:33:25,512 --> 00:33:26,792 S2: you know, is something I'm going to ask the model 643 00:33:26,792 --> 00:33:28,352 S2: to do anyway. I'll just ask it to do one, 644 00:33:28,352 --> 00:33:30,712 S2: two and five to. Well, the problem is it can 645 00:33:30,712 --> 00:33:32,352 S2: make a mistake in one. It can make a mistake 646 00:33:32,352 --> 00:33:34,632 S2: in two. That compound before you get to three and four. 647 00:33:34,832 --> 00:33:37,912 S2: So you're better off, you know, implementing one, two and code. 648 00:33:37,952 --> 00:33:39,912 S2: And then maybe you ask the model just to finish 649 00:33:39,912 --> 00:33:43,082 S2: it off and do step five because it's the final step. 650 00:33:43,082 --> 00:33:46,442 S2: It's had ground truth rooted in steps one two, steps 651 00:33:46,442 --> 00:33:49,482 S2: three and four. If they're well contextualized problems, maybe the 652 00:33:49,482 --> 00:33:52,082 S2: false positive rate is low enough that you can afford 653 00:33:52,082 --> 00:33:53,642 S2: to just let the model kind of finish it up 654 00:33:53,642 --> 00:33:56,442 S2: for you. But that's the biggest that's the biggest jump 655 00:33:56,442 --> 00:33:59,722 S2: I would take. Usually that's step five is like validation 656 00:33:59,722 --> 00:34:03,802 S2: or correctness. Um, checking. And that's not something you want 657 00:34:03,802 --> 00:34:06,522 S2: to ask the model to do because it's, it's it's 658 00:34:07,242 --> 00:34:11,082 S2: it has the tendency to, um, one be wanting to 659 00:34:11,122 --> 00:34:13,162 S2: kind of like please itself and say, oh yeah, it 660 00:34:13,162 --> 00:34:17,282 S2: looks great to me. Um, or to, um, depending on 661 00:34:17,282 --> 00:34:20,242 S2: how you phrase it, find something that doesn't exist. And 662 00:34:20,362 --> 00:34:23,482 S2: validation is a problem that typically is, uh, is pretty 663 00:34:23,522 --> 00:34:25,442 S2: amenable to like deterministic code. 664 00:34:27,042 --> 00:34:33,602 S1: So I really love this. Um. Where this is taking 665 00:34:33,602 --> 00:34:38,322 S1: me is designing, like, a, uh, a general problem solver. 666 00:34:38,922 --> 00:34:43,852 S1: And I'm imagining, like, the smartest model that you have. 667 00:34:43,892 --> 00:34:47,772 S1: You know, opus, whatever. Or, like, the best Gemini or 668 00:34:47,772 --> 00:34:50,612 S1: whatever or whatever the best model is. But but then 669 00:34:50,612 --> 00:34:54,092 S1: what you do is you say, okay, uh, the problem 670 00:34:54,092 --> 00:34:59,972 S1: is we need to design a system that, uh, you know, properly, 671 00:34:59,972 --> 00:35:03,652 S1: deterministically solves this problem with a high level of accuracy. 672 00:35:04,252 --> 00:35:06,972 S1: For example, the vulnerability problem that you guys worked on. 673 00:35:07,372 --> 00:35:11,332 S1: And then what I love is the idea of you 674 00:35:11,372 --> 00:35:15,932 S1: present to the model all these different AI models and 675 00:35:15,932 --> 00:35:20,852 S1: all these different deterministic technologies, all as solutions. And then 676 00:35:20,852 --> 00:35:25,452 S1: you do what you said, which is you, um, break 677 00:35:25,452 --> 00:35:28,652 S1: down the problems that need to be solved at every 678 00:35:28,652 --> 00:35:33,852 S1: level of the subpieces. Right. And then you match each 679 00:35:33,852 --> 00:35:38,732 S1: of those little problems to either one or, uh, one 680 00:35:38,732 --> 00:35:42,022 S1: or many of these eyes, which are bigger or smaller, 681 00:35:42,062 --> 00:35:45,262 S1: have different weaknesses or whatever, or even ML, not even 682 00:35:45,302 --> 00:35:51,142 S1: LLM based. Yeah. Versus deterministic with the rule of like look, 683 00:35:51,182 --> 00:35:56,702 S1: use the appropriate one for this problem type. And then 684 00:35:56,702 --> 00:35:59,582 S1: maybe you have a whole bunch of training about problem 685 00:35:59,582 --> 00:36:03,742 S1: types and solution types. And then it picks which one 686 00:36:03,742 --> 00:36:07,382 S1: to use for each step. I mean is that. 687 00:36:08,102 --> 00:36:09,542 S2: You mentioned this. I think this is what some of 688 00:36:09,542 --> 00:36:11,862 S2: like the large, you know, third party ML as a 689 00:36:11,862 --> 00:36:14,342 S2: service providers like OpenAI and anthropic are kind of trying 690 00:36:14,342 --> 00:36:16,422 S2: to do. If you've heard of like this concept of 691 00:36:16,422 --> 00:36:19,942 S2: like mixture of experts models, um, it's uh. 692 00:36:19,942 --> 00:36:20,462 S1: That's true. 693 00:36:20,662 --> 00:36:22,622 S2: Yeah. It's this concept where, you know, like, you know, 694 00:36:22,662 --> 00:36:25,062 S2: like the, the actual interface. We have to maybe GPT 695 00:36:25,102 --> 00:36:27,462 S2: five and, and I haven't looked at the source code. 696 00:36:27,462 --> 00:36:28,942 S2: I don't work at OpenAI, so I have no idea 697 00:36:28,942 --> 00:36:30,622 S2: if this works underneath the hood, but it's been kind 698 00:36:30,622 --> 00:36:33,542 S2: of theorized and it's even been mentioned, you know, a 699 00:36:33,542 --> 00:36:36,182 S2: bit in terms of, um, you know, people who've kind 700 00:36:36,182 --> 00:36:38,422 S2: of looked at the models a little bit closer that, 701 00:36:38,462 --> 00:36:40,352 S2: you know, um, you know, when we, when we, we 702 00:36:40,392 --> 00:36:41,992 S2: fine tune a model to make it really good or 703 00:36:41,992 --> 00:36:44,872 S2: really suitable for a particular purpose that's amenable to AIML, 704 00:36:45,232 --> 00:36:49,232 S2: it can still be challenging to, um, have it interface 705 00:36:49,232 --> 00:36:50,912 S2: with the user in the way that like a high 706 00:36:50,912 --> 00:36:54,472 S2: quality chatbot would. So using yeah, a mixture of experts 707 00:36:54,472 --> 00:36:56,912 S2: models suggests that like having like an interface, like a 708 00:36:57,432 --> 00:37:01,192 S2: bot that interacts with the user but then recognizes certain 709 00:37:01,192 --> 00:37:04,392 S2: classes of problems and ducts them to the right expert. So, oh, 710 00:37:04,432 --> 00:37:07,192 S2: they're asking me about cyber. I'll ask, you know, um, 711 00:37:08,072 --> 00:37:11,112 S2: cyber GPT to handle this one. All they're asking about, 712 00:37:11,392 --> 00:37:14,392 S2: you know, mental health, I'll ask, you know, mental health 713 00:37:14,392 --> 00:37:19,552 S2: GPT to to help out here. Um, so, you know, 714 00:37:19,592 --> 00:37:22,672 S2: this kind of like concept I think is I think 715 00:37:22,672 --> 00:37:24,992 S2: it's trying to be creative, or at least it's been 716 00:37:24,992 --> 00:37:27,392 S2: thought of, um, in terms of using like all AI, 717 00:37:27,392 --> 00:37:30,192 S2: ML solutions. But but yeah, I agree, like the way 718 00:37:30,232 --> 00:37:32,992 S2: forward is to have, um, you know, for, for like 719 00:37:32,992 --> 00:37:38,482 S2: rapid like prototype development have like components that do certain things. Well, um, 720 00:37:38,522 --> 00:37:40,842 S2: and honestly, it's like reflected in software, like we have 721 00:37:40,842 --> 00:37:44,522 S2: libraries for, we have libraries for sorting. No one or 722 00:37:44,562 --> 00:37:47,562 S2: we have libraries for cryptography. Nobody should be writing their 723 00:37:47,562 --> 00:37:50,962 S2: own cryptography code. Use a library. Um, you know, the 724 00:37:50,962 --> 00:37:54,882 S2: closer these high quality libraries and, um, fine tuned ML 725 00:37:54,882 --> 00:37:58,282 S2: applications or ML models for certain types of subproblems, the 726 00:37:58,282 --> 00:37:59,882 S2: closer we get to being able to kind of compose 727 00:37:59,882 --> 00:38:01,962 S2: all these together. And the good thing is, is that 728 00:38:01,962 --> 00:38:03,882 S2: Elm is probably pretty good at writing the glue code 729 00:38:03,922 --> 00:38:05,362 S2: to sequence all this stuff together. 730 00:38:06,362 --> 00:38:09,082 S1: Yeah, yeah. Because because that's the trick for me. Because 731 00:38:09,122 --> 00:38:12,122 S1: inside of a mixture of experts, you're already inside the LLM. 732 00:38:12,442 --> 00:38:15,042 S1: What I'm thinking of this higher level model is like, look, 733 00:38:15,082 --> 00:38:18,282 S1: we're doing it. We're doing, um, matrix math over here. 734 00:38:18,602 --> 00:38:22,962 S1: We're doing multiplication over here. Um, guess what? This problem 735 00:38:22,962 --> 00:38:26,442 S1: space is not associated with an AI. We don't even 736 00:38:26,482 --> 00:38:29,042 S1: know I will ever touch this. We hand it to 737 00:38:29,042 --> 00:38:34,922 S1: our fastest and best, you know, deterministic addition function or whatever, 738 00:38:34,962 --> 00:38:38,572 S1: you know, and it's like maybe 95% of the whole 739 00:38:38,572 --> 00:38:41,372 S1: app ends up being traditional tech that doesn't involve AI, 740 00:38:41,412 --> 00:38:43,092 S1: other than the routing to get there. 741 00:38:43,932 --> 00:38:45,372 S2: Yeah, I mean, that would be ideal. I mean, anything 742 00:38:45,372 --> 00:38:49,852 S2: you can route, anything. Anything you can. Yeah, I don't know. 743 00:38:49,852 --> 00:38:51,972 S2: It's funny. It's like really what it comes down to 744 00:38:52,052 --> 00:38:57,412 S2: is like using large language models and like, solving large problems. 745 00:38:57,412 --> 00:39:00,692 S2: It becomes a conditional probability problem. And even if you 746 00:39:00,692 --> 00:39:03,572 S2: have the answer, get the right answer right at 99% 747 00:39:03,572 --> 00:39:07,572 S2: of the time. Um, over and over and over again, 748 00:39:08,252 --> 00:39:11,052 S2: you still have a high likelihood of failure by the 749 00:39:11,052 --> 00:39:13,892 S2: time you compute all the conditional probability out. It's kind 750 00:39:13,932 --> 00:39:15,812 S2: of funny. Like, I kind of learned this lesson in like, 751 00:39:15,852 --> 00:39:19,052 S2: in a completely different walk of life. Um, after I 752 00:39:19,052 --> 00:39:22,332 S2: got my bachelor's degree in CS, I, I worked for 753 00:39:22,332 --> 00:39:27,132 S2: like a year doing, um, software engineering and kind of 754 00:39:27,172 --> 00:39:30,852 S2: found it to be dull, so I, I, I did 755 00:39:30,852 --> 00:39:33,252 S2: something completely different. I joined the Army and I started 756 00:39:33,252 --> 00:39:37,432 S2: flying helicopters. Um, it's actually nice. That is, that's actually, 757 00:39:37,472 --> 00:39:39,712 S2: you know, I'm at up at Camp Dwyer in in 758 00:39:39,712 --> 00:39:43,152 S2: RC Southwest and Afghanistan. It's, um, picture was taken of 759 00:39:43,152 --> 00:39:45,792 S2: our aircraft on the flight line, and one of my 760 00:39:45,792 --> 00:39:48,632 S2: jobs as a pilot was to educate our junior pilots 761 00:39:48,632 --> 00:39:52,232 S2: on this concept of, like, mission survivability. Um, and that's 762 00:39:52,232 --> 00:39:55,192 S2: the idea that, um, you know, understanding what's called, like, 763 00:39:55,192 --> 00:39:57,192 S2: the kill chain. The kill chain has been pretty popularized 764 00:39:57,192 --> 00:40:00,392 S2: and security as well. But, you know, basically for a 765 00:40:00,432 --> 00:40:03,312 S2: for a compromise, whether it's shooting down an aircraft or 766 00:40:03,312 --> 00:40:05,432 S2: breaching a database, like a lot of things have to 767 00:40:05,472 --> 00:40:08,032 S2: happen and they all have some sort of probability. And 768 00:40:08,032 --> 00:40:10,312 S2: your goal in breaking the kill chain or breaking the 769 00:40:10,312 --> 00:40:13,672 S2: exploitation chain is to reduce any one probability down to zero, 770 00:40:13,912 --> 00:40:18,832 S2: because then the common or the conditional probability problem becomes zero. Um, 771 00:40:18,832 --> 00:40:20,752 S2: but the probabilities can be really weird. I used to 772 00:40:20,752 --> 00:40:22,712 S2: talk to my junior pilots and ask them like, hey, 773 00:40:22,992 --> 00:40:25,632 S2: what do you think is like the acceptable loss rate 774 00:40:25,632 --> 00:40:28,432 S2: on any of the missions that we fly here in theater? 775 00:40:28,472 --> 00:40:30,432 S2: And they would usually give me answers like they were 776 00:40:30,432 --> 00:40:35,002 S2: pretty close. They'd say like 90% or 95% or even 99%. 777 00:40:35,922 --> 00:40:37,562 S2: So I would actually take them to the math problem. 778 00:40:37,562 --> 00:40:39,681 S2: I get off the whiteboard and I'd say, okay, let's 779 00:40:39,682 --> 00:40:42,882 S2: assume it's 99%. I say, okay, how many aircraft are 780 00:40:42,882 --> 00:40:44,962 S2: we flying a day? Okay. You know, we have ten 781 00:40:44,962 --> 00:40:47,602 S2: total aircraft. We go on five missions a day. So 782 00:40:47,602 --> 00:40:50,242 S2: that's five aircraft are going out there. And let's say 783 00:40:50,242 --> 00:40:52,162 S2: there's only a 1% chance that each one of them 784 00:40:52,162 --> 00:40:54,162 S2: gets shot down. Okay. So that's five aircraft a day. 785 00:40:54,162 --> 00:40:55,562 S2: But we're going to be in we're going to be 786 00:40:55,562 --> 00:40:58,122 S2: in theater for for nine months. We'll round it off. 787 00:40:58,122 --> 00:40:59,882 S2: We'll make it a year. We're going to be here 788 00:40:59,882 --> 00:41:03,722 S2: for 365 days. So now if I take 365 by 789 00:41:03,762 --> 00:41:06,642 S2: five and multiply it by five, that's the number of 790 00:41:06,642 --> 00:41:09,762 S2: missions we're flying in the entire time we're here. This 791 00:41:09,762 --> 00:41:11,482 S2: number comes out to be pretty high. And now all 792 00:41:11,482 --> 00:41:14,802 S2: of a sudden, if I lose one aircraft for every 100, 793 00:41:14,842 --> 00:41:17,162 S2: you realize that I actually run out of aircraft in 794 00:41:17,162 --> 00:41:20,162 S2: the first two months of of being in theater and I. 795 00:41:20,162 --> 00:41:21,922 S2: And now all of a sudden, the troops don't have, 796 00:41:22,082 --> 00:41:23,402 S2: don't have helicopters to fly. 797 00:41:23,402 --> 00:41:23,762 S1: Yeah. 798 00:41:24,122 --> 00:41:26,762 S2: So I said, actually, believe it or not, our our 799 00:41:26,842 --> 00:41:32,962 S2: acceptable loss rate is something more like 99.99999%. Um, we 800 00:41:32,962 --> 00:41:35,252 S2: can almost never lose an aircraft because. Or we can 801 00:41:35,252 --> 00:41:38,452 S2: almost never accept any type of probability. That means we 802 00:41:38,452 --> 00:41:41,052 S2: have even a remote chance of losing an aircraft because 803 00:41:41,052 --> 00:41:44,092 S2: we will deplete them. It's a limited resource. Um, solving 804 00:41:44,092 --> 00:41:46,372 S2: problems with Llms is the same way. If you ask 805 00:41:46,372 --> 00:41:48,972 S2: them to solve 15 problems in a row, even if 806 00:41:48,972 --> 00:41:52,372 S2: it's got a 99% chance, which is which would be 807 00:41:52,372 --> 00:41:55,212 S2: amazing if any LLM could get anywhere close to that, 808 00:41:55,652 --> 00:41:57,812 S2: even if it has a 99% chance of answering every 809 00:41:57,812 --> 00:42:01,452 S2: single problem right over the course of a year, it's 810 00:42:01,452 --> 00:42:04,932 S2: probably going to give you answers that are wrong almost 80% 811 00:42:04,932 --> 00:42:07,652 S2: of the time if that chain is long enough. And 812 00:42:07,652 --> 00:42:09,572 S2: if you have enough problems that you feed through it. 813 00:42:10,012 --> 00:42:13,252 S2: So that's one thing I try to like, um, hope 814 00:42:13,252 --> 00:42:16,972 S2: people conceptualize over relying on large language models and try 815 00:42:16,972 --> 00:42:20,412 S2: to help them understand this, like compounding error problem. It's 816 00:42:20,412 --> 00:42:26,132 S2: really a conditional probability, uh, compounding conditional probability problem. And 817 00:42:26,132 --> 00:42:28,812 S2: your tolerance for false positives is actually zero. So anywhere 818 00:42:28,812 --> 00:42:31,252 S2: in this chain that you can we have to think 819 00:42:31,252 --> 00:42:33,942 S2: about this differently now because I can't reduce anything to zero. 820 00:42:33,982 --> 00:42:35,422 S2: But what I can do is I can take certain 821 00:42:35,422 --> 00:42:36,902 S2: parts of the chain and I can bump them up 822 00:42:36,902 --> 00:42:40,102 S2: to 100%, meaning my chances of getting something right when 823 00:42:40,102 --> 00:42:42,902 S2: I use a deterministic algorithm are 100%. So now I 824 00:42:42,942 --> 00:42:45,862 S2: no longer have some sort of fractional probability out of. 825 00:42:45,902 --> 00:42:49,102 S2: So this 15 step problem now let's say 12 steps 826 00:42:49,102 --> 00:42:52,062 S2: I do deterministically. Now I only have a three step chain. 827 00:42:52,062 --> 00:42:55,622 S2: And now that 99% I'm getting it right only three times. 828 00:42:55,702 --> 00:42:58,422 S2: You simplify this problem. Now I might be able to 829 00:42:58,422 --> 00:43:00,702 S2: make it through a year's worth of operations that, you know, 830 00:43:00,742 --> 00:43:03,382 S2: 100 examples of the problem a day. I might be 831 00:43:03,382 --> 00:43:06,302 S2: able to make it through that with a false positive 832 00:43:06,302 --> 00:43:08,022 S2: rate of. I don't know what the math is in 833 00:43:08,022 --> 00:43:09,502 S2: my head. I'd have to I have to punch it out. 834 00:43:09,502 --> 00:43:11,342 S2: But that false positive rate might be a lot more 835 00:43:11,342 --> 00:43:15,422 S2: survivable in an operational world than, you know, 15 conditional 836 00:43:15,422 --> 00:43:17,862 S2: probability problems that are all 99%. 837 00:43:18,902 --> 00:43:22,902 S1: Yeah, yeah, I love that. The way I describe it is, um, 838 00:43:23,222 --> 00:43:26,782 S1: what's 1% of 100 metric tons of problems. 839 00:43:27,942 --> 00:43:28,382 S2: A metric. 840 00:43:29,062 --> 00:43:31,182 S1: A metric ton of problems? 841 00:43:31,272 --> 00:43:33,272 S2: Yeah, I love that. I love that. 842 00:43:34,072 --> 00:43:40,152 S1: Yeah. Yeah. Um, so, uh, we share this in common, actually. So, um, 843 00:43:40,152 --> 00:43:42,952 S1: I was, um, I was also Army, and I was at. 844 00:43:43,832 --> 00:43:46,912 S1: I was at Fort Campbell, so I was air assault, 845 00:43:46,912 --> 00:43:48,832 S1: so I had to do all the helicopter stuff. 846 00:43:48,872 --> 00:43:50,552 S2: Uh, right on, man. Hell, yeah. Brother. 847 00:43:50,992 --> 00:43:54,872 S1: Yeah. That's cool. Airborne air assault. Right? Um, yeah. 848 00:43:55,592 --> 00:43:58,792 S2: No. Yeah, I, I was, um, uh, this this picture 849 00:43:58,792 --> 00:44:01,392 S2: was taken when we were doing, uh, medevac chase. Uh, 850 00:44:01,432 --> 00:44:03,712 S2: we we did security for those guys over there, but 851 00:44:03,752 --> 00:44:05,832 S2: I was in an air assault battalion, so we literally 852 00:44:05,832 --> 00:44:07,912 S2: did nothing but fly you guys around, so. 853 00:44:07,952 --> 00:44:08,352 S1: Oh. 854 00:44:08,352 --> 00:44:10,312 S2: Nice man. Small world. Dude. 855 00:44:10,752 --> 00:44:11,432 S1: Yeah, yeah. 856 00:44:12,272 --> 00:44:14,472 S2: Yeah, I was over at Fort Campbell. I, I was at, um. 857 00:44:14,472 --> 00:44:16,712 S2: I was at Fort Riley, uh, in in the first 858 00:44:17,112 --> 00:44:19,872 S2: cab and then, um, I PC from there after I 859 00:44:19,872 --> 00:44:22,072 S2: went to Afghanistan and went to the 82nd. Um, so 860 00:44:22,072 --> 00:44:24,272 S2: I never got, never quite got to Campbell, which, like, 861 00:44:24,312 --> 00:44:26,472 S2: would have been great because I live here in Ohio 862 00:44:26,472 --> 00:44:29,272 S2: and Cincinnati. It's like where I was from. So I 863 00:44:29,272 --> 00:44:31,402 S2: was like always trying to get to Campbell because it 864 00:44:31,402 --> 00:44:33,202 S2: was like only like 4 or 5 hours from home 865 00:44:33,202 --> 00:44:35,322 S2: and be able to see family a lot easier. But 866 00:44:35,322 --> 00:44:38,722 S2: I ended up like 12 and nine hours away, respectively, so, uh. 867 00:44:39,042 --> 00:44:42,642 S1: Yeah. Well, that's super cool. Yeah, well, we need to 868 00:44:42,642 --> 00:44:46,802 S1: chat some more. Man. This is, like, really, really cool stuff. Um, 869 00:44:47,282 --> 00:44:49,122 S1: what you guys did on the team is cool, but 870 00:44:49,162 --> 00:44:51,762 S1: I'm even more excited just about the way you think 871 00:44:51,762 --> 00:44:58,522 S1: about these things. Um, I'm. I'm, uh, happy that, um, 872 00:44:58,762 --> 00:45:00,722 S1: the way you're thinking about it is similar to the 873 00:45:00,722 --> 00:45:03,082 S1: way I'm thinking about it. I you've taught me a 874 00:45:03,082 --> 00:45:06,082 S1: lot just during this thing. We should we should definitely 875 00:45:06,082 --> 00:45:08,362 S1: chat more after this. Um, anything else you want to 876 00:45:08,362 --> 00:45:14,482 S1: share about the the competition or, um, lessons learned? Um. 877 00:45:15,522 --> 00:45:17,082 S2: So I think one of the things that that came 878 00:45:17,082 --> 00:45:31,532 S2: out of the competition, um, was a lot of vindication. Sorry. 879 00:45:31,532 --> 00:45:36,612 S2: I nudged mouse in it. Oh. So, um, I'll just 880 00:45:36,612 --> 00:45:38,252 S2: I'll just go right into the answer. I assume you 881 00:45:38,252 --> 00:45:41,852 S2: can edit this later or something, but yeah. Um, so yeah, 882 00:45:41,852 --> 00:45:43,412 S2: one of the things that, um, that came out of 883 00:45:43,412 --> 00:45:46,572 S2: the competition was, was honestly a lot of indication, um, 884 00:45:47,092 --> 00:45:49,252 S2: like I had mentioned before, you know, when we started 885 00:45:49,252 --> 00:45:53,092 S2: off this process, um, this was two years ago, which 886 00:45:53,092 --> 00:45:56,292 S2: has been two lifetimes in the development of like AI 887 00:45:56,292 --> 00:46:00,612 S2: enabled systems for any problem, much less cybersecurity. Um, so 888 00:46:00,612 --> 00:46:03,692 S2: a lot of the things that we did, like tool enabling, um, 889 00:46:03,732 --> 00:46:07,812 S2: and multi-agent systems were things that we did before, things 890 00:46:07,812 --> 00:46:13,252 S2: like MCP or um, complicated, um, libraries for supporting this existed, 891 00:46:13,252 --> 00:46:17,012 S2: like we used early versions of um, of long chain, uh, 892 00:46:17,012 --> 00:46:19,052 S2: for some of our multi-agent stuff, but we actually ended 893 00:46:19,052 --> 00:46:20,332 S2: up having to write a lot of and implement a 894 00:46:20,332 --> 00:46:23,692 S2: lot of our own glue code for this. Um, so 895 00:46:23,732 --> 00:46:26,612 S2: it's really vindicating to see, like, those techniques become, while 896 00:46:26,612 --> 00:46:29,952 S2: we're doing the competition, become not only one commonplace and 897 00:46:29,952 --> 00:46:33,832 S2: two supported by the major large language model, providers be 898 00:46:33,832 --> 00:46:37,192 S2: adopted and be used generally by the community. Um, you know, 899 00:46:37,232 --> 00:46:39,112 S2: it was really great that we came in second and 900 00:46:39,112 --> 00:46:41,512 S2: that also the first place finisher also used this like 901 00:46:41,512 --> 00:46:46,632 S2: kind of, um, use, um, problem solving techniques that are 902 00:46:46,632 --> 00:46:51,232 S2: well suited for the problem approach. Yeah. Don't use AI everywhere. Um, 903 00:46:51,792 --> 00:46:55,512 S2: finisher theory. They were a little bit more LM forward, 904 00:46:55,792 --> 00:46:58,352 S2: but they still had a lot of, like, traditional components. 905 00:46:58,512 --> 00:47:01,352 S2: I don't think any team really went after this. Like, 906 00:47:01,352 --> 00:47:04,912 S2: all LM tried to just do everything within the LM. Um. 907 00:47:05,432 --> 00:47:07,952 S1: I bet a lot started that way, and they they 908 00:47:08,112 --> 00:47:10,112 S1: fall back from it. Yeah. 909 00:47:10,152 --> 00:47:13,072 S2: Yeah. Yeah, I think I think at least one of them, um, 910 00:47:13,312 --> 00:47:14,672 S2: at least one team. I think all you need is 911 00:47:14,672 --> 00:47:17,872 S2: a fuzzing brain. I think in the semi-finals, their approach, um, 912 00:47:18,352 --> 00:47:20,752 S2: tried to just use an LM to augment a fuzzer 913 00:47:20,752 --> 00:47:22,632 S2: to find vulnerabilities. And I don't think they really had 914 00:47:22,632 --> 00:47:24,672 S2: much of, like, a solution for patching, but it was 915 00:47:24,672 --> 00:47:26,552 S2: enough to get them to the finals. I they had 916 00:47:26,552 --> 00:47:31,202 S2: a more well rounded system, I believe, uh, in the, 917 00:47:31,242 --> 00:47:33,442 S2: in the finals. Um, so yeah, it was kind of 918 00:47:33,482 --> 00:47:35,442 S2: vindicating to also see that all these other bright minds 919 00:47:35,442 --> 00:47:38,802 S2: out there were also similarly of the, of the mindset 920 00:47:38,802 --> 00:47:41,002 S2: to do this. But um, one of the biggest takeaways 921 00:47:41,002 --> 00:47:43,682 S2: I have that I'll, that I'll say is that was 922 00:47:43,682 --> 00:47:45,802 S2: like different than what I expected because it's really easy 923 00:47:45,802 --> 00:47:47,442 S2: to pat myself on the back and say, oh yeah, 924 00:47:47,482 --> 00:47:49,522 S2: all the plan I came up with worked great. That's 925 00:47:49,682 --> 00:47:52,722 S2: that's awesome. But, um, I will say that I was 926 00:47:52,722 --> 00:47:56,362 S2: really surprised at how well large language models eventually became 927 00:47:56,362 --> 00:47:59,402 S2: at helping us generate patches and also helping us generate 928 00:47:59,402 --> 00:48:02,722 S2: seed inputs to improve Fuzzer performance. Those were areas where 929 00:48:02,722 --> 00:48:04,922 S2: I didn't really give the LLM a lot of credit 930 00:48:04,922 --> 00:48:07,322 S2: up front, but I had to build an autonomous system, 931 00:48:07,322 --> 00:48:10,802 S2: so I had no choice. They really outperformed my expectations. 932 00:48:10,802 --> 00:48:12,802 S2: So I kind of came out of this with, um, 933 00:48:13,202 --> 00:48:17,122 S2: a bit of a healthier respect for the capabilities of 934 00:48:17,122 --> 00:48:20,322 S2: AI models. Once again, these are still highly constrained. 935 00:48:20,322 --> 00:48:21,322 S1: And yeah, yeah. 936 00:48:21,362 --> 00:48:23,962 S2: Very context rich problems that we ask them to do, 937 00:48:24,082 --> 00:48:26,212 S2: but they still did way better than I thought they 938 00:48:26,212 --> 00:48:27,612 S2: were going to do. Um. 939 00:48:27,892 --> 00:48:32,812 S1: Yeah. And also context constrained, not polluted, like a very 940 00:48:33,132 --> 00:48:35,612 S1: controlled context for that thing. Like like you were talking 941 00:48:35,612 --> 00:48:36,612 S1: about before, right? 942 00:48:37,092 --> 00:48:40,692 S2: Yeah. Yeah. Um, yeah, I think that's about it. Unfortunately, 943 00:48:40,692 --> 00:48:41,892 S2: I do have to jump off. I gotta I got 944 00:48:41,892 --> 00:48:45,212 S2: another call at 1230, but, um. Yeah, I'd love to 945 00:48:45,212 --> 00:48:47,612 S2: chat more and talk more with you at some point. 946 00:48:47,612 --> 00:48:49,892 S2: If you want to do a follow up episode or, 947 00:48:49,932 --> 00:48:52,372 S2: I don't know, you just want to chat about other stuff. Um, 948 00:48:53,052 --> 00:48:54,372 S2: you know, we got a couple of friends in common 949 00:48:54,372 --> 00:48:58,652 S2: between Clint and, uh, between Clint and Keith, and it's, uh, 950 00:48:58,652 --> 00:49:00,172 S2: you know, I've. I've run into you a couple places 951 00:49:00,172 --> 00:49:03,052 S2: on various calls and stuff that we've been on, but, um, 952 00:49:03,132 --> 00:49:04,212 S2: it was good to get a chance to talk with 953 00:49:04,212 --> 00:49:06,052 S2: you one on one. I feel like we've been kind of, like, 954 00:49:06,292 --> 00:49:08,612 S2: circling around in the same circle for a while, but 955 00:49:08,612 --> 00:49:10,132 S2: I hadn't had a chance to, like, actually just chat 956 00:49:10,132 --> 00:49:10,812 S2: the two of us. 957 00:49:11,492 --> 00:49:15,452 S1: Yeah, absolutely. Well, thanks. Thanks for the, uh, the input. 958 00:49:15,452 --> 00:49:18,812 S1: This is just, uh, fantastic stuff. And, uh, let's definitely 959 00:49:18,812 --> 00:49:19,572 S1: catch up soon. 960 00:49:20,052 --> 00:49:21,372 S2: Yeah. Sounds good man. Take care of yourself. 961 00:49:21,412 --> 00:49:22,252 S1: All right. Take care.