1
00:00:19,852 --> 00:00:22,212
S1: All right, Michael, welcome to unsupervised learning.

2
00:00:22,892 --> 00:00:24,412
S2: Hey, it's great to be here. Thanks for having me.

3
00:00:25,532 --> 00:00:29,892
S1: Yeah. So, uh, lots to talk about here. Uh, can

4
00:00:29,892 --> 00:00:31,732
S1: you give a quick intro on yourself?

5
00:00:32,492 --> 00:00:34,172
S2: Yeah, sure. So, uh, my name is Michael Brown. I'm

6
00:00:34,172 --> 00:00:36,812
S2: a principal security engineer at trilobites. I lead up our

7
00:00:36,812 --> 00:00:40,932
S2: company's AI and ML security research group. We really focus

8
00:00:40,932 --> 00:00:44,972
S2: on two kinds of, uh, intersections between AI, ML, and security.

9
00:00:45,012 --> 00:00:51,332
S2: It's primarily using AIML technologies to solve traditional cybersecurity problems

10
00:00:51,332 --> 00:00:54,292
S2: that are really hairy and really kind of sticky, and

11
00:00:54,292 --> 00:00:57,972
S2: conventional methods have kind of failed to address. And then

12
00:00:57,972 --> 00:01:01,412
S2: we also, uh, to a smaller degree, look at, um,

13
00:01:01,452 --> 00:01:06,292
S2: the security of AIML based systems. So, um, I was

14
00:01:06,532 --> 00:01:10,332
S2: also the lead designer, um, in team lead for um,

15
00:01:10,372 --> 00:01:14,092
S2: trilobites team that entered into the AI Cyber Challenge. Uh,

16
00:01:14,132 --> 00:01:16,892
S2: we built the tool called Buttercup, which took second place

17
00:01:16,892 --> 00:01:21,382
S2: in And overall in the iacc. And, um. Yeah, that's

18
00:01:21,382 --> 00:01:21,862
S2: about it.

19
00:01:22,462 --> 00:01:26,062
S1: Yeah. That's perfect. And that's exactly what I'd like to

20
00:01:26,102 --> 00:01:32,622
S1: chat about. Um, so I guess, um, I guess the

21
00:01:32,622 --> 00:01:35,702
S1: thing I'm most interested in is, uh, just the design

22
00:01:35,702 --> 00:01:41,622
S1: of the system, and, um, I guess overall, what you

23
00:01:41,622 --> 00:01:44,542
S1: know about the designs of the other system. So design

24
00:01:44,542 --> 00:01:49,222
S1: versus design, system versus system. What? Whatever you want to

25
00:01:49,222 --> 00:01:51,541
S1: share or can share. Like what? What are your thoughts

26
00:01:51,542 --> 00:01:54,541
S1: on that? Um, I guess everyone releases open source. So

27
00:01:54,862 --> 00:01:56,462
S1: maybe you've had a chance to look at some of

28
00:01:56,462 --> 00:01:59,662
S1: the other offerings. Maybe you've heard them talking, maybe you know,

29
00:01:59,662 --> 00:02:02,742
S1: the teams. Uh, so I guess what kind of Intel

30
00:02:02,742 --> 00:02:06,502
S1: do you have on what everyone else was doing versus

31
00:02:06,502 --> 00:02:10,062
S1: what you guys were doing? And how do you think

32
00:02:10,182 --> 00:02:11,142
S1: that went?

33
00:02:12,502 --> 00:02:14,622
S2: Yeah. Well, um, yeah, I guess I can answer that

34
00:02:14,622 --> 00:02:17,992
S2: last part pretty easily. It went pretty well for us. Um,

35
00:02:18,232 --> 00:02:20,712
S2: so we took second place. Uh, the team that finished

36
00:02:20,712 --> 00:02:23,952
S2: in first. Team Atlanta. Um, they had a pretty similar

37
00:02:23,952 --> 00:02:28,512
S2: setup to ours. Um, they had more components, more moving parts, uh,

38
00:02:28,512 --> 00:02:31,552
S2: more pieces. They had more hands. Um, larger team to

39
00:02:31,552 --> 00:02:34,112
S2: be able to kind of implement more, um, but ultimately

40
00:02:34,112 --> 00:02:37,952
S2: they had a really similar kind of set of design principles, um,

41
00:02:37,992 --> 00:02:41,632
S2: that worked out for us, the third place finishing team theory, they, um,

42
00:02:41,672 --> 00:02:44,112
S2: had a bit of a deviation in terms of like

43
00:02:44,112 --> 00:02:47,232
S2: their conceptual, uh, principles that guided how they built their system.

44
00:02:47,232 --> 00:02:49,712
S2: But I can get into that in a bit. Um,

45
00:02:49,952 --> 00:02:51,392
S2: I guess I can first start off by talking a

46
00:02:51,392 --> 00:02:55,192
S2: little bit about our concept. So it's interesting. Um, you know,

47
00:02:55,232 --> 00:02:57,472
S2: the concept for Buttercup changed quite a bit over the

48
00:02:57,472 --> 00:03:00,032
S2: course of the over the course of the AI Cyber Challenge.

49
00:03:00,032 --> 00:03:03,832
S2: So this got announced, um, a couple years back, and

50
00:03:03,832 --> 00:03:06,592
S2: there was a period of about 4 or 5 months, um,

51
00:03:06,672 --> 00:03:09,512
S2: after the cyber challenge was announced, but before DARPA had

52
00:03:09,512 --> 00:03:13,031
S2: really released any rules. So we didn't really know exactly

53
00:03:13,312 --> 00:03:15,282
S2: how the competition was going to be structured. We structured.

54
00:03:15,282 --> 00:03:16,682
S2: We just knew that we would have to build a

55
00:03:16,681 --> 00:03:22,281
S2: fully autonomous, AI driven system that could find and patch vulnerabilities, um,

56
00:03:22,322 --> 00:03:26,202
S2: with a high degree of accuracy. Um, so originally, the

57
00:03:26,202 --> 00:03:29,962
S2: concept that I drew up along with my co-creator Ian Smith, um,

58
00:03:30,562 --> 00:03:34,042
S2: was originally really ambitious. Lots of moving parts, lots of

59
00:03:34,042 --> 00:03:39,602
S2: static analysis, dynamic analysis, lots of, um, conventional techniques, lots

60
00:03:39,602 --> 00:03:42,642
S2: of AIML based techniques. But ultimately, once the rules came out,

61
00:03:42,642 --> 00:03:44,442
S2: it kind of got pared down quite a bit. Um,

62
00:03:44,442 --> 00:03:47,322
S2: some of the things that we wanted to do, um, were,

63
00:03:47,522 --> 00:03:49,122
S2: were marked as like out of scope. Some of the

64
00:03:49,122 --> 00:03:52,162
S2: stuff we wanted to do were marked as against the rules, um,

65
00:03:52,162 --> 00:03:54,322
S2: just for the tractability of the competition.

66
00:03:54,322 --> 00:03:56,802
S1: So is that because they were, they would have been

67
00:03:56,802 --> 00:03:59,402
S1: too expensive. Didn't you have budgets you had to stay under?

68
00:04:00,162 --> 00:04:02,722
S2: Yeah. So some of it was definitely, um, budgetary and

69
00:04:02,722 --> 00:04:04,562
S2: some stuff was just, you know, flat out against the rules.

70
00:04:04,562 --> 00:04:07,402
S2: We looked at fine tuning a large language model, um,

71
00:04:07,442 --> 00:04:10,602
S2: with information about lots of open source software. And, um,

72
00:04:10,642 --> 00:04:15,022
S2: there ended up being a rule about pre-baking models, so. okay, really,

73
00:04:15,022 --> 00:04:17,702
S2: kudos to DARPA for making sure that, you know, competitors

74
00:04:17,702 --> 00:04:21,022
S2: didn't have the ability to kind of, um, skew the

75
00:04:21,022 --> 00:04:23,622
S2: systems that they build for the test, which is, you know,

76
00:04:23,662 --> 00:04:27,182
S2: finding and patching vulnerabilities and open source software. Um, so, yeah,

77
00:04:27,222 --> 00:04:29,541
S2: there was a lot of stuff that gets cut down. Um,

78
00:04:29,582 --> 00:04:33,382
S2: they got cut down. But ultimately the design of our system, um, was,

79
00:04:33,382 --> 00:04:35,541
S2: was basically a pipeline. We we kind of broke the

80
00:04:35,541 --> 00:04:37,491
S2: problem down. We realized we had to do basically 4

81
00:04:37,492 --> 00:04:40,421
S2: or 5 things really well. To win this competition, we

82
00:04:40,422 --> 00:04:42,462
S2: had to be able to find vulnerabilities. And not only that,

83
00:04:42,462 --> 00:04:44,302
S2: we had to be able to prove they exist. So

84
00:04:44,302 --> 00:04:46,942
S2: it wasn't enough just to, you know, use a static

85
00:04:46,942 --> 00:04:49,302
S2: analysis scanner and say, hey, this thing thinks there's a

86
00:04:49,302 --> 00:04:54,501
S2: vulnerability online. 50 of, you know, whatever, uh, you actually

87
00:04:54,502 --> 00:04:56,862
S2: had to have a crashing test case for the first

88
00:04:57,981 --> 00:05:01,582
S2: round of the competition in the semifinals. And in the finals.

89
00:05:01,702 --> 00:05:05,222
S2: You didn't they, they relaxed this requirement. But the pathway

90
00:05:05,222 --> 00:05:08,982
S2: to getting lots of points basically still required one. Um,

91
00:05:08,981 --> 00:05:11,462
S2: so you you had to find vulnerabilities and also prove

92
00:05:11,462 --> 00:05:14,152
S2: they exist with a crashing input, or an input that

93
00:05:14,152 --> 00:05:19,232
S2: would trigger a sanitizer in the target function. Um, you

94
00:05:19,231 --> 00:05:22,032
S2: had to be able to contextualize and draw additional information

95
00:05:22,032 --> 00:05:25,952
S2: about this vulnerability. Otherwise, patching was doomed to fail. Um,

96
00:05:25,992 --> 00:05:30,272
S2: and then you had to patch the actually patched the vulnerability. Um,

97
00:05:30,791 --> 00:05:35,072
S2: so this is a highly complex, uh, problem that conventional

98
00:05:35,072 --> 00:05:39,312
S2: approaches to software analysis have really kind of not addressed. Well,

99
00:05:39,312 --> 00:05:41,032
S2: in my opinion. And it was a great area to

100
00:05:41,072 --> 00:05:43,272
S2: use I. And then we also, you know, finally we

101
00:05:43,272 --> 00:05:47,272
S2: had to orchestrate all of these functions and do really

102
00:05:47,272 --> 00:05:50,032
S2: high quality engineering around all of them so that the

103
00:05:50,032 --> 00:05:53,032
S2: system would stay up and running for several days. Um,

104
00:05:53,032 --> 00:05:54,872
S2: so based on those kind of 4 or 5, depending

105
00:05:54,872 --> 00:05:57,632
S2: on how you chop them up, core principles or core

106
00:05:57,632 --> 00:05:59,671
S2: tasks that we had to do, um, we kind of

107
00:05:59,712 --> 00:06:01,952
S2: decided on an approach that we kind of call the

108
00:06:01,952 --> 00:06:04,672
S2: best of both worlds, which was, you know, we knew

109
00:06:04,712 --> 00:06:08,752
S2: that conventional software analysis, whether it's dynamic, static, hybrid, whatever, um,

110
00:06:08,791 --> 00:06:12,162
S2: it really excels at certain subproblems within this pipeline. and

111
00:06:12,162 --> 00:06:15,722
S2: it really struggles with other ones. And AIML and specifically

112
00:06:15,722 --> 00:06:19,202
S2: generative AI, which the competition was, was kind of heavily

113
00:06:19,202 --> 00:06:22,522
S2: skewed towards generative AI. Generative AI does really well at

114
00:06:22,522 --> 00:06:25,162
S2: certain types of subproblems in this pipeline, but also really

115
00:06:25,162 --> 00:06:29,282
S2: struggles with others. So our approach is pretty straightforward. We're

116
00:06:29,282 --> 00:06:31,442
S2: going to merge the best in class capability for each

117
00:06:31,442 --> 00:06:35,481
S2: part of this pipeline. Uh, stitch them together with high uptime,

118
00:06:35,481 --> 00:06:39,842
S2: high reliability engineering code, um, and then focus on doing really,

119
00:06:39,842 --> 00:06:44,042
S2: really well for the largest number of, um, the largest

120
00:06:44,041 --> 00:06:48,282
S2: number of possible targets that we could possibly, um, that

121
00:06:48,282 --> 00:06:49,522
S2: we could possibly do well in.

122
00:06:51,322 --> 00:07:01,002
S1: Okay. Yeah. Interesting. So would you say that, um. Basically

123
00:07:01,002 --> 00:07:03,241
S1: those those things that you described in the beginning, those

124
00:07:03,242 --> 00:07:05,882
S1: are like modules and they should almost like, kind of

125
00:07:05,922 --> 00:07:08,442
S1: work independently. So you can, like, hand a task to

126
00:07:08,481 --> 00:07:11,372
S1: each of them. Is that kind of the the system

127
00:07:11,372 --> 00:07:12,332
S1: design idea?

128
00:07:12,892 --> 00:07:15,692
S2: Yeah. Yeah. So we, um, part of this was just

129
00:07:15,732 --> 00:07:20,052
S2: surviving a really rapid development cycle. This wasn't really advertised

130
00:07:20,092 --> 00:07:22,012
S2: all that well, but we actually only had about three

131
00:07:22,012 --> 00:07:26,292
S2: months to develop the first version of Buttercup in the semi-finals. Um,

132
00:07:26,292 --> 00:07:29,452
S2: and we actually had only had about six months to develop, um,

133
00:07:29,492 --> 00:07:32,332
S2: the final version of Buttercup or Buttercup 2.0, which, which

134
00:07:32,332 --> 00:07:35,132
S2: took second place in the finals. Um, and that was

135
00:07:35,132 --> 00:07:37,852
S2: because even though each round of the competition ran for

136
00:07:37,852 --> 00:07:41,012
S2: a year, it took DARPA a while to solicit feedback

137
00:07:41,012 --> 00:07:45,572
S2: from competitors, other stakeholders, and actually solidify the rules. Um,

138
00:07:45,612 --> 00:07:47,732
S2: and so the rules were solidified. It was really at

139
00:07:47,732 --> 00:07:51,652
S2: risk to do really kind of any development on the system. Also,

140
00:07:51,652 --> 00:07:54,772
S2: certain things like the the technical specifics on their competition

141
00:07:54,772 --> 00:07:58,492
S2: API weren't available until later in the, in these cycles. Um,

142
00:07:58,492 --> 00:08:01,292
S2: so part of the reason why we modularized each component

143
00:08:01,612 --> 00:08:04,812
S2: was so that we could take smaller subteams within my

144
00:08:04,812 --> 00:08:08,452
S2: larger team of about ten engineers, um, all working some

145
00:08:08,452 --> 00:08:10,862
S2: degree of part time on this system so we can

146
00:08:10,862 --> 00:08:12,982
S2: modularize it, keep them kind of separate. You know, it

147
00:08:12,982 --> 00:08:14,782
S2: gives us this integration problem that we have to deal

148
00:08:14,782 --> 00:08:15,902
S2: with at the end. We have to kind of put

149
00:08:15,902 --> 00:08:18,302
S2: everything together and make sure that it runs well. Um,

150
00:08:18,342 --> 00:08:19,822
S2: but it was kind of a necessity. It was kind

151
00:08:19,822 --> 00:08:21,902
S2: of a necessity because we had to work on developing

152
00:08:21,902 --> 00:08:25,342
S2: everything independently. We couldn't afford to just do the first block.

153
00:08:25,622 --> 00:08:27,302
S2: And is it becoming like that? You know, that meme

154
00:08:27,302 --> 00:08:31,302
S2: of the horse drawing where really finally defined head and

155
00:08:31,302 --> 00:08:33,222
S2: then as it gets towards like the the back parts

156
00:08:33,222 --> 00:08:35,382
S2: of the animal, it turns into like a raw sketch.

157
00:08:35,502 --> 00:08:37,381
S2: That was what was going to happen if we if

158
00:08:37,382 --> 00:08:40,822
S2: we didn't modularize this. Um, but it also helped because

159
00:08:40,822 --> 00:08:43,742
S2: as we decided to change out strategies or play with

160
00:08:43,742 --> 00:08:45,982
S2: different strategies, made it really easy to kind of plug

161
00:08:45,982 --> 00:08:48,262
S2: and play different parts to see what would work later on.

162
00:08:49,462 --> 00:08:52,822
S1: Yeah, that makes sense. So I keep having this debate

163
00:08:52,822 --> 00:08:56,381
S1: with a whole bunch of people. It's kind of around, um,

164
00:08:56,942 --> 00:09:00,542
S1: let the model do the work because the model is smarter. Um,

165
00:09:00,742 --> 00:09:04,781
S1: and it just understands what to do. And then there's uh,

166
00:09:05,462 --> 00:09:10,112
S1: the other argument, which is, um, build a robust system

167
00:09:10,832 --> 00:09:13,552
S1: and you have the model kind of just be the

168
00:09:13,552 --> 00:09:17,432
S1: intelligence that helps guide the system or moves things through

169
00:09:17,432 --> 00:09:22,392
S1: the system, or maybe routes, uh, across the system or whatever.

170
00:09:22,712 --> 00:09:25,432
S1: But the system itself should be set up really well,

171
00:09:26,232 --> 00:09:28,752
S1: and you're kind of like functioning as a router. And

172
00:09:28,752 --> 00:09:33,352
S1: then when the model gets updated, it makes the system better. Um,

173
00:09:33,592 --> 00:09:37,272
S1: but the counter to that is basically that we're just

174
00:09:37,272 --> 00:09:40,032
S1: going to design bad systems. So we should stop trying

175
00:09:40,032 --> 00:09:43,192
S1: to be rigid there and just use the model. Like

176
00:09:43,352 --> 00:09:44,752
S1: where do you guys fall on that?

177
00:09:45,432 --> 00:09:49,672
S2: Uh, I think it was probably closest to the second

178
00:09:49,672 --> 00:09:53,712
S2: one and maybe more like an an undescribed third thing.

179
00:09:53,712 --> 00:09:56,911
S2: So I'll kind of go over for I, um, you know, we've,

180
00:09:56,912 --> 00:09:58,792
S2: we've been, you know, in me particular I've been doing

181
00:09:58,792 --> 00:10:03,592
S2: research on like applied AI for, for security problems since before, uh,

182
00:10:03,592 --> 00:10:06,271
S2: the large language model became the predominant form of technology.

183
00:10:06,272 --> 00:10:12,402
S2: Back to, you know, 2018, 2019 time frame. Um, and uh, realistically,

184
00:10:12,402 --> 00:10:14,482
S2: like large language models are great at a good number

185
00:10:14,482 --> 00:10:17,842
S2: of things. Um, but they really struggle with certain things.

186
00:10:18,282 --> 00:10:20,881
S2: And particularly in a challenge like this where you have

187
00:10:20,881 --> 00:10:23,722
S2: to do multiple things right in sequence in order to

188
00:10:23,722 --> 00:10:27,082
S2: be successful, you have to worry about errors that start

189
00:10:27,122 --> 00:10:31,242
S2: off in early stages of an LLM heavy pipeline that

190
00:10:31,242 --> 00:10:33,562
S2: compound over time, until eventually you get to the point

191
00:10:33,562 --> 00:10:36,521
S2: where I think kind of collapses. Um, so our philosophy

192
00:10:36,522 --> 00:10:40,122
S2: on using AI, uh, specifically within the AI cyber challenge

193
00:10:40,122 --> 00:10:42,602
S2: and also kind of more broadly, um, is to use

194
00:10:42,602 --> 00:10:49,082
S2: it for, um, tightly constrained, highly contextualized problems that, um,

195
00:10:49,362 --> 00:10:51,842
S2: the models are set up for success. Um, so this

196
00:10:51,842 --> 00:10:54,162
S2: is actually kind of an interesting anecdote. Um, during the

197
00:10:54,162 --> 00:10:58,122
S2: first round of, uh, during the first round of the

198
00:10:58,122 --> 00:11:03,202
S2: AI Cyber Challenge, um, the whole concept of like multi-agent systems,

199
00:11:03,442 --> 00:11:08,342
S2: systems that have, like, tools available to them. um, didn't

200
00:11:08,342 --> 00:11:10,622
S2: really exist. It was like in a couple of papers

201
00:11:10,622 --> 00:11:13,901
S2: on archive and ultimately, um, the way we built our

202
00:11:13,902 --> 00:11:17,582
S2: aperture for the semi-finals and for the finals, um, is

203
00:11:17,622 --> 00:11:20,862
S2: is now reflective of how LM driven systems are just

204
00:11:20,862 --> 00:11:23,742
S2: built today. So it's actually really vindicating. So like our

205
00:11:23,742 --> 00:11:28,021
S2: patcher is a like a multi-agent system. It's got multiple

206
00:11:28,022 --> 00:11:30,662
S2: large language models, each with different roles to play within

207
00:11:30,662 --> 00:11:34,342
S2: this process that collaborate to generate a patch and then

208
00:11:34,342 --> 00:11:38,021
S2: validate it to make sure that it's actually one will compile,

209
00:11:38,062 --> 00:11:41,582
S2: two will actually fix the vulnerability that we've discovered. And

210
00:11:41,582 --> 00:11:44,462
S2: three doesn't break other functionality within the program. So we

211
00:11:44,462 --> 00:11:46,342
S2: found that trying to ask one large language model to

212
00:11:46,342 --> 00:11:48,982
S2: do all of that didn't really work out. And also

213
00:11:48,982 --> 00:11:52,662
S2: in the semi-finals, the, the reasoning models, um, or the

214
00:11:52,662 --> 00:11:55,702
S2: thinking models, depending on, on the branding, they didn't exist,

215
00:11:55,702 --> 00:11:57,702
S2: they weren't available. They weren't even available to us to

216
00:11:57,742 --> 00:12:02,222
S2: use as like, um, early adopter models in the a.i.c.c.

217
00:12:02,222 --> 00:12:04,262
S2: So we were dealing with, with simple, you know, back

218
00:12:04,261 --> 00:12:09,592
S2: and forth, um, style chat models. Um, so we actually

219
00:12:09,592 --> 00:12:12,391
S2: had to build in a lot of this reasoning as

220
00:12:12,392 --> 00:12:14,912
S2: part of this, like multi-agent architecture, we had to build

221
00:12:14,912 --> 00:12:18,512
S2: in a lot of like reliability and engineering code around

222
00:12:18,511 --> 00:12:23,872
S2: maintaining the pipeline. Um, fortunately, the process for um, discovering

223
00:12:23,872 --> 00:12:27,272
S2: artifacts and submitting them was pretty rigid. Um, so it

224
00:12:27,272 --> 00:12:29,912
S2: didn't really affect us that much in terms of or

225
00:12:29,912 --> 00:12:31,232
S2: it didn't have to like put a lot of really

226
00:12:31,232 --> 00:12:34,632
S2: complex reasoning in, um, but actually we ended up even

227
00:12:34,631 --> 00:12:36,552
S2: by the end of the finals, we didn't use a

228
00:12:36,552 --> 00:12:39,952
S2: reasoning or a thinking model, um, in Buttercup, because we'd

229
00:12:39,952 --> 00:12:42,552
S2: actually built it in, it was part of the circuitry

230
00:12:42,552 --> 00:12:45,832
S2: or part of like the, um, the Python code, part

231
00:12:45,832 --> 00:12:49,312
S2: of our orchestration code. Um, so we had the opportunity

232
00:12:49,312 --> 00:12:50,712
S2: in the finals to take that out and let the

233
00:12:50,712 --> 00:12:52,992
S2: model do the work. We kind of explored it a

234
00:12:52,992 --> 00:12:55,592
S2: little bit, but ultimately we decided against it because the

235
00:12:55,592 --> 00:12:58,672
S2: best case scenario was that the model would kind of

236
00:12:58,712 --> 00:13:01,792
S2: figure out on its own how to break the problem

237
00:13:01,792 --> 00:13:03,752
S2: down and how to do individual things, and what tools

238
00:13:03,752 --> 00:13:07,242
S2: to call in sequence. Uh, but we were already subject

239
00:13:07,242 --> 00:13:09,202
S2: matter experts who did it exactly the way it should

240
00:13:09,202 --> 00:13:12,242
S2: be done. So the the best case scenario is that

241
00:13:12,242 --> 00:13:14,762
S2: the model was able to replicate what we've done only

242
00:13:14,761 --> 00:13:17,842
S2: at a more expensive per call. Um, or more expensive,

243
00:13:17,881 --> 00:13:21,882
S2: like number of volume of tokens. Um, so we actually kept, um, we,

244
00:13:21,881 --> 00:13:23,842
S2: we did upgrade our models. We went from the GPT

245
00:13:23,881 --> 00:13:28,161
S2: three series, um, and the Claude three, uh, series of

246
00:13:28,162 --> 00:13:33,122
S2: models and moved up to, um, the four and like

247
00:13:33,162 --> 00:13:36,362
S2: the basically the Gen four versions of models for the final.

248
00:13:36,362 --> 00:13:39,402
S2: So we, we upgraded the underlying models, but we very much, um,

249
00:13:39,442 --> 00:13:42,562
S2: kept the problems very small for the, for the AI's

250
00:13:42,682 --> 00:13:45,362
S2: or for the, um, for the AI models, so that

251
00:13:45,362 --> 00:13:48,122
S2: we would avoid this issue where you have compounding errors,

252
00:13:48,362 --> 00:13:51,642
S2: you have to worry about like these, these modulo errors of,

253
00:13:51,682 --> 00:13:54,082
S2: you know, deciding to do the wrong thing in sequence.

254
00:13:54,562 --> 00:13:56,682
S2: And that actually turns out to be really, uh, to

255
00:13:56,682 --> 00:14:00,242
S2: be penalize you heavily in these long systems because, you know,

256
00:14:00,282 --> 00:14:03,202
S2: when a system decides, you know, hey, I've got to

257
00:14:03,202 --> 00:14:05,692
S2: do A, B, C and D and C before b.

258
00:14:06,052 --> 00:14:09,651
S2: All of that information involved with dealing with this like

259
00:14:09,852 --> 00:14:12,852
S2: out of sequence task. It stays in the context window.

260
00:14:12,852 --> 00:14:14,732
S2: And it kind of, for lack of a better term,

261
00:14:14,772 --> 00:14:17,532
S2: kind of pollutes the model's ability to kind of reorder

262
00:14:17,532 --> 00:14:19,132
S2: those tasks and do them correctly. It has a hard

263
00:14:19,132 --> 00:14:22,532
S2: time kind of forgetting information until it rolls out of

264
00:14:22,532 --> 00:14:24,692
S2: the context window. So it's a really long way to

265
00:14:24,692 --> 00:14:28,172
S2: say we probably did the latter version. But, um, one

266
00:14:28,172 --> 00:14:29,692
S2: thing I do want to say is like the actual

267
00:14:29,732 --> 00:14:33,052
S2: like processing of artifacts through the system, we didn't rely

268
00:14:33,052 --> 00:14:34,692
S2: on the AI to kind of figure out, okay, I've

269
00:14:34,692 --> 00:14:36,532
S2: got a vulnerability now I should patch it. That was

270
00:14:36,532 --> 00:14:40,772
S2: also all, um, that was also all orchestrated, um, by

271
00:14:40,772 --> 00:14:42,172
S2: our by our larger pipeline.

272
00:14:42,572 --> 00:14:46,132
S1: Okay. Okay. So yeah, I've seen this a lot as well.

273
00:14:46,172 --> 00:14:48,772
S1: I mean, I feel like this is a general concept

274
00:14:48,772 --> 00:14:54,252
S1: that people are coming to, which is, um, I don't

275
00:14:54,252 --> 00:14:59,372
S1: want to say legacy tech. Traditional tech is just like, deterministic. So, like,

276
00:14:59,372 --> 00:15:01,532
S1: that's the tech that you want to use to, like,

277
00:15:02,092 --> 00:15:05,342
S1: do things that matter, and then you kind of want

278
00:15:05,382 --> 00:15:09,702
S1: to use like AI for like a, um, I don't know,

279
00:15:09,742 --> 00:15:13,462
S1: like a router maybe, or like a, um, something intelligent

280
00:15:13,462 --> 00:15:19,222
S1: about choosing which standard tech to use, but not making like, choices.

281
00:15:19,222 --> 00:15:22,742
S1: Maybe necessarily. Um, I don't know. I'm trying to figure

282
00:15:22,742 --> 00:15:24,582
S1: out how to articulate that, but it's like.

283
00:15:24,782 --> 00:15:26,262
S2: Yeah, well, it's actually funny you bring this up. I've

284
00:15:26,262 --> 00:15:28,982
S2: had to kind of get good at articulating this, um,

285
00:15:28,982 --> 00:15:31,022
S2: over the last couple of years. So the way I've

286
00:15:31,022 --> 00:15:33,742
S2: explained this to people is that certain problems, particularly in

287
00:15:33,742 --> 00:15:37,462
S2: computer science with this kind of generalizes everywhere. Certain problems

288
00:15:37,502 --> 00:15:43,142
S2: lend themselves to prescriptive solutions. So prescriptive solution is something

289
00:15:43,142 --> 00:15:44,982
S2: that we do when we write an algorithm to solve

290
00:15:44,982 --> 00:15:47,502
S2: a problem. This could be like coming up with an

291
00:15:47,502 --> 00:15:50,302
S2: answer for the traveling salesman problem. You know, we know

292
00:15:50,342 --> 00:15:52,502
S2: it's a really difficult problem to solve, but there's greedy

293
00:15:52,502 --> 00:15:54,982
S2: algorithms that do a pretty good job and for the

294
00:15:54,982 --> 00:15:56,822
S2: most part, will get you a good answer. Maybe not

295
00:15:56,822 --> 00:15:58,542
S2: the best answer, but they'll get you a good one.

296
00:15:59,102 --> 00:16:01,952
S2: So for these types of problems, you can prescribe a

297
00:16:01,952 --> 00:16:04,552
S2: set of steps to the computer and let them execute them.

298
00:16:04,952 --> 00:16:09,032
S2: Now other problems are really, really challenging to prescribe a

299
00:16:09,032 --> 00:16:12,952
S2: solution for. So these types of problems lend themselves to

300
00:16:12,992 --> 00:16:15,592
S2: AI or ML techniques because you can use a descriptive

301
00:16:15,712 --> 00:16:19,352
S2: instead of prescriptive solution. So a good example of this

302
00:16:19,352 --> 00:16:22,432
S2: is like image recognition. So it's really really hard to

303
00:16:22,472 --> 00:16:25,112
S2: take a picture of a cat and write a computer

304
00:16:25,112 --> 00:16:28,832
S2: program that will say, okay, based on the pixel colors

305
00:16:28,832 --> 00:16:30,992
S2: of this pixel and this position, this is going to

306
00:16:30,992 --> 00:16:32,832
S2: be a cat, because a cat can be in a

307
00:16:32,832 --> 00:16:36,312
S2: million different contortions. It can have different hair, the face

308
00:16:36,312 --> 00:16:38,752
S2: can be half obscured. But what we can do is

309
00:16:38,792 --> 00:16:41,152
S2: we can describe to an AI ML model what a

310
00:16:41,192 --> 00:16:43,712
S2: cat looks like with millions of pictures, because we have

311
00:16:43,712 --> 00:16:46,032
S2: millions of pictures of cats. And then it can do

312
00:16:46,032 --> 00:16:48,512
S2: a good job of solving that problem. Now it might

313
00:16:48,512 --> 00:16:51,192
S2: make mistakes, but this is better than the option that

314
00:16:51,192 --> 00:16:54,152
S2: you had with the traditional approach, because that approach was

315
00:16:54,152 --> 00:16:57,152
S2: awful to begin with. So a good example of a

316
00:16:57,152 --> 00:17:01,242
S2: corollary for this in Buttercup is patch generation. There's a

317
00:17:01,282 --> 00:17:03,282
S2: lot of synthetic code generation tools and a lot of

318
00:17:03,282 --> 00:17:06,202
S2: research in this area. But in terms of like automatically

319
00:17:06,202 --> 00:17:10,242
S2: generating patches to fix bugs, unless your bug is like

320
00:17:10,282 --> 00:17:13,321
S2: dead obvious, like it's missing a bounds check and it's

321
00:17:13,322 --> 00:17:15,402
S2: really easy to apply some sort of pattern matching to

322
00:17:15,442 --> 00:17:17,202
S2: figure out what the lower bound is, or the upper

323
00:17:17,202 --> 00:17:20,882
S2: bound is that needs to be checked. Um, tools to

324
00:17:20,922 --> 00:17:24,482
S2: generate patches for weird bugs. Like they just don't exist.

325
00:17:24,922 --> 00:17:27,402
S2: So this is a great place for AIML to help

326
00:17:27,402 --> 00:17:29,401
S2: us out. And it actually turns out, um, you know,

327
00:17:29,402 --> 00:17:31,922
S2: this is really proven true by the AI Cyber Challenge

328
00:17:31,922 --> 00:17:35,921
S2: and by Buttercup, more specifically, um, llms are great at

329
00:17:35,922 --> 00:17:38,402
S2: generating code, um, because it's one of the biggest value

330
00:17:38,402 --> 00:17:43,002
S2: propositions right now for the technology. So, um, generating patches

331
00:17:43,002 --> 00:17:45,602
S2: for bugs is tightly constrained. It's not not asking you

332
00:17:45,602 --> 00:17:48,561
S2: to generate all of the code that is necessary to

333
00:17:48,602 --> 00:17:51,042
S2: build this entire system that I've got a spec sheet for.

334
00:17:51,482 --> 00:17:54,121
S2: I'm only asking it given this code, and given what

335
00:17:54,122 --> 00:17:56,002
S2: we know about this vulnerability, how would you change it

336
00:17:56,002 --> 00:17:59,122
S2: to fix it? The large language models have already internalized

337
00:17:59,122 --> 00:18:03,382
S2: internalize large numbers of incremental commits to open source code

338
00:18:03,382 --> 00:18:06,262
S2: repositories that fix bugs, so they actually have a really

339
00:18:06,262 --> 00:18:09,742
S2: good track record with, um, more than I expected, even

340
00:18:09,742 --> 00:18:13,022
S2: when we started this, uh, with generating patches. So this

341
00:18:13,022 --> 00:18:15,382
S2: is a great example of where generating a patch is

342
00:18:15,382 --> 00:18:18,462
S2: something that lends itself towards a descriptive solution and a

343
00:18:18,462 --> 00:18:23,621
S2: descriptive algorithm, uh, or an AIML algorithm versus something that's prescriptive, um,

344
00:18:23,902 --> 00:18:25,941
S2: which is fuzzing. Fuzzing is a good example of a

345
00:18:25,942 --> 00:18:28,341
S2: prescriptive solution. If you if you need to find a

346
00:18:28,342 --> 00:18:32,822
S2: vulnerability and you need a crashing input, um, you have

347
00:18:32,821 --> 00:18:35,061
S2: to be able to prove that it exists. It's really,

348
00:18:35,061 --> 00:18:37,262
S2: really hard to get an LLM to do that because

349
00:18:37,302 --> 00:18:42,262
S2: llms the underlying reasoning. They don't have like data feedforward. Um,

350
00:18:42,302 --> 00:18:45,542
S2: they basically they look at source code like they look

351
00:18:45,542 --> 00:18:49,621
S2: at natural language. Natural language doesn't describe the activities of

352
00:18:49,622 --> 00:18:53,022
S2: an underlying state machine that runs on hardware after it

353
00:18:53,022 --> 00:18:55,622
S2: passes through a compiler. So like, you know, the source

354
00:18:55,622 --> 00:18:57,982
S2: code when looked at by a model. Models look at

355
00:18:57,982 --> 00:19:01,112
S2: source code in a really shallow way. Um, so when

356
00:19:01,112 --> 00:19:04,191
S2: we want to find, you know, a crashing input, a

357
00:19:04,192 --> 00:19:06,192
S2: fuzzer is a great way because we can prescribe a solution,

358
00:19:06,192 --> 00:19:10,072
S2: which is try everything, brute force it. Um, just come

359
00:19:10,071 --> 00:19:11,671
S2: up with different inputs, throw it in there, and then

360
00:19:11,672 --> 00:19:13,752
S2: if it crashes, well, there you go. You've proven it.

361
00:19:13,912 --> 00:19:16,952
S2: So that's what fuzzing heavily early on. You know, for

362
00:19:16,952 --> 00:19:19,192
S2: one type of problem we use patching heavily for another.

363
00:19:20,071 --> 00:19:24,552
S1: Yeah, that makes sense. And the other problem with, um,

364
00:19:25,632 --> 00:19:33,152
S1: finding vulns with with um, I also seems to me that, um, they,

365
00:19:33,152 --> 00:19:35,992
S1: they want to please there's they're heavily biased to be like,

366
00:19:35,992 --> 00:19:38,391
S1: this is it. This is one. Yeah. Well, this is

367
00:19:38,392 --> 00:19:40,831
S1: definitely a hit or whatever. And you look at it

368
00:19:40,872 --> 00:19:44,952
S1: and it's actually not. So I guess the intelligence is

369
00:19:44,952 --> 00:19:48,632
S1: deciding to use the fuzzer, which it could help make

370
00:19:48,632 --> 00:19:51,552
S1: that decision that a fuzzer should be used. Right.

371
00:19:52,512 --> 00:19:55,192
S2: Yeah. Yeah. So it's it's funny you bring that up.

372
00:19:55,192 --> 00:19:59,402
S2: Large language models really struggle to solve problems that aren't

373
00:19:59,442 --> 00:20:02,162
S2: rooted in some kind of ground truth. Um, it turns

374
00:20:02,162 --> 00:20:04,242
S2: out there's a huge difference there. We have some internal

375
00:20:04,242 --> 00:20:08,242
S2: research that we haven't published. Anybody could reproduce it. But, um,

376
00:20:08,282 --> 00:20:09,482
S2: so it turns out if you if you have a

377
00:20:09,482 --> 00:20:11,242
S2: bit of source code and you ask the model to

378
00:20:11,282 --> 00:20:15,522
S2: tell you where the vulnerability is, um, it will absolutely

379
00:20:15,561 --> 00:20:18,202
S2: hallucinate a vulnerability because it wants to please you. Uh,

380
00:20:18,202 --> 00:20:20,841
S2: we have one of our researchers, um, one of our

381
00:20:20,842 --> 00:20:23,522
S2: principal researchers, Artem. He's a great guy. He, um, he

382
00:20:23,522 --> 00:20:28,522
S2: downloaded the, um, formally, correct. Uh, the formally proven correct

383
00:20:28,522 --> 00:20:32,042
S2: portions of, uh, of Linux and asked a large language

384
00:20:32,042 --> 00:20:35,882
S2: model several hundred times. Um, here's a snippet of code.

385
00:20:35,882 --> 00:20:37,722
S2: It has a vulnerability where it is, and every single

386
00:20:37,722 --> 00:20:40,802
S2: time it would find it would manufacture vulnerability because it

387
00:20:40,802 --> 00:20:43,722
S2: wants to find the answer. So it turns out when

388
00:20:43,722 --> 00:20:46,601
S2: we started asking it, is there a vulnerability? Um, it

389
00:20:46,602 --> 00:20:49,162
S2: messed up a little less, but it would still assume

390
00:20:49,162 --> 00:20:51,921
S2: that because you're asking that there's something to find and

391
00:20:51,922 --> 00:20:54,322
S2: it would still mess up quite a bit. So that's

392
00:20:54,321 --> 00:20:57,012
S2: why when we're in the concept where we're, when we're using, um,

393
00:20:57,052 --> 00:21:00,252
S2: large language models for generating patches. It's great because we

394
00:21:00,252 --> 00:21:02,411
S2: know there's a vulnerability because we found it and we

395
00:21:02,412 --> 00:21:04,571
S2: proved it, and we can collect additional information.

396
00:21:04,612 --> 00:21:05,172
S1: Yeah.

397
00:21:05,412 --> 00:21:08,532
S2: So now I don't have to worry about asking the model. Hey,

398
00:21:08,532 --> 00:21:10,611
S2: do you think there's a vulnerability? And if so, patch it.

399
00:21:10,612 --> 00:21:12,972
S2: I say no, there is a vulnerability. It's here. This

400
00:21:12,972 --> 00:21:15,772
S2: is extra information about a code that touches it. Now

401
00:21:15,772 --> 00:21:18,571
S2: generate a patch. And the model is very good at

402
00:21:18,571 --> 00:21:21,012
S2: doing that because it takes away the decision making or,

403
00:21:21,332 --> 00:21:24,092
S2: or the judgment call that large language models are really,

404
00:21:24,092 --> 00:21:27,252
S2: really bad at because they don't actually model judgment calls underneath.

405
00:21:27,612 --> 00:21:31,332
S2: And their architecture, they, they model, you know, sequencing information,

406
00:21:31,571 --> 00:21:34,052
S2: sequencing tokens. And when you write code, you're writing a

407
00:21:34,052 --> 00:21:36,851
S2: sequence of tokens. So these problems tend to be, um,

408
00:21:36,892 --> 00:21:39,972
S2: a lot more suitable than other problems where you're asking

409
00:21:39,972 --> 00:21:42,292
S2: it to find the ground truth for you, bad problems

410
00:21:42,292 --> 00:21:45,332
S2: for llms asking it to take ground truth and expand

411
00:21:45,332 --> 00:21:47,611
S2: upon it. Great applications for Llms.

412
00:21:48,012 --> 00:21:49,772
S1: Oh man, I love that. And this also goes to

413
00:21:49,772 --> 00:21:52,652
S1: your previous point of not wanting to pollute the context

414
00:21:52,652 --> 00:21:56,622
S1: for the current task on hand, which is building that patch,

415
00:21:57,222 --> 00:22:00,582
S1: because if you have like some history of like there

416
00:22:00,582 --> 00:22:04,182
S1: were previous decisions made or previous questions asked or whatever

417
00:22:04,222 --> 00:22:06,061
S1: it might get like diverted, you know?

418
00:22:06,942 --> 00:22:11,102
S2: Yeah, absolutely. It's um, it's a, it's a big challenge particularly, um,

419
00:22:11,582 --> 00:22:13,222
S2: I don't know, it's funny. I've, I've been kind of

420
00:22:13,262 --> 00:22:15,742
S2: trying to sing this gospel internally, uh, at Trail of

421
00:22:15,742 --> 00:22:18,222
S2: Bits and to other people who will listen that, um,

422
00:22:18,622 --> 00:22:22,502
S2: the increasing size of context window is not always your friend. Um,

423
00:22:23,061 --> 00:22:25,502
S2: by increasing the size of the context window. I mean,

424
00:22:25,502 --> 00:22:27,102
S2: if you think about how the large language model works

425
00:22:27,102 --> 00:22:29,702
S2: under the hood, it's using these contexts to attune the

426
00:22:29,702 --> 00:22:32,302
S2: model to certain parts of its training data that are

427
00:22:32,302 --> 00:22:35,262
S2: going to be highly relevant to solving your particular problem.

428
00:22:35,622 --> 00:22:37,862
S2: And the more words and the more tokens you put

429
00:22:37,862 --> 00:22:41,262
S2: into the context window, the more you are kind of

430
00:22:41,302 --> 00:22:46,821
S2: nulling out or, um, numbing the attention mechanism. You're forcing

431
00:22:46,821 --> 00:22:48,742
S2: it to become more and more general, because now there

432
00:22:48,782 --> 00:22:52,822
S2: are more tokens that are affecting these attuned probabilities. So

433
00:22:52,821 --> 00:22:56,192
S2: you actually are better off with using now. Context window

434
00:22:56,232 --> 00:22:58,911
S2: is great because if you need, let's say a million,

435
00:22:59,152 --> 00:23:01,671
S2: you know, a million tokens in your context window to

436
00:23:01,712 --> 00:23:04,472
S2: constrain the problem, then use a million tokens. But if

437
00:23:04,472 --> 00:23:07,312
S2: you can do it for 1000 or 10,000, you're going

438
00:23:07,311 --> 00:23:09,831
S2: to get better results because you're more likely to focus

439
00:23:09,832 --> 00:23:11,311
S2: that model where it needs to be.

440
00:23:12,512 --> 00:23:16,311
S1: Yeah, I love this. Like, by the way, this this

441
00:23:16,352 --> 00:23:20,392
S1: this is great. This is great. Um, I'm going to

442
00:23:20,792 --> 00:23:24,912
S1: create a lot of content out of this, um, because it's,

443
00:23:24,912 --> 00:23:30,631
S1: it's really crystallizing in like one starting to form something

444
00:23:30,632 --> 00:23:34,232
S1: in my mind. I'd love to work with you on it. Um, essentially,

445
00:23:34,232 --> 00:23:37,391
S1: what I'm trying to think of is, um, what are

446
00:23:37,392 --> 00:23:40,592
S1: some general statements that we could make? Um, one that

447
00:23:40,592 --> 00:23:42,512
S1: I'm sort of heading in the direction of, you tell

448
00:23:42,512 --> 00:23:46,872
S1: me if I'm wrong is like. And this might be

449
00:23:46,872 --> 00:23:51,032
S1: overstating it, but like, the system itself should be highly

450
00:23:51,032 --> 00:23:56,522
S1: modular and and most as much as possible made up

451
00:23:56,522 --> 00:24:01,602
S1: of traditional and deterministic tech. And then the way that

452
00:24:01,602 --> 00:24:05,082
S1: you use the AI is for the specific type of problem,

453
00:24:05,282 --> 00:24:08,121
S1: which we're going to articulate the way you articulated it

454
00:24:09,482 --> 00:24:13,762
S1: for those types of problems where routing is needed to

455
00:24:13,762 --> 00:24:18,841
S1: the traditional tech. Um, and it's like, don't just go

456
00:24:18,882 --> 00:24:22,682
S1: crazy with AI. Don't ask it questions that the traditional

457
00:24:22,682 --> 00:24:27,122
S1: text should be answering. Um, it's something like that. And

458
00:24:27,122 --> 00:24:33,362
S1: then ultimately you have like this dependable deterministic system with

459
00:24:33,762 --> 00:24:37,081
S1: the minimum amount of AI that is required to move

460
00:24:37,402 --> 00:24:39,162
S1: appropriately through that system.

461
00:24:40,522 --> 00:24:43,561
S2: Yeah. So yeah, really it comes down to problem formulation.

462
00:24:43,561 --> 00:24:46,722
S2: And this is like the the great part about and

463
00:24:46,722 --> 00:24:48,002
S2: this is part of the reason why you see such

464
00:24:48,002 --> 00:24:50,841
S2: a huge overlap in interest between people from the computer

465
00:24:50,842 --> 00:24:53,622
S2: science background and people from like data science backgrounds on

466
00:24:53,622 --> 00:24:55,782
S2: here because, you know, one of the basic things you

467
00:24:55,782 --> 00:24:57,582
S2: learn in computer science, like when you get to like

468
00:24:57,582 --> 00:25:01,821
S2: the graduate level is problem formulation. It's how to recognize

469
00:25:02,022 --> 00:25:07,302
S2: your problem as a derivative, or maybe a like dressed

470
00:25:07,302 --> 00:25:12,102
S2: up version of some other problem. So, you know, right away, um, okay,

471
00:25:12,102 --> 00:25:13,742
S2: I have this problem of, okay, I've got to manage

472
00:25:13,742 --> 00:25:16,742
S2: this delivery system. How do I make this delivery system, um,

473
00:25:16,742 --> 00:25:20,022
S2: for Amazon efficient? You can recognize this right away as, oh,

474
00:25:20,022 --> 00:25:22,782
S2: this is traveling salesman. There's no good way to do this.

475
00:25:22,821 --> 00:25:24,222
S2: But what I can do is I can. I'm going

476
00:25:24,262 --> 00:25:26,582
S2: to get a good answer. I just have to accept

477
00:25:26,942 --> 00:25:29,142
S2: that my answer is going to be imprecise or not

478
00:25:29,142 --> 00:25:33,742
S2: necessarily optimal. Um, and in applying AI and ML to

479
00:25:33,742 --> 00:25:37,342
S2: security problems or any problem in general, the first step

480
00:25:37,342 --> 00:25:40,782
S2: is very much like problem formulation. It's understanding what kind

481
00:25:40,782 --> 00:25:42,621
S2: of model is going to work best for this problem,

482
00:25:42,662 --> 00:25:45,742
S2: because is this a problem that will work well with

483
00:25:45,742 --> 00:25:47,661
S2: a time series model, because my data is coming in

484
00:25:47,662 --> 00:25:49,861
S2: over time, or is this a model that's going to

485
00:25:49,862 --> 00:25:54,992
S2: work well with, um, let's say like a, like linear regression,

486
00:25:54,992 --> 00:25:59,472
S2: because there is some true underlying probability for how the

487
00:25:59,472 --> 00:26:02,152
S2: data is distributed that I'm trying to learn from one

488
00:26:02,152 --> 00:26:05,432
S2: of like the kind of curses of large language models

489
00:26:05,752 --> 00:26:09,032
S2: is that they have abstracted all of this good data

490
00:26:09,032 --> 00:26:12,592
S2: science practice, all these good data science practices away. And

491
00:26:12,592 --> 00:26:15,992
S2: now it's great because it democratizes it. Anybody can use AI,

492
00:26:16,032 --> 00:26:18,071
S2: anybody can use an LLM. And all you have to

493
00:26:18,071 --> 00:26:20,272
S2: do is be able to articulate your problem. The problem is,

494
00:26:20,272 --> 00:26:23,232
S2: is that it also abstracts away problem formulation. And now

495
00:26:23,232 --> 00:26:26,311
S2: we're starting to use Llms because they're accessible for certain

496
00:26:26,311 --> 00:26:29,831
S2: types of problems that they're really not well formulated for. Um.

497
00:26:30,672 --> 00:26:31,272
S1: Yeah.

498
00:26:31,432 --> 00:26:33,391
S2: So this is this is kind of where we get

499
00:26:33,392 --> 00:26:35,712
S2: to the issue. So the good news is we don't

500
00:26:35,712 --> 00:26:38,152
S2: have to just like say, okay, well, I can't do

501
00:26:38,192 --> 00:26:40,232
S2: problem formulation with an LLM, so I just throw it away.

502
00:26:40,232 --> 00:26:42,111
S2: Don't use it. I have to go back to, you know,

503
00:26:42,152 --> 00:26:44,431
S2: TensorFlow and writing my own models and stuff. What we

504
00:26:44,432 --> 00:26:46,592
S2: really have to do is get to what you were describing,

505
00:26:46,912 --> 00:26:50,282
S2: which is rather than throw the LLM at a large problem.

506
00:26:50,282 --> 00:26:52,722
S2: We take it a step further. We break the problem down.

507
00:26:52,722 --> 00:26:56,002
S2: Are there subproblems that are highly amenable to AI solutions?

508
00:26:56,242 --> 00:26:58,762
S2: I have a litmus test that I, that I pass, um,

509
00:26:58,802 --> 00:27:01,042
S2: you know, problems through. And I try to encourage my

510
00:27:01,042 --> 00:27:04,802
S2: team members to use, um, which is, you know, basically

511
00:27:04,802 --> 00:27:06,482
S2: like a check to see whether a problem is good

512
00:27:06,482 --> 00:27:09,042
S2: for AIML. And it's usually, you know, do you have

513
00:27:09,042 --> 00:27:11,802
S2: enough data in the model that you can train? In

514
00:27:11,802 --> 00:27:13,722
S2: this case, it now becomes is the LLM. Does the

515
00:27:13,722 --> 00:27:15,722
S2: LLM have examples of this on the internet that it

516
00:27:15,722 --> 00:27:17,841
S2: can draw from, or are you asking it to do

517
00:27:17,842 --> 00:27:23,282
S2: something like reverse engineering, you know, firmware code on this

518
00:27:23,282 --> 00:27:26,162
S2: obscure chipset that like there's no examples on the internet,

519
00:27:26,162 --> 00:27:29,282
S2: bad example or to it won't have it won't have

520
00:27:29,282 --> 00:27:32,562
S2: anything to draw from. Number two, um, is there some

521
00:27:32,561 --> 00:27:36,361
S2: probabilistic nature to the data that's underlying? This is actually

522
00:27:36,362 --> 00:27:38,401
S2: makes large language models really bad for a lot of

523
00:27:38,402 --> 00:27:42,722
S2: security problems, because they're what we call non-differentiable, meaning that

524
00:27:42,722 --> 00:27:45,082
S2: they don't have like this nice curved space that you

525
00:27:45,082 --> 00:27:49,852
S2: can use stochastic gradient descent or virtually any optimization function

526
00:27:49,852 --> 00:27:52,052
S2: to try and climb and find a good answer for

527
00:27:52,172 --> 00:27:54,052
S2: it actually exists more of like this kind of cloud

528
00:27:54,052 --> 00:27:56,012
S2: with dots of answers all over the place. If you

529
00:27:56,012 --> 00:27:58,811
S2: were to try and imagine the answers to security questions

530
00:27:59,132 --> 00:28:01,252
S2: in like a mathematical graph.

531
00:28:01,732 --> 00:28:04,772
S1: Okay, what's an example of what's an example of one

532
00:28:04,772 --> 00:28:06,612
S1: of those? I'm, I'm trying to think of what that

533
00:28:06,612 --> 00:28:07,692
S1: space might look like.

534
00:28:08,172 --> 00:28:10,212
S2: Yeah. So a good example of like a problem that

535
00:28:10,212 --> 00:28:14,532
S2: is differentiable is like housing prices. So housing prices vary by,

536
00:28:14,571 --> 00:28:17,851
S2: you know, like the size by square footage. Yeah. Square footage,

537
00:28:17,852 --> 00:28:20,931
S2: number of rooms, zip code quality of the schools. So

538
00:28:20,932 --> 00:28:22,691
S2: when you plot these all out you get something that

539
00:28:22,692 --> 00:28:24,732
S2: you can do linear regression on. You can see like.

540
00:28:24,732 --> 00:28:24,932
S1: A.

541
00:28:25,132 --> 00:28:28,052
S2: Little loop. And that's called a differentiable function because it's

542
00:28:28,052 --> 00:28:31,052
S2: a continuous line that you can draw through the data

543
00:28:31,052 --> 00:28:33,212
S2: that more or less minimizes the error of those points

544
00:28:33,212 --> 00:28:34,012
S2: along the line.

545
00:28:34,252 --> 00:28:34,611
S1: Yep.

546
00:28:35,132 --> 00:28:37,732
S2: But if we want to think about, um, let's say

547
00:28:37,772 --> 00:28:40,332
S2: now optimizing a program, we can take a look at

548
00:28:40,332 --> 00:28:45,532
S2: how ordering certain steps or changing the way we implement

549
00:28:45,532 --> 00:28:48,342
S2: certain functions as changing the speed of a program up

550
00:28:48,342 --> 00:28:52,782
S2: and down, and that becomes kind of pseudo differentiable. It's

551
00:28:52,782 --> 00:28:54,382
S2: it's more like a step function where you have kind

552
00:28:54,382 --> 00:28:56,502
S2: of like little lines where if I change this one thing,

553
00:28:56,502 --> 00:28:59,262
S2: it jumps up a little bit, it's more jagged, but

554
00:28:59,302 --> 00:29:03,022
S2: there's still, um, it's close to differentiable because I can

555
00:29:03,062 --> 00:29:06,662
S2: kind of map deterministically how if I run it on,

556
00:29:06,982 --> 00:29:09,622
S2: you know, with this set of compiler optimizations or that

557
00:29:09,662 --> 00:29:13,102
S2: it's definitely not differentiable, but it's closer. Security is just

558
00:29:13,102 --> 00:29:17,622
S2: wild because the flaws in computer programs can come from

559
00:29:17,622 --> 00:29:19,382
S2: one of a million different sources. It can be a

560
00:29:19,382 --> 00:29:22,142
S2: logic bug, it can be a mis implemented function. It

561
00:29:22,142 --> 00:29:23,982
S2: can be the use of an unsafe function, which is

562
00:29:23,982 --> 00:29:27,702
S2: easy to find. There's no way for us to take, um,

563
00:29:28,502 --> 00:29:32,262
S2: root causes for vulnerabilities in software and solutions to them

564
00:29:32,422 --> 00:29:35,062
S2: and plot them on a graph. Because they come from

565
00:29:35,702 --> 00:29:39,502
S2: they come from unquantifiable sources. Some of them like, you know,

566
00:29:39,542 --> 00:29:42,982
S2: Spectre and Meltdown and stuff. They they're resident in hardware

567
00:29:43,222 --> 00:29:45,942
S2: and the implementation there. Some are purely in software like

568
00:29:45,992 --> 00:29:50,312
S2: X type vulnerabilities. We can't they don't they're it's, um,

569
00:29:50,352 --> 00:29:52,272
S2: it's not even apples and oranges. It's like trying to

570
00:29:52,272 --> 00:29:55,512
S2: compare apples and fighter jets. Um.

571
00:29:56,752 --> 00:29:59,112
S1: Is it, is it a matter of, like the, the

572
00:29:59,272 --> 00:30:03,792
S1: tensor size or the, um, I think that's called tensor size.

573
00:30:03,792 --> 00:30:07,752
S1: I can't remember the, the, um, the number of dimensions

574
00:30:07,752 --> 00:30:10,112
S1: in the space, because when you're looking at square footage

575
00:30:10,112 --> 00:30:13,592
S1: and price what you have to write, is it the

576
00:30:13,592 --> 00:30:18,472
S1: problem in security that is just so many dimensions that, um,

577
00:30:18,472 --> 00:30:20,912
S1: when you try to plot it, you try to simplify it,

578
00:30:20,952 --> 00:30:22,192
S1: it just becomes garbage.

579
00:30:22,912 --> 00:30:25,072
S2: Well, it's a matter of common dimensions. So if you

580
00:30:25,072 --> 00:30:28,112
S2: build a house, every house has square footage.

581
00:30:28,552 --> 00:30:29,192
S1: There you go.

582
00:30:29,712 --> 00:30:32,272
S2: And you can calculate the space underneath. But a cross

583
00:30:32,312 --> 00:30:36,792
S2: site request forgery vulnerability in a, um, you know, piece

584
00:30:36,792 --> 00:30:39,552
S2: of JavaScript code that exists on the web has almost

585
00:30:39,552 --> 00:30:42,952
S2: nothing in common with a memory corruption vulnerability in a

586
00:30:42,992 --> 00:30:47,762
S2: C program running on a router in your home device.

587
00:30:48,082 --> 00:30:51,802
S2: They are implemented at different levels of abstraction. You know,

588
00:30:51,842 --> 00:30:54,482
S2: like even the program representations are different because some of

589
00:30:54,482 --> 00:30:57,122
S2: the vulnerabilities might exist only in binary code after it's

590
00:30:57,122 --> 00:31:02,362
S2: been compiled versus other vulnerabilities that are resident in source

591
00:31:02,362 --> 00:31:06,162
S2: code that's interpreted via web browser. Um, so really what

592
00:31:06,162 --> 00:31:07,962
S2: it is, is it's like trying to it's like trying

593
00:31:07,962 --> 00:31:10,682
S2: to plot, you know, the prices of homes, along with

594
00:31:11,002 --> 00:31:14,962
S2: the prices of, um, I don't know, oranges in a

595
00:31:14,962 --> 00:31:18,722
S2: particular year. You know, there's very little in common between

596
00:31:18,762 --> 00:31:21,802
S2: a house and an orange other than maybe some, like,

597
00:31:21,842 --> 00:31:25,402
S2: you know, global macro effects that might show some correlation,

598
00:31:25,802 --> 00:31:28,122
S2: you know. You know, economic factors like inflation.

599
00:31:28,522 --> 00:31:31,202
S1: Or like the beating of a whale's heart to determine

600
00:31:31,202 --> 00:31:35,962
S1: whether or not it's healthy. It's it's like completely different. Uh, yeah.

601
00:31:36,002 --> 00:31:39,161
S1: Completely different sports. Yeah. Yeah, yeah.

602
00:31:39,522 --> 00:31:40,962
S2: Yeah. So, so really, it's a it's a lack of

603
00:31:40,962 --> 00:31:43,682
S2: common dimensions in cybersecurity, which is why, you know, if

604
00:31:43,682 --> 00:31:45,732
S2: we think about like if we were trying to model,

605
00:31:45,772 --> 00:31:47,532
S2: like what the data would look like, if we could

606
00:31:47,532 --> 00:31:50,012
S2: visualize it, it would just be a bunch of points

607
00:31:50,012 --> 00:31:54,332
S2: of presence out there. Um, uh, within this, like, kind

608
00:31:54,332 --> 00:31:57,332
S2: of large cloud. Um, and even then, that's another problem

609
00:31:57,332 --> 00:31:59,692
S2: that kind of makes cybersecurity really hard to model with

610
00:31:59,692 --> 00:32:05,092
S2: AML is that there is really comparatively little data, um,

611
00:32:05,692 --> 00:32:07,412
S2: in terms of like the volume of data, there's tons

612
00:32:07,412 --> 00:32:09,452
S2: of vulnerabilities out there. But if you're trying to make

613
00:32:09,452 --> 00:32:13,732
S2: a model that's really, really good at, let's say, detecting, um,

614
00:32:14,052 --> 00:32:17,692
S2: buffer overflows and embedded device code, um, you're going to

615
00:32:17,692 --> 00:32:19,452
S2: find some data for that, but there's not that much

616
00:32:19,452 --> 00:32:21,492
S2: you have to rely on like POC write ups on,

617
00:32:21,492 --> 00:32:23,572
S2: on the internet for practitioners who put it out there

618
00:32:23,572 --> 00:32:27,412
S2: for fun. Um, but there's not a million of examples

619
00:32:27,412 --> 00:32:28,772
S2: of that like, it is if you want to say,

620
00:32:28,772 --> 00:32:30,732
S2: I want to train a model to write the Great

621
00:32:30,732 --> 00:32:33,972
S2: American novel, there you can take you can take every

622
00:32:33,972 --> 00:32:36,132
S2: novel ever written, throw it in there and then see

623
00:32:36,132 --> 00:32:38,292
S2: what the model comes up with. If you prompt it

624
00:32:38,292 --> 00:32:40,052
S2: with like a general plot line, it's going to do

625
00:32:40,052 --> 00:32:42,412
S2: a lot better at that because, you know, that data

626
00:32:42,412 --> 00:32:48,312
S2: fills in that space a lot more. Um, so so, yeah, it's, um. Yeah.

627
00:32:48,352 --> 00:32:50,992
S2: Like the, the, the challenges and problem formulation are, are

628
00:32:50,992 --> 00:32:53,112
S2: really big and, um, yeah, that's why I kind of

629
00:32:53,152 --> 00:32:55,752
S2: encourage people when they look at these like, okay, I

630
00:32:55,752 --> 00:32:58,552
S2: want to build an AI, ML driven system. Um, take

631
00:32:58,552 --> 00:33:01,312
S2: a look at what subproblems are actually suitable for AIML.

632
00:33:01,592 --> 00:33:03,352
S2: Use them there. And I think you'll also find that

633
00:33:03,352 --> 00:33:05,472
S2: a lot of the times we have a tendency to

634
00:33:05,512 --> 00:33:08,152
S2: like say, okay, let's just kind of throw large language

635
00:33:08,152 --> 00:33:09,592
S2: models at some of these problems that we know we

636
00:33:09,592 --> 00:33:13,312
S2: could really solve with regular code. Um, and that's really

637
00:33:13,312 --> 00:33:16,191
S2: bad because of this compounding error problem. So, you know,

638
00:33:16,232 --> 00:33:18,432
S2: if I, you know, five steps in sequence that I've

639
00:33:18,432 --> 00:33:20,232
S2: got to do in step three is good for AIML

640
00:33:20,232 --> 00:33:23,272
S2: and step four is good for AIML. You know, like

641
00:33:23,272 --> 00:33:25,352
S2: it's like, okay, well, look, almost half of this problem is,

642
00:33:25,512 --> 00:33:26,792
S2: you know, is something I'm going to ask the model

643
00:33:26,792 --> 00:33:28,352
S2: to do anyway. I'll just ask it to do one,

644
00:33:28,352 --> 00:33:30,712
S2: two and five to. Well, the problem is it can

645
00:33:30,712 --> 00:33:32,352
S2: make a mistake in one. It can make a mistake

646
00:33:32,352 --> 00:33:34,632
S2: in two. That compound before you get to three and four.

647
00:33:34,832 --> 00:33:37,912
S2: So you're better off, you know, implementing one, two and code.

648
00:33:37,952 --> 00:33:39,912
S2: And then maybe you ask the model just to finish

649
00:33:39,912 --> 00:33:43,082
S2: it off and do step five because it's the final step.

650
00:33:43,082 --> 00:33:46,442
S2: It's had ground truth rooted in steps one two, steps

651
00:33:46,442 --> 00:33:49,482
S2: three and four. If they're well contextualized problems, maybe the

652
00:33:49,482 --> 00:33:52,082
S2: false positive rate is low enough that you can afford

653
00:33:52,082 --> 00:33:53,642
S2: to just let the model kind of finish it up

654
00:33:53,642 --> 00:33:56,442
S2: for you. But that's the biggest that's the biggest jump

655
00:33:56,442 --> 00:33:59,722
S2: I would take. Usually that's step five is like validation

656
00:33:59,722 --> 00:34:03,802
S2: or correctness. Um, checking. And that's not something you want

657
00:34:03,802 --> 00:34:06,522
S2: to ask the model to do because it's, it's it's

658
00:34:07,242 --> 00:34:11,082
S2: it has the tendency to, um, one be wanting to

659
00:34:11,122 --> 00:34:13,162
S2: kind of like please itself and say, oh yeah, it

660
00:34:13,162 --> 00:34:17,282
S2: looks great to me. Um, or to, um, depending on

661
00:34:17,282 --> 00:34:20,242
S2: how you phrase it, find something that doesn't exist. And

662
00:34:20,362 --> 00:34:23,482
S2: validation is a problem that typically is, uh, is pretty

663
00:34:23,522 --> 00:34:25,442
S2: amenable to like deterministic code.

664
00:34:27,042 --> 00:34:33,602
S1: So I really love this. Um. Where this is taking

665
00:34:33,602 --> 00:34:38,322
S1: me is designing, like, a, uh, a general problem solver.

666
00:34:38,922 --> 00:34:43,852
S1: And I'm imagining, like, the smartest model that you have.

667
00:34:43,892 --> 00:34:47,772
S1: You know, opus, whatever. Or, like, the best Gemini or

668
00:34:47,772 --> 00:34:50,612
S1: whatever or whatever the best model is. But but then

669
00:34:50,612 --> 00:34:54,092
S1: what you do is you say, okay, uh, the problem

670
00:34:54,092 --> 00:34:59,972
S1: is we need to design a system that, uh, you know, properly,

671
00:34:59,972 --> 00:35:03,652
S1: deterministically solves this problem with a high level of accuracy.

672
00:35:04,252 --> 00:35:06,972
S1: For example, the vulnerability problem that you guys worked on.

673
00:35:07,372 --> 00:35:11,332
S1: And then what I love is the idea of you

674
00:35:11,372 --> 00:35:15,932
S1: present to the model all these different AI models and

675
00:35:15,932 --> 00:35:20,852
S1: all these different deterministic technologies, all as solutions. And then

676
00:35:20,852 --> 00:35:25,452
S1: you do what you said, which is you, um, break

677
00:35:25,452 --> 00:35:28,652
S1: down the problems that need to be solved at every

678
00:35:28,652 --> 00:35:33,852
S1: level of the subpieces. Right. And then you match each

679
00:35:33,852 --> 00:35:38,732
S1: of those little problems to either one or, uh, one

680
00:35:38,732 --> 00:35:42,022
S1: or many of these eyes, which are bigger or smaller,

681
00:35:42,062 --> 00:35:45,262
S1: have different weaknesses or whatever, or even ML, not even

682
00:35:45,302 --> 00:35:51,142
S1: LLM based. Yeah. Versus deterministic with the rule of like look,

683
00:35:51,182 --> 00:35:56,702
S1: use the appropriate one for this problem type. And then

684
00:35:56,702 --> 00:35:59,582
S1: maybe you have a whole bunch of training about problem

685
00:35:59,582 --> 00:36:03,742
S1: types and solution types. And then it picks which one

686
00:36:03,742 --> 00:36:07,382
S1: to use for each step. I mean is that.

687
00:36:08,102 --> 00:36:09,542
S2: You mentioned this. I think this is what some of

688
00:36:09,542 --> 00:36:11,862
S2: like the large, you know, third party ML as a

689
00:36:11,862 --> 00:36:14,342
S2: service providers like OpenAI and anthropic are kind of trying

690
00:36:14,342 --> 00:36:16,422
S2: to do. If you've heard of like this concept of

691
00:36:16,422 --> 00:36:19,942
S2: like mixture of experts models, um, it's uh.

692
00:36:19,942 --> 00:36:20,462
S1: That's true.

693
00:36:20,662 --> 00:36:22,622
S2: Yeah. It's this concept where, you know, like, you know,

694
00:36:22,662 --> 00:36:25,062
S2: like the, the actual interface. We have to maybe GPT

695
00:36:25,102 --> 00:36:27,462
S2: five and, and I haven't looked at the source code.

696
00:36:27,462 --> 00:36:28,942
S2: I don't work at OpenAI, so I have no idea

697
00:36:28,942 --> 00:36:30,622
S2: if this works underneath the hood, but it's been kind

698
00:36:30,622 --> 00:36:33,542
S2: of theorized and it's even been mentioned, you know, a

699
00:36:33,542 --> 00:36:36,182
S2: bit in terms of, um, you know, people who've kind

700
00:36:36,182 --> 00:36:38,422
S2: of looked at the models a little bit closer that,

701
00:36:38,462 --> 00:36:40,352
S2: you know, um, you know, when we, when we, we

702
00:36:40,392 --> 00:36:41,992
S2: fine tune a model to make it really good or

703
00:36:41,992 --> 00:36:44,872
S2: really suitable for a particular purpose that's amenable to AIML,

704
00:36:45,232 --> 00:36:49,232
S2: it can still be challenging to, um, have it interface

705
00:36:49,232 --> 00:36:50,912
S2: with the user in the way that like a high

706
00:36:50,912 --> 00:36:54,472
S2: quality chatbot would. So using yeah, a mixture of experts

707
00:36:54,472 --> 00:36:56,912
S2: models suggests that like having like an interface, like a

708
00:36:57,432 --> 00:37:01,192
S2: bot that interacts with the user but then recognizes certain

709
00:37:01,192 --> 00:37:04,392
S2: classes of problems and ducts them to the right expert. So, oh,

710
00:37:04,432 --> 00:37:07,192
S2: they're asking me about cyber. I'll ask, you know, um,

711
00:37:08,072 --> 00:37:11,112
S2: cyber GPT to handle this one. All they're asking about,

712
00:37:11,392 --> 00:37:14,392
S2: you know, mental health, I'll ask, you know, mental health

713
00:37:14,392 --> 00:37:19,552
S2: GPT to to help out here. Um, so, you know,

714
00:37:19,592 --> 00:37:22,672
S2: this kind of like concept I think is I think

715
00:37:22,672 --> 00:37:24,992
S2: it's trying to be creative, or at least it's been

716
00:37:24,992 --> 00:37:27,392
S2: thought of, um, in terms of using like all AI,

717
00:37:27,392 --> 00:37:30,192
S2: ML solutions. But but yeah, I agree, like the way

718
00:37:30,232 --> 00:37:32,992
S2: forward is to have, um, you know, for, for like

719
00:37:32,992 --> 00:37:38,482
S2: rapid like prototype development have like components that do certain things. Well, um,

720
00:37:38,522 --> 00:37:40,842
S2: and honestly, it's like reflected in software, like we have

721
00:37:40,842 --> 00:37:44,522
S2: libraries for, we have libraries for sorting. No one or

722
00:37:44,562 --> 00:37:47,562
S2: we have libraries for cryptography. Nobody should be writing their

723
00:37:47,562 --> 00:37:50,962
S2: own cryptography code. Use a library. Um, you know, the

724
00:37:50,962 --> 00:37:54,882
S2: closer these high quality libraries and, um, fine tuned ML

725
00:37:54,882 --> 00:37:58,282
S2: applications or ML models for certain types of subproblems, the

726
00:37:58,282 --> 00:37:59,882
S2: closer we get to being able to kind of compose

727
00:37:59,882 --> 00:38:01,962
S2: all these together. And the good thing is, is that

728
00:38:01,962 --> 00:38:03,882
S2: Elm is probably pretty good at writing the glue code

729
00:38:03,922 --> 00:38:05,362
S2: to sequence all this stuff together.

730
00:38:06,362 --> 00:38:09,082
S1: Yeah, yeah. Because because that's the trick for me. Because

731
00:38:09,122 --> 00:38:12,122
S1: inside of a mixture of experts, you're already inside the LLM.

732
00:38:12,442 --> 00:38:15,042
S1: What I'm thinking of this higher level model is like, look,

733
00:38:15,082 --> 00:38:18,282
S1: we're doing it. We're doing, um, matrix math over here.

734
00:38:18,602 --> 00:38:22,962
S1: We're doing multiplication over here. Um, guess what? This problem

735
00:38:22,962 --> 00:38:26,442
S1: space is not associated with an AI. We don't even

736
00:38:26,482 --> 00:38:29,042
S1: know I will ever touch this. We hand it to

737
00:38:29,042 --> 00:38:34,922
S1: our fastest and best, you know, deterministic addition function or whatever,

738
00:38:34,962 --> 00:38:38,572
S1: you know, and it's like maybe 95% of the whole

739
00:38:38,572 --> 00:38:41,372
S1: app ends up being traditional tech that doesn't involve AI,

740
00:38:41,412 --> 00:38:43,092
S1: other than the routing to get there.

741
00:38:43,932 --> 00:38:45,372
S2: Yeah, I mean, that would be ideal. I mean, anything

742
00:38:45,372 --> 00:38:49,852
S2: you can route, anything. Anything you can. Yeah, I don't know.

743
00:38:49,852 --> 00:38:51,972
S2: It's funny. It's like really what it comes down to

744
00:38:52,052 --> 00:38:57,412
S2: is like using large language models and like, solving large problems.

745
00:38:57,412 --> 00:39:00,692
S2: It becomes a conditional probability problem. And even if you

746
00:39:00,692 --> 00:39:03,572
S2: have the answer, get the right answer right at 99%

747
00:39:03,572 --> 00:39:07,572
S2: of the time. Um, over and over and over again,

748
00:39:08,252 --> 00:39:11,052
S2: you still have a high likelihood of failure by the

749
00:39:11,052 --> 00:39:13,892
S2: time you compute all the conditional probability out. It's kind

750
00:39:13,932 --> 00:39:15,812
S2: of funny. Like, I kind of learned this lesson in like,

751
00:39:15,852 --> 00:39:19,052
S2: in a completely different walk of life. Um, after I

752
00:39:19,052 --> 00:39:22,332
S2: got my bachelor's degree in CS, I, I worked for

753
00:39:22,332 --> 00:39:27,132
S2: like a year doing, um, software engineering and kind of

754
00:39:27,172 --> 00:39:30,852
S2: found it to be dull, so I, I, I did

755
00:39:30,852 --> 00:39:33,252
S2: something completely different. I joined the Army and I started

756
00:39:33,252 --> 00:39:37,432
S2: flying helicopters. Um, it's actually nice. That is, that's actually,

757
00:39:37,472 --> 00:39:39,712
S2: you know, I'm at up at Camp Dwyer in in

758
00:39:39,712 --> 00:39:43,152
S2: RC Southwest and Afghanistan. It's, um, picture was taken of

759
00:39:43,152 --> 00:39:45,792
S2: our aircraft on the flight line, and one of my

760
00:39:45,792 --> 00:39:48,632
S2: jobs as a pilot was to educate our junior pilots

761
00:39:48,632 --> 00:39:52,232
S2: on this concept of, like, mission survivability. Um, and that's

762
00:39:52,232 --> 00:39:55,192
S2: the idea that, um, you know, understanding what's called, like,

763
00:39:55,192 --> 00:39:57,192
S2: the kill chain. The kill chain has been pretty popularized

764
00:39:57,192 --> 00:40:00,392
S2: and security as well. But, you know, basically for a

765
00:40:00,432 --> 00:40:03,312
S2: for a compromise, whether it's shooting down an aircraft or

766
00:40:03,312 --> 00:40:05,432
S2: breaching a database, like a lot of things have to

767
00:40:05,472 --> 00:40:08,032
S2: happen and they all have some sort of probability. And

768
00:40:08,032 --> 00:40:10,312
S2: your goal in breaking the kill chain or breaking the

769
00:40:10,312 --> 00:40:13,672
S2: exploitation chain is to reduce any one probability down to zero,

770
00:40:13,912 --> 00:40:18,832
S2: because then the common or the conditional probability problem becomes zero. Um,

771
00:40:18,832 --> 00:40:20,752
S2: but the probabilities can be really weird. I used to

772
00:40:20,752 --> 00:40:22,712
S2: talk to my junior pilots and ask them like, hey,

773
00:40:22,992 --> 00:40:25,632
S2: what do you think is like the acceptable loss rate

774
00:40:25,632 --> 00:40:28,432
S2: on any of the missions that we fly here in theater?

775
00:40:28,472 --> 00:40:30,432
S2: And they would usually give me answers like they were

776
00:40:30,432 --> 00:40:35,002
S2: pretty close. They'd say like 90% or 95% or even 99%.

777
00:40:35,922 --> 00:40:37,562
S2: So I would actually take them to the math problem.

778
00:40:37,562 --> 00:40:39,681
S2: I get off the whiteboard and I'd say, okay, let's

779
00:40:39,682 --> 00:40:42,882
S2: assume it's 99%. I say, okay, how many aircraft are

780
00:40:42,882 --> 00:40:44,962
S2: we flying a day? Okay. You know, we have ten

781
00:40:44,962 --> 00:40:47,602
S2: total aircraft. We go on five missions a day. So

782
00:40:47,602 --> 00:40:50,242
S2: that's five aircraft are going out there. And let's say

783
00:40:50,242 --> 00:40:52,162
S2: there's only a 1% chance that each one of them

784
00:40:52,162 --> 00:40:54,162
S2: gets shot down. Okay. So that's five aircraft a day.

785
00:40:54,162 --> 00:40:55,562
S2: But we're going to be in we're going to be

786
00:40:55,562 --> 00:40:58,122
S2: in theater for for nine months. We'll round it off.

787
00:40:58,122 --> 00:40:59,882
S2: We'll make it a year. We're going to be here

788
00:40:59,882 --> 00:41:03,722
S2: for 365 days. So now if I take 365 by

789
00:41:03,762 --> 00:41:06,642
S2: five and multiply it by five, that's the number of

790
00:41:06,642 --> 00:41:09,762
S2: missions we're flying in the entire time we're here. This

791
00:41:09,762 --> 00:41:11,482
S2: number comes out to be pretty high. And now all

792
00:41:11,482 --> 00:41:14,802
S2: of a sudden, if I lose one aircraft for every 100,

793
00:41:14,842 --> 00:41:17,162
S2: you realize that I actually run out of aircraft in

794
00:41:17,162 --> 00:41:20,162
S2: the first two months of of being in theater and I.

795
00:41:20,162 --> 00:41:21,922
S2: And now all of a sudden, the troops don't have,

796
00:41:22,082 --> 00:41:23,402
S2: don't have helicopters to fly.

797
00:41:23,402 --> 00:41:23,762
S1: Yeah.

798
00:41:24,122 --> 00:41:26,762
S2: So I said, actually, believe it or not, our our

799
00:41:26,842 --> 00:41:32,962
S2: acceptable loss rate is something more like 99.99999%. Um, we

800
00:41:32,962 --> 00:41:35,252
S2: can almost never lose an aircraft because. Or we can

801
00:41:35,252 --> 00:41:38,452
S2: almost never accept any type of probability. That means we

802
00:41:38,452 --> 00:41:41,052
S2: have even a remote chance of losing an aircraft because

803
00:41:41,052 --> 00:41:44,092
S2: we will deplete them. It's a limited resource. Um, solving

804
00:41:44,092 --> 00:41:46,372
S2: problems with Llms is the same way. If you ask

805
00:41:46,372 --> 00:41:48,972
S2: them to solve 15 problems in a row, even if

806
00:41:48,972 --> 00:41:52,372
S2: it's got a 99% chance, which is which would be

807
00:41:52,372 --> 00:41:55,212
S2: amazing if any LLM could get anywhere close to that,

808
00:41:55,652 --> 00:41:57,812
S2: even if it has a 99% chance of answering every

809
00:41:57,812 --> 00:42:01,452
S2: single problem right over the course of a year, it's

810
00:42:01,452 --> 00:42:04,932
S2: probably going to give you answers that are wrong almost 80%

811
00:42:04,932 --> 00:42:07,652
S2: of the time if that chain is long enough. And

812
00:42:07,652 --> 00:42:09,572
S2: if you have enough problems that you feed through it.

813
00:42:10,012 --> 00:42:13,252
S2: So that's one thing I try to like, um, hope

814
00:42:13,252 --> 00:42:16,972
S2: people conceptualize over relying on large language models and try

815
00:42:16,972 --> 00:42:20,412
S2: to help them understand this, like compounding error problem. It's

816
00:42:20,412 --> 00:42:26,132
S2: really a conditional probability, uh, compounding conditional probability problem. And

817
00:42:26,132 --> 00:42:28,812
S2: your tolerance for false positives is actually zero. So anywhere

818
00:42:28,812 --> 00:42:31,252
S2: in this chain that you can we have to think

819
00:42:31,252 --> 00:42:33,942
S2: about this differently now because I can't reduce anything to zero.

820
00:42:33,982 --> 00:42:35,422
S2: But what I can do is I can take certain

821
00:42:35,422 --> 00:42:36,902
S2: parts of the chain and I can bump them up

822
00:42:36,902 --> 00:42:40,102
S2: to 100%, meaning my chances of getting something right when

823
00:42:40,102 --> 00:42:42,902
S2: I use a deterministic algorithm are 100%. So now I

824
00:42:42,942 --> 00:42:45,862
S2: no longer have some sort of fractional probability out of.

825
00:42:45,902 --> 00:42:49,102
S2: So this 15 step problem now let's say 12 steps

826
00:42:49,102 --> 00:42:52,062
S2: I do deterministically. Now I only have a three step chain.

827
00:42:52,062 --> 00:42:55,622
S2: And now that 99% I'm getting it right only three times.

828
00:42:55,702 --> 00:42:58,422
S2: You simplify this problem. Now I might be able to

829
00:42:58,422 --> 00:43:00,702
S2: make it through a year's worth of operations that, you know,

830
00:43:00,742 --> 00:43:03,382
S2: 100 examples of the problem a day. I might be

831
00:43:03,382 --> 00:43:06,302
S2: able to make it through that with a false positive

832
00:43:06,302 --> 00:43:08,022
S2: rate of. I don't know what the math is in

833
00:43:08,022 --> 00:43:09,502
S2: my head. I'd have to I have to punch it out.

834
00:43:09,502 --> 00:43:11,342
S2: But that false positive rate might be a lot more

835
00:43:11,342 --> 00:43:15,422
S2: survivable in an operational world than, you know, 15 conditional

836
00:43:15,422 --> 00:43:17,862
S2: probability problems that are all 99%.

837
00:43:18,902 --> 00:43:22,902
S1: Yeah, yeah, I love that. The way I describe it is, um,

838
00:43:23,222 --> 00:43:26,782
S1: what's 1% of 100 metric tons of problems.

839
00:43:27,942 --> 00:43:28,382
S2: A metric.

840
00:43:29,062 --> 00:43:31,182
S1: A metric ton of problems?

841
00:43:31,272 --> 00:43:33,272
S2: Yeah, I love that. I love that.

842
00:43:34,072 --> 00:43:40,152
S1: Yeah. Yeah. Um, so, uh, we share this in common, actually. So, um,

843
00:43:40,152 --> 00:43:42,952
S1: I was, um, I was also Army, and I was at.

844
00:43:43,832 --> 00:43:46,912
S1: I was at Fort Campbell, so I was air assault,

845
00:43:46,912 --> 00:43:48,832
S1: so I had to do all the helicopter stuff.

846
00:43:48,872 --> 00:43:50,552
S2: Uh, right on, man. Hell, yeah. Brother.

847
00:43:50,992 --> 00:43:54,872
S1: Yeah. That's cool. Airborne air assault. Right? Um, yeah.

848
00:43:55,592 --> 00:43:58,792
S2: No. Yeah, I, I was, um, uh, this this picture

849
00:43:58,792 --> 00:44:01,392
S2: was taken when we were doing, uh, medevac chase. Uh,

850
00:44:01,432 --> 00:44:03,712
S2: we we did security for those guys over there, but

851
00:44:03,752 --> 00:44:05,832
S2: I was in an air assault battalion, so we literally

852
00:44:05,832 --> 00:44:07,912
S2: did nothing but fly you guys around, so.

853
00:44:07,952 --> 00:44:08,352
S1: Oh.

854
00:44:08,352 --> 00:44:10,312
S2: Nice man. Small world. Dude.

855
00:44:10,752 --> 00:44:11,432
S1: Yeah, yeah.

856
00:44:12,272 --> 00:44:14,472
S2: Yeah, I was over at Fort Campbell. I, I was at, um.

857
00:44:14,472 --> 00:44:16,712
S2: I was at Fort Riley, uh, in in the first

858
00:44:17,112 --> 00:44:19,872
S2: cab and then, um, I PC from there after I

859
00:44:19,872 --> 00:44:22,072
S2: went to Afghanistan and went to the 82nd. Um, so

860
00:44:22,072 --> 00:44:24,272
S2: I never got, never quite got to Campbell, which, like,

861
00:44:24,312 --> 00:44:26,472
S2: would have been great because I live here in Ohio

862
00:44:26,472 --> 00:44:29,272
S2: and Cincinnati. It's like where I was from. So I

863
00:44:29,272 --> 00:44:31,402
S2: was like always trying to get to Campbell because it

864
00:44:31,402 --> 00:44:33,202
S2: was like only like 4 or 5 hours from home

865
00:44:33,202 --> 00:44:35,322
S2: and be able to see family a lot easier. But

866
00:44:35,322 --> 00:44:38,722
S2: I ended up like 12 and nine hours away, respectively, so, uh.

867
00:44:39,042 --> 00:44:42,642
S1: Yeah. Well, that's super cool. Yeah, well, we need to

868
00:44:42,642 --> 00:44:46,802
S1: chat some more. Man. This is, like, really, really cool stuff. Um,

869
00:44:47,282 --> 00:44:49,122
S1: what you guys did on the team is cool, but

870
00:44:49,162 --> 00:44:51,762
S1: I'm even more excited just about the way you think

871
00:44:51,762 --> 00:44:58,522
S1: about these things. Um, I'm. I'm, uh, happy that, um,

872
00:44:58,762 --> 00:45:00,722
S1: the way you're thinking about it is similar to the

873
00:45:00,722 --> 00:45:03,082
S1: way I'm thinking about it. I you've taught me a

874
00:45:03,082 --> 00:45:06,082
S1: lot just during this thing. We should we should definitely

875
00:45:06,082 --> 00:45:08,362
S1: chat more after this. Um, anything else you want to

876
00:45:08,362 --> 00:45:14,482
S1: share about the the competition or, um, lessons learned? Um.

877
00:45:15,522 --> 00:45:17,082
S2: So I think one of the things that that came

878
00:45:17,082 --> 00:45:31,532
S2: out of the competition, um, was a lot of vindication. Sorry.

879
00:45:31,532 --> 00:45:36,612
S2: I nudged mouse in it. Oh. So, um, I'll just

880
00:45:36,612 --> 00:45:38,252
S2: I'll just go right into the answer. I assume you

881
00:45:38,252 --> 00:45:41,852
S2: can edit this later or something, but yeah. Um, so yeah,

882
00:45:41,852 --> 00:45:43,412
S2: one of the things that, um, that came out of

883
00:45:43,412 --> 00:45:46,572
S2: the competition was, was honestly a lot of indication, um,

884
00:45:47,092 --> 00:45:49,252
S2: like I had mentioned before, you know, when we started

885
00:45:49,252 --> 00:45:53,092
S2: off this process, um, this was two years ago, which

886
00:45:53,092 --> 00:45:56,292
S2: has been two lifetimes in the development of like AI

887
00:45:56,292 --> 00:46:00,612
S2: enabled systems for any problem, much less cybersecurity. Um, so

888
00:46:00,612 --> 00:46:03,692
S2: a lot of the things that we did, like tool enabling, um,

889
00:46:03,732 --> 00:46:07,812
S2: and multi-agent systems were things that we did before, things

890
00:46:07,812 --> 00:46:13,252
S2: like MCP or um, complicated, um, libraries for supporting this existed,

891
00:46:13,252 --> 00:46:17,012
S2: like we used early versions of um, of long chain, uh,

892
00:46:17,012 --> 00:46:19,052
S2: for some of our multi-agent stuff, but we actually ended

893
00:46:19,052 --> 00:46:20,332
S2: up having to write a lot of and implement a

894
00:46:20,332 --> 00:46:23,692
S2: lot of our own glue code for this. Um, so

895
00:46:23,732 --> 00:46:26,612
S2: it's really vindicating to see, like, those techniques become, while

896
00:46:26,612 --> 00:46:29,952
S2: we're doing the competition, become not only one commonplace and

897
00:46:29,952 --> 00:46:33,832
S2: two supported by the major large language model, providers be

898
00:46:33,832 --> 00:46:37,192
S2: adopted and be used generally by the community. Um, you know,

899
00:46:37,232 --> 00:46:39,112
S2: it was really great that we came in second and

900
00:46:39,112 --> 00:46:41,512
S2: that also the first place finisher also used this like

901
00:46:41,512 --> 00:46:46,632
S2: kind of, um, use, um, problem solving techniques that are

902
00:46:46,632 --> 00:46:51,232
S2: well suited for the problem approach. Yeah. Don't use AI everywhere. Um,

903
00:46:51,792 --> 00:46:55,512
S2: finisher theory. They were a little bit more LM forward,

904
00:46:55,792 --> 00:46:58,352
S2: but they still had a lot of, like, traditional components.

905
00:46:58,512 --> 00:47:01,352
S2: I don't think any team really went after this. Like,

906
00:47:01,352 --> 00:47:04,912
S2: all LM tried to just do everything within the LM. Um.

907
00:47:05,432 --> 00:47:07,952
S1: I bet a lot started that way, and they they

908
00:47:08,112 --> 00:47:10,112
S1: fall back from it. Yeah.

909
00:47:10,152 --> 00:47:13,072
S2: Yeah. Yeah, I think I think at least one of them, um,

910
00:47:13,312 --> 00:47:14,672
S2: at least one team. I think all you need is

911
00:47:14,672 --> 00:47:17,872
S2: a fuzzing brain. I think in the semi-finals, their approach, um,

912
00:47:18,352 --> 00:47:20,752
S2: tried to just use an LM to augment a fuzzer

913
00:47:20,752 --> 00:47:22,632
S2: to find vulnerabilities. And I don't think they really had

914
00:47:22,632 --> 00:47:24,672
S2: much of, like, a solution for patching, but it was

915
00:47:24,672 --> 00:47:26,552
S2: enough to get them to the finals. I they had

916
00:47:26,552 --> 00:47:31,202
S2: a more well rounded system, I believe, uh, in the,

917
00:47:31,242 --> 00:47:33,442
S2: in the finals. Um, so yeah, it was kind of

918
00:47:33,482 --> 00:47:35,442
S2: vindicating to also see that all these other bright minds

919
00:47:35,442 --> 00:47:38,802
S2: out there were also similarly of the, of the mindset

920
00:47:38,802 --> 00:47:41,002
S2: to do this. But um, one of the biggest takeaways

921
00:47:41,002 --> 00:47:43,682
S2: I have that I'll, that I'll say is that was

922
00:47:43,682 --> 00:47:45,802
S2: like different than what I expected because it's really easy

923
00:47:45,802 --> 00:47:47,442
S2: to pat myself on the back and say, oh yeah,

924
00:47:47,482 --> 00:47:49,522
S2: all the plan I came up with worked great. That's

925
00:47:49,682 --> 00:47:52,722
S2: that's awesome. But, um, I will say that I was

926
00:47:52,722 --> 00:47:56,362
S2: really surprised at how well large language models eventually became

927
00:47:56,362 --> 00:47:59,402
S2: at helping us generate patches and also helping us generate

928
00:47:59,402 --> 00:48:02,722
S2: seed inputs to improve Fuzzer performance. Those were areas where

929
00:48:02,722 --> 00:48:04,922
S2: I didn't really give the LLM a lot of credit

930
00:48:04,922 --> 00:48:07,322
S2: up front, but I had to build an autonomous system,

931
00:48:07,322 --> 00:48:10,802
S2: so I had no choice. They really outperformed my expectations.

932
00:48:10,802 --> 00:48:12,802
S2: So I kind of came out of this with, um,

933
00:48:13,202 --> 00:48:17,122
S2: a bit of a healthier respect for the capabilities of

934
00:48:17,122 --> 00:48:20,322
S2: AI models. Once again, these are still highly constrained.

935
00:48:20,322 --> 00:48:21,322
S1: And yeah, yeah.

936
00:48:21,362 --> 00:48:23,962
S2: Very context rich problems that we ask them to do,

937
00:48:24,082 --> 00:48:26,212
S2: but they still did way better than I thought they

938
00:48:26,212 --> 00:48:27,612
S2: were going to do. Um.

939
00:48:27,892 --> 00:48:32,812
S1: Yeah. And also context constrained, not polluted, like a very

940
00:48:33,132 --> 00:48:35,612
S1: controlled context for that thing. Like like you were talking

941
00:48:35,612 --> 00:48:36,612
S1: about before, right?

942
00:48:37,092 --> 00:48:40,692
S2: Yeah. Yeah. Um, yeah, I think that's about it. Unfortunately,

943
00:48:40,692 --> 00:48:41,892
S2: I do have to jump off. I gotta I got

944
00:48:41,892 --> 00:48:45,212
S2: another call at 1230, but, um. Yeah, I'd love to

945
00:48:45,212 --> 00:48:47,612
S2: chat more and talk more with you at some point.

946
00:48:47,612 --> 00:48:49,892
S2: If you want to do a follow up episode or,

947
00:48:49,932 --> 00:48:52,372
S2: I don't know, you just want to chat about other stuff. Um,

948
00:48:53,052 --> 00:48:54,372
S2: you know, we got a couple of friends in common

949
00:48:54,372 --> 00:48:58,652
S2: between Clint and, uh, between Clint and Keith, and it's, uh,

950
00:48:58,652 --> 00:49:00,172
S2: you know, I've. I've run into you a couple places

951
00:49:00,172 --> 00:49:03,052
S2: on various calls and stuff that we've been on, but, um,

952
00:49:03,132 --> 00:49:04,212
S2: it was good to get a chance to talk with

953
00:49:04,212 --> 00:49:06,052
S2: you one on one. I feel like we've been kind of, like,

954
00:49:06,292 --> 00:49:08,612
S2: circling around in the same circle for a while, but

955
00:49:08,612 --> 00:49:10,132
S2: I hadn't had a chance to, like, actually just chat

956
00:49:10,132 --> 00:49:10,812
S2: the two of us.

957
00:49:11,492 --> 00:49:15,452
S1: Yeah, absolutely. Well, thanks. Thanks for the, uh, the input.

958
00:49:15,452 --> 00:49:18,812
S1: This is just, uh, fantastic stuff. And, uh, let's definitely

959
00:49:18,812 --> 00:49:19,572
S1: catch up soon.

960
00:49:20,052 --> 00:49:21,372
S2: Yeah. Sounds good man. Take care of yourself.

961
00:49:21,412 --> 00:49:22,252
S1: All right. Take care.