WEBVTT - Inside View: The AI Personas Needed to Diagnose Disease 0:00:13.960 --> 0:00:17.239 Welcome to tech Stuff. This is the inside View. I'm 0:00:17.280 --> 0:00:19.440 os Vloschen here with Cara Price. 0:00:19.840 --> 0:00:23.319 Hello, so as I'm very curious to know more about 0:00:23.320 --> 0:00:25.000 the story you've brought me this week, since it's a 0:00:25.040 --> 0:00:27.440 topic we discussed a lot on this podcast. 0:00:27.840 --> 0:00:31.320 Yes, so today I've got a story about AI in healthcare, 0:00:31.760 --> 0:00:35.960 specifically AI and diagnosis. I spoke with doctor Matthew Lungren, 0:00:35.960 --> 0:00:39.040 who is the chief Scientific officer for Microsoft Health and 0:00:39.080 --> 0:00:43.320 Life Sciences, about this blog post that Microsoft recently published 0:00:43.600 --> 0:00:47.440 with the title the Path to Medical Superintelligence. 0:00:47.880 --> 0:00:50.840 Do I want to know what medical superintelligence is? It's 0:00:50.880 --> 0:00:54.639 more big than just regular intelligence. But I actually heard 0:00:54.640 --> 0:00:57.760 about this study. It was everywhere, and if I remember correctly, 0:00:57.840 --> 0:01:00.520 it was that the AI were better at diagnosing than doctors. 0:01:00.600 --> 0:01:04.160 Right, Yeah, that's right, In fact, four times better. There 0:01:04.240 --> 0:01:07.240 was a headline in Time magazine which really says it all. 0:01:07.800 --> 0:01:12.240 Microsoft's AI is better than doctors are diagnosing disease. Special 0:01:12.240 --> 0:01:14.679 shout out here to Elliot Fishman, who's our old friend. 0:01:14.880 --> 0:01:18.000 He's a professor of radiology at Johns Hopkins and he 0:01:18.080 --> 0:01:22.039 runs this fascinating email group that discusses new developments in AI. 0:01:22.760 --> 0:01:24.880 Matthew Lunger and I are both members of this group, 0:01:25.240 --> 0:01:28.120 and Matthew is also one of the authors of the study. 0:01:28.360 --> 0:01:29.880 What kind of doctor is Doctor Lungren? 0:01:30.240 --> 0:01:33.440 Like Elliott Fishman our friend, he's a radiologist by training 0:01:33.520 --> 0:01:36.280 and has a public health background. He was hired at 0:01:36.280 --> 0:01:41.400 Stanford where he started using machine learning to analyze large 0:01:41.480 --> 0:01:43.200 data sets. Here's Matthew. 0:01:43.640 --> 0:01:46.520 Eventually my lab grew into a very large AI center 0:01:46.560 --> 0:01:49.880 at Stanford, which bridged the computer science department in the 0:01:49.880 --> 0:01:54.280 medical school and kind of saw translation of newest techniques 0:01:54.320 --> 0:01:58.920 into healthcare applications accelerate. Taking that work further, I went 0:01:59.000 --> 0:02:03.520 to Microsoft on sabbatical at Microsoft Research and realized that 0:02:03.640 --> 0:02:07.400 a very similar opportunity was there in big tech if 0:02:07.440 --> 0:02:10.800 you could start to connect the latest technology to problems 0:02:10.800 --> 0:02:12.799 in healthcare. And so that's how I came to be here, 0:02:12.800 --> 0:02:14.679 and that's kind of what I still do all day. 0:02:15.160 --> 0:02:17.120 And Matthew is also one of the authors of the 0:02:17.120 --> 0:02:18.040 Microsoft study. 0:02:18.480 --> 0:02:22.240 I believe that the human expert plus these expert systems 0:02:22.280 --> 0:02:24.639 together will ultimately deliver better care. 0:02:25.040 --> 0:02:25.680 No matter what. 0:02:25.560 --> 0:02:29.480 Profession you're in, there's always a gray haired person that has, 0:02:29.680 --> 0:02:31.440 you know, in some sense, seen it all and kind 0:02:31.440 --> 0:02:34.080 of compressed that into their brain and their pattern matching 0:02:34.120 --> 0:02:37.200 in a way that is just faster than folks that 0:02:37.240 --> 0:02:39.920 don't have as much experience. And that's true anywhere, but 0:02:40.000 --> 0:02:42.799 certainly in medicine, right. I think that the assistance or 0:02:42.880 --> 0:02:46.360 ability of AI to now sort of connect dots in 0:02:46.440 --> 0:02:51.680 ways that maybe can achieve that wisdom or that experience 0:02:52.160 --> 0:02:53.400 and bring that to the surface. 0:02:53.760 --> 0:02:55.160 It's kind of an unprecedented time. 0:02:55.480 --> 0:02:59.359 The only exceptional performance I four times better than human doctors. 0:03:00.040 --> 0:03:02.160 One of the things I found most interesting about the 0:03:02.200 --> 0:03:06.040 study was that it wasn't just one single AI model 0:03:06.120 --> 0:03:09.240 doing a diagnosis. It was a whole team of AI 0:03:09.320 --> 0:03:11.920 models that were able to talk to each other in 0:03:12.000 --> 0:03:15.280 order to count with hypotheses, order tests, and ultimately count 0:03:15.280 --> 0:03:16.120 with a diagnosis. 0:03:16.600 --> 0:03:20.360 So multiple AI models seems a little bit unfair. 0:03:20.240 --> 0:03:22.800 Yes, and in fact we talked about this. The doctors 0:03:22.800 --> 0:03:25.760 in the study were not allowed to call specialists to 0:03:25.800 --> 0:03:28.720 help them with their diagnosis, but the ais were allowed 0:03:28.760 --> 0:03:31.160 to talk to each other. So doctors are not going 0:03:31.160 --> 0:03:32.600 to be made obsolete anytime soon. 0:03:32.680 --> 0:03:34.520 Well good, because I have a physical coming up and 0:03:34.560 --> 0:03:37.760 I don't need four AI models being like, well, this 0:03:37.800 --> 0:03:39.520 girl got real big this year. 0:03:40.880 --> 0:03:43.400 Now, as you and I already know, people are already 0:03:43.520 --> 0:03:47.560 using AI regularly to diagnose themselves. In fact, I think 0:03:47.600 --> 0:03:51.440 more than ten percent of the overall CHATCHBT traffic is 0:03:51.480 --> 0:03:54.800 around medical stuff. This is not always music to the 0:03:54.840 --> 0:03:57.480 ear of doctors, so it was interesting to look at 0:03:57.520 --> 0:04:00.200 an example where this is actually an AI build built 0:04:00.360 --> 0:04:03.320 for doctors and to work with doctors rather than patient facing. 0:04:03.800 --> 0:04:06.200 And the other interesting thing for me, which we talk 0:04:06.280 --> 0:04:09.160 about with Lunger, which we'll get to, is how this 0:04:09.280 --> 0:04:13.560 idea of multiple ais talking to each other can simulate 0:04:13.640 --> 0:04:17.120 the experience of the best hospital systems in the US 0:04:17.360 --> 0:04:20.800 for people who otherwise might not have access to these 0:04:20.839 --> 0:04:22.000 panels and experts. 0:04:22.279 --> 0:04:24.920 I can't wait to hear what you learned from him. 0:04:25.240 --> 0:04:28.159 Well, here's the rest of my conversation with doctor Matthew Lungren. 0:04:28.920 --> 0:04:31.920 So you're a trained doctor, and I want to start 0:04:31.960 --> 0:04:34.880 with the basics, which is diagnosis. I'm not sure when 0:04:34.920 --> 0:04:38.200 the last time you made a diagnosis on a patient was, 0:04:38.520 --> 0:04:40.880 but I'd love to hear from you as a doctor. 0:04:41.320 --> 0:04:43.040 What is the process of diagnosis? 0:04:43.320 --> 0:04:46.680 Yeah, I mean it depends quite a bit on the specialty. 0:04:46.720 --> 0:04:50.680 But as most people know, the classic image of a physician, 0:04:50.760 --> 0:04:52.960 right is to speak with the. 0:04:52.880 --> 0:04:54.760 Patient, kind of do a Sherlock Holmes kind of thing. 0:04:54.760 --> 0:04:56.840 Everyone's seen the shows like House and Things are kind 0:04:56.839 --> 0:04:58.960 of sensationalized sort of the approach. 0:04:58.960 --> 0:05:01.200 But really there's a lot of unknowns that you have 0:05:01.240 --> 0:05:01.720 to tease out. 0:05:01.800 --> 0:05:01.880 Right. 0:05:01.960 --> 0:05:04.200 You have to interview the page, you have to obviously 0:05:04.320 --> 0:05:07.120 interpret labs and other information, and you have to start 0:05:07.160 --> 0:05:10.400 to narrow things down and order appropriate tests. Try not 0:05:10.440 --> 0:05:13.240 to chase too many what we call the zebras, but 0:05:13.600 --> 0:05:15.520 keep those in mind in case you're dealing with one, and. 0:05:15.680 --> 0:05:18.160 The zebra would be the classic House episode. 0:05:17.839 --> 0:05:20.039 Right, yeah, right, Well every House episode is a zebra, 0:05:20.080 --> 0:05:22.960 which actually has some relationship to the study we're going 0:05:23.040 --> 0:05:26.520 to talk about today. But in general, it's more common 0:05:26.560 --> 0:05:29.920 to have an uncommon presentation of a common disease than 0:05:30.000 --> 0:05:32.839 in a common presentation of an uncommon disease, if that 0:05:32.880 --> 0:05:33.560 makes sense. 0:05:33.920 --> 0:05:38.600 Right, right, right, And this kind of relationship between AI 0:05:38.680 --> 0:05:41.440 and doctors has been going on for a few years. 0:05:41.760 --> 0:05:45.040 I remember reading a great piece in the Niyoka about 0:05:45.200 --> 0:05:48.840 how one of the challenges for AI was that the 0:05:48.839 --> 0:05:52.920 best doctors can't actually tell you in words why they're 0:05:52.920 --> 0:05:54.200 good at making diagnoses. 0:05:54.320 --> 0:05:55.600 That's right. It's interesting. 0:05:55.640 --> 0:05:58.640 I think there are things that humans have, many cotton 0:05:58.680 --> 0:06:01.000 adiases that are well undo and I think you know, 0:06:01.120 --> 0:06:05.760 keeping that in check while also trying to leverage the 0:06:05.800 --> 0:06:08.359 information in front of you not be affected by the 0:06:08.400 --> 0:06:10.640 case you just saw or something you just heard at 0:06:10.640 --> 0:06:14.200 a conference, or an error that you experienced years ago 0:06:14.240 --> 0:06:18.040 that's still impacting the way that you think about diagnoses. 0:06:18.080 --> 0:06:21.760 And I think those biases have been well published and 0:06:21.839 --> 0:06:25.559 discussed at nauseum in healthcare, but we're kind of dealing 0:06:25.560 --> 0:06:28.760 with this new human plus AI dance. 0:06:29.120 --> 0:06:33.560 That's fascinating. Yeah. I mean I actually slipped and fell 0:06:33.640 --> 0:06:37.080 down a few stairs at the weekend and bashed my 0:06:37.080 --> 0:06:39.160 head slightly on one of the stairs, and then didn't 0:06:39.160 --> 0:06:41.080 feel very well, and I was like, I wonder if 0:06:41.080 --> 0:06:43.800 I could be concussed. So I did a selfie and 0:06:43.839 --> 0:06:45.800 sent it to check GPT and it said my eyes 0:06:45.839 --> 0:06:48.800 look fine. So I actually, if I'd been more wired, 0:06:48.800 --> 0:06:50.159 I would have gone to the doctor. But there's a 0:06:50.200 --> 0:06:51.440 kind of a duck side to that as well. 0:06:51.520 --> 0:06:53.400 Yeah, I mean I think it sounds like you did okay, 0:06:53.400 --> 0:06:55.760 But I would say that the old saying in healthcare 0:06:55.920 --> 0:06:57.880 during the particularly the rise of the Internet, right, which 0:06:57.920 --> 0:07:00.279 is kind of the other similar kind of technology logic 0:07:00.320 --> 0:07:03.840 advancement that impacted healthcare. We used to say to our patients, 0:07:04.400 --> 0:07:07.160 you know, your Google search does not replace our medical degree, right, 0:07:07.160 --> 0:07:09.480 And that wasn't meant to be a condescending but it 0:07:09.560 --> 0:07:11.120 was just sort of like we had to sort of 0:07:11.520 --> 0:07:13.960 pull them back from the abyss of going down a 0:07:14.000 --> 0:07:17.240 rabbit hole and every ache and pain was immediately terminal cancer, right, 0:07:17.240 --> 0:07:19.960 that kind of But today it's different. It sort of 0:07:19.960 --> 0:07:24.240 reference the experience you just mentioned that's happening everywhere. In fact, 0:07:24.280 --> 0:07:27.800 the recent open Ai launch of GPD five, they spent 0:07:27.960 --> 0:07:32.720 fifteen minutes talking with a patient who went through a 0:07:32.840 --> 0:07:35.880 very difficult battle with cancer and worked with the model 0:07:35.880 --> 0:07:40.360 herself and was able to have very complex medical jard 0:07:40.360 --> 0:07:44.000 and explain to her in plain English, was able to 0:07:44.040 --> 0:07:47.120 help her with questions to ask the position. And as 0:07:47.120 --> 0:07:50.360 someone who still practices and sees patients today, I have 0:07:50.400 --> 0:07:52.720 to say my patients are better informed than maybe ever 0:07:52.840 --> 0:07:56.520 and it's kind of changing the bar with this classic 0:07:56.600 --> 0:07:59.600 information asymmetry problem where the patient has to kind of 0:07:59.680 --> 0:08:02.160 keep up up with the technical speak and all the 0:08:02.160 --> 0:08:03.960 information that we spend decades learning. 0:08:04.400 --> 0:08:06.600 It feels like there's almost a better playing field. 0:08:06.600 --> 0:08:08.679 So I can have this conversation with my patient almost 0:08:08.680 --> 0:08:10.520 at a peer level, is right, and then we can 0:08:10.560 --> 0:08:14.040 go through the care journey together. I'm extremely excited about 0:08:14.040 --> 0:08:15.560 that prospect. 0:08:15.880 --> 0:08:17.880 Taking a couple of steps back, I mean, you mentioned 0:08:17.960 --> 0:08:21.240 you've been in and around this since twenty twelve, twenty thirteen. 0:08:22.240 --> 0:08:24.520 Why do people want to use AI medicine. 0:08:24.600 --> 0:08:28.560 Well, it's an incredibly challenging discipline and it has only 0:08:28.720 --> 0:08:31.680 become more so maybe in the last ten or fifteen years. 0:08:32.640 --> 0:08:34.320 One of the things that is going on is that 0:08:34.400 --> 0:08:39.439 information is doubling roughly every ninety days medical information. That 0:08:39.760 --> 0:08:41.800 trend has been going on for a really long time. 0:08:41.840 --> 0:08:45.960 And what does publication of papers, publication of papers, new therapies, 0:08:46.120 --> 0:08:49.240 new guidelines, all these things keep stacking up, right, And 0:08:49.280 --> 0:08:52.679 so just because you've been through medical school and training, right, 0:08:52.720 --> 0:08:54.760 we have lots of systems in place to help us 0:08:54.800 --> 0:08:57.880 continue our education. But really the reaction to that has 0:08:57.920 --> 0:09:01.480 been to sub in some cases sub sub specialize. So 0:09:01.520 --> 0:09:04.640 to give you an example, I am a diagnostic radiologist, 0:09:04.640 --> 0:09:08.079 so that's the bigger specialty, and then I specialize in 0:09:08.200 --> 0:09:11.640 interventional radiology, which is an image guid to procedures basically, 0:09:12.080 --> 0:09:14.720 and then I am further specialized in pediatric version of that. 0:09:15.080 --> 0:09:17.440 So that's like a Russian nesting doll of specialties. And 0:09:17.480 --> 0:09:21.000 you see that throughout healthcare. And that is partly due 0:09:21.040 --> 0:09:25.120 to the complexity of care that's required for some patients, 0:09:25.160 --> 0:09:29.120 but also it's due to the information tidle wave and 0:09:29.200 --> 0:09:31.600 being able to hold all that in a human mind 0:09:32.000 --> 0:09:34.679 right with all of our limitations, and so AI, I 0:09:34.760 --> 0:09:38.280 think at least the work that we've been doing here 0:09:38.480 --> 0:09:43.680 is starting to provide a counter narrative to needing to 0:09:43.720 --> 0:09:46.480 be sub subspecialized in order to be able to manage 0:09:46.520 --> 0:09:48.800 information and take really good care of your patients across 0:09:48.840 --> 0:09:52.760 a wide variety of complex diagnoses. And I think that 0:09:52.760 --> 0:09:55.040 that's really where the excitement is. I think right now 0:09:55.120 --> 0:09:59.360 is can I use this system to augment my ability 0:09:59.360 --> 0:10:00.120 to care for PAYP. 0:10:00.920 --> 0:10:05.360 And why isn't AI more ubiquitous in medicine? And what 0:10:05.760 --> 0:10:08.680 has been integration challenge up until now, Well. 0:10:08.480 --> 0:10:11.160 There's a whole podcast just on that odds, I would say, 0:10:11.200 --> 0:10:14.439 but the short version is that we have been an 0:10:14.480 --> 0:10:20.480 incredibly skeptical discipline it's skeptical of new technology and at 0:10:20.480 --> 0:10:24.120 the same time extraordinarily risk averse for good reason, right, 0:10:24.320 --> 0:10:28.800 we require significant evidence, right to change the way we practice. 0:10:29.240 --> 0:10:31.680 We have you know, as you know, clinical trials take 0:10:31.920 --> 0:10:35.199 years and years, and some still fail, actually many fail, 0:10:35.320 --> 0:10:37.720 and we accept that as the system that keeps our 0:10:37.760 --> 0:10:40.400 patients safe and keeps us on the cutting edge. I 0:10:40.440 --> 0:10:44.120 think in terms of just the technical mechanics of adoption, 0:10:44.360 --> 0:10:46.719 we have a very rigid system in the software two 0:10:46.760 --> 0:10:49.880 world that is changing. What's so again, what's so exciting 0:10:49.880 --> 0:10:53.439 about this is that again any physician can pull out 0:10:53.480 --> 0:10:56.440 their cell phone and interact with this cutting edge AI 0:10:56.559 --> 0:10:59.520 without needing to have you know, three four year long 0:10:59.600 --> 0:11:03.120 cycles of integration with software. Right, and it's just the 0:11:03.160 --> 0:11:04.880 early days, but as of the trends that we're saying, 0:11:05.240 --> 0:11:05.520 just to. 0:11:05.480 --> 0:11:09.000 Take a step back, I guess the classic model of 0:11:09.440 --> 0:11:14.160 measuring AI performance versus doctor performance was to present a 0:11:14.160 --> 0:11:18.120 hard problem or a hard diagnostic conundrum and ask for 0:11:18.160 --> 0:11:21.280 an answer and measure answer versus answer. How is that 0:11:21.280 --> 0:11:22.240 different to what you've done? 0:11:22.520 --> 0:11:25.480 Yeah, well it's even less precise than that. 0:11:25.559 --> 0:11:28.400 So that the way up until now, at least for 0:11:28.520 --> 0:11:31.400 large language models, when people talk about they have medical capabilities, 0:11:32.160 --> 0:11:35.320 they were actually using medical examination questions. 0:11:35.320 --> 0:11:37.520 So there's a question stem and then there's a multiple 0:11:37.600 --> 0:11:38.320 choice answer. 0:11:39.080 --> 0:11:41.680 That's not medicine, but it is how we you know, 0:11:41.800 --> 0:11:45.640 qualify our humans, right, human physicians to be granted a 0:11:45.679 --> 0:11:48.080 medical license, so that we think we kind of use 0:11:48.120 --> 0:11:50.600 that for a long time as a as a surrogate 0:11:50.679 --> 0:11:52.360 or a bell weather, But it wasn't. 0:11:52.480 --> 0:11:54.480 Could it pause a test to be a doctor rather 0:11:54.520 --> 0:11:57.360 than could it actually be effective at acting as a doctor. 0:11:57.440 --> 0:12:00.280 That's interesting, right, And we were able to show very 0:12:00.280 --> 0:12:03.840 early on with GPD four that these models outperform positions 0:12:03.840 --> 0:12:06.240 on these multiple choice tests. But there's all kinds of 0:12:06.280 --> 0:12:10.080 caveats there. Is that really medicine? Has it seen some 0:12:10.160 --> 0:12:11.959 of that data and it's training assuredly? 0:12:12.080 --> 0:12:14.319 Yes? Right? And is that useful? 0:12:14.360 --> 0:12:18.120 I think those questions came up now in practice, it's 0:12:18.480 --> 0:12:22.000 estimated that ten to twenty percent of AI interactions with 0:12:22.040 --> 0:12:27.200 these common chatbots like GPT are around a medical use case. 0:12:27.200 --> 0:12:29.360 So we know that there's someone is getting value out 0:12:29.400 --> 0:12:31.520 of that somewhere, right, and we see it with our 0:12:31.520 --> 0:12:33.360 own eyes. So how do we bridge the gap to 0:12:33.400 --> 0:12:37.439 something a slightly more realistic in terms of not giving 0:12:37.440 --> 0:12:39.160 you all the information up front, just like we would 0:12:39.360 --> 0:12:42.240 in real healthcare. One of the principal thoughts around the 0:12:42.240 --> 0:12:45.880 study was is there a way to take advantage of 0:12:45.880 --> 0:12:50.520 the incredible capabilities that these models have in medical diagnosis. 0:12:49.880 --> 0:12:53.760 And knowledge but also push it a bit further. 0:12:53.880 --> 0:12:56.400 And not have it kind of just be a question 0:12:56.440 --> 0:12:59.240 answering machine. And so we thought, can we kind of 0:12:59.280 --> 0:13:01.679 have several versions of the model kind of act as 0:13:01.720 --> 0:13:04.400 different humans or this is that idea of an agent, 0:13:04.760 --> 0:13:07.840 and give them jobs. One job is to look at 0:13:07.840 --> 0:13:10.960 the economics of the tests that you're trying to order. 0:13:11.000 --> 0:13:15.360 One is to question your next decision point. So the 0:13:15.360 --> 0:13:17.559 information isn't just in and out with one model, it's 0:13:17.600 --> 0:13:20.160 actually in and out through a system of models. And 0:13:20.200 --> 0:13:22.120 we showed that no matter what model you use, whether 0:13:22.160 --> 0:13:24.880 it's Google's model, whether it's open the Eyes model, whether 0:13:24.880 --> 0:13:28.520 it's an open source model, it improves that diagnostic capability 0:13:28.600 --> 0:13:32.240 on these extraordinarily challenging diagnostic tests. 0:13:32.640 --> 0:13:35.320 So you had ten co authors on this study, and 0:13:36.000 --> 0:13:38.080 you know, as we talked about when it was released, 0:13:38.240 --> 0:13:40.600 took the world by storm, and so, I mean, how 0:13:40.600 --> 0:13:44.160 did you go about designing the study and what was 0:13:44.200 --> 0:13:46.800 the hypothesis and what have you found? 0:13:47.200 --> 0:13:50.520 So this was a cross Microsoft collaboration, but harsh and Noori, 0:13:50.559 --> 0:13:53.000 who is the lead on this, really wanted to say, 0:13:53.240 --> 0:13:55.360 you know, we have a lot of evidence that these 0:13:55.400 --> 0:13:59.559 models perform well for these standardized tests, and then we 0:13:59.600 --> 0:14:03.320 see the real world situation where that's not how people present. 0:14:03.400 --> 0:14:05.880 They don't show up with hey, these are all my tests, 0:14:05.920 --> 0:14:07.320 these are all my problems, and these are the four 0:14:07.440 --> 0:14:10.000 choices of what I may have right. And then taking 0:14:10.120 --> 0:14:13.079 what are essentially some of the most difficult questions out 0:14:13.120 --> 0:14:15.880 of New England Journal and structuring them in a way 0:14:16.520 --> 0:14:20.120 that requires a model to ask for more information or 0:14:20.200 --> 0:14:21.000 order tests, just. 0:14:20.960 --> 0:14:21.840 Like a physician would. 0:14:22.640 --> 0:14:25.080 The hypothesis was that that would be interesting and of itself, 0:14:25.200 --> 0:14:27.080 but then what if we also put humans through that 0:14:27.120 --> 0:14:32.680 same system. In other words, here's the first step headache, Okay, 0:14:32.880 --> 0:14:33.680 what do you do next? 0:14:33.680 --> 0:14:33.880 Well? 0:14:33.920 --> 0:14:35.520 Do I need to ask more questions? Do I need 0:14:35.520 --> 0:14:36.880 to order a test, et cetera, et cetera. 0:14:37.520 --> 0:14:40.160 One of the really brilliant outcomes here was by having 0:14:40.240 --> 0:14:43.320 that system of agents as opposed to just the single model, 0:14:43.720 --> 0:14:48.240 allowed us to have a more realistic understanding of the capabilities. 0:14:48.240 --> 0:14:50.560 In other words, if I wanted to know the answer, 0:14:50.600 --> 0:14:52.960 and I'm a chatbot, my answer could be, let's order 0:14:53.000 --> 0:14:56.080 every single test that there is, and that would probably 0:14:56.080 --> 0:14:56.920 get you the right answer. 0:14:57.160 --> 0:14:58.040 Is that feasible? 0:14:58.560 --> 0:14:58.760 No? 0:14:59.160 --> 0:14:59.320 Right? 0:14:59.400 --> 0:14:59.480 Ye? 0:15:00.000 --> 0:15:04.000 So forcing it to think about resources cost of the 0:15:04.240 --> 0:15:07.120 care actually found a very interesting what we would call 0:15:07.200 --> 0:15:13.040 the pride or frontier of capability underconstrained resources. So they 0:15:13.040 --> 0:15:16.600 were actually getting to an incredible diagnoses very very accurately, 0:15:17.240 --> 0:15:20.440 but also cost efficiently, and that was really one of 0:15:20.440 --> 0:15:21.960 the biggest takeaways from this work. 0:15:22.960 --> 0:15:25.680 Can you just to make it more concrete for our listeners, 0:15:25.720 --> 0:15:28.560 can you kind of set up one of these cases 0:15:28.800 --> 0:15:32.720 as though an episode of House Dare I say, and 0:15:32.760 --> 0:15:35.920 then what the human doctors did and what the AI 0:15:36.400 --> 0:15:38.840 agents did, and then how you compare that performance. 0:15:39.320 --> 0:15:42.040 Let's just say it was someone that had easy bleeding 0:15:42.120 --> 0:15:44.640 that unexpected. They were brushing their teeth and they started 0:15:44.640 --> 0:15:46.560 bleeding and it was kind of unusual, and they noticed 0:15:46.600 --> 0:15:48.440 that they were getting a lot of bruising, and there's 0:15:48.480 --> 0:15:50.520 just a certain battery of tests. I think that was 0:15:50.560 --> 0:15:53.960 pretty comparable on both sides in terms of what they ordered. 0:15:54.320 --> 0:15:55.840 But taking continued to. 0:15:55.840 --> 0:15:58.000 Be what the AI ordered and what than human doctors. 0:15:57.680 --> 0:15:59.280 Are human and AI pretty much right. 0:15:59.360 --> 0:16:01.480 So the first few steps, I think a lot there 0:16:01.520 --> 0:16:05.080 was a lot of similarity, which is expected. Where we 0:16:05.080 --> 0:16:08.680 started to see early diversions was because of that agent setup. 0:16:09.040 --> 0:16:11.880 Humans did kind of jump to more advanced tests more quickly, 0:16:11.880 --> 0:16:15.320 more expensive tests, and that was interesting because the models 0:16:15.320 --> 0:16:17.040 were able to kind of get to the next step 0:16:17.440 --> 0:16:19.720 with a battery of less expensive tests. So we thought 0:16:19.720 --> 0:16:21.360 that was a kind of an interesting starting to see 0:16:21.360 --> 0:16:24.000 some divergence. And then, to be fair to the humans, 0:16:24.840 --> 0:16:27.680 they're still kind of handcuffed. In other words, they're just 0:16:27.760 --> 0:16:31.600 getting text feedback as they're interacting with the system, whereas 0:16:31.960 --> 0:16:34.680 when I'm with a patient, I'm seeing them, I'm able 0:16:34.720 --> 0:16:37.640 to kind of take some cues, I'm able to examine them. 0:16:37.640 --> 0:16:40.240 So there was some limitations there, but then the less 0:16:40.240 --> 0:16:43.280 once it got to the stage where you had a 0:16:43.280 --> 0:16:47.280 differential diagnosis, so a list of likely things, more often 0:16:47.320 --> 0:16:49.720 than not, the model was ranking them in a much 0:16:49.760 --> 0:16:53.440 more data driven order that ultimately led to the correct 0:16:53.440 --> 0:16:56.560 diagnosis much more quickly. Whereas you know, as us you 0:16:56.600 --> 0:16:58.600 would with humans, with these limitations, you're kind of going 0:16:58.600 --> 0:17:02.040 in some rabbit holes, you're maybe not ordering them in 0:17:02.120 --> 0:17:04.240 the best order, and so you're kind of going down 0:17:04.280 --> 0:17:07.120 other paths that end up increasing the time or expense 0:17:07.200 --> 0:17:08.920 or potentially leading to the rown diagnosis. 0:17:15.720 --> 0:17:19.040 After the break, how the multi agent system the diagnostic 0:17:19.200 --> 0:17:21.880 orchestrator actually works stay with us. 0:17:38.240 --> 0:17:42.560 I put the study through chet GPT describe the diagnostic 0:17:42.680 --> 0:17:45.720 orchestrator as like a virtual team of five doctors, each 0:17:45.760 --> 0:17:49.520 with a different role. One less possible illnesses, one chooses 0:17:49.520 --> 0:17:53.680 the best tests, one plays devil's advocate, one watches the budget, 0:17:53.760 --> 0:17:56.359 and one checks the quality of everything. The team talks 0:17:56.400 --> 0:17:58.440 it out step by set, but decides what to do next. 0:17:58.520 --> 0:18:00.600 Is that is that a fair summary? That's exactly right? 0:18:00.640 --> 0:18:03.000 And you can have infinite numbers of those agents. 0:18:03.040 --> 0:18:05.560 I think these five were just kind of a scratching 0:18:05.560 --> 0:18:08.439 the surface of what's possible. I will say just quickly 0:18:08.480 --> 0:18:11.320 that I was incredibly happy to see that the curmudgeon 0:18:11.359 --> 0:18:13.679 agent we called it, or the Devil's advocate agent was 0:18:13.720 --> 0:18:17.280 helpful because you get into these group things situations, and 0:18:17.320 --> 0:18:20.399 it's kind of fun to watch a model argue with 0:18:20.520 --> 0:18:24.720 other models about some of the decisions being made in 0:18:24.840 --> 0:18:28.280 questioning the steps. So where the models fall short today 0:18:29.320 --> 0:18:32.640 is outside of the text domain. And what I mean 0:18:32.680 --> 0:18:37.320 by that is models are incredibly good at understanding medical 0:18:37.320 --> 0:18:41.320 concepts as their communicated in text form, but when you 0:18:41.359 --> 0:18:44.280 get into the images and genomics and waveforms and all 0:18:44.280 --> 0:18:46.240 the other types of ways that we take care of 0:18:46.320 --> 0:18:51.720 our patients, the models are vastly underperforming humans. And a 0:18:51.760 --> 0:18:53.520 good example of that is if I needed to look 0:18:53.520 --> 0:18:56.400 at a chest sexuray in one of these diagnostic steps 0:18:57.000 --> 0:18:58.840 and the model had to interpret the chess sector, it 0:18:58.840 --> 0:19:01.520 couldn't read the report actually had to look at the image, 0:19:01.880 --> 0:19:04.080 it would fall short and fail nine times out of ten. 0:19:04.560 --> 0:19:07.159 So we know that that's a significant gap. But on 0:19:07.200 --> 0:19:11.080 the other hand, most healthcare right eighty percent of physician 0:19:11.400 --> 0:19:15.320 or patients interaction with their healthcare systems involve some kind 0:19:15.359 --> 0:19:20.840 of other information like a ECG or a biopsy path 0:19:21.040 --> 0:19:25.600 slide right or a MRI for example. So I'm hoping 0:19:25.600 --> 0:19:28.639 to see agents that have those competencies included into the mix, 0:19:29.320 --> 0:19:31.200 or we can start to really get to a place 0:19:31.240 --> 0:19:35.640 where the diagnostic environment meets how we're testing the systems. 0:19:36.160 --> 0:19:39.199 There was a study last year which I was fascinated by. 0:19:39.280 --> 0:19:44.120 Wish is that AI diagnosis in this study was better 0:19:44.200 --> 0:19:47.119 than human plus AI. In other words, I was a study, 0:19:47.119 --> 0:19:49.399 and you would assume, or you would hope, that a 0:19:49.400 --> 0:19:51.760 doctor using AI would be better than just an AI 0:19:51.800 --> 0:19:56.280 diagnosis alone. But in fact, the human plus AI model 0:19:56.400 --> 0:19:59.439 was worse than the pure AI model. And one of 0:19:59.440 --> 0:20:02.239 the conclusions from this was maybe that the doctors what 0:20:02.240 --> 0:20:04.120 didn't want to listen to what AI was telling them. 0:20:04.119 --> 0:20:06.479 But I mean, did you see that study and did 0:20:06.480 --> 0:20:07.240 it give you pause? 0:20:07.680 --> 0:20:10.520 For more than a decade we've been kind of dealing 0:20:10.560 --> 0:20:14.359 with this unexpected result. This goes all again, all the 0:20:14.400 --> 0:20:16.440 way back to the earliest days of applying at least 0:20:16.440 --> 0:20:19.760 some of the powerful deep learning systems in healthcare, we 0:20:19.960 --> 0:20:23.840 have consistently seen that, in other words, in whatever set 0:20:23.920 --> 0:20:26.640 up the AI, if you just leave it alone, typically 0:20:26.640 --> 0:20:28.920 does better than the human plus THEI or. 0:20:28.840 --> 0:20:29.640 The human alone. 0:20:29.880 --> 0:20:34.719