WEBVTT - Can computers describe what they see? 0:00:00.160 --> 0:00:07.200 Brought to you by Toyota. Let's go places. Welcome to 0:00:07.400 --> 0:00:14.760 Forward Thinking, Pater and welcome to Forward Thinking, the podcast 0:00:14.840 --> 0:00:17.000 that looks at the future and says makes you think 0:00:17.040 --> 0:00:20.880 all the world's a sunny day. I'm Jonathan Strickland Obama, 0:00:20.960 --> 0:00:24.680 and I'm Joe mcformick. So today we're going to talk 0:00:24.680 --> 0:00:28.440 about something that I think is a pretty interesting topic, 0:00:28.600 --> 0:00:32.159 and that's the automated description of images. This is yet 0:00:32.200 --> 0:00:35.960 another topic I came across in Alexis Magicals Five Intriguing 0:00:36.000 --> 0:00:38.600 Things Email, which if you're not signed up for you 0:00:38.640 --> 0:00:40.320 should get on that it. It is one of my 0:00:40.400 --> 0:00:44.280 favorite sources of daily delights on the Internet. So what 0:00:44.479 --> 0:00:46.640 is this idea, Well, it's what it sounds like. But 0:00:46.880 --> 0:00:51.000 I'll start with an analogy. Okay, imagine you're going to 0:00:51.520 --> 0:00:56.320 something like Google image search. Um, now, what happens when 0:00:56.360 --> 0:00:58.720 you do a Google image search? You type in some 0:00:58.840 --> 0:01:04.080 words and it comes back with images. Well, that's kind 0:01:04.080 --> 0:01:07.279 of strange because how does it translate the difference between 0:01:07.280 --> 0:01:09.840 a collection of pixels on the one hand and the 0:01:09.880 --> 0:01:13.320 words you've typed in. Well, one of the things, obviously, 0:01:13.400 --> 0:01:17.480 is that there's data associated with images medigata, right, Um, 0:01:17.520 --> 0:01:20.720 it's either captions that people have manually typed in, or 0:01:20.800 --> 0:01:24.240 perhaps keywords that they've attached to those images, or something 0:01:24.240 --> 0:01:26.840 else on the on the website's page that's going to 0:01:26.959 --> 0:01:29.200 clue you into what that image is about, like a 0:01:29.200 --> 0:01:33.000 file name or whatever. You could also use an approach 0:01:33.160 --> 0:01:35.920 that is sort of refined by humans. So you could 0:01:35.920 --> 0:01:39.560 have humans sitting there working on your algorithm where they 0:01:39.600 --> 0:01:43.520 go through image after image from selected keywords and say 0:01:43.600 --> 0:01:45.840 this is a good match for that keyboard and this 0:01:45.880 --> 0:01:48.280 is a bad match for that keyword, and that sort 0:01:48.320 --> 0:01:54.160 of helps you, uh, connect words to images. Or let's say, 0:01:54.200 --> 0:01:57.600 what if there was no text associated with an image, 0:01:57.760 --> 0:02:00.880 could you still do it? Well, in some cases you 0:02:00.920 --> 0:02:04.600 probably could, right, because we've gotten to a certain level 0:02:04.640 --> 0:02:08.600 with image recognition. Uh, there are automated programs that can 0:02:08.639 --> 0:02:10.840 look at this and say this is a human face 0:02:11.919 --> 0:02:14.080 or this is a cat, as we have discussed before 0:02:14.080 --> 0:02:17.680 on the show. And well, I'm sure right, and we 0:02:17.680 --> 0:02:20.360 we I think we don't know the full extent to 0:02:20.639 --> 0:02:25.600 which artificial intelligence like that already figures into something like 0:02:25.639 --> 0:02:28.200 Google Image Search. I wouldn't be surprised if that was 0:02:28.240 --> 0:02:31.240 a small part of it. But obviously we're relying heavily 0:02:31.320 --> 0:02:35.840 on text associated with images. Okay, but now let's take 0:02:35.880 --> 0:02:40.760 that same last example, just identifying an image with no 0:02:40.880 --> 0:02:44.240 associated text, and say, could we do that with a 0:02:44.360 --> 0:02:47.639 complex scenario. So it's not just a picture of a 0:02:47.720 --> 0:02:50.560 human face, or say a bowling ball, which would be 0:02:50.560 --> 0:02:52.600 pretty easy to recognize, as you know, it's round, it's 0:02:52.600 --> 0:02:57.120 got three holes, but something like there is a pizza 0:02:57.360 --> 0:03:01.520 sitting in a bathtub, or a man throwing a sandwich 0:03:01.600 --> 0:03:04.919 off a cliff. I've got a lot of food related 0:03:04.960 --> 0:03:08.280 imagery in your in your head. Well, I said those 0:03:08.320 --> 0:03:12.680 because I actually searched for them earlier before i'd had lunch. Yeah, alright, 0:03:12.720 --> 0:03:16.119 and I found no images of someone throwing a sandwich 0:03:16.160 --> 0:03:19.160 off a cliff. Well, why would anyone want to throw 0:03:19.200 --> 0:03:21.080 a sandwich off a cliff? I don't know, But you 0:03:21.120 --> 0:03:23.480 know what, I would be really surprised if there wasn't 0:03:23.520 --> 0:03:26.760 at least one picture of that out there somewhere. If 0:03:26.800 --> 0:03:30.560 there's not, there's definitely a stock photography opportunity lying in Wait, yeah, 0:03:31.000 --> 0:03:35.360 you know what we're doing after the podcast. Um, there's 0:03:35.360 --> 0:03:38.120 a reason that this would be hard, right to describe 0:03:38.160 --> 0:03:42.760 a complex image not just one thing, but a complex 0:03:42.800 --> 0:03:47.440 sort of scene that requires a sentence to describe it, right, 0:03:47.720 --> 0:03:53.640 with propositional phrases and and situational um relativity. Yeah, exactly, 0:03:53.680 --> 0:03:55.760 you have to be able to describe the relationship between 0:03:55.800 --> 0:03:58.640 all the different elements that are inside that picture. Right. 0:03:58.640 --> 0:04:00.680 And so that last example is what we're going to 0:04:00.800 --> 0:04:04.480 talk about today, how computers can look at an image 0:04:04.520 --> 0:04:07.760 with a complex scene taking place and turn that into 0:04:07.880 --> 0:04:11.240 a correct and accurate description made out of words. And 0:04:11.600 --> 0:04:15.400 this this goes into all, you know, a key element 0:04:15.400 --> 0:04:19.000 of artificial intelligence. It's not just the ability to describe something, 0:04:19.000 --> 0:04:21.880 but the ability to recognize it. It's something that can 0:04:21.920 --> 0:04:25.760 go beyond just describing a picture. And we talk a 0:04:25.800 --> 0:04:28.440 lot about how there are things that that we humans 0:04:28.800 --> 0:04:31.520 are really good at. It's it comes naturally to us, 0:04:31.520 --> 0:04:34.160 it's the way we work. But they are things that 0:04:34.200 --> 0:04:37.960 do not necessarily come naturally to the machines we make um. 0:04:38.000 --> 0:04:40.279 And the example I gave was if you had a 0:04:40.320 --> 0:04:42.440 group of people, you know, you've got a bunch of 0:04:42.440 --> 0:04:44.560 people together, and you told all of them, I want 0:04:44.600 --> 0:04:47.839 you to draw this picture, and you describe the picture 0:04:47.880 --> 0:04:49.960 to them. And in my case, I said, well, just 0:04:50.040 --> 0:04:53.800 imagine that there's a young lady saying at a table 0:04:53.839 --> 0:04:56.560 reading a book, and just make the picture as detailed 0:04:56.600 --> 0:04:58.599 as you can. But all that's all I give is 0:04:58.640 --> 0:05:00.880 just the elements that have to be there. Is there's 0:05:00.880 --> 0:05:04.080 a seated lady. Uh that she's at a table and 0:05:04.120 --> 0:05:06.920 she's reading a book. And so you could have all 0:05:06.920 --> 0:05:10.840 these different types of interpretations of that request. Sure, I 0:05:10.880 --> 0:05:14.560 might draw a picture of a lady sitting in a 0:05:14.600 --> 0:05:18.559 cafe reading a book. Or it could be a desk 0:05:18.600 --> 0:05:22.200 at a library. There's so many different opportunities there. Uh, 0:05:22.480 --> 0:05:24.720 could just be a kitchen table whatever. But if I 0:05:24.839 --> 0:05:28.640 took all the pictures that everyone drew, and then I 0:05:28.680 --> 0:05:31.279 went to a separate group of people and I showed 0:05:31.320 --> 0:05:33.640 them those all those different pictures, and I all right, 0:05:33.680 --> 0:05:36.599 so what do all these pictures have in common? Pretty 0:05:36.600 --> 0:05:38.640 sure that a lot of people would end up saying, Okay, 0:05:38.680 --> 0:05:42.279 these are all pictures of a young lady reading a 0:05:42.279 --> 0:05:44.320 book at a table. It would be kind of weird 0:05:44.360 --> 0:05:46.920 if in every picture she was reading tech war. That 0:05:46.960 --> 0:05:50.960 would be a very weird. For lots of reasons that 0:05:51.000 --> 0:05:53.720 would be weird. But at any rate, yeah, I mean, 0:05:53.760 --> 0:05:56.760 we would have essentially people answering the same you know, 0:05:56.760 --> 0:06:00.240 giving the same basic description. Now, they're a lot of 0:06:00.240 --> 0:06:02.719 things going on in that scenario I just described. It's 0:06:02.760 --> 0:06:05.120 not just the fact that you're able to recognize things, 0:06:05.160 --> 0:06:08.279 it's that you're able to draw the conclusion that all 0:06:08.360 --> 0:06:11.520 these different pictures, even though the details are different, are 0:06:11.520 --> 0:06:14.000 showing you essentially the same thing, which is something that 0:06:14.000 --> 0:06:16.360 would be necessary if we were using an image search 0:06:16.960 --> 0:06:21.200 that relied on this automated image description. Right, well, I mean, 0:06:21.240 --> 0:06:24.120 and furthermore, we as humans are able to recognize a 0:06:24.200 --> 0:06:27.000 lady sitting at a table reading a book from any 0:06:27.040 --> 0:06:30.200 angle that it's drawn from right right, pretty easily. Yeah, 0:06:30.200 --> 0:06:32.160 we can tell. We can tell like that this is 0:06:32.240 --> 0:06:34.440 the book, this is the lady, this is the table. 0:06:35.160 --> 0:06:39.600 If you show just pixels to a machine that hasn't 0:06:39.800 --> 0:06:42.520 had any way of telling the difference between, you know, 0:06:42.560 --> 0:06:46.239 what these pixels actually mean, they might not That machine 0:06:46.279 --> 0:06:47.880 might not be able to tell that there are distinct 0:06:48.000 --> 0:06:50.479 elements in that picture, right, it may all just look 0:06:50.520 --> 0:06:53.760 like one thing. So there are a lot of complications here, uh, 0:06:54.040 --> 0:06:56.640 same sort of We brought this example up a few times, 0:06:56.680 --> 0:06:58.880 same sort of thing, like I know what a cup 0:06:59.000 --> 0:07:00.679 is because I've seen a cup, but I was told 0:07:00.880 --> 0:07:04.560 this is a cup, and you're able to extrapolate many 0:07:04.560 --> 0:07:07.839 different kinds of cups. Exactly. You have an ideal in 0:07:07.880 --> 0:07:10.559 your head. It's sort of the platonic ideal of a cup, 0:07:10.600 --> 0:07:13.760 and there are many ways that a actual cup can 0:07:14.000 --> 0:07:17.600 vary that theme, but somehow you always recognize the theme. 0:07:17.760 --> 0:07:21.000 I recognize every single cup in existence as an imperfect 0:07:21.080 --> 0:07:25.840 realization of the ideal cup that's in my mind. Uh so, 0:07:26.520 --> 0:07:28.680 which has grimace on it, by the way, But at 0:07:28.720 --> 0:07:33.200 any rate, Yeah, and so the again, a computer, you 0:07:33.240 --> 0:07:36.840 could if you gave it an image and you programmed 0:07:36.840 --> 0:07:41.640 in some software and said, this particular image that you're 0:07:41.640 --> 0:07:44.560 seeing here is a cup. But then you took a 0:07:44.600 --> 0:07:47.560 totally different kind of cup, different shape, different size, different color. 0:07:48.200 --> 0:07:50.640 The computer is not necessarily going to know what that is, 0:07:51.040 --> 0:07:53.080 right right, it's going to say, well, this one's blue. Yeah, 0:07:53.120 --> 0:07:56.320 you need a huge shorter and it's from a different angle. 0:07:56.480 --> 0:07:57.960 It's it's one of those things where you you know 0:07:58.600 --> 0:08:02.480 there's a difficult problem and there's not necessarily a simple 0:08:02.520 --> 0:08:06.560 solution to fix it. The computer understands what's going on. 0:08:07.280 --> 0:08:09.200 So the point I'm trying to make is that this 0:08:09.280 --> 0:08:12.120 is a non trivial computer problem that a lot of 0:08:12.120 --> 0:08:14.200 people have worked on for a long time, and it's 0:08:14.560 --> 0:08:18.560 absolutely amazing to see how much progress we've made. Absolutely 0:08:18.640 --> 0:08:21.800 and furthermore, this is only half of the issue that 0:08:21.800 --> 0:08:24.320 we're dealing with here overall. Because once, I mean, once 0:08:24.400 --> 0:08:27.840 you can teach a computer to identify, for example, a cup, 0:08:28.640 --> 0:08:32.160 how do you get it to to explain what that 0:08:32.200 --> 0:08:34.600 cup is doing in relation to the other things? How 0:08:34.640 --> 0:08:37.360 does it describe it in a way that actually makes 0:08:37.400 --> 0:08:41.360 sense to That's a natural language problem, exactly right, because 0:08:41.400 --> 0:08:44.840 the computer doesn't think in English or whatever language you 0:08:44.840 --> 0:08:47.959 wanted to spit out right. Yeah, that's and we've talked 0:08:48.000 --> 0:08:50.720 about natural language issues as well, the idea that machine 0:08:50.800 --> 0:08:53.480 language and natural language are extremely different, and in fact, 0:08:53.800 --> 0:08:58.440 programming languages are a bridge between pure machine language and 0:08:58.640 --> 0:09:01.040 human language. You might not think it if you're not 0:09:01.080 --> 0:09:04.800 a programmer and you look at raw code, you might think, well, 0:09:04.840 --> 0:09:08.440 this isn't language that any human would understand, But in 0:09:08.480 --> 0:09:11.160 fact that's precisely what it is. It's meant to be 0:09:11.679 --> 0:09:16.480 that that bridging material. Uh So, getting computers so that 0:09:16.520 --> 0:09:20.920 they can interpret the natural language innately is a very 0:09:21.000 --> 0:09:24.960 challenging issue. We've said it before that you can word 0:09:25.160 --> 0:09:30.319 the exact same thought numerous ways. Human language is highly, 0:09:30.440 --> 0:09:34.280 highly redundant. Yes, they're all different kinds, and the differences 0:09:34.360 --> 0:09:38.240 in in word choice might express subtle differences in tone 0:09:38.480 --> 0:09:41.440 and things like that, but you can basically describe the 0:09:41.480 --> 0:09:47.880 same thing a jillion different ways different spellings. So, for example, 0:09:48.320 --> 0:09:51.319 if you're playing an old text adventure game like Zorc, yes, 0:09:51.760 --> 0:09:55.679 you could type walk down the hall or go down 0:09:55.679 --> 0:09:59.959 the hall, and that text parsing system might be smart 0:10:00.160 --> 0:10:02.920 enough to know either one and get you down the hall. 0:10:03.040 --> 0:10:05.200 So like, okay, I know what the person just said, 0:10:05.360 --> 0:10:08.240 but if you type mosey on down the corridor, it 0:10:08.360 --> 0:10:11.320 may very well say I don't know what you're talking about. Right, 0:10:11.360 --> 0:10:13.559 What did Zorc say when you said something I'm sorry, 0:10:13.559 --> 0:10:16.000 I don't understand what you mean, or something along those lines, 0:10:16.480 --> 0:10:18.920 or you were eaten by a group if it's too 0:10:19.000 --> 0:10:23.679 frustrated at you, if you're too confusing too often grew Yeah, no, 0:10:24.000 --> 0:10:26.640 but that's a great example the idea that you know, obviously, 0:10:26.640 --> 0:10:30.600 those those programs were only capable of accepting commands that 0:10:30.640 --> 0:10:34.120 had been pre programmed into them, and anything that went 0:10:34.160 --> 0:10:38.800 outside those parameters was an error. It was missing. The 0:10:39.200 --> 0:10:41.920 message we would get is I didn't understand what you 0:10:41.960 --> 0:10:43.960 had to say, but in reality it could have just 0:10:44.040 --> 0:10:47.320 as well been found not found. Right. Okay, So now 0:10:47.679 --> 0:10:51.920 we're combining these two different artificial intelligence problems. On one 0:10:52.000 --> 0:10:56.400 the complex problem of looking at a scene and recognizing 0:10:56.440 --> 0:10:59.520 what's going on there, connecting that to other images and context, 0:10:59.559 --> 0:11:01.640 and the other one making sense of it in a 0:11:01.679 --> 0:11:05.640 written language. And uh, why does this matter? Yeah? Why 0:11:05.640 --> 0:11:07.160 why do we even want to do this if it's 0:11:07.200 --> 0:11:09.280 so difficult? A lot a lot of reasons. One of 0:11:09.320 --> 0:11:11.640 the Well, first of all, we talked about image search 0:11:11.720 --> 0:11:15.199 and just being able to to automate this would make 0:11:15.240 --> 0:11:18.160 image search way more efficient for the things that we're 0:11:18.200 --> 0:11:20.960 looking for. We could be much more specific. Oh man, 0:11:21.080 --> 0:11:22.720 if only I could tell it. So. Part of my 0:11:22.920 --> 0:11:26.079 job is searching for stock images that I end up 0:11:26.080 --> 0:11:30.559 publishing on our website and and human labeling of stock images. 0:11:30.600 --> 0:11:32.280 If you guys have never had to search for stock 0:11:32.320 --> 0:11:34.720 images for your job before, let me tell you it's 0:11:34.720 --> 0:11:38.160 one of the most joyous and terrifying things on the planet. 0:11:38.160 --> 0:11:40.880 Because no matter what you type in, what you're going 0:11:40.920 --> 0:11:44.120 to get back is some weird clip art, some sexy 0:11:44.200 --> 0:11:46.280 ladies doing some stuff that may or may not have 0:11:46.320 --> 0:11:49.040 anything to do with what you just said and may not, 0:11:49.360 --> 0:11:52.120 and and maybe what you are actually looking you might 0:11:52.160 --> 0:11:56.720 also find some like truly weird images, like a an overweight, 0:11:57.120 --> 0:11:59.840 shirtless gentleman sitting at a table with a paper bag 0:12:00.000 --> 0:12:02.160 over his head, knife and fork in his hands, eating 0:12:02.160 --> 0:12:05.920 a cartoon hamster. I mean, it's some weird stuff on there. Seriously, 0:12:06.000 --> 0:12:09.080 I am almost positive that there is a stock photograph 0:12:09.240 --> 0:12:11.920 out there somewhere of somebody throwing a sandwich off a 0:12:11.960 --> 0:12:14.760 cliff for like they've made it for diet purposes or 0:12:14.840 --> 0:12:19.439 something like that. Didn't get endless examples of that by that, right? 0:12:19.520 --> 0:12:22.160 If I searched for man throwing a sandwich off a cliff, 0:12:22.200 --> 0:12:24.000 I should have tried this before we started, but I'm 0:12:24.000 --> 0:12:26.400 almost positive I would not get that. Instead, I get 0:12:26.440 --> 0:12:29.960 a lady in a bikini sitting on a pinball machine. Yeah, 0:12:30.000 --> 0:12:34.600 that's that's pretty accurate. So just increasing the accuracy of 0:12:34.720 --> 0:12:38.360 image search would be one reason, right, And and there 0:12:38.400 --> 0:12:40.920 are lots of different reasons why our our image searches 0:12:41.000 --> 0:12:44.160 on these things are imperfect. A large part of it 0:12:44.200 --> 0:12:46.400 is that you've got people gaming the system. They're essentially 0:12:46.400 --> 0:12:48.600 putting in any tag word they can possibly think of 0:12:49.120 --> 0:12:51.280 because they want their images to be the ones that 0:12:51.320 --> 0:12:54.679 are purchased. But but if you want to play fair, 0:12:54.800 --> 0:12:58.040 then having an automated description would be best because you 0:12:58.040 --> 0:13:00.600 can't curate everything by human eye. It would just take 0:13:00.920 --> 0:13:03.440 too long. We're generating too much content for that to 0:13:03.480 --> 0:13:06.880 be a realistic possibility. But another is that it could 0:13:06.920 --> 0:13:09.599 be a huge help for people who have visual impairments. 0:13:09.679 --> 0:13:14.400 So someone who is reading a news story, you know, 0:13:14.440 --> 0:13:17.080 someone who has who has some sort of visual impairment, 0:13:17.120 --> 0:13:20.680 maybe they're blind, and there could be pictures that give 0:13:20.880 --> 0:13:23.960 more context to whatever the story is, but they miss 0:13:24.000 --> 0:13:26.720 out on that if there's not an actual description of 0:13:26.760 --> 0:13:29.520 what that picture is. Particularly, I mean, there's some content 0:13:29.559 --> 0:13:33.880 out there where the caption might be playful but doesn't 0:13:33.920 --> 0:13:37.840 actually tell you what the picture is. Absolutely, so this 0:13:37.920 --> 0:13:40.000 would be a big help for people who are in 0:13:40.040 --> 0:13:42.960 that situation. It also could speed up web access for 0:13:43.000 --> 0:13:46.560 people who have limited connectivity to the Internet. Perhaps it's 0:13:46.600 --> 0:13:49.440 over through a cellular network, and it may be that 0:13:49.520 --> 0:13:51.679 there's some important information they need to get hold of. 0:13:51.720 --> 0:13:54.200 But you know, if they're trying to load pictures, it's 0:13:54.280 --> 0:13:56.240 just taking too long for it to load any kind 0:13:56.240 --> 0:13:58.680 of thing. You could have a quick summary of those pictures. 0:13:58.679 --> 0:14:01.079 That would really speed things up. Because actually, I also 0:14:01.120 --> 0:14:06.240 think it's just an important contribution to general artificial intelligence. Absolutely, 0:14:06.240 --> 0:14:08.600 if you're trying to create a system that can mimic 0:14:08.679 --> 0:14:10.920 all of the functions of the human mind, well, one 0:14:10.960 --> 0:14:14.440 of the main things humans do is look at something 0:14:14.520 --> 0:14:17.560 and describe it, right, and you know, the description is 0:14:17.600 --> 0:14:20.720 just part of it. There's also the interaction, right by 0:14:20.760 --> 0:14:24.800 by recognizing things in our environment, we know how to proceed. 0:14:25.360 --> 0:14:28.520 We can make decisions on how to proceed. So if, 0:14:28.600 --> 0:14:31.800 for example, we walk into a room and we noticed 0:14:31.840 --> 0:14:34.840 that there are a lot of pedestals around us, and 0:14:34.880 --> 0:14:38.040 they're delicate vases on each pedestal, we know not to 0:14:38.080 --> 0:14:41.640 go swing in our arms everywhere willy nilly. But you know, 0:14:42.120 --> 0:14:45.720 a robot would not necessarily be able to tell that 0:14:46.360 --> 0:14:48.680 a a vause sitting on a pestel was not in 0:14:48.760 --> 0:14:52.440 fact a single piece. It might it might interpret that 0:14:52.520 --> 0:14:55.240 as a column that's a good point, you know, or 0:14:55.320 --> 0:14:57.720 that there's even even if it could recognize that was 0:14:58.000 --> 0:15:01.720 an object sitting on another object the vase was delicate 0:15:01.840 --> 0:15:05.520 or that it was worth not smashing. Right, yeah, robots 0:15:05.520 --> 0:15:08.760 Teaching robots value is a very tricky thing. Also, they 0:15:08.800 --> 0:15:13.640 hate vauses they do. It's I think, I think you 0:15:13.720 --> 0:15:18.360 program them, I think programming. All well, it's because we 0:15:18.640 --> 0:15:23.360 gave them the basic personality of Gallagher is the problem. Okay, okay, 0:15:23.440 --> 0:15:28.600 so we've so automated image description is a very difficult problem, 0:15:28.680 --> 0:15:31.120 but it's also very worth solving. But in the long 0:15:31.200 --> 0:15:34.240 term for artificial intelligence, and in some specific cases, in 0:15:34.240 --> 0:15:37.560 the short term. Who's actually working on this? Where did 0:15:37.560 --> 0:15:40.960 this come from? Well, I mean there are lots of 0:15:41.000 --> 0:15:44.560 different people in computer science working on this problem. But 0:15:44.680 --> 0:15:48.160 the thing that kind of spur spurred on this particular 0:15:48.200 --> 0:15:51.240 podcast you discovered, right, Yeah, it was well, as I said, 0:15:51.240 --> 0:15:54.600 it was through Alexis Madrigals five Intriguing Things newsletter, and 0:15:54.640 --> 0:15:59.040 it was a link to the Google Research blog, which 0:15:59.200 --> 0:16:02.920 is a cool little blog. Some of it's definitely over 0:16:02.960 --> 0:16:05.840 the average reader's head, but it's also just very interesting. 0:16:05.920 --> 0:16:09.440 And yeah, yeah, um, this specific blog entry is from 0:16:09.560 --> 0:16:13.600 Google UK uh and it was posted by a bunch 0:16:13.640 --> 0:16:17.320 of research scientists. It's it's one of those things where 0:16:17.400 --> 0:16:19.600 if you go into the Google research blog, they do 0:16:19.680 --> 0:16:23.600 get um more technical than your average blog does. They're 0:16:23.640 --> 0:16:28.240 not so technical as to be completely incomprehensible, but I 0:16:28.240 --> 0:16:30.880 will say they're really good about linking out two terms 0:16:30.920 --> 0:16:33.040 that you might be unfamiliar, which is important because I 0:16:33.080 --> 0:16:35.680 had to click on every single one of those links. 0:16:36.200 --> 0:16:39.120 I did so much reading for this one blog post, 0:16:39.120 --> 0:16:41.480 so I could really get a handle on what they 0:16:41.480 --> 0:16:44.800 were saying. But again, it illustrates the complexity of the problem. 0:16:45.360 --> 0:16:47.680 So again we don't mean to suggest that these Google 0:16:47.720 --> 0:16:50.200 researchers are the only people working on this problem, or 0:16:50.240 --> 0:16:52.440 that their approach is the only way to do it. 0:16:53.280 --> 0:16:56.160 We're constraint on it because it was really well documented 0:16:56.360 --> 0:16:58.800 and it was just published on November sevente We are 0:16:58.840 --> 0:17:01.800 recording this on November two. Any first, so it was 0:17:01.880 --> 0:17:05.080 of immediate interest to us. Yes, okay, so how are 0:17:05.119 --> 0:17:08.640 we doing this? Well? First, you have to identify what 0:17:09.119 --> 0:17:10.960 needs to be done before you can figure out how 0:17:11.000 --> 0:17:13.159 to do it right. You have to figure out the 0:17:12.800 --> 0:17:15.520 the things that have to happen in order for this 0:17:15.600 --> 0:17:19.440 to be a possibility. And they identified several things, including 0:17:19.480 --> 0:17:23.080 computer vision, which is how machines acquire and analyze images. 0:17:23.119 --> 0:17:24.960 So how do they get the images in the first place. 0:17:25.359 --> 0:17:29.359 Is it purely through code? Is it actual visual you know? 0:17:29.600 --> 0:17:32.520 Is like like a camera system. I mean, if you're 0:17:32.520 --> 0:17:34.920 talking about robotics, and it's probably a camera system because 0:17:34.920 --> 0:17:37.560 they're looking around in their environment. It could just be 0:17:37.640 --> 0:17:40.680 sampling from the Internet or something like right right, Well 0:17:40.880 --> 0:17:44.800 with with something like an automated search, it could all 0:17:45.000 --> 0:17:48.119 be code. Like it could be that there's no quote 0:17:48.160 --> 0:17:52.000 unquote looking at the image, right, but so there's that. 0:17:52.040 --> 0:17:56.080 There's also object detection, which sounds really easy, but it's 0:17:56.200 --> 0:17:58.760 incredibly hard. So this is what I was talking about, 0:17:58.800 --> 0:18:01.680 being able to recognize in visual objects within an image. 0:18:02.000 --> 0:18:05.840 So what separates an object from its background? If I've 0:18:05.840 --> 0:18:08.440 got a shot, a top down shot of a table, 0:18:08.800 --> 0:18:11.080 and there's a book sitting on the middle of that table, 0:18:11.640 --> 0:18:14.600 then when I look down, I can see that there's 0:18:14.640 --> 0:18:16.479 a book and there's a table, and I recognize those 0:18:16.560 --> 0:18:18.960 is two different things. But like I was saying before, 0:18:19.080 --> 0:18:21.600 if it's a machine and it doesn't have this way 0:18:21.600 --> 0:18:24.720 of of telling the difference there, it may just think 0:18:24.720 --> 0:18:27.760 of that as a pattern that's on a table, you know, 0:18:27.920 --> 0:18:30.800 or even a raised part of that table. If it 0:18:30.840 --> 0:18:33.600 can detect depth, it's not an it's not a cut 0:18:33.640 --> 0:18:36.120 and dry thing. So getting a point where you can 0:18:36.160 --> 0:18:38.960 have a computer that can tell that there are multiple 0:18:38.960 --> 0:18:42.000 objects within a scene, that's already a challenge, although we've 0:18:42.200 --> 0:18:46.640 gone a far away to actually do that. But imagine 0:18:46.680 --> 0:18:48.359 that you are looking at if you want to think 0:18:48.359 --> 0:18:50.800 about how hard this is, imagine you're thinking at a 0:18:50.800 --> 0:18:54.439 at a overgrown field and there's someone in a gilly 0:18:54.480 --> 0:18:56.960 suit out there. Gilly suit are those camos suits that 0:18:57.000 --> 0:19:00.000 have all the plant type material hanging off of them. 0:19:00.119 --> 0:19:02.280 Some of its artificial, some of it may actually be 0:19:02.320 --> 0:19:06.960 gathered from wherever you're going. Those camouflage suits are really convincing. 0:19:07.000 --> 0:19:09.440 It's really hard to pick someone who's someone who's good 0:19:09.440 --> 0:19:12.840 at at being uh covert. You may not even know 0:19:12.920 --> 0:19:16.760 that there's a person there, and so that's as hard 0:19:16.760 --> 0:19:19.359 as that is for us. That about the equivalent of 0:19:19.400 --> 0:19:22.040 looking at a book on a table for a computer exactly. Yeah. 0:19:22.080 --> 0:19:26.360 You know, until you're able to teach a machine how 0:19:26.440 --> 0:19:29.560 to how to see in a way, then it's going 0:19:29.680 --> 0:19:32.200 to have it's going to be just as as difficult 0:19:32.240 --> 0:19:33.960 to detect that as it would be for us to 0:19:33.960 --> 0:19:35.600 see that guy in the gilly suit in the middle 0:19:35.640 --> 0:19:39.000 of the overgrown field. Um. But once you do get 0:19:39.040 --> 0:19:40.960 to that, you still have other things to keep in mind, 0:19:40.960 --> 0:19:43.959 like classification and labeling. So this is a measurement of 0:19:44.000 --> 0:19:48.520 how accurately a program can assign correct labels to an image. Uh. 0:19:48.640 --> 0:19:50.960 I love the example, like if you looked at certain 0:19:51.000 --> 0:19:53.280 examples within the Google blog post and eventually took you 0:19:53.320 --> 0:19:56.240 to a picture of a dog wearing a sombrero, so 0:19:56.320 --> 0:19:59.480 you had dog hat. There's a hat on a dog, 0:20:00.000 --> 0:20:04.439 a wide brimmed hat. Yeah, these are all important elements. 0:20:04.520 --> 0:20:07.639 That's part of that classification and labeling. Uh. You know, 0:20:07.680 --> 0:20:11.119 instead of just saying that's one ugly dog because not 0:20:11.119 --> 0:20:15.840 being a weird growth, yeah, that would be that would 0:20:15.840 --> 0:20:18.560 be more than that. It's it's uh dog with the 0:20:18.600 --> 0:20:22.440 hat and not just dog with something on it. Yeah, 0:20:22.560 --> 0:20:26.080 so again not dog with stack of pancakes. Yeah, And 0:20:26.119 --> 0:20:29.080 identifying the fact that there is that relationship that there 0:20:29.200 --> 0:20:31.920 is a hat on a dog, not just that there's 0:20:31.960 --> 0:20:34.320 a hat and a dog in the picture, but how 0:20:34.359 --> 0:20:36.960 did those two objects relate to one another within the 0:20:37.000 --> 0:20:41.200 context of that picture. So that's also pretty cool. So 0:20:41.320 --> 0:20:44.200 as We've said, lots of folks have been working on 0:20:44.760 --> 0:20:49.439 the problem of how to do this thing, and a 0:20:49.520 --> 0:20:52.520 common approach has been working on linking computer systems that 0:20:52.600 --> 0:20:55.600 understand what's going on in pictures with computer systems that 0:20:55.680 --> 0:20:59.120 understand what's going on in sentences, and letting the two 0:20:59.280 --> 0:21:03.200 match up images and phrases. But these scientists at Google 0:21:03.400 --> 0:21:06.080 were approaching it from a different way. They're trying to 0:21:06.119 --> 0:21:09.720 create a system where the two halves work together directly 0:21:09.840 --> 0:21:13.680 with the same data, rather than comparing and contrasting two 0:21:13.800 --> 0:21:16.520 separate sets of data. Okay, um, so they were. They 0:21:16.520 --> 0:21:19.880 were inspired to do this by recent language translation research 0:21:19.960 --> 0:21:22.520 in which one half of the system would create a 0:21:22.600 --> 0:21:26.119 diagram of a sentence in one language, say English, and 0:21:26.280 --> 0:21:28.919 the second half would look at that diagram and generate 0:21:28.960 --> 0:21:32.200 a sentence from it in another language, say French. Yeah. 0:21:32.200 --> 0:21:34.919 And again this was a part of classification. It wasn't 0:21:35.080 --> 0:21:38.639 It wasn't just a word to word, you know, what 0:21:38.800 --> 0:21:41.040 is the what is the analogous word in this other 0:21:41.119 --> 0:21:45.199 language for this the one that we're detecting here, but 0:21:45.400 --> 0:21:48.439 rather what is the meaning of this sentence and what 0:21:48.640 --> 0:21:51.479