WEBVTT - Can computers describe what they see?

0:00:00.160 --> 0:00:07.200
<v Speaker 1>Brought to you by Toyota. Let's go places. Welcome to

0:00:07.400 --> 0:00:14.760
<v Speaker 1>Forward Thinking, Pater and welcome to Forward Thinking, the podcast

0:00:14.840 --> 0:00:17.000
<v Speaker 1>that looks at the future and says makes you think

0:00:17.040 --> 0:00:20.880
<v Speaker 1>all the world's a sunny day. I'm Jonathan Strickland Obama,

0:00:20.960 --> 0:00:24.680
<v Speaker 1>and I'm Joe mcformick. So today we're going to talk

0:00:24.680 --> 0:00:28.440
<v Speaker 1>about something that I think is a pretty interesting topic,

0:00:28.600 --> 0:00:32.159
<v Speaker 1>and that's the automated description of images. This is yet

0:00:32.200 --> 0:00:35.960
<v Speaker 1>another topic I came across in Alexis Magicals Five Intriguing

0:00:36.000 --> 0:00:38.600
<v Speaker 1>Things Email, which if you're not signed up for you

0:00:38.640 --> 0:00:40.320
<v Speaker 1>should get on that it. It is one of my

0:00:40.400 --> 0:00:44.280
<v Speaker 1>favorite sources of daily delights on the Internet. So what

0:00:44.479 --> 0:00:46.640
<v Speaker 1>is this idea, Well, it's what it sounds like. But

0:00:46.880 --> 0:00:51.000
<v Speaker 1>I'll start with an analogy. Okay, imagine you're going to

0:00:51.520 --> 0:00:56.320
<v Speaker 1>something like Google image search. Um, now, what happens when

0:00:56.360 --> 0:00:58.720
<v Speaker 1>you do a Google image search? You type in some

0:00:58.840 --> 0:01:04.080
<v Speaker 1>words and it comes back with images. Well, that's kind

0:01:04.080 --> 0:01:07.279
<v Speaker 1>of strange because how does it translate the difference between

0:01:07.280 --> 0:01:09.840
<v Speaker 1>a collection of pixels on the one hand and the

0:01:09.880 --> 0:01:13.320
<v Speaker 1>words you've typed in. Well, one of the things, obviously,

0:01:13.400 --> 0:01:17.480
<v Speaker 1>is that there's data associated with images medigata, right, Um,

0:01:17.520 --> 0:01:20.720
<v Speaker 1>it's either captions that people have manually typed in, or

0:01:20.800 --> 0:01:24.240
<v Speaker 1>perhaps keywords that they've attached to those images, or something

0:01:24.240 --> 0:01:26.840
<v Speaker 1>else on the on the website's page that's going to

0:01:26.959 --> 0:01:29.200
<v Speaker 1>clue you into what that image is about, like a

0:01:29.200 --> 0:01:33.000
<v Speaker 1>file name or whatever. You could also use an approach

0:01:33.160 --> 0:01:35.920
<v Speaker 1>that is sort of refined by humans. So you could

0:01:35.920 --> 0:01:39.560
<v Speaker 1>have humans sitting there working on your algorithm where they

0:01:39.600 --> 0:01:43.520
<v Speaker 1>go through image after image from selected keywords and say

0:01:43.600 --> 0:01:45.840
<v Speaker 1>this is a good match for that keyboard and this

0:01:45.880 --> 0:01:48.280
<v Speaker 1>is a bad match for that keyword, and that sort

0:01:48.320 --> 0:01:54.160
<v Speaker 1>of helps you, uh, connect words to images. Or let's say,

0:01:54.200 --> 0:01:57.600
<v Speaker 1>what if there was no text associated with an image,

0:01:57.760 --> 0:02:00.880
<v Speaker 1>could you still do it? Well, in some cases you

0:02:00.920 --> 0:02:04.600
<v Speaker 1>probably could, right, because we've gotten to a certain level

0:02:04.640 --> 0:02:08.600
<v Speaker 1>with image recognition. Uh, there are automated programs that can

0:02:08.639 --> 0:02:10.840
<v Speaker 1>look at this and say this is a human face

0:02:11.919 --> 0:02:14.080
<v Speaker 1>or this is a cat, as we have discussed before

0:02:14.080 --> 0:02:17.680
<v Speaker 1>on the show. And well, I'm sure right, and we

0:02:17.680 --> 0:02:20.360
<v Speaker 1>we I think we don't know the full extent to

0:02:20.639 --> 0:02:25.600
<v Speaker 1>which artificial intelligence like that already figures into something like

0:02:25.639 --> 0:02:28.200
<v Speaker 1>Google Image Search. I wouldn't be surprised if that was

0:02:28.240 --> 0:02:31.240
<v Speaker 1>a small part of it. But obviously we're relying heavily

0:02:31.320 --> 0:02:35.840
<v Speaker 1>on text associated with images. Okay, but now let's take

0:02:35.880 --> 0:02:40.760
<v Speaker 1>that same last example, just identifying an image with no

0:02:40.880 --> 0:02:44.240
<v Speaker 1>associated text, and say, could we do that with a

0:02:44.360 --> 0:02:47.639
<v Speaker 1>complex scenario. So it's not just a picture of a

0:02:47.720 --> 0:02:50.560
<v Speaker 1>human face, or say a bowling ball, which would be

0:02:50.560 --> 0:02:52.600
<v Speaker 1>pretty easy to recognize, as you know, it's round, it's

0:02:52.600 --> 0:02:57.120
<v Speaker 1>got three holes, but something like there is a pizza

0:02:57.360 --> 0:03:01.520
<v Speaker 1>sitting in a bathtub, or a man throwing a sandwich

0:03:01.600 --> 0:03:04.919
<v Speaker 1>off a cliff. I've got a lot of food related

0:03:04.960 --> 0:03:08.280
<v Speaker 1>imagery in your in your head. Well, I said those

0:03:08.320 --> 0:03:12.680
<v Speaker 1>because I actually searched for them earlier before i'd had lunch. Yeah, alright,

0:03:12.720 --> 0:03:16.119
<v Speaker 1>and I found no images of someone throwing a sandwich

0:03:16.160 --> 0:03:19.160
<v Speaker 1>off a cliff. Well, why would anyone want to throw

0:03:19.200 --> 0:03:21.080
<v Speaker 1>a sandwich off a cliff? I don't know, But you

0:03:21.120 --> 0:03:23.480
<v Speaker 1>know what, I would be really surprised if there wasn't

0:03:23.520 --> 0:03:26.760
<v Speaker 1>at least one picture of that out there somewhere. If

0:03:26.800 --> 0:03:30.560
<v Speaker 1>there's not, there's definitely a stock photography opportunity lying in Wait, yeah,

0:03:31.000 --> 0:03:35.360
<v Speaker 1>you know what we're doing after the podcast. Um, there's

0:03:35.360 --> 0:03:38.120
<v Speaker 1>a reason that this would be hard, right to describe

0:03:38.160 --> 0:03:42.760
<v Speaker 1>a complex image not just one thing, but a complex

0:03:42.800 --> 0:03:47.440
<v Speaker 1>sort of scene that requires a sentence to describe it, right,

0:03:47.720 --> 0:03:53.640
<v Speaker 1>with propositional phrases and and situational um relativity. Yeah, exactly,

0:03:53.680 --> 0:03:55.760
<v Speaker 1>you have to be able to describe the relationship between

0:03:55.800 --> 0:03:58.640
<v Speaker 1>all the different elements that are inside that picture. Right.

0:03:58.640 --> 0:04:00.680
<v Speaker 1>And so that last example is what we're going to

0:04:00.800 --> 0:04:04.480
<v Speaker 1>talk about today, how computers can look at an image

0:04:04.520 --> 0:04:07.760
<v Speaker 1>with a complex scene taking place and turn that into

0:04:07.880 --> 0:04:11.240
<v Speaker 1>a correct and accurate description made out of words. And

0:04:11.600 --> 0:04:15.400
<v Speaker 1>this this goes into all, you know, a key element

0:04:15.400 --> 0:04:19.000
<v Speaker 1>of artificial intelligence. It's not just the ability to describe something,

0:04:19.000 --> 0:04:21.880
<v Speaker 1>but the ability to recognize it. It's something that can

0:04:21.920 --> 0:04:25.760
<v Speaker 1>go beyond just describing a picture. And we talk a

0:04:25.800 --> 0:04:28.440
<v Speaker 1>lot about how there are things that that we humans

0:04:28.800 --> 0:04:31.520
<v Speaker 1>are really good at. It's it comes naturally to us,

0:04:31.520 --> 0:04:34.160
<v Speaker 1>it's the way we work. But they are things that

0:04:34.200 --> 0:04:37.960
<v Speaker 1>do not necessarily come naturally to the machines we make um.

0:04:38.000 --> 0:04:40.279
<v Speaker 1>And the example I gave was if you had a

0:04:40.320 --> 0:04:42.440
<v Speaker 1>group of people, you know, you've got a bunch of

0:04:42.440 --> 0:04:44.560
<v Speaker 1>people together, and you told all of them, I want

0:04:44.600 --> 0:04:47.839
<v Speaker 1>you to draw this picture, and you describe the picture

0:04:47.880 --> 0:04:49.960
<v Speaker 1>to them. And in my case, I said, well, just

0:04:50.040 --> 0:04:53.800
<v Speaker 1>imagine that there's a young lady saying at a table

0:04:53.839 --> 0:04:56.560
<v Speaker 1>reading a book, and just make the picture as detailed

0:04:56.600 --> 0:04:58.599
<v Speaker 1>as you can. But all that's all I give is

0:04:58.640 --> 0:05:00.880
<v Speaker 1>just the elements that have to be there. Is there's

0:05:00.880 --> 0:05:04.080
<v Speaker 1>a seated lady. Uh that she's at a table and

0:05:04.120 --> 0:05:06.920
<v Speaker 1>she's reading a book. And so you could have all

0:05:06.920 --> 0:05:10.840
<v Speaker 1>these different types of interpretations of that request. Sure, I

0:05:10.880 --> 0:05:14.560
<v Speaker 1>might draw a picture of a lady sitting in a

0:05:14.600 --> 0:05:18.559
<v Speaker 1>cafe reading a book. Or it could be a desk

0:05:18.600 --> 0:05:22.200
<v Speaker 1>at a library. There's so many different opportunities there. Uh,

0:05:22.480 --> 0:05:24.720
<v Speaker 1>could just be a kitchen table whatever. But if I

0:05:24.839 --> 0:05:28.640
<v Speaker 1>took all the pictures that everyone drew, and then I

0:05:28.680 --> 0:05:31.279
<v Speaker 1>went to a separate group of people and I showed

0:05:31.320 --> 0:05:33.640
<v Speaker 1>them those all those different pictures, and I all right,

0:05:33.680 --> 0:05:36.599
<v Speaker 1>so what do all these pictures have in common? Pretty

0:05:36.600 --> 0:05:38.640
<v Speaker 1>sure that a lot of people would end up saying, Okay,

0:05:38.680 --> 0:05:42.279
<v Speaker 1>these are all pictures of a young lady reading a

0:05:42.279 --> 0:05:44.320
<v Speaker 1>book at a table. It would be kind of weird

0:05:44.360 --> 0:05:46.920
<v Speaker 1>if in every picture she was reading tech war. That

0:05:46.960 --> 0:05:50.960
<v Speaker 1>would be a very weird. For lots of reasons that

0:05:51.000 --> 0:05:53.720
<v Speaker 1>would be weird. But at any rate, yeah, I mean,

0:05:53.760 --> 0:05:56.760
<v Speaker 1>we would have essentially people answering the same you know,

0:05:56.760 --> 0:06:00.240
<v Speaker 1>giving the same basic description. Now, they're a lot of

0:06:00.240 --> 0:06:02.719
<v Speaker 1>things going on in that scenario I just described. It's

0:06:02.760 --> 0:06:05.120
<v Speaker 1>not just the fact that you're able to recognize things,

0:06:05.160 --> 0:06:08.279
<v Speaker 1>it's that you're able to draw the conclusion that all

0:06:08.360 --> 0:06:11.520
<v Speaker 1>these different pictures, even though the details are different, are

0:06:11.520 --> 0:06:14.000
<v Speaker 1>showing you essentially the same thing, which is something that

0:06:14.000 --> 0:06:16.360
<v Speaker 1>would be necessary if we were using an image search

0:06:16.960 --> 0:06:21.200
<v Speaker 1>that relied on this automated image description. Right, well, I mean,

0:06:21.240 --> 0:06:24.120
<v Speaker 1>and furthermore, we as humans are able to recognize a

0:06:24.200 --> 0:06:27.000
<v Speaker 1>lady sitting at a table reading a book from any

0:06:27.040 --> 0:06:30.200
<v Speaker 1>angle that it's drawn from right right, pretty easily. Yeah,

0:06:30.200 --> 0:06:32.160
<v Speaker 1>we can tell. We can tell like that this is

0:06:32.240 --> 0:06:34.440
<v Speaker 1>the book, this is the lady, this is the table.

0:06:35.160 --> 0:06:39.600
<v Speaker 1>If you show just pixels to a machine that hasn't

0:06:39.800 --> 0:06:42.520
<v Speaker 1>had any way of telling the difference between, you know,

0:06:42.560 --> 0:06:46.239
<v Speaker 1>what these pixels actually mean, they might not That machine

0:06:46.279 --> 0:06:47.880
<v Speaker 1>might not be able to tell that there are distinct

0:06:48.000 --> 0:06:50.479
<v Speaker 1>elements in that picture, right, it may all just look

0:06:50.520 --> 0:06:53.760
<v Speaker 1>like one thing. So there are a lot of complications here, uh,

0:06:54.040 --> 0:06:56.640
<v Speaker 1>same sort of We brought this example up a few times,

0:06:56.680 --> 0:06:58.880
<v Speaker 1>same sort of thing, like I know what a cup

0:06:59.000 --> 0:07:00.679
<v Speaker 1>is because I've seen a cup, but I was told

0:07:00.880 --> 0:07:04.560
<v Speaker 1>this is a cup, and you're able to extrapolate many

0:07:04.560 --> 0:07:07.839
<v Speaker 1>different kinds of cups. Exactly. You have an ideal in

0:07:07.880 --> 0:07:10.559
<v Speaker 1>your head. It's sort of the platonic ideal of a cup,

0:07:10.600 --> 0:07:13.760
<v Speaker 1>and there are many ways that a actual cup can

0:07:14.000 --> 0:07:17.600
<v Speaker 1>vary that theme, but somehow you always recognize the theme.

0:07:17.760 --> 0:07:21.000
<v Speaker 1>I recognize every single cup in existence as an imperfect

0:07:21.080 --> 0:07:25.840
<v Speaker 1>realization of the ideal cup that's in my mind. Uh so,

0:07:26.520 --> 0:07:28.680
<v Speaker 1>which has grimace on it, by the way, But at

0:07:28.720 --> 0:07:33.200
<v Speaker 1>any rate, Yeah, and so the again, a computer, you

0:07:33.240 --> 0:07:36.840
<v Speaker 1>could if you gave it an image and you programmed

0:07:36.840 --> 0:07:41.640
<v Speaker 1>in some software and said, this particular image that you're

0:07:41.640 --> 0:07:44.560
<v Speaker 1>seeing here is a cup. But then you took a

0:07:44.600 --> 0:07:47.560
<v Speaker 1>totally different kind of cup, different shape, different size, different color.

0:07:48.200 --> 0:07:50.640
<v Speaker 1>The computer is not necessarily going to know what that is,

0:07:51.040 --> 0:07:53.080
<v Speaker 1>right right, it's going to say, well, this one's blue. Yeah,

0:07:53.120 --> 0:07:56.320
<v Speaker 1>you need a huge shorter and it's from a different angle.

0:07:56.480 --> 0:07:57.960
<v Speaker 1>It's it's one of those things where you you know

0:07:58.600 --> 0:08:02.480
<v Speaker 1>there's a difficult problem and there's not necessarily a simple

0:08:02.520 --> 0:08:06.560
<v Speaker 1>solution to fix it. The computer understands what's going on.

0:08:07.280 --> 0:08:09.200
<v Speaker 1>So the point I'm trying to make is that this

0:08:09.280 --> 0:08:12.120
<v Speaker 1>is a non trivial computer problem that a lot of

0:08:12.120 --> 0:08:14.200
<v Speaker 1>people have worked on for a long time, and it's

0:08:14.560 --> 0:08:18.560
<v Speaker 1>absolutely amazing to see how much progress we've made. Absolutely

0:08:18.640 --> 0:08:21.800
<v Speaker 1>and furthermore, this is only half of the issue that

0:08:21.800 --> 0:08:24.320
<v Speaker 1>we're dealing with here overall. Because once, I mean, once

0:08:24.400 --> 0:08:27.840
<v Speaker 1>you can teach a computer to identify, for example, a cup,

0:08:28.640 --> 0:08:32.160
<v Speaker 1>how do you get it to to explain what that

0:08:32.200 --> 0:08:34.600
<v Speaker 1>cup is doing in relation to the other things? How

0:08:34.640 --> 0:08:37.360
<v Speaker 1>does it describe it in a way that actually makes

0:08:37.400 --> 0:08:41.360
<v Speaker 1>sense to That's a natural language problem, exactly right, because

0:08:41.400 --> 0:08:44.840
<v Speaker 1>the computer doesn't think in English or whatever language you

0:08:44.840 --> 0:08:47.959
<v Speaker 1>wanted to spit out right. Yeah, that's and we've talked

0:08:48.000 --> 0:08:50.720
<v Speaker 1>about natural language issues as well, the idea that machine

0:08:50.800 --> 0:08:53.480
<v Speaker 1>language and natural language are extremely different, and in fact,

0:08:53.800 --> 0:08:58.440
<v Speaker 1>programming languages are a bridge between pure machine language and

0:08:58.640 --> 0:09:01.040
<v Speaker 1>human language. You might not think it if you're not

0:09:01.080 --> 0:09:04.800
<v Speaker 1>a programmer and you look at raw code, you might think, well,

0:09:04.840 --> 0:09:08.440
<v Speaker 1>this isn't language that any human would understand, But in

0:09:08.480 --> 0:09:11.160
<v Speaker 1>fact that's precisely what it is. It's meant to be

0:09:11.679 --> 0:09:16.480
<v Speaker 1>that that bridging material. Uh So, getting computers so that

0:09:16.520 --> 0:09:20.920
<v Speaker 1>they can interpret the natural language innately is a very

0:09:21.000 --> 0:09:24.960
<v Speaker 1>challenging issue. We've said it before that you can word

0:09:25.160 --> 0:09:30.319
<v Speaker 1>the exact same thought numerous ways. Human language is highly,

0:09:30.440 --> 0:09:34.280
<v Speaker 1>highly redundant. Yes, they're all different kinds, and the differences

0:09:34.360 --> 0:09:38.240
<v Speaker 1>in in word choice might express subtle differences in tone

0:09:38.480 --> 0:09:41.440
<v Speaker 1>and things like that, but you can basically describe the

0:09:41.480 --> 0:09:47.880
<v Speaker 1>same thing a jillion different ways different spellings. So, for example,

0:09:48.320 --> 0:09:51.319
<v Speaker 1>if you're playing an old text adventure game like Zorc, yes,

0:09:51.760 --> 0:09:55.679
<v Speaker 1>you could type walk down the hall or go down

0:09:55.679 --> 0:09:59.959
<v Speaker 1>the hall, and that text parsing system might be smart

0:10:00.160 --> 0:10:02.920
<v Speaker 1>enough to know either one and get you down the hall.

0:10:03.040 --> 0:10:05.200
<v Speaker 1>So like, okay, I know what the person just said,

0:10:05.360 --> 0:10:08.240
<v Speaker 1>but if you type mosey on down the corridor, it

0:10:08.360 --> 0:10:11.320
<v Speaker 1>may very well say I don't know what you're talking about. Right,

0:10:11.360 --> 0:10:13.559
<v Speaker 1>What did Zorc say when you said something I'm sorry,

0:10:13.559 --> 0:10:16.000
<v Speaker 1>I don't understand what you mean, or something along those lines,

0:10:16.480 --> 0:10:18.920
<v Speaker 1>or you were eaten by a group if it's too

0:10:19.000 --> 0:10:23.679
<v Speaker 1>frustrated at you, if you're too confusing too often grew Yeah, no,

0:10:24.000 --> 0:10:26.640
<v Speaker 1>but that's a great example the idea that you know, obviously,

0:10:26.640 --> 0:10:30.600
<v Speaker 1>those those programs were only capable of accepting commands that

0:10:30.640 --> 0:10:34.120
<v Speaker 1>had been pre programmed into them, and anything that went

0:10:34.160 --> 0:10:38.800
<v Speaker 1>outside those parameters was an error. It was missing. The

0:10:39.200 --> 0:10:41.920
<v Speaker 1>message we would get is I didn't understand what you

0:10:41.960 --> 0:10:43.960
<v Speaker 1>had to say, but in reality it could have just

0:10:44.040 --> 0:10:47.320
<v Speaker 1>as well been found not found. Right. Okay, So now

0:10:47.679 --> 0:10:51.920
<v Speaker 1>we're combining these two different artificial intelligence problems. On one

0:10:52.000 --> 0:10:56.400
<v Speaker 1>the complex problem of looking at a scene and recognizing

0:10:56.440 --> 0:10:59.520
<v Speaker 1>what's going on there, connecting that to other images and context,

0:10:59.559 --> 0:11:01.640
<v Speaker 1>and the other one making sense of it in a

0:11:01.679 --> 0:11:05.640
<v Speaker 1>written language. And uh, why does this matter? Yeah? Why

0:11:05.640 --> 0:11:07.160
<v Speaker 1>why do we even want to do this if it's

0:11:07.200 --> 0:11:09.280
<v Speaker 1>so difficult? A lot a lot of reasons. One of

0:11:09.320 --> 0:11:11.640
<v Speaker 1>the Well, first of all, we talked about image search

0:11:11.720 --> 0:11:15.199
<v Speaker 1>and just being able to to automate this would make

0:11:15.240 --> 0:11:18.160
<v Speaker 1>image search way more efficient for the things that we're

0:11:18.200 --> 0:11:20.960
<v Speaker 1>looking for. We could be much more specific. Oh man,

0:11:21.080 --> 0:11:22.720
<v Speaker 1>if only I could tell it. So. Part of my

0:11:22.920 --> 0:11:26.079
<v Speaker 1>job is searching for stock images that I end up

0:11:26.080 --> 0:11:30.559
<v Speaker 1>publishing on our website and and human labeling of stock images.

0:11:30.600 --> 0:11:32.280
<v Speaker 1>If you guys have never had to search for stock

0:11:32.320 --> 0:11:34.720
<v Speaker 1>images for your job before, let me tell you it's

0:11:34.720 --> 0:11:38.160
<v Speaker 1>one of the most joyous and terrifying things on the planet.

0:11:38.160 --> 0:11:40.880
<v Speaker 1>Because no matter what you type in, what you're going

0:11:40.920 --> 0:11:44.120
<v Speaker 1>to get back is some weird clip art, some sexy

0:11:44.200 --> 0:11:46.280
<v Speaker 1>ladies doing some stuff that may or may not have

0:11:46.320 --> 0:11:49.040
<v Speaker 1>anything to do with what you just said and may not,

0:11:49.360 --> 0:11:52.120
<v Speaker 1>and and maybe what you are actually looking you might

0:11:52.160 --> 0:11:56.720
<v Speaker 1>also find some like truly weird images, like a an overweight,

0:11:57.120 --> 0:11:59.840
<v Speaker 1>shirtless gentleman sitting at a table with a paper bag

0:12:00.000 --> 0:12:02.160
<v Speaker 1>over his head, knife and fork in his hands, eating

0:12:02.160 --> 0:12:05.920
<v Speaker 1>a cartoon hamster. I mean, it's some weird stuff on there. Seriously,

0:12:06.000 --> 0:12:09.080
<v Speaker 1>I am almost positive that there is a stock photograph

0:12:09.240 --> 0:12:11.920
<v Speaker 1>out there somewhere of somebody throwing a sandwich off a

0:12:11.960 --> 0:12:14.760
<v Speaker 1>cliff for like they've made it for diet purposes or

0:12:14.840 --> 0:12:19.439
<v Speaker 1>something like that. Didn't get endless examples of that by that, right?

0:12:19.520 --> 0:12:22.160
<v Speaker 1>If I searched for man throwing a sandwich off a cliff,

0:12:22.200 --> 0:12:24.000
<v Speaker 1>I should have tried this before we started, but I'm

0:12:24.000 --> 0:12:26.400
<v Speaker 1>almost positive I would not get that. Instead, I get

0:12:26.440 --> 0:12:29.960
<v Speaker 1>a lady in a bikini sitting on a pinball machine. Yeah,

0:12:30.000 --> 0:12:34.600
<v Speaker 1>that's that's pretty accurate. So just increasing the accuracy of

0:12:34.720 --> 0:12:38.360
<v Speaker 1>image search would be one reason, right, And and there

0:12:38.400 --> 0:12:40.920
<v Speaker 1>are lots of different reasons why our our image searches

0:12:41.000 --> 0:12:44.160
<v Speaker 1>on these things are imperfect. A large part of it

0:12:44.200 --> 0:12:46.400
<v Speaker 1>is that you've got people gaming the system. They're essentially

0:12:46.400 --> 0:12:48.600
<v Speaker 1>putting in any tag word they can possibly think of

0:12:49.120 --> 0:12:51.280
<v Speaker 1>because they want their images to be the ones that

0:12:51.320 --> 0:12:54.679
<v Speaker 1>are purchased. But but if you want to play fair,

0:12:54.800 --> 0:12:58.040
<v Speaker 1>then having an automated description would be best because you

0:12:58.040 --> 0:13:00.600
<v Speaker 1>can't curate everything by human eye. It would just take

0:13:00.920 --> 0:13:03.440
<v Speaker 1>too long. We're generating too much content for that to

0:13:03.480 --> 0:13:06.880
<v Speaker 1>be a realistic possibility. But another is that it could

0:13:06.920 --> 0:13:09.599
<v Speaker 1>be a huge help for people who have visual impairments.

0:13:09.679 --> 0:13:14.400
<v Speaker 1>So someone who is reading a news story, you know,

0:13:14.440 --> 0:13:17.080
<v Speaker 1>someone who has who has some sort of visual impairment,

0:13:17.120 --> 0:13:20.680
<v Speaker 1>maybe they're blind, and there could be pictures that give

0:13:20.880 --> 0:13:23.960
<v Speaker 1>more context to whatever the story is, but they miss

0:13:24.000 --> 0:13:26.720
<v Speaker 1>out on that if there's not an actual description of

0:13:26.760 --> 0:13:29.520
<v Speaker 1>what that picture is. Particularly, I mean, there's some content

0:13:29.559 --> 0:13:33.880
<v Speaker 1>out there where the caption might be playful but doesn't

0:13:33.920 --> 0:13:37.840
<v Speaker 1>actually tell you what the picture is. Absolutely, so this

0:13:37.920 --> 0:13:40.000
<v Speaker 1>would be a big help for people who are in

0:13:40.040 --> 0:13:42.960
<v Speaker 1>that situation. It also could speed up web access for

0:13:43.000 --> 0:13:46.560
<v Speaker 1>people who have limited connectivity to the Internet. Perhaps it's

0:13:46.600 --> 0:13:49.440
<v Speaker 1>over through a cellular network, and it may be that

0:13:49.520 --> 0:13:51.679
<v Speaker 1>there's some important information they need to get hold of.

0:13:51.720 --> 0:13:54.200
<v Speaker 1>But you know, if they're trying to load pictures, it's

0:13:54.280 --> 0:13:56.240
<v Speaker 1>just taking too long for it to load any kind

0:13:56.240 --> 0:13:58.680
<v Speaker 1>of thing. You could have a quick summary of those pictures.

0:13:58.679 --> 0:14:01.079
<v Speaker 1>That would really speed things up. Because actually, I also

0:14:01.120 --> 0:14:06.240
<v Speaker 1>think it's just an important contribution to general artificial intelligence. Absolutely,

0:14:06.240 --> 0:14:08.600
<v Speaker 1>if you're trying to create a system that can mimic

0:14:08.679 --> 0:14:10.920
<v Speaker 1>all of the functions of the human mind, well, one

0:14:10.960 --> 0:14:14.440
<v Speaker 1>of the main things humans do is look at something

0:14:14.520 --> 0:14:17.560
<v Speaker 1>and describe it, right, and you know, the description is

0:14:17.600 --> 0:14:20.720
<v Speaker 1>just part of it. There's also the interaction, right by

0:14:20.760 --> 0:14:24.800
<v Speaker 1>by recognizing things in our environment, we know how to proceed.

0:14:25.360 --> 0:14:28.520
<v Speaker 1>We can make decisions on how to proceed. So if,

0:14:28.600 --> 0:14:31.800
<v Speaker 1>for example, we walk into a room and we noticed

0:14:31.840 --> 0:14:34.840
<v Speaker 1>that there are a lot of pedestals around us, and

0:14:34.880 --> 0:14:38.040
<v Speaker 1>they're delicate vases on each pedestal, we know not to

0:14:38.080 --> 0:14:41.640
<v Speaker 1>go swing in our arms everywhere willy nilly. But you know,

0:14:42.120 --> 0:14:45.720
<v Speaker 1>a robot would not necessarily be able to tell that

0:14:46.360 --> 0:14:48.680
<v Speaker 1>a a vause sitting on a pestel was not in

0:14:48.760 --> 0:14:52.440
<v Speaker 1>fact a single piece. It might it might interpret that

0:14:52.520 --> 0:14:55.240
<v Speaker 1>as a column that's a good point, you know, or

0:14:55.320 --> 0:14:57.720
<v Speaker 1>that there's even even if it could recognize that was

0:14:58.000 --> 0:15:01.720
<v Speaker 1>an object sitting on another object the vase was delicate

0:15:01.840 --> 0:15:05.520
<v Speaker 1>or that it was worth not smashing. Right, yeah, robots

0:15:05.520 --> 0:15:08.760
<v Speaker 1>Teaching robots value is a very tricky thing. Also, they

0:15:08.800 --> 0:15:13.640
<v Speaker 1>hate vauses they do. It's I think, I think you

0:15:13.720 --> 0:15:18.360
<v Speaker 1>program them, I think programming. All well, it's because we

0:15:18.640 --> 0:15:23.360
<v Speaker 1>gave them the basic personality of Gallagher is the problem. Okay, okay,

0:15:23.440 --> 0:15:28.600
<v Speaker 1>so we've so automated image description is a very difficult problem,

0:15:28.680 --> 0:15:31.120
<v Speaker 1>but it's also very worth solving. But in the long

0:15:31.200 --> 0:15:34.240
<v Speaker 1>term for artificial intelligence, and in some specific cases, in

0:15:34.240 --> 0:15:37.560
<v Speaker 1>the short term. Who's actually working on this? Where did

0:15:37.560 --> 0:15:40.960
<v Speaker 1>this come from? Well, I mean there are lots of

0:15:41.000 --> 0:15:44.560
<v Speaker 1>different people in computer science working on this problem. But

0:15:44.680 --> 0:15:48.160
<v Speaker 1>the thing that kind of spur spurred on this particular

0:15:48.200 --> 0:15:51.240
<v Speaker 1>podcast you discovered, right, Yeah, it was well, as I said,

0:15:51.240 --> 0:15:54.600
<v Speaker 1>it was through Alexis Madrigals five Intriguing Things newsletter, and

0:15:54.640 --> 0:15:59.040
<v Speaker 1>it was a link to the Google Research blog, which

0:15:59.200 --> 0:16:02.920
<v Speaker 1>is a cool little blog. Some of it's definitely over

0:16:02.960 --> 0:16:05.840
<v Speaker 1>the average reader's head, but it's also just very interesting.

0:16:05.920 --> 0:16:09.440
<v Speaker 1>And yeah, yeah, um, this specific blog entry is from

0:16:09.560 --> 0:16:13.600
<v Speaker 1>Google UK uh and it was posted by a bunch

0:16:13.640 --> 0:16:17.320
<v Speaker 1>of research scientists. It's it's one of those things where

0:16:17.400 --> 0:16:19.600
<v Speaker 1>if you go into the Google research blog, they do

0:16:19.680 --> 0:16:23.600
<v Speaker 1>get um more technical than your average blog does. They're

0:16:23.640 --> 0:16:28.240
<v Speaker 1>not so technical as to be completely incomprehensible, but I

0:16:28.240 --> 0:16:30.880
<v Speaker 1>will say they're really good about linking out two terms

0:16:30.920 --> 0:16:33.040
<v Speaker 1>that you might be unfamiliar, which is important because I

0:16:33.080 --> 0:16:35.680
<v Speaker 1>had to click on every single one of those links.

0:16:36.200 --> 0:16:39.120
<v Speaker 1>I did so much reading for this one blog post,

0:16:39.120 --> 0:16:41.480
<v Speaker 1>so I could really get a handle on what they

0:16:41.480 --> 0:16:44.800
<v Speaker 1>were saying. But again, it illustrates the complexity of the problem.

0:16:45.360 --> 0:16:47.680
<v Speaker 1>So again we don't mean to suggest that these Google

0:16:47.720 --> 0:16:50.200
<v Speaker 1>researchers are the only people working on this problem, or

0:16:50.240 --> 0:16:52.440
<v Speaker 1>that their approach is the only way to do it.

0:16:53.280 --> 0:16:56.160
<v Speaker 1>We're constraint on it because it was really well documented

0:16:56.360 --> 0:16:58.800
<v Speaker 1>and it was just published on November sevente We are

0:16:58.840 --> 0:17:01.800
<v Speaker 1>recording this on November two. Any first, so it was

0:17:01.880 --> 0:17:05.080
<v Speaker 1>of immediate interest to us. Yes, okay, so how are

0:17:05.119 --> 0:17:08.640
<v Speaker 1>we doing this? Well? First, you have to identify what

0:17:09.119 --> 0:17:10.960
<v Speaker 1>needs to be done before you can figure out how

0:17:11.000 --> 0:17:13.159
<v Speaker 1>to do it right. You have to figure out the

0:17:12.800 --> 0:17:15.520
<v Speaker 1>the things that have to happen in order for this

0:17:15.600 --> 0:17:19.440
<v Speaker 1>to be a possibility. And they identified several things, including

0:17:19.480 --> 0:17:23.080
<v Speaker 1>computer vision, which is how machines acquire and analyze images.

0:17:23.119 --> 0:17:24.960
<v Speaker 1>So how do they get the images in the first place.

0:17:25.359 --> 0:17:29.359
<v Speaker 1>Is it purely through code? Is it actual visual you know?

0:17:29.600 --> 0:17:32.520
<v Speaker 1>Is like like a camera system. I mean, if you're

0:17:32.520 --> 0:17:34.920
<v Speaker 1>talking about robotics, and it's probably a camera system because

0:17:34.920 --> 0:17:37.560
<v Speaker 1>they're looking around in their environment. It could just be

0:17:37.640 --> 0:17:40.680
<v Speaker 1>sampling from the Internet or something like right right, Well

0:17:40.880 --> 0:17:44.800
<v Speaker 1>with with something like an automated search, it could all

0:17:45.000 --> 0:17:48.119
<v Speaker 1>be code. Like it could be that there's no quote

0:17:48.160 --> 0:17:52.000
<v Speaker 1>unquote looking at the image, right, but so there's that.

0:17:52.040 --> 0:17:56.080
<v Speaker 1>There's also object detection, which sounds really easy, but it's

0:17:56.200 --> 0:17:58.760
<v Speaker 1>incredibly hard. So this is what I was talking about,

0:17:58.800 --> 0:18:01.680
<v Speaker 1>being able to recognize in visual objects within an image.

0:18:02.000 --> 0:18:05.840
<v Speaker 1>So what separates an object from its background? If I've

0:18:05.840 --> 0:18:08.440
<v Speaker 1>got a shot, a top down shot of a table,

0:18:08.800 --> 0:18:11.080
<v Speaker 1>and there's a book sitting on the middle of that table,

0:18:11.640 --> 0:18:14.600
<v Speaker 1>then when I look down, I can see that there's

0:18:14.640 --> 0:18:16.479
<v Speaker 1>a book and there's a table, and I recognize those

0:18:16.560 --> 0:18:18.960
<v Speaker 1>is two different things. But like I was saying before,

0:18:19.080 --> 0:18:21.600
<v Speaker 1>if it's a machine and it doesn't have this way

0:18:21.600 --> 0:18:24.720
<v Speaker 1>of of telling the difference there, it may just think

0:18:24.720 --> 0:18:27.760
<v Speaker 1>of that as a pattern that's on a table, you know,

0:18:27.920 --> 0:18:30.800
<v Speaker 1>or even a raised part of that table. If it

0:18:30.840 --> 0:18:33.600
<v Speaker 1>can detect depth, it's not an it's not a cut

0:18:33.640 --> 0:18:36.120
<v Speaker 1>and dry thing. So getting a point where you can

0:18:36.160 --> 0:18:38.960
<v Speaker 1>have a computer that can tell that there are multiple

0:18:38.960 --> 0:18:42.000
<v Speaker 1>objects within a scene, that's already a challenge, although we've

0:18:42.200 --> 0:18:46.640
<v Speaker 1>gone a far away to actually do that. But imagine

0:18:46.680 --> 0:18:48.359
<v Speaker 1>that you are looking at if you want to think

0:18:48.359 --> 0:18:50.800
<v Speaker 1>about how hard this is, imagine you're thinking at a

0:18:50.800 --> 0:18:54.439
<v Speaker 1>at a overgrown field and there's someone in a gilly

0:18:54.480 --> 0:18:56.960
<v Speaker 1>suit out there. Gilly suit are those camos suits that

0:18:57.000 --> 0:19:00.000
<v Speaker 1>have all the plant type material hanging off of them.

0:19:00.119 --> 0:19:02.280
<v Speaker 1>Some of its artificial, some of it may actually be

0:19:02.320 --> 0:19:06.960
<v Speaker 1>gathered from wherever you're going. Those camouflage suits are really convincing.

0:19:07.000 --> 0:19:09.440
<v Speaker 1>It's really hard to pick someone who's someone who's good

0:19:09.440 --> 0:19:12.840
<v Speaker 1>at at being uh covert. You may not even know

0:19:12.920 --> 0:19:16.760
<v Speaker 1>that there's a person there, and so that's as hard

0:19:16.760 --> 0:19:19.359
<v Speaker 1>as that is for us. That about the equivalent of

0:19:19.400 --> 0:19:22.040
<v Speaker 1>looking at a book on a table for a computer exactly. Yeah.

0:19:22.080 --> 0:19:26.360
<v Speaker 1>You know, until you're able to teach a machine how

0:19:26.440 --> 0:19:29.560
<v Speaker 1>to how to see in a way, then it's going

0:19:29.680 --> 0:19:32.200
<v Speaker 1>to have it's going to be just as as difficult

0:19:32.240 --> 0:19:33.960
<v Speaker 1>to detect that as it would be for us to

0:19:33.960 --> 0:19:35.600
<v Speaker 1>see that guy in the gilly suit in the middle

0:19:35.640 --> 0:19:39.000
<v Speaker 1>of the overgrown field. Um. But once you do get

0:19:39.040 --> 0:19:40.960
<v Speaker 1>to that, you still have other things to keep in mind,

0:19:40.960 --> 0:19:43.959
<v Speaker 1>like classification and labeling. So this is a measurement of

0:19:44.000 --> 0:19:48.520
<v Speaker 1>how accurately a program can assign correct labels to an image. Uh.

0:19:48.640 --> 0:19:50.960
<v Speaker 1>I love the example, like if you looked at certain

0:19:51.000 --> 0:19:53.280
<v Speaker 1>examples within the Google blog post and eventually took you

0:19:53.320 --> 0:19:56.240
<v Speaker 1>to a picture of a dog wearing a sombrero, so

0:19:56.320 --> 0:19:59.480
<v Speaker 1>you had dog hat. There's a hat on a dog,

0:20:00.000 --> 0:20:04.439
<v Speaker 1>a wide brimmed hat. Yeah, these are all important elements.

0:20:04.520 --> 0:20:07.639
<v Speaker 1>That's part of that classification and labeling. Uh. You know,

0:20:07.680 --> 0:20:11.119
<v Speaker 1>instead of just saying that's one ugly dog because not

0:20:11.119 --> 0:20:15.840
<v Speaker 1>being a weird growth, yeah, that would be that would

0:20:15.840 --> 0:20:18.560
<v Speaker 1>be more than that. It's it's uh dog with the

0:20:18.600 --> 0:20:22.440
<v Speaker 1>hat and not just dog with something on it. Yeah,

0:20:22.560 --> 0:20:26.080
<v Speaker 1>so again not dog with stack of pancakes. Yeah, And

0:20:26.119 --> 0:20:29.080
<v Speaker 1>identifying the fact that there is that relationship that there

0:20:29.200 --> 0:20:31.920
<v Speaker 1>is a hat on a dog, not just that there's

0:20:31.960 --> 0:20:34.320
<v Speaker 1>a hat and a dog in the picture, but how

0:20:34.359 --> 0:20:36.960
<v Speaker 1>did those two objects relate to one another within the

0:20:37.000 --> 0:20:41.200
<v Speaker 1>context of that picture. So that's also pretty cool. So

0:20:41.320 --> 0:20:44.200
<v Speaker 1>as We've said, lots of folks have been working on

0:20:44.760 --> 0:20:49.439
<v Speaker 1>the problem of how to do this thing, and a

0:20:49.520 --> 0:20:52.520
<v Speaker 1>common approach has been working on linking computer systems that

0:20:52.600 --> 0:20:55.600
<v Speaker 1>understand what's going on in pictures with computer systems that

0:20:55.680 --> 0:20:59.120
<v Speaker 1>understand what's going on in sentences, and letting the two

0:20:59.280 --> 0:21:03.200
<v Speaker 1>match up images and phrases. But these scientists at Google

0:21:03.400 --> 0:21:06.080
<v Speaker 1>were approaching it from a different way. They're trying to

0:21:06.119 --> 0:21:09.720
<v Speaker 1>create a system where the two halves work together directly

0:21:09.840 --> 0:21:13.680
<v Speaker 1>with the same data, rather than comparing and contrasting two

0:21:13.800 --> 0:21:16.520
<v Speaker 1>separate sets of data. Okay, um, so they were. They

0:21:16.520 --> 0:21:19.880
<v Speaker 1>were inspired to do this by recent language translation research

0:21:19.960 --> 0:21:22.520
<v Speaker 1>in which one half of the system would create a

0:21:22.600 --> 0:21:26.119
<v Speaker 1>diagram of a sentence in one language, say English, and

0:21:26.280 --> 0:21:28.919
<v Speaker 1>the second half would look at that diagram and generate

0:21:28.960 --> 0:21:32.200
<v Speaker 1>a sentence from it in another language, say French. Yeah.

0:21:32.200 --> 0:21:34.919
<v Speaker 1>And again this was a part of classification. It wasn't

0:21:35.080 --> 0:21:38.639
<v Speaker 1>It wasn't just a word to word, you know, what

0:21:38.800 --> 0:21:41.040
<v Speaker 1>is the what is the analogous word in this other

0:21:41.119 --> 0:21:45.199
<v Speaker 1>language for this the one that we're detecting here, but

0:21:45.400 --> 0:21:48.439
<v Speaker 1>rather what is the meaning of this sentence and what

0:21:48.640 --> 0:21:51.479
<v Speaker 1>is the what is the phrase in this other language

0:21:51.480 --> 0:21:54.760
<v Speaker 1>that has that same meaning exactly, because because doing word

0:21:54.760 --> 0:21:57.520
<v Speaker 1>for word translations, if you've ever worked in another language,

0:21:57.600 --> 0:21:59.920
<v Speaker 1>often doesn't at all. Right, and you realize as you

0:22:00.040 --> 0:22:02.159
<v Speaker 1>say it to someone who's a native speaker of that language,

0:22:02.160 --> 0:22:04.479
<v Speaker 1>is said, that's a very weird way of putting what

0:22:04.560 --> 0:22:07.960
<v Speaker 1>you just said. Um, so yeah, it's it's a really

0:22:08.000 --> 0:22:13.919
<v Speaker 1>interesting idea, and the way they do it is particularly technical.

0:22:14.040 --> 0:22:15.920
<v Speaker 1>So I'm going to take this from a very kind

0:22:15.920 --> 0:22:19.320
<v Speaker 1>of high level because well, first and what I gotta

0:22:19.359 --> 0:22:22.840
<v Speaker 1>be totally honest, it goes so technical. I definitely don't

0:22:22.880 --> 0:22:26.040
<v Speaker 1>understand all of it. So we're talking about neural networks. Yeah,

0:22:26.080 --> 0:22:27.920
<v Speaker 1>so I'm trying to take it from a high enough

0:22:28.000 --> 0:22:30.720
<v Speaker 1>level where I feel like I still have a general

0:22:31.240 --> 0:22:35.080
<v Speaker 1>grip on what's happening. But I'm if I dove down

0:22:35.080 --> 0:22:38.040
<v Speaker 1>any deeper than I would most likely be giving at

0:22:38.119 --> 0:22:42.119
<v Speaker 1>least equal amounts information and misinformation. Yeah. Yeah, but so so,

0:22:42.240 --> 0:22:45.600
<v Speaker 1>like like just said, we're talking about artificial neural networks,

0:22:45.640 --> 0:22:48.760
<v Speaker 1>which are the systems that these language researchers were using,

0:22:48.760 --> 0:22:50.880
<v Speaker 1>and you decided to go with the same thing. Yeah,

0:22:50.920 --> 0:22:55.280
<v Speaker 1>they're systems that are attempting to replicate something that happens

0:22:55.320 --> 0:23:00.119
<v Speaker 1>inside an organic brain. Yeah, so anything that is an

0:23:00.160 --> 0:23:04.080
<v Speaker 1>artificial neural network is trying to to kind of mimic

0:23:04.200 --> 0:23:08.080
<v Speaker 1>nature essentially at some level. Some of them are more

0:23:08.840 --> 0:23:11.160
<v Speaker 1>like the neural networks you would see in our brains.

0:23:11.200 --> 0:23:15.080
<v Speaker 1>Some of them react like neurons, but they are arranged

0:23:15.119 --> 0:23:17.120
<v Speaker 1>in a way that's very different from the way our

0:23:17.160 --> 0:23:20.439
<v Speaker 1>brains are arranged. So, uh, you know, it's when we

0:23:20.520 --> 0:23:22.719
<v Speaker 1>say these neural networks, just keep in mind we're not

0:23:22.800 --> 0:23:26.440
<v Speaker 1>necessarily talking about an artificial brain. It's not the same thing.

0:23:26.520 --> 0:23:30.159
<v Speaker 1>It's more like the tiny constituents, you know, millions of

0:23:30.160 --> 0:23:32.880
<v Speaker 1>which make up a brain. Right. Think of each neuron

0:23:33.040 --> 0:23:37.480
<v Speaker 1>as capable of of doing a various processes, whatever those

0:23:37.480 --> 0:23:40.320
<v Speaker 1>processes might be, on information that comes to it, and

0:23:40.320 --> 0:23:42.840
<v Speaker 1>then send it on to other neurons, which will then

0:23:43.640 --> 0:23:45.959
<v Speaker 1>add their own element of whatever it may be. So

0:23:46.000 --> 0:23:48.280
<v Speaker 1>one of them might be all right, whenever I get

0:23:48.280 --> 0:23:52.240
<v Speaker 1>input from uh, this other neuron, I know to perform

0:23:52.320 --> 0:23:56.440
<v Speaker 1>this specific mathematic process and then I know to pass

0:23:56.440 --> 0:23:59.160
<v Speaker 1>it on to this other neuron. And uh, even that's

0:23:59.160 --> 0:24:01.640
<v Speaker 1>an oversimplification, but it gives you kind of an idea

0:24:01.680 --> 0:24:04.040
<v Speaker 1>of what's going on. So one of the two types

0:24:04.080 --> 0:24:06.439
<v Speaker 1>of artificial neural networks used by Google, and keep in

0:24:06.480 --> 0:24:10.520
<v Speaker 1>mind there are lots of different variations of artificial neural networks.

0:24:11.040 --> 0:24:15.480
<v Speaker 1>It's called a convolutional neural network. And it's funny because

0:24:15.520 --> 0:24:17.440
<v Speaker 1>when I think convoluted, I don't think of that as

0:24:17.440 --> 0:24:21.040
<v Speaker 1>being a positive thing. But convolutional neural network in this

0:24:21.080 --> 0:24:24.440
<v Speaker 1>case is a feed forward neural network, which means there's

0:24:24.440 --> 0:24:27.600
<v Speaker 1>an actual pathway to follow that has a beginning and

0:24:27.720 --> 0:24:30.040
<v Speaker 1>an end. So think of it like, uh, you know,

0:24:30.119 --> 0:24:33.800
<v Speaker 1>it's it's gonna start and and a destination. Uh and

0:24:33.800 --> 0:24:36.160
<v Speaker 1>and things always begin at the start and they always

0:24:36.240 --> 0:24:39.680
<v Speaker 1>end up at the destination. And the pathway has lots

0:24:39.680 --> 0:24:43.320
<v Speaker 1>of neurons along it that can do work upon whatever

0:24:43.359 --> 0:24:47.199
<v Speaker 1>the input is. So you have this is the start

0:24:47.840 --> 0:24:51.800
<v Speaker 1>um and it's essentially there to classify objects within an image,

0:24:52.040 --> 0:24:55.199
<v Speaker 1>and that data ends up getting encoded according to however

0:24:55.280 --> 0:24:58.600
<v Speaker 1>you've programmed that network. Al Right, you wind up with

0:24:58.640 --> 0:25:01.800
<v Speaker 1>this with this really huge amount of data coming off

0:25:01.800 --> 0:25:04.119
<v Speaker 1>of a single image, right, and all of that gets

0:25:04.119 --> 0:25:08.720
<v Speaker 1>fed into the second artificial neural network, which is called

0:25:08.720 --> 0:25:12.480
<v Speaker 1>a recurrent neural network, which is not a direct pathway.

0:25:12.520 --> 0:25:15.080
<v Speaker 1>This is an interconnected model that creates a what is

0:25:15.119 --> 0:25:18.480
<v Speaker 1>called a directed cycle. So there are several subsets of

0:25:18.520 --> 0:25:21.359
<v Speaker 1>this kind of network. Uh. The fully recurrent network is

0:25:21.400 --> 0:25:23.760
<v Speaker 1>the probably the easiest to imagine. That's one where every

0:25:23.760 --> 0:25:27.560
<v Speaker 1>single neuron has a direct connection with our directed connection

0:25:28.040 --> 0:25:31.480
<v Speaker 1>with every single other neuron within the network. Uh. Now,

0:25:31.680 --> 0:25:35.440
<v Speaker 1>these obviously get way more complicated the more neurons you add.

0:25:35.480 --> 0:25:37.919
<v Speaker 1>This is one of the reasons why having an artificial

0:25:37.960 --> 0:25:41.639
<v Speaker 1>brain model is so hard, because you're talking on the

0:25:41.760 --> 0:25:47.040
<v Speaker 1>order of eighty billion neurons to make a simple artificial brain.

0:25:47.200 --> 0:25:51.719
<v Speaker 1>Eighty billion units that have interconnections with not necessarily every

0:25:51.800 --> 0:25:55.160
<v Speaker 1>other node like every other neuron, but enough to make

0:25:55.200 --> 0:25:59.800
<v Speaker 1>this complicated and slow on the classical computing scale, and

0:25:59.800 --> 0:26:02.679
<v Speaker 1>then away getting back to their's. Uh, this one is

0:26:02.720 --> 0:26:07.120
<v Speaker 1>what will end up uh processing all that information to

0:26:07.200 --> 0:26:10.480
<v Speaker 1>describe the images that were classified from the first one.

0:26:10.520 --> 0:26:13.120
<v Speaker 1>So the first one classifies all the stuff. This one

0:26:13.200 --> 0:26:16.960
<v Speaker 1>is what creates the language used to describe those images

0:26:17.000 --> 0:26:20.720
<v Speaker 1>to make them meaningful to a human audience, so that

0:26:20.840 --> 0:26:24.199
<v Speaker 1>when we get the description, it actually it reflects whatever

0:26:24.280 --> 0:26:26.359
<v Speaker 1>the picture shows and doesn't like It's not like a

0:26:26.440 --> 0:26:29.119
<v Speaker 1>lady sitting down at a table reading a book and

0:26:29.160 --> 0:26:33.600
<v Speaker 1>it says frog dancing on skyscraper. Well hopefully, I mean,

0:26:33.680 --> 0:26:36.080
<v Speaker 1>I mean the system is certainly not perfect. It's it's

0:26:36.119 --> 0:26:38.800
<v Speaker 1>a really good system, and it's neat because this this

0:26:39.000 --> 0:26:40.879
<v Speaker 1>these neural networks. The other thing that these do that

0:26:40.920 --> 0:26:44.760
<v Speaker 1>I didn't mention before is they can learn. They can

0:26:44.880 --> 0:26:48.159
<v Speaker 1>you know, once once a process goes through, if you

0:26:48.240 --> 0:26:51.880
<v Speaker 1>start making that process, you know, replicating that process, either

0:26:51.960 --> 0:26:54.399
<v Speaker 1>by feeding the same thing through over and over feeding

0:26:54.440 --> 0:26:57.080
<v Speaker 1>similar things through, it starts to pick up on that.

0:26:57.119 --> 0:26:59.160
<v Speaker 1>So this is very similar to that idea we had

0:26:59.240 --> 0:27:03.440
<v Speaker 1>about feeding in the the thousands and thousands of images

0:27:03.440 --> 0:27:06.720
<v Speaker 1>and videos of cats and how the machine was able

0:27:06.760 --> 0:27:10.040
<v Speaker 1>to learn what a cat was without anyone telling it

0:27:10.119 --> 0:27:13.160
<v Speaker 1>what a cat was the same sort of thing here.

0:27:13.640 --> 0:27:16.840
<v Speaker 1>It's it's that same process. It ends up kind of

0:27:16.880 --> 0:27:20.919
<v Speaker 1>like a memory. How our memories are pathways of neurons

0:27:21.000 --> 0:27:24.240
<v Speaker 1>that fire in a specific sequence more or less, and

0:27:24.280 --> 0:27:27.200
<v Speaker 1>every time we remember it, we're replicating that as close

0:27:27.200 --> 0:27:31.600
<v Speaker 1>as we can. Anyway, Uh, similar to what's going on here.

0:27:32.560 --> 0:27:35.840
<v Speaker 1>Pretty cool And like I said, to get more more

0:27:36.080 --> 0:27:39.679
<v Speaker 1>detailed is beyond me, yea, beyond beyond any of us

0:27:39.720 --> 0:27:42.439
<v Speaker 1>sitting at this table. Um, I would check out that

0:27:42.480 --> 0:27:45.240
<v Speaker 1>blog post if you get a chance, because especially because

0:27:45.240 --> 0:27:48.560
<v Speaker 1>it's got a great image where it sort of shows

0:27:48.600 --> 0:27:51.240
<v Speaker 1>the difference between what a good description of an image

0:27:51.240 --> 0:27:54.440
<v Speaker 1>looks like and what I failed description looks like. Yeah,

0:27:54.680 --> 0:27:57.640
<v Speaker 1>radiance in between. Sy sure it had this rating system

0:27:57.760 --> 0:28:01.160
<v Speaker 1>or has it probably hasn't been stroid since we've made

0:28:01.160 --> 0:28:04.280
<v Speaker 1>this podcast. I hope that. Otherwise we're futile and telling

0:28:04.280 --> 0:28:08.119
<v Speaker 1>you check it out. We're backward thinkers. Um. But but

0:28:08.119 --> 0:28:11.200
<v Speaker 1>but yeah, so so humans have ranked the photos by like, well,

0:28:11.280 --> 0:28:15.119
<v Speaker 1>this totally has this is an accurate description all the

0:28:15.160 --> 0:28:18.600
<v Speaker 1>way to this is not what that is. So, for example,

0:28:18.920 --> 0:28:22.040
<v Speaker 1>like a person riding a motorcycle on a dirt road

0:28:22.600 --> 0:28:25.359
<v Speaker 1>is described with the sentence a person riding a motorcycle

0:28:25.440 --> 0:28:29.000
<v Speaker 1>on a dirt road sort of a dirt road. Actually,

0:28:29.000 --> 0:28:31.040
<v Speaker 1>I would say it looks kind of like a motocross track.

0:28:31.119 --> 0:28:34.840
<v Speaker 1>But that's enough. Yeah, okay, And then it would have

0:28:35.080 --> 0:28:38.840
<v Speaker 1>sort of like describes with minor errors. So there was

0:28:38.920 --> 0:28:41.000
<v Speaker 1>one that says close up of a cat laying on

0:28:41.040 --> 0:28:43.800
<v Speaker 1>a couch, it's a cat sitting on a bed. But

0:28:45.240 --> 0:28:48.120
<v Speaker 1>cat in the photo, it's not really a close up,

0:28:48.160 --> 0:28:50.040
<v Speaker 1>but yeah, you know, it's sort of sort of what's

0:28:50.040 --> 0:28:53.479
<v Speaker 1>going on. Um. My favorite was the one that's labeled

0:28:53.520 --> 0:28:56.400
<v Speaker 1>a refrigerator filled with lots of food and drinks and

0:28:56.440 --> 0:28:59.640
<v Speaker 1>in fact it's some kind of parking sign with stickers

0:28:59.640 --> 0:29:02.720
<v Speaker 1>all for it. Yeah, in a in a busy city tunnel.

0:29:02.760 --> 0:29:05.440
<v Speaker 1>It looks like and if you look at the images

0:29:05.480 --> 0:29:09.720
<v Speaker 1>the way that they they show how the computer quote

0:29:09.800 --> 0:29:12.920
<v Speaker 1>unquote sees the image, which really it's just it's just

0:29:12.960 --> 0:29:16.600
<v Speaker 1>a visualization for our benefit, but it shows with different

0:29:16.640 --> 0:29:20.440
<v Speaker 1>color boxes around each individual element to show how it

0:29:20.560 --> 0:29:23.120
<v Speaker 1>how it's picking the mountain labeling it. It gives you

0:29:23.200 --> 0:29:25.000
<v Speaker 1>an idea like when you when you think of it

0:29:25.040 --> 0:29:27.440
<v Speaker 1>that way, you think, wow, this is a lot more

0:29:27.480 --> 0:29:30.560
<v Speaker 1>complicated than I imagine. I mean, we we often think like,

0:29:30.760 --> 0:29:32.480
<v Speaker 1>I don't know what you guys think. I shouldn't say

0:29:32.480 --> 0:29:36.720
<v Speaker 1>we I said. I often think of automated image description.

0:29:37.080 --> 0:29:41.680
<v Speaker 1>I often will imagine the simplest of images in my head,

0:29:41.840 --> 0:29:44.800
<v Speaker 1>just thinking, like, you know, it would look like an

0:29:44.840 --> 0:29:48.800
<v Speaker 1>Apple commercial, you know, white background with one solitary image

0:29:48.800 --> 0:29:51.360
<v Speaker 1>in the middle, and the description of what that is.

0:29:52.080 --> 0:29:54.920
<v Speaker 1>I don't necessarily think, oh wait, no, this refers to

0:29:55.080 --> 0:29:57.479
<v Speaker 1>every kind of image, including you know, pictures of me

0:29:57.560 --> 0:30:00.600
<v Speaker 1>and my friends at at a restaurant with lots of

0:30:00.640 --> 0:30:03.120
<v Speaker 1>other stuff going on. Like when you think about that

0:30:03.160 --> 0:30:05.000
<v Speaker 1>and all the different elements that can appear in a

0:30:05.080 --> 0:30:08.160
<v Speaker 1>single picture, then you realize, wow, this is this is

0:30:08.240 --> 0:30:12.640
<v Speaker 1>really amazing that they've been able to create anything remotely

0:30:12.680 --> 0:30:15.920
<v Speaker 1>approaching automated image, because you know, that's that's the thing

0:30:16.000 --> 0:30:19.160
<v Speaker 1>that our eyes do. They naturally pick out items of

0:30:19.200 --> 0:30:22.680
<v Speaker 1>interest from visual scenes. So the same sort of thing

0:30:22.880 --> 0:30:25.840
<v Speaker 1>again applies with robotics. I mean, you know, I know

0:30:25.880 --> 0:30:29.959
<v Speaker 1>this is mostly about image description, but the same kind

0:30:30.240 --> 0:30:34.640
<v Speaker 1>of of processing is really important for machines, especially machines

0:30:34.640 --> 0:30:36.960
<v Speaker 1>that are going to be interacting with humans on a

0:30:37.000 --> 0:30:40.680
<v Speaker 1>more frequent basis, to be able to recognize an environment,

0:30:40.680 --> 0:30:43.840
<v Speaker 1>not just to pick out the potential obstacles that a

0:30:44.000 --> 0:30:46.640
<v Speaker 1>robot might encounter so I can maneuver around them, but

0:30:46.720 --> 0:30:50.080
<v Speaker 1>also just to understand, h, these are the elements within

0:30:50.120 --> 0:30:53.800
<v Speaker 1>this scene that I need to be careful around because

0:30:53.840 --> 0:30:57.040
<v Speaker 1>they are either people and thus I don't want to

0:30:57.080 --> 0:30:59.840
<v Speaker 1>injure them, or they are things that are delicate and

0:31:00.000 --> 0:31:01.880
<v Speaker 1>I don't want to break them, or these are the

0:31:01.880 --> 0:31:04.160
<v Speaker 1>objects that I need to interact with, and this is

0:31:04.200 --> 0:31:06.880
<v Speaker 1>how I might go about interacting with them exactly. And

0:31:06.920 --> 0:31:10.360
<v Speaker 1>of course, beyond that, to communicate information about the environment.

0:31:10.640 --> 0:31:13.040
<v Speaker 1>And I mean that's something that would be amazingly helpful

0:31:13.120 --> 0:31:16.760
<v Speaker 1>if a robot could describe to you what happened three

0:31:16.800 --> 0:31:20.320
<v Speaker 1>minutes ago, yeah, you know well, And and for robots,

0:31:20.320 --> 0:31:22.800
<v Speaker 1>we often talk about robots used for first responders, a

0:31:22.880 --> 0:31:25.280
<v Speaker 1>robot that could be able to not just look for

0:31:25.320 --> 0:31:30.480
<v Speaker 1>signs of life, but actively describe the environment back to operators,

0:31:30.600 --> 0:31:34.080
<v Speaker 1>so that you know, say like, oh, hey, that pillar

0:31:34.160 --> 0:31:36.959
<v Speaker 1>is down over there, and the floor is caved in

0:31:37.000 --> 0:31:40.560
<v Speaker 1>over here, and there's a pile of debris and yeahah yeah,

0:31:40.600 --> 0:31:42.640
<v Speaker 1>because there are a lot of sensors on a robot

0:31:42.720 --> 0:31:46.640
<v Speaker 1>that can detect things that don't necessarily translate into direct

0:31:46.720 --> 0:31:49.920
<v Speaker 1>visual data for us, you know, not just the cameras

0:31:49.960 --> 0:31:52.600
<v Speaker 1>but other things, and for red sensors, stuff that you know,

0:31:52.680 --> 0:31:55.400
<v Speaker 1>would need some processing on our end to make it meaningful.

0:31:55.480 --> 0:31:58.600
<v Speaker 1>But if it's able to communicate directly that information, that

0:31:58.600 --> 0:32:02.640
<v Speaker 1>would be incredibly helpful. So uh, very important part of

0:32:02.720 --> 0:32:05.400
<v Speaker 1>artificial intelligence. And again I think it also illustrates that

0:32:05.520 --> 0:32:10.880
<v Speaker 1>artificial intelligence is a much bigger idea than just a

0:32:11.000 --> 0:32:14.320
<v Speaker 1>machine that quote unquote thinks like we do. That's the

0:32:14.360 --> 0:32:17.320
<v Speaker 1>way a lot of us will describe like that. That's

0:32:17.360 --> 0:32:19.640
<v Speaker 1>kind of my go to thought whenever I hear the

0:32:19.640 --> 0:32:23.160
<v Speaker 1>words artificial intelligence. Mainly I guess because of Hollywood, But

0:32:23.520 --> 0:32:28.800
<v Speaker 1>the reality is that it encompasses way more than that. So, uh,

0:32:28.960 --> 0:32:34.200
<v Speaker 1>excellent work on uncovering that little item and and suggesting it, Joe.

0:32:34.280 --> 0:32:37.200
<v Speaker 1>I think it made a great topic. As I've said,

0:32:37.240 --> 0:32:40.480
<v Speaker 1>it weren't me, Well, you saw it and then you

0:32:40.560 --> 0:32:42.800
<v Speaker 1>brought it to our attention. If it hadn't been for you,

0:32:43.120 --> 0:32:45.320
<v Speaker 1>we probably been talking about something that would not have

0:32:45.440 --> 0:32:48.640
<v Speaker 1>required me to fire my entire brain at it for

0:32:48.920 --> 0:32:51.240
<v Speaker 1>a day and a half of solid research. Well, I

0:32:51.240 --> 0:32:53.640
<v Speaker 1>spend most of my time trying to figure out weird

0:32:53.680 --> 0:32:56.560
<v Speaker 1>ways to make you exercise your neurons because I need

0:32:56.600 --> 0:32:58.840
<v Speaker 1>those healthy for something I'm planning on doing in a

0:32:58.880 --> 0:33:01.720
<v Speaker 1>couple of months. That's that's all good, good, that's good.

0:33:01.800 --> 0:33:04.280
<v Speaker 1>I personally thank you, Joe, because otherwise I'd just be

0:33:04.320 --> 0:33:08.840
<v Speaker 1>saying here drooling. So I you know, sure your plan

0:33:09.000 --> 0:33:12.160
<v Speaker 1>is likely nefarious, but I'm gonna go along with it

0:33:12.160 --> 0:33:15.040
<v Speaker 1>because it's benefiting me in the short term. What's your

0:33:15.080 --> 0:33:18.640
<v Speaker 1>blood type? So forward thinking? If you have suggestions for

0:33:18.760 --> 0:33:20.760
<v Speaker 1>future topics, you should get in touch with us and

0:33:20.840 --> 0:33:23.640
<v Speaker 1>let us know what those are. Our email addresses FW

0:33:23.760 --> 0:33:25.880
<v Speaker 1>thinking at how stuff Works dot com, or drop us

0:33:25.880 --> 0:33:28.960
<v Speaker 1>a line on Google Plus, on Facebook or on Twitter.

0:33:29.040 --> 0:33:31.520
<v Speaker 1>At Twitter and Google Plus, we have the handle fw thinking.

0:33:31.640 --> 0:33:34.000
<v Speaker 1>Just search for that at Facebook will pop right up.

0:33:34.320 --> 0:33:36.440
<v Speaker 1>Let us know what you want to hear about, or

0:33:36.440 --> 0:33:38.600
<v Speaker 1>maybe you want to chime in, or maybe you found

0:33:38.680 --> 0:33:41.800
<v Speaker 1>that that stock image of a of a man throwing

0:33:41.800 --> 0:33:44.080
<v Speaker 1>a sandwich off a cliff. Oh I want to see that.

0:33:44.240 --> 0:33:46.600
<v Speaker 1>If you send it in, I will replace whatever image

0:33:46.640 --> 0:33:49.120
<v Speaker 1>I originally put with this podcast and use that and

0:33:49.240 --> 0:33:51.480
<v Speaker 1>uh and we'll even we'll even throw credit in. We'll

0:33:51.520 --> 0:33:53.479
<v Speaker 1>say that you know you're the one who found it

0:33:53.520 --> 0:33:58.160
<v Speaker 1>for us. So well, assuming now their photo permissions, yes, yes,

0:33:58.240 --> 0:34:00.000
<v Speaker 1>we have to be able to we have to be able,

0:34:00.360 --> 0:34:03.160
<v Speaker 1>we have to be able to use it with permission.

0:34:03.280 --> 0:34:06.000
<v Speaker 1>So if it's if it's a stock photo that we

0:34:06.120 --> 0:34:08.160
<v Speaker 1>you know, from a stock company, that we actually use,

0:34:08.719 --> 0:34:11.279
<v Speaker 1>then we can we can totally do that. And uh

0:34:11.440 --> 0:34:13.319
<v Speaker 1>so as long as that all liigns up I know

0:34:13.400 --> 0:34:15.680
<v Speaker 1>that's a lot of ifs. You'll get a credit for it,

0:34:15.840 --> 0:34:18.600
<v Speaker 1>and just think you two will have have contributed to

0:34:19.280 --> 0:34:25.000
<v Speaker 1>our forward thinking quest of people throwing food off of

0:34:25.080 --> 0:34:28.759
<v Speaker 1>high elevations. I don't know at any rate. Get in

0:34:28.800 --> 0:34:31.040
<v Speaker 1>touch with us and we will talk to you again

0:34:31.719 --> 0:34:38.840
<v Speaker 1>really soon. For more on this topic in the future

0:34:38.880 --> 0:34:51.640
<v Speaker 1>of technology, I'll visit forward thinking dot Com, brought to

0:34:51.680 --> 0:34:54.120
<v Speaker 1>you by Toyota. Let's Go Places,