WEBVTT - A Quick Chat About CAPTCHAs

0:00:04.480 --> 0:00:12.399
<v Speaker 1>Welcome to Tech Stuff, a production from iHeartRadio. Hey there,

0:00:12.440 --> 0:00:15.800
<v Speaker 1>and welcome to tech Stuff. I'm your host, Jonathan Strickland.

0:00:15.800 --> 0:00:18.799
<v Speaker 1>I'm an executive producer with iHeart Podcasts. And how the

0:00:18.800 --> 0:00:21.560
<v Speaker 1>tech are you. I thought today we could do a

0:00:21.760 --> 0:00:25.560
<v Speaker 1>real quickie, because you know, sometimes it's nice just to

0:00:25.600 --> 0:00:29.400
<v Speaker 1>do a short thing to talk about a subject in tech.

0:00:29.920 --> 0:00:32.480
<v Speaker 1>And there are a whole bunch of tech Stuff episodes

0:00:32.800 --> 0:00:36.559
<v Speaker 1>in which I have talked about the Turing test. So

0:00:37.200 --> 0:00:39.400
<v Speaker 1>there are a lot of different variations of the Turing test.

0:00:40.040 --> 0:00:43.559
<v Speaker 1>It's based off a thought experiment from Alan Turing, the

0:00:43.840 --> 0:00:48.680
<v Speaker 1>computer scientist, very influential, very important in World War Two.

0:00:49.200 --> 0:00:54.160
<v Speaker 1>He helped crack the Enigma code. And actually the movie

0:00:54.280 --> 0:00:59.720
<v Speaker 1>that sort of depicted his efforts in cracking Enigma is

0:00:59.760 --> 0:01:03.280
<v Speaker 1>called the Imitation Game. Well, the Imitation Game makes reference

0:01:03.440 --> 0:01:07.160
<v Speaker 1>that is like part of the Turing test. So he

0:01:07.319 --> 0:01:09.760
<v Speaker 1>kind of proposed this test when people would ask him

0:01:09.760 --> 0:01:13.240
<v Speaker 1>if he thought machines would be capable of thought. Now

0:01:13.440 --> 0:01:15.520
<v Speaker 1>keep in mind, like this is back in the forties

0:01:15.560 --> 0:01:18.480
<v Speaker 1>and fifties, do you think that machines will be capable

0:01:18.520 --> 0:01:20.240
<v Speaker 1>of thinking? And he said, I don't really think that's

0:01:20.240 --> 0:01:23.080
<v Speaker 1>a very interesting question. For one thing, I don't know

0:01:23.120 --> 0:01:25.840
<v Speaker 1>that there's any meaningful way to answer it. However, I

0:01:25.920 --> 0:01:29.120
<v Speaker 1>do think we can be a little more precise if

0:01:29.160 --> 0:01:32.399
<v Speaker 1>we think about it in terms of kind of a

0:01:32.400 --> 0:01:36.280
<v Speaker 1>thought experiment, a test. So imagine this is the situation

0:01:36.920 --> 0:01:40.560
<v Speaker 1>you find yourself in. You go into a room and

0:01:40.640 --> 0:01:43.760
<v Speaker 1>there's a computer terminal there, and that's it. You know,

0:01:43.800 --> 0:01:45.920
<v Speaker 1>you can't see into any other rooms or anything. It's

0:01:45.959 --> 0:01:48.800
<v Speaker 1>just a desk with a computer terminal. And you sit

0:01:48.880 --> 0:01:52.760
<v Speaker 1>down at this terminal and there's a little prompt there

0:01:52.880 --> 0:01:55.640
<v Speaker 1>that lets you get into a chat session. And you

0:01:55.800 --> 0:01:59.640
<v Speaker 1>enter into this chat session, and you have five minutes

0:01:59.760 --> 0:02:02.640
<v Speaker 1>and you can ask the person on the other end

0:02:02.680 --> 0:02:05.560
<v Speaker 1>of the chat session any questions you want within that

0:02:05.600 --> 0:02:08.760
<v Speaker 1>five minute time frame. And once those five minutes are up,

0:02:08.919 --> 0:02:12.959
<v Speaker 1>you're asked to determine was the person on the other

0:02:13.120 --> 0:02:16.120
<v Speaker 1>end of the chat an actual human being or was

0:02:16.160 --> 0:02:20.200
<v Speaker 1>it a computer program? Was it some form of artificial intelligence?

0:02:20.280 --> 0:02:23.200
<v Speaker 1>A bot is what we would call it today. And

0:02:23.800 --> 0:02:28.600
<v Speaker 1>if you are unable to determine whether the subject on

0:02:28.639 --> 0:02:31.120
<v Speaker 1>the other end of the chat is human or a

0:02:31.160 --> 0:02:35.720
<v Speaker 1>bot to any reliable degree, then you could say that, oh,

0:02:35.880 --> 0:02:40.239
<v Speaker 1>that that program passed the Turing test, I could find

0:02:40.720 --> 0:02:45.399
<v Speaker 1>no way of telling the difference between that computer program

0:02:45.440 --> 0:02:48.919
<v Speaker 1>and an actual person. Touring suggests that due to advancements

0:02:48.919 --> 0:02:52.320
<v Speaker 1>in computer science, he suspected that people would have at

0:02:52.440 --> 0:02:56.160
<v Speaker 1>best a success rate of around seventy percent to be

0:02:56.200 --> 0:02:59.160
<v Speaker 1>able to tell whether or not the quote unquote person

0:02:59.200 --> 0:03:00.720
<v Speaker 1>on the other end of the church chat was a

0:03:00.800 --> 0:03:04.320
<v Speaker 1>human being or a computer program, And he said he

0:03:04.400 --> 0:03:06.720
<v Speaker 1>expected that to be capable in just a few years time.

0:03:06.880 --> 0:03:10.040
<v Speaker 1>It took a little bit longer than that, but I

0:03:10.080 --> 0:03:14.120
<v Speaker 1>would say that with the sophistication we've reached with chatbots

0:03:14.200 --> 0:03:17.640
<v Speaker 1>these days, I think you could fairly conclusively say that

0:03:17.680 --> 0:03:21.600
<v Speaker 1>we've got programs out there that can quote unquote beat

0:03:21.639 --> 0:03:24.240
<v Speaker 1>the Turing test. Part of the problem is that Turing

0:03:24.360 --> 0:03:28.000
<v Speaker 1>was saying that in the future, these programs are going

0:03:28.040 --> 0:03:30.560
<v Speaker 1>to be sophisticated enough that they will fool people into

0:03:30.560 --> 0:03:33.720
<v Speaker 1>thinking it's another person. He wasn't saying, oh, you have

0:03:33.800 --> 0:03:38.480
<v Speaker 1>to meet this specific threshold for your system to have

0:03:39.560 --> 0:03:43.960
<v Speaker 1>achieved beating the Turing test. That would come afterward. Other

0:03:44.000 --> 0:03:47.720
<v Speaker 1>people would kind of create the criteria. But since then

0:03:47.800 --> 0:03:51.320
<v Speaker 1>people have subsequently used the phrase Turing test to reference.

0:03:51.480 --> 0:03:54.520
<v Speaker 1>Essentially any kind of task designed to determine if a

0:03:54.560 --> 0:03:58.720
<v Speaker 1>machine has or at least appears to have the property

0:03:59.280 --> 0:04:01.800
<v Speaker 1>of intell and when I say that, I mean really

0:04:01.960 --> 0:04:07.920
<v Speaker 1>general intelligence. But there's another specific use of Turing tests

0:04:08.000 --> 0:04:10.840
<v Speaker 1>that I would like to bring up today, and that

0:04:11.040 --> 0:04:16.080
<v Speaker 1>is the completely automated public Turing test to tell computers

0:04:16.200 --> 0:04:21.360
<v Speaker 1>and humans apart that once you turn it into an acronym,

0:04:21.440 --> 0:04:25.080
<v Speaker 1>becomes cap chop. So these are those little tasks that

0:04:25.160 --> 0:04:29.520
<v Speaker 1>you occasionally encounter on certain websites, and they require you

0:04:29.640 --> 0:04:32.400
<v Speaker 1>to do something like you type in a string of

0:04:32.480 --> 0:04:36.280
<v Speaker 1>characters that are displayed on screen. They're usually deformed in

0:04:36.360 --> 0:04:40.040
<v Speaker 1>some way and up against a crazy background. Or you

0:04:40.160 --> 0:04:42.840
<v Speaker 1>might be given a big selection of images and told

0:04:42.880 --> 0:04:45.320
<v Speaker 1>to pick out all the ones that have a cat

0:04:45.360 --> 0:04:48.120
<v Speaker 1>in them or something. Or you might have to drag

0:04:48.279 --> 0:04:51.839
<v Speaker 1>a little picture of a puzzle piece into an image

0:04:52.000 --> 0:04:55.279
<v Speaker 1>where it fits into a very specific spot. All of

0:04:55.320 --> 0:04:58.919
<v Speaker 1>these are meant to separate out actual human visitors to

0:04:59.240 --> 0:05:03.560
<v Speaker 1>a website service versus all the automated programs or bots

0:05:03.640 --> 0:05:05.560
<v Speaker 1>or whatever you might want to call it. So I

0:05:05.560 --> 0:05:07.640
<v Speaker 1>thought it would be fun to do a quick episode

0:05:07.880 --> 0:05:11.640
<v Speaker 1>on where captures came from and what purpose they serve

0:05:12.240 --> 0:05:14.560
<v Speaker 1>kind of touched on that already, but also how they

0:05:14.560 --> 0:05:18.160
<v Speaker 1>fit into the grand picture of artificial intelligence, because interestingly

0:05:18.480 --> 0:05:22.039
<v Speaker 1>they play a pretty important part. They have helped drive

0:05:22.240 --> 0:05:26.640
<v Speaker 1>the development and advancement of artificial intelligence, not necessarily in

0:05:26.680 --> 0:05:31.080
<v Speaker 1>a way that is helpful to everybody out there, but

0:05:31.120 --> 0:05:34.480
<v Speaker 1>it certainly has served as a way to get people

0:05:34.960 --> 0:05:39.280
<v Speaker 1>thinking about how to tackle certain AI problems. So our

0:05:39.320 --> 0:05:43.599
<v Speaker 1>story actually begins with the good old website Yahoo. Y'all

0:05:43.600 --> 0:05:46.360
<v Speaker 1>remember Yahoo. I mean, it's still a thing, but I

0:05:46.400 --> 0:05:50.720
<v Speaker 1>remember a time where Yahoo was practically synonymous with Internet

0:05:50.800 --> 0:05:53.159
<v Speaker 1>for a lot of folks. You may not even remember

0:05:53.200 --> 0:05:56.000
<v Speaker 1>this if you haven't been on Yahoo in ages, but

0:05:56.480 --> 0:05:59.400
<v Speaker 1>once upon a time, Yahoo was sort of an ordal

0:05:59.640 --> 0:06:02.159
<v Speaker 1>to the rest of the Internet. Yahoo was kind of

0:06:02.160 --> 0:06:03.919
<v Speaker 1>like a landing page. A lot of people had it

0:06:04.000 --> 0:06:06.960
<v Speaker 1>set as their homepage, so when they would go into

0:06:07.000 --> 0:06:10.440
<v Speaker 1>a web browser, they'd go right into Yahoo and you

0:06:10.440 --> 0:06:13.919
<v Speaker 1>would find articles there and all sorts of other links,

0:06:14.279 --> 0:06:18.000
<v Speaker 1>as well as chat rooms and of course the search

0:06:18.040 --> 0:06:21.440
<v Speaker 1>engine where you could search for other stuff online besides

0:06:21.520 --> 0:06:24.680
<v Speaker 1>the stuff that just popped up on Yahoo. Well, in

0:06:24.720 --> 0:06:29.240
<v Speaker 1>those chat rooms, moderators were running into a really serious problem.

0:06:29.560 --> 0:06:35.440
<v Speaker 1>The chat spaces were becoming invaded by bots posing as people.

0:06:35.600 --> 0:06:37.520
<v Speaker 1>Now this is in two thousand. The bots were not

0:06:37.560 --> 0:06:41.919
<v Speaker 1>particularly sophisticated, but they were creating a lot of spam, Like,

0:06:41.960 --> 0:06:45.880
<v Speaker 1>they were jamming up chat spaces with just spam messages

0:06:45.920 --> 0:06:48.560
<v Speaker 1>while people are trying to chat. In some cases, they

0:06:48.560 --> 0:06:52.279
<v Speaker 1>were gathering personal information of users in an effort to

0:06:52.839 --> 0:06:56.719
<v Speaker 1>exploit those users in some way or another. So Yahoo

0:06:56.720 --> 0:06:59.320
<v Speaker 1>didn't want this to keep going. It wasn't reflecting well

0:06:59.320 --> 0:07:02.440
<v Speaker 1>on the companies. So they turned to the computer science

0:07:02.480 --> 0:07:06.400
<v Speaker 1>department at Carnegie Mellon University in order to see, Hey,

0:07:06.640 --> 0:07:10.200
<v Speaker 1>is there some way that we could, you know, kind

0:07:10.200 --> 0:07:13.400
<v Speaker 1>of like have a bouncer out front, a gate keeper

0:07:13.640 --> 0:07:18.040
<v Speaker 1>if you will, that would allow humans into the various

0:07:18.080 --> 0:07:19.880
<v Speaker 1>systems so that they can make use of them the

0:07:19.880 --> 0:07:23.800
<v Speaker 1>way they were intended, but prevent all the robots, all

0:07:23.840 --> 0:07:28.000
<v Speaker 1>the AI programs, all the computer software or algorithms, however

0:07:28.000 --> 0:07:31.440
<v Speaker 1>you want to define it, keep them from getting access.

0:07:31.680 --> 0:07:36.960
<v Speaker 1>So a team led by Manuel Bloom and including folks

0:07:37.040 --> 0:07:40.120
<v Speaker 1>or Blum, I suppose, and including folks like John Langford,

0:07:40.360 --> 0:07:45.800
<v Speaker 1>Louis von On, Nicholas Hopper, and others tackled this challenge,

0:07:46.200 --> 0:07:49.160
<v Speaker 1>so they needed to come up with a test. Now,

0:07:49.240 --> 0:07:53.200
<v Speaker 1>in an ideal world, the test would be a cinch

0:07:53.480 --> 0:07:56.520
<v Speaker 1>for a human being to complete, but it would be

0:07:56.560 --> 0:08:01.840
<v Speaker 1>a real stumper for algorithmically driven by And that is

0:08:01.920 --> 0:08:06.520
<v Speaker 1>the basic philosophy of capture. Make a test that humans

0:08:06.560 --> 0:08:10.800
<v Speaker 1>find really easy to complete, perhaps even trivial, like it's

0:08:10.960 --> 0:08:15.360
<v Speaker 1>just a mild inconvenience, as they say, but for bots

0:08:15.640 --> 0:08:19.040
<v Speaker 1>it's like a turn away. You're never going to be

0:08:19.080 --> 0:08:21.720
<v Speaker 1>able to get this. Now, some of y'all might be

0:08:21.760 --> 0:08:24.920
<v Speaker 1>saying something along the lines of but Jonathan, whenever I

0:08:25.080 --> 0:08:29.320
<v Speaker 1>run into captures these days, they're sometimes really hard, Like

0:08:29.520 --> 0:08:32.560
<v Speaker 1>it's hard to see what they spell out. I'll try

0:08:32.559 --> 0:08:35.439
<v Speaker 1>and type things in three four times and get kicked out.

0:08:35.760 --> 0:08:38.839
<v Speaker 1>And you're right, that is a problem. It is something

0:08:38.880 --> 0:08:42.720
<v Speaker 1>that actually is happening. It doesn't mean that you're not human.

0:08:43.400 --> 0:08:46.280
<v Speaker 1>If you're having like existential crises. I would like to

0:08:46.320 --> 0:08:49.400
<v Speaker 1>set your mind at ease by saying you're probably human.

0:08:49.800 --> 0:08:52.520
<v Speaker 1>I mean, I don't think I could say anything for certain,

0:08:52.559 --> 0:08:55.480
<v Speaker 1>but I feel fairly confidence saying you're probably human. But

0:08:55.880 --> 0:08:59.719
<v Speaker 1>the reason why captures have become really difficult in some

0:08:59.800 --> 0:09:03.240
<v Speaker 1>way cases anyway with some specific types of captures is

0:09:03.360 --> 0:09:07.800
<v Speaker 1>largely because other programmers figured out how to make better

0:09:07.840 --> 0:09:12.000
<v Speaker 1>automated programs that can parse and respond to captures. So

0:09:12.200 --> 0:09:14.920
<v Speaker 1>as one group of programmers figured out how to design

0:09:15.080 --> 0:09:19.360
<v Speaker 1>tools to defeat a capture, the captured designers would go

0:09:19.440 --> 0:09:23.120
<v Speaker 1>back to the drawing board to create new tests to

0:09:23.160 --> 0:09:26.320
<v Speaker 1>be more challenging for those bots, to say, well, they

0:09:26.559 --> 0:09:31.280
<v Speaker 1>got good at this, let's change these things and reintroduce

0:09:31.400 --> 0:09:36.560
<v Speaker 1>the capture so that this will trip up those systems

0:09:36.600 --> 0:09:39.560
<v Speaker 1>because while they're good at what we used to use

0:09:40.320 --> 0:09:45.920
<v Speaker 1>for gatekeeping, they've never run into this before, and unfortunately

0:09:46.080 --> 0:09:49.360
<v Speaker 1>that sometimes means that the tests become more challenging for

0:09:49.440 --> 0:09:52.440
<v Speaker 1>human beings as well. It no longer is a case

0:09:52.480 --> 0:09:56.160
<v Speaker 1>where something is trivial for a human but difficult for robots,

0:09:56.240 --> 0:09:59.679
<v Speaker 1>at least for certain types of captures. And that's particularly

0:09:59.720 --> 0:10:03.040
<v Speaker 1>true so if the human has some impairments like if

0:10:03.120 --> 0:10:07.000
<v Speaker 1>they have color blindness for example, or some other visual

0:10:07.480 --> 0:10:11.880
<v Speaker 1>or impairment like there are real issues in making captures

0:10:11.920 --> 0:10:14.720
<v Speaker 1>that do what they're supposed to do, that is weed

0:10:14.760 --> 0:10:18.320
<v Speaker 1>out all the non humans but also be accessible to

0:10:18.520 --> 0:10:21.800
<v Speaker 1>all humans, even those who might have impairments that would

0:10:21.840 --> 0:10:25.640
<v Speaker 1>otherwise make it difficult or challenging to complete a capture.

0:10:25.960 --> 0:10:29.320
<v Speaker 1>It is not an easy path to walk. We're going

0:10:29.400 --> 0:10:31.079
<v Speaker 1>to take a quick break. When we come back, i'll

0:10:31.200 --> 0:10:46.200
<v Speaker 1>talk more about the capture story we're back. So in

0:10:46.280 --> 0:10:50.400
<v Speaker 1>the early days of captures, they mostly took on the

0:10:50.440 --> 0:10:55.600
<v Speaker 1>form of distorted text that was printed over a busy background.

0:10:55.880 --> 0:10:58.960
<v Speaker 1>And the idea was that most automated programs would not

0:10:59.040 --> 0:11:02.360
<v Speaker 1>be able to recognize distorted texts like it would be

0:11:02.400 --> 0:11:05.760
<v Speaker 1>an image, not just text letters where it would be

0:11:05.800 --> 0:11:08.360
<v Speaker 1>able to read like the code used to generate the

0:11:08.440 --> 0:11:10.960
<v Speaker 1>letters and then say, oh, well that's these letters that

0:11:11.000 --> 0:11:13.839
<v Speaker 1>can replicate that and get through no problem. You had

0:11:13.840 --> 0:11:18.160
<v Speaker 1>to have something that was going to really stump them. Now,

0:11:18.240 --> 0:11:21.480
<v Speaker 1>image recognition is a pretty tricky science. I've talked about

0:11:21.480 --> 0:11:25.160
<v Speaker 1>it on this show before, Like, training computer systems to

0:11:25.240 --> 0:11:28.960
<v Speaker 1>recognize images takes a lot of time and effort and

0:11:29.040 --> 0:11:32.200
<v Speaker 1>lots and lots and lots of samples so that the

0:11:32.280 --> 0:11:37.640
<v Speaker 1>computer system can quote unquote learn what those images represent. Now,

0:11:37.679 --> 0:11:40.559
<v Speaker 1>it's one thing to teach a computer how to recognize

0:11:40.920 --> 0:11:44.600
<v Speaker 1>standard letters that are in a recognizable font. So if

0:11:44.640 --> 0:11:48.280
<v Speaker 1>the Internet only ever used one font and only used

0:11:48.320 --> 0:11:52.839
<v Speaker 1>one size of that font. Then it would be relatively

0:11:53.120 --> 0:11:56.400
<v Speaker 1>trivial for those who want to defeat captures, because once

0:11:56.400 --> 0:11:58.720
<v Speaker 1>you train a computer vision system on what a lower

0:11:58.760 --> 0:12:02.480
<v Speaker 1>case T looks like, for example, then the system would

0:12:02.480 --> 0:12:05.160
<v Speaker 1>recognize a lowercase tea every time one popped up. But

0:12:05.280 --> 0:12:08.400
<v Speaker 1>of course, there are lots of different fonts and typefaces

0:12:08.440 --> 0:12:11.080
<v Speaker 1>on the Internet, and they come in different sizes and

0:12:11.200 --> 0:12:15.240
<v Speaker 1>colors and on different backgrounds. So teaching a computer system

0:12:15.280 --> 0:12:18.520
<v Speaker 1>what a times new Roman lowercase tea looks like against

0:12:18.559 --> 0:12:21.439
<v Speaker 1>a blank background doesn't mean it's also going to recognize

0:12:21.440 --> 0:12:25.160
<v Speaker 1>a lowercase tea and some other font on some crazy background.

0:12:25.320 --> 0:12:28.040
<v Speaker 1>Plus maybe the tea is a little wavy, a little distorted,

0:12:28.440 --> 0:12:31.640
<v Speaker 1>so distorting that text makes it more challenging for image

0:12:31.679 --> 0:12:36.200
<v Speaker 1>recognition systems, like they're looking for defining features to be

0:12:36.360 --> 0:12:40.040
<v Speaker 1>able to recognize the image of a letter with the

0:12:40.120 --> 0:12:43.280
<v Speaker 1>actual letter. You see, humans, when we teach a human

0:12:43.440 --> 0:12:46.880
<v Speaker 1>what something looks like, it's a lot easier for humans

0:12:46.920 --> 0:12:50.280
<v Speaker 1>to associate other things that look kind of the way

0:12:50.880 --> 0:12:54.160
<v Speaker 1>the first example did, but maybe not exactly the same.

0:12:54.640 --> 0:12:57.640
<v Speaker 1>So in other words, like the example, I always use

0:12:57.640 --> 0:13:00.160
<v Speaker 1>our coffee mugs right. If I show you a coffee mugdug,

0:13:00.200 --> 0:13:02.199
<v Speaker 1>and I say this is a coffee mug, and then

0:13:02.240 --> 0:13:05.240
<v Speaker 1>I show you a second kind that looks totally different,

0:13:05.280 --> 0:13:08.520
<v Speaker 1>different color, different size, you know, whatever, maybe has different

0:13:08.520 --> 0:13:10.800
<v Speaker 1>writing on it, whatever it might be. And I say,

0:13:10.840 --> 0:13:13.400
<v Speaker 1>this is also a coffee mug. And then I show

0:13:13.400 --> 0:13:16.200
<v Speaker 1>you a third example that looks unlike the first two.

0:13:16.440 --> 0:13:18.800
<v Speaker 1>You could say, oh, okay, I get the idea. I

0:13:18.840 --> 0:13:23.200
<v Speaker 1>get the different features that make up what a coffee

0:13:23.280 --> 0:13:26.560
<v Speaker 1>mug is. I understand now. And now when I encounter

0:13:27.160 --> 0:13:29.840
<v Speaker 1>different types of coffee mugs, even though they might not

0:13:29.920 --> 0:13:32.559
<v Speaker 1>look anything like any of the other ones I've encountered,

0:13:32.720 --> 0:13:36.360
<v Speaker 1>I know, Okay, that's probably a coffee mug. Until someone says, no,

0:13:36.440 --> 0:13:39.160
<v Speaker 1>that's a teacup, and then your world is turned upside down.

0:13:39.200 --> 0:13:41.880
<v Speaker 1>But you get what I'm saying. Computers don't work that way.

0:13:42.200 --> 0:13:45.840
<v Speaker 1>Computers like, if you teach it an example is a thing,

0:13:46.320 --> 0:13:52.120
<v Speaker 1>it doesn't necessarily understand that similar but distinctly different versions

0:13:52.160 --> 0:13:55.200
<v Speaker 1>of that same thing fall into the same category. That

0:13:55.320 --> 0:13:58.920
<v Speaker 1>takes lots and lots of training. So the whole idea

0:13:59.080 --> 0:14:02.760
<v Speaker 1>of distortion was that this would make it very tricky

0:14:03.120 --> 0:14:06.040
<v Speaker 1>for most systems to be able to parse that information

0:14:06.320 --> 0:14:09.080
<v Speaker 1>and be able to put it in reliably and to

0:14:09.240 --> 0:14:13.200
<v Speaker 1>fool the capture system doesn't mean that it was fool proof.

0:14:13.360 --> 0:14:16.720
<v Speaker 1>Over time, those systems did get better at being able

0:14:16.760 --> 0:14:20.640
<v Speaker 1>to recognize those figures that were on screen, even better

0:14:20.800 --> 0:14:25.360
<v Speaker 1>than humans could in some cases, which is obviously a problem. Now,

0:14:25.400 --> 0:14:28.480
<v Speaker 1>there have been lots of other capture systems, not just Capture.

0:14:28.760 --> 0:14:34.400
<v Speaker 1>For example, there's one called Asira Asira. Asira did something

0:14:34.480 --> 0:14:37.200
<v Speaker 1>I mentioned earlier in the episode. It would present the

0:14:37.360 --> 0:14:40.680
<v Speaker 1>visitor with a collection of photographs and they would include

0:14:40.760 --> 0:14:43.960
<v Speaker 1>cats and dogs, and it would ask you, okay, identify

0:14:44.200 --> 0:14:47.240
<v Speaker 1>the pictures that have cats in them. So that was

0:14:47.280 --> 0:14:50.920
<v Speaker 1>one way to get around this was that it wasn't

0:14:51.040 --> 0:14:55.240
<v Speaker 1>just figuring out text. It was differentiating between cats and dogs,

0:14:55.560 --> 0:14:59.080
<v Speaker 1>something that again computer systems couldn't do just natively. They

0:14:59.120 --> 0:15:02.600
<v Speaker 1>had to be taught how to recognize the features that

0:15:02.640 --> 0:15:05.200
<v Speaker 1>belonged to a cat versus those that belonged to a dog,

0:15:05.720 --> 0:15:08.720
<v Speaker 1>just the same as all other image recognition software. The

0:15:08.760 --> 0:15:13.280
<v Speaker 1>folks over at Google developed Recapture, and that actually served

0:15:13.280 --> 0:15:17.640
<v Speaker 1>a dual purpose. It was kind of sneaky. So with Recapture,

0:15:17.720 --> 0:15:19.400
<v Speaker 1>you would go to a website and you would be

0:15:19.440 --> 0:15:22.400
<v Speaker 1>greeted by some you know, kind of grainy text, and

0:15:22.440 --> 0:15:24.320
<v Speaker 1>you'd be asked to type it out. You'd actually get

0:15:24.360 --> 0:15:28.120
<v Speaker 1>a couple of different ones, not just one. And this

0:15:28.320 --> 0:15:32.920
<v Speaker 1>text was from scans made of physical digitized books, so

0:15:32.960 --> 0:15:36.200
<v Speaker 1>in other words, books where they had put the page

0:15:36.320 --> 0:15:39.200
<v Speaker 1>down on a scanner and created a scan. So some

0:15:39.240 --> 0:15:41.800
<v Speaker 1>of these books were in you know, pretty bad shape.

0:15:41.840 --> 0:15:46.200
<v Speaker 1>They were at all crisp, clear images. So your first

0:15:46.480 --> 0:15:50.720
<v Speaker 1>capture you'd be presented with, Google actually knew the answer

0:15:50.880 --> 0:15:53.640
<v Speaker 1>to whatever the word was. So let's say the word

0:15:54.000 --> 0:15:58.800
<v Speaker 1>is salamander and you type in salamander, and so Google says,

0:15:58.800 --> 0:16:04.200
<v Speaker 1>all right, I already knew that this scanned word is salamander.

0:16:04.640 --> 0:16:06.960
<v Speaker 1>This is obviously a person who has typed this in.

0:16:07.280 --> 0:16:11.160
<v Speaker 1>But the second image would be a scan from a book.

0:16:11.200 --> 0:16:13.720
<v Speaker 1>Maybe it'd be a really smudged one, like one that's

0:16:13.760 --> 0:16:16.840
<v Speaker 1>harder to read, and it would ask you, okay, was

0:16:16.880 --> 0:16:20.320
<v Speaker 1>this word. Let's say it's surgeon and you type insurgeon. Well,

0:16:20.360 --> 0:16:25.200
<v Speaker 1>the secret sauce here is that Google didn't know that

0:16:25.200 --> 0:16:28.280
<v Speaker 1>that scanned word was surgeon. What Google was doing was

0:16:28.320 --> 0:16:33.560
<v Speaker 1>crowdsourcing crowdsourcing the effort to figure out what the text

0:16:33.840 --> 0:16:39.400
<v Speaker 1>in this scanned image actually said. So if you and

0:16:40.000 --> 0:16:43.280
<v Speaker 1>thousands of other people all put the same word in

0:16:43.760 --> 0:16:47.360
<v Speaker 1>when you were encountering this particular scan, Google would say,

0:16:47.400 --> 0:16:51.840
<v Speaker 1>all right, that word is very likely surgeon. Because you know,

0:16:52.040 --> 0:16:55.280
<v Speaker 1>ninety eight percent of the people who were shown this

0:16:55.680 --> 0:16:59.480
<v Speaker 1>recapture typed surgeon in. So now we know that that

0:16:59.520 --> 0:17:04.880
<v Speaker 1>word is which meant that they could essentially transcribe these

0:17:04.960 --> 0:17:09.840
<v Speaker 1>digitized texts by using the crowd to do the work

0:17:09.880 --> 0:17:13.159
<v Speaker 1>for them. And that is kind of the heart of

0:17:13.280 --> 0:17:17.840
<v Speaker 1>where capture and AI meet. That captures have been used

0:17:18.119 --> 0:17:21.880
<v Speaker 1>one to help train AI so that it's more effective.

0:17:21.920 --> 0:17:24.119
<v Speaker 1>Like if you've encountered other Google ones where it's like

0:17:24.320 --> 0:17:26.840
<v Speaker 1>pick all the images here that have motorcycles in them

0:17:27.119 --> 0:17:31.600
<v Speaker 1>or stairs. Well, part of that is training Google's image

0:17:31.640 --> 0:17:37.120
<v Speaker 1>recognition systems so that they're more accurate. Right, Like an

0:17:37.160 --> 0:17:41.359
<v Speaker 1>image recognition system might have trouble differentiating an actual like

0:17:41.440 --> 0:17:44.920
<v Speaker 1>stone staircase out in front of a building with a

0:17:44.960 --> 0:17:48.800
<v Speaker 1>pedestrian crosswalk, because you know you've got those those broken

0:17:48.920 --> 0:17:52.760
<v Speaker 1>lines on a crosswalk, those could look like stairs to

0:17:53.000 --> 0:17:57.440
<v Speaker 1>a computer image recognition system. So by giving users the

0:17:57.480 --> 0:18:01.080
<v Speaker 1>task of hey, identify all the excit samples in this

0:18:01.359 --> 0:18:05.080
<v Speaker 1>list that have stairs in them, Google starts to train

0:18:05.200 --> 0:18:09.119
<v Speaker 1>its own image recognition algorithms to be more effective and

0:18:09.200 --> 0:18:13.000
<v Speaker 1>more accurate. So in a way, we were essentially being

0:18:13.160 --> 0:18:18.640
<v Speaker 1>used as free labor to make these AI systems more accurate,

0:18:19.000 --> 0:18:20.800
<v Speaker 1>just so that we could get access to whatever it

0:18:20.920 --> 0:18:23.719
<v Speaker 1>was we were trying to visit, whether that was an

0:18:23.800 --> 0:18:27.159
<v Speaker 1>online shop or a chat room, or you know, whatever

0:18:27.200 --> 0:18:31.719
<v Speaker 1>it might be. So, yeah, we we've been working for free, y'all.

0:18:32.080 --> 0:18:35.240
<v Speaker 1>Actually it's in some cases we've been working for free

0:18:35.359 --> 0:18:38.880
<v Speaker 1>and denied access to tools that we wanted to use

0:18:39.040 --> 0:18:41.240
<v Speaker 1>because the captures were too hard for us to be

0:18:41.280 --> 0:18:44.480
<v Speaker 1>able to solve. But yeah, that's that's the quick story

0:18:44.720 --> 0:18:48.400
<v Speaker 1>about the history and evolution of captures. Clearly they're still

0:18:48.480 --> 0:18:51.600
<v Speaker 1>used today. Sometimes it's something simple like click this box

0:18:51.640 --> 0:18:54.040
<v Speaker 1>to prove your human that kind of thing where it

0:18:54.080 --> 0:18:57.400
<v Speaker 1>requires you to take an action. Those obviously are much

0:18:57.440 --> 0:19:02.159
<v Speaker 1>more simple for humans to comp than for robots, so

0:19:02.240 --> 0:19:06.080
<v Speaker 1>those still follow the philosophy of the original captions. A

0:19:06.119 --> 0:19:08.840
<v Speaker 1>lot of other ones, though they get pretty tricky, to

0:19:08.880 --> 0:19:11.720
<v Speaker 1>the point where sometimes I'm discouraged from even going further

0:19:12.080 --> 0:19:15.080
<v Speaker 1>and visiting the website in particular, just like, you know what,

0:19:15.400 --> 0:19:18.520
<v Speaker 1>I don't need to feel stupid because I couldn't find

0:19:18.560 --> 0:19:22.440
<v Speaker 1>all the fire hydrants in these photographs, so I'm just out.

0:19:22.720 --> 0:19:25.080
<v Speaker 1>But yeah, that's it. And like I said, it plays

0:19:25.119 --> 0:19:27.320
<v Speaker 1>a really important part with AI. It's kind of a

0:19:27.359 --> 0:19:33.159
<v Speaker 1>seesaw effect, right, Like you create a barrier that AI

0:19:33.359 --> 0:19:36.080
<v Speaker 1>can't get over until it can, and then you have

0:19:36.119 --> 0:19:39.520
<v Speaker 1>to go back and create a harder barrier. And meanwhile,

0:19:39.680 --> 0:19:42.840
<v Speaker 1>the folks developing the AI keep making advancements that the

0:19:42.880 --> 0:19:46.600
<v Speaker 1>AI gets more sophisticated and powerful over time. So yeah,

0:19:46.680 --> 0:19:50.960
<v Speaker 1>delicate balance and not everybody benefits. As I said, hope

0:19:51.000 --> 0:19:54.200
<v Speaker 1>that that was interesting and informative to y'all. I hope

0:19:54.240 --> 0:19:56.919
<v Speaker 1>you're all doing well, and I'll talk to you again

0:19:57.480 --> 0:20:07.640
<v Speaker 1>really soon. Tech Stuff is an iHeartRadio production. For more

0:20:07.720 --> 0:20:12.480
<v Speaker 1>podcasts from iHeartRadio, visit the iHeartRadio app, Apple Podcasts, or

0:20:12.480 --> 0:20:14.439
<v Speaker 1>wherever you listen to your favorite shows.