WEBVTT - Speech Recognition

0:00:00.200 --> 0:00:07.240
<v Speaker 1>Brought to you by Toyota. Let's go places. Welcome to

0:00:07.400 --> 0:00:17.520
<v Speaker 1>Forward Thinking. Welcome everyone to Forward Thinking, the podcast where

0:00:17.520 --> 0:00:20.400
<v Speaker 1>we think about the future. I'm Jonathan Strickland, I'm Lauren

0:00:20.440 --> 0:00:22.920
<v Speaker 1>Vogel Dan, and I'm Joe McCormick. And today we want

0:00:23.000 --> 0:00:28.480
<v Speaker 1>to talk about the evolution of speech recognition software and hardware,

0:00:28.800 --> 0:00:31.479
<v Speaker 1>and to talk about how did we get to where

0:00:31.480 --> 0:00:34.200
<v Speaker 1>we are and where are we going from here? Because

0:00:34.240 --> 0:00:37.840
<v Speaker 1>clearly this is a thing. I mean, we're seeing speech

0:00:37.880 --> 0:00:44.600
<v Speaker 1>recognition in lots of different devices and including computers, mobile devices.

0:00:44.960 --> 0:00:47.080
<v Speaker 1>I know that my phone allows me to talk to

0:00:47.120 --> 0:00:50.800
<v Speaker 1>it and ask questions and occasionally I get the correct response.

0:00:51.479 --> 0:00:56.680
<v Speaker 1>I also have programs that will create an automatic transcript

0:00:56.760 --> 0:01:01.680
<v Speaker 1>of voicemails. So how did we get to this point?

0:01:01.680 --> 0:01:05.400
<v Speaker 1>How did we get to a time where computers can,

0:01:05.640 --> 0:01:09.880
<v Speaker 1>at least on the surface level, appear to understand speech.

0:01:10.520 --> 0:01:12.280
<v Speaker 1>And to really understand this, we have to go back

0:01:12.319 --> 0:01:18.000
<v Speaker 1>aways and by a ways, I mean seventeen seventy three, what, yeah,

0:01:18.080 --> 0:01:21.840
<v Speaker 1>all right, So seventeen seventy three, shortly before the Atari

0:01:21.840 --> 0:01:24.920
<v Speaker 1>twenty six hundred came out by a couple of centuries,

0:01:25.360 --> 0:01:30.120
<v Speaker 1>there was a Russian scientist named Christian Kratzenstein, and he

0:01:30.319 --> 0:01:35.480
<v Speaker 1>actually Kratzenstein, Kratzenstein, Kratzenstein. That's a great name, it's an

0:01:35.480 --> 0:01:40.000
<v Speaker 1>awesome name. Right. Well, during this time people were starting

0:01:40.040 --> 0:01:43.840
<v Speaker 1>to get really interested in the nature of sound and

0:01:43.920 --> 0:01:49.400
<v Speaker 1>ways of producing sound, and Kratzenstein actually created something very interesting.

0:01:49.480 --> 0:01:53.920
<v Speaker 1>He created a machine that was capable of producing vowel

0:01:54.040 --> 0:01:59.840
<v Speaker 1>like sounds using organ pipes and resonance tubes. Wow. So

0:02:00.760 --> 0:02:04.279
<v Speaker 1>totally synthetic. Totally synthetic. And in this case we're talking

0:02:04.320 --> 0:02:08.360
<v Speaker 1>about a machine producing sounds, not a machine taking in

0:02:08.520 --> 0:02:13.359
<v Speaker 1>sounds and analyst exactly. This is a machine. But the

0:02:13.880 --> 0:02:17.919
<v Speaker 1>history of speech recognition is also a history of designing

0:02:18.000 --> 0:02:21.360
<v Speaker 1>machines that talk back to us. They don't just listen

0:02:21.400 --> 0:02:23.680
<v Speaker 1>to what we have to say, but they can communicate

0:02:23.760 --> 0:02:26.560
<v Speaker 1>back to us. So this is one of those earliest

0:02:27.520 --> 0:02:31.320
<v Speaker 1>versions of that. And he wasn't alone. In seventeen ninety

0:02:31.360 --> 0:02:36.360
<v Speaker 1>one Wuff Gong von Kimpellen in Vienna he built the

0:02:36.480 --> 0:02:41.480
<v Speaker 1>Acoustic Mechanical Speech Machine or two for two. Yeah, yeah,

0:02:41.520 --> 0:02:47.200
<v Speaker 1>and then let's just skip the entire nineteenth century. But

0:02:47.840 --> 0:02:50.400
<v Speaker 1>wait a minute, I actually it wasn't I think that

0:02:50.480 --> 0:02:54.079
<v Speaker 1>Alexander Graham Bell. His wife was deaf correct, and he

0:02:54.360 --> 0:02:57.240
<v Speaker 1>originally when he was starting to play around with sound,

0:02:57.560 --> 0:03:00.360
<v Speaker 1>created was trying to transform, trying to create the device

0:03:00.360 --> 0:03:03.280
<v Speaker 1>that would transform audible words into a visual output that

0:03:03.360 --> 0:03:07.440
<v Speaker 1>a deaf person could interpret. And I mean he wound

0:03:07.520 --> 0:03:10.960
<v Speaker 1>up creating pictures with sound, but his wife never really

0:03:11.000 --> 0:03:15.000
<v Speaker 1>managed to interpret them. However, that research started going into

0:03:15.000 --> 0:03:18.280
<v Speaker 1>things like the telephone. Well, and also there were early

0:03:18.639 --> 0:03:22.160
<v Speaker 1>attempts when the gramophone first became a thing, when they

0:03:22.200 --> 0:03:25.720
<v Speaker 1>started to use wax cylinders to record sound in a

0:03:25.720 --> 0:03:29.200
<v Speaker 1>physical medium. Was that at a cent or somebody before him?

0:03:29.440 --> 0:03:32.000
<v Speaker 1>Bell did some of this as well, So we're talking

0:03:32.080 --> 0:03:34.760
<v Speaker 1>about there are actually quite a few inventors who were

0:03:34.760 --> 0:03:37.160
<v Speaker 1>working on this sort of technology. But there were people

0:03:37.200 --> 0:03:41.560
<v Speaker 1>who had created these different devices to record sound on

0:03:41.600 --> 0:03:44.360
<v Speaker 1>a physical medium, and there were already people thinking, well,

0:03:44.360 --> 0:03:46.040
<v Speaker 1>if we can do this, is there some way where

0:03:46.080 --> 0:03:49.080
<v Speaker 1>we can reverse the process, where we take the physical

0:03:49.120 --> 0:03:52.400
<v Speaker 1>medium and make that into an input of some type.

0:03:52.440 --> 0:03:55.040
<v Speaker 1>And some people were even thinking maybe we can make

0:03:55.320 --> 0:04:00.720
<v Speaker 1>an automatic typewriter. Once the mechanical typewriter came into being,

0:04:00.720 --> 0:04:03.600
<v Speaker 1>there were thoughts of if there's some way to make

0:04:03.800 --> 0:04:07.440
<v Speaker 1>this same process where we're recording sound onto a physical

0:04:07.440 --> 0:04:11.000
<v Speaker 1>medium then turn into a way of actually transmitting this

0:04:11.040 --> 0:04:14.360
<v Speaker 1>into text. That would be amazing. No one quite figured

0:04:14.400 --> 0:04:16.480
<v Speaker 1>it out at that point, but it got a lot

0:04:16.520 --> 0:04:19.800
<v Speaker 1>of people thinking, and in fact, Bell Laboratories was one

0:04:19.839 --> 0:04:24.640
<v Speaker 1>of the leading companies or leading research firms that was

0:04:24.720 --> 0:04:28.200
<v Speaker 1>really concentrated on this speech recognition problem. And in the

0:04:28.279 --> 0:04:32.400
<v Speaker 1>nineteen thirties there was a guy named Homer Dudley. I

0:04:32.400 --> 0:04:35.719
<v Speaker 1>guess probably not quite as colorful as Kratzenstein or Wolfgang.

0:04:36.040 --> 0:04:39.360
<v Speaker 1>It's okay, it's a more uh, what would you say,

0:04:39.400 --> 0:04:44.120
<v Speaker 1>it's a more Americana kind of name. Homer Dudley, Yeah,

0:04:44.120 --> 0:04:47.280
<v Speaker 1>Homer Dudley. He was at Bell Labs. He proposed a

0:04:47.360 --> 0:04:51.520
<v Speaker 1>system model for speech analysis and synthesis, and he also

0:04:51.600 --> 0:04:55.760
<v Speaker 1>designed the Voice Operating Demonstrator also known as the VOTER,

0:04:56.279 --> 0:04:59.480
<v Speaker 1>which was a speech synthesizer, And this was essentially building

0:04:59.480 --> 0:05:02.279
<v Speaker 1>on that scene work that the other guys had done

0:05:02.640 --> 0:05:07.560
<v Speaker 1>centuries before, but in a electronic capacity as opposed to mechanical.

0:05:08.720 --> 0:05:11.480
<v Speaker 1>Then we get into this era where the researchers were

0:05:11.480 --> 0:05:13.640
<v Speaker 1>starting to try and figure out how to make machines

0:05:13.720 --> 0:05:17.520
<v Speaker 1>actually understand speech. At least on a surface level. And

0:05:17.839 --> 0:05:22.920
<v Speaker 1>the early emphasis was on phonetics, which is the sounds

0:05:22.960 --> 0:05:25.440
<v Speaker 1>we make in our language. You know, the age language

0:05:25.440 --> 0:05:29.880
<v Speaker 1>has its own list of phonemes that we generate in

0:05:29.960 --> 0:05:32.640
<v Speaker 1>order to make the words. These are, yeah, the building

0:05:32.640 --> 0:05:36.520
<v Speaker 1>blocks kind of speech. It's the individual sounds, and they're

0:05:36.560 --> 0:05:39.880
<v Speaker 1>similar across languages. But yeah, for example, English has how

0:05:39.920 --> 0:05:43.760
<v Speaker 1>about forty. Linguists are actually kind of in a disagreement

0:05:43.760 --> 0:05:46.600
<v Speaker 1>about the exact number. It all depends on where you are,

0:05:46.760 --> 0:05:50.080
<v Speaker 1>like in the South. Next, we produce sounds in the South,

0:05:50.400 --> 0:05:52.839
<v Speaker 1>like we can make a one syllable word into at

0:05:52.920 --> 0:05:57.159
<v Speaker 1>least three or four syllables, y'all. So we have the

0:05:57.200 --> 0:06:00.320
<v Speaker 1>ability to insert sounds where no sound was before, which

0:06:00.320 --> 0:06:02.920
<v Speaker 1>is why things like hooked on phonics does not necessarily

0:06:02.960 --> 0:06:06.440
<v Speaker 1>work as well as advertised, not necessarily unless you create

0:06:06.480 --> 0:06:09.960
<v Speaker 1>a regionalized version, in which case that would be interesting.

0:06:10.360 --> 0:06:13.560
<v Speaker 1>But you've got to understand how hard this is for

0:06:13.640 --> 0:06:17.760
<v Speaker 1>computers to hear, right, So we're used to it. I mean,

0:06:17.800 --> 0:06:19.880
<v Speaker 1>we talk to people all the time. But just think

0:06:19.880 --> 0:06:23.880
<v Speaker 1>about like when you're on the phone and say somebody

0:06:23.960 --> 0:06:26.920
<v Speaker 1>is reading something to you, like spelling out a word

0:06:27.040 --> 0:06:29.719
<v Speaker 1>or something, and you did you say P or B?

0:06:30.720 --> 0:06:34.000
<v Speaker 1>Did you say P? I? Could you know what? That's

0:06:34.040 --> 0:06:37.240
<v Speaker 1>why you need the Yankee Hotel fox trot kind of sure,

0:06:37.320 --> 0:06:42.400
<v Speaker 1>sure stuff. So when you it's so easy for computers

0:06:42.400 --> 0:06:45.840
<v Speaker 1>to mess this up, well yeah, and beyond that, even

0:06:45.880 --> 0:06:49.720
<v Speaker 1>if you are enunciating clearly, the speed at which you

0:06:49.760 --> 0:06:55.920
<v Speaker 1>say a word can completely make a computer misunderstand you, absolutely,

0:06:56.040 --> 0:06:59.000
<v Speaker 1>because if you've programmed the computer program in such a way,

0:06:59.040 --> 0:07:01.480
<v Speaker 1>you've built the computer pro in such a way that

0:07:02.400 --> 0:07:06.080
<v Speaker 1>it analyzes a word based on a sequence of sounds,

0:07:06.120 --> 0:07:09.359
<v Speaker 1>and it expects each part of that sequence to be

0:07:09.360 --> 0:07:11.880
<v Speaker 1>a certain length. If you pronounce that word at a

0:07:11.920 --> 0:07:14.800
<v Speaker 1>different speed than someone else, then the computer might have

0:07:14.880 --> 0:07:18.080
<v Speaker 1>trouble figuring out that the two versions of that was

0:07:18.120 --> 0:07:22.760
<v Speaker 1>the same word. And this can vary within an individual speech.

0:07:23.360 --> 0:07:26.160
<v Speaker 1>I could say the same word twice in one paragraph,

0:07:26.520 --> 0:07:29.880
<v Speaker 1>and the way I say it each time might be

0:07:29.960 --> 0:07:32.880
<v Speaker 1>different enough to cause problems. Right, So these are non

0:07:32.880 --> 0:07:35.520
<v Speaker 1>trivial problems, and in these early days they were mostly

0:07:35.600 --> 0:07:37.760
<v Speaker 1>focused on just trying to figure out how to teach

0:07:37.800 --> 0:07:42.480
<v Speaker 1>a computer to recognize those basic sounds. In nineteen fifty two,

0:07:42.480 --> 0:07:46.360
<v Speaker 1>Bell Labs introduced the Audrey system and that could recognize

0:07:46.440 --> 0:07:50.080
<v Speaker 1>spoken digits, which made it a little easier because you

0:07:50.240 --> 0:07:53.520
<v Speaker 1>eliminate everything that's not a digit, right, you're just going

0:07:53.560 --> 0:07:58.080
<v Speaker 1>through a series of what ten, twenty? It was probably

0:07:58.080 --> 0:08:01.080
<v Speaker 1>only nine actually, because you usually do one at a time.

0:08:01.800 --> 0:08:04.560
<v Speaker 1>Maybe maybe ten. If you include zero there as well,

0:08:05.160 --> 0:08:07.440
<v Speaker 1>I mean you might not. It all depends. They discovered

0:08:07.440 --> 0:08:10.360
<v Speaker 1>the number zero in the nineteen fifty They did, but

0:08:10.440 --> 0:08:13.800
<v Speaker 1>they lost it for a while. Oh yeah, so, but

0:08:13.960 --> 0:08:16.040
<v Speaker 1>I think by nineteen fifty two they re found it.

0:08:16.120 --> 0:08:18.160
<v Speaker 1>The Mayans had it at least. Yeah, it was you know,

0:08:18.280 --> 0:08:22.720
<v Speaker 1>twenty thirteen, we had to have it. I mean yeah.

0:08:22.760 --> 0:08:26.080
<v Speaker 1>Bell Labs ended up having this Audrey system, and by

0:08:26.360 --> 0:08:29.440
<v Speaker 1>limiting it to just digits, it meant that they could

0:08:29.520 --> 0:08:34.959
<v Speaker 1>work very hard on a drastically simplified version of speech recognition,

0:08:35.120 --> 0:08:37.480
<v Speaker 1>because again, you just throw out anything that's non digit,

0:08:37.920 --> 0:08:41.240
<v Speaker 1>and it means the computer it can concentrate on which

0:08:41.840 --> 0:08:45.440
<v Speaker 1>digit did that sound like the most, based upon the

0:08:45.440 --> 0:08:48.640
<v Speaker 1>phonemes that are needed to say whatever that digit is.

0:08:49.400 --> 0:08:51.680
<v Speaker 1>In nineteen sixty two, so this is a decade later,

0:08:51.960 --> 0:08:55.480
<v Speaker 1>IBM demonstrated the shoe box machine at the World's Fair

0:08:56.400 --> 0:09:01.080
<v Speaker 1>and it could understand sixteen words spoken in English. Another

0:09:01.160 --> 0:09:04.920
<v Speaker 1>good point is that speech recognition, some of these systems

0:09:04.960 --> 0:09:08.600
<v Speaker 1>are language specific. It's not that it can adapt to

0:09:08.679 --> 0:09:12.440
<v Speaker 1>any language. It is most of the program's programmed specifically

0:09:12.480 --> 0:09:15.280
<v Speaker 1>for a certain one, right exactly. So, Again, if the

0:09:15.320 --> 0:09:18.400
<v Speaker 1>phonemes that we produce here are different from ones and

0:09:18.520 --> 0:09:22.840
<v Speaker 1>say China, then it's not going to give you like

0:09:23.040 --> 0:09:26.240
<v Speaker 1>whatever it produces is not going to be the response

0:09:26.320 --> 0:09:31.160
<v Speaker 1>that someone who's speaking Chinese would want, right So generally,

0:09:31.200 --> 0:09:34.400
<v Speaker 1>in the nineteen sixties, Japanese labs began to work on

0:09:34.920 --> 0:09:38.560
<v Speaker 1>vowel recognition phonemes, and they also did some early work

0:09:38.559 --> 0:09:41.840
<v Speaker 1>and continuous speech recognition. Now this is important because again

0:09:41.880 --> 0:09:45.000
<v Speaker 1>those early speech recognition programs, even when they got to

0:09:45.040 --> 0:09:47.319
<v Speaker 1>the point where they could recognize full words, you had

0:09:47.320 --> 0:09:55.960
<v Speaker 1>to put long pauses between each word or else it

0:09:56.000 --> 0:09:58.600
<v Speaker 1>never would And unless you're William Shatner, that's not really

0:09:58.679 --> 0:10:03.840
<v Speaker 1>a natural way of speaking or Christopher Walkin. Yeah. Either way.

0:10:05.120 --> 0:10:08.480
<v Speaker 1>Also in the sixties, Fry and Dean's, two researchers at

0:10:08.480 --> 0:10:12.120
<v Speaker 1>the University College in England, designed a phone name recognizer

0:10:12.640 --> 0:10:16.640
<v Speaker 1>that could recognize four vowel sounds and nine consonant sounds,

0:10:16.960 --> 0:10:20.800
<v Speaker 1>and they use statistical data on phoneme sequences found in

0:10:20.840 --> 0:10:24.080
<v Speaker 1>English to help the system recognize more words than it

0:10:24.200 --> 0:10:27.440
<v Speaker 1>normally would. And this is kind of interesting. What you

0:10:27.480 --> 0:10:30.200
<v Speaker 1>do is you say, all right, there are a certain

0:10:30.720 --> 0:10:36.280
<v Speaker 1>limited number of sounds typically found in the spoken English language.

0:10:37.080 --> 0:10:40.040
<v Speaker 1>But those sounds are not you know, it's not that

0:10:40.040 --> 0:10:42.160
<v Speaker 1>those are completely interchangeable and that you're going to find

0:10:42.200 --> 0:10:45.280
<v Speaker 1>every single combination of those sounds in an English word.

0:10:45.280 --> 0:10:49.320
<v Speaker 1>There's certain sounds that are rarely, if ever, going to

0:10:49.360 --> 0:10:52.160
<v Speaker 1>go together. So if you start to take those sounds

0:10:52.160 --> 0:10:55.280
<v Speaker 1>out and then concentrate on the words that do use

0:10:55.559 --> 0:10:58.800
<v Speaker 1>the sounds that are left, you have reduced the number

0:10:58.880 --> 0:11:03.319
<v Speaker 1>of possibilities and thus made the system more efficient and reliable.

0:11:03.800 --> 0:11:05.880
<v Speaker 1>So now are we starting to get into an era

0:11:06.120 --> 0:11:10.040
<v Speaker 1>of what you're talking about here where the machines are

0:11:10.080 --> 0:11:14.680
<v Speaker 1>doing some analysis, Yes, to uh, to figure out what

0:11:14.760 --> 0:11:18.400
<v Speaker 1>the language means, right, well, really what it means even

0:11:18.440 --> 0:11:20.400
<v Speaker 1>just to figure out what the word is exactly. Yes,

0:11:20.440 --> 0:11:23.800
<v Speaker 1>that's what I meant. To interpret the sounds into words.

0:11:24.200 --> 0:11:27.439
<v Speaker 1>It's not just drawing on things that have been directly

0:11:27.520 --> 0:11:31.280
<v Speaker 1>programmed into it, you know, the hard coded, right, understanding

0:11:31.320 --> 0:11:35.160
<v Speaker 1>that it's using statistical analysis. Yes, and and I mean

0:11:35.240 --> 0:11:37.800
<v Speaker 1>clearly this would be important if you're talking about any

0:11:37.840 --> 0:11:42.520
<v Speaker 1>sort of dictation software, right, because with dictation software, to

0:11:42.880 --> 0:11:47.800
<v Speaker 1>program every single word in the English language into a

0:11:48.040 --> 0:11:53.679
<v Speaker 1>vocabulary for this program and to do every variation of

0:11:53.720 --> 0:11:57.319
<v Speaker 1>the pronunciation of that word would be pretty that'd be

0:11:57.360 --> 0:11:59.440
<v Speaker 1>a lot of work. Yeah. So if you can create

0:11:59.559 --> 0:12:03.360
<v Speaker 1>a system that can analyze the phonemes and then, based

0:12:03.400 --> 0:12:06.760
<v Speaker 1>upon the certain statistical analysis, figure out or make a

0:12:06.760 --> 0:12:09.719
<v Speaker 1>best guess at what that word is, you've fixed a

0:12:09.720 --> 0:12:11.840
<v Speaker 1>lot of the problems. And in fact best guest becomes

0:12:11.920 --> 0:12:15.360
<v Speaker 1>really important in just a few decades. So in nineteen

0:12:15.440 --> 0:12:18.960
<v Speaker 1>seventy one, oh wait, I'm sorry, let me back up.

0:12:19.480 --> 0:12:23.080
<v Speaker 1>Late sixties early seventies, researchers start to look into non

0:12:23.360 --> 0:12:27.960
<v Speaker 1>uniform timescale approaches to speech recognition, which is what I

0:12:28.000 --> 0:12:30.960
<v Speaker 1>was talking about earlier. The fact that not everyone speaks

0:12:31.080 --> 0:12:33.360
<v Speaker 1>the same words at the same speed or uses the

0:12:33.360 --> 0:12:37.760
<v Speaker 1>same emphasis. So you have to figure out a way

0:12:37.800 --> 0:12:42.240
<v Speaker 1>of analyzing that and accounting for that. And it's called

0:12:42.320 --> 0:12:46.959
<v Speaker 1>the it's called dynamic time warping, which is not a

0:12:47.080 --> 0:12:48.800
<v Speaker 1>jump to the left and a step to the right.

0:12:49.600 --> 0:12:53.760
<v Speaker 1>I'm disappointed, Jonathan, I'm sorry. Dynamic to me, H, well,

0:12:53.800 --> 0:12:56.559
<v Speaker 1>you know, I'll take you to a movie. On Friday

0:12:56.920 --> 0:13:00.680
<v Speaker 1>nineteen seventy one, the United States Department of Defense Advance

0:13:00.880 --> 0:13:05.520
<v Speaker 1>Research Project Agency, also known as DARPA initiates a program

0:13:05.559 --> 0:13:10.480
<v Speaker 1>called Speech Understanding Research or su R, and it funded

0:13:10.520 --> 0:13:15.280
<v Speaker 1>several projects, including one by Carnegie Mellon University called Harpie,

0:13:15.840 --> 0:13:19.200
<v Speaker 1>which is just charming. But yes, there's a speech understanding

0:13:19.200 --> 0:13:23.240
<v Speaker 1>system which could understand one thousand and eleven words, which

0:13:23.240 --> 0:13:25.000
<v Speaker 1>I said was about the same as a vocabulary of

0:13:25.040 --> 0:13:28.440
<v Speaker 1>a three year old. And it used something called beam

0:13:28.720 --> 0:13:31.880
<v Speaker 1>search to narrow down the possibilities of what a spoken

0:13:32.040 --> 0:13:36.400
<v Speaker 1>sound could be by comparing it to the statistical data

0:13:36.400 --> 0:13:38.600
<v Speaker 1>and going with the most likely results. So it's going

0:13:38.600 --> 0:13:41.679
<v Speaker 1>with probabilities. And so this is really interesting to me

0:13:41.800 --> 0:13:44.320
<v Speaker 1>because it doesn't necessarily mean it's going to produce the

0:13:44.400 --> 0:13:47.200
<v Speaker 1>correct result. It's making a best guess based upon the

0:13:47.200 --> 0:13:50.560
<v Speaker 1>input that it got what it was you said. So

0:13:50.679 --> 0:13:53.199
<v Speaker 1>in this case, if I were having the conversation with you, Joe,

0:13:53.280 --> 0:13:55.200
<v Speaker 1>and I said a letter and you weren't sure if

0:13:55.240 --> 0:13:57.560
<v Speaker 1>it was p or b you. Instead of you asking me,

0:13:57.600 --> 0:13:59.079
<v Speaker 1>you just say, well, I think it was probac was

0:13:59.120 --> 0:14:01.839
<v Speaker 1>a P. I'm just gonna write, well, I mean, if

0:14:02.200 --> 0:14:04.120
<v Speaker 1>your computer is smart enough and it has a large

0:14:04.200 --> 0:14:09.680
<v Speaker 1>enough dictionary, it might understand that that say, the words

0:14:09.720 --> 0:14:12.080
<v Speaker 1>starting with a P sound makes sense here, But the

0:14:12.080 --> 0:14:14.400
<v Speaker 1>words starting with a B sound is not. So like

0:14:14.800 --> 0:14:17.679
<v Speaker 1>I ate a pair or I ate a bear, and

0:14:17.960 --> 0:14:21.960
<v Speaker 1>now some days, some days the pair eats you exactly.

0:14:22.560 --> 0:14:25.200
<v Speaker 1>But of course i'd imagine the machine at that time

0:14:25.360 --> 0:14:29.240
<v Speaker 1>didn't have the resources to say, go figure out if

0:14:29.280 --> 0:14:31.960
<v Speaker 1>I ate a pair or I ate a bear made

0:14:31.960 --> 0:14:34.680
<v Speaker 1>more sense, right right? We need to remember that this

0:14:34.760 --> 0:14:38.160
<v Speaker 1>is you said, early seventies, so this is when you know,

0:14:38.240 --> 0:14:40.880
<v Speaker 1>computers were the size of like three of my car

0:14:40.960 --> 0:14:43.080
<v Speaker 1>at least, you know, right well, and there were I mean,

0:14:43.280 --> 0:14:46.200
<v Speaker 1>there was no Internet yet, and that'll come in a

0:14:46.400 --> 0:14:49.120
<v Speaker 1>big way in a little bit here. By the seventies

0:14:49.120 --> 0:14:54.280
<v Speaker 1>they had Arpanet, but that was very limited and that

0:14:54.440 --> 0:14:55.760
<v Speaker 1>they had have anything to do with it. They had

0:14:55.800 --> 0:14:58.240
<v Speaker 1>no web to draw on, no no web at all

0:14:58.280 --> 0:15:02.800
<v Speaker 1>for massive sampling of of data. So nineteen seventy six,

0:15:03.080 --> 0:15:07.520
<v Speaker 1>the Serve program that DARPA had concludes there were a

0:15:07.520 --> 0:15:11.200
<v Speaker 1>couple of other agencies that had tried to create speech

0:15:11.320 --> 0:15:16.000
<v Speaker 1>understanding algorithms and hardware, but had not quite met the

0:15:16.080 --> 0:15:18.280
<v Speaker 1>requirements by the end of the program to really count

0:15:18.320 --> 0:15:21.600
<v Speaker 1>as a success, but they did end up contributing quite

0:15:21.600 --> 0:15:26.360
<v Speaker 1>a bit to future endeavors. So then we've got the

0:15:26.480 --> 0:15:31.280
<v Speaker 1>nineteen eighties that typically follows the seventies, and that's when

0:15:31.320 --> 0:15:35.040
<v Speaker 1>they introduced a statistical method that was based on the

0:15:35.160 --> 0:15:38.640
<v Speaker 1>hidden Markov model. Have you guys heard of this the hmm,

0:15:38.760 --> 0:15:42.360
<v Speaker 1>all right, it's a little complicated and it's difficult to

0:15:42.400 --> 0:15:47.760
<v Speaker 1>really explain without the benefit of complicated graphics behind me,

0:15:47.800 --> 0:15:52.840
<v Speaker 1>but I will try. So it's a probability model. And

0:15:53.520 --> 0:15:57.360
<v Speaker 1>let's say that you've got let's say you've got three

0:15:58.280 --> 0:16:01.200
<v Speaker 1>earns in front of you. Okay, three three vases are

0:16:01.240 --> 0:16:04.320
<v Speaker 1>in front of you. They're solid, you can't see through them,

0:16:04.600 --> 0:16:07.960
<v Speaker 1>but you see that you've put a certain number of

0:16:08.480 --> 0:16:11.840
<v Speaker 1>orange ping pong balls in each. The first one has

0:16:11.880 --> 0:16:14.240
<v Speaker 1>the most. You put a certain number of white ping

0:16:14.280 --> 0:16:16.000
<v Speaker 1>pong balls in each, The middle one has the most

0:16:16.000 --> 0:16:18.040
<v Speaker 1>of those, and you put a certain number of yellow

0:16:18.120 --> 0:16:20.760
<v Speaker 1>ping pong balls in each. The third one has the

0:16:20.760 --> 0:16:23.720
<v Speaker 1>most of those, and then you already know the states

0:16:23.760 --> 0:16:26.280
<v Speaker 1>of the you're actually watching as you draw these ping

0:16:26.360 --> 0:16:28.440
<v Speaker 1>pong balls out, and then you're combining them to get

0:16:28.480 --> 0:16:32.400
<v Speaker 1>some sort of response at the end. It doesn't matter

0:16:32.400 --> 0:16:34.920
<v Speaker 1>what the response is, but you're drawing a ping pong

0:16:34.960 --> 0:16:38.480
<v Speaker 1>ball out from each combining them together, and that you

0:16:38.640 --> 0:16:41.160
<v Speaker 1>see the whole process. Now, that's a normal Markov model

0:16:41.200 --> 0:16:44.760
<v Speaker 1>because you know the state of each of those draws

0:16:44.960 --> 0:16:50.200
<v Speaker 1>from the vases, all right, so you observe the state. Now,

0:16:50.280 --> 0:16:52.400
<v Speaker 1>let's say those vases are in a one room and

0:16:52.400 --> 0:16:54.480
<v Speaker 1>you're in another room, and you cannot see into the

0:16:54.880 --> 0:16:56.600
<v Speaker 1>other room. You just get to see the output of

0:16:56.640 --> 0:16:58.840
<v Speaker 1>the three ping pong balls as they come out of

0:16:58.880 --> 0:17:03.000
<v Speaker 1>this process. So you don't see which one's drawn from

0:17:03.000 --> 0:17:05.320
<v Speaker 1>which earned, but you know that one is drawn from

0:17:05.320 --> 0:17:08.520
<v Speaker 1>each one, and you see what the result is. Now,

0:17:08.840 --> 0:17:12.000
<v Speaker 1>you don't know the state of those individual urns, but

0:17:12.040 --> 0:17:14.680
<v Speaker 1>you do see the result, which gives you enough information

0:17:14.760 --> 0:17:18.520
<v Speaker 1>to draw some conclusions about the state of the urns

0:17:18.560 --> 0:17:20.959
<v Speaker 1>inside the room. Not enough for you to know for certain,

0:17:21.359 --> 0:17:24.280
<v Speaker 1>but you can get sort of a probability of what

0:17:24.680 --> 0:17:27.240
<v Speaker 1>happened in there to get the result that you have

0:17:27.400 --> 0:17:31.560
<v Speaker 1>that's a hidden Markov model, and that is an oversimplification

0:17:31.760 --> 0:17:34.080
<v Speaker 1>of the hidden Markov model. So anyone out there who

0:17:34.119 --> 0:17:38.200
<v Speaker 1>actually works with systems that use this is screaming that's

0:17:38.400 --> 0:17:41.560
<v Speaker 1>way too simplistic, I know, but this is the easiest

0:17:41.560 --> 0:17:43.800
<v Speaker 1>way for me to explain it, Okay. But so basically

0:17:43.920 --> 0:17:47.560
<v Speaker 1>what you're saying is that it uses it looks at

0:17:47.560 --> 0:17:51.600
<v Speaker 1>the statistical prevalence of these three different colors appearing into

0:17:51.640 --> 0:17:56.600
<v Speaker 1>the room, and by that it makes judgments about how

0:17:56.680 --> 0:18:00.800
<v Speaker 1>common they probably are in the vases more or less.

0:18:00.840 --> 0:18:04.800
<v Speaker 1>And so these models are used a lot in things

0:18:04.880 --> 0:18:08.520
<v Speaker 1>that require a lot of interpretation on machines. Part voice

0:18:08.520 --> 0:18:10.440
<v Speaker 1>recognition is a big part of that, but it's not

0:18:10.480 --> 0:18:15.840
<v Speaker 1>just voice recognition, gesture recognition, handwriting recognition, anything where you know,

0:18:15.960 --> 0:18:19.000
<v Speaker 1>two people could try and make the same result, but

0:18:19.119 --> 0:18:22.800
<v Speaker 1>because we are individuals and because we do think slightly differently,

0:18:23.280 --> 0:18:26.200
<v Speaker 1>even though we're both creating the same result, we're doing

0:18:26.200 --> 0:18:28.199
<v Speaker 1>it in a different way. The computer has to be

0:18:28.200 --> 0:18:31.000
<v Speaker 1>able to interpret that, right. So it's because it's taking

0:18:31.440 --> 0:18:35.720
<v Speaker 1>sort of ambiguous analog data from the world, sure, yeah,

0:18:35.760 --> 0:18:38.119
<v Speaker 1>and it has to be able to react to that

0:18:38.240 --> 0:18:42.280
<v Speaker 1>and create a meaningful result. So once people started to

0:18:42.440 --> 0:18:47.960
<v Speaker 1>concentrate on this form of statistical analysis, voice recognition pretty

0:18:48.040 --> 0:18:51.959
<v Speaker 1>much hit its peak as far as recognizing individual words,

0:18:52.520 --> 0:18:55.760
<v Speaker 1>not necessarily knowing what the context is or what the

0:18:55.800 --> 0:18:58.520
<v Speaker 1>meaning is, but it meant that if you were speaking

0:18:58.600 --> 0:19:02.959
<v Speaker 1>into a mischie that had this kind of software in it,

0:19:02.960 --> 0:19:06.560
<v Speaker 1>it could determine with relative ease what it was you

0:19:06.640 --> 0:19:10.200
<v Speaker 1>were saying, not what it meant, but what the actual

0:19:10.240 --> 0:19:12.639
<v Speaker 1>words were. So if, for example, if it's a simple

0:19:12.800 --> 0:19:16.960
<v Speaker 1>speech to text program, it be fairly accurate, and it

0:19:17.000 --> 0:19:19.840
<v Speaker 1>got more accurate as time went on. In nineteen eighty two,

0:19:19.880 --> 0:19:23.960
<v Speaker 1>that's when a certain Ray Kurtzweil got involved. Our old

0:19:24.800 --> 0:19:29.000
<v Speaker 1>Kurtzweil's a well known futurist, one of those evangelists for

0:19:29.080 --> 0:19:32.879
<v Speaker 1>the oncoming singularity, a fellow who I think is hoping

0:19:32.920 --> 0:19:38.200
<v Speaker 1>to achieve immortality through technology in some method or another personally, yes, definitely.

0:19:38.680 --> 0:19:42.120
<v Speaker 1>So he created in nineteen eighty two the Kurtzweil Applied

0:19:42.240 --> 0:19:47.600
<v Speaker 1>Intelligence Division Company, really and it was all about creating

0:19:47.640 --> 0:19:52.119
<v Speaker 1>computer based speech recognition, And in nineteen eighty seven it

0:19:52.240 --> 0:19:56.960
<v Speaker 1>introduced a commercial speech recognition system. And Kurtzweil was really

0:19:57.000 --> 0:20:02.320
<v Speaker 1>applying his expertise in two areas, computer science and pattern recognition.

0:20:02.800 --> 0:20:05.560
<v Speaker 1>He was really interested in the way that computers can

0:20:05.680 --> 0:20:10.520
<v Speaker 1>identify patterns and respond to them, and speech was certainly

0:20:10.680 --> 0:20:14.000
<v Speaker 1>part of that. So he applied that knowledge and that

0:20:14.040 --> 0:20:18.200
<v Speaker 1>expertise and really made some big contributions in the speech

0:20:18.240 --> 0:20:21.680
<v Speaker 1>recognition field. Skipping over to the nineteen nineties, I mean,

0:20:21.760 --> 0:20:24.879
<v Speaker 1>essentially we're having this field evolve over time. But in

0:20:24.920 --> 0:20:28.200
<v Speaker 1>the nineties we started seeing the development of real speech

0:20:28.280 --> 0:20:31.920
<v Speaker 1>enabled applications. So this is when we started getting those

0:20:32.640 --> 0:20:34.760
<v Speaker 1>telephone systems where you would call in and get an

0:20:34.800 --> 0:20:39.960
<v Speaker 1>automated response saying say say or press one, which is

0:20:40.119 --> 0:20:42.399
<v Speaker 1>again going all the way back to the Audrey system

0:20:42.400 --> 0:20:44.360
<v Speaker 1>in nineteen fifty two that no labs DI it Yeah,

0:20:44.520 --> 0:20:46.399
<v Speaker 1>You've only got ten responses, and so it just has

0:20:46.400 --> 0:20:48.800
<v Speaker 1>to figure out which one right, and then eventually it

0:20:48.800 --> 0:20:52.080
<v Speaker 1>would get to things like you know, say yes, or

0:20:52.760 --> 0:20:55.679
<v Speaker 1>like I can help you with that. What is your

0:20:55.720 --> 0:21:00.520
<v Speaker 1>problem he's a keyword? Yeah, not that keyword. Sorry, I

0:21:00.560 --> 0:21:04.879
<v Speaker 1>don't understand. Can you restate that? Yeah. So by twenty

0:21:04.920 --> 0:21:09.040
<v Speaker 1>ten we get the we get Google's English Voice search system,

0:21:09.560 --> 0:21:14.400
<v Speaker 1>which incorporates around two hundred and thirty billion words from

0:21:14.560 --> 0:21:19.400
<v Speaker 1>actual user queries. Wow, have you all tried this thing? Oh? Yeah,

0:21:19.400 --> 0:21:22.160
<v Speaker 1>I use it all the time. No, I do, because

0:21:22.200 --> 0:21:24.560
<v Speaker 1>I've got an Android phone, so I actually do use

0:21:24.680 --> 0:21:28.280
<v Speaker 1>voice search all the time. Sometimes I think it's really

0:21:28.400 --> 0:21:31.600
<v Speaker 1>hilarious how accurate it is, Like, you know, it shouldn't

0:21:31.600 --> 0:21:35.280
<v Speaker 1>recognize that term, but it does. I use it mostly

0:21:35.320 --> 0:21:39.080
<v Speaker 1>for navigation purposes, So I'll pull up a map application

0:21:39.320 --> 0:21:42.119
<v Speaker 1>and it's a Google one. So then I, you know,

0:21:43.280 --> 0:21:45.960
<v Speaker 1>speak destination and I can say an address, or I

0:21:45.960 --> 0:21:49.040
<v Speaker 1>can say a business name, or you know, if I

0:21:49.080 --> 0:21:51.040
<v Speaker 1>have someone in my contact list, I can say their

0:21:51.160 --> 0:21:54.760
<v Speaker 1>name and it pulls up the information, which leads us

0:21:54.840 --> 0:21:59.320
<v Speaker 1>kind of into a second part of this speech recognition discussion.

0:21:59.520 --> 0:22:02.160
<v Speaker 1>We've got the idea that speech and search are really

0:22:03.000 --> 0:22:07.120
<v Speaker 1>tightly connected, actually to the point where advances in one

0:22:07.200 --> 0:22:10.600
<v Speaker 1>field often mean that the other field benefits as a result.

0:22:11.080 --> 0:22:15.240
<v Speaker 1>But now we're talking about not just recognizing words, but

0:22:15.680 --> 0:22:20.640
<v Speaker 1>pulling some sort of meaning from them. Right, Well, what

0:22:20.840 --> 0:22:25.399
<v Speaker 1>is the goal of input of an interface that takes

0:22:25.440 --> 0:22:29.080
<v Speaker 1>input from a human and turns it into data. I mean,

0:22:29.880 --> 0:22:31.800
<v Speaker 1>I don't know. We'll say I would argue that the

0:22:31.920 --> 0:22:36.080
<v Speaker 1>ultimate goal of an input interface is to become invisible,

0:22:36.840 --> 0:22:40.320
<v Speaker 1>to make things as easy and as natural and as

0:22:40.400 --> 0:22:44.560
<v Speaker 1>intuitive for you as it possibly could be, so that

0:22:44.640 --> 0:22:48.000
<v Speaker 1>you don't even recognize the tools you're using, right, Right,

0:22:48.040 --> 0:22:51.800
<v Speaker 1>to give the computer the ability to answer your questions

0:22:51.840 --> 0:22:54.640
<v Speaker 1>almost before you ask them, exactly and right now, you know,

0:22:54.920 --> 0:22:58.960
<v Speaker 1>we're still using tools that we have to learn how

0:22:59.000 --> 0:23:03.000
<v Speaker 1>to use. Right. So when you when you want to

0:23:03.040 --> 0:23:07.639
<v Speaker 1>talk to the voice recognition program on your smartphone, you

0:23:07.760 --> 0:23:10.560
<v Speaker 1>do have to be aware that it's only going to

0:23:10.640 --> 0:23:14.239
<v Speaker 1>be listening to certain keywords, right. You have to you

0:23:14.240 --> 0:23:17.760
<v Speaker 1>have to give it keywords and sort of specific commands

0:23:17.880 --> 0:23:21.439
<v Speaker 1>that it can understand in order for it to help you. Sure,

0:23:21.480 --> 0:23:23.440
<v Speaker 1>And in that sense, it's kind of like a program

0:23:23.480 --> 0:23:26.160
<v Speaker 1>where you know, you have a certain number of buttons

0:23:26.200 --> 0:23:29.000
<v Speaker 1>you can click on, or commands you can enter on

0:23:29.040 --> 0:23:32.399
<v Speaker 1>a command line that are chosen from a list of

0:23:32.480 --> 0:23:36.320
<v Speaker 1>pre selected commands, but you're just doing it with your voice, right.

0:23:36.400 --> 0:23:40.720
<v Speaker 1>Anything outside of that would just be interpreted as an error. Sure, yeah, Yeah,

0:23:40.760 --> 0:23:42.400
<v Speaker 1>you can say open and close, but if you say

0:23:42.400 --> 0:23:46.640
<v Speaker 1>French fries, it goes qua. Yeah. Yeah. So let's say

0:23:46.640 --> 0:23:49.400
<v Speaker 1>you had this and you're looking for something on Google, right,

0:23:49.480 --> 0:23:51.680
<v Speaker 1>you're looking at Google Maps and you're using your voice,

0:23:51.800 --> 0:23:54.280
<v Speaker 1>you could probably say French fries, though, right, should say

0:23:54.280 --> 0:23:57.040
<v Speaker 1>like French fries near my house. Well, even there it

0:23:57.400 --> 0:24:00.480
<v Speaker 1>might be able to understand those keywords. Right, You've given

0:24:00.520 --> 0:24:03.240
<v Speaker 1>it something that it knows how to work with. But

0:24:03.359 --> 0:24:06.480
<v Speaker 1>what if you've got a problem, like I'm trying to

0:24:06.520 --> 0:24:09.479
<v Speaker 1>remember this meal I had that was real good in

0:24:09.600 --> 0:24:12.680
<v Speaker 1>town and I don't know, and you're kind of describing it,

0:24:12.840 --> 0:24:15.479
<v Speaker 1>but right, it can't do anything with that, right, Right,

0:24:15.520 --> 0:24:17.560
<v Speaker 1>you'd have to talk to a person at that point.

0:24:17.720 --> 0:24:19.400
<v Speaker 1>You would either that or you would have to have

0:24:19.520 --> 0:24:23.199
<v Speaker 1>every single restaurant give every single possible explanation of what

0:24:23.280 --> 0:24:26.600
<v Speaker 1>its meals would be like exactly. Yeah. But so this

0:24:26.720 --> 0:24:29.679
<v Speaker 1>leads me to a question about the future of voice

0:24:29.720 --> 0:24:33.399
<v Speaker 1>and speech recognition. And here's a question. Why do we

0:24:33.480 --> 0:24:37.440
<v Speaker 1>call tech support? I mean, if you've I mean, most

0:24:37.440 --> 0:24:40.639
<v Speaker 1>people are working, most people have called tech support at

0:24:40.680 --> 0:24:45.000
<v Speaker 1>some point. But I will venture that almost any problem

0:24:45.320 --> 0:24:48.600
<v Speaker 1>that can be solved by tech support, there's already a

0:24:48.680 --> 0:24:53.600
<v Speaker 1>written out solution to the exact problem you have somewhere online. Sure, right,

0:24:54.040 --> 0:24:57.119
<v Speaker 1>somebody has already solved this problem, and they've probably typed

0:24:57.200 --> 0:25:00.480
<v Speaker 1>up instructions on how to fix it right, and they

0:25:00.520 --> 0:25:03.920
<v Speaker 1>may even be easy to follow instructions. But the challenge

0:25:03.960 --> 0:25:07.360
<v Speaker 1>there is for the person who's experiencing the problem, how

0:25:07.359 --> 0:25:09.320
<v Speaker 1>do they frame their problem in such a way that

0:25:09.359 --> 0:25:12.360
<v Speaker 1>they get the exactly the right response. How do they

0:25:12.400 --> 0:25:17.480
<v Speaker 1>connect the problem they're experiencing to the solution that exists

0:25:17.520 --> 0:25:19.959
<v Speaker 1>somewhere out there If they don't know what the correct

0:25:20.040 --> 0:25:21.919
<v Speaker 1>keywords are, If they don't they don't know what the

0:25:21.960 --> 0:25:24.240
<v Speaker 1>problem is itself, They're just going my screen won't turn

0:25:24.280 --> 0:25:27.159
<v Speaker 1>on exactly. And that's why we call tech support. I

0:25:27.160 --> 0:25:30.159
<v Speaker 1>think you call tech support because you need something that

0:25:30.200 --> 0:25:34.359
<v Speaker 1>can process natural language, which is right now a person.

0:25:34.920 --> 0:25:37.639
<v Speaker 1>A person can listen to you describe your problem, and

0:25:37.720 --> 0:25:40.879
<v Speaker 1>whatever terms you come up with, can take that information,

0:25:41.320 --> 0:25:44.399
<v Speaker 1>get the gist of it, and connect that to a

0:25:44.440 --> 0:25:48.480
<v Speaker 1>piece of knowledge, right, and then in return, that person

0:25:48.600 --> 0:25:53.159
<v Speaker 1>can respond with language that the person who called in

0:25:53.200 --> 0:25:56.560
<v Speaker 1>for tech support can understand. So, for instance, if I'm

0:25:56.720 --> 0:25:59.600
<v Speaker 1>experiencing a problem and I call up Joe and Joe

0:25:59.640 --> 0:26:01.640
<v Speaker 1>tells me how to fix it, but I don't understand

0:26:01.680 --> 0:26:05.840
<v Speaker 1>his explanation, I can say I am sorry, I just

0:26:06.320 --> 0:26:08.600
<v Speaker 1>I don't get that Joe can actually then take the

0:26:08.640 --> 0:26:11.560
<v Speaker 1>time to reframe what it was he said in a

0:26:11.600 --> 0:26:15.480
<v Speaker 1>way that my puny brain can comprehend, right, And then

0:26:15.520 --> 0:26:17.600
<v Speaker 1>I can turn off my computer and turn it back

0:26:17.640 --> 0:26:20.760
<v Speaker 1>on again and suddenly works. So yeah, it totally works

0:26:20.800 --> 0:26:24.000
<v Speaker 1>both ways. I mean, but it's so especially important in

0:26:24.400 --> 0:26:26.960
<v Speaker 1>identifying what the problem is to begin with, because a

0:26:26.960 --> 0:26:29.359
<v Speaker 1>lot of times we just don't know the right way

0:26:29.440 --> 0:26:32.480
<v Speaker 1>to explain it to a computer in terms of commands

0:26:32.480 --> 0:26:36.560
<v Speaker 1>and keywords. And so I think this is sort of

0:26:36.600 --> 0:26:39.320
<v Speaker 1>the future of where voice recognition is going from here.

0:26:39.359 --> 0:26:41.520
<v Speaker 1>And there are a couple of things we need to

0:26:41.600 --> 0:26:46.520
<v Speaker 1>explore about voice and speech recognition. One of them is

0:26:47.359 --> 0:26:53.119
<v Speaker 1>how does the computer understand whole speech like sentences that

0:26:53.160 --> 0:26:55.399
<v Speaker 1>you're speaking to it, as opposed to just little words

0:26:55.440 --> 0:26:58.119
<v Speaker 1>at a time, and make sense of those in a

0:26:58.119 --> 0:27:00.919
<v Speaker 1>grammatical way and actually make sense of them instead of

0:27:00.960 --> 0:27:03.040
<v Speaker 1>instead of yeah, picking up on those keywords, because you know,

0:27:03.200 --> 0:27:06.760
<v Speaker 1>right now, the technology doesn't know what you're saying, right,

0:27:07.240 --> 0:27:10.919
<v Speaker 1>So yeah, well, I mean, and in our way, it

0:27:10.960 --> 0:27:17.520
<v Speaker 1>will probably never know what you're saying. Oh, we can

0:27:17.560 --> 0:27:20.159
<v Speaker 1>have a debate what will computers achieve consciousness? You know,

0:27:20.240 --> 0:27:24.040
<v Speaker 1>will the terminator learn to love. But whether or not

0:27:24.119 --> 0:27:28.359
<v Speaker 1>the terminator, terminator will understand the meaning of love. The

0:27:28.440 --> 0:27:31.560
<v Speaker 1>terminator will at least be able to make sense of

0:27:31.640 --> 0:27:34.760
<v Speaker 1>my grammar, even if it's spontaneous and kind of manful.

0:27:34.840 --> 0:27:36.840
<v Speaker 1>So the terminator may not love, but it may be

0:27:36.880 --> 0:27:40.240
<v Speaker 1>able to mark up your paper. Yeah, it may be

0:27:40.359 --> 0:27:43.080
<v Speaker 1>able to help me figure out what restaurant I went

0:27:43.119 --> 0:27:45.200
<v Speaker 1>to when I was in town last year, just by

0:27:45.240 --> 0:27:47.919
<v Speaker 1>me describing some dish. If the terminator doesn't love you,

0:27:47.960 --> 0:27:50.240
<v Speaker 1>I don't see it taking the opportunity to actually help

0:27:50.280 --> 0:27:51.760
<v Speaker 1>you with that problem. And I'd like to put in

0:27:51.800 --> 0:27:53.200
<v Speaker 1>that I do not want the terminator to be my

0:27:53.280 --> 0:27:56.280
<v Speaker 1>English teacher ever. Thanks, I'm pretty sure I had the

0:27:56.359 --> 0:28:01.120
<v Speaker 1>terminator as my English teacher. You'll fail English. So so

0:28:01.880 --> 0:28:07.560
<v Speaker 1>before well, no, I want to want to introduce a

0:28:07.600 --> 0:28:13.359
<v Speaker 1>possible way of viewing the progress of our input through voice.

0:28:13.800 --> 0:28:18.679
<v Speaker 1>And that's a way of looking at the computer helper

0:28:18.840 --> 0:28:25.040
<v Speaker 1>as something that's that's got an obedient orientation versus a

0:28:25.080 --> 0:28:27.720
<v Speaker 1>sympathetic orientation. All right, and what do you mean by that?

0:28:27.760 --> 0:28:30.320
<v Speaker 1>And so I would say that right now, computers have

0:28:30.400 --> 0:28:35.879
<v Speaker 1>an obedient orientation, meaning they they solve directly problems that

0:28:35.920 --> 0:28:37.800
<v Speaker 1>you give them. Right, they do what you tell them

0:28:37.840 --> 0:28:40.840
<v Speaker 1>to do, and that's it. Yeah yeah. And when they're

0:28:40.960 --> 0:28:43.040
<v Speaker 1>not doing that, it's because you haven't told them how

0:28:43.040 --> 0:28:45.960
<v Speaker 1>to do it exactly right, Yeah yeah. And so you

0:28:46.120 --> 0:28:48.480
<v Speaker 1>enter a command and it follows the command exactly, It

0:28:48.520 --> 0:28:52.480
<v Speaker 1>performs the calculation, it searches for the search term. However,

0:28:52.480 --> 0:28:55.520
<v Speaker 1>it goes like that. Now, what makes that person on

0:28:55.640 --> 0:28:59.560
<v Speaker 1>tech support different. That person has a sympathetic orientation as

0:28:59.560 --> 0:29:03.480
<v Speaker 1>opposed to an obedient orientation. What that person does is

0:29:03.800 --> 0:29:07.240
<v Speaker 1>listens to your whole problem, gets the gist of it,

0:29:07.240 --> 0:29:11.000
<v Speaker 1>figures out what's important, and then helps you solve it. Right,

0:29:11.080 --> 0:29:15.080
<v Speaker 1>They see the end. They see not just each of

0:29:15.120 --> 0:29:18.640
<v Speaker 1>the individual commands you're giving, but they understand what you're

0:29:18.720 --> 0:29:23.520
<v Speaker 1>trying to do overall. Right, and we're already making some

0:29:23.600 --> 0:29:28.960
<v Speaker 1>pretty big strides in natural language recognition. For instance, IBM's Watson,

0:29:29.040 --> 0:29:31.960
<v Speaker 1>which was famous for going on Jeopardy up against two

0:29:32.120 --> 0:29:35.480
<v Speaker 1>former Jeopardy champions and beating them, winning in a game

0:29:35.520 --> 0:29:39.080
<v Speaker 1>of Jeopardy. But what it had to do was it

0:29:39.160 --> 0:29:42.840
<v Speaker 1>essentially had a huge amount of information stored in its

0:29:42.880 --> 0:29:45.680
<v Speaker 1>in its data banks. Yeah, but it had no connection

0:29:45.960 --> 0:29:48.160
<v Speaker 1>to the Internet while it was playing the game, So

0:29:48.320 --> 0:29:50.640
<v Speaker 1>it was it had much of the Internet on it.

0:29:50.320 --> 0:29:52.600
<v Speaker 1>You know, it was self but it was self contained. Yeah,

0:29:52.600 --> 0:29:56.000
<v Speaker 1>all the YouTube comments were left off, but otherwise, yeah,

0:29:56.040 --> 0:30:00.680
<v Speaker 1>it was Why didn't it need those worry about what

0:30:00.760 --> 0:30:03.239
<v Speaker 1>happened when when it learned Urban Dictionary? Right, Yeah, they

0:30:03.240 --> 0:30:05.360
<v Speaker 1>taught it Urban Dictionary and then they basically had to

0:30:05.480 --> 0:30:08.760
<v Speaker 1>nuke Urban Dictionary from orbit from from its data banks

0:30:08.800 --> 0:30:14.800
<v Speaker 1>because it started off. It's true. That's completely true. Okay,

0:30:14.880 --> 0:30:18.400
<v Speaker 1>so I understand why it doesn't need YouTube comedy turn

0:30:18.440 --> 0:30:23.800
<v Speaker 1>Watson into a vicious sociopath. Yeah, yeah, which I will

0:30:23.880 --> 0:30:27.120
<v Speaker 1>kill you. It was essentially becoming the the Sean Connery

0:30:27.200 --> 0:30:30.280
<v Speaker 1>from the Saturday Night Live skits. So anyway I had

0:30:30.480 --> 0:30:32.480
<v Speaker 1>it was it was closed off, so it didn't have

0:30:32.520 --> 0:30:36.040
<v Speaker 1>an outside. It didn't have an outlet to uh to

0:30:36.040 --> 0:30:38.120
<v Speaker 1>to go out and do a search on the internet

0:30:38.240 --> 0:30:41.760
<v Speaker 1>for everything. So when a Jeopardy clue came up, it

0:30:41.840 --> 0:30:45.000
<v Speaker 1>had to analyze the clue, go through its database and

0:30:45.040 --> 0:30:49.160
<v Speaker 1>then determine which bits of information were most likely to

0:30:49.200 --> 0:30:52.280
<v Speaker 1>be the relevant ones to answer or to form the

0:30:52.360 --> 0:30:55.280
<v Speaker 1>question in the case of Jeopardy for that clue, and

0:30:55.400 --> 0:30:57.320
<v Speaker 1>uh and the way it did this was that it

0:30:57.360 --> 0:31:02.160
<v Speaker 1>would assign probabilities to answers based upon parsing out the clue.

0:31:02.640 --> 0:31:05.120
<v Speaker 1>And the thing about Jeopardy is that it's not just

0:31:06.760 --> 0:31:11.120
<v Speaker 1>really straightforward answers. You know, things like you know, this

0:31:11.440 --> 0:31:14.800
<v Speaker 1>is the is Beethoven's symphony that contains Ode to Joy?

0:31:15.480 --> 0:31:18.000
<v Speaker 1>What is the ninth Symphony? You know it's not that's

0:31:18.120 --> 0:31:21.880
<v Speaker 1>there's word play in there, right exactly, there are punsure

0:31:22.040 --> 0:31:24.880
<v Speaker 1>and yeah, yes, So they had to create programs that

0:31:24.920 --> 0:31:28.600
<v Speaker 1>could that could parse that language and determine what is

0:31:28.640 --> 0:31:32.160
<v Speaker 1>the underlying meaning of this phrase, not just what do

0:31:32.280 --> 0:31:34.120
<v Speaker 1>these words? What are you know? Not just using those

0:31:34.120 --> 0:31:36.800
<v Speaker 1>words as search terms, because if it did that, it

0:31:36.880 --> 0:31:39.480
<v Speaker 1>never would have won. It had to figure out the relevance.

0:31:39.520 --> 0:31:40.960
<v Speaker 1>And so what it would do is it would pull

0:31:41.040 --> 0:31:44.040
<v Speaker 1>up all these different answers and a sign that probabilities

0:31:44.080 --> 0:31:48.160
<v Speaker 1>for being the correct one. And if the probability was

0:31:48.240 --> 0:31:50.480
<v Speaker 1>higher than a threshold and I can't remember what the

0:31:50.520 --> 0:31:52.719
<v Speaker 1>threshold it was, like seventy or eighty percent or whatever,

0:31:53.000 --> 0:31:55.480
<v Speaker 1>but if it was higher than that threshold, then then

0:31:55.520 --> 0:31:59.520
<v Speaker 1>and only then would Watson guess and guests. Yeah. Otherwise

0:31:59.560 --> 0:32:01.800
<v Speaker 1>Watson would be quiet and allow one of the other

0:32:01.840 --> 0:32:04.840
<v Speaker 1>two people to answer, which is really interesting to me

0:32:04.920 --> 0:32:08.360
<v Speaker 1>because that's a it's a step towards that natural language recognition,

0:32:08.440 --> 0:32:11.400
<v Speaker 1>the idea that it's not just looking at the words

0:32:11.480 --> 0:32:15.840
<v Speaker 1>as search terms, but as these are things units of meaning,

0:32:16.440 --> 0:32:19.600
<v Speaker 1>they have meaning, and therefore you need to find the

0:32:19.720 --> 0:32:23.120
<v Speaker 1>data that corresponds with that meaning. And that is incredible. Well,

0:32:23.120 --> 0:32:26.320
<v Speaker 1>it was searching for if I'm correct, it wasn't it.

0:32:26.320 --> 0:32:29.520
<v Speaker 1>It had something to do with like keywords would be

0:32:29.560 --> 0:32:33.160
<v Speaker 1>searched based on when they were in proximity to other

0:32:33.240 --> 0:32:36.880
<v Speaker 1>important terms, right as far as I can understand it, yes,

0:32:37.400 --> 0:32:40.080
<v Speaker 1>but I mean it gets really pretty complex. And then

0:32:41.640 --> 0:32:44.120
<v Speaker 1>beyond that, you know, we're starting to see Watson being

0:32:44.280 --> 0:32:48.600
<v Speaker 1>used in medical facilities yea. You know they're using to

0:32:48.600 --> 0:32:51.360
<v Speaker 1>describe what's wrong, you know, kind of a diagnostic although

0:32:51.400 --> 0:32:54.160
<v Speaker 1>a lot of doctors will tell you that while it's

0:32:54.320 --> 0:32:58.200
<v Speaker 1>a useful tool, it's certainly not a replacement for a

0:32:58.240 --> 0:33:02.400
<v Speaker 1>doctor because so many cases can like two people with

0:33:02.480 --> 0:33:06.800
<v Speaker 1>the exact same condition can come in and nonsent present

0:33:06.880 --> 0:33:11.480
<v Speaker 1>different symptoms and even explain the same symptoms in very

0:33:11.480 --> 0:33:15.479
<v Speaker 1>different terms, and so it becomes increasingly difficult for a

0:33:15.520 --> 0:33:17.760
<v Speaker 1>machine to be able to interpret that and come up

0:33:17.760 --> 0:33:20.000
<v Speaker 1>with the right response, as opposed to a doctor who

0:33:20.040 --> 0:33:23.880
<v Speaker 1>has that experience and has the ability to be much

0:33:23.880 --> 0:33:27.840
<v Speaker 1>more dynamic and even proactive and asking the right questions

0:33:27.920 --> 0:33:30.719
<v Speaker 1>to get the right information. Well. Also, I mean, a doctor,

0:33:30.800 --> 0:33:34.480
<v Speaker 1>much like the tech support person is, though with much

0:33:34.520 --> 0:33:38.520
<v Speaker 1>higher stakes obviously, is able to identify what's important. I

0:33:38.520 --> 0:33:40.360
<v Speaker 1>mean a lot of most of the time, when you

0:33:40.440 --> 0:33:44.040
<v Speaker 1>come describing a problem, you're giving too much information. All

0:33:44.080 --> 0:33:47.640
<v Speaker 1>the time, You're giving all this information, and a huge

0:33:47.720 --> 0:33:51.440
<v Speaker 1>amount of it is probably not actually relevant to what's

0:33:51.560 --> 0:33:55.360
<v Speaker 1>really the problem. And that's when the human who's experienced

0:33:55.400 --> 0:33:58.920
<v Speaker 1>this before knows what to zero in on. Computers have

0:33:58.960 --> 0:34:01.120
<v Speaker 1>more trouble with that, right, Like, they have a hard

0:34:01.160 --> 0:34:04.240
<v Speaker 1>time figuring out what's important when you've given it a

0:34:04.280 --> 0:34:08.200
<v Speaker 1>list of sure, yeah, yeah, no, it without without giving

0:34:08.280 --> 0:34:12.440
<v Speaker 1>it some form of a way of recognizing context and

0:34:12.719 --> 0:34:18.799
<v Speaker 1>way of recognizing the importance of particular words and particular phrases. Yeah,

0:34:18.880 --> 0:34:21.480
<v Speaker 1>I mean, it's just how does a computer determine that

0:34:22.080 --> 0:34:24.560
<v Speaker 1>the third word in a sentence is more or less

0:34:24.560 --> 0:34:27.320
<v Speaker 1>important than the fifth word, right, It's all statistical probability

0:34:27.320 --> 0:34:29.239
<v Speaker 1>and a certain point you're going to plateau on that

0:34:29.280 --> 0:34:31.200
<v Speaker 1>because the more the more input that you give to

0:34:31.239 --> 0:34:33.680
<v Speaker 1>these kinds of programs, you know, they'll analyze it and

0:34:33.719 --> 0:34:36.520
<v Speaker 1>analyze it, and it gets really accurate, and then kind

0:34:36.600 --> 0:34:39.160
<v Speaker 1>of stops getting more accurate. Yeah. In fact, that's been

0:34:39.160 --> 0:34:42.040
<v Speaker 1>a real issue with voice recognition in general, and a

0:34:42.160 --> 0:34:46.239
<v Speaker 1>very interesting thing that I think. Maybe it's interesting to

0:34:46.320 --> 0:34:50.440
<v Speaker 1>me because Kurtzweil worked on voice recognition, and I know

0:34:50.640 --> 0:34:54.719
<v Speaker 1>the man must be aware that the technology increased at

0:34:54.719 --> 0:34:57.840
<v Speaker 1>a pretty rapid pace but then began to plateau off.

0:34:58.120 --> 0:35:00.120
<v Speaker 1>You know, really it even began to plateau off in

0:35:00.160 --> 0:35:03.480
<v Speaker 1>the eighties. We made improvements and we learned how to

0:35:03.560 --> 0:35:06.320
<v Speaker 1>use the technology we had created better in better ways.

0:35:06.719 --> 0:35:11.200
<v Speaker 1>But it's not like the advances we made are exponentially

0:35:11.239 --> 0:35:14.239
<v Speaker 1>better than the previous generations. So in a way, the

0:35:14.360 --> 0:35:16.799
<v Speaker 1>curve is starting to plateau off and level off. We're

0:35:16.840 --> 0:35:19.879
<v Speaker 1>still we're still making advancements, but not at an accelerated rate,

0:35:19.920 --> 0:35:24.240
<v Speaker 1>whereas with Moore's law, every two years, essentially we're seeing

0:35:24.360 --> 0:35:29.160
<v Speaker 1>computers get twice as powerful, and so I think that

0:35:29.160 --> 0:35:35.560
<v Speaker 1>that makes some futuristic predictions less likely because we recognized

0:35:35.640 --> 0:35:41.080
<v Speaker 1>not all elements of technology accelerate at this same rate. Yeah,

0:35:41.239 --> 0:35:45.720
<v Speaker 1>I think for the future of speech recognition, natural language

0:35:45.760 --> 0:35:51.600
<v Speaker 1>processing is key, and natural language processing, you know, Watson

0:35:51.680 --> 0:35:55.440
<v Speaker 1>seems amazing, but compared to what's probably going to come

0:35:55.480 --> 0:35:58.440
<v Speaker 1>in the future, Watson is actually very primitive, right And

0:35:58.960 --> 0:36:01.120
<v Speaker 1>also we got to in mind that right now, even

0:36:01.160 --> 0:36:04.600
<v Speaker 1>though Watson does did do this amazing thing by beating

0:36:04.640 --> 0:36:09.720
<v Speaker 1>humans at their own game, it had thousands of processors

0:36:10.360 --> 0:36:13.080
<v Speaker 1>and tens of thousands of cores. So you're talking about

0:36:13.600 --> 0:36:20.600
<v Speaker 1>an incredibly powerful, energy hungry machine that was able to

0:36:20.640 --> 0:36:24.600
<v Speaker 1>do something that a human can do. Right, well, that

0:36:24.640 --> 0:36:28.279
<v Speaker 1>a tiny little meat thing. It was. It was able

0:36:28.320 --> 0:36:31.600
<v Speaker 1>to do what the tech support operator can do. And ultimately,

0:36:31.680 --> 0:36:35.160
<v Speaker 1>I think that's the endgame here. What we're talking about

0:36:35.560 --> 0:36:39.200
<v Speaker 1>in the far future. What we dream about is when

0:36:39.239 --> 0:36:44.400
<v Speaker 1>your computer is as sympathetic in its orientation as a

0:36:44.520 --> 0:36:48.960
<v Speaker 1>human helper, is that you can describe in spontaneous human

0:36:49.080 --> 0:36:51.840
<v Speaker 1>language what you're trying to do and it can actually

0:36:51.880 --> 0:36:54.879
<v Speaker 1>help you with that as opposed to just operating off

0:36:54.880 --> 0:36:57.959
<v Speaker 1>of set commands. And it's we're getting there. I mean,

0:36:58.040 --> 0:37:00.880
<v Speaker 1>if you talk to people who have used Apple's Siri

0:37:01.239 --> 0:37:03.839
<v Speaker 1>or the Google Voice search stuff. You know, you can

0:37:03.960 --> 0:37:07.879
<v Speaker 1>use some pretty you know, colloquial sayings to get what

0:37:07.960 --> 0:37:10.960
<v Speaker 1>you want. And it's it's getting better and better at

0:37:11.000 --> 0:37:13.880
<v Speaker 1>interpreting those and giving you the response that would be appropriate.

0:37:14.400 --> 0:37:17.160
<v Speaker 1>And and granted this is all again still on a

0:37:17.160 --> 0:37:21.240
<v Speaker 1>surface level, but it's it's seemingly deep, you know, to

0:37:21.400 --> 0:37:24.640
<v Speaker 1>the user experience. It seems like the machine understands what

0:37:24.680 --> 0:37:28.480
<v Speaker 1>you're saying, even though that's not really yes. And and

0:37:28.560 --> 0:37:30.520
<v Speaker 1>you know, maybe in the future we have the semantic

0:37:30.560 --> 0:37:33.880
<v Speaker 1>web that responds exactly to what we want even if

0:37:33.920 --> 0:37:35.799
<v Speaker 1>we were to you know, I know that it's really

0:37:35.800 --> 0:37:38.480
<v Speaker 1>hard to get tone across and text messages, but maybe

0:37:38.480 --> 0:37:41.799
<v Speaker 1>computers will be better than people are. By the way,

0:37:41.840 --> 0:37:45.359
<v Speaker 1>I'm always I'm always j slash k if you if

0:37:45.360 --> 0:37:48.719
<v Speaker 1>you're wondering, all right, well, you know we should wrap

0:37:48.760 --> 0:37:50.600
<v Speaker 1>this up. We've gone on quite a bit about voice

0:37:50.840 --> 0:37:54.000
<v Speaker 1>and speech recognition. It's a fascinating topic and it is

0:37:54.080 --> 0:37:57.239
<v Speaker 1>one that I am eager to see more advances in

0:37:57.280 --> 0:38:00.560
<v Speaker 1>the field. We've seen stuff not just in smartphones and tablets,

0:38:00.560 --> 0:38:04.879
<v Speaker 1>but also game consoles, things like Microsoft's Connect and other

0:38:04.960 --> 0:38:09.080
<v Speaker 1>devices as well incorporate voice recognition, and I expect we're

0:38:09.080 --> 0:38:11.399
<v Speaker 1>going to see even more of that. I can't wait

0:38:11.400 --> 0:38:15.279
<v Speaker 1>for my thermostat to have it. It's too darn on

0:38:15.360 --> 0:38:17.760
<v Speaker 1>here and then it just immediately just cranks down five degrees.

0:38:18.400 --> 0:38:21.600
<v Speaker 1>That'd be fantastic because I don't have one that's connected

0:38:21.640 --> 0:38:23.760
<v Speaker 1>to the internet, so I can't just use my smartphone.

0:38:23.800 --> 0:38:26.359
<v Speaker 1>I actually have to. I can't believe it, get up

0:38:26.480 --> 0:38:30.600
<v Speaker 1>and walk to it. I know it's a My life

0:38:30.800 --> 0:38:34.640
<v Speaker 1>is a drama waiting to be filmed. So guys, that's

0:38:34.680 --> 0:38:37.359
<v Speaker 1>our episode about voice recognition. We hope you enjoyed it.

0:38:37.880 --> 0:38:40.560
<v Speaker 1>We highly recommend that if you have any topics that

0:38:40.600 --> 0:38:43.520
<v Speaker 1>you think board Thinking should cover stuff about the future

0:38:43.560 --> 0:38:45.759
<v Speaker 1>that really has you excited, that you get in touch

0:38:45.760 --> 0:38:49.520
<v Speaker 1>with us. We have an email address now, it's fw

0:38:49.640 --> 0:38:53.239
<v Speaker 1>thinking at discovery dot com. You can also go to

0:38:53.360 --> 0:38:57.080
<v Speaker 1>fwthinking dot com for all of our content. We've got videos,

0:38:57.200 --> 0:39:00.279
<v Speaker 1>blog posts, podcasts, we have links to all of our

0:39:00.320 --> 0:39:03.680
<v Speaker 1>social networking stuff. Go there, connect with us, let us

0:39:03.719 --> 0:39:05.840
<v Speaker 1>know what you think. We look forward to hearing from you,

0:39:05.920 --> 0:39:12.560
<v Speaker 1>and we will talk to you again really soon. For

0:39:12.680 --> 0:39:15.440
<v Speaker 1>more on this topic and the future of technology, visit

0:39:15.520 --> 0:39:29.960
<v Speaker 1>forward thinking dot com. Brought to you by Toyota. Let's

0:39:30.000 --> 0:39:30.680
<v Speaker 1>go Places,