WEBVTT - I'm sorry, what did you say?

0:00:04.120 --> 0:00:07.160
<v Speaker 1>Get in tech with technology with tech Stuff from how

0:00:07.200 --> 0:00:13.680
<v Speaker 1>stuff works dot com. Hey there, and welcome to tech Stuff.

0:00:13.720 --> 0:00:16.279
<v Speaker 1>I'm your host, Jonathan Strickland. I'm an executive producer with

0:00:16.280 --> 0:00:19.960
<v Speaker 1>how Stuff Works in love all Things Tech and listener

0:00:20.079 --> 0:00:22.320
<v Speaker 1>Nate wrote in and asked that I do an episode

0:00:22.560 --> 0:00:27.720
<v Speaker 1>about personal digital assistance or virtual assistance or voice helpers.

0:00:28.200 --> 0:00:30.160
<v Speaker 1>This is hard because we don't really have a great

0:00:30.560 --> 0:00:33.839
<v Speaker 1>term for these things, but I'm talking about applications like

0:00:34.400 --> 0:00:36.559
<v Speaker 1>and I apologize ahead of time if I activate your

0:00:36.560 --> 0:00:42.840
<v Speaker 1>technology Sirie, Alexa and Google Assistant. These sort of voice

0:00:42.840 --> 0:00:45.920
<v Speaker 1>helpers that can respond to voice commands as well as

0:00:46.040 --> 0:00:48.000
<v Speaker 1>other means of input in a way that makes them

0:00:48.000 --> 0:00:51.479
<v Speaker 1>seem almost intelligent. Now, as it turns out, that's actually

0:00:51.479 --> 0:00:55.240
<v Speaker 1>a pretty complicated history because it requires a discussion about

0:00:55.320 --> 0:00:59.840
<v Speaker 1>a lot of different connected ideas that we're all in

0:01:00.040 --> 0:01:03.080
<v Speaker 1>dependent and then ultimately converged. We're talking about stuff like

0:01:03.320 --> 0:01:08.560
<v Speaker 1>speech recognition, natural language processing, and technology that was meant

0:01:08.600 --> 0:01:12.200
<v Speaker 1>to improve accessibility and a whole lot more So, it

0:01:12.240 --> 0:01:15.280
<v Speaker 1>makes talking about the services somewhat challenging because it's not

0:01:15.360 --> 0:01:18.360
<v Speaker 1>like there was just one pathway that led to their development.

0:01:18.640 --> 0:01:22.960
<v Speaker 1>They exist largely because of these independent but converging areas

0:01:22.959 --> 0:01:25.920
<v Speaker 1>of innovation. Much of the work that made these services

0:01:25.959 --> 0:01:29.840
<v Speaker 1>possible took place in events that were concurrent with each other,

0:01:30.000 --> 0:01:36.920
<v Speaker 1>with different organizations all working towards similar but unconnected, disconnected goals.

0:01:36.959 --> 0:01:40.040
<v Speaker 1>So going by strict timeline approach would be really hard,

0:01:40.120 --> 0:01:42.880
<v Speaker 1>if not impossible, just because you have to jump around

0:01:42.920 --> 0:01:46.200
<v Speaker 1>a lot to talk about different advances. So today I'm

0:01:46.240 --> 0:01:50.480
<v Speaker 1>going to focus solely on speech recognition. This in itself

0:01:50.640 --> 0:01:53.640
<v Speaker 1>is a huge topic, so it's more than enough for

0:01:53.680 --> 0:01:56.280
<v Speaker 1>a single episode of tech stuff. In the next episode,

0:01:56.320 --> 0:02:00.000
<v Speaker 1>I'm going to dive more into natural language processing, which

0:02:00.520 --> 0:02:03.800
<v Speaker 1>has some crossover with speech recognition, but it is its

0:02:03.840 --> 0:02:06.200
<v Speaker 1>own thing. And then after that we'll take a look

0:02:06.200 --> 0:02:09.640
<v Speaker 1>at how voice assistants like Sirie and Alexa popped up

0:02:09.680 --> 0:02:14.240
<v Speaker 1>over time. First, the idea of creating a machine that

0:02:14.280 --> 0:02:18.920
<v Speaker 1>could interpret speech is older than computers. If you listen

0:02:18.960 --> 0:02:21.440
<v Speaker 1>to my episodes about the history of the turntable, you'll

0:02:21.480 --> 0:02:26.800
<v Speaker 1>remember the phanatograph, designed by Eduard Leon Scott de Martinville

0:02:27.120 --> 0:02:31.680
<v Speaker 1>in eighteen fifty seven. The gadget had a small brush

0:02:31.760 --> 0:02:35.360
<v Speaker 1>that was attached to a parchment diaphragm and the bristles

0:02:35.400 --> 0:02:38.520
<v Speaker 1>on the brush rested against a sheet of paper that

0:02:38.600 --> 0:02:41.520
<v Speaker 1>itself was wrapped around a cylinder. On top of the

0:02:41.520 --> 0:02:44.520
<v Speaker 1>sheet of paper was a layer of soot. So to

0:02:44.560 --> 0:02:47.400
<v Speaker 1>operate the device, you would turn the cylinder, the brush

0:02:47.400 --> 0:02:50.679
<v Speaker 1>would drag across the soot on the paper, and you

0:02:50.720 --> 0:02:54.120
<v Speaker 1>would shout at the diaphragm. The vibrations of sound would

0:02:54.160 --> 0:02:56.959
<v Speaker 1>cause the parchment diaphragm to vibrate. That would make the

0:02:57.000 --> 0:02:59.720
<v Speaker 1>brush vibrate and move against the paper, and that would

0:02:59.720 --> 0:03:03.480
<v Speaker 1>create a pattern corresponding to the vibrations that were made

0:03:03.560 --> 0:03:06.799
<v Speaker 1>by the paper diaphragm. The phonautograph was supposed to aid

0:03:06.840 --> 0:03:10.240
<v Speaker 1>in the study of language and sound. The machine itself

0:03:10.320 --> 0:03:14.639
<v Speaker 1>was not intended to interpret sound, but rather facilitate interpretation.

0:03:14.680 --> 0:03:19.240
<v Speaker 1>A human would take a look at these tracings essentially

0:03:19.520 --> 0:03:22.360
<v Speaker 1>and be able to analyze sound, or at least that

0:03:22.400 --> 0:03:24.760
<v Speaker 1>was the intent. It didn't quite work out that way,

0:03:24.760 --> 0:03:28.480
<v Speaker 1>but that was the concept behind it. Now, let's set

0:03:28.480 --> 0:03:32.160
<v Speaker 1>the way back machine to the nineteen fifties. In nineteen

0:03:32.200 --> 0:03:37.280
<v Speaker 1>fifty two, Bell Labs created the Audrey system, which was

0:03:37.320 --> 0:03:41.000
<v Speaker 1>not a mean green mother from outer space, but rather

0:03:41.480 --> 0:03:45.960
<v Speaker 1>the first documented speech recognizer system. It was an analog system,

0:03:46.000 --> 0:03:50.200
<v Speaker 1>not a digital one. It was its own dedicated massive circuit,

0:03:50.880 --> 0:03:53.400
<v Speaker 1>and it even had vacuum tubes in this thing. Because

0:03:53.440 --> 0:03:56.960
<v Speaker 1>this is before the transistor. It could recognize strings of

0:03:57.000 --> 0:04:00.720
<v Speaker 1>digits spoken by its creator with about nine and accuracy.

0:04:00.800 --> 0:04:03.360
<v Speaker 1>If anyone else tried it, the accuracy dropped a bit.

0:04:03.440 --> 0:04:07.200
<v Speaker 1>This already shows that speech recognition is tough because not

0:04:07.320 --> 0:04:10.480
<v Speaker 1>everyone says things exactly the same way. I know that's

0:04:10.480 --> 0:04:12.680
<v Speaker 1>not a news flash, but it is important for the

0:04:12.720 --> 0:04:16.279
<v Speaker 1>concept of speech recognition. Uh. You also had to pause

0:04:16.800 --> 0:04:21.160
<v Speaker 1>between strings of numbers. You couldn't just rattle off conversationally.

0:04:21.240 --> 0:04:23.440
<v Speaker 1>You had to put pauses in there. But it was

0:04:23.480 --> 0:04:26.680
<v Speaker 1>also an enormous piece of machinery. It took up a

0:04:26.800 --> 0:04:29.640
<v Speaker 1>six ft high relay rack and it consumed a lot

0:04:29.640 --> 0:04:34.280
<v Speaker 1>of electricity. Then Big Blue, also known as IBM, had

0:04:34.360 --> 0:04:37.880
<v Speaker 1>scientists and engineers working on the possibility of designing technologies

0:04:37.880 --> 0:04:40.560
<v Speaker 1>that could recognize speech. They were kind of working around

0:04:40.600 --> 0:04:44.520
<v Speaker 1>the same time that Bell South was computer scientists Nathaniel Rochester,

0:04:44.600 --> 0:04:47.200
<v Speaker 1>who designed an IBM computer called the seven oh one.

0:04:47.480 --> 0:04:50.120
<v Speaker 1>He also wrote the first assembler. Headed up a group

0:04:50.160 --> 0:04:53.359
<v Speaker 1>of engineers at IBM who were researching pattern recognition and

0:04:53.480 --> 0:04:57.440
<v Speaker 1>information theory. That work, which was early research into fundamental

0:04:57.520 --> 0:05:01.039
<v Speaker 1>building blocks for artificial intelligence, would also become important for

0:05:01.080 --> 0:05:05.240
<v Speaker 1>speech recognition. In the late nineteen fifties, William C. Dirsh,

0:05:05.360 --> 0:05:09.640
<v Speaker 1>another IBM computer scientist, developed a computer system as part

0:05:09.720 --> 0:05:14.720
<v Speaker 1>of IBMS Advanced Systems Development Division laboratory, and it incorporated

0:05:14.760 --> 0:05:18.679
<v Speaker 1>basic elements of speech recognition. He unveiled the device, called

0:05:18.720 --> 0:05:22.359
<v Speaker 1>the IBM Shoebox in nineteen sixty two at the World's Fair.

0:05:22.880 --> 0:05:26.480
<v Speaker 1>Using a microphone, you could speak basic digits from zero

0:05:26.520 --> 0:05:30.680
<v Speaker 1>to nine, and also six additional control words like plus

0:05:30.800 --> 0:05:34.279
<v Speaker 1>or minus, and the shoebox would recognize the words and

0:05:34.480 --> 0:05:39.960
<v Speaker 1>perform calculation, So essentially this was a basic voice controlled calculator.

0:05:40.279 --> 0:05:43.920
<v Speaker 1>While the application was limited, this showed off a remarkable achievement.

0:05:44.120 --> 0:05:46.719
<v Speaker 1>Finding a way to program a machine to accept speech

0:05:46.880 --> 0:05:50.120
<v Speaker 1>as a command is a non trivial problem. Throughout the

0:05:50.160 --> 0:05:53.880
<v Speaker 1>nineteen sixties, computer scientists took a brute force sort of

0:05:53.920 --> 0:05:57.520
<v Speaker 1>approach to solving speech recognition, which could work in very

0:05:57.640 --> 0:06:01.280
<v Speaker 1>narrow applications such as the calculator approach, but were by

0:06:01.320 --> 0:06:04.839
<v Speaker 1>their nature difficult to scale up. Even in the early

0:06:04.920 --> 0:06:09.200
<v Speaker 1>nineteen seventies, the Speech Understanding Research Project from our PA

0:06:09.279 --> 0:06:12.000
<v Speaker 1>as the same organization that would help bring the Internet

0:06:12.040 --> 0:06:17.040
<v Speaker 1>into being, produced a brute force template called Harpy. While

0:06:17.120 --> 0:06:19.920
<v Speaker 1>it was reliant upon brute force, Harpy, which came out

0:06:19.960 --> 0:06:24.200
<v Speaker 1>of Carnegie Melon Research, could recognize about one thousand words.

0:06:24.600 --> 0:06:28.320
<v Speaker 1>Harpy also made use of a process called beam search.

0:06:28.880 --> 0:06:31.560
<v Speaker 1>This is a search strategy in which a search algorithm

0:06:31.640 --> 0:06:35.120
<v Speaker 1>can consider multiple possible hits at a single time, rather

0:06:35.160 --> 0:06:38.720
<v Speaker 1>than looking through a large data set for a specific

0:06:38.839 --> 0:06:42.560
<v Speaker 1>perfect hit. Then the algorithm would determine the probability of

0:06:42.600 --> 0:06:44.880
<v Speaker 1>each of the hits as being the right word. The

0:06:44.960 --> 0:06:47.599
<v Speaker 1>number of potential hits is determined by a value called

0:06:47.640 --> 0:06:51.320
<v Speaker 1>the beam width, setting the speech recognition and application designer

0:06:51.360 --> 0:06:54.000
<v Speaker 1>can set. Beam search is a much more efficient way

0:06:54.000 --> 0:06:56.920
<v Speaker 1>to suss out speech, and it's frequently used today, not

0:06:56.960 --> 0:07:00.000
<v Speaker 1>just in speech recognition but also in natural language process

0:07:00.000 --> 0:07:03.160
<v Speaker 1>saying another sequential models, but it gets super technical, so

0:07:03.200 --> 0:07:05.800
<v Speaker 1>we're gonna leave it at that kind of high level approach.

0:07:06.600 --> 0:07:10.520
<v Speaker 1>But these systems still mapped all words to a template,

0:07:10.720 --> 0:07:14.480
<v Speaker 1>one template per word. It didn't break words up into sounds,

0:07:14.760 --> 0:07:18.080
<v Speaker 1>but look for a match against a database of established

0:07:18.160 --> 0:07:21.240
<v Speaker 1>vocabulary words, which meant that if you did not pronounce

0:07:21.280 --> 0:07:23.640
<v Speaker 1>the word the same way as it was represented in

0:07:23.680 --> 0:07:26.760
<v Speaker 1>the database, you might not get a hit. You would

0:07:26.800 --> 0:07:29.520
<v Speaker 1>have to get it close enough to that template for

0:07:29.640 --> 0:07:31.400
<v Speaker 1>you to be able to get a hit. This is

0:07:31.440 --> 0:07:35.120
<v Speaker 1>a big problem. People speak with accents or dialects, or

0:07:35.160 --> 0:07:39.200
<v Speaker 1>they may have difficulty replicating certain sounds. The brute force

0:07:39.280 --> 0:07:42.120
<v Speaker 1>approach often meant you you'd have to say the same

0:07:42.120 --> 0:07:45.560
<v Speaker 1>word a few times with clear enunciation and long pauses

0:07:45.600 --> 0:07:48.680
<v Speaker 1>to get a hit. And again, it just didn't scale

0:07:48.800 --> 0:07:52.040
<v Speaker 1>very well. It wasn't until the late nineteen seventies that

0:07:52.080 --> 0:07:55.360
<v Speaker 1>computer scientists were able to find a different approach that

0:07:55.400 --> 0:07:59.280
<v Speaker 1>would power more modern speech recognition systems. And let's go

0:07:59.360 --> 0:08:01.720
<v Speaker 1>through some of the steps that are necessary, from the

0:08:01.760 --> 0:08:07.040
<v Speaker 1>basic physical attributes of speech to the processing of the information. First, speech,

0:08:07.280 --> 0:08:11.800
<v Speaker 1>like all sound, ultimately is a physical phenomenon. It is vibration.

0:08:12.160 --> 0:08:15.520
<v Speaker 1>We produce these vibrations with vocal cords and our lips, teeth,

0:08:15.640 --> 0:08:18.680
<v Speaker 1>and tongue according to the rules of whatever language we

0:08:18.720 --> 0:08:22.160
<v Speaker 1>are speaking. These vibrations travel through a medium such as

0:08:22.160 --> 0:08:25.000
<v Speaker 1>the air, and then they get picked up by something else,

0:08:25.080 --> 0:08:28.360
<v Speaker 1>like someone else's ears or a microphone or whatever. But

0:08:28.440 --> 0:08:32.319
<v Speaker 1>at this stage we're talking about physical vibrations and analog

0:08:32.920 --> 0:08:37.840
<v Speaker 1>form of input. Computers do not directly interpret physical vibrations.

0:08:37.840 --> 0:08:42.839
<v Speaker 1>Computers process digital information, and speech is an analog phenomena.

0:08:42.960 --> 0:08:45.320
<v Speaker 1>So the first thing we need for a computer to

0:08:45.360 --> 0:08:49.000
<v Speaker 1>recognize speech is for some sort of analog to digital

0:08:49.040 --> 0:08:52.840
<v Speaker 1>converter that can accept the analog information and then translated

0:08:52.920 --> 0:08:56.400
<v Speaker 1>into digital information. The a d C would typically sample

0:08:56.520 --> 0:09:00.400
<v Speaker 1>speech by taking precise measurements of the sound at frequent

0:09:00.480 --> 0:09:04.439
<v Speaker 1>intervals or samples such as thousands of times per second,

0:09:04.480 --> 0:09:07.319
<v Speaker 1>so you can almost think of it like snapshots, Like

0:09:07.320 --> 0:09:11.480
<v Speaker 1>like pictures. The a d C is measuring quantifiable elements

0:09:11.559 --> 0:09:14.920
<v Speaker 1>of the sound every time it takes a sample. That

0:09:15.000 --> 0:09:19.160
<v Speaker 1>might include stuff like amplitude and frequency, or volume and pitch.

0:09:19.240 --> 0:09:22.880
<v Speaker 1>If you're talking about how we perceive sound. There's usually

0:09:23.000 --> 0:09:25.679
<v Speaker 1>some sort of noise filter incorporated into this step as

0:09:25.720 --> 0:09:29.640
<v Speaker 1>well to help remove any unwanted sounds from the signal.

0:09:30.080 --> 0:09:32.719
<v Speaker 1>The system has to be able to recognize which signals

0:09:32.960 --> 0:09:36.200
<v Speaker 1>represent a command in which ones are not important. This

0:09:36.280 --> 0:09:38.760
<v Speaker 1>is why I can do stuff like send vocal commands

0:09:38.840 --> 0:09:42.200
<v Speaker 1>to a voice assistant, even if there's another conversation going

0:09:42.240 --> 0:09:45.560
<v Speaker 1>on nearby, or if I have the radio or television on. Now.

0:09:45.559 --> 0:09:47.960
<v Speaker 1>I have a lot more to say about the technology

0:09:48.000 --> 0:09:50.880
<v Speaker 1>that makes speech recognition possible, but before I get into that,

0:09:50.960 --> 0:10:01.360
<v Speaker 1>let's take a quick break to thank our sponsor. So,

0:10:01.559 --> 0:10:05.840
<v Speaker 1>a speech recognition system typically as a database of sound

0:10:05.880 --> 0:10:10.679
<v Speaker 1>samples that will allow the recognition system to compare incoming

0:10:10.760 --> 0:10:15.480
<v Speaker 1>signals against that database. The speech recognition system might have

0:10:15.600 --> 0:10:19.760
<v Speaker 1>to put the incoming sound through a process called temporal alignment,

0:10:20.240 --> 0:10:22.240
<v Speaker 1>which is a fancy way of saying the system might

0:10:22.240 --> 0:10:25.480
<v Speaker 1>have to slow down or speed up the incoming sound.

0:10:25.960 --> 0:10:28.439
<v Speaker 1>You can think of this as like making a recording

0:10:28.600 --> 0:10:31.800
<v Speaker 1>and then almost immediately playing the recording back. Obviously, the

0:10:31.800 --> 0:10:34.920
<v Speaker 1>speech recognition system can't change the speed at which you're speaking,

0:10:35.600 --> 0:10:37.959
<v Speaker 1>though you might get a feature that prompts you to

0:10:38.880 --> 0:10:41.679
<v Speaker 1>slow down or speed up if the message may say

0:10:41.679 --> 0:10:44.480
<v Speaker 1>could you say that again, but slower that kind of thing. Um.

0:10:44.520 --> 0:10:47.080
<v Speaker 1>If you happen to be someone from the Northeastern United States,

0:10:47.080 --> 0:10:50.120
<v Speaker 1>for example, you may frequently get these messages saying slow

0:10:50.160 --> 0:10:54.040
<v Speaker 1>the heck down. Temporal alignment allows the speech recognition system

0:10:54.040 --> 0:10:56.400
<v Speaker 1>to look for matches between the incoming sound and the

0:10:56.440 --> 0:10:59.839
<v Speaker 1>samples in the system's memory. The system must also do

0:11:00.000 --> 0:11:03.560
<v Speaker 1>gied up the sounds in the incoming signal into segments

0:11:03.600 --> 0:11:08.040
<v Speaker 1>that represent specific sounds in the native language, such as

0:11:08.080 --> 0:11:12.320
<v Speaker 1>the sound or the hard to sound. It looks for

0:11:12.440 --> 0:11:17.000
<v Speaker 1>matches in its memory that represent phonemes, and a phoneme

0:11:17.080 --> 0:11:20.360
<v Speaker 1>is a basic sound native to a specific language, to

0:11:20.440 --> 0:11:23.760
<v Speaker 1>a particular language, whichever when you're looking at. So, for example,

0:11:23.800 --> 0:11:28.280
<v Speaker 1>the English language has about forty phonemes. Linguists actually get

0:11:28.280 --> 0:11:31.200
<v Speaker 1>into some pretty vicious fights about exactly how many phonemes

0:11:31.280 --> 0:11:34.720
<v Speaker 1>English language has, but it's around forties. Some people argue

0:11:34.760 --> 0:11:37.840
<v Speaker 1>that there are more phonemes, some say that there are.

0:11:37.880 --> 0:11:41.199
<v Speaker 1>Some of the supposed additional phonemes are in fact repeats

0:11:41.200 --> 0:11:45.280
<v Speaker 1>of existing ones. Other languages, though, will have different number

0:11:45.280 --> 0:11:48.760
<v Speaker 1>of phonemes in them. Some may have far more than English,

0:11:48.800 --> 0:11:52.079
<v Speaker 1>some may have fewer than English. The system then has

0:11:52.120 --> 0:11:56.240
<v Speaker 1>to analyze the phonemes in sequence. So it's looking at

0:11:56.280 --> 0:11:59.480
<v Speaker 1>these little markers that represent different sounds, and this is

0:11:59.520 --> 0:12:01.760
<v Speaker 1>how it says them can look for matches between a

0:12:01.760 --> 0:12:05.000
<v Speaker 1>series of phonemes and the words that it can recognize

0:12:05.000 --> 0:12:08.679
<v Speaker 1>it can try and build words from these sounds. This

0:12:08.760 --> 0:12:12.040
<v Speaker 1>is way harder than I'm making it. Sound speech recognition

0:12:12.040 --> 0:12:16.280
<v Speaker 1>systems have complicated statistical models to help them determine what

0:12:16.440 --> 0:12:20.560
<v Speaker 1>a word might be. Even a simple speech recognition system

0:12:20.600 --> 0:12:25.280
<v Speaker 1>will have a complex statistical model to recognize individual words.

0:12:25.720 --> 0:12:30.120
<v Speaker 1>More sophisticated systems might also look at contextual information surrounding

0:12:30.160 --> 0:12:33.720
<v Speaker 1>the phonemes. In other words, a really sophisticated system isn't

0:12:33.760 --> 0:12:36.120
<v Speaker 1>just looking for a match in phonemes to sus out

0:12:36.160 --> 0:12:39.240
<v Speaker 1>what a single word is in a sentence. It's looking

0:12:39.240 --> 0:12:42.560
<v Speaker 1>at the phonemes that came before and after to determine

0:12:42.600 --> 0:12:45.559
<v Speaker 1>what those words were and to help increase the confidence

0:12:45.640 --> 0:12:48.600
<v Speaker 1>level overall. So let me give an example. Let's say

0:12:48.640 --> 0:12:51.040
<v Speaker 1>have activated one of these voice assistants, and I've used

0:12:51.040 --> 0:12:53.720
<v Speaker 1>whatever voice command activates it. I'm not going to do

0:12:53.760 --> 0:12:55.480
<v Speaker 1>it here because some of you might be listening on

0:12:55.520 --> 0:12:58.800
<v Speaker 1>those devices. And then I say turn the volume up

0:12:58.920 --> 0:13:02.200
<v Speaker 1>thirty percent. The speech recognition system begins to parse what

0:13:02.280 --> 0:13:05.440
<v Speaker 1>I said by analyzing those sounds phone name by phone name,

0:13:05.640 --> 0:13:09.240
<v Speaker 1>identifying them, analyzing them, trying to group them together to

0:13:09.320 --> 0:13:11.839
<v Speaker 1>form words, and when it thinks it's found a word,

0:13:11.880 --> 0:13:15.760
<v Speaker 1>it assigns a certain probability to that, and when it

0:13:15.800 --> 0:13:18.160
<v Speaker 1>starts to analyze the phone names that make up the

0:13:18.200 --> 0:13:21.160
<v Speaker 1>word volume, it's also looking at the words that came

0:13:21.200 --> 0:13:24.360
<v Speaker 1>before turn the and it's looking at the words that

0:13:24.400 --> 0:13:29.000
<v Speaker 1>came after up. That boosts the system's confidence overall that

0:13:29.080 --> 0:13:32.240
<v Speaker 1>the keyword volume is in fact volume, and then it

0:13:32.280 --> 0:13:34.199
<v Speaker 1>does what I told it to do. When I talk

0:13:34.240 --> 0:13:37.559
<v Speaker 1>about confidence, I don't mean the system feels good about itself.

0:13:37.600 --> 0:13:41.440
<v Speaker 1>I'm talking about probabilities. These systems largely work in the

0:13:41.440 --> 0:13:44.920
<v Speaker 1>realm of probabilities. What is the probability that I said

0:13:45.000 --> 0:13:49.040
<v Speaker 1>volume rather than some other word. For speech recognition system

0:13:49.040 --> 0:13:51.560
<v Speaker 1>to work, it needs to be able to assign a

0:13:51.640 --> 0:13:55.880
<v Speaker 1>confidence level towards. The higher the level, the more certain

0:13:56.160 --> 0:13:59.079
<v Speaker 1>quote unquote the system is that it got things correct.

0:13:59.559 --> 0:14:03.320
<v Speaker 1>Typical computer engineers will design systems that will only execute

0:14:03.320 --> 0:14:07.240
<v Speaker 1>a command or return a result of some sort if

0:14:07.280 --> 0:14:10.160
<v Speaker 1>the system has reached a certain threshold of confidence, and

0:14:10.200 --> 0:14:13.600
<v Speaker 1>if it hasn't, you won't get a result. So, for example,

0:14:13.679 --> 0:14:17.000
<v Speaker 1>and this isn't about speech recognition exactly, but it illustrates

0:14:17.040 --> 0:14:20.760
<v Speaker 1>my point. IBM S Watson computer would not offer up

0:14:20.800 --> 0:14:24.760
<v Speaker 1>an answer on Jeopardy unless it met a certain threshold

0:14:24.880 --> 0:14:26.800
<v Speaker 1>of confidence in an answer, and I think it was

0:14:26.800 --> 0:14:30.600
<v Speaker 1>about eight percent. So if it or eight percent certain

0:14:30.680 --> 0:14:32.520
<v Speaker 1>that it had the right answer, it would buzz in.

0:14:32.600 --> 0:14:35.680
<v Speaker 1>But if it was less than eight sure, it would

0:14:35.680 --> 0:14:38.880
<v Speaker 1>not put forth that answer. There are two broad types

0:14:38.920 --> 0:14:43.120
<v Speaker 1>of statistical models in speech recognition systems today. There are

0:14:43.120 --> 0:14:45.400
<v Speaker 1>others that could be used, but there are two broad

0:14:45.480 --> 0:14:47.640
<v Speaker 1>ones that tend to be used these days. They are

0:14:47.640 --> 0:14:52.040
<v Speaker 1>the hidden Markov model and neural networks. Hidden Markov model,

0:14:52.080 --> 0:14:57.280
<v Speaker 1>by the way, is overwhelmingly the most popular method of

0:14:58.080 --> 0:15:02.640
<v Speaker 1>using a statistical model to analyze speech recognition. It is

0:15:02.680 --> 0:15:05.200
<v Speaker 1>the prevalent approach, and it works sort of how I

0:15:05.280 --> 0:15:07.640
<v Speaker 1>just described. It looks at each phone name and starts

0:15:07.640 --> 0:15:10.080
<v Speaker 1>to build out a pathway. If you think of this

0:15:10.120 --> 0:15:13.120
<v Speaker 1>as like an actual physical path that you're following, you

0:15:13.120 --> 0:15:15.640
<v Speaker 1>would start off with the first phone name that represents

0:15:15.680 --> 0:15:18.120
<v Speaker 1>the beginning of the path, and the phone name might

0:15:18.160 --> 0:15:22.000
<v Speaker 1>eliminate other possible phone names right away. By that, I

0:15:22.000 --> 0:15:26.560
<v Speaker 1>mean it might be a sound that doesn't combine with

0:15:26.640 --> 0:15:29.080
<v Speaker 1>certain other sounds within that language. There might be a

0:15:29.120 --> 0:15:33.800
<v Speaker 1>phone name that does not combine with other specific phone names.

0:15:33.880 --> 0:15:36.880
<v Speaker 1>So imagine you have a path and originally it splits

0:15:36.920 --> 0:15:40.160
<v Speaker 1>into tons of other pathways, but a couple of those

0:15:40.160 --> 0:15:42.960
<v Speaker 1>pathways are blocked off with signs that say the pathway

0:15:43.000 --> 0:15:47.120
<v Speaker 1>is closed. It's closed because those pathways represent phone names

0:15:47.200 --> 0:15:51.320
<v Speaker 1>that would never be paired with the initial one. You

0:15:51.360 --> 0:15:55.480
<v Speaker 1>just don't get that sound in English. The closed paths

0:15:55.600 --> 0:15:58.640
<v Speaker 1>would therefore be off limits, and only the open paths

0:15:58.640 --> 0:16:01.960
<v Speaker 1>would be the possibility. Then the hidden Markov model would

0:16:01.960 --> 0:16:04.640
<v Speaker 1>look at the next phone name the next step along

0:16:04.640 --> 0:16:08.080
<v Speaker 1>this pathway. That phone name determines which of the viable

0:16:08.120 --> 0:16:11.800
<v Speaker 1>path options is actually the one to follow. All the

0:16:11.800 --> 0:16:14.640
<v Speaker 1>other options would be discarded, and so on. It would

0:16:14.680 --> 0:16:17.320
<v Speaker 1>go all the way down the list of phone names

0:16:17.360 --> 0:16:19.840
<v Speaker 1>until the model arrives at a conclusion of the most

0:16:19.880 --> 0:16:23.280
<v Speaker 1>likely word that was spoken. It assigns a probability score

0:16:23.280 --> 0:16:26.720
<v Speaker 1>to each phone names, thinking I'm pretty sure the sound

0:16:26.760 --> 0:16:30.600
<v Speaker 1>that I heard, quote unquote was this. That helps the

0:16:30.640 --> 0:16:33.280
<v Speaker 1>system make an educated guess as to what word was

0:16:33.320 --> 0:16:36.320
<v Speaker 1>actually spoken. Now, I've talked a lot about neural networks

0:16:36.320 --> 0:16:37.760
<v Speaker 1>in the past. I'm just going to give it a

0:16:37.840 --> 0:16:42.800
<v Speaker 1>quick cursory covering here, because they really aren't the dominant

0:16:43.080 --> 0:16:47.800
<v Speaker 1>statistical model in speech recognition. UH Neural networks have nodes,

0:16:48.080 --> 0:16:51.560
<v Speaker 1>computer nodes or algorithms that act like a neuron right

0:16:51.680 --> 0:16:54.760
<v Speaker 1>like a like a brain cell, and they execute operations

0:16:54.840 --> 0:16:58.440
<v Speaker 1>on data. The neurons also assigned a probability score to

0:16:58.640 --> 0:17:02.080
<v Speaker 1>that x that execution of of data and shows the

0:17:02.080 --> 0:17:05.400
<v Speaker 1>confidence in the system in the result before they pass

0:17:05.440 --> 0:17:08.080
<v Speaker 1>it on to another neuron in the network, which then

0:17:08.160 --> 0:17:10.879
<v Speaker 1>executes another operation on the data and so on, and

0:17:10.960 --> 0:17:13.760
<v Speaker 1>ultimately the network produces an end result of all those

0:17:13.800 --> 0:17:17.320
<v Speaker 1>operations and judges the probability of whether or not that

0:17:17.440 --> 0:17:20.560
<v Speaker 1>result is the right one, and again, if it meets

0:17:20.560 --> 0:17:24.560
<v Speaker 1>a certain threshold, then it's considered the correct answer or

0:17:24.560 --> 0:17:27.119
<v Speaker 1>the closest to correct that the system can manage. In

0:17:27.160 --> 0:17:30.840
<v Speaker 1>any case, speech recognition systems have to be trained, and

0:17:30.840 --> 0:17:34.200
<v Speaker 1>there are trillions of potential combinations of sounds that could

0:17:34.280 --> 0:17:37.640
<v Speaker 1>represent different words. And the How stuff Works article How

0:17:37.720 --> 0:17:41.000
<v Speaker 1>Speech Recognition Works Ed Grabanowski, who is one of the

0:17:41.160 --> 0:17:44.120
<v Speaker 1>powerhouses of the site. He's written some of the best

0:17:44.240 --> 0:17:47.320
<v Speaker 1>articles on how stuff Works, gave a great example. He says,

0:17:47.600 --> 0:17:52.800
<v Speaker 1>take the phrase recognize speech right the phone emes in

0:17:52.800 --> 0:17:55.520
<v Speaker 1>that phrase happened to be pretty similar to a totally

0:17:55.600 --> 0:17:59.199
<v Speaker 1>different phrase, which would be recognized beach. So you have

0:17:59.280 --> 0:18:04.720
<v Speaker 1>recognized speech or wreck a nice beach. The speech recognition

0:18:04.720 --> 0:18:07.480
<v Speaker 1>software has to be able to determine the difference, or

0:18:07.560 --> 0:18:09.200
<v Speaker 1>else the next thing you know, you're gonna have terminators

0:18:09.280 --> 0:18:12.960
<v Speaker 1>kicking sand in everyone's face, and that's no good. Alexander Wibel,

0:18:13.320 --> 0:18:17.360
<v Speaker 1>who worked on that system called Harpy that I mentioned earlier,

0:18:18.000 --> 0:18:21.200
<v Speaker 1>had another couple of examples. He said, you might say

0:18:21.320 --> 0:18:25.919
<v Speaker 1>youth and Asia and get the result youth in Asia.

0:18:26.080 --> 0:18:29.280
<v Speaker 1>Or you might say give me a new display and

0:18:29.320 --> 0:18:32.040
<v Speaker 1>you get the result, give me a newdist play. If

0:18:32.080 --> 0:18:36.040
<v Speaker 1>you've ever used something like Google transcripts, where if you

0:18:36.040 --> 0:18:38.760
<v Speaker 1>had a Google Voice and you were reading the voicemails,

0:18:39.760 --> 0:18:43.159
<v Speaker 1>you could get hilarious results. Because of this, the speech recognition,

0:18:43.480 --> 0:18:46.640
<v Speaker 1>the speech to text feature could end up spelling out

0:18:47.560 --> 0:18:52.080
<v Speaker 1>truly ridiculous messages. I would get messages from my mother,

0:18:52.920 --> 0:18:56.000
<v Speaker 1>and I only wish my mom would leave me messages

0:18:56.080 --> 0:18:59.600
<v Speaker 1>the way that Google transcript thought she was leaving me messages,

0:18:59.640 --> 0:19:03.320
<v Speaker 1>because they were the most crazy messages ever. But it's

0:19:03.359 --> 0:19:05.720
<v Speaker 1>mostly because my mom has a Southern accent and so

0:19:05.800 --> 0:19:10.640
<v Speaker 1>Google would often misinterpret what she was saying, so these

0:19:10.640 --> 0:19:14.520
<v Speaker 1>systems have to undergo hours of training. John Garofolo, a

0:19:14.520 --> 0:19:17.800
<v Speaker 1>computer scientist who was cited in that House Stuff Works article,

0:19:18.119 --> 0:19:22.600
<v Speaker 1>had this to say. These statistical systems need lots of

0:19:22.640 --> 0:19:26.960
<v Speaker 1>exemplary training data to reach their optimal performance, sometimes on

0:19:27.000 --> 0:19:30.840
<v Speaker 1>the order of thousands of hours of human transcribed speech

0:19:30.880 --> 0:19:34.960
<v Speaker 1>and hundreds of megabytes of text. These training data are

0:19:35.040 --> 0:19:38.919
<v Speaker 1>used to create acoustic models of words, word lists, and

0:19:39.040 --> 0:19:43.200
<v Speaker 1>multi word probability networks. There is some art into how

0:19:43.240 --> 0:19:47.680
<v Speaker 1>one selects, compiles, and prepares this training data for digestion

0:19:47.800 --> 0:19:50.680
<v Speaker 1>by the system, and how the system models are tuned

0:19:50.840 --> 0:19:54.400
<v Speaker 1>to a particular application. These details can make the difference

0:19:54.400 --> 0:19:57.719
<v Speaker 1>between a well performing system and a poorly performing system,

0:19:57.960 --> 0:20:01.639
<v Speaker 1>even when using the same basic algorith Rhythm speech recognition

0:20:01.680 --> 0:20:05.240
<v Speaker 1>also requires a decent amount of processing power. This was

0:20:05.280 --> 0:20:08.440
<v Speaker 1>a limiting factor on speech recognition for a really long time.

0:20:08.880 --> 0:20:12.639
<v Speaker 1>Systems were limited in their capabilities, which meant that for years,

0:20:12.680 --> 0:20:16.399
<v Speaker 1>if you wanted to incorporate speech recognition in a computer system,

0:20:16.480 --> 0:20:19.320
<v Speaker 1>and then most of the computer's processing power would have

0:20:19.400 --> 0:20:22.119
<v Speaker 1>to dedicate itself just to parsing speech. You couldn't do

0:20:22.240 --> 0:20:25.320
<v Speaker 1>much else on that machine. But since Moore's laws held

0:20:25.400 --> 0:20:27.400
<v Speaker 1>up so well for decades, we got to a point

0:20:27.400 --> 0:20:30.000
<v Speaker 1>where the process and capabilities of machines reached a stage

0:20:30.200 --> 0:20:33.679
<v Speaker 1>where this isn't as big a concern, And another development

0:20:33.960 --> 0:20:38.680
<v Speaker 1>that Google really helped pioneer definitely change things. I'll talk

0:20:38.720 --> 0:20:41.160
<v Speaker 1>more about that in our next section, but first let's

0:20:41.160 --> 0:20:51.560
<v Speaker 1>take another quick break to thank our sponsors. Okay, So,

0:20:51.600 --> 0:20:54.800
<v Speaker 1>advances in speech recognition in the late nineteen seventies paved

0:20:54.920 --> 0:20:58.280
<v Speaker 1>the way from how most systems work these days, though

0:20:58.320 --> 0:21:01.480
<v Speaker 1>of course the models have under gone multiple refinements and

0:21:01.520 --> 0:21:05.080
<v Speaker 1>tweaking over time. The first speech recognition product to ever

0:21:05.200 --> 0:21:09.200
<v Speaker 1>launch for consumers was a program called Dragon Dictate, which

0:21:09.240 --> 0:21:13.879
<v Speaker 1>debuted in Dragon Dictate. The original version that is, because

0:21:13.920 --> 0:21:17.119
<v Speaker 1>they still come out to this day, relied on discrete

0:21:17.160 --> 0:21:20.320
<v Speaker 1>speech recognition. Now, I don't mean you had to be

0:21:20.400 --> 0:21:22.879
<v Speaker 1>secretive and hush hush about it. It's not that kind

0:21:22.880 --> 0:21:25.879
<v Speaker 1>of discreet. Rather, I mean you had to pronounce each

0:21:26.040 --> 0:21:28.879
<v Speaker 1>word clearly, with a pause between words. You could not

0:21:29.080 --> 0:21:33.040
<v Speaker 1>speak conversationally, or the dictation software could not interpret what

0:21:33.080 --> 0:21:43.080
<v Speaker 1>you were saying, so using the software would sound like this.

0:21:44.840 --> 0:21:47.440
<v Speaker 1>It was limited and it was primitive compared to today's

0:21:47.520 --> 0:21:51.080
<v Speaker 1>speech recognition products, but it was a groundbreaking product in

0:21:51.119 --> 0:21:54.560
<v Speaker 1>the early nineties. And it also costs somewhere between six

0:21:54.600 --> 0:21:59.320
<v Speaker 1>thousand and nine thousand dollars I saw differing accounts, but

0:21:59.359 --> 0:22:02.920
<v Speaker 1>that would be between nine and fourteen grand in today's dollars,

0:22:02.960 --> 0:22:07.320
<v Speaker 1>so pretty expensive software package. Dragon still produces speech recognition

0:22:07.359 --> 0:22:09.760
<v Speaker 1>technologies to this day, and of course they are much

0:22:09.800 --> 0:22:13.000
<v Speaker 1>more adept at recognizing and transcribing speech than the original

0:22:13.080 --> 0:22:16.160
<v Speaker 1>version was years ago. The software is also less expensive.

0:22:16.640 --> 0:22:19.399
<v Speaker 1>One version I saw retails for less than a hundred dollars,

0:22:19.400 --> 0:22:23.000
<v Speaker 1>so nice. Big deep price cut advancements and model design

0:22:23.040 --> 0:22:27.800
<v Speaker 1>and processor speed meant that speech recognition technology advanced rather quickly.

0:22:28.240 --> 0:22:33.280
<v Speaker 1>In Bell South released Vowel v a L. The Voice

0:22:33.280 --> 0:22:36.840
<v Speaker 1>Portal VAL was an automated interactive system that could respond

0:22:36.880 --> 0:22:40.520
<v Speaker 1>to questions over the phone. This was a basic implementation

0:22:40.560 --> 0:22:42.719
<v Speaker 1>that would evolve over time to the systems you may

0:22:42.760 --> 0:22:45.800
<v Speaker 1>have encountered when calling up automated menus where it's a

0:22:46.160 --> 0:22:49.800
<v Speaker 1>press three or say three and that kind of thing,

0:22:50.080 --> 0:22:53.600
<v Speaker 1>or do you have any questions? You can say anything

0:22:53.640 --> 0:22:56.479
<v Speaker 1>from check my balance to you know, that kind of stuff.

0:22:57.320 --> 0:22:59.920
<v Speaker 1>In two thousand five DARPA, which is the same brand

0:23:00.000 --> 0:23:02.199
<v Speaker 1>each of the Department of Defense that used to be

0:23:02.240 --> 0:23:04.280
<v Speaker 1>known as ARPA, So in other words, it's the same

0:23:04.680 --> 0:23:09.120
<v Speaker 1>R and d ARM that funded the creation of the Internet.

0:23:09.480 --> 0:23:11.760
<v Speaker 1>They funded a program in two thousand five called the

0:23:11.800 --> 0:23:17.359
<v Speaker 1>Global Autonomous Language Exploitation Project or GALE. The purpose of

0:23:17.359 --> 0:23:20.639
<v Speaker 1>this project was to advance research and development into automated

0:23:20.640 --> 0:23:24.480
<v Speaker 1>translation between languages. So not only were computers supposed to

0:23:24.520 --> 0:23:27.560
<v Speaker 1>be able to recognize speech, but also translate that speech

0:23:27.600 --> 0:23:31.320
<v Speaker 1>from one language into another, which adds another layer of

0:23:31.359 --> 0:23:35.120
<v Speaker 1>complexity on top. Right well, according to s r I International,

0:23:35.680 --> 0:23:39.480
<v Speaker 1>the system should be able to quote automatically take multi

0:23:39.600 --> 0:23:44.239
<v Speaker 1>lingual newscasts, text documents, and other forms of communication and

0:23:44.280 --> 0:23:47.679
<v Speaker 1>make their information available to human queries end quote. So

0:23:47.760 --> 0:23:50.840
<v Speaker 1>wouldn't just translate the information, which was already even more

0:23:50.880 --> 0:23:54.760
<v Speaker 1>complicated than speech recognition, It could also index that information

0:23:54.800 --> 0:23:57.119
<v Speaker 1>in a meaningful way so you could search for stuff.

0:23:57.960 --> 0:24:02.720
<v Speaker 1>So layer upon layer of complexity for that project. Things

0:24:02.720 --> 0:24:06.080
<v Speaker 1>that helped push speech recognition as well as natural language

0:24:06.080 --> 0:24:10.680
<v Speaker 1>processing to new heights largely came from two competing companies,

0:24:11.040 --> 0:24:14.200
<v Speaker 1>Apple and Google. So let me explain that In two

0:24:14.280 --> 0:24:18.000
<v Speaker 1>thousand seven, Apple introduced the iPhone, which was the first

0:24:18.040 --> 0:24:21.800
<v Speaker 1>truly successful consumer smartphone, especially here in the United States.

0:24:22.119 --> 0:24:25.959
<v Speaker 1>The smartphone introduced a new era and form of computing.

0:24:26.400 --> 0:24:31.720
<v Speaker 1>It created countless opportunities in numerous areas, including location based computing,

0:24:32.160 --> 0:24:36.280
<v Speaker 1>mobile interactions, and speech recognition. The computer was in a

0:24:36.440 --> 0:24:39.800
<v Speaker 1>phone form factor. Phones are designed for us to talk into,

0:24:39.880 --> 0:24:42.520
<v Speaker 1>So now you can walk around carrying a computer that

0:24:42.600 --> 0:24:45.080
<v Speaker 1>was designed to transmit your voice. It's only a matter

0:24:45.080 --> 0:24:47.480
<v Speaker 1>of time before someone figured out a way to leverage

0:24:47.520 --> 0:24:51.760
<v Speaker 1>that for speech recognition. Google meanwhile, was pioneering an approach

0:24:51.800 --> 0:24:55.639
<v Speaker 1>in what would perform all the processing functions necessary to

0:24:55.720 --> 0:24:58.680
<v Speaker 1>support speech recognition. It was doing it in the cloud,

0:24:59.240 --> 0:25:02.480
<v Speaker 1>so instead of having the device itself have to run

0:25:02.560 --> 0:25:05.919
<v Speaker 1>all that processing power, the device would have a persistent

0:25:05.960 --> 0:25:10.520
<v Speaker 1>connection to a server on the Internet, and the server

0:25:10.640 --> 0:25:13.240
<v Speaker 1>would do the work. It would just send the signal

0:25:13.280 --> 0:25:16.080
<v Speaker 1>to the server. The server would process and analyze the

0:25:16.119 --> 0:25:18.760
<v Speaker 1>signal and return the result back to the phone, and

0:25:18.800 --> 0:25:21.200
<v Speaker 1>the phone just was acting as a transmitter. It wasn't

0:25:21.280 --> 0:25:24.200
<v Speaker 1>really having to do any of that analysis itself. So

0:25:25.000 --> 0:25:29.600
<v Speaker 1>in two thousand and eight, Google launched the Google Voice

0:25:29.600 --> 0:25:33.040
<v Speaker 1>search app for the iPhone that would do all the

0:25:33.400 --> 0:25:38.040
<v Speaker 1>this uh speech recognition processing. Right, you could speak into

0:25:38.080 --> 0:25:40.960
<v Speaker 1>it and have Google search the terms for you for

0:25:41.080 --> 0:25:43.720
<v Speaker 1>whatever it was you were saying. But again, what was

0:25:43.760 --> 0:25:45.920
<v Speaker 1>really going on was that Google was sending those search

0:25:46.080 --> 0:25:50.040
<v Speaker 1>terms or that that speech signal over to a server

0:25:50.160 --> 0:25:53.520
<v Speaker 1>that Google operated, and then send the results back down

0:25:53.560 --> 0:25:55.840
<v Speaker 1>to the phone. But to the user it looked like

0:25:55.880 --> 0:25:58.240
<v Speaker 1>the phone itself was doing all the work. The truth

0:25:58.400 --> 0:26:02.600
<v Speaker 1>was it was simply a very basic application of true

0:26:02.680 --> 0:26:05.840
<v Speaker 1>cloud computing, and that created a new method of rolling

0:26:05.880 --> 0:26:09.320
<v Speaker 1>out speech recognition and apps and services. No longer did

0:26:09.320 --> 0:26:12.119
<v Speaker 1>you have to worry about creating a really powerful piece

0:26:12.160 --> 0:26:15.280
<v Speaker 1>of equipment. You can have that be on the back end.

0:26:15.840 --> 0:26:18.719
<v Speaker 1>The piece of equipment the user could have could be

0:26:18.760 --> 0:26:23.480
<v Speaker 1>a relatively underpowered terminal. Essentially. Meanwhile, it also meant that

0:26:23.520 --> 0:26:27.879
<v Speaker 1>Google could collect enormous samples of data, not necessarily to

0:26:27.960 --> 0:26:31.520
<v Speaker 1>market to people or to identify specific individual but rather

0:26:31.680 --> 0:26:34.840
<v Speaker 1>it could collect a lot of data for training its

0:26:34.920 --> 0:26:39.200
<v Speaker 1>speech recognition and natural language recognition models. Google could build

0:26:39.200 --> 0:26:42.080
<v Speaker 1>out a much more robust model of human speech patterns

0:26:42.359 --> 0:26:46.840
<v Speaker 1>because they had thousands of real world uses going on

0:26:46.960 --> 0:26:49.520
<v Speaker 1>in real time they could keep using that to build

0:26:49.520 --> 0:26:54.919
<v Speaker 1>out and bolster their models, and that improved Google's speech

0:26:54.920 --> 0:26:59.439
<v Speaker 1>recognition accuracy. Today, major speech recognition platforms typically have an

0:26:59.520 --> 0:27:02.800
<v Speaker 1>error rate below five percent, which is pretty darn impressive.

0:27:03.280 --> 0:27:07.880
<v Speaker 1>According to a calm score estimation, by twenty half of

0:27:07.920 --> 0:27:11.600
<v Speaker 1>all searches on the Internet will be voice searches. So

0:27:11.640 --> 0:27:15.720
<v Speaker 1>speech recognition, along with natural language processing, could lead to

0:27:15.760 --> 0:27:19.679
<v Speaker 1>a future of ambient computing in which the environments we

0:27:19.880 --> 0:27:23.720
<v Speaker 1>move through our effectively computer interfaces, and we can access

0:27:23.760 --> 0:27:26.600
<v Speaker 1>them through voice commands and other ways of commanding, maybe

0:27:26.600 --> 0:27:29.480
<v Speaker 1>gesture commands, but that seems like it might be better,

0:27:29.520 --> 0:27:32.360
<v Speaker 1>say for our episode about voice assistance and where we're

0:27:32.359 --> 0:27:36.160
<v Speaker 1>headed with that technology. In our next episode, I'm going

0:27:36.200 --> 0:27:40.199
<v Speaker 1>to really explore natural language processing, how it works, and

0:27:40.240 --> 0:27:42.840
<v Speaker 1>how that field of research has evolved over the last

0:27:42.840 --> 0:27:46.119
<v Speaker 1>few decades. It's also really fascinating, and it does, in

0:27:46.200 --> 0:27:49.000
<v Speaker 1>fact cross over quite a bit with speech recognition. But

0:27:49.280 --> 0:27:53.679
<v Speaker 1>natural language processing goes beyond speech. It also includes text,

0:27:54.320 --> 0:27:56.320
<v Speaker 1>and that will be our next episode. But if you

0:27:56.359 --> 0:27:58.760
<v Speaker 1>have a suggestion for a future topic I should cover

0:27:58.960 --> 0:28:01.520
<v Speaker 1>on tech Stuff, send me a message let me know

0:28:01.560 --> 0:28:04.199
<v Speaker 1>about it. The email for the show is tech stuff

0:28:04.359 --> 0:28:07.440
<v Speaker 1>at how stuff works dot com, or you can drop

0:28:07.440 --> 0:28:09.240
<v Speaker 1>me a line on Facebook or Twitter. The handle for

0:28:09.320 --> 0:28:13.040
<v Speaker 1>both of those is text Stuff hs W, and you

0:28:13.119 --> 0:28:15.800
<v Speaker 1>can also follow us on Instagram. I would love it

0:28:15.840 --> 0:28:19.200
<v Speaker 1>if you did, and I'll talk to you again really soon.

0:28:24.960 --> 0:28:27.399
<v Speaker 1>For more on this and bouthsands of other topics, is

0:28:27.400 --> 0:28:38.520
<v Speaker 1>it how stuff works dot com