WEBVTT - Shh! The Tech is Listening! 0:00:04.240 --> 0:00:07.240 Welcome to Tech Stuff, a production of I Heart Radios 0:00:07.320 --> 0:00:13.880 How Stuff Works. Hey there, and welcome to tech Stuff. 0:00:13.880 --> 0:00:17.400 I'm your host, Jonathan Strickland. I'm an executive producer with 0:00:17.600 --> 0:00:19.560 I Heart Radio and How Stuff Works, and I love 0:00:19.600 --> 0:00:24.200 all things tech. And I'm sitting in the audience of 0:00:24.239 --> 0:00:28.240 a local theater like Stage theater not long ago. I'm 0:00:28.240 --> 0:00:31.440 waiting for the show to start, and there's a song 0:00:31.720 --> 0:00:34.440 that's playing over the sound system, and I'm really kind 0:00:34.440 --> 0:00:37.479 of digging the song, but I totally don't recognize it. 0:00:38.040 --> 0:00:40.920 And I glanced down at my phone and I see 0:00:41.240 --> 0:00:44.320 that on the phone below the time on the locked 0:00:44.400 --> 0:00:48.600 phone screen, it says that the song is danger High 0:00:48.680 --> 0:00:52.280 Voltage by Electric six. Now this is obviously a hypothetical 0:00:52.280 --> 0:00:54.680 example because I would recognize that song anywhere, but you 0:00:54.720 --> 0:00:57.959 get the point. Anyway, I'm thinking, that's so cool. My 0:00:58.000 --> 0:01:01.640 phone knows what songs are playing around me. That's so neat. 0:01:02.360 --> 0:01:05.000 I didn't even have to tell to do anything. And 0:01:05.040 --> 0:01:07.760 then a couple of hours later, as I think back 0:01:07.800 --> 0:01:11.560 on this moment, uncertainty and dreads start to see Ben, 0:01:11.680 --> 0:01:15.240 wait a minute, if my phone can identify a song 0:01:15.440 --> 0:01:18.400 that's playing around me, that means my phone is actually 0:01:18.440 --> 0:01:21.319 listening to stuff. It wouldn't be able to tell me 0:01:21.680 --> 0:01:23.920 the song title. Otherwise it has to be able to 0:01:23.959 --> 0:01:26.959 pick up the audio. I didn't activate any app. I 0:01:26.959 --> 0:01:30.880 didn't turn on shah Zam or ask my phone or anything. 0:01:30.920 --> 0:01:33.560 My phone did it by itself. So my phone is 0:01:33.600 --> 0:01:36.800 detecting the sounds around it even when it's not in 0:01:36.920 --> 0:01:41.280 an active mode. Now, on a similar note, I'm sure 0:01:41.440 --> 0:01:45.640 we all have had these personal assistant experiences out there. 0:01:45.680 --> 0:01:48.520 Whether we use one ourselves, we've been around when someone 0:01:48.520 --> 0:01:52.880 else uses them, things like Google Assistant or Alexa or 0:01:52.920 --> 0:01:56.120 Siri or Cartana. There's more of them out there. You 0:01:56.160 --> 0:01:59.200 can activate these assistants with a specific word or phrase, 0:01:59.560 --> 0:02:01.640 and then you speak to them to carry out some 0:02:01.680 --> 0:02:04.560 sort of task or to get you some sort of 0:02:04.560 --> 0:02:07.400 information or something along those lines. We've got a Google 0:02:07.440 --> 0:02:10.200 Home device in our house, so we might use it 0:02:10.240 --> 0:02:13.480 to get a quick rundown on the weather Report. We 0:02:13.560 --> 0:02:15.360 might ask it to play a track off an album 0:02:15.360 --> 0:02:19.000 by the jazz Fusion band weather Report. But wait, that 0:02:19.080 --> 0:02:22.120 means that device is listening to We didn't have to 0:02:22.120 --> 0:02:24.280 take any physical action. We didn't have to push a 0:02:24.320 --> 0:02:27.560 button to make it work. We just spoke the keyword 0:02:27.720 --> 0:02:31.160 or a key phrase, and off it goes. And then 0:02:31.200 --> 0:02:34.760 we get into stuff that seems super creepy. And I'm 0:02:34.800 --> 0:02:37.240 sure most of you have had some sort of experience 0:02:37.280 --> 0:02:40.840 like this. Say you're chatting with friends, maybe you're at 0:02:40.880 --> 0:02:44.400 a restaurant or you're just hanging out, and you're talking 0:02:44.440 --> 0:02:47.480 about this new snack food you just heard about, and 0:02:47.520 --> 0:02:50.519 this is just one part of a conversation that rambles 0:02:50.560 --> 0:02:55.200 all over the place. But then you talk a little 0:02:55.200 --> 0:02:56.840 bit about the snack food for a couple of minutes. 0:02:56.840 --> 0:02:58.760 You're like, you've heard about it, you wanted to try it, 0:02:58.880 --> 0:03:01.080 you haven't tried it yet. Later on, you pop on 0:03:01.120 --> 0:03:03.079 over to Facebook, and as you're scrolling through your feed, 0:03:03.160 --> 0:03:06.440 there it is. There's an ad for the very same 0:03:06.480 --> 0:03:09.480 snack food you mentioned to your friends just a little 0:03:09.480 --> 0:03:13.240 earlier that day. You've never purchased the snack as far 0:03:13.280 --> 0:03:15.520 as you remember, you haven't even searched for it on 0:03:15.560 --> 0:03:19.240 the web, and there's the ad. So as Facebook listening 0:03:19.280 --> 0:03:22.200 in on your conversation in an effort to serve up 0:03:22.240 --> 0:03:26.680 a laser focused targeted ad. One this episode, we're gonna 0:03:26.680 --> 0:03:29.840 take a look at the technology that allows our devices 0:03:29.880 --> 0:03:33.320 to listen in on us, and we'll explore the studies 0:03:33.320 --> 0:03:36.200 about whether or not anything hanky is going on and 0:03:36.200 --> 0:03:40.400 try to separate fact from fud FU D that's fear, 0:03:40.520 --> 0:03:44.240 uncertainty and doubt. And we'll also chat about some recent 0:03:44.320 --> 0:03:47.120 news stories about how big companies have been handing over 0:03:47.160 --> 0:03:51.280 audio messages to third party human contractors and what that 0:03:51.360 --> 0:03:55.680 means in terms of privacy and ethics. Now, first, let's 0:03:55.720 --> 0:04:00.160 address a big reason why devices aren't constantly recording or 0:04:00.200 --> 0:04:05.520 broadcasting all the sounds within an environment that's reachable by microphone. 0:04:06.320 --> 0:04:10.840 It's because that's truly enormous, Like, that's a huge amount 0:04:10.960 --> 0:04:14.040 of data. So let's just take Facebook as an example. 0:04:14.680 --> 0:04:18.360 There are more than two billion people using Facebook every month. 0:04:18.880 --> 0:04:21.080 At least one and a half billion people pop on 0:04:21.080 --> 0:04:24.400 Facebook every single day. Now that's not necessarily the same 0:04:24.880 --> 0:04:27.680 one and a half billion people every day, but every 0:04:27.760 --> 0:04:31.640 day one point five billion people check Facebook, and out 0:04:31.640 --> 0:04:35.400 of that number, nearly one billion of them are accessing 0:04:35.440 --> 0:04:40.360 Facebook on mobile devices. So, just from a data management standpoint, 0:04:41.040 --> 0:04:45.240 there's no way any company, even one as large as Facebook, 0:04:45.400 --> 0:04:49.279 could be actively monitoring, recording, or even analyzing all that 0:04:49.360 --> 0:04:54.080 audio that would be coming in from a billion mobile handsets. 0:04:54.960 --> 0:04:56.960 We are in the age of big data, but we 0:04:57.040 --> 0:04:59.640 still have our limits. Plus you'd have to figure out 0:05:00.240 --> 0:05:03.520 that you know that that large amount of data, most 0:05:03.560 --> 0:05:06.640 of it wouldn't be useful to Facebook. Now, don't get 0:05:06.640 --> 0:05:08.880 me wrong. At the end of the day, you and 0:05:08.960 --> 0:05:14.000 I are the products being bought and sold on Facebook 0:05:14.080 --> 0:05:19.240 and Google and other providers out there. We're potential customers 0:05:19.279 --> 0:05:22.720 for all of the advertisers that use those companies like 0:05:22.760 --> 0:05:26.839 Facebook as a platform. So it benefits the advertisers and 0:05:27.040 --> 0:05:31.120 Facebook and sometimes even us as customers to match the 0:05:31.200 --> 0:05:34.360 right ads to the right people. So there's definitely an 0:05:34.400 --> 0:05:37.880 incentive to learn as much about users as possible to 0:05:38.000 --> 0:05:42.200 leverage their interests and potentially convert them into paying customers 0:05:42.240 --> 0:05:45.960 to an advertiser. Now, this is the very basic foundation 0:05:46.080 --> 0:05:50.520 of Facebook's business model. So if Facebook could do this 0:05:50.839 --> 0:05:54.160 from a technical standpoint, and if the company could get 0:05:54.200 --> 0:05:58.400 away with it from a public perception standpoint, I think 0:05:58.400 --> 0:06:03.000 there's little doubt that face Book would do it. But honestly, 0:06:03.000 --> 0:06:05.440 it's just way too much information to process and to 0:06:05.480 --> 0:06:09.200 boil down into actionable plans. We talk about a lot 0:06:09.200 --> 0:06:12.080 of stuff in our day, you know, and some of 0:06:12.120 --> 0:06:14.159 it we may not really be interested in. We're just 0:06:14.200 --> 0:06:17.839 talking about something, So it wouldn't do Facebook any good 0:06:17.839 --> 0:06:20.239 to serve up ads for stuff that we weren't actually 0:06:20.279 --> 0:06:22.880 really interested in, So it has to pick and choose 0:06:22.880 --> 0:06:27.360 its moments. Facebook has denied using phone microphones in this way. 0:06:27.720 --> 0:06:30.320 In a June second, two thousand sixteen blog post on 0:06:30.360 --> 0:06:34.280 the Facebook newsroom site, a company representative wrote this, and 0:06:34.320 --> 0:06:39.720 here's a quote. Facebook does not use your phone's microphone 0:06:39.760 --> 0:06:42.359 to inform ads or to change what you see in 0:06:42.440 --> 0:06:45.800 news feed. Some recent articles have suggested that we must 0:06:45.839 --> 0:06:48.280 be listening to people's conversations in order to show them 0:06:48.279 --> 0:06:52.360 relevant ads. This is not true. We show ads based 0:06:52.400 --> 0:06:56.400 on people's interests and other profile information, not what you're 0:06:56.400 --> 0:07:00.160 talking out loud about. We only access your microphone if 0:07:00.200 --> 0:07:02.560 you have given our app permission, and if you are 0:07:02.600 --> 0:07:06.560 actively using a specific feature that requires audio. This might 0:07:06.600 --> 0:07:09.600 include recording a video or using an optional feature we 0:07:09.640 --> 0:07:12.560 introduced two years ago to include music or other audio 0:07:12.600 --> 0:07:18.240 in your status updates. End quote. Now, it's understandable that 0:07:18.320 --> 0:07:22.200 people would be a bit skeptical regarding Facebook's claims of innocence. 0:07:22.520 --> 0:07:25.840 In this regard. The company has had several high profile 0:07:25.920 --> 0:07:29.840 scandals and issues with privacy and security. Zuckerberg himself once 0:07:29.960 --> 0:07:35.240 famously declared that privacy is dead. Also, he simultaneously does 0:07:35.280 --> 0:07:38.400 his best to preserve his own privacy. But that's commentary 0:07:38.440 --> 0:07:42.400 for another episode. So I don't blame people for thinking 0:07:42.440 --> 0:07:45.480 that Facebook might actually be listening in on conversations because 0:07:45.480 --> 0:07:48.880 the company has already proven it hasn't been the best 0:07:49.000 --> 0:07:52.640 steward of user privacy in the past. But that doesn't 0:07:52.680 --> 0:07:56.040 mean the company has actually been spying on people. It 0:07:56.080 --> 0:08:00.480 doesn't have to, at least not in that way. And 0:08:00.720 --> 0:08:03.680 this is where we get into some troubling territory because 0:08:03.720 --> 0:08:06.200 it's where we start to learn how services like Google 0:08:06.280 --> 0:08:10.880 and Facebook and others can glean information about us, whether 0:08:10.960 --> 0:08:14.240 we have consciously shared that information or not, and it 0:08:14.240 --> 0:08:17.840 helps explain how these companies can advertise to us so effectively. 0:08:18.640 --> 0:08:22.200 One way Facebook does this is with an innovation called 0:08:22.360 --> 0:08:26.640 Facebook Pixel. Now, this is a piece of code that 0:08:27.000 --> 0:08:32.320 Facebook's clients advertisers really can put on their own websites. 0:08:32.720 --> 0:08:35.600 So it's the type of code you would insert into 0:08:35.640 --> 0:08:38.040 the website for a business. So let's say you own 0:08:38.080 --> 0:08:42.359 a specialty niche marketing shop. We'll say you sell figurines 0:08:42.400 --> 0:08:46.200 based off of iconic horror movie monsters and characters, and 0:08:46.240 --> 0:08:49.200 you're going to advertise on Facebook. The pixel code is 0:08:49.240 --> 0:08:52.920 one way Facebook can optimize that experience. The code pulls 0:08:52.960 --> 0:08:57.320 information off of user behavior on your website and sends 0:08:57.320 --> 0:09:00.760 it to Facebook. If people click over to your site 0:09:00.760 --> 0:09:03.560 because of an ad on Facebook, pixel will register it. 0:09:04.000 --> 0:09:07.120 This helps you see how effective or ineffective your ads 0:09:07.200 --> 0:09:10.800 are on the site. It also can target your ads 0:09:10.920 --> 0:09:13.520 to people on Facebook who would be most likely to 0:09:13.600 --> 0:09:17.160 click on those ads. It might analyze the traits common 0:09:17.200 --> 0:09:19.600 to people who are interacting with your ads, and then 0:09:19.640 --> 0:09:22.760 extrapolate that to target people who have similar traits and 0:09:22.880 --> 0:09:27.920 behaviors but they haven't yet seen your advertisements. Facebook, meanwhile, 0:09:28.040 --> 0:09:30.360 can also use that data to serve up ads from 0:09:30.400 --> 0:09:33.559 other companies to users based on similar findings, and it 0:09:33.640 --> 0:09:36.400 can track other stuff too. Let's say you click over 0:09:36.480 --> 0:09:38.880 to an article on a blog or news site that 0:09:38.960 --> 0:09:42.680 incorporates Facebook pixel in the site's code. Facebook can see 0:09:42.679 --> 0:09:45.160 how long you were on that article, which in turn 0:09:45.200 --> 0:09:48.600 indicates your interest and investment level in that topic. Then 0:09:48.640 --> 0:09:51.640 Facebook can serve up ads related to the contents of 0:09:51.679 --> 0:09:54.920 that article to you. In the end, it's all about 0:09:54.920 --> 0:09:58.760 analyzing user behavior to get the biggest return on investment, 0:09:59.080 --> 0:10:01.800 and it doesn't require are using the microphone to do it. 0:10:02.160 --> 0:10:05.000 They can just look at who you are, where you've been, 0:10:05.440 --> 0:10:09.280 both in real life if it's tracking your location and 0:10:09.360 --> 0:10:12.720 on the Internet if it's tracking your your browsing and 0:10:12.800 --> 0:10:15.600 who your friends are. And all of this information combined 0:10:16.000 --> 0:10:19.240 gives Facebook a ton of data about what kind of 0:10:19.280 --> 0:10:21.920 ads to target towards you. Now, on top of that, 0:10:22.200 --> 0:10:26.120 Facebook can purchase information from data brokers to supplement its 0:10:26.120 --> 0:10:29.400 own guard Ganga and database. There are companies that manage 0:10:29.400 --> 0:10:33.160 stuff like loyalty programs, which also track what you buy. 0:10:33.360 --> 0:10:36.000 They have to for the loyalty programs to work, and 0:10:36.040 --> 0:10:39.400 those purchases are linked to you as a person. They know, Oh, 0:10:39.480 --> 0:10:42.480 Jonathan goes to Starbucks all the time and he always 0:10:42.480 --> 0:10:45.520 gets those Nitro cold brews, So let's put an ad 0:10:46.000 --> 0:10:49.720 that targets him based on that information. Now, that data 0:10:49.800 --> 0:10:51.920 isn't just being used to help you get the best 0:10:52.200 --> 0:10:56.080 deal on whatever it happens to be. That information is valuable. 0:10:56.559 --> 0:11:00.480 So companies that manage these loyalty programs can and do 0:11:00.840 --> 0:11:03.600 buy and sell sell that data you know are spending 0:11:03.640 --> 0:11:07.400 habits are part of this sort of encyclopedia entry about 0:11:07.400 --> 0:11:11.080 our interests, priorities, and behaviors. Now, none of this needs 0:11:11.200 --> 0:11:15.200 to use a microphone to spy on us. So in 0:11:15.240 --> 0:11:17.800 the case of seeing that snack food pop up on 0:11:17.800 --> 0:11:20.480 the Facebook feed, it could simply be that you exhibit 0:11:20.559 --> 0:11:23.520 behaviors similar to ones that people who have bought that 0:11:23.600 --> 0:11:26.200 snack food tend to have. As well. You've liked the 0:11:26.240 --> 0:11:29.480 same sort of pages. You may even have a lot 0:11:29.520 --> 0:11:32.080 of friends who have already bought this stuff. You may 0:11:32.120 --> 0:11:34.959 live in a region where it has recently been introduced. 0:11:35.360 --> 0:11:37.600 These are the kinds of points of data that Facebook 0:11:37.679 --> 0:11:39.320 might use in order to serve that add up to 0:11:39.360 --> 0:11:41.840 you that have nothing to do with your microphone. So 0:11:41.880 --> 0:11:44.640 you got the ad not because you talked about the 0:11:44.640 --> 0:11:47.760 snack food, but because Facebook has sussed out you're the 0:11:47.760 --> 0:11:50.640 type of person who would like that snack food because 0:11:51.400 --> 0:11:54.360 spoiler alert, You're not as special as you think you are, 0:11:54.880 --> 0:11:57.600 and I'm not as special as I think I am. 0:11:57.640 --> 0:12:00.080 Now you could argue, and I would agree with you 0:12:00.160 --> 0:12:03.480 on this, that what Facebook is doing is at least 0:12:03.559 --> 0:12:06.520 as creepy as listening in on a microphone, perhaps even 0:12:06.600 --> 0:12:10.760 more so. Facebook has filed patents that focus on technology 0:12:10.840 --> 0:12:13.200 is meant to predict where you're going to go next 0:12:13.559 --> 0:12:16.400 based on your history of location data. So, in other words, 0:12:16.640 --> 0:12:19.160 Facebook is trying to figure out where you're going to 0:12:19.240 --> 0:12:23.000 go before you go there. And it's not just you, 0:12:23.160 --> 0:12:25.680 it's all the people you know who are using Facebook 0:12:25.720 --> 0:12:29.440 two and so it's not just predicting where you'll go, 0:12:30.120 --> 0:12:33.600 it's also predicting which people you may be running into, 0:12:33.679 --> 0:12:35.800 because it's predicting those people are going to go to 0:12:35.840 --> 0:12:38.560 that same place and whether or not you might encounter 0:12:38.679 --> 0:12:41.199 one another. It can also use that to make suggestions 0:12:41.240 --> 0:12:44.480 to add people on Facebook who are going to those 0:12:44.520 --> 0:12:48.240 same places so that they become your friends online. Now 0:12:48.240 --> 0:12:51.400 why does Facebook care who your friends are? Because the 0:12:51.440 --> 0:12:55.120 more people who use Facebook and the more interconnected they become, 0:12:55.640 --> 0:12:59.480 the more useful the information they generate for Facebook. That 0:12:59.720 --> 0:13:03.640 that ends up becoming more valuable to the company. So 0:13:05.040 --> 0:13:07.480 it is pretty creepy and invasive, and it doesn't have 0:13:07.520 --> 0:13:10.439 to use the microphone. But when we come back, I'll 0:13:10.440 --> 0:13:13.040 talk a bit more about these sound activated features and 0:13:13.080 --> 0:13:15.439 what's actually going on, because there is some stuff we've 0:13:15.480 --> 0:13:17.760 got to be worried about. But first, let's take a 0:13:17.880 --> 0:13:28.240 quick break. When I opened this show, I talked about 0:13:28.240 --> 0:13:30.920 how my phone could listen in on music and identify 0:13:31.000 --> 0:13:34.320 the song even when the phone was in its locked mode. 0:13:34.800 --> 0:13:38.200 Now that's because I have a Pixel to xcel phone. 0:13:38.240 --> 0:13:41.839 It's an Android phone. It's actually a flagship Google phone, 0:13:42.160 --> 0:13:45.400 and there's a feature on the Pixel too that's called 0:13:45.640 --> 0:13:48.560 now playing. You have to activate this feature, you have 0:13:48.600 --> 0:13:51.679 to choose to optimize it. So I want to make 0:13:51.720 --> 0:13:54.679 that clear. I chose to activate this feature. It's not 0:13:54.760 --> 0:13:59.240 just active by default, and with it active, the phone 0:13:59.240 --> 0:14:01.920 can identify music that's playing, and it can tell me 0:14:01.960 --> 0:14:04.720 the title even when the phone is in its locked position. 0:14:04.800 --> 0:14:08.360 So what gives Well, this is not as creepy and 0:14:08.440 --> 0:14:12.040 invasive as it sounds at first glance, because his feature, 0:14:12.480 --> 0:14:16.480 this is incredible to me, is actually entirely local to 0:14:16.600 --> 0:14:21.320 the Pixel two phones. It works on the phone itself. 0:14:21.360 --> 0:14:24.320 It's not consulting the cloud at all, it's not sending 0:14:24.360 --> 0:14:28.760 any information. So how can that be possible? How can 0:14:29.320 --> 0:14:32.400 all this information exists on the phone already? Well, let's 0:14:32.440 --> 0:14:35.960 boil it down first, if you've ever played with any 0:14:36.000 --> 0:14:40.920 digital sound recording software, you've likely seen sound recorded as 0:14:40.920 --> 0:14:44.880 a wave form, a visualization of sound, and typically it's 0:14:44.880 --> 0:14:47.120 pretty simple stuff like if you're using a very basic 0:14:47.240 --> 0:14:51.920 sound recording system, you're mostly looking at changes in amplitude 0:14:52.280 --> 0:14:55.119 or volume. In other words, so you see a continuous 0:14:55.200 --> 0:14:57.520 series of peaks and valleys over the course of a 0:14:57.560 --> 0:15:02.200 sound recording. Those represent the loudest and the quietest parts 0:15:02.240 --> 0:15:05.200 of the recording that changes in volume. You can also 0:15:05.240 --> 0:15:09.480 graph frequency or pitch, and you can if you zoom 0:15:09.520 --> 0:15:12.480 way in, see shapes in the wave form that indicates 0:15:12.480 --> 0:15:17.080 specific phonetics and sounds. Anyone who has worked in audio 0:15:17.240 --> 0:15:20.760 editing for a while can identify at a glance certain 0:15:20.800 --> 0:15:26.000 distinctive sounds. Tari, my producer, can probably tell you just 0:15:26.160 --> 0:15:29.520 by looking at a waveform of my recording which moments 0:15:29.560 --> 0:15:34.400 represent the irritating mouth sounds she removes before publishing an episode. 0:15:35.080 --> 0:15:37.680 It doesn't take long before you can do this yourself. 0:15:38.040 --> 0:15:40.560 It's actually pretty easy to identify, say it like a 0:15:40.640 --> 0:15:46.000 high hat symbol in a music recording, because it's very distinctive. Now, 0:15:46.080 --> 0:15:49.200 that means that songs have these distinctive features like a 0:15:49.240 --> 0:15:53.400 fingerprint that represent the sound of the song, and if 0:15:53.440 --> 0:15:56.800 you can recognize the fingerprint, you can identify the song 0:15:57.040 --> 0:15:59.600 even if you're not listening to the song at that moment. 0:16:00.040 --> 0:16:03.000 And you could look at a print out of a 0:16:03.000 --> 0:16:06.280 wave form of a song and you can try and 0:16:06.360 --> 0:16:10.760 match it against a library of print outs. That's essentially 0:16:10.840 --> 0:16:14.280 what the pixel Too is doing. The program runs in 0:16:14.320 --> 0:16:17.960 the background, It activates when the sound profile indicates that 0:16:18.000 --> 0:16:22.160 there's music present, so it then analyzes the sound that's 0:16:22.160 --> 0:16:24.800 coming in through the microphone and it creates one of 0:16:24.800 --> 0:16:28.400 these digital fingerprints that I was just saying. Then, just 0:16:28.440 --> 0:16:31.040 like you would with a crime scene fingerprint, the pixel 0:16:31.080 --> 0:16:34.760 Too will compare the digital analysis of the song that's 0:16:34.760 --> 0:16:38.560 playing against a local database on the phone of fingerprints 0:16:38.600 --> 0:16:42.640 that represent thousands of popular songs for your region. Now 0:16:42.680 --> 0:16:45.920 exactly how many hasn't really been released, but supposedly in 0:16:45.960 --> 0:16:49.560 the tens of thousands of songs range. And if the 0:16:49.560 --> 0:16:51.920 pixel Too finds a match between the song that is 0:16:51.960 --> 0:16:55.200 currently playing and the one that's in the database, it 0:16:55.280 --> 0:16:58.200 returns the result. This works even if the phone has 0:16:58.200 --> 0:17:01.840 cellular and WiFi data turned off, because again it's all local. 0:17:02.440 --> 0:17:06.480 Now the now playing feature doesn't run constantly because that 0:17:06.520 --> 0:17:10.119 would drain battery life like crazy. Instead, it samples the 0:17:10.160 --> 0:17:14.600 audio approximately every sixty seconds, and it takes time to 0:17:14.680 --> 0:17:17.560 match a song to an entry in the database. The 0:17:17.600 --> 0:17:20.959 cleaner the audio, in other words, the less background noise 0:17:21.040 --> 0:17:24.800 and less interference that's present, the faster this process tends 0:17:24.800 --> 0:17:28.440 to be. This means that when songs transition from one 0:17:28.480 --> 0:17:31.200 song to another, it can take a little bit before 0:17:31.240 --> 0:17:33.879 the phone registers the change. It all depends on the 0:17:33.920 --> 0:17:38.040 acoustic quality of the environment and where in this sampling 0:17:38.160 --> 0:17:42.440 cycle the phone is at any given time, so that's 0:17:42.480 --> 0:17:45.840 not quite as creepy because everything's local on the device. 0:17:45.920 --> 0:17:49.159 It's not sending any data out anywhere else. It's not 0:17:49.280 --> 0:17:52.240 listening to what I'm listening to and an alerting Google 0:17:52.400 --> 0:17:55.359 to let them know, hey, Jonathan's once again listening to 0:17:55.400 --> 0:17:59.960 the soundtrack to be More Chill, which would be an 0:18:00.040 --> 0:18:03.000 accurate suggestion that it would make because I do listen 0:18:03.040 --> 0:18:05.840 to that a lot. Anyway, you can use this feature 0:18:06.520 --> 0:18:09.560 to learn more about the track, the artist, the album, 0:18:09.600 --> 0:18:13.320 including potentially purchasing that music. And those features do connect 0:18:13.359 --> 0:18:16.679 to the outside world through WiFi or cellular connections, but 0:18:16.760 --> 0:18:20.639 that requires an extra step on the part of the user. Also, 0:18:20.680 --> 0:18:23.520 Google pushes out updates to this database with the most 0:18:23.520 --> 0:18:27.560 popular songs, and these are regionalized to reflect the country 0:18:27.560 --> 0:18:31.240 you're in, because you're less likely to run into, say 0:18:31.600 --> 0:18:35.320 a Peruvian pop song when you're in Scotland. The push 0:18:35.440 --> 0:18:39.320 updates do happen over WiFi or cellular local connections. But 0:18:39.960 --> 0:18:42.920 but this is just the reference data that analyze music 0:18:42.960 --> 0:18:47.080 gets compared against. An app like Shazam, on the other hand, 0:18:47.520 --> 0:18:50.400 connects to the cloud, but you also have to activate 0:18:50.440 --> 0:18:52.760 the app to have it listened to the audio, so 0:18:53.160 --> 0:18:56.439 it's a user choice to have the app listen. So 0:18:56.480 --> 0:18:59.040 this is more like a push to talk device, except 0:18:59.040 --> 0:19:02.439 it's pushed to listen. Shazam is also analyzing music to 0:19:02.480 --> 0:19:05.399 sus out a digital fingerprint for the audio, but it 0:19:05.480 --> 0:19:09.480 can compare the sampled audio against a much larger database 0:19:09.800 --> 0:19:13.239 consisting of millions of songs, rather than the tens of 0:19:13.280 --> 0:19:16.439 thousands you would find on the pixel to now playing feature. 0:19:17.040 --> 0:19:20.320 More importantly, I think it's fair to say this isn't 0:19:20.359 --> 0:19:23.679 a creepy use of the technology, since the listening feature 0:19:23.760 --> 0:19:27.240 only activates on the user's command rather than just being 0:19:27.320 --> 0:19:30.320 on by default. Now, this isn't that much different than 0:19:30.359 --> 0:19:34.440 what virtual assistants are doing when you use them. Clearly, 0:19:35.000 --> 0:19:38.359 the microphone on a virtual assistant like Google Home or 0:19:38.440 --> 0:19:41.960 Siri or whatever, it has to be active all the time, 0:19:42.040 --> 0:19:44.879 otherwise you wouldn't get a response when you used whatever 0:19:44.920 --> 0:19:48.800 the keyword or phrase was to activate the assistant. I'm 0:19:48.800 --> 0:19:52.440 going to try and avoid saying any of those phrases, 0:19:52.520 --> 0:19:54.399 by the way, because I don't want those of you 0:19:54.520 --> 0:19:57.280 who have those devices to deal with the frustration of 0:19:57.320 --> 0:20:01.200 them going off in response to something I say. A Now, 0:20:01.200 --> 0:20:05.000 those words or phrases have a specific sound, just like 0:20:05.240 --> 0:20:09.040 music does. In this case, we're talking about phonemes, which 0:20:09.040 --> 0:20:12.440 are recognizable sounds found in language. So in English there 0:20:12.480 --> 0:20:16.560 are forty four phonemes. The order and combination of those 0:20:16.560 --> 0:20:19.560 phonemes are the key. So if you say something that 0:20:19.680 --> 0:20:23.000 has those phonemes in the right order, or if it's 0:20:23.119 --> 0:20:26.440 close enough, if it's an a noisy environment, this can 0:20:26.480 --> 0:20:30.560 activate the virtual assistant. It's like a key fitting into 0:20:30.600 --> 0:20:33.640 a lock. Now, if you're saying other stuff, it's like 0:20:33.680 --> 0:20:37.000 the wrong key is inserted and nothing happens. It's only 0:20:37.000 --> 0:20:39.720 when you say something that fits the lock that the 0:20:39.760 --> 0:20:45.000 assistant activates. This process continues after activation. When you talk 0:20:45.080 --> 0:20:48.960 to the virtual assistant, it analyzes your speech by phonemes. 0:20:49.920 --> 0:20:53.000 Software processes those to figure out what words you are 0:20:53.080 --> 0:20:56.520 actually saying. Well for the first step, that is, because 0:20:56.560 --> 0:21:00.199 it's actually more complicated than that. So, for example, there 0:21:00.240 --> 0:21:03.440 are hominems. These are words that have a similar sound 0:21:03.760 --> 0:21:08.480 but different meanings and often different spellings. An easy example 0:21:08.600 --> 0:21:12.080 is the number eight in the past tense for to eat, 0:21:12.520 --> 0:21:16.520 such as I ate an entire bowl of cao. Mm 0:21:16.600 --> 0:21:22.840 hmm okay. So those two words eight and eight sound 0:21:22.920 --> 0:21:26.199 exactly the same, but they have different meanings. Now that 0:21:26.240 --> 0:21:29.400 means the software can't rely on just the sounds you're 0:21:29.440 --> 0:21:32.000 making when you speak to figure out what you mean, 0:21:32.480 --> 0:21:36.120 has to actually analyze syntax and context and make judgment 0:21:36.160 --> 0:21:38.960 calls about what you are actually meaning when you say 0:21:38.960 --> 0:21:43.040 these things. Sometimes it gets things right, sometimes it gets 0:21:43.040 --> 0:21:45.840 things wrong. But don't be too hard on it. Because 0:21:46.160 --> 0:21:50.000 humans misunderstand other humans all the time. Even when we 0:21:50.040 --> 0:21:52.719 are both communicating with it in the same language, we 0:21:52.760 --> 0:21:56.600 can misunderstand each other. Now, this is still just the 0:21:56.680 --> 0:22:00.000 first step you can think of. This is essentially speed 0:22:00.000 --> 0:22:02.960 each to text. From there, you have to determine what 0:22:03.160 --> 0:22:06.320 is actually being asked by the speaker, what is the 0:22:06.400 --> 0:22:11.600 intent behind the words. If someone speaks French very slowly 0:22:11.640 --> 0:22:14.199 to me, I might be able to spell out what 0:22:14.359 --> 0:22:17.400 is being said phonetically, but that doesn't mean I understand 0:22:17.440 --> 0:22:21.360 the actual content of what was spoken. And to complicate matters, 0:22:21.640 --> 0:22:23.560 there are a lot of different ways to ask for 0:22:23.600 --> 0:22:27.199 the same information. I might say what's the weather for 0:22:27.240 --> 0:22:30.280 this week? Or will I need an umbrella today, or 0:22:30.320 --> 0:22:32.879 one of a dozen other ways to inquire about the weather. 0:22:33.359 --> 0:22:36.479 The software has to be able to determine what the 0:22:36.560 --> 0:22:40.960 intent was behind my question, and then there's another step, 0:22:41.280 --> 0:22:45.280 which is matching intent with action. The assistant has to 0:22:45.359 --> 0:22:48.679 respond to my request, and hopefully it does so in 0:22:48.680 --> 0:22:51.320 a way that's relevant to whatever I was asking about 0:22:51.320 --> 0:22:53.840 in the first place. So if I ask my virtual 0:22:53.880 --> 0:22:56.720 assistant for an update on the weather, I'm not going 0:22:56.760 --> 0:22:59.679 to be impressed if it instead tells me about the 0:22:59.720 --> 0:23:03.720 track FAIC or vice versa. And as assistants get connected 0:23:03.760 --> 0:23:08.320 into more systems like security systems, lights, apps, and more, 0:23:08.760 --> 0:23:12.520 the software has to send appropriate commands to these other 0:23:12.600 --> 0:23:16.679 elements to produce the expected results. Now, this is all impressive, 0:23:17.000 --> 0:23:20.040 and because it's impressive, it could be a little scary 0:23:20.160 --> 0:23:23.639 when we think about assistance as hanging on our every word. 0:23:23.760 --> 0:23:27.440 What are are they always listening? Are they always paying attention? Now? 0:23:27.480 --> 0:23:30.760 They're always monitoring sound, but they're not doing so in 0:23:30.800 --> 0:23:34.520