WEBVTT - TechStuff Makes Eye Contact with a Robot

0:00:04.400 --> 0:00:07.800
<v Speaker 1>Welcome to tech Stuff, a production from my Heart Radio.

0:00:12.039 --> 0:00:14.760
<v Speaker 1>Hey there, and welcome to tech Stuff. I'm your host,

0:00:14.880 --> 0:00:18.160
<v Speaker 1>Jonathan Strickland. I'm an executive producer with I Heart Radio,

0:00:18.200 --> 0:00:21.560
<v Speaker 1>and I love all things tech. And Halloween is over

0:00:21.800 --> 0:00:23.639
<v Speaker 1>by the time you hear this. I hope you had

0:00:23.680 --> 0:00:26.560
<v Speaker 1>a happy one. But I still have something that falls

0:00:26.680 --> 0:00:31.200
<v Speaker 1>into the kind of creepy category, at least in my opinion.

0:00:31.760 --> 0:00:34.960
<v Speaker 1>And I discovered this after looking around at tech news

0:00:35.000 --> 0:00:37.920
<v Speaker 1>in general, and I became fascinated by it and figured, hey,

0:00:37.960 --> 0:00:40.839
<v Speaker 1>you know, I haven't done a really focused episode on

0:00:40.880 --> 0:00:44.920
<v Speaker 1>a very specific implementation of technology in a long time,

0:00:45.560 --> 0:00:48.840
<v Speaker 1>so why not do that now. Now, anyone who knows

0:00:48.920 --> 0:00:52.199
<v Speaker 1>me can tell you that I am a sucker for

0:00:52.479 --> 0:00:57.320
<v Speaker 1>Disney imagineering, which of course is the peculiar twist on

0:00:57.920 --> 0:01:03.560
<v Speaker 1>engineering and innovation that Disney champions. Right. The inventiveness and

0:01:03.640 --> 0:01:06.400
<v Speaker 1>the attention to detail impressed me a great deal. Those

0:01:06.440 --> 0:01:11.160
<v Speaker 1>are hallmarks of Disney engineering or imagineering. And I've done

0:01:11.200 --> 0:01:14.319
<v Speaker 1>episodes covering various elements that tie into this, from the

0:01:14.400 --> 0:01:18.160
<v Speaker 1>history of upcot to how audio animatronics work. And it's

0:01:18.200 --> 0:01:22.440
<v Speaker 1>that last topic I wish to revisit because not long

0:01:22.480 --> 0:01:27.080
<v Speaker 1>ago I read a research paper from Disney Imagineers titled

0:01:27.480 --> 0:01:33.040
<v Speaker 1>Realistic and Interactive Robot Gaze. That's g A Z E,

0:01:33.720 --> 0:01:36.240
<v Speaker 1>you know, referring to where a person or in this

0:01:36.319 --> 0:01:41.040
<v Speaker 1>case uh an object a robot appears to be looking.

0:01:41.920 --> 0:01:44.679
<v Speaker 1>And the paper is fascinating and it's available for anyone

0:01:44.720 --> 0:01:47.480
<v Speaker 1>to read for free. So if you find this subject

0:01:47.480 --> 0:01:50.480
<v Speaker 1>matter neat, I really recommend you read it. Now. It

0:01:50.560 --> 0:01:54.280
<v Speaker 1>does get a bit technical. There's some math in there too,

0:01:54.600 --> 0:01:56.880
<v Speaker 1>but for the most part, I think it's a pretty

0:01:56.960 --> 0:02:02.880
<v Speaker 1>accessible paper. The pictures and good gravy, y'all. The video

0:02:03.520 --> 0:02:08.320
<v Speaker 1>that are connected to this project are the stuff of nightmares,

0:02:09.120 --> 0:02:11.720
<v Speaker 1>but we'll get to that. The heart of the paper

0:02:12.160 --> 0:02:16.400
<v Speaker 1>is all about designing systems so that an audio animatronic

0:02:16.520 --> 0:02:20.800
<v Speaker 1>or or just an animatronic figure can make and maintain

0:02:21.000 --> 0:02:24.639
<v Speaker 1>eye contact or at least appear to with someone who

0:02:24.720 --> 0:02:27.799
<v Speaker 1>is looking at that figure and onlooker. So, in other words,

0:02:28.360 --> 0:02:32.360
<v Speaker 1>imagine that there's a Disney attraction at a park, and

0:02:32.720 --> 0:02:35.200
<v Speaker 1>in this attraction you can walk up to a robot.

0:02:35.560 --> 0:02:39.280
<v Speaker 1>It's probably going to be behind like a rail or

0:02:39.320 --> 0:02:41.440
<v Speaker 1>inside a booth or something, so that you can't you know,

0:02:41.960 --> 0:02:45.760
<v Speaker 1>touch it, and the robot notices you looking at it,

0:02:45.800 --> 0:02:48.720
<v Speaker 1>and it looks you in the eye. And then maybe

0:02:48.760 --> 0:02:51.639
<v Speaker 1>you get to chat with the robot and it maintains

0:02:51.639 --> 0:02:54.600
<v Speaker 1>eye contact with you, and occasionally maybe it's eyes dart

0:02:54.639 --> 0:02:57.360
<v Speaker 1>around to glance at other stuff that's within its field

0:02:57.400 --> 0:03:00.799
<v Speaker 1>of view, or maybe even indicating that the robot is

0:03:00.840 --> 0:03:03.880
<v Speaker 1>appearing to like take a second to think of a response.

0:03:04.320 --> 0:03:07.080
<v Speaker 1>That's kind of what we're talking about here. And here's

0:03:07.120 --> 0:03:11.480
<v Speaker 1>the thing. This is surprisingly difficult to do, and it's

0:03:11.680 --> 0:03:17.000
<v Speaker 1>extra hard to do without dipping into super unsettling territory.

0:03:17.120 --> 0:03:20.079
<v Speaker 1>So today we're going to learn more about the technology

0:03:20.120 --> 0:03:24.000
<v Speaker 1>and the psychology behind this project, as well as what

0:03:24.160 --> 0:03:28.560
<v Speaker 1>makes it different from earlier audio animatronics, which is honestly

0:03:28.560 --> 0:03:32.000
<v Speaker 1>a good place for us to start. The original audio

0:03:32.040 --> 0:03:36.480
<v Speaker 1>animatronics were essentially puppets. In fact, you could argue that

0:03:36.640 --> 0:03:42.840
<v Speaker 1>all animatronics are ultimately puppets. Each puppet has a certain

0:03:42.920 --> 0:03:46.000
<v Speaker 1>number of degrees of freedom, and that refers to a

0:03:46.080 --> 0:03:49.760
<v Speaker 1>number of independent directions of motion. So let's take a

0:03:49.800 --> 0:03:54.320
<v Speaker 1>simple example. Let's say that a robots neck only has

0:03:54.400 --> 0:03:56.920
<v Speaker 1>one degree of freedom. Well, that would mean the robot

0:03:57.040 --> 0:03:59.480
<v Speaker 1>might be able to nod its head up and down.

0:04:00.000 --> 0:04:01.520
<v Speaker 1>But if it could do that, it wouldn't be able

0:04:01.520 --> 0:04:03.600
<v Speaker 1>to shake its head or tilt its head, because that

0:04:03.600 --> 0:04:06.840
<v Speaker 1>would be an additional degree of freedom. Or maybe it's

0:04:06.880 --> 0:04:08.920
<v Speaker 1>able to shake its head, but it's not able to

0:04:09.000 --> 0:04:11.920
<v Speaker 1>nod or tilt because it only has that one degree

0:04:11.920 --> 0:04:14.720
<v Speaker 1>of freedom. That one degree is really limiting, and it

0:04:14.760 --> 0:04:20.520
<v Speaker 1>just tells us the full range of of direction emotions

0:04:21.040 --> 0:04:24.320
<v Speaker 1>that any one joint can do, and we typically talk

0:04:24.360 --> 0:04:27.680
<v Speaker 1>about degrees of freedom with joints to express the range

0:04:27.680 --> 0:04:32.160
<v Speaker 1>of possible motions the you know, whatever it is can perform.

0:04:32.200 --> 0:04:36.000
<v Speaker 1>The enchanted Tiki Room at Disneyland was an early example

0:04:36.080 --> 0:04:40.040
<v Speaker 1>of audio animatronic ingenuity. It wasn't the very first use

0:04:40.080 --> 0:04:43.280
<v Speaker 1>of audio animatronics, but it was an early one, and

0:04:43.360 --> 0:04:46.480
<v Speaker 1>when you learned how it worked behind the scenes, it's

0:04:46.560 --> 0:04:50.960
<v Speaker 1>pretty wacky. The various birds, flowers, and other elements in

0:04:50.960 --> 0:04:55.080
<v Speaker 1>the attraction connected to a very complex system, including some

0:04:55.200 --> 0:04:59.800
<v Speaker 1>pneumatic valves. A pneumatic system uses air under pressure to

0:05:00.040 --> 0:05:03.400
<v Speaker 1>do work, so these valves in turn connected to a

0:05:03.480 --> 0:05:08.400
<v Speaker 1>circuit that had thin metal reads as switches. Now, normally

0:05:08.680 --> 0:05:11.640
<v Speaker 1>the switch would be open, meaning no electricity can flow

0:05:11.680 --> 0:05:14.720
<v Speaker 1>through the circuit and thus provide electricity to open or

0:05:14.839 --> 0:05:19.040
<v Speaker 1>close the valve. But when sounds of a certain frequency

0:05:19.160 --> 0:05:22.719
<v Speaker 1>would play near these reads, it would cause those reads

0:05:22.720 --> 0:05:25.240
<v Speaker 1>to vibrate, and you know, depending on the thickness and

0:05:25.320 --> 0:05:28.440
<v Speaker 1>length of the read, that would determine what frequency of

0:05:28.520 --> 0:05:32.000
<v Speaker 1>sound would most likely get it to start vibrating. Once

0:05:32.000 --> 0:05:34.760
<v Speaker 1>it vibrated, it would close the circuit and thus allow

0:05:34.839 --> 0:05:39.160
<v Speaker 1>power to go through to the respective valve. And every

0:05:39.200 --> 0:05:41.200
<v Speaker 1>bird and flower in the attraction had this sort of

0:05:41.240 --> 0:05:45.279
<v Speaker 1>system where the sounds playing through the sound system would

0:05:45.320 --> 0:05:48.400
<v Speaker 1>actually cause the individual circuits for those birds and flowers

0:05:48.400 --> 0:05:51.919
<v Speaker 1>to activate. So the chirping of the bird, that chirping

0:05:51.920 --> 0:05:54.320
<v Speaker 1>sound was actually the sound that was opening and closing

0:05:54.360 --> 0:05:58.240
<v Speaker 1>the the circuit and thus activating the valve that would

0:05:58.279 --> 0:06:01.719
<v Speaker 1>control the bird's beak. And because the figures relied on

0:06:01.760 --> 0:06:04.839
<v Speaker 1>the sound to close the circuit, they were audio animatronics.

0:06:05.480 --> 0:06:08.479
<v Speaker 1>Over the years, Disney would improve on this design, sometimes

0:06:08.520 --> 0:06:12.279
<v Speaker 1>by necessity. So for example, when the imagineers set out

0:06:12.279 --> 0:06:15.680
<v Speaker 1>to create the attraction The Great Moments with Mr. Lincoln,

0:06:16.279 --> 0:06:18.520
<v Speaker 1>they had to come up with new mechanisms to do

0:06:18.640 --> 0:06:22.600
<v Speaker 1>that because pneumatics would not be a good solution. With pneumatics,

0:06:22.600 --> 0:06:25.520
<v Speaker 1>you've got a couple of limitations that you're working with.

0:06:25.600 --> 0:06:29.560
<v Speaker 1>One is that you can't move really heavy stuff effectively

0:06:29.640 --> 0:06:34.159
<v Speaker 1>with pneumatics. Another is that pneumatic pistons tend to move

0:06:34.200 --> 0:06:38.320
<v Speaker 1>really fast. It's hard to do controlled slow movements with pneumatics.

0:06:38.320 --> 0:06:40.760
<v Speaker 1>So it might be okay for something like a bird

0:06:40.760 --> 0:06:44.320
<v Speaker 1>flapping its wings or opening and closing its beak fairly quickly,

0:06:44.720 --> 0:06:47.560
<v Speaker 1>but it's not so great for say, a revered US

0:06:47.640 --> 0:06:51.920
<v Speaker 1>president lifting his hand. But I've covered that in other episodes.

0:06:52.480 --> 0:06:54.880
<v Speaker 1>The really important thing I want to stress is that

0:06:55.000 --> 0:07:00.520
<v Speaker 1>audio animatronic figures have historically been limited to a cific,

0:07:00.920 --> 0:07:05.880
<v Speaker 1>pre programmed sequence of motions, so calling them puppets is

0:07:06.279 --> 0:07:10.360
<v Speaker 1>fairly appropriate. These are figures that will do the exact

0:07:10.440 --> 0:07:14.160
<v Speaker 1>same sequence of motions until something goes wrong or the

0:07:14.200 --> 0:07:17.600
<v Speaker 1>attraction is shut off for some reason. The pirate and

0:07:17.680 --> 0:07:20.800
<v Speaker 1>Pirates of the Caribbean that is precariously attempting to step

0:07:20.840 --> 0:07:23.960
<v Speaker 1>onto a rowboat is never going to fall into the water.

0:07:24.360 --> 0:07:26.720
<v Speaker 1>He's never going to get into the boat, and he's

0:07:26.760 --> 0:07:29.800
<v Speaker 1>never gonna step back onto the shore. He will continue

0:07:29.960 --> 0:07:34.520
<v Speaker 1>his balancing act until the end of time. And this

0:07:34.600 --> 0:07:37.600
<v Speaker 1>is starting to sound like some sort of Greek myth

0:07:37.680 --> 0:07:40.640
<v Speaker 1>about the afterlife at this point. Now, the reason I'm

0:07:40.640 --> 0:07:43.880
<v Speaker 1>bringing this up, the reason it's important, is that creating

0:07:44.040 --> 0:07:48.520
<v Speaker 1>an animatronic figure that can actually detect an onlookers gaze

0:07:48.960 --> 0:07:53.200
<v Speaker 1>and return it making eye contact can't be totally dedicated

0:07:53.240 --> 0:07:57.520
<v Speaker 1>to following the same set of motions on repeat. There

0:07:57.560 --> 0:08:01.240
<v Speaker 1>has to be some room for variability within it. At

0:08:01.240 --> 0:08:04.840
<v Speaker 1>the same time, Disney's whole gig is to create a show.

0:08:05.440 --> 0:08:09.000
<v Speaker 1>The amusement parks are show business. If you are in

0:08:09.160 --> 0:08:12.040
<v Speaker 1>a public space of one of those parks, like you're

0:08:12.120 --> 0:08:15.240
<v Speaker 1>inside the confines of the park itself, walgging a down

0:08:15.280 --> 0:08:19.520
<v Speaker 1>Main street or whatever, you are on stage. The employees

0:08:19.520 --> 0:08:23.400
<v Speaker 1>are called cast members, and shows, while they can have

0:08:23.480 --> 0:08:27.040
<v Speaker 1>some variation in them, are supposed to follow a general flow.

0:08:27.160 --> 0:08:30.720
<v Speaker 1>They follow a script. And so the imagineers were working

0:08:30.720 --> 0:08:33.680
<v Speaker 1>on creating a figure that would follow a scripted set

0:08:33.679 --> 0:08:36.280
<v Speaker 1>of behaviors, but would have the freedom to throw in

0:08:36.360 --> 0:08:39.840
<v Speaker 1>stuff like eye contact now and then the figure, in

0:08:39.880 --> 0:08:44.600
<v Speaker 1>a way would be able to improvise. It's jazz Baby.

0:08:44.840 --> 0:08:46.839
<v Speaker 1>The tune is more or less set, but how you

0:08:46.920 --> 0:08:49.960
<v Speaker 1>go through it allows for a lot of variation. For

0:08:50.040 --> 0:08:53.040
<v Speaker 1>the purposes of this work, the team relied on an

0:08:53.040 --> 0:08:56.800
<v Speaker 1>animatronic bust. Now we've kind of dropped the audio at

0:08:56.800 --> 0:09:01.480
<v Speaker 1>this point. Modern animatronic figures are not really driven by

0:09:01.640 --> 0:09:06.520
<v Speaker 1>audio signals anymore. They're driven by circuitry and sophisticated computer

0:09:06.640 --> 0:09:11.720
<v Speaker 1>systems and programs. Though to be fair, they still often

0:09:11.760 --> 0:09:15.120
<v Speaker 1>are referred to as audio animatronic. But you really need

0:09:15.200 --> 0:09:18.240
<v Speaker 1>to see a picture of this thing. I'll do my

0:09:18.280 --> 0:09:21.080
<v Speaker 1>best to describe it, but really you should search this

0:09:21.240 --> 0:09:27.600
<v Speaker 1>Disney uh interactive gaze animatronic because who boy, so imagine

0:09:27.679 --> 0:09:32.000
<v Speaker 1>the V shaped torso of a bust sculpture, right, It's

0:09:32.080 --> 0:09:34.640
<v Speaker 1>very narrow at the bottom, and it widens up to

0:09:34.679 --> 0:09:38.360
<v Speaker 1>the shoulders. It's clad in a white button up shirt,

0:09:38.640 --> 0:09:40.880
<v Speaker 1>you know, kind of like an Oxford shirt of business shirt.

0:09:41.880 --> 0:09:44.800
<v Speaker 1>It does have shoulders, but does not have arms. It

0:09:44.880 --> 0:09:48.920
<v Speaker 1>has a head, good golly, it has a head. The

0:09:49.000 --> 0:09:52.560
<v Speaker 1>head of this figure has a sort of plastic skull,

0:09:53.280 --> 0:09:56.680
<v Speaker 1>though it's kind of more like a plastic mask than

0:09:56.960 --> 0:10:00.199
<v Speaker 1>a human skull. It doesn't look like a skeleton skull.

0:10:00.679 --> 0:10:04.200
<v Speaker 1>It does have eyes, it's even got eyelids, and it's

0:10:04.240 --> 0:10:08.719
<v Speaker 1>got teeth. And looking at this thing is a little unsettling.

0:10:09.360 --> 0:10:12.920
<v Speaker 1>And that's before it even makes eye contact with you. Now,

0:10:13.000 --> 0:10:15.840
<v Speaker 1>why would you want to make something like this be

0:10:15.960 --> 0:10:18.920
<v Speaker 1>able to make eye contact in the first place. Well,

0:10:18.960 --> 0:10:24.280
<v Speaker 1>eye contact is an important social signal. It shows mutual acknowledgement,

0:10:24.360 --> 0:10:27.360
<v Speaker 1>and it can lead us to projecting certain things upon

0:10:27.400 --> 0:10:31.199
<v Speaker 1>the person or animal that's making eye contact with us.

0:10:31.480 --> 0:10:34.760
<v Speaker 1>We tend to perceive such creatures as possessing a certain

0:10:34.760 --> 0:10:38.960
<v Speaker 1>amount of intelligence and sincerity. For example, when I make

0:10:39.040 --> 0:10:42.360
<v Speaker 1>eye contact with my dog Ti Bolt, I perceive him

0:10:42.400 --> 0:10:46.440
<v Speaker 1>to be intelligent and alert and loving. Now I have

0:10:46.520 --> 0:10:49.400
<v Speaker 1>no way of knowing what is really going on in

0:10:49.520 --> 0:10:53.160
<v Speaker 1>his doggy mind. I suspect it's probably more along the

0:10:53.200 --> 0:10:55.760
<v Speaker 1>lines of is the bald man about to give me

0:10:55.840 --> 0:10:59.120
<v Speaker 1>a treat? I should pay attention, But I like to

0:10:59.160 --> 0:11:03.120
<v Speaker 1>think of it as sincere love. Now, as the paper states, quote,

0:11:03.640 --> 0:11:07.480
<v Speaker 1>given the importance of gays in social interactions, as well

0:11:07.520 --> 0:11:11.280
<v Speaker 1>as its ability to communicate states and shape perceptions, it

0:11:11.400 --> 0:11:14.480
<v Speaker 1>is a parent that gays can function as a significant

0:11:14.520 --> 0:11:19.160
<v Speaker 1>tool for an interactive robot character end quote. And I

0:11:19.160 --> 0:11:21.840
<v Speaker 1>can totally grock that. I imagine what it might be

0:11:21.920 --> 0:11:25.160
<v Speaker 1>like to a child who's going to Disney World or

0:11:25.240 --> 0:11:28.400
<v Speaker 1>Disneyland for the very first time and going to a

0:11:28.520 --> 0:11:32.280
<v Speaker 1>ride or an attraction where there's an animatronic figure, perhaps

0:11:32.400 --> 0:11:35.400
<v Speaker 1>one that looks like a famous Disney character, and it

0:11:35.480 --> 0:11:38.560
<v Speaker 1>makes eye contact with that child, maybe it even speaks

0:11:38.600 --> 0:11:40.839
<v Speaker 1>to the child, and maybe it can respond to the

0:11:40.920 --> 0:11:44.400
<v Speaker 1>child of the child speaks back. That sort of interaction

0:11:44.720 --> 0:11:46.240
<v Speaker 1>would have been the kind of stuff that would have

0:11:46.240 --> 0:11:49.560
<v Speaker 1>stuck with me as a kid well into adulthood, and

0:11:49.600 --> 0:11:52.240
<v Speaker 1>I feel confident about that because I have a lot

0:11:52.280 --> 0:11:56.880
<v Speaker 1>of memories of the seemingly magical moments I've experienced at

0:11:56.920 --> 0:12:00.400
<v Speaker 1>Disney with far more primitive technology. Is that we're in

0:12:00.440 --> 0:12:03.040
<v Speaker 1>the Disney parks when I first started visiting them in

0:12:03.080 --> 0:12:06.400
<v Speaker 1>the nineteen seventies, so I can certainly see the show

0:12:06.559 --> 0:12:09.800
<v Speaker 1>need for this sort of development. But there are numerous

0:12:09.920 --> 0:12:12.640
<v Speaker 1>challenges that stand in the way of achieving this goal,

0:12:12.760 --> 0:12:16.880
<v Speaker 1>and they fall into different broad categories. Perhaps the easiest

0:12:17.000 --> 0:12:20.079
<v Speaker 1>set of challenges to conquer is actually the electro mechanical

0:12:20.240 --> 0:12:23.360
<v Speaker 1>side of things. That is, the actual mechanisms that you're

0:12:23.360 --> 0:12:27.240
<v Speaker 1>going to use to create these effects, the servos and

0:12:27.280 --> 0:12:29.920
<v Speaker 1>the motors and the other components that will create the

0:12:29.960 --> 0:12:33.720
<v Speaker 1>actual motions that will translate into the robot making eye

0:12:33.720 --> 0:12:38.920
<v Speaker 1>contact or behaving in otherwise realistic ways. That's one of

0:12:38.960 --> 0:12:42.280
<v Speaker 1>the set of challenges, but there are others. One is

0:12:42.280 --> 0:12:45.480
<v Speaker 1>giving the robot the ability to detect the gaze of

0:12:45.600 --> 0:12:48.160
<v Speaker 1>onlookers in the first place. There has to be some

0:12:48.240 --> 0:12:52.880
<v Speaker 1>sort of face recognition and maybe even eye tracking technology

0:12:52.960 --> 0:12:56.600
<v Speaker 1>so that the robot looks at the right spot. So

0:12:56.640 --> 0:12:59.360
<v Speaker 1>the electro mechanical parts have to work correctly, but so

0:12:59.400 --> 0:13:03.600
<v Speaker 1>does the robot vision or perception. Otherwise the robot is

0:13:03.600 --> 0:13:06.199
<v Speaker 1>going to look in the wrong spot, perhaps staring off

0:13:06.240 --> 0:13:09.560
<v Speaker 1>to one side or above or below and onlooker's eye

0:13:09.559 --> 0:13:14.160
<v Speaker 1>contact or attempt at eye contact. Another challenge would be

0:13:14.200 --> 0:13:16.440
<v Speaker 1>on the programming side. You have to figure out how

0:13:16.440 --> 0:13:18.719
<v Speaker 1>to determine who the figure is going to look at.

0:13:19.000 --> 0:13:22.199
<v Speaker 1>You also have to figure out how long the robot

0:13:22.200 --> 0:13:26.120
<v Speaker 1>will look at somebody and what could distract the robot,

0:13:26.240 --> 0:13:29.320
<v Speaker 1>and whether or not the robot would return to looking at,

0:13:29.440 --> 0:13:32.240
<v Speaker 1>you know, the first person, or maybe look at a

0:13:32.280 --> 0:13:35.040
<v Speaker 1>second person, or maybe look at something else Entirely, you

0:13:35.080 --> 0:13:38.319
<v Speaker 1>have to solve the challenge of the program and prioritize

0:13:38.360 --> 0:13:41.480
<v Speaker 1>the order of operations so that the robot behaves in

0:13:41.480 --> 0:13:43.920
<v Speaker 1>a way that makes sense, as opposed to a robot

0:13:43.920 --> 0:13:47.679
<v Speaker 1>that's just you know, reacting to all visual stimuli in

0:13:47.720 --> 0:13:51.880
<v Speaker 1>a random way, which would be at the very least disconcerting.

0:13:52.640 --> 0:13:54.480
<v Speaker 1>And then we get to something that's a bit harder

0:13:54.520 --> 0:13:57.760
<v Speaker 1>to define than degrees of freedom or range of motion

0:13:58.160 --> 0:14:02.120
<v Speaker 1>or the hierarchy of programming, and that's human psychology. Now,

0:14:02.160 --> 0:14:05.559
<v Speaker 1>as the paper points out, eye contact is an important

0:14:05.600 --> 0:14:09.160
<v Speaker 1>social cue for most of us, but there are a

0:14:09.160 --> 0:14:11.960
<v Speaker 1>whole range of humans out there right For people who

0:14:11.960 --> 0:14:15.600
<v Speaker 1>have autism, eye contact can be a really challenging task,

0:14:16.160 --> 0:14:19.040
<v Speaker 1>and it tends to make people who have this type

0:14:19.040 --> 0:14:21.880
<v Speaker 1>of autism. It makes their lives a little more difficult

0:14:22.040 --> 0:14:26.520
<v Speaker 1>or complicated as a result. It's something that people some

0:14:26.560 --> 0:14:29.280
<v Speaker 1>people anyway, have to consciously deal with. They have to

0:14:30.040 --> 0:14:32.720
<v Speaker 1>remember to do this and work at it. It's not

0:14:32.920 --> 0:14:35.120
<v Speaker 1>it's not a natural behavior for them. So this is

0:14:35.160 --> 0:14:37.320
<v Speaker 1>something that can be tricky for human beings, let alone

0:14:37.400 --> 0:14:41.240
<v Speaker 1>for robots. Now, while eye contact can help create a

0:14:41.280 --> 0:14:44.320
<v Speaker 1>sense of sincerity and interest, it can also shift over

0:14:44.360 --> 0:14:48.560
<v Speaker 1>into more unpleasant territory, such as a sense of predatory

0:14:48.680 --> 0:14:52.840
<v Speaker 1>intent or as a comedian I once saw said there's

0:14:52.840 --> 0:14:55.840
<v Speaker 1>a fine line between the casual eye contact of a

0:14:55.880 --> 0:14:59.040
<v Speaker 1>friend and the cold stare of a serial killer. He

0:14:59.120 --> 0:15:01.960
<v Speaker 1>was specifically taught king about trying to navigate the tricky

0:15:02.040 --> 0:15:05.040
<v Speaker 1>territory of approaching people in order to get to know them.

0:15:05.400 --> 0:15:07.400
<v Speaker 1>But I think the meaning could be used for lots

0:15:07.400 --> 0:15:11.160
<v Speaker 1>of scenarios, including an encounter with a robotic figure. And

0:15:11.240 --> 0:15:15.640
<v Speaker 1>along with that is the issue of the uncanny valley,

0:15:15.680 --> 0:15:19.120
<v Speaker 1>which I have touched on in previous episodes. I'm not

0:15:19.200 --> 0:15:21.920
<v Speaker 1>sure if I've ever actually talked about the origin of

0:15:21.960 --> 0:15:25.400
<v Speaker 1>the phrase, however, a professor at the Tokyo Institute of

0:15:25.400 --> 0:15:28.960
<v Speaker 1>Technology named massa Hiro Mori coined this phrase in the

0:15:29.040 --> 0:15:33.800
<v Speaker 1>nineteen seventies to describe a pretty odd phenomenon. As robots

0:15:33.880 --> 0:15:37.680
<v Speaker 1>become more human like or more lifelike in general, they

0:15:37.720 --> 0:15:41.640
<v Speaker 1>become more appealing to us, but only up to a point,

0:15:42.120 --> 0:15:44.560
<v Speaker 1>and once they get to that point and go beyond it,

0:15:45.440 --> 0:15:51.040
<v Speaker 1>our reception of these robots plunges into the uncanny valley.

0:15:51.120 --> 0:15:54.680
<v Speaker 1>The valley in this case is how humans react to

0:15:54.880 --> 0:15:57.440
<v Speaker 1>the robot. This also applies to other stuff like c

0:15:57.640 --> 0:16:00.920
<v Speaker 1>g I characters, for instance, and other words are a

0:16:01.000 --> 0:16:04.560
<v Speaker 1>robot that might be a simple industrial arm is one

0:16:04.640 --> 0:16:07.440
<v Speaker 1>we probably wouldn't feel very much affinity for, you know,

0:16:07.480 --> 0:16:11.680
<v Speaker 1>it's obviously a machine. A robot that still looks really robotic,

0:16:11.800 --> 0:16:14.240
<v Speaker 1>but has you know, arms and legs like a vaguely

0:16:14.280 --> 0:16:17.280
<v Speaker 1>humanoid shape. We would probably feel a little more affinity

0:16:17.320 --> 0:16:20.280
<v Speaker 1>towards that make it look a little bit more human,

0:16:20.560 --> 0:16:23.360
<v Speaker 1>but you know, not to the point where anyone would

0:16:23.800 --> 0:16:26.880
<v Speaker 1>mistake it for being human. We might like it even more.

0:16:27.280 --> 0:16:29.960
<v Speaker 1>But once you start getting close to but not quite

0:16:30.160 --> 0:16:33.960
<v Speaker 1>human in appearance and behavior, our response drops to a

0:16:34.000 --> 0:16:37.720
<v Speaker 1>point where a lot of people feel unsettled, or even

0:16:37.880 --> 0:16:41.960
<v Speaker 1>they might feel revulsion when looking at the figure. Something is,

0:16:42.000 --> 0:16:45.160
<v Speaker 1>you know, not right. The cues that would normally help

0:16:45.200 --> 0:16:48.800
<v Speaker 1>us identify with the synthetic figure now feel strange and

0:16:48.880 --> 0:16:52.960
<v Speaker 1>maybe even scary. It's possible to get beyond the uncanny

0:16:53.040 --> 0:16:56.080
<v Speaker 1>valley to create a robot or c g I character

0:16:56.440 --> 0:17:00.720
<v Speaker 1>that doesn't initiate this kind of instant revulsion, but it

0:17:00.880 --> 0:17:03.480
<v Speaker 1>is very hard to do so. A big challenge is

0:17:03.520 --> 0:17:08.240
<v Speaker 1>building an animatronic that doesn't trigger the uncanny value response

0:17:08.320 --> 0:17:11.159
<v Speaker 1>either by avoiding the trap of being almost but not

0:17:11.280 --> 0:17:14.359
<v Speaker 1>quite human in behavior, you know, by keeping things a

0:17:14.359 --> 0:17:18.520
<v Speaker 1>bit more obviously robotic, so there's that clear and distinct

0:17:18.600 --> 0:17:22.840
<v Speaker 1>separation that kind of removes that that response we have,

0:17:23.480 --> 0:17:27.160
<v Speaker 1>or creating something lifelike enough that we feel the same

0:17:27.200 --> 0:17:29.760
<v Speaker 1>sort of reactions we would experience if that were a

0:17:29.800 --> 0:17:34.399
<v Speaker 1>real human. So it's tough to do. It's easier to

0:17:34.440 --> 0:17:37.879
<v Speaker 1>do the robot approach than it is to get something

0:17:37.920 --> 0:17:40.960
<v Speaker 1>that seems human enough that we let our guard down.

0:17:41.600 --> 0:17:44.400
<v Speaker 1>None of these challenges are trivial, but they all require

0:17:44.480 --> 0:17:49.000
<v Speaker 1>distinct approaches that must ultimately converge into a single implementation.

0:17:49.760 --> 0:17:51.600
<v Speaker 1>When we come back, I'll talk about some of the

0:17:51.640 --> 0:17:55.359
<v Speaker 1>technologies in this animatronic figure and the engineering team's philosophy

0:17:55.440 --> 0:17:58.959
<v Speaker 1>behind their design choices. But first let's take a quick break.

0:18:06.560 --> 0:18:10.080
<v Speaker 1>The engineering team limited itself to parameters that related to

0:18:10.119 --> 0:18:13.680
<v Speaker 1>creating a robot that could direct its gaze towards onlookers,

0:18:13.840 --> 0:18:16.760
<v Speaker 1>which meant they didn't have to worry about it doing

0:18:17.280 --> 0:18:21.520
<v Speaker 1>literally anything else. The audio animatronic bus they used has

0:18:21.760 --> 0:18:25.640
<v Speaker 1>nineteen degrees of freedom total, but the team made no

0:18:25.840 --> 0:18:28.600
<v Speaker 1>use of ten of those. They only used nine degrees

0:18:28.680 --> 0:18:32.040
<v Speaker 1>of freedom. They focused on the neck, which has three

0:18:32.080 --> 0:18:35.919
<v Speaker 1>degrees of freedom. The eyelids, which have two degrees of freedom,

0:18:35.960 --> 0:18:39.400
<v Speaker 1>the eyes, which also have too, and the eyebrows, which

0:18:39.440 --> 0:18:42.479
<v Speaker 1>have two degrees of freedom. The unused degrees of freedom

0:18:42.480 --> 0:18:44.840
<v Speaker 1>are for moving the jaw and the lips of the figure,

0:18:45.240 --> 0:18:48.320
<v Speaker 1>but since that's not necessary to make eye contact, the

0:18:48.359 --> 0:18:51.400
<v Speaker 1>team just ignored those they didn't need to mess with them,

0:18:51.440 --> 0:18:54.000
<v Speaker 1>which means we get the effect of a robotic skull

0:18:54.160 --> 0:18:57.920
<v Speaker 1>with an unchanging rictus grin staring at us as its

0:18:57.960 --> 0:19:01.679
<v Speaker 1>upper facial area remains animated it. I guess what I'm

0:19:01.680 --> 0:19:06.159
<v Speaker 1>saying is I didn't find the overall effect particularly comforting.

0:19:06.880 --> 0:19:10.760
<v Speaker 1>According to the paper, the commands going to these components

0:19:10.800 --> 0:19:15.399
<v Speaker 1>come from a quote custom proprietary software stack operating on

0:19:15.440 --> 0:19:19.800
<v Speaker 1>a one hurts real time loop end quote. Hurts is

0:19:19.840 --> 0:19:23.160
<v Speaker 1>a cycle per second, so this means that the software

0:19:23.240 --> 0:19:26.919
<v Speaker 1>is pulsing out operations one hundred times every second to

0:19:27.080 --> 0:19:31.280
<v Speaker 1>control this animatronic bust. Many of those commands aren't only

0:19:31.440 --> 0:19:34.800
<v Speaker 1>about making the bus do something specific, but to do

0:19:34.960 --> 0:19:39.399
<v Speaker 1>it in a specific way. Let's get back to the

0:19:39.440 --> 0:19:43.119
<v Speaker 1>Tiki birds as an example. The pneumatic valve that would

0:19:43.119 --> 0:19:45.840
<v Speaker 1>control whether or not pressurized air could travel to a

0:19:45.920 --> 0:19:49.920
<v Speaker 1>specific place like the mechanism that operates a bird's beak

0:19:50.480 --> 0:19:52.920
<v Speaker 1>is a pretty simple on or off switch, meaning the

0:19:53.000 --> 0:19:55.399
<v Speaker 1>valve is either open, in which case air can flow,

0:19:56.000 --> 0:19:58.199
<v Speaker 1>or it's closed, in which case the air is blocked

0:19:58.200 --> 0:20:01.760
<v Speaker 1>from flowing through. And a debating the mechanism, So the

0:20:01.800 --> 0:20:05.000
<v Speaker 1>beak has a natural resting position, and for this example,

0:20:05.080 --> 0:20:08.720
<v Speaker 1>will just assume that the rest position is a closed beak,

0:20:09.600 --> 0:20:12.119
<v Speaker 1>and so that's what the beak will always return to

0:20:12.320 --> 0:20:16.080
<v Speaker 1>when there's no air flowing. To the mechanism that opens

0:20:16.119 --> 0:20:19.040
<v Speaker 1>the beak. If we open the valve, it lets air through,

0:20:19.280 --> 0:20:21.399
<v Speaker 1>It rushes to the end point, forces the beak to

0:20:21.600 --> 0:20:25.280
<v Speaker 1>open rapidly. Closing and opening the valve quickly forces the

0:20:25.280 --> 0:20:28.560
<v Speaker 1>bird's beak to open and close quickly, and when matched

0:20:28.560 --> 0:20:31.080
<v Speaker 1>with a soundtrack, it looks as though the bird is

0:20:31.119 --> 0:20:34.240
<v Speaker 1>speaking or singing, or you know, whatever it's doing. But

0:20:34.320 --> 0:20:37.080
<v Speaker 1>that movement is rapid and, just as I mentioned earlier,

0:20:37.160 --> 0:20:41.919
<v Speaker 1>not suitable for all animatronic applications. Having life sized humanoids

0:20:41.960 --> 0:20:45.080
<v Speaker 1>move with that kind of alarming speed would be scary

0:20:45.119 --> 0:20:49.040
<v Speaker 1>and legitimately dangerous. The greater mass of the figures would

0:20:49.080 --> 0:20:51.800
<v Speaker 1>mean you're dealing with larger amounts of inertia. I mean,

0:20:51.840 --> 0:20:54.400
<v Speaker 1>I just imagine what it would look like if Mr Lincoln,

0:20:54.480 --> 0:20:56.760
<v Speaker 1>in an effort to raise his hand in a gentle

0:20:56.800 --> 0:21:01.400
<v Speaker 1>show of reserve determination, instead violently karate chopped his own

0:21:01.440 --> 0:21:05.159
<v Speaker 1>head off. It would be, as the kids say, a

0:21:05.240 --> 0:21:10.040
<v Speaker 1>bad look. To create the illusion of life, the animatronics

0:21:10.080 --> 0:21:14.480
<v Speaker 1>that Disney designs follow certain general strategies. One is called

0:21:14.640 --> 0:21:18.640
<v Speaker 1>slow in and slow out. Now. This refers to general

0:21:18.680 --> 0:21:22.280
<v Speaker 1>movements and the ideas that any movement should start off

0:21:22.400 --> 0:21:26.240
<v Speaker 1>slowly and then pick up speed as the movement continues,

0:21:26.800 --> 0:21:30.080
<v Speaker 1>and then slow down again before coming to a stop.

0:21:30.440 --> 0:21:32.879
<v Speaker 1>And it makes the motions appear more fluid, and it

0:21:32.880 --> 0:21:35.320
<v Speaker 1>has the added benefit of not being quite so harsh

0:21:35.359 --> 0:21:38.680
<v Speaker 1>on the figures themselves. So when a Disney figure raises

0:21:38.800 --> 0:21:41.720
<v Speaker 1>its hand, the hand should start off moving upward with

0:21:41.760 --> 0:21:45.399
<v Speaker 1>a nice, smooth slow motion, pick up a bit of

0:21:45.440 --> 0:21:48.960
<v Speaker 1>speed as it's moving upward, and then slow down again

0:21:49.000 --> 0:21:52.199
<v Speaker 1>as it's approaching its end point. And this means that

0:21:52.280 --> 0:21:55.440
<v Speaker 1>the underlying motors and mechanical systems have to be capable

0:21:55.560 --> 0:21:59.240
<v Speaker 1>of achieving the strategy. It's why you can't use pneumatic systems.

0:21:59.240 --> 0:22:02.320
<v Speaker 1>They can't be those simple single speed devices that are

0:22:02.320 --> 0:22:06.080
<v Speaker 1>either on or off, like the Tiki birds. Oh, and

0:22:06.119 --> 0:22:08.320
<v Speaker 1>I guess I should specify I'm talking in this case

0:22:08.320 --> 0:22:11.639
<v Speaker 1>about the original Tiki birds because the birds in the

0:22:11.680 --> 0:22:15.600
<v Speaker 1>attractions today work on updated and more sophisticated computer systems

0:22:15.600 --> 0:22:17.760
<v Speaker 1>that take up a fraction of a fraction of the

0:22:17.800 --> 0:22:21.960
<v Speaker 1>space of the old attraction, which essentially required an entire

0:22:22.119 --> 0:22:24.920
<v Speaker 1>room filled with cables and tubes to make everything work

0:22:25.040 --> 0:22:30.240
<v Speaker 1>underneath the actual attraction itself. Now a few computers handled

0:22:30.280 --> 0:22:35.359
<v Speaker 1>the whole shebang. Anyway, Let's get back to animatronics. Some

0:22:35.520 --> 0:22:39.080
<v Speaker 1>of the other guiding principles in animatronic motion that in

0:22:39.160 --> 0:22:42.240
<v Speaker 1>turn dictate the types of motors and joints and other

0:22:42.280 --> 0:22:45.680
<v Speaker 1>mechanical elements that the team mustn't use to to make

0:22:45.760 --> 0:22:50.000
<v Speaker 1>these happen include designing motions as arcs, meaning the motion

0:22:50.040 --> 0:22:54.560
<v Speaker 1>should follow an arched trajectory. Another is that the motions

0:22:54.680 --> 0:22:58.960
<v Speaker 1>should have overlap, meaning a robot shouldn't move a single

0:22:59.040 --> 0:23:03.320
<v Speaker 1>element like an arm, stop, then go to move on

0:23:03.400 --> 0:23:07.840
<v Speaker 1>the next element like the head position, and then stop

0:23:07.880 --> 0:23:12.160
<v Speaker 1>and so on, because that would be well, really robotic. Instead,

0:23:12.200 --> 0:23:16.040
<v Speaker 1>the robots motions should overlap with one another so that

0:23:16.359 --> 0:23:18.879
<v Speaker 1>Let's say Mr. Lincoln is turning his head at the

0:23:18.920 --> 0:23:22.320
<v Speaker 1>same time his arm is going up in determination. Now,

0:23:22.400 --> 0:23:26.040
<v Speaker 1>another element that's connected to this concept is that of drag,

0:23:26.480 --> 0:23:29.040
<v Speaker 1>which means that the different body parts are moving at

0:23:29.119 --> 0:23:31.960
<v Speaker 1>different frequencies or timing. They're not moving all at the

0:23:32.000 --> 0:23:35.000
<v Speaker 1>same speed. So, in other words, the speed at which Mr.

0:23:35.040 --> 0:23:38.399
<v Speaker 1>Lincoln turns his head might be slightly faster or slower

0:23:38.440 --> 0:23:41.280
<v Speaker 1>than the speed at which his arm goes up. This

0:23:41.359 --> 0:23:44.560
<v Speaker 1>is all in an effort to create the illusion of life,

0:23:44.640 --> 0:23:47.960
<v Speaker 1>but it also means that the programming in hardware underlying

0:23:48.000 --> 0:23:51.840
<v Speaker 1>the figure has to support those strategies. For the purposes

0:23:51.880 --> 0:23:54.919
<v Speaker 1>of this project, the engineers had certain motions they wanted

0:23:54.960 --> 0:23:58.000
<v Speaker 1>to be included. One minimum set of motions needed were

0:23:58.080 --> 0:24:02.360
<v Speaker 1>some that would imply that the bust was a breathing entity,

0:24:02.400 --> 0:24:04.920
<v Speaker 1>So I need to move slightly as if it were

0:24:05.040 --> 0:24:08.960
<v Speaker 1>drawing breath. Blinking was also an important motion to get down,

0:24:09.080 --> 0:24:11.359
<v Speaker 1>as it would be more than a little unnerving to

0:24:11.359 --> 0:24:14.440
<v Speaker 1>have an animatronic figure make eye contact with you and

0:24:14.480 --> 0:24:19.600
<v Speaker 1>then never ever blink. And then there were the scads.

0:24:20.440 --> 0:24:23.040
<v Speaker 1>Now I have to confess something to you, guys. When

0:24:23.040 --> 0:24:26.639
<v Speaker 1>I first encountered the word scads, which is S A

0:24:26.960 --> 0:24:30.920
<v Speaker 1>C C A D E S. I had no idea

0:24:30.960 --> 0:24:33.239
<v Speaker 1>what that meant. It was a new word to me,

0:24:33.840 --> 0:24:35.560
<v Speaker 1>and maybe it's a new word for some of you

0:24:35.640 --> 0:24:38.760
<v Speaker 1>out there too. So if you happen to be like me,

0:24:39.160 --> 0:24:42.720
<v Speaker 1>what the heck are scads? Well? That refers to the quick,

0:24:43.000 --> 0:24:47.200
<v Speaker 1>simultaneous movement of both eyes from one point of focus

0:24:47.280 --> 0:24:50.240
<v Speaker 1>to another. So think about how you might take in

0:24:50.320 --> 0:24:52.719
<v Speaker 1>a scene that has a lot of stuff going on.

0:24:52.800 --> 0:24:57.640
<v Speaker 1>Let's say you you walk up to a building that's

0:24:57.640 --> 0:25:00.920
<v Speaker 1>that's that's burning. Well, your is are going to dart

0:25:01.080 --> 0:25:03.679
<v Speaker 1>at different things that are going on in front of

0:25:03.720 --> 0:25:06.640
<v Speaker 1>you that catch your attention as you focus on them,

0:25:06.640 --> 0:25:09.000
<v Speaker 1>and then you file that information away. And perhaps you're

0:25:09.000 --> 0:25:13.320
<v Speaker 1>even doing this subconsciously. Uh. It means our gaze is

0:25:13.359 --> 0:25:17.280
<v Speaker 1>not always steady and unwavering. It it moves around a

0:25:17.280 --> 0:25:19.840
<v Speaker 1>bit on occasion. And that's not the only way we

0:25:19.880 --> 0:25:22.159
<v Speaker 1>move our eyes. Of course, we can actually track things

0:25:22.200 --> 0:25:25.320
<v Speaker 1>that are moving and use our eyes to move in

0:25:25.400 --> 0:25:28.720
<v Speaker 1>a more smooth and gradual motion. But the team knew

0:25:28.720 --> 0:25:31.080
<v Speaker 1>that if they could incorporate the CODs, that would give

0:25:31.119 --> 0:25:35.320
<v Speaker 1>the robot a more lifelike performance. But that decision meant

0:25:35.320 --> 0:25:37.560
<v Speaker 1>the team needed to figure out something else, which was

0:25:37.600 --> 0:25:40.960
<v Speaker 1>where to put the cameras. The animatronic needs its own

0:25:41.119 --> 0:25:44.480
<v Speaker 1>vision to be able to detect onlookers and then direct

0:25:44.520 --> 0:25:49.080
<v Speaker 1>its own gaze appropriately, and some robots do put cameras

0:25:49.080 --> 0:25:52.000
<v Speaker 1>in the eyes of the robot so that the eyes

0:25:52.040 --> 0:25:55.520
<v Speaker 1>are actually camera lenses, but that presents a challenge if

0:25:55.560 --> 0:25:58.760
<v Speaker 1>you wish to incorporate rapid eye movement like the CODs,

0:25:58.800 --> 0:26:01.720
<v Speaker 1>because that sort of movement introduces motion blur in the

0:26:01.840 --> 0:26:04.679
<v Speaker 1>video imagery makes it more challenging for the robot to

0:26:04.720 --> 0:26:06.520
<v Speaker 1>keep track of what's going on in front of it.

0:26:06.920 --> 0:26:09.600
<v Speaker 1>For that reason, the team decided that the cameras would

0:26:09.600 --> 0:26:13.360
<v Speaker 1>not be mounted in the eyes, but they rather were

0:26:13.359 --> 0:26:18.640
<v Speaker 1>mounted on the animatronics chest. Presumably, should the gaze tracking

0:26:18.720 --> 0:26:22.040
<v Speaker 1>technology find its way into full animatronic figures in the future,

0:26:22.440 --> 0:26:25.080
<v Speaker 1>the camera will be you know, hidden within the body

0:26:25.160 --> 0:26:29.040
<v Speaker 1>of the animatronic torso in order to avoid this problem,

0:26:29.160 --> 0:26:33.160
<v Speaker 1>or otherwise maybe mounted in an obtrusive spot. One thing

0:26:33.160 --> 0:26:36.320
<v Speaker 1>that interests me with this particular approach is that the

0:26:36.359 --> 0:26:39.639
<v Speaker 1>system has to do some calculations as to where the

0:26:39.720 --> 0:26:43.040
<v Speaker 1>eyes of the animatronic are in relation to the physical

0:26:43.080 --> 0:26:46.680
<v Speaker 1>location of the cameras, you know, because for us, all

0:26:46.680 --> 0:26:50.440
<v Speaker 1>our eyes are essentially the cameras, or at least the

0:26:50.480 --> 0:26:53.920
<v Speaker 1>camera lenses, so we don't have to make any adjustments.

0:26:54.000 --> 0:26:57.280
<v Speaker 1>Right where we're looking is like the point of our

0:26:57.320 --> 0:27:01.000
<v Speaker 1>gaze is the point of where we're taking in visual information.

0:27:01.480 --> 0:27:05.520
<v Speaker 1>For the animatronic, the eyes of the robot, the actual

0:27:05.640 --> 0:27:08.960
<v Speaker 1>eyes that are in the skull, don't function as eyes.

0:27:09.520 --> 0:27:14.359
<v Speaker 1>They aren't lenses. They're actually several inches above the actual camera.

0:27:15.000 --> 0:27:18.439
<v Speaker 1>And yet the eyes in the robot's head need to

0:27:18.480 --> 0:27:20.439
<v Speaker 1>point in the right direction. They need to be the

0:27:20.520 --> 0:27:23.760
<v Speaker 1>part that's pointed at the person who's looking at it. Right,

0:27:23.800 --> 0:27:26.359
<v Speaker 1>it doesn't make sense for the robot to just turn

0:27:26.440 --> 0:27:29.480
<v Speaker 1>its sternom towards you. It needs to be looking at

0:27:29.520 --> 0:27:33.040
<v Speaker 1>you with its robot eyes. And I think of this

0:27:33.200 --> 0:27:36.679
<v Speaker 1>kind of like someone who's working a hand puppet and

0:27:36.720 --> 0:27:39.760
<v Speaker 1>they've got the hand puppet up over their head, so

0:27:40.000 --> 0:27:42.359
<v Speaker 1>maybe they're behind a little stage, you know, like like

0:27:42.480 --> 0:27:45.719
<v Speaker 1>the muppets tend to be. You've got this hand puppet

0:27:45.720 --> 0:27:49.480
<v Speaker 1>and it needs to make eye contact with a human being. Well,

0:27:49.600 --> 0:27:52.000
<v Speaker 1>that just means the puppeteer has to take that into

0:27:52.040 --> 0:27:56.919
<v Speaker 1>account and angle their hand so that the puppets eyes

0:27:57.119 --> 0:28:00.359
<v Speaker 1>appear to be locking on the eyes of the real

0:28:00.440 --> 0:28:04.440
<v Speaker 1>person that the muppet or puppet is interacting with. It's

0:28:04.440 --> 0:28:07.680
<v Speaker 1>a little tricky. It requires some skill for the robot.

0:28:07.800 --> 0:28:10.600
<v Speaker 1>It means that there's some you know, nifty geometry going

0:28:10.640 --> 0:28:13.960
<v Speaker 1>on in the processor side to make this work out.

0:28:14.040 --> 0:28:17.320
<v Speaker 1>Like the image recognition has to identify where the eyes

0:28:17.400 --> 0:28:22.560
<v Speaker 1>are of the onlooker and then calculate where the robots

0:28:22.560 --> 0:28:25.840
<v Speaker 1>eyes are in relation to that and direct them in

0:28:25.880 --> 0:28:29.359
<v Speaker 1>the right way, which to me is really fascinating because again,

0:28:29.720 --> 0:28:32.359
<v Speaker 1>the eyes of the robot are not where the visual

0:28:32.359 --> 0:28:36.399
<v Speaker 1>information is actually coming in. We'll talk more about the

0:28:36.440 --> 0:28:39.240
<v Speaker 1>behaviors of this robot in a second, but since we're

0:28:39.240 --> 0:28:41.720
<v Speaker 1>already chatting about cameras, it's good to talk about what

0:28:41.800 --> 0:28:44.600
<v Speaker 1>the team was actually using to give the robot it's vision.

0:28:45.040 --> 0:28:47.840
<v Speaker 1>They went within off the shelf solution. They used a

0:28:47.920 --> 0:28:52.320
<v Speaker 1>camera called the Mint Eye D one thousand and Mint

0:28:52.440 --> 0:28:56.760
<v Speaker 1>is spelled m y nt. This particular camera has two

0:28:56.840 --> 0:29:00.400
<v Speaker 1>lenses in it for stereoscopic vision, and so together they

0:29:00.400 --> 0:29:03.240
<v Speaker 1>can create a stereo image that is an image with

0:29:03.640 --> 0:29:05.640
<v Speaker 1>you know, kind of a depth like a three D

0:29:05.760 --> 0:29:09.160
<v Speaker 1>image with a resolution of two thousand, five hundred sixty

0:29:09.160 --> 0:29:12.920
<v Speaker 1>by seven twenty pixels at sixty frames per second, so

0:29:12.920 --> 0:29:16.480
<v Speaker 1>it can do you know, this is video information. There's

0:29:16.480 --> 0:29:20.120
<v Speaker 1>also a depth map mode which uses infrared light to

0:29:20.160 --> 0:29:23.800
<v Speaker 1>help judge the depth of the things within its field

0:29:23.800 --> 0:29:26.600
<v Speaker 1>of view, like how close is one thing versus another

0:29:27.240 --> 0:29:30.160
<v Speaker 1>relative to the camera, and the depth maps resolution is

0:29:30.200 --> 0:29:33.160
<v Speaker 1>at one thousand, two hundred eighty by seven twenty pixels

0:29:33.200 --> 0:29:36.200
<v Speaker 1>at sixty frames per second. As I mentioned, these two

0:29:36.320 --> 0:29:40.000
<v Speaker 1>lenses allow the camera to simulate human binocular vision. So

0:29:40.040 --> 0:29:42.400
<v Speaker 1>just as we perceive depth in the world around us

0:29:42.520 --> 0:29:45.320
<v Speaker 1>using two eyes, you know, most of us, uh, this

0:29:45.440 --> 0:29:48.480
<v Speaker 1>camera can do the same thing and judge which things

0:29:48.520 --> 0:29:51.360
<v Speaker 1>are in the foreground versus the background, what things are

0:29:51.520 --> 0:29:54.280
<v Speaker 1>closest to it versus furthest away, and make a better

0:29:54.320 --> 0:29:57.440
<v Speaker 1>determination of which things within its field of view are

0:29:57.480 --> 0:30:00.680
<v Speaker 1>worthy of attention, which will become important in a little bit.

0:30:01.240 --> 0:30:03.880
<v Speaker 1>The camera has a more limited field of view than

0:30:03.920 --> 0:30:07.640
<v Speaker 1>a typical human. It has about half the horizontal field

0:30:07.720 --> 0:30:10.760
<v Speaker 1>of view of persons, so it's periphery is more narrow,

0:30:11.400 --> 0:30:13.959
<v Speaker 1>and it has a little more than a third the

0:30:14.080 --> 0:30:16.440
<v Speaker 1>vertical field of view, so I can't see as much

0:30:16.520 --> 0:30:19.840
<v Speaker 1>up and down as your typical person can. So any

0:30:19.920 --> 0:30:23.080
<v Speaker 1>future animatronic figure might need a more expansive field of

0:30:23.160 --> 0:30:26.000
<v Speaker 1>view to be able to interact with guests who could

0:30:26.080 --> 0:30:28.640
<v Speaker 1>range and height from very small to quite tall. I mean,

0:30:28.680 --> 0:30:31.680
<v Speaker 1>all sorts of people go to Disney. So I do

0:30:31.760 --> 0:30:35.240
<v Speaker 1>see that as a potential limiting factor in the short run,

0:30:35.800 --> 0:30:38.920
<v Speaker 1>that any stereoscopic kind of camera would need to have

0:30:39.000 --> 0:30:43.480
<v Speaker 1>a pretty good field of view for a robot to

0:30:43.520 --> 0:30:49.080
<v Speaker 1>be able to interact properly with guests of different heights. Now,

0:30:49.120 --> 0:30:51.200
<v Speaker 1>I decided to see how much this camera would cost

0:30:51.240 --> 0:30:54.360
<v Speaker 1>for some normal schlub like myself, and the answer is

0:30:54.520 --> 0:30:56.840
<v Speaker 1>less than four hundred dollars. So this is actually a

0:30:56.840 --> 0:31:01.600
<v Speaker 1>pretty inexpensive solution all things considered. And again so it's

0:31:01.640 --> 0:31:05.520
<v Speaker 1>it's really more important for creating the basis for the

0:31:05.600 --> 0:31:08.400
<v Speaker 1>work as opposed to saying this is a final product.

0:31:08.960 --> 0:31:11.440
<v Speaker 1>And that's more or less the hardware side of things,

0:31:11.520 --> 0:31:13.680
<v Speaker 1>or at least as specific as I can get based

0:31:13.680 --> 0:31:16.840
<v Speaker 1>on the material available. Like I, I don't know what

0:31:17.000 --> 0:31:19.880
<v Speaker 1>the power of their computer system was, you know, I

0:31:19.920 --> 0:31:23.120
<v Speaker 1>don't know the specific types of motors they were using

0:31:23.120 --> 0:31:26.720
<v Speaker 1>in the animatronic but from a high level we understand

0:31:26.720 --> 0:31:30.200
<v Speaker 1>what's going on. However, the real magic happens with the

0:31:30.240 --> 0:31:33.840
<v Speaker 1>system that gives this hardware it's orders, and the team

0:31:33.880 --> 0:31:36.760
<v Speaker 1>made the conscious decision to create the illusion of life

0:31:37.120 --> 0:31:41.040
<v Speaker 1>rather than attempt to replicate human behaviors perfectly, which is

0:31:41.040 --> 0:31:43.400
<v Speaker 1>a bit of a challenging concept. You might think, well,

0:31:43.400 --> 0:31:46.160
<v Speaker 1>what's the difference, But I think I have a pretty

0:31:46.200 --> 0:31:50.640
<v Speaker 1>decent analogy. If you've ever gone to see a stage play,

0:31:51.000 --> 0:31:55.560
<v Speaker 1>then you've seen sets. Maybe the sets were really detailed,

0:31:56.040 --> 0:31:59.240
<v Speaker 1>maybe they were bare bones sets, But in any case,

0:31:59.280 --> 0:32:01.800
<v Speaker 1>the sets are meant to create the illusion of a

0:32:01.880 --> 0:32:04.800
<v Speaker 1>real place at a real moment of time. You know,

0:32:04.840 --> 0:32:07.000
<v Speaker 1>it could be a room in the eighteenth century in

0:32:07.040 --> 0:32:10.480
<v Speaker 1>a in a palatial estate, or it might be a

0:32:10.600 --> 0:32:14.560
<v Speaker 1>modern day real estate sales office if it's a moment play,

0:32:14.680 --> 0:32:18.600
<v Speaker 1>or maybe it's a campsite. In any case, the sets

0:32:18.600 --> 0:32:21.800
<v Speaker 1>and props are meant to convey the illusion of that

0:32:21.880 --> 0:32:24.920
<v Speaker 1>place and time, and if you were to actually get

0:32:25.000 --> 0:32:27.360
<v Speaker 1>up on stage and walk around, that illusion would very

0:32:27.440 --> 0:32:30.480
<v Speaker 1>quickly be broken. But when you're sitting in the audience,

0:32:30.920 --> 0:32:33.680
<v Speaker 1>it's up to you to use your imagination to fill

0:32:33.720 --> 0:32:36.840
<v Speaker 1>in some of the gaps and suspend disbelief it is

0:32:37.000 --> 0:32:41.360
<v Speaker 1>a show. Likewise, the engineers who worked on this project

0:32:41.520 --> 0:32:45.600
<v Speaker 1>talk about robot behaviors in terms of a show, and

0:32:45.640 --> 0:32:48.040
<v Speaker 1>that means that the robot needs to react and move

0:32:48.080 --> 0:32:50.840
<v Speaker 1>in ways that create the illusion of life, but it

0:32:50.880 --> 0:32:56.320
<v Speaker 1>does not necessarily need to adhere completely to human behaviors.

0:32:56.360 --> 0:33:00.000
<v Speaker 1>This makes things much more simple, particularly since it removes

0:33:00.000 --> 0:33:03.800
<v Speaker 1>of tricky questions regarding what sets of behaviors are the

0:33:03.800 --> 0:33:07.400
<v Speaker 1>most human, because I'm sure you've noticed human beings and

0:33:07.520 --> 0:33:11.600
<v Speaker 1>human behavior occur in a really broad spectrum, and what

0:33:11.840 --> 0:33:15.240
<v Speaker 1>might be a typical set of behaviors for one person

0:33:15.560 --> 0:33:19.000
<v Speaker 1>could be completely alien to another person. So it's a

0:33:19.000 --> 0:33:23.120
<v Speaker 1>good idea to not try and define what sets of

0:33:23.160 --> 0:33:28.000
<v Speaker 1>behaviors are quintessentially human. When we come back, I'll talk

0:33:28.040 --> 0:33:31.320
<v Speaker 1>about how the team determined how the robot would actually

0:33:31.360 --> 0:33:35.400
<v Speaker 1>behave it's pretty cool, but first let's take another quick break.

0:33:42.760 --> 0:33:46.440
<v Speaker 1>The team created an architecture to describe the relationship of

0:33:46.560 --> 0:33:51.280
<v Speaker 1>various elements to create the behavior of an interactive robotic gaze.

0:33:51.400 --> 0:33:56.120
<v Speaker 1>To create this robotic eye contact, the layers include the camera,

0:33:56.400 --> 0:33:59.280
<v Speaker 1>which is you know, the point of perception from the robot,

0:33:59.800 --> 0:34:05.280
<v Speaker 1>a perception engine h and an attention engine which determines

0:34:05.680 --> 0:34:08.560
<v Speaker 1>which things within the robots perception are actually worthy of

0:34:08.680 --> 0:34:13.200
<v Speaker 1>attention or focus. A behavior selection engine and a library

0:34:13.280 --> 0:34:18.880
<v Speaker 1>of potential behaviors, and the audio animatronic figures systems. It's hardware,

0:34:18.960 --> 0:34:22.839
<v Speaker 1>the motor commands and motor states go to that, and

0:34:23.080 --> 0:34:26.120
<v Speaker 1>that's the layers in order from top to bottom. These

0:34:26.200 --> 0:34:29.320
<v Speaker 1>layers explain the relationship of each element in sort of

0:34:29.320 --> 0:34:32.600
<v Speaker 1>an abstract way, allowing us to understand how the robot

0:34:32.680 --> 0:34:36.640
<v Speaker 1>processes and reacts to information. So the perception engine is

0:34:36.680 --> 0:34:40.759
<v Speaker 1>designed to identify potential elements within the robotic vision, you know,

0:34:40.800 --> 0:34:44.200
<v Speaker 1>separating things out from say just a static background, and

0:34:44.239 --> 0:34:47.600
<v Speaker 1>the attention engine attempts to identify things within the robots

0:34:47.680 --> 0:34:51.560
<v Speaker 1>vision that merit focus. The attention engine generates what the

0:34:51.560 --> 0:34:56.040
<v Speaker 1>team calls a curiosity score. So if that curiosity score

0:34:56.160 --> 0:35:00.600
<v Speaker 1>is below a certain threshold, the robot won't quota quote

0:35:00.640 --> 0:35:03.719
<v Speaker 1>notice something within its field of view. It's it's not

0:35:03.880 --> 0:35:07.960
<v Speaker 1>enough to capture its attention. Certain actions, such as you know,

0:35:08.000 --> 0:35:12.040
<v Speaker 1>waving at the robot, merit a higher curiosity score. So

0:35:12.080 --> 0:35:15.640
<v Speaker 1>if the score ends up being above the curiosity score threshold,

0:35:15.960 --> 0:35:18.920
<v Speaker 1>the robot will look toward whatever it was that you know,

0:35:19.200 --> 0:35:22.560
<v Speaker 1>quote unquote got its attention. The team decided it would

0:35:22.560 --> 0:35:25.680
<v Speaker 1>be helpful to create a sort of scenario to work with,

0:35:25.760 --> 0:35:28.560
<v Speaker 1>not just have you know, a robot randomly looking around,

0:35:28.600 --> 0:35:32.800
<v Speaker 1>So their approach was to simulate an elderly man reading

0:35:32.880 --> 0:35:36.720
<v Speaker 1>something like a newspaper or a book. Most of the time,

0:35:36.960 --> 0:35:39.239
<v Speaker 1>the robot would be looking downward a bit, you know,

0:35:39.280 --> 0:35:41.839
<v Speaker 1>it's head tilted down a little, as if it were

0:35:41.840 --> 0:35:44.879
<v Speaker 1>reading something that was held more or less at torso level.

0:35:45.600 --> 0:35:48.960
<v Speaker 1>If something moves into the robots field of you, the

0:35:49.080 --> 0:35:51.600
<v Speaker 1>robot could glance up quickly, just as a human would

0:35:51.600 --> 0:35:54.640
<v Speaker 1>to assess what's going on, and if whatever is within

0:35:54.719 --> 0:35:57.839
<v Speaker 1>the field of view creates a curiosity score lower than

0:35:58.000 --> 0:36:01.319
<v Speaker 1>what the threshold is, then the robot just goes back

0:36:01.360 --> 0:36:04.840
<v Speaker 1>to reading. If whatever is going on is above that

0:36:04.960 --> 0:36:09.560
<v Speaker 1>curiosity score threshold, the robot might look directly at whatever

0:36:09.600 --> 0:36:12.279
<v Speaker 1>it is that's happening, and then things could progress from there.

0:36:12.920 --> 0:36:16.520
<v Speaker 1>That's where the behavior selection engine and behavior library come

0:36:16.560 --> 0:36:19.319
<v Speaker 1>into play. There are a few possible reactions, and the

0:36:19.360 --> 0:36:22.880
<v Speaker 1>robot will choose one depending on several factors. For example,

0:36:23.120 --> 0:36:26.880
<v Speaker 1>one such factor was familiarity. The robot would behave differently

0:36:26.880 --> 0:36:30.800
<v Speaker 1>toward people it quote unquote recognized. It also wouldn't switch

0:36:30.920 --> 0:36:34.560
<v Speaker 1>focus every time someone tried to wave it down, So

0:36:34.719 --> 0:36:37.080
<v Speaker 1>if you were to distract the robot, it might look

0:36:37.120 --> 0:36:39.799
<v Speaker 1>away from whatever it was looking at before and then

0:36:39.840 --> 0:36:42.440
<v Speaker 1>look to you once. Then it might look back at

0:36:42.480 --> 0:36:45.479
<v Speaker 1>someone quote unquote knows, and if you were to wave

0:36:45.480 --> 0:36:48.880
<v Speaker 1>at it again, you wouldn't necessarily get a response. So

0:36:49.040 --> 0:36:51.600
<v Speaker 1>kind of think about how adults can be with kids,

0:36:51.880 --> 0:36:54.800
<v Speaker 1>where the adults tend to develop a highly attuned skill

0:36:54.880 --> 0:36:57.560
<v Speaker 1>of ignoring the child after a bit, even if the

0:36:57.600 --> 0:37:03.279
<v Speaker 1>child is saying, but look, look, look, hey, Look what's

0:37:03.280 --> 0:37:07.800
<v Speaker 1>what I'm doing? Look? And so on. So the team

0:37:07.880 --> 0:37:12.840
<v Speaker 1>created four basic states. The default state was called read,

0:37:13.200 --> 0:37:15.920
<v Speaker 1>meaning it would appear as though the figure we're reading

0:37:15.920 --> 0:37:19.480
<v Speaker 1>a book or newspaper at Torso level. The next state

0:37:19.600 --> 0:37:23.040
<v Speaker 1>up is glance, where upon the robot would appear to

0:37:23.120 --> 0:37:26.480
<v Speaker 1>glance away from the reading material to see what sort

0:37:26.480 --> 0:37:29.440
<v Speaker 1>of ruckus is going on. This involved movement of not

0:37:29.520 --> 0:37:31.759
<v Speaker 1>just the eyes but the head as well. So the

0:37:31.800 --> 0:37:35.000
<v Speaker 1>head tilts up a bit and it looks for a moment,

0:37:35.080 --> 0:37:38.799
<v Speaker 1>like the robot is looking away from the imaginary book

0:37:38.840 --> 0:37:42.319
<v Speaker 1>or newspaper. If the curiosity threshold is met, then the

0:37:42.360 --> 0:37:46.640
<v Speaker 1>next state engage would pop up. This means that whatever

0:37:46.680 --> 0:37:49.040
<v Speaker 1>it was that got the robot's attention is worthy of

0:37:49.200 --> 0:37:52.399
<v Speaker 1>further focus. In the robot will direct its gaze at

0:37:52.480 --> 0:37:56.560
<v Speaker 1>that thing. With the engage stage, which has a nice

0:37:56.640 --> 0:37:59.759
<v Speaker 1>rhyme to it, the robot will attempt to make eye contact,

0:38:00.000 --> 0:38:02.840
<v Speaker 1>which involves the cameras detecting the face of the person

0:38:02.920 --> 0:38:06.160
<v Speaker 1>of interest, and then the computer system commanding the robot's

0:38:06.160 --> 0:38:09.600
<v Speaker 1>head and eyes to aim towards that detected face. The

0:38:09.640 --> 0:38:11.920
<v Speaker 1>amount of time that the robot spends looking at a

0:38:11.960 --> 0:38:15.240
<v Speaker 1>person is determined both by a minimum countdown clock saying

0:38:15.640 --> 0:38:18.960
<v Speaker 1>you have to spend this amount at least looking at

0:38:19.040 --> 0:38:22.719
<v Speaker 1>this person, and then there's the curiosity score that the

0:38:22.800 --> 0:38:26.040
<v Speaker 1>robot has assigned to that person. So once that score

0:38:26.080 --> 0:38:30.600
<v Speaker 1>decreases below the engaged threshold, the robot returns to read.

0:38:30.760 --> 0:38:34.120
<v Speaker 1>So if you happen to be particularly interesting, the robot

0:38:34.160 --> 0:38:37.360
<v Speaker 1>will look at you for longer, and when you stop

0:38:37.360 --> 0:38:40.839
<v Speaker 1>being interesting, the robot eventually goes back to reading its

0:38:40.840 --> 0:38:45.399
<v Speaker 1>pretend book or whatever. The final stage is called acknowledge,

0:38:45.400 --> 0:38:47.279
<v Speaker 1>and that was the name that the team gave for

0:38:47.320 --> 0:38:49.960
<v Speaker 1>those times when the robot is seeing a person that

0:38:50.080 --> 0:38:53.720
<v Speaker 1>is familiar to the robot. For the purposes of the tests,

0:38:54.200 --> 0:38:58.000
<v Speaker 1>the familiarity variable was actually randomized, so in other words,

0:38:58.280 --> 0:39:02.480
<v Speaker 1>the robot wasn't necessary early familiar with people. It just

0:39:02.880 --> 0:39:06.800
<v Speaker 1>was told it was familiar with somebody. So, in other words,

0:39:06.800 --> 0:39:09.360
<v Speaker 1>that it could be a totally new person that walks

0:39:09.440 --> 0:39:12.960
<v Speaker 1>up to the robot and the robot randomly assigns that

0:39:13.000 --> 0:39:16.879
<v Speaker 1>person the familiar tag, and the robot will behave as

0:39:16.880 --> 0:39:20.759
<v Speaker 1>if that's someone that the robot recognizes. Maybe they're just

0:39:20.880 --> 0:39:24.840
<v Speaker 1>an old friend the robot just met. Is there a

0:39:24.880 --> 0:39:28.640
<v Speaker 1>word for that? The robot system also had a sort

0:39:28.680 --> 0:39:32.480
<v Speaker 1>of short term memory that the team called the guesthouse.

0:39:33.080 --> 0:39:35.799
<v Speaker 1>As people would come into the robot's field of view

0:39:36.080 --> 0:39:39.720
<v Speaker 1>or the scene as the team called it, the robot

0:39:39.760 --> 0:39:43.799
<v Speaker 1>would analyze that person and assign that person a numerical

0:39:43.960 --> 0:39:47.560
<v Speaker 1>value to keep track of that person, and it would

0:39:47.560 --> 0:39:49.920
<v Speaker 1>also keep track of how many times that particular person

0:39:50.200 --> 0:39:53.239
<v Speaker 1>had been within its field of view, and it would

0:39:53.320 --> 0:39:56.160
<v Speaker 1>keep track of the curiosity score that was assigned to

0:39:56.280 --> 0:39:59.760
<v Speaker 1>that person. In addition to the states, the team described

0:39:59.840 --> 0:40:03.520
<v Speaker 1>lay years of show. Now this relates closely with the

0:40:03.560 --> 0:40:06.480
<v Speaker 1>states I just mentioned, but it helps explain how the

0:40:06.600 --> 0:40:09.879
<v Speaker 1>robot transitions from one set of behaviors to another, how

0:40:09.880 --> 0:40:13.239
<v Speaker 1>does it make the determination to change from one thing

0:40:13.320 --> 0:40:17.360
<v Speaker 1>to the next, and which behaviors will overwrite others versus

0:40:17.440 --> 0:40:21.160
<v Speaker 1>behaviors that will always be present with the robot. All

0:40:21.200 --> 0:40:24.640
<v Speaker 1>of this is necessary because of that variation I was

0:40:24.680 --> 0:40:27.640
<v Speaker 1>talking about at the beginning of the show. If the

0:40:27.760 --> 0:40:32.239
<v Speaker 1>robot were just following a scripted set of directions, it

0:40:32.280 --> 0:40:34.839
<v Speaker 1>wouldn't have to make these determinations because it would just

0:40:34.840 --> 0:40:38.520
<v Speaker 1>follow the same sequence over and over. But because we

0:40:38.600 --> 0:40:42.800
<v Speaker 1>have this variability, we have to build in a system

0:40:42.920 --> 0:40:45.359
<v Speaker 1>for the robot to follow in order to make decisions.

0:40:45.440 --> 0:40:48.040
<v Speaker 1>So at the base level you have what the team

0:40:48.040 --> 0:40:52.560
<v Speaker 1>calls zero show. This is essentially the robot in off mode.

0:40:52.640 --> 0:40:55.719
<v Speaker 1>It is inanimate. But the next layer up is a

0:40:55.800 --> 0:41:00.279
<v Speaker 1>live show, which has the baseline behaviors of simulated bree thing,

0:41:00.960 --> 0:41:05.399
<v Speaker 1>eye blinking, and the scads. This level of show underlies

0:41:05.680 --> 0:41:08.960
<v Speaker 1>all the other higher levels, so this is sort of

0:41:09.920 --> 0:41:13.520
<v Speaker 1>always running in the background. You don't want the robot

0:41:13.560 --> 0:41:17.160
<v Speaker 1>to suddenly stop breathing while it does other stuff. The

0:41:17.280 --> 0:41:21.400
<v Speaker 1>next four show levels correspond with the four states of

0:41:21.440 --> 0:41:25.560
<v Speaker 1>the robots. So you have read, glance, engage, and acknowledge,

0:41:26.040 --> 0:41:30.920
<v Speaker 1>and an engage show will subsume the glance and read shows.

0:41:31.360 --> 0:41:34.040
<v Speaker 1>It will take over the robots behaviors, So the robots

0:41:34.080 --> 0:41:37.800
<v Speaker 1>not going to display the behaviors of read and glance

0:41:38.160 --> 0:41:43.279
<v Speaker 1>when engage happens. So it's that hierarchy of operations, and

0:41:43.360 --> 0:41:46.080
<v Speaker 1>I find it really interesting to look at robot behaviors

0:41:46.080 --> 0:41:49.359
<v Speaker 1>in this way as that hierarchy of potential states. It's

0:41:49.400 --> 0:41:52.800
<v Speaker 1>amazing when you break down those states and determine which

0:41:52.840 --> 0:41:57.520
<v Speaker 1>should take priority given certain circumstances, and how long that

0:41:57.640 --> 0:42:00.520
<v Speaker 1>state should remain active before it rever it's to a

0:42:00.640 --> 0:42:04.400
<v Speaker 1>lower level state. Again, the team is trying to create

0:42:04.440 --> 0:42:07.240
<v Speaker 1>the illusion of life. The robot doesn't have to actually

0:42:07.320 --> 0:42:10.960
<v Speaker 1>lose interest or anything like that. It's just simulating it.

0:42:11.600 --> 0:42:15.040
<v Speaker 1>This particular project was working within some pretty well defined

0:42:15.080 --> 0:42:18.640
<v Speaker 1>parameters and restrictions. The team acknowledge that their work is

0:42:18.680 --> 0:42:22.120
<v Speaker 1>really meant to be a starting point for further improvements.

0:42:22.560 --> 0:42:26.759
<v Speaker 1>They point out that older audio animatronics might seem lifelike

0:42:26.840 --> 0:42:30.760
<v Speaker 1>at greater distances and for shorter durations. So, for example,

0:42:31.239 --> 0:42:33.920
<v Speaker 1>if you were to ride an attraction where you go

0:42:34.040 --> 0:42:37.520
<v Speaker 1>by a scene of audio animatronic figures at a decent

0:42:37.520 --> 0:42:40.800
<v Speaker 1>clip and and there you know, good, twenty feet away.

0:42:41.200 --> 0:42:43.919
<v Speaker 1>The limited amount of time and the greater distance that

0:42:44.080 --> 0:42:48.160
<v Speaker 1>are involved can help support that illusion of life. The

0:42:48.200 --> 0:42:51.879
<v Speaker 1>animatronic figures don't have to be super convincing because you're

0:42:51.920 --> 0:42:55.480
<v Speaker 1>not spending enough time and attention to see through the illusion,

0:42:55.520 --> 0:42:58.120
<v Speaker 1>nor are you close enough to see it showed through.

0:42:58.840 --> 0:43:01.800
<v Speaker 1>The more time you have and the less distance between

0:43:01.880 --> 0:43:04.840
<v Speaker 1>you and the animatronic figure, the harder it is to

0:43:04.920 --> 0:43:09.800
<v Speaker 1>create and maintain that illusion of life. Without an interactive gaze,

0:43:09.880 --> 0:43:13.800
<v Speaker 1>Without eye contact, it becomes pretty clear that the animatronic

0:43:13.840 --> 0:43:17.480
<v Speaker 1>figure has no real lifelike quality to it. If you

0:43:17.520 --> 0:43:20.520
<v Speaker 1>were to stand close to one of these older animatronic figures,

0:43:20.920 --> 0:43:23.280
<v Speaker 1>you would notice that it's not really looking at anything

0:43:23.320 --> 0:43:26.520
<v Speaker 1>in particular, and that its movements are a matter of routine.

0:43:26.880 --> 0:43:32.000
<v Speaker 1>It's not a demonstration of spontaneous or seemingly spontaneous decisions.

0:43:32.640 --> 0:43:35.799
<v Speaker 1>The Interactive Gaze project takes this a step up. The

0:43:35.880 --> 0:43:39.080
<v Speaker 1>robot can recognize and acknowledge someone that is in the

0:43:39.160 --> 0:43:43.200
<v Speaker 1>robot's presence, it can direct its focus and attention at

0:43:43.239 --> 0:43:46.560
<v Speaker 1>that person. This definitely is a step up in creating

0:43:46.560 --> 0:43:50.560
<v Speaker 1>that illusion and works at much smaller distances of viewing

0:43:51.040 --> 0:43:54.200
<v Speaker 1>than the older methods do, but the engineers admit it

0:43:54.320 --> 0:43:57.759
<v Speaker 1>still has limitations. They point out that their approach as

0:43:57.800 --> 0:44:00.680
<v Speaker 1>it stands, might serve as a way to reserve that

0:44:00.760 --> 0:44:04.320
<v Speaker 1>illusion of life for a couple of minutes at the most,

0:44:04.640 --> 0:44:08.040
<v Speaker 1>but beyond that the illusion would start to fade away.

0:44:08.280 --> 0:44:10.880
<v Speaker 1>They point out that as the distance between the robot

0:44:10.960 --> 0:44:14.359
<v Speaker 1>and the audience decreases, and as the time of observing

0:44:14.360 --> 0:44:19.560
<v Speaker 1>the robot increases, you have to incorporate increasingly complex and

0:44:19.640 --> 0:44:24.040
<v Speaker 1>natural behaviors to maintain that illusion of life, and interactive

0:44:24.080 --> 0:44:27.359
<v Speaker 1>gaze is just one element. Others could include stuff like

0:44:27.440 --> 0:44:31.440
<v Speaker 1>a display of emotion. The bust has sort of a

0:44:31.440 --> 0:44:34.040
<v Speaker 1>little bit of this. It can it can imply a

0:44:34.080 --> 0:44:36.640
<v Speaker 1>sense of emotion to some degree with the way it

0:44:36.719 --> 0:44:40.240
<v Speaker 1>holds its eyes, but because it doesn't have any movement

0:44:40.239 --> 0:44:42.960
<v Speaker 1>of its jaw or lips, and doesn't have any other

0:44:43.840 --> 0:44:48.400
<v Speaker 1>means of really indicating emotion, this is pretty limited. So

0:44:48.480 --> 0:44:52.080
<v Speaker 1>perhaps a robot that can hear and parse and respond

0:44:52.080 --> 0:44:54.440
<v Speaker 1>to speech, you know, sort of like the voice activated

0:44:54.480 --> 0:44:57.200
<v Speaker 1>digital assistance that are familiar to us, and you know,

0:44:57.239 --> 0:45:00.360
<v Speaker 1>probably like the Amazon Echo or the iPhone or Android phones.

0:45:00.960 --> 0:45:04.640
<v Speaker 1>That might be something that really pushes that illusion of life.

0:45:05.000 --> 0:45:09.279
<v Speaker 1>And of course there's also the physical appearance aspect. Now,

0:45:09.280 --> 0:45:12.800
<v Speaker 1>you would never mistake this animatronic bust for a human

0:45:13.120 --> 0:45:16.319
<v Speaker 1>I mentioned before. It's pretty creepy looking. It's got a

0:45:16.360 --> 0:45:19.879
<v Speaker 1>plastic and skeletal quality to it that prevents you from

0:45:19.880 --> 0:45:23.640
<v Speaker 1>ever mistaking it as a person. But the team points

0:45:23.640 --> 0:45:27.240
<v Speaker 1>out the physical appearance of the robot taps back into

0:45:27.320 --> 0:45:31.000
<v Speaker 1>that problem of uncanny Valley. It might take a while

0:45:31.080 --> 0:45:34.800
<v Speaker 1>to create something that's convincing enough and yet not repulsive

0:45:36.120 --> 0:45:39.560
<v Speaker 1>to work as a robotic human animatronic. If you make

0:45:39.600 --> 0:45:43.240
<v Speaker 1>it look too real, it's going to give people the creeps.

0:45:43.880 --> 0:45:46.680
<v Speaker 1>I think, at least in the short term, we're more

0:45:46.760 --> 0:45:49.400
<v Speaker 1>likely to see this technology used to create characters that

0:45:49.440 --> 0:45:53.960
<v Speaker 1>are human like but still distinctly not human, in order

0:45:54.000 --> 0:45:58.000
<v Speaker 1>to avoid that negative reaction when the uncanny Valley gets involved.

0:45:58.320 --> 0:46:01.200
<v Speaker 1>In other words, using the US to create an animatronic

0:46:01.239 --> 0:46:04.799
<v Speaker 1>figure that looks a lot like a cartoon character, even

0:46:04.840 --> 0:46:09.160
<v Speaker 1>a human cartoon character because well, you recognize the cartoon

0:46:09.239 --> 0:46:13.680
<v Speaker 1>character as representing a human. Cartoon characters don't really look

0:46:13.719 --> 0:46:18.239
<v Speaker 1>like humans. Usually they look like they have human qualities

0:46:18.280 --> 0:46:21.719
<v Speaker 1>to them, but they still have cartoonish qualities to them,

0:46:21.760 --> 0:46:24.480
<v Speaker 1>so you wouldn't mistake them for actually being human. Or

0:46:24.560 --> 0:46:27.040
<v Speaker 1>you just you know, go the robot route or some

0:46:27.120 --> 0:46:31.239
<v Speaker 1>sort of animal career and you sidestep that problem. The

0:46:31.280 --> 0:46:34.440
<v Speaker 1>engineers conclude their paper by talking about how the attention

0:46:34.520 --> 0:46:37.960
<v Speaker 1>engine could, with some evolution, work for a lot of

0:46:37.960 --> 0:46:42.160
<v Speaker 1>different applications. So imagine that you design an animatronic that

0:46:42.239 --> 0:46:46.400
<v Speaker 1>represents someone who's really frightened, and that kind of character

0:46:46.560 --> 0:46:49.600
<v Speaker 1>might have a very low threshold for stimuli to push

0:46:49.640 --> 0:46:52.400
<v Speaker 1>it to a higher state of attentiveness, right like a

0:46:52.440 --> 0:46:54.920
<v Speaker 1>little sound might cause that character to perk up and

0:46:54.960 --> 0:46:59.440
<v Speaker 1>look around quickly because that that character is supposed to

0:46:59.440 --> 0:47:02.440
<v Speaker 1>be frightened. Or you could create something like, you know,

0:47:02.480 --> 0:47:06.279
<v Speaker 1>an absent minded book lover who only glances up from

0:47:06.440 --> 0:47:10.400
<v Speaker 1>whatever book they're studying if something really exciting is happening,

0:47:10.400 --> 0:47:14.120
<v Speaker 1>otherwise they just ignore it. They also talk about the

0:47:14.160 --> 0:47:18.520
<v Speaker 1>bottom up approach to layering behaviors and deciding which behaviors

0:47:18.560 --> 0:47:23.560
<v Speaker 1>will replace others that might inhabit a lower state. That

0:47:23.680 --> 0:47:26.520
<v Speaker 1>is really fascinating to me. Now, we're still a far

0:47:26.560 --> 0:47:29.560
<v Speaker 1>away off from seeing these sorts of technologies make their

0:47:29.600 --> 0:47:33.279
<v Speaker 1>way into official attractions, but based on what I've seen

0:47:33.440 --> 0:47:36.160
<v Speaker 1>and read, I wouldn't be surprised to find them making

0:47:36.200 --> 0:47:38.879
<v Speaker 1>their way into Disney parks in the next say, five

0:47:38.960 --> 0:47:42.760
<v Speaker 1>years or so, depending on how the company budgets stuff.

0:47:42.800 --> 0:47:46.839
<v Speaker 1>Of course, the pandemic has created a particularly tricky situation

0:47:46.920 --> 0:47:49.839
<v Speaker 1>for that branch of the Disney Company, even as other

0:47:49.880 --> 0:47:53.319
<v Speaker 1>branches of that company continue it's global domination of all

0:47:53.360 --> 0:47:57.759
<v Speaker 1>things entertainment. But the technology itself and the design philosophy

0:47:57.760 --> 0:48:00.480
<v Speaker 1>of how to program a robot to behave as if

0:48:00.560 --> 0:48:03.600
<v Speaker 1>it were doing so naturally, it's really neat to me.

0:48:04.080 --> 0:48:06.040
<v Speaker 1>And as I said at the beginning, the paper is

0:48:06.040 --> 0:48:09.320
<v Speaker 1>available for free to read, so if you want to

0:48:09.400 --> 0:48:13.359
<v Speaker 1>check that out, I highly recommend it. I think it

0:48:13.480 --> 0:48:16.440
<v Speaker 1>is a fascinating piece of work, and as I said,

0:48:17.000 --> 0:48:20.840
<v Speaker 1>it's not that difficult to follow. There's some math stuff

0:48:20.880 --> 0:48:23.160
<v Speaker 1>that will probably, you know, lose a lot of you,

0:48:23.239 --> 0:48:25.719
<v Speaker 1>but it lost me. I'm not I'm not trying to

0:48:25.760 --> 0:48:28.279
<v Speaker 1>shame you. I couldn't follow all of it, but it

0:48:28.360 --> 0:48:31.000
<v Speaker 1>is otherwise pretty easy to understand. And like I said,

0:48:31.000 --> 0:48:36.279
<v Speaker 1>it is titled Realistic and Interactive Robot Gaze g A

0:48:36.600 --> 0:48:40.040
<v Speaker 1>Z E, so check that out. It is really a

0:48:40.080 --> 0:48:43.759
<v Speaker 1>neat paper. Just I apologize for the pictures that are

0:48:43.760 --> 0:48:48.080
<v Speaker 1>in there because they're creepy as all get out. That's

0:48:48.080 --> 0:48:50.640
<v Speaker 1>it for me. I hope you guys enjoyed this episode.

0:48:50.840 --> 0:48:53.400
<v Speaker 1>If you have suggestions for future topics I should tackle

0:48:53.520 --> 0:48:56.360
<v Speaker 1>in tech stuff, let me know on Twitter. The handle

0:48:56.480 --> 0:48:59.840
<v Speaker 1>is text stuff h s W and I'll talk to

0:48:59.880 --> 0:49:08.160
<v Speaker 1>you again really soon. Text Stuff is an I Heart

0:49:08.280 --> 0:49:12.000
<v Speaker 1>Radio production. For more podcasts from I Heart Radio, visit

0:49:12.040 --> 0:49:15.080
<v Speaker 1>the I Heart Radio app, Apple Podcasts, or wherever you

0:49:15.200 --> 0:49:16.520
<v Speaker 1>listen to your favorite shows,