Advertisement
Guest User

Untitled

a guest
May 9th, 2018
910
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 51.79 KB | None | 0 0
  1.  
  2. 00:00
  3. good morning hi my name is Amelie I'm
  4. 00:04
  5. going to be a session chair for the
  6. 00:05
  7. morning so it's my great pleasure to
  8. 00:08
  9. introduce cryovac solo who is going to
  10. 00:11
  11. give an invited talk karai is a director
  12. 00:15
  13. of research in deep mine and he's one of
  14. 00:17
  15. the star researchers in our community he
  16. 00:20
  17. has contributed to many highly
  18. 00:22
  19. influential projects in deep mind such
  20. 00:25
  21. as spatial transformer networks auto
  22. 00:28
  23. regressive generated models such as
  24. 00:31
  25. pixel recurrent networks and wave nets
  26. 00:34
  27. and debris enforcement learning for
  28. 00:36
  29. playing Atari games and alphago today he
  30. 00:39
  31. will talk about from generative models
  32. 00:42
  33. to generative agents so let's welcome
  34. 00:46
  35. karai
  36. 00:47
  37. [Applause]
  38. 00:48
  39. [Music]
  40. 00:53
  41. thank you very much Hong Kong for the
  42. 00:55
  43. very nice introduction
  44. 00:56
  45. and thanks everyone for being here it's
  46. 00:59
  47. it's absolute pleasure
  48. 01:00
  49. so is Hong like mentioned I'm going to
  50. 01:04
  51. try to talk about unsupervised learning
  52. 01:08
  53. in general starting from the generative
  54. 01:10
  55. models may be a classical way when I try
  56. 01:11
  57. to give another view that I think is
  58. 01:14
  59. quite interesting that we have been we
  60. 01:16
  61. have been working on recently when I
  62. 01:20
  63. think about what are the important
  64. 01:21
  65. things for us to do is a is a community
  66. 01:24
  67. I think everyone here sort of agrees
  68. 01:27
  69. that in the end what is important is to
  70. 01:29
  71. be to be doing constitute to us learning
  72. 01:31
  73. we sort of realize that supervised
  74. 01:33
  75. learning has all sorts of successes but
  76. 01:37
  77. in the end unsupervised learning is kind
  78. 01:39
  79. of like the next frontier and when I
  80. 01:42
  81. think about unsupervised learning there
  82. 01:46
  83. are there are sort of like different
  84. 01:48
  85. explanations that come to my mind and
  86. 01:50
  87. when talking to people I think we all
  88. 01:52
  89. have sort of different opinions on this
  90. 01:54
  91. one of the things that I think is a
  92. 01:57
  93. common explanation is we have an
  94. 01:58
  95. unsupervised learning algorithm we run
  96. 02:00
  97. it on our data what we expect is the
  98. 02:02
  99. algorithm to understand our data and to
  100. 02:04
  101. explain our data or or or our
  102. 02:07
  103. environment right and and what we expect
  104. 02:10
  105. from this is that the algorithm is going
  106. 02:12
  107. to learn the intrinsic properties of our
  108. 02:14
  109. data of our environment and then it's
  110. 02:16
  111. going to be able to explain that through
  112. 02:18
  113. those properties but most of the time
  114. 02:20
  115. what happens is because of the kinds of
  116. 02:24
  117. models that we use we resort to and at
  118. 02:26
  119. the end
  120. 02:26
  121. looking at samples and what we look at
  122. 02:28
  123. the samples we try to see that did our
  124. 02:31
  125. model really understand the environment
  126. 02:32
  127. and if it understood the environment
  128. 02:34
  129. then then the sample should be
  130. 02:36
  131. meaningful of course we look at all
  132. 02:38
  133. sorts of objective measures that we try
  134. 02:39
  135. to that that we use during training like
  136. 02:41
  137. Inception scores looked at cahoots and
  138. 02:43
  139. such but in the end we always resort the
  140. 02:45
  141. samples in terms of like understanding
  142. 02:47
  143. if our model really can explain what's
  144. 02:49
  145. going on in the environment the other
  146. 02:51
  147. kind of general explanation that we all
  148. 02:53
  149. use is like the goal of unsupervised
  150. 02:56
  151. learning is to learn rich
  152. 02:57
  153. representations right it's already
  154. 02:58
  155. embedded in the name of the skill of
  156. 03:01
  157. this conference the main goal of deep
  158. 03:03
  159. learning unsupervised learning is with
  160. 03:05
  161. learning
  162. 03:05
  163. those are presentations but then when we
  164. 03:08
  165. think about those representations again
  166. 03:09
  167. it doesn't this explanation doesn't give
  168. 03:11
  169. us an objective measure what we think
  170. 03:13
  171. about is why those like how are we going
  172. 03:17
  173. to think about those representations in
  174. 03:19
  175. terms of being great and useful and to
  176. 03:21
  177. me the most important bit is if we have
  178. 03:24
  179. good and richer presentations then they
  180. 03:26
  181. are useful for generalization for
  182. 03:28
  183. transfer right and we need to we need to
  184. 03:31
  185. sort of if you have a good unsupervised
  186. 03:33
  187. learning model and it can give us good
  188. 03:35
  189. through presentations then we can get
  190. 03:37
  191. generalization so what I'm going to do
  192. 03:39
  193. is today also tie it together with
  194. 03:41
  195. something else that is really I think
  196. 03:43
  197. for me it is very important as long as
  198. 03:45
  199. I've mentioned some a big chunk of work
  200. 03:47
  201. that we have been doing a deep mine that
  202. 03:49
  203. I've been doing is about agents and
  204. 03:51
  205. reinforcement learning and in this talk
  206. 03:53
  207. I'm going to sort of take a look at
  208. 03:55
  209. unsupervised learning from classical
  210. 03:57
  211. sense of like learning a learning a
  212. 04:00
  213. generative model and also learning an
  214. 04:02
  215. agent that can do on supervised learning
  216. 04:03
  217. so I'm going to start from the wavenet
  218. 04:06
  219. model hopefully as many of you know it
  220. 04:10
  221. is a generative model of audio it's a
  222. 04:12
  223. pure deep learning model and turns it
  224. 04:14
  225. does you can model any audio signal like
  226. 04:17
  227. speech and and and music and then you
  228. 04:20
  229. can get really realistic samples out of
  230. 04:21
  231. that and the next thing I'm going to do
  232. 04:25
  233. is I'm going to explain this other sort
  234. 04:27
  235. of new approach that that I find really
  236. 04:29
  237. interesting to unsupervised learning
  238. 04:31
  239. that is based on deep reinforcement
  240. 04:34
  241. learning learning an agent that can
  242. 04:36
  243. actually that does unsupervised learning
  244. 04:38
  245. so this model called spiral is based on
  246. 04:41
  247. a new agent architecture that we have
  248. 04:43
  249. been that we have been working on that
  250. 04:45
  251. we have published recently called Impala
  252. 04:46
  253. it's a very large highly scaleable
  254. 04:49
  255. efficient off-post elearning agent
  256. 04:51
  257. architecture that we use in spiral to do
  258. 04:54
  259. unsupervised learning and the
  260. 04:57
  261. interesting bit about the spiral work is
  262. 04:59
  263. it does generalization through using
  264. 05:01
  265. some sort of tool space tools that we as
  266. 05:03
  267. people have created that we have created
  268. 05:06
  269. so that we can actually solve not one
  270. 05:08
  271. specific problem we can solve many
  272. 05:10
  273. different problems using these tools and
  274. 05:12
  275. using the interface of a two
  276. 05:14
  277. and having an agent you can actually now
  278. 05:16
  279. learn a generative model of your
  280. 05:19
  281. environment
  282. 05:19
  283. all right so without like more delay the
  284. 05:24
  285. first thing that I'm going to try to
  286. 05:25
  287. introduce is like quickly the very net
  288. 05:27
  289. model way net is a generative model of
  290. 05:30
  291. of audio as I said it is it samples the
  292. 05:33
  293. robot your signal it doesn't use any
  294. 05:35
  295. sort of interface to model the audio
  296. 05:38
  297. signal audio in general is very very
  298. 05:41
  299. high dimensional so the the standard
  300. 05:43
  301. audio signal that we started when Miller
  302. 05:45
  303. done moved a bit when we were at the
  304. 05:48
  305. beginning but 16,000 samples per second
  306. 05:50
  307. like if you compare that our usual
  308. 05:52
  309. language modeling and and and machine
  310. 05:55
  311. translation kind of tasks it is several
  312. 05:57
  313. orders of magnitude more data so the
  314. 06:00
  315. kinds of dependencies that one needs to
  316. 06:02
  317. model to be able to model good audio
  318. 06:04
  319. signals is very it's very long so this
  320. 06:09
  321. model what it does is it samples it
  322. 06:11
  323. models one sample at a time and it is a
  324. 06:14
  325. soft max distribution to model the model
  326. 06:17
  327. each sample depending on dependent on
  328. 06:20
  329. all the all the previous samples of the
  330. 06:22
  331. of the signal when you look at it more
  332. 06:26
  333. closely though it is it is it is an
  334. 06:28
  335. architecture that has quite a bit of
  336. 06:30
  337. resemblance to the pixel CNN model maybe
  338. 06:32
  339. some of you also are familiar with that
  340. 06:34
  341. in the end it is a stack of multiple
  342. 06:38
  343. commotion layers to be a little bit more
  344. 06:40
  345. specific it has these residual blocks
  346. 06:42
  347. you use multiples of those residual
  348. 06:44
  349. blocks and each decision and in each
  350. 06:46
  351. residual work there are these dilated
  352. 06:49
  353. convolutional layers that that go on top
  354. 06:53
  355. of each other and through those dilated
  356. 06:54
  357. convolutional layers that are causal
  358. 06:56
  359. convolutions we can model very long
  360. 06:59
  361. dependencies so through that we can get
  362. 07:01
  363. the modelling dependency in time now one
  364. 07:06
  365. of the biggest design considerations
  366. 07:08
  367. about wag net is it is designed to be
  368. 07:11
  369. very very efficient during training
  370. 07:13
  371. because during training what you can do
  372. 07:15
  373. is because all the targets are known
  374. 07:17
  375. when you generate the signal you
  376. 07:19
  377. generate the whole signal at once just
  378. 07:20
  379. run it like a convulsion on net you get
  380. 07:22
  381. your signal then because you have the
  382. 07:24
  383. targets you get your error signal
  384. 07:26
  385. from that propagate back so training is
  386. 07:28
  387. very efficient but of course when it
  388. 07:30
  389. comes to sampling time in the end this
  390. 07:31
  391. is an autoregressive model and through
  392. 07:34
  393. those causal emotions you need to run
  394. 07:36
  395. through them one sample at a time so if
  396. 07:38
  397. you are sampling let's say 24 kilohertz
  398. 07:39
  399. 24,000 samples per second you need to
  400. 07:42
  401. generate one sample at a time just like
  402. 07:44
  403. you see in this animation and of course
  404. 07:46
  405. this is painful this is painful but in
  406. 07:49
  407. the end it works quite well and we can
  408. 07:51
  409. generate very very high quality audio
  410. 07:53
  411. with this so what I want to do is I want
  412. 07:58
  413. to actually I want to I want to make you
  414. 08:01
  415. listen to the unconditional samples from
  416. 08:04
  417. this model so rag model the speed signal
  418. 08:07
  419. and without any conditioning on text or
  420. 08:10
  421. anything just take the audio signal and
  422. 08:12
  423. model that with model that it wavenet
  424. 08:14
  425. and then when you sample this is the
  426. 08:17
  427. kind of so as you can see or here
  428. 08:30
  429. hopefully the the quality is very high
  430. 08:35
  431. and this is modeling really the raw
  432. 08:37
  433. audio grow audio signal and this is
  434. 08:40
  435. completely unconditional so what you
  436. 08:42
  437. hear is sometimes you even hear short
  438. 08:44
  439. words like okay from and then if you try
  440. 08:48
  441. to listen all the tonation and
  442. 08:49
  443. everything sounds quite natural and
  444. 08:52
  445. sometimes it feels like you are
  446. 08:53
  447. listening to someone speaking in a
  448. 08:55
  449. language that you don't know so the the
  450. 08:57
  451. main characteristics of the of the
  452. 09:00
  453. signal is all captured there so in terms
  454. 09:02
  455. of dependencies we are looking into like
  456. 09:04
  457. something like several thousand samples
  458. 09:06
  459. of dependencies are actually properly
  460. 09:09
  461. and correctly modelled there and then of
  462. 09:12
  463. course sorry and then of course what you
  464. 09:16
  465. can do is you can you can augment this
  466. 09:18
  467. model by conditioning on a text signal
  468. 09:22
  469. that is associated with the signal that
  470. 09:24
  471. you want to generate and by conditioning
  472. 09:26
  473. on the text signal now you have a
  474. 09:28
  475. generative model a conditional
  476. 09:30
  477. generative model that actually solves a
  478. 09:32
  479. real-world problem just by itself and
  480. 09:34
  481. turn deep learning right so
  482. 09:37
  483. the text you create the linguistic
  484. 09:38
  485. embeddings from that using those
  486. 09:40
  487. linguistic embeddings you can generate
  488. 09:42
  489. the signal and then and then it starts
  490. 09:46
  491. it's not talking right so it's a it's a
  492. 09:48
  493. solution to the whole text to speech
  494. 09:51
  495. synthesis problem that as you know is
  496. 09:53
  497. very very common used in in in real
  498. 09:57
  499. world sorry alright so when we did this
  500. 10:03
  501. the the bayonet model and this was
  502. 10:07
  503. around like almost two years ago now we
  504. 10:10
  505. looked at the we looked at equality when
  506. 10:12
  507. we use it as a TTS model and in green
  508. 10:15
  509. what you see is the quality of the human
  510. 10:17
  511. speech I can obtain through this mean
  512. 10:19
  513. opinion scores and in blue you see the
  514. 10:21
  515. wavenet and the other colors are the
  516. 10:23
  517. other models that were the best models
  518. 10:25
  519. around and at the time and you can see
  520. 10:27
  521. that they met close the gap between the
  522. 10:30
  523. human called speech and other models by
  524. 10:33
  525. by a big margin so at the time this this
  526. 10:37
  527. really got us excited because now we
  528. 10:39
  529. actually had a model a deep learning
  530. 10:41
  531. model that comes with all the
  532. 10:42
  533. flexibilities and advantages of doing
  534. 10:44
  535. deep learning and at the same time it's
  536. 10:46
  537. modeling raw audio and it is it is it is
  538. 10:49
  539. very very high quality
  540. 10:50
  541. I could play text to speech samples that
  542. 10:53
  543. is generated by this model but actually
  544. 10:55
  545. what you can do is what I'm going to go
  546. 10:56
  547. into next if you are using Google
  548. 10:58
  549. assistant right now you are already
  550. 10:59
  551. hearing back that there because this is
  552. 11:01
  553. already in production so anyone who's
  554. 11:03
  555. using Google assistant and like querying
  556. 11:05
  557. Wikipedia and things like that the the
  558. 11:08
  559. speech that is generated there is
  560. 11:10
  561. actually coming from the very net model
  562. 11:11
  563. and what I want to do is I want to
  564. 11:13
  565. explain how we how we did that and that
  566. 11:18
  567. brings me into our next project that we
  568. 11:20
  569. did in the wagonette in the very net
  570. 11:22
  571. domain this is the parallel way net
  572. 11:24
  573. power the net project so of course when
  574. 11:27
  575. you have a research project and at some
  576. 11:29
  577. point you realize that okay it is
  578. 11:30
  579. actually lands it actually lands itself
  580. 11:33
  581. into the solution of a real-world
  582. 11:34
  583. problem and you want to put it into
  584. 11:37
  585. production in a very challenging
  586. 11:39
  587. environment then then of course it
  588. 11:41
  589. requires much more than our little
  590. 11:44
  591. research group so this was a big
  592. 11:45
  593. cooperation between the D point research
  594. 11:47
  595. applied and the Google
  596. 11:48
  597. speech teams actually so in this slide
  598. 11:52
  599. what but what what I show is basis the
  600. 11:55
  601. the the basic ingredients of how we turn
  602. 11:58
  603. a wave net architecture into a
  604. 12:01
  605. feed-forward and parallel architecture
  606. 12:03
  607. because what we realize pretty soon when
  608. 12:06
  609. we started when we try to attempt doing
  610. 12:09
  611. doing putting putting a system like this
  612. 12:13
  613. into production was actually speed of
  614. 12:15
  615. course is very important quality is very
  616. 12:17
  617. very important but the the importance is
  618. 12:19
  619. of speed is it is not enough to actually
  620. 12:22
  621. run something in real time the kind of
  622. 12:24
  623. constraints that we track those ovals
  624. 12:26
  625. like orders of magnitude faster than
  626. 12:27
  627. real time even actually being able to
  628. 12:30
  629. run in constant time so when one day
  630. 12:32
  631. when the constraint becomes being able
  632. 12:34
  633. to run in constant time the only thing
  634. 12:36
  635. you can do is create a feed-forward
  636. 12:38
  637. Network and then paralyze the signal
  638. 12:40
  639. generation right so that is what we did
  640. 12:43
  641. so in this slide at the top what you see
  642. 12:45
  643. is the usual wavenet model we call it
  644. 12:48
  645. the teacher now in the setting this
  646. 12:49
  647. wavenet model is pure trained and it is
  648. 12:52
  649. fixed and it is used as a scoring
  650. 12:55
  651. function at the bottom what you see is
  652. 12:57
  653. the generator that we call the student
  654. 12:59
  655. and this student model is again an
  656. 13:02
  657. architecture that is very close to write
  658. 13:04
  659. net but it is a it is it is run as a
  660. 13:07
  661. feed-forward convolutional network and
  662. 13:09
  663. the way it is run is and it is trained
  664. 13:11
  665. is actually it has two components one
  666. 13:13
  667. component is coming from a net we know
  668. 13:15
  669. that it is very efficient in training as
  670. 13:17
  671. I said but slow in something the other
  672. 13:19
  673. the other thing is based on the inverse
  674. 13:21
  675. autoregressive flow work that was done
  676. 13:22
  677. by the king - colleagues at opening I
  678. 13:24
  679. last year and and and and this this
  680. 13:28
  681. structure gives gives us the capability
  682. 13:30
  683. to actually get a input noise signal in
  684. 13:33
  685. and slowly transform that noise signal
  686. 13:36
  687. into a into a proper distribution that
  688. 13:39
  689. is going to be the speed signal right so
  690. 13:42
  691. the way we train this is random noise
  692. 13:44
  693. goes in together with the linguistic
  694. 13:46
  695. features through layers and layers of
  696. 13:48
  697. these flows the signal gets that that
  698. 13:50
  699. random noise gets transferred into
  700. 13:52
  701. speech signal that speed signal goes
  702. 13:54
  703. into a net very net is like already the
  704. 13:57
  705. best kind of scoring function that we
  706. 13:59
  707. can use because it's a
  708. 14:00
  709. it's a density model and wavenet scores
  710. 14:03
  711. that and that score from that we get the
  712. 14:06
  713. gradients back into the generator and
  714. 14:09
  715. then we update the generator we call
  716. 14:11
  717. this process the proper water density
  718. 14:12
  719. distribution but of course when you are
  720. 14:15
  721. trying to do real-world things and if
  722. 14:18
  723. things are very challenging like speed
  724. 14:19
  725. signals that is by itself not enough so
  726. 14:21
  727. I have highlighted two components here
  728. 14:23
  729. one of them as I said is the magnet
  730. 14:25
  731. scoring function the other thing that we
  732. 14:27
  733. use is a power loss because what happens
  734. 14:30
  735. is when we train the model in this
  736. 14:32
  737. manner the signal tends to be very low
  738. 14:35
  739. energy sort of like whispering someone
  740. 14:38
  741. speaks but they are like whispering so
  742. 14:39
  743. during training we sort of edit this
  744. 14:41
  745. extra loss that tries to conserve the
  746. 14:43
  747. energy of the generated speech and with
  748. 14:47
  749. these two the the wavenet scoring and
  750. 14:49
  751. the power loss we were already getting
  752. 14:51
  753. very high called speed signal but of
  754. 14:54
  755. course like the constraints are very
  756. 14:55
  757. very tough and what we did was we
  758. 14:58
  759. trained another wave net model so we
  760. 15:00
  761. sort of used wavenet everywhere right
  762. 15:01
  763. that we are generating through a leg net
  764. 15:03
  765. through convolution we are using very
  766. 15:04
  767. net as a scoring function we again
  768. 15:07
  769. trained another very net model this time
  770. 15:08
  771. we used it as a speech recognition
  772. 15:10
  773. system and that is the perceptual loss
  774. 15:12
  775. that you see there so we train the wave
  776. 15:14
  777. net again as a speech recognition system
  778. 15:16
  779. what we do is during training of course
  780. 15:18
  781. you have the text and the corresponding
  782. 15:21
  783. speech signal we generate the we
  784. 15:25
  785. generate the corresponding speech
  786. 15:27
  787. through our generator we get the text
  788. 15:29
  789. give that the speech recognition system
  790. 15:30
  791. the speech recognition system of course
  792. 15:32
  793. not needs to decode we generated signal
  794. 15:35
  795. into those into that text right and we
  796. 15:37
  797. get the error from there propagate back
  798. 15:39
  799. into our generator so that's another
  800. 15:41
  801. sort of quality improvement that we get
  802. 15:42
  803. by using speech recognition as a
  804. 15:45
  805. perceptual loss in our generation system
  806. 15:47
  807. and the last thing that we did was using
  808. 15:51
  809. a contrasting term that basically uses
  810. 15:53
  811. okay we generate a signal conditioned on
  812. 15:55
  813. some text you can you can create a
  814. 15:58
  815. contrast applause we're saying that the
  816. 16:01
  817. signal that is generated with the
  818. 16:02
  819. corresponding text is it should be
  820. 16:05
  821. different than the same signal if it if
  822. 16:07
  823. it was conditioned on a separate text
  824. 16:09
  825. right
  826. 16:10
  827. there's a contrasting luster so more
  828. 16:12
  829. specifically what we have is in the end
  830. 16:14
  831. we end up with these four terms at the
  832. 16:18
  833. top we see that the the original sort of
  834. 16:22
  835. using vena there's a scoring function
  836. 16:24
  837. the problem with advances the
  838. 16:25
  839. distillation idea then we have the power
  840. 16:28
  841. loss that that uses Fourier transforms
  842. 16:31
  843. eternal to to conserve the energy and
  844. 16:34
  845. the contrastive term and find out the
  846. 16:36
  847. perceptual was that does the that does
  848. 16:40
  849. the speech of cognition and when we all
  850. 16:42
  851. these then of course what we did was we
  852. 16:44
  853. looked at the quality now what what I'm
  854. 16:47
  855. showing here is the quality with respect
  856. 16:49
  857. to the again the best non wavenet model
  858. 16:52
  859. so this is sort of like a year after the
  860. 16:54
  861. original research pretty much exactly a
  862. 16:57
  863. year and so during that time of course
  864. 17:00
  865. the the best speech synthesis models
  866. 17:02
  867. also improved but wavenet was still
  868. 17:04
  869. better than better than anything else
  870. 17:06
  871. and it was matching the quality of so
  872. 17:08
  873. the new magnet the parallel bayonet is
  874. 17:11
  875. exactly matching the quality of the of
  876. 17:15
  877. the original magnitude and what what I'm
  878. 17:18
  879. showing here is three different US
  880. 17:20
  881. English voices and also Japanese and
  882. 17:21
  883. this is the kind of thing that we always
  884. 17:23
  885. want from deep learning right the
  886. 17:25
  887. ability to generalize to new datasets to
  888. 17:27
  889. new domains so we have developed all
  890. 17:29
  891. this model one practically one single US
  892. 17:31
  893. English voice and it was just a matter
  894. 17:33
  895. of collecting or getting another data
  896. 17:35
  897. set from another either speaker or
  898. 17:38
  899. another language like some speaker
  900. 17:41
  901. speaking Japanese you just get that run
  902. 17:43
  903. it and there you go you have a speech
  904. 17:45
  905. synthesis you have a production called
  906. 17:46
  907. speaks into the system just by doing
  908. 17:48
  909. that this is the kind of thing that we
  910. 17:50
  911. really like from deep line right and and
  912. 17:52
  913. if you are thinking about from from deep
  914. 17:54
  915. learning and if you are thinking about
  916. 17:55
  917. unsupervised learning I think this is
  918. 17:57
  919. this is this is a very good
  920. 17:58
  921. demonstration of that
  922. 17:59
  923. so before switching to the next one I
  924. 18:02
  925. also want to mention that we have also
  926. 18:04
  927. done some further work on this called
  928. 18:06
  929. wave RN and that is recently published
  930. 18:08
  931. and
  932. 18:09
  933. I encourage you to look into that one
  934. 18:11
  935. too that's a very interesting piece of
  936. 18:12
  937. work also for generating speech at very
  938. 18:15
  939. very high speed the next thing I want to
  940. 18:18
  941. talk about is the Impala architecture
  942. 18:20
  943. the new agent architecture that I said
  944. 18:22
  945. because as I said so now wavenet is a
  946. 18:25
  947. sort of in a classical sense of of
  948. 18:30
  949. unsupervised model that actually can
  950. 18:32
  951. solve a real world problem now the next
  952. 18:35
  953. thing I want to sort of start talking
  954. 18:36
  955. about is this new different way of doing
  956. 18:38
  957. unsupervised learning but for that most
  958. 18:41
  959. another exciting bit is to be able to do
  960. 18:44
  961. deep reinforcement learning at scale
  962. 18:47
  963. sorry all right so I want to sort of
  964. 18:54
  965. motivate why do we want to actually push
  966. 18:56
  967. our deep reinforcement learning models
  968. 18:57
  969. further and further because most of the
  970. 18:59
  971. time what we do because this is a new
  972. 19:01
  973. area is we take sort of like very simple
  974. 19:05
  975. tasks in in some simple environments and
  976. 19:08
  977. what we try to do is we try to train an
  978. 19:10
  979. agent that shows a single task in that
  980. 19:12
  981. environment well what we what we want to
  982. 19:15
  983. do is we want to go further than that
  984. 19:16
  985. right like again going back to the point
  986. 19:18
  987. of generalization and being able to
  988. 19:19
  989. solve multiple tasks we have created the
  990. 19:22
  991. new task set this is an open source task
  992. 19:24
  993. set that we have like we have an open
  994. 19:26
  995. source environment called vm lab and as
  996. 19:28
  997. part of that we have created this new
  998. 19:29
  999. task set vm lab 30 it is 30 environments
  1000. 19:33
  1001. that are sort of covering tasks around
  1002. 19:36
  1003. language memory and navigation and those
  1004. 19:38
  1005. kinds of things and the goal is not to
  1006. 19:41
  1007. solve each one of them individually the
  1008. 19:43
  1009. goal is to have one single agent one
  1010. 19:45
  1011. single network that is that is solving
  1012. 19:48
  1013. all those thoughts all at the same time
  1014. 19:50
  1015. there is nothing custom in that agent
  1016. 19:52
  1017. that is specific to any single one of
  1018. 19:55
  1019. these environments when you look at
  1020. 19:56
  1021. those environments I'm showing some of
  1022. 19:59
  1023. those here the agency has a first-person
  1024. 20:02
  1025. view so it is in like a maze-like
  1026. 20:04
  1027. environment and the agent has a
  1028. 20:06
  1029. first-person view camera input and it
  1030. 20:08
  1031. can navigate around go forward backwards
  1032. 20:10
  1033. and rotate around look up down jump and
  1034. 20:13
  1035. those kinds of things and and it is
  1036. 20:16
  1037. solving all different kinds of tasks
  1038. 20:18
  1039. that are that are catered to test
  1040. 20:19
  1041. different
  1042. 20:20
  1043. kinds of kinds of abilities but the goal
  1044. 20:22
  1045. is as I said again to solve all of them
  1046. 20:24
  1047. at the same time one thing that becomes
  1048. 20:26
  1049. really really important in this case is
  1050. 20:27
  1051. of course the stability of our
  1052. 20:29
  1053. algorithms because now we are not
  1054. 20:32
  1055. solving one single task we are solving
  1056. 20:34
  1057. 30 of them and we want to really stable
  1058. 20:36
  1059. models because we don't have the chance
  1060. 20:37
  1061. to tune hyper parameters one single task
  1062. 20:39
  1063. anymore and of course what becomes
  1064. 20:41
  1065. really important is task interference
  1066. 20:43
  1067. right hopefully what we expect again by
  1068. 20:45
  1069. using deep learning is this is like a
  1070. 20:47
  1071. multi task setting and in this multi
  1072. 20:48
  1073. task setting we hope to see positive
  1074. 20:51
  1075. transfer rather than task interference
  1076. 20:53
  1077. and and and we hope to demonstrate this
  1078. 20:55
  1079. in this in this challenging
  1080. 20:56
  1081. reinforcement of a reinforcement
  1082. 20:58
  1083. learning domain - okay I sort of
  1084. 21:03
  1085. realized that I needed to put a slide
  1086. 21:05
  1087. about by deep reinforcement learning
  1088. 21:07
  1089. because a little bit to my surprise that
  1090. 21:10
  1091. was actually not much reinforcement
  1092. 21:11
  1093. learning in this conference this year
  1094. 21:12
  1095. and I wanted to sort of a little bit
  1096. 21:15
  1097. touch on why I think is important for
  1098. 21:18
  1099. for the deep learning community before
  1100. 21:20
  1101. this community to actually do deep
  1102. 21:22
  1103. reinforcement learning because it is to
  1104. 21:24
  1105. me it is at the core of if if one of the
  1106. 21:26
  1107. goals that we work for here is AI then
  1108. 21:28
  1109. it is at the core of order right
  1110. 21:30
  1111. reinforcement learning is a very general
  1112. 21:32
  1113. framework for it
  1114. 21:33
  1115. for doing sequential decision-making for
  1116. 21:36
  1117. learning sequential decision making
  1118. 21:38
  1119. tasks and deep learning on the other
  1120. 21:40
  1121. hand of course is the best model that we
  1122. 21:43
  1123. have the best set of algorithms we have
  1124. 21:45
  1125. to learn representations and
  1126. 21:47
  1127. combinations of those combinations of
  1128. 21:51
  1129. these two different models is is the
  1130. 21:55
  1131. most sort of like arm is the best answer
  1132. 21:58
  1133. so far we have in terms of learning very
  1134. 22:00
  1135. good state representations of very
  1136. 22:03
  1137. challenging tasks that are not just for
  1138. 22:05
  1139. like solving toy domains but actually to
  1140. 22:08
  1141. solve challenging real world problems of
  1142. 22:11
  1143. course there are many things that are
  1144. 22:12
  1145. there are open problems there like some
  1146. 22:14
  1147. of them that are sort of interesting at
  1148. 22:16
  1149. least for me is the idea of separating
  1150. 22:20
  1151. the computational power of a model from
  1152. 22:22
  1153. the number of weights or the number of
  1154. 22:24
  1155. layers it has or basically again going
  1156. 22:27
  1157. back to on supervised learning learning
  1158. 22:29
  1159. to transfer
  1160. 22:30
  1161. so if we do this deep reinforcement
  1162. 22:32
  1163. learning models with the idea to to
  1164. 22:35
  1165. actually generalize to transfer okay so
  1166. 22:39
  1167. the Impala agent is based on the on
  1168. 22:44
  1169. another work that we have done couple of
  1170. 22:46
  1171. years ago called the a synchronous
  1172. 22:48
  1173. advantage actor critic the a3c model in
  1174. 22:50
  1175. the end it's a it's opposed to gradient
  1176. 22:53
  1177. methods but you have is like that I
  1178. 22:54
  1179. tried to sort of cartoonishly explain
  1180. 22:56
  1181. that in the in the in the figure at
  1182. 22:58
  1183. every time step the agent sees the
  1184. 23:00
  1185. environment and at that time step the
  1186. 23:03
  1187. agent outputs a post distribution and
  1188. 23:06
  1189. also the also the value function the
  1190. 23:08
  1191. value function is the agents expectation
  1192. 23:12
  1193. of the total amount of reward that it's
  1194. 23:14
  1195. going to get until the end of the
  1196. 23:16
  1197. episode being in that state all right
  1198. 23:18
  1199. and the policy is the distribution over
  1200. 23:19
  1201. the actions that the agent has and at
  1202. 23:21
  1203. every time step the agent looks at the
  1204. 23:23
  1205. environment and updates is policy so
  1206. 23:25
  1207. that it can be can actually act in the
  1208. 23:27
  1209. environment and it updates his value
  1210. 23:28
  1211. function and the way you train this is
  1212. 23:30
  1213. with the with the post the gradient
  1214. 23:32
  1215. intuitively this is actually is actually
  1216. 23:34
  1217. very simple what you do is the gradient
  1218. 23:36
  1219. of the policy is scaled by the
  1220. 23:39
  1221. difference between the total reward that
  1222. 23:41
  1223. the agent actually gets in the
  1224. 23:43
  1225. environment - the baseline and the
  1226. 23:46
  1227. baseline is the value function right so
  1228. 23:48
  1229. what it means is if the agent ends up
  1230. 23:50
  1231. doing better than what the value
  1232. 23:52
  1233. function what its assumption was then
  1234. 23:55
  1235. it's a good thing you have a positive
  1236. 23:56
  1237. gradient you're going to reinforce your
  1238. 23:57
  1239. understanding of the environment if the
  1240. 23:59
  1241. agent does worse than what it got so
  1242. 24:02
  1243. well so the value was higher than the
  1244. 24:04
  1245. total reward that you got then you have
  1246. 24:06
  1247. a negative gradient you need to shuffle
  1248. 24:08
  1249. things around and the way you learn the
  1250. 24:10
  1251. value function is by the usual and step
  1252. 24:13
  1253. and step TD error now the a3c algorithm
  1254. 24:17
  1255. so this was the actor critic part the a
  1256. 24:20
  1257. synchronous party in 3 C algorithm is
  1258. 24:22
  1259. composed of multiple actors and each
  1260. 24:24
  1261. actor independently operates in the
  1262. 24:27
  1263. environment and and and collecting for
  1264. 24:30
  1265. collect observations
  1266. 24:32
  1267. acts in the environment computes the
  1268. 24:34
  1269. posted gradients and and
  1270. 24:37
  1271. completes the gradients with respect to
  1272. 24:39
  1273. the parameters of its network then what
  1274. 24:41
  1275. it does is it sends those gradients back
  1276. 24:43
  1277. into the parameter server then the
  1278. 24:45
  1279. parameter server collects all these
  1280. 24:46
  1281. gradients from all different actors
  1282. 24:48
  1283. combines them together and then shares
  1284. 24:50
  1285. those parameters with all the actors
  1286. 24:52
  1287. around now what happens in this case is
  1288. 24:55
  1289. as you increase the number of actors
  1290. 24:56
  1291. this is the usual asynchronous
  1292. 24:58
  1293. stochastic gradient descent setup as the
  1294. 25:01
  1295. number of actors increases the stale
  1296. 25:03
  1297. grade the staleness of the gradients
  1298. 25:05
  1299. becomes a problem so what happens is in
  1300. 25:08
  1301. the end is distribution the experience
  1302. 25:10
  1303. collection is actually something very
  1304. 25:11
  1305. very advantages it's very good and but
  1306. 25:14
  1307. what happens is communicating gradients
  1308. 25:16
  1309. might become a bottleneck as you try to
  1310. 25:17
  1311. really scale things up so for that what
  1312. 25:21
  1313. we tried was a different architecture
  1314. 25:27
  1315. the idea of a sanctuary server is
  1316. 25:31
  1317. actually quite useful but rather than
  1318. 25:33
  1319. using it to just to just do the
  1320. 25:36
  1321. accumulate the parameter updates the
  1322. 25:39
  1323. idea of that learner is to make the
  1324. 25:42
  1325. centralized component into a learner so
  1326. 25:45
  1327. the all the whole learning algorithm is
  1328. 25:46
  1329. is contained in that what the actors
  1330. 25:48
  1331. does is only act in the environment not
  1332. 25:50
  1333. compute the gradients or anything
  1334. 25:52
  1335. send the observations back into learners
  1336. 25:54
  1337. to the learner and the learner sends the
  1338. 25:56
  1339. parameters back and in this in this way
  1340. 25:58
  1341. what you are doing is you are completely
  1342. 26:00
  1343. decoupling what happens about your
  1344. 26:02
  1345. experience collection in your
  1346. 26:04
  1347. environments from your learning
  1348. 26:06
  1349. algorithm and in this way you are
  1350. 26:07
  1351. actually gaining a lot of robustness
  1352. 26:09
  1353. into noise in your environments
  1354. 26:11
  1355. sometimes rendering times vary some some
  1356. 26:14
  1357. environments are slow some environments
  1358. 26:16
  1359. are fast
  1360. 26:17
  1361. all that is completely decoupled from
  1362. 26:18
  1363. your learning algorithm but of course
  1364. 26:20
  1365. what you need is a good learning
  1366. 26:22
  1367. algorithm to to be able to deal with
  1368. 26:24
  1369. that kind of variation so in the end we
  1370. 26:27
  1371. empower what we have is we have a very
  1372. 26:29
  1373. efficient decoupled backward pass if you
  1374. 26:31
  1375. were so actors generate trajectories as
  1376. 26:33
  1377. I said but then but that that decoupling
  1378. 26:37
  1379. creates this of posionous write the
  1380. 26:39
  1381. policy in the actors the behavior poles
  1382. 26:41
  1383. if you will is separate from the policy
  1384. 26:44
  1385. in the learner
  1386. 26:45
  1387. target policy so what we need is enough
  1388. 26:47
  1389. posted earning of course there are many
  1390. 26:48
  1391. of posted learning algorithms but we
  1392. 26:50
  1393. really wanted to have a post gradient
  1394. 26:52
  1395. method and and for that we developed
  1396. 26:56
  1397. this new method called V trace and it's
  1398. 26:58
  1399. an off-post advantage critic algorithm
  1400. 27:00
  1401. the advantage of V traces it is using
  1402. 27:04
  1403. these truncated important sampling
  1404. 27:06
  1405. ratios to actually come up with an
  1406. 27:08
  1407. estimate for the valley so because of
  1408. 27:12
  1409. there is this imbalance between the
  1410. 27:13
  1411. learners that and the actors you need to
  1412. 27:15
  1413. balance those you need to balance that
  1414. 27:17
  1415. difference the good thing about this is
  1416. 27:19
  1417. it's an algorithm is a smooth transition
  1418. 27:22
  1419. between the on post case and off policy
  1420. 27:24
  1421. case when they when the actors and the
  1422. 27:26
  1423. learner are completely in sync so you're
  1424. 27:29
  1425. in the on policy case the algorithm
  1426. 27:30
  1427. actually boils down to the usual a3c
  1428. 27:33
  1429. update with the n steps bellman equation
  1430. 27:35
  1431. if they become more separate than the
  1432. 27:38
  1433. correction of the algorithm kicks in and
  1434. 27:41
  1435. then you have the corrected corrected
  1436. 27:43
  1437. estimate the algorithm has two main
  1438. 27:47
  1439. components to those truncation factors
  1440. 27:49
  1441. to control two different aspects of the
  1442. 27:52
  1443. of off learning one of them is the robe
  1444. 27:55
  1445. which controls the reach value function
  1446. 27:58
  1447. the algorithm is going to converge
  1448. 28:00
  1449. towards the behavior the value function
  1450. 28:02
  1451. that code that corresponds to the
  1452. 28:04
  1453. behavior policy or the value function
  1454. 28:06
  1455. that corresponds to the target policy in
  1456. 28:07
  1457. the learner and the other one controls
  1458. 28:09
  1459. the speed of convergence the C factor by
  1460. 28:13
  1461. by controlling the by controlling the
  1462. 28:15
  1463. truncation that it can it can increase
  1464. 28:17
  1465. or decrease the variance in learning and
  1466. 28:19
  1467. the stick and it can it can it can have
  1468. 28:22
  1469. an effect on the speed of convergence
  1470. 28:24
  1471. now than me when we tested this of
  1472. 28:28
  1473. course the goal is to test on all
  1474. 28:29
  1475. environments at once but what we wanted
  1476. 28:31
  1477. to do was first you look at the single
  1478. 28:33
  1479. task is also we look at five different
  1480. 28:35
  1481. environments and we see that in these
  1482. 28:37
  1483. environments the Impala algorithm always
  1484. 28:39
  1485. very stable it performs at the top so
  1486. 28:44
  1487. the comparisons here are the Impala
  1488. 28:45
  1489. algorithm the batch a3c method and they
  1490. 28:50
  1491. touch a to C method and then different
  1492. 28:52
  1493. versions of a three C algorithms and you
  1494. 28:54
  1495. can see that Impala and batch a to C are
  1496. 28:56
  1497. always at
  1498. 28:57
  1499. performing at the top Impala seems to be
  1500. 29:00
  1501. doing fine
  1502. 29:01
  1503. they're like the the dark blue curve and
  1504. 29:03
  1505. and this gives us the sort of feeling
  1506. 29:06
  1507. that okay we have a nice outlet now of
  1508. 29:08
  1509. course the other thing that is very
  1510. 29:09
  1511. important and that is discussed a lot is
  1512. 29:12
  1513. the stability of these algorithms right
  1514. 29:14
  1515. I actually really like these floods
  1516. 29:16
  1517. since during the a3c work actually keep
  1518. 29:19
  1519. looking at these floods and we always
  1520. 29:21
  1521. put them in the papers the plot here is
  1522. 29:23
  1523. on the x-axis we have the heart we have
  1524. 29:25
  1525. the hyper parameter combinations when
  1526. 29:27
  1527. you when you of course trade any model
  1528. 29:29
  1529. what we do all of us is we do some sort
  1530. 29:31
  1531. of hyper parameter sweep and here what
  1532. 29:33
  1533. we are doing is we are looking at the
  1534. 29:35
  1535. final score that we achieve with every
  1536. 29:37
  1537. single hyper parameter setting that we
  1538. 29:39
  1539. that we get and you sort it and in the
  1540. 29:42
  1541. in this kind of thought what you have is
  1542. 29:44
  1543. the the the KERS the algorithms that are
  1544. 29:47
  1545. at the top and that our most flood are
  1546. 29:49
  1547. the most like better performing and most
  1548. 29:52
  1549. stable algorithms right and what we see
  1550. 29:54
  1551. here is Impala is always of course it's
  1552. 29:57
  1553. achieving better results but it's not
  1554. 29:58
  1555. achieving those results because there is
  1556. 30:00
  1557. one sort of lucky - parameter setting is
  1558. 30:03
  1559. consistently at the top and you can see
  1560. 30:05
  1561. that it's not of course completely flat
  1562. 30:07
  1563. because in the end we are sort of
  1564. 30:08
  1565. searching over three orders of magnitude
  1566. 30:10
  1567. in parameter settings the but we can see
  1568. 30:18
  1569. that the algorithm is actually quite
  1570. 30:19
  1571. stable now when we look at our our our
  1572. 30:22
  1573. main goal here what we are looking at in
  1574. 30:24
  1575. on the x-axis we have the wall clock
  1576. 30:26
  1577. time and on the y-axis we have the sort
  1578. 30:29
  1579. of the normalized score and the and the
  1580. 30:32
  1581. red line that you see there is the a3
  1582. 30:34
  1583. see and you can see that Impala not only
  1584. 30:37
  1585. H is much better of course if they
  1586. 30:39
  1587. choose them much much much faster the
  1588. 30:41
  1589. other thing is comparing the green and
  1590. 30:43
  1591. the orange line thirds that is the
  1592. 30:45
  1593. comparison between training Impala in an
  1594. 30:47
  1595. expert setting versus a multi task City
  1596. 30:49
  1597. and we see that it achieves better
  1598. 30:51
  1599. scores like the faster which again gives
  1600. 30:54
  1601. us the idea that we are actually seeing
  1602. 30:56
  1603. positive transfer it's it's a like to
  1604. 30:58
  1605. like setting the all the all the all the
  1606. 31:02
  1607. details of the network and the agent are
  1608. 31:03
  1609. the same in one case you have one
  1610. 31:05
  1611. network
  1612. 31:06
  1613. tasks and in other case you train the
  1614. 31:08
  1615. same network on all the tasks and what
  1616. 31:10
  1617. you achieve is a better result because
  1618. 31:12
  1619. of the positive transfer between those
  1620. 31:14
  1621. tasks and what happens is if you give
  1622. 31:17
  1623. Impala more resources you end up with
  1624. 31:20
  1625. this almost vertical takeoff from there
  1626. 31:23
  1627. right and what you have is you can
  1628. 31:24
  1629. actually solve this challenging turkey
  1630. 31:26
  1631. task domain in under 24 hours given the
  1632. 31:29
  1633. resources and that is the kind of
  1634. 31:30
  1635. algorithmic sort of power that we want
  1636. 31:33
  1637. to be able to train these very highly
  1638. 31:35
  1639. scalable agents now why do we want to do
  1640. 31:38
  1641. that that is the point that I want to
  1642. 31:40
  1643. come next and and and in the final part
  1644. 31:43
  1645. this is the new spiral algorithm that I
  1646. 31:46
  1647. want to talk about now just quickly
  1648. 31:49
  1649. going back to the original ideas that
  1650. 31:52
  1651. that I talked about unsupervised
  1652. 31:54
  1653. learning is also about explaining
  1654. 31:56
  1655. environments and generating samples but
  1656. 31:59
  1657. maybe generate examples by explaining
  1658. 32:01
  1659. environments and we talked about the
  1660. 32:03
  1661. fact that when we have these deep
  1662. 32:04
  1663. learning models like magnet we can
  1664. 32:06
  1665. generate amazing samples but at the same
  1666. 32:08
  1667. time maybe there's a different way we
  1668. 32:09
  1669. can do these things less implicit in the
  1670. 32:11
  1671. Sun set when we generate these samples
  1672. 32:13
  1673. they come with some explanation and that
  1674. 32:15
  1675. explanation can go through some using
  1676. 32:17
  1677. some tools in this particular case what
  1678. 32:20
  1679. we are going to do is we are going to
  1680. 32:22
  1681. use a painting tool and we are going to
  1682. 32:24
  1683. learn to control this painting tool it's
  1684. 32:26
  1685. a real drawing program and we are going
  1686. 32:28
  1687. to basically generate a program that the
  1688. 32:31
  1689. painting tool will use to generate the
  1690. 32:33
  1691. image and the main idea that I want to
  1692. 32:36
  1693. convey is by using tools by it by by
  1694. 32:39
  1695. learning how to use tools that are
  1696. 32:41
  1697. already available that we have actually
  1698. 32:44
  1699. we can start thinking about different
  1700. 32:46
  1701. kinds of generalizations that I'll try
  1702. 32:47
  1703. to demonstrate so in real word we have a
  1704. 32:50
  1705. lot of examples of programs and their
  1706. 32:53
  1707. executions and the results of those
  1708. 32:55
  1709. programs they can be arithmetic programs
  1710. 32:57
  1711. floating programs or even architectural
  1712. 32:59
  1713. blueprints right and what we do is
  1714. 33:02
  1715. because we know we have an information
  1716. 33:06
  1717. on that generation process when we see
  1718. 33:10
  1719. the results we can go and try to infer
  1720. 33:13
  1721. what was the program what was the
  1722. 33:14
  1723. blueprint that generated that that
  1724. 33:16
  1725. particular input so we can do this and
  1726. 33:18
  1727. the goal is to be able to do this with
  1728. 33:20
  1729. our with our agents too
  1730. 33:22
  1731. specifically we are going to use this
  1732. 33:24
  1733. environment called lead my paint it is
  1734. 33:27
  1735. actually a professional-grade
  1736. 33:28
  1737. open-source drawing library and it's
  1738. 33:30
  1739. used worldwide by many artists what we
  1740. 33:33
  1741. are doing is we are using a limited
  1742. 33:34
  1743. interface basically learning - learning
  1744. 33:36
  1745. to draw brushstrokes we are going to
  1746. 33:39
  1747. have an agent that does that the agent
  1748. 33:41
  1749. in the end called spiral has three main
  1750. 33:43
  1751. components first of all is the agent
  1752. 33:45
  1753. that generates the brushstrokes sort of
  1754. 33:47
  1755. I like to see that as writing the
  1756. 33:49
  1757. program the second one is the
  1758. 33:51
  1759. environment to lead my paint so the
  1760. 33:53
  1761. brushstrokes come in environment turns
  1762. 33:55
  1763. those into brushstrokes in the canvas
  1764. 33:57
  1765. and that cameras got those into a
  1766. 34:00
  1767. discriminator and the discriminator is
  1768. 34:01
  1769. trained like again and that
  1770. 34:04
  1771. discriminative looks at the generated
  1772. 34:05
  1773. image and says does this look like a
  1774. 34:07
  1775. real drawing and then gives a score and
  1776. 34:09
  1777. that score is opposed to the usual gun
  1778. 34:11
  1779. training rather than propagating the
  1780. 34:13
  1781. gradient packs we get that score and we
  1782. 34:16
  1783. train our agent with that score is a
  1784. 34:18
  1785. reward so when you think about this all
  1786. 34:20
  1787. these three components coming together
  1788. 34:21
  1789. you have an unsupervised learning model
  1790. 34:23
  1791. similar to the Ganz but rather than
  1792. 34:26
  1793. generating in the pixel space we
  1794. 34:28
  1795. generate in this program space and the
  1796. 34:30
  1797. training is done through the done
  1798. 34:33
  1799. through the reward that the agent itself
  1800. 34:35
  1801. also learns so we are sort of trusting
  1802. 34:37
  1803. another neural net just like in Gans
  1804. 34:39
  1805. setup to actually guide learning but not
  1806. 34:41
  1807. through its gradients just treat the
  1808. 34:42
  1809. score function so in my opinion it makes
  1810. 34:44
  1811. it in certain cases it makes it very
  1812. 34:46
  1813. very sort of capable of using a
  1814. 34:49
  1815. different kinds of tools so as I said
  1816. 34:52
  1817. this agent the the reinforcement
  1818. 34:54
  1819. learning part of the agent is completely
  1820. 34:56
  1821. the same as the Impala
  1822. 34:57
  1823. so we now that we have an agent that can
  1824. 35:00
  1825. actually solve really challenging
  1826. 35:02
  1827. reinforcement learning setups we take it
  1828. 35:03
  1829. and put it into this environment
  1830. 35:05
  1831. augmented with the ability to learn a
  1832. 35:08
  1833. discriminative function to actually have
  1834. 35:11
  1835. the reward the to emphasize again the
  1836. 35:13
  1837. important thing here is yes we have an
  1838. 35:15
  1839. agent but there is no environment that
  1840. 35:17
  1841. actually says that ok this is the reward
  1842. 35:19
  1843. that the agent should get the reward
  1844. 35:22
  1845. generation is also inside the agent
  1846. 35:24
  1847. thanks to again all the unsupervised
  1848. 35:26
  1849. learning models
  1850. 35:26
  1851. that is actually being studied here so
  1852. 35:29
  1853. we specifically use against it up there
  1854. 35:31
  1855. so can we generate the first thing of
  1856. 35:35
  1857. course we try is when you are doing
  1858. 35:36
  1859. unsupervised learning from scratch again
  1860. 35:38
  1861. you go back to illness right you start
  1862. 35:40
  1863. from M&S; and initially of course it's
  1864. 35:42
  1865. generating various crash pad like things
  1866. 35:44
  1867. but then through training it becomes
  1868. 35:47
  1869. better and better and better here in the
  1870. 35:49
  1871. middle you see that now the the agent
  1872. 35:52
  1873. learned - these are complete
  1874. 35:53
  1875. unconditional samples again the ones
  1876. 35:55
  1877. that you see in the middle it learn to
  1878. 35:57
  1879. create these trucks that generates these
  1880. 35:59
  1881. digits right to emphasize this this
  1882. 36:01
  1883. agent has never seen strokes that are
  1884. 36:04
  1885. coming from real people how we draw
  1886. 36:06
  1887. digits it learned to experiment with
  1888. 36:09
  1889. these drugs and it's sort of built its
  1890. 36:11
  1891. own policy to create these strokes that
  1892. 36:14
  1893. would generate these images of course
  1894. 36:16
  1895. you can train the whole set up is a
  1896. 36:17
  1897. conditional generation process to
  1898. 36:19
  1899. recreate a given image - I think the
  1900. 36:22
  1901. main thing about this is it's learning
  1902. 36:24
  1903. an unsupervised way to throw the strokes
  1904. 36:26
  1905. I see it as the environment the the
  1906. 36:29
  1907. league my paint environment sort of
  1908. 36:31
  1909. gives us a grounded bottleneck to
  1910. 36:33
  1911. actually create a meaningful
  1912. 36:35
  1913. representation space of course the next
  1914. 36:38
  1915. thing we tried was on the glut and again
  1916. 36:39
  1917. you see the same things it can generate
  1918. 36:41
  1919. unconditional meaningful only glove
  1920. 36:43
  1921. looking like samples or it can recreate
  1922. 36:45
  1923. on the glut samples but then
  1924. 36:48
  1925. generalization right so here what we
  1926. 36:50
  1927. tried was train the model on Omniglot
  1928. 36:52
  1929. and then ask it to generate endless
  1930. 36:55
  1931. digits right this is what you see in the
  1932. 36:57
  1933. middle middle road there can it draw in
  1934. 36:59
  1935. this digits this has never seen amnesty
  1936. 37:02
  1937. just before but we all know that only
  1938. 37:04
  1939. God is more general than in this and it
  1940. 37:06
  1941. can do it right given an amnesty yet
  1942. 37:08
  1943. it can actually draw that the network
  1944. 37:10
  1945. itself has never seen any any amnesty
  1946. 37:13
  1947. just during its training then we tried
  1948. 37:17
  1949. Smiley's right there line drawings okay
  1950. 37:19
  1951. so it can giving it smiley it can also
  1952. 37:21
  1953. drop Smiley's - that is great so can we
  1954. 37:25
  1955. do more we did this we took this cartoon
  1956. 37:30
  1957. drawing and this is done by chopping it
  1958. 37:33
  1959. up into 64 by 64 pieces and it's a
  1960. 37:36
  1961. general line drawing right again this is
  1962. 37:38
  1963. the
  1964. 37:39
  1965. imagine that if the Train using Omniglot
  1966. 37:40
  1967. and now you can see that it can actually
  1968. 37:43
  1969. recreate that trolling certain areas are
  1970. 37:46
  1971. read about right back around eyes
  1972. 37:47
  1973. insides they are really complicated but
  1974. 37:49
  1975. in general you can see that it is
  1976. 37:51
  1977. actually capable of generating those
  1978. 37:52
  1979. drawings so this gives you an idea of
  1980. 37:55
  1981. okay generalization I can I can sort of
  1982. 37:58
  1983. train on one domain and generalize the
  1984. 38:00
  1985. new ones
  1986. 38:01
  1987. so can I push it further the next thing
  1988. 38:03
  1989. that we tried was okay the advantage of
  1990. 38:06
  1991. using a tool is you have a meaningful
  1992. 38:08
  1993. representation space that we can
  1994. 38:11
  1995. hopefully transfer that representation
  1996. 38:13
  1997. space into a new environment so here
  1998. 38:15
  1999. what we do is again the same agent that
  2000. 38:17
  2001. is trained using Omniglot we transfer
  2002. 38:19
  2003. that simulated that that simulated
  2004. 38:22
  2005. environment into real world the way we
  2006. 38:25
  2007. do that is we we took that same program
  2008. 38:28
  2009. and our friends at the robotics group at
  2010. 38:31
  2011. deep mine wrote a controller to control
  2012. 38:36
  2013. that robotic arm to take that program
  2014. 38:38
  2015. and drove it this whole like experiment
  2016. 38:41
  2017. happened in under a week really and what
  2018. 38:43
  2019. we ended up with was the same agent the
  2020. 38:47
  2021. same agent it is not fine-tuned through
  2022. 38:49
  2023. all the setup or anything the same agent
  2024. 38:51
  2025. generates its brushstroke programs and
  2026. 38:54
  2027. then that program goes into a controller
  2028. 38:56
  2029. that can be realized by a real robotic
  2030. 38:59
  2031. arm right the advantage of doing this is
  2032. 39:01
  2033. the reason we can do this is the
  2034. 39:03
  2035. environment that we used is a real
  2036. 39:05
  2037. environment we didn't sort of create
  2038. 39:07
  2039. that environment the latent space if you
  2040. 39:10
  2041. will is not something some arbitrary
  2042. 39:12
  2043. latent space that we created because
  2044. 39:14
  2045. it's a latent space that is defined by
  2046. 39:17
  2047. us that is as a meaningful to space and
  2048. 39:20
  2049. the reason we create those tools is to
  2050. 39:21
  2051. solve many different problems anyways
  2052. 39:24
  2053. right and this is an example of that
  2054. 39:25
  2055. using that tool space gives us the
  2056. 39:27
  2057. ability to actually transfer its
  2058. 39:29
  2059. capability so with that I want to
  2060. 39:32
  2061. conclude I tried to give an explanation
  2062. 39:36
  2063. of you think about generative models and
  2064. 39:39
  2065. unsupervised learning and to me of
  2066. 39:41
  2067. course like I'm a hundred percent sure
  2068. 39:43
  2069. everyone agrees that our aim is not to
  2070. 39:45
  2071. just look at images right our aim is to
  2072. 39:47
  2073. do much more
  2074. 39:48
  2075. than that and I tried to give two
  2076. 39:50
  2077. different two different aspects one of
  2078. 39:52
  2079. them is the kind of genital models that
  2080. 39:55
  2081. we can do actually right now can solve
  2082. 39:57
  2083. real world problems like we have seen in
  2084. 39:59
  2085. Vienna and also we can think about a
  2086. 40:01
  2087. different kind of setup where we have
  2088. 40:03
  2089. agents actually training and and
  2090. 40:06
  2091. generating interpretable programs right
  2092. 40:09
  2093. that is an important aspect that we have
  2094. 40:10
  2095. seen that conversation coming up here
  2096. 40:12
  2097. actually through several of the talks
  2098. 40:15
  2099. here that being interbeing able to
  2100. 40:17
  2101. generate interpretable programs is one
  2102. 40:19
  2103. of the bottlenecks that we face right
  2104. 40:21
  2105. now because there are many critical
  2106. 40:23
  2107. applications that we want to solve there
  2108. 40:24
  2109. are many tools that we're gonna eat you
  2110. 40:26
  2111. eyes and this is one sort of step
  2112. 40:28
  2113. towards that best way how how I see and
  2114. 40:30
  2115. being able to do these requires us to
  2116. 40:33
  2117. create these very capable reinforcement
  2118. 40:37
  2119. learning agents that rely on new
  2120. 40:39
  2121. algorithms that we need to that we need
  2122. 40:41
  2123. to work on with that thank you very much
  2124. 40:44
  2125. I think I want to thank all my
  2126. 40:46
  2127. co-operators for their for their help on
  2128. 40:49
  2129. this thank you very much
  2130. 40:50
  2131. [Applause]
  2132. 40:50
  2133. [Music]
  2134. 40:57
  2135. [Applause]
  2136. 41:06
  2137. we have time for maybe one or two
  2138. 41:09
  2139. questions
  2140. 41:24
  2141. okay so I have 100 so how do you think
  2142. 41:27
  2143. about scaling to like more like general
  2144. 41:32
  2145. domains beyond some simple strokes how
  2146. 41:37
  2147. to generate like realistic scenes right
  2148. 41:41
  2149. so one thing that I haven't shown here
  2150. 41:43
  2151. actually yes creating realistic scenes
  2152. 41:46
  2153. is is one case one thing that I haven't
  2154. 41:49
  2155. talked about is actually as part of
  2156. 41:51
  2157. sorry as part of this work it's actually
  2158. 41:54
  2159. in the paper one thing that the team did
  2160. 41:57
  2161. by the way I had to mention and this was
  2162. 41:59
  2163. worked on most by Yaroslav gun in
  2164. 42:00
  2165. Melbourne he's actually PhD student at
  2166. 42:03
  2167. Mira and he spent his summer with us
  2168. 42:04
  2169. doing his internship so as an amazing
  2170. 42:06
  2171. job for actually doing it during an
  2172. 42:08
  2173. internship pretty big congratulations to
  2174. 42:10
  2175. him so one thing that that that we did
  2176. 42:12
  2177. was actually try to generate images so
  2178. 42:14
  2179. we took the survey data set and use the
  2180. 42:16
  2181. same drawing program to actually to
  2182. 42:20
  2183. actually draw those and in that case our
  2184. 42:23
  2185. setup is just scaling towards those like
  2186. 42:26
  2187. the same stuff set up actually scales
  2188. 42:27
  2189. because it's a general drawing - and you
  2190. 42:30
  2191. can control the color and we can do that
  2192. 42:32
  2193. but it requires a little bit more sort
  2194. 42:35
  2195. of like it was one of the last
  2196. 42:36
  2197. experiments that we did but like it is
  2198. 42:38
  2199. it is sort of in the words thanks for a
  2200. 42:42
  2201. great talker I had a question about the
  2202. 42:44
  2203. Impala results right you had a slide
  2204. 42:47
  2205. where one with a curve where all workers
  2206. 42:51
  2207. are learning versus having one
  2208. 42:54
  2209. centralized sorry centralized learner
  2210. 42:57
  2211. the all workers learning actually does
  2212. 43:00
  2213. better
  2214. 43:01
  2215. than the centralized letter and I found
  2216. 43:04
  2217. that not quite surprising but like you
  2218. 43:07
  2219. know it's great and it's great to see
  2220. 43:10
  2221. the positive transfer between tasks do
  2222. 43:11
  2223. you think
  2224. 43:12
  2225. have you tried that on other Suites of
  2226. 43:13
  2227. tasks do you think it's just because
  2228. 43:14
  2229. it's tasks in this suite of tasks are
  2230. 43:17
  2231. very similar to usually like it
  2232. 43:19
  2233. definitely depends on that but the
  2234. 43:21
  2235. reason we created those tasks it is for
  2236. 43:24
  2237. that reason right in real world what we
  2238. 43:26
  2239. have is we have the visual structure of
  2240. 43:28
  2241. our world is unique so the kind of setup
  2242. 43:31
  2243. that we have in deep defined lab that
  2244. 43:33
  2245. that that tasks it is that it's a
  2246. 43:36
  2247. unified visual environment you have one
  2248. 43:38
  2249. sort of one one one kind of agent with a
  2250. 43:41
  2251. unified action space and now you can
  2252. 43:43
  2253. focus on solving different kinds of
  2254. 43:45
  2255. tasks of course like that is the kind of
  2256. 43:47
  2257. thing that we were testing given all
  2258. 43:48
  2259. these through does it actually is it
  2260. 43:51
  2261. possible to do the multi task positive
  2262. 43:53
  2263. transfer that we see in supervised
  2264. 43:55
  2265. learning cases that we were able to see
  2266. 43:57
  2267. that in reinforcement learning yeah
  2268. 44:01
  2269. hello this is exciting I have a question
  2270. 44:06
  2271. about extending this to maybe more open
  2272. 44:09
  2273. domains so what is the challenge it's a
  2274. 44:13
  2275. challenge to be a number of actions to
  2276. 44:16
  2277. pick because the number of strokes maybe
  2278. 44:19
  2279. the strokes face smaller so what other
  2280. 44:22
  2281. challenge to extend to open domains with
  2282. 44:27
  2283. what do you like what do you have in
  2284. 44:29
  2285. mind is open domains like number of
  2286. 44:31
  2287. actions is definitely a challenge right
  2288. 44:32
  2289. it is definitely one of the big
  2290. 44:34
  2291. challenges that a lot of research in as
  2292. 44:36
  2293. far as I know in RL goes into that but
  2294. 44:39
  2295. that is that is I think only one of the
  2296. 44:41
  2297. main challenges the other challenge of
  2298. 44:42
  2299. course is the straight representation
  2300. 44:45
  2301. that is mainly why we sort of used deep
  2302. 44:48
  2303. learning right because we expect that
  2304. 44:51
  2305. with deep learning we are going to be
  2306. 44:52
  2307. able to learn better representations and
  2308. 44:54
  2309. that still remains as a challenge
  2310. 44:56
  2311. because being able to learn
  2312. 44:57
  2313. representations is not an architectural
  2314. 44:59
  2315. problem only it is also about finding
  2316. 45:03
  2317. the right sort of training set up and
  2318. 45:05
  2319. spyro was an example of that where we
  2320. 45:07
  2321. can get that reward function that that
  2322. 45:08
  2323. reward signal in an unsupervised way
  2324. 45:11
  2325. right and in many different domains
  2326. 45:13
  2327. like there are many different ways we
  2328. 45:15
  2329. can do this but actually finding those
  2330. 45:16
  2331. solutions also part of that
  2332. 45:20
  2333. okay so let's Bank arriving
  2334. 45:24
  2335. [Music]
  2336. 45:27
  2337. [Applause]
  2338. Up next
  2339. AUTOPLAY
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement