Advertisement
Guest User

Untitled

a guest
May 9th, 2018
1,086
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 43.20 KB | None | 0 0
  1. good morning hi my name is Amelie I'm going to be a session chair for the
  2. morning so it's my great pleasure to introduce cryovac solo who is going to
  3. give an invited talk karai is a director of research in deep mine and he's one
  4. of the star researchers in our community he has contributed to many highly
  5. influential projects in deep mind such as spatial transformer networks auto
  6. regressive generated models such as pixel recurrent networks and wave nets and
  7. debris enforcement learning for playing Atari games and alphago today he will
  8. talk about from generative models to generative agents so let's welcome karai
  9. [Applause]
  10. [Music]
  11. thank you very much Hong Kong for the very nice introduction and thanks
  12. everyone for being here it's it's absolute pleasure so is Hong like mentioned
  13. I'm going to try to talk about unsupervised learning in general starting from
  14. the generative models may be a classical way when I try to give another view
  15. that I think is quite interesting that we have been we have been working on
  16. recently when I think about what are the important things for us to do is a is
  17. a community I think everyone here sort of agrees that in the end what is
  18. important is to be to be doing constitute to us learning we sort of realize
  19. that supervised learning has all sorts of successes but in the end unsupervised
  20. learning is kind of like the next frontier and when I think about unsupervised
  21. learning there are there are sort of like different explanations that come to
  22. my mind and when talking to people I think we all have sort of different
  23. opinions on this one of the things that I think is a common explanation is we
  24. have an unsupervised learning algorithm we run it on our data what we expect is
  25. the algorithm to understand our data and to explain our data or or or our
  26. environment right and and what we expect from this is that the algorithm is
  27. going to learn the intrinsic properties of our data of our environment and then
  28. it's going to be able to explain that through those properties but most of the
  29. time what happens is because of the kinds of models that we use we resort to
  30. and at the end looking at samples and what we look at the samples we try to see
  31. that did our model really understand the environment and if it understood the
  32. environment then then the sample should be meaningful of course we look at all
  33. sorts of objective measures that we try to that that we use during training
  34. like Inception scores looked at cahoots and such but in the end we always
  35. resort the samples in terms of like understanding if our model really can
  36. explain what's going on in the environment the other kind of general
  37. explanation that we all use is like the goal of unsupervised learning is to
  38. learn rich representations right it's already embedded in the name of the skill
  39. of this conference the main goal of deep learning unsupervised learning is with
  40. learning those are presentations but then when we think about those
  41. representations again it doesn't this explanation doesn't give us an objective
  42. measure what we think about is why those like how are we going to think about
  43. those representations in terms of being great and useful and to me the most
  44. important bit is if we have good and richer presentations then they are useful
  45. for generalization for transfer right and we need to we need to sort of if you
  46. have a good unsupervised learning model and it can give us good through
  47. presentations then we can get generalization so what I'm going to do is today
  48. also tie it together with something else that is really I think for me it is
  49. very important as long as I've mentioned some a big chunk of work that we
  50. have been doing a deep mine that I've been doing is about agents and
  51. reinforcement learning and in this talk I'm going to sort of take a look at
  52. unsupervised learning from classical sense of like learning a learning a
  53. generative model and also learning an agent that can do on supervised learning
  54. so I'm going to start from the wavenet model hopefully as many of you know it
  55. is a generative model of audio it's a pure deep learning model and turns it
  56. does you can model any audio signal like speech and and and music and then you
  57. can get really realistic samples out of that and the next thing I'm going to do
  58. is I'm going to explain this other sort of new approach that that I find really
  59. interesting to unsupervised learning that is based on deep reinforcement
  60. learning learning an agent that can actually that does unsupervised learning so
  61. this model called spiral is based on a new agent architecture that we have been
  62. that we have been working on that we have published recently called Impala it's
  63. a very large highly scaleable efficient off-post elearning agent architecture
  64. that we use in spiral to do unsupervised learning and the interesting bit about
  65. the spiral work is it does generalization through using some sort of tool space
  66. tools that we as people have created that we have created so that we can
  67. actually solve not one specific problem we can solve many different problems
  68. using these tools and using the interface of a two and having an agent you can
  69. actually now learn a generative model of your environment all right so without
  70. like more delay the first thing that I'm going to try to introduce is like
  71. quickly the very net model way net is a generative model of of audio as I said
  72. it is it samples the robot your signal it doesn't use any sort of interface to
  73. model the audio signal audio in general is very very high dimensional so the
  74. the standard audio signal that we started when Miller done moved a bit when we
  75. were at the beginning but 16,000 samples per second like if you compare that
  76. our usual language modeling and and and machine translation kind of tasks it is
  77. several orders of magnitude more data so the kinds of dependencies that one
  78. needs to model to be able to model good audio signals is very it's very long so
  79. this model what it does is it samples it models one sample at a time and it is
  80. a soft max distribution to model the model each sample depending on dependent
  81. on all the all the previous samples of the of the signal when you look at it
  82. more closely though it is it is it is an architecture that has quite a bit of
  83. resemblance to the pixel CNN model maybe some of you also are familiar with
  84. that in the end it is a stack of multiple commotion layers to be a little bit
  85. more specific it has these residual blocks you use multiples of those residual
  86. blocks and each decision and in each residual work there are these dilated
  87. convolutional layers that that go on top of each other and through those
  88. dilated convolutional layers that are causal convolutions we can model very
  89. long dependencies so through that we can get the modelling dependency in time
  90. now one of the biggest design considerations about wag net is it is designed to
  91. be very very efficient during training because during training what you can do
  92. is because all the targets are known when you generate the signal you generate
  93. the whole signal at once just run it like a convulsion on net you get your
  94. signal then because you have the targets you get your error signal from that
  95. propagate back so training is very efficient but of course when it comes to
  96. sampling time in the end this is an autoregressive model and through those
  97. causal emotions you need to run through them one sample at a time so if you are
  98. sampling let's say 24 kilohertz 24,000 samples per second you need to generate
  99. one sample at a time just like you see in this animation and of course this is
  100. painful this is painful but in the end it works quite well and we can generate
  101. very very high quality audio with this so what I want to do is I want to
  102. actually I want to I want to make you listen to the unconditional samples from
  103. this model so rag model the speed signal and without any conditioning on text
  104. or anything just take the audio signal and model that with model that it
  105. wavenet and then when you sample this is the kind of so as you can see or here
  106. hopefully the the quality is very high and this is modeling really the raw
  107. audio grow audio signal and this is completely unconditional so what you hear
  108. is sometimes you even hear short words like okay from and then if you try to
  109. listen all the tonation and everything sounds quite natural and sometimes it
  110. feels like you are listening to someone speaking in a language that you don't
  111. know so the the main characteristics of the of the signal is all captured there
  112. so in terms of dependencies we are looking into like something like several
  113. thousand samples of dependencies are actually properly and correctly modelled
  114. there and then of course sorry and then of course what you can do is you can
  115. you can augment this model by conditioning on a text signal that is associated
  116. with the signal that you want to generate and by conditioning on the text
  117. signal now you have a generative model a conditional generative model that
  118. actually solves a real-world problem just by itself and turn deep learning
  119. right so the text you create the linguistic embeddings from that using those
  120. linguistic embeddings you can generate the signal and then and then it starts
  121. it's not talking right so it's a it's a solution to the whole text to speech
  122. synthesis problem that as you know is very very common used in in in real world
  123. sorry alright so when we did this the the bayonet model and this was around
  124. like almost two years ago now we looked at the we looked at equality when we
  125. use it as a TTS model and in green what you see is the quality of the human
  126. speech I can obtain through this mean opinion scores and in blue you see the
  127. wavenet and the other colors are the other models that were the best models
  128. around and at the time and you can see that they met close the gap between the
  129. human called speech and other models by by a big margin so at the time this
  130. this really got us excited because now we actually had a model a deep learning
  131. model that comes with all the flexibilities and advantages of doing deep
  132. learning and at the same time it's modeling raw audio and it is it is it is
  133. very very high quality I could play text to speech samples that is generated by
  134. this model but actually what you can do is what I'm going to go into next if
  135. you are using Google assistant right now you are already hearing back that
  136. there because this is already in production so anyone who's using Google
  137. assistant and like querying Wikipedia and things like that the the speech that
  138. is generated there is actually coming from the very net model and what I want
  139. to do is I want to explain how we how we did that and that brings me into our
  140. next project that we did in the wagonette in the very net domain this is the
  141. parallel way net power the net project so of course when you have a research
  142. project and at some point you realize that okay it is actually lands it
  143. actually lands itself into the solution of a real-world problem and you want to
  144. put it into production in a very challenging environment then then of course it
  145. requires much more than our little research group so this was a big cooperation
  146. between the D point research applied and the Google speech teams actually so in
  147. this slide what but what what I show is basis the the the basic ingredients of
  148. how we turn a wave net architecture into a feed-forward and parallel
  149. architecture because what we realize pretty soon when we started when we try to
  150. attempt doing doing putting putting a system like this into production was
  151. actually speed of course is very important quality is very very important but
  152. the the importance is of speed is it is not enough to actually run something in
  153. real time the kind of constraints that we track those ovals like orders of
  154. magnitude faster than real time even actually being able to run in constant
  155. time so when one day when the constraint becomes being able to run in constant
  156. time the only thing you can do is create a feed-forward Network and then
  157. paralyze the signal generation right so that is what we did so in this slide at
  158. the top what you see is the usual wavenet model we call it the teacher now in
  159. the setting this wavenet model is pure trained and it is fixed and it is used
  160. as a scoring function at the bottom what you see is the generator that we call
  161. the student and this student model is again an architecture that is very close
  162. to write net but it is a it is it is run as a feed-forward convolutional
  163. network and the way it is run is and it is trained is actually it has two
  164. components one component is coming from a net we know that it is very efficient
  165. in training as I said but slow in something the other the other thing is based
  166. on the inverse autoregressive flow work that was done by the king - colleagues
  167. at opening I last year and and and and this this structure gives gives us the
  168. capability to actually get a input noise signal in and slowly transform that
  169. noise signal into a into a proper distribution that is going to be the speed
  170. signal right so the way we train this is random noise goes in together with the
  171. linguistic features through layers and layers of these flows the signal gets
  172. that that random noise gets transferred into speech signal that speed signal
  173. goes into a net very net is like already the best kind of scoring function that
  174. we can use because it's a it's a density model and wavenet scores that and that
  175. score from that we get the gradients back into the generator and then we update
  176. the generator we call this process the proper water density distribution but of
  177. course when you are trying to do real-world things and if things are very
  178. challenging like speed signals that is by itself not enough so I have
  179. highlighted two components here one of them as I said is the magnet scoring
  180. function the other thing that we use is a power loss because what happens is
  181. when we train the model in this manner the signal tends to be very low energy
  182. sort of like whispering someone speaks but they are like whispering so during
  183. training we sort of edit this extra loss that tries to conserve the energy of
  184. the generated speech and with these two the the wavenet scoring and the power
  185. loss we were already getting very high called speed signal but of course like
  186. the constraints are very very tough and what we did was we trained another wave
  187. net model so we sort of used wavenet everywhere right that we are generating
  188. through a leg net through convolution we are using very net as a scoring
  189. function we again trained another very net model this time we used it as a
  190. speech recognition system and that is the perceptual loss that you see there so
  191. we train the wave net again as a speech recognition system what we do is during
  192. training of course you have the text and the corresponding speech signal we
  193. generate the we generate the corresponding speech through our generator we get
  194. the text give that the speech recognition system the speech recognition system
  195. of course not needs to decode we generated signal into those into that text
  196. right and we get the error from there propagate back into our generator so
  197. that's another sort of quality improvement that we get by using speech
  198. recognition as a perceptual loss in our generation system and the last thing
  199. that we did was using a contrasting term that basically uses okay we generate a
  200. signal conditioned on some text you can you can create a contrast applause
  201. we're saying that the signal that is generated with the corresponding text is
  202. it should be different than the same signal if it if it was conditioned on a
  203. separate text right there's a contrasting luster so more specifically what we
  204. have is in the end we end up with these four terms at the top we see that the
  205. the original sort of using vena there's a scoring function the problem with
  206. advances the distillation idea then we have the power loss that that uses
  207. Fourier transforms eternal to to conserve the energy and the contrastive term
  208. and find out the perceptual was that does the that does the speech of cognition
  209. and when we all these then of course what we did was we looked at the quality
  210. now what what I'm showing here is the quality with respect to the again the
  211. best non wavenet model so this is sort of like a year after the original
  212. research pretty much exactly a year and so during that time of course the the
  213. best speech synthesis models also improved but wavenet was still better than
  214. better than anything else and it was matching the quality of so the new magnet
  215. the parallel bayonet is exactly matching the quality of the of the original
  216. magnitude and what what I'm showing here is three different US English voices
  217. and also Japanese and this is the kind of thing that we always want from deep
  218. learning right the ability to generalize to new datasets to new domains so we
  219. have developed all this model one practically one single US English voice and
  220. it was just a matter of collecting or getting another data set from another
  221. either speaker or another language like some speaker speaking Japanese you just
  222. get that run it and there you go you have a speech synthesis you have a
  223. production called speaks into the system just by doing that this is the kind of
  224. thing that we really like from deep line right and and if you are thinking
  225. about from from deep learning and if you are thinking about unsupervised
  226. learning I think this is this is this is a very good demonstration of that so
  227. before switching to the next one I also want to mention that we have also done
  228. some further work on this called wave RN and that is recently published and I
  229. encourage you to look into that one too that's a very interesting piece of work
  230. also for generating speech at very very high speed the next thing I want to
  231. talk about is the Impala architecture the new agent architecture that I said
  232. because as I said so now wavenet is a sort of in a classical sense of of
  233. unsupervised model that actually can solve a real world problem now the next
  234. thing I want to sort of start talking about is this new different way of doing
  235. unsupervised learning but for that most another exciting bit is to be able to
  236. do deep reinforcement learning at scale sorry all right so I want to sort of
  237. motivate why do we want to actually push our deep reinforcement learning
  238. models further and further because most of the time what we do because this
  239. is a new area is we take sort of like very simple tasks in in some simple
  240. environments and what we try to do is we try to train an agent that shows a
  241. single task in that environment well what we what we want to do is we want to
  242. go further than that right like again going back to the point of
  243. generalization and being able to solve multiple tasks we have created the new
  244. task set this is an open source task set that we have like we have an open
  245. source environment called vm lab and as part of that we have created this new
  246. task set vm lab 30 it is 30 environments that are sort of covering tasks
  247. around language memory and navigation and those kinds of things and the goal
  248. is not to solve each one of them individually the goal is to have one single
  249. agent one single network that is that is solving all those thoughts all at
  250. the same time there is nothing custom in that agent that is specific to any
  251. single one of these environments when you look at those environments I'm
  252. showing some of those here the agency has a first-person view so it is in like
  253. a maze-like environment and the agent has a first-person view camera input and
  254. it can navigate around go forward backwards and rotate around look up down jump
  255. and those kinds of things and and it is solving all different kinds of tasks
  256. that are that are catered to test different kinds of kinds of abilities but the
  257. goal is as I said again to solve all of them at the same time one thing that
  258. becomes really really important in this case is of course the stability of our
  259. algorithms because now we are not solving one single task we are solving 30 of
  260. them and we want to really stable models because we don't have the chance to
  261. tune hyper parameters one single task anymore and of course what becomes really
  262. important is task interference right hopefully what we expect again by using
  263. deep learning is this is like a multi task setting and in this multi task
  264. setting we hope to see positive transfer rather than task interference and and
  265. and we hope to demonstrate this in this in this challenging reinforcement of a
  266. reinforcement learning domain - okay I sort of realized that I needed to put a
  267. slide about by deep reinforcement learning because a little bit to my surprise
  268. that was actually not much reinforcement learning in this conference this year
  269. and I wanted to sort of a little bit touch on why I think is important for for
  270. the deep learning community before this community to actually do deep
  271. reinforcement learning because it is to me it is at the core of if if one of
  272. the goals that we work for here is AI then it is at the core of order right
  273. reinforcement learning is a very general framework for it for doing sequential
  274. decision-making for learning sequential decision making tasks and deep learning
  275. on the other hand of course is the best model that we have the best set of
  276. algorithms we have to learn representations and combinations of those
  277. combinations of these two different models is is the most sort of like arm is
  278. the best answer so far we have in terms of learning very good state
  279. representations of very challenging tasks that are not just for like solving
  280. toy domains but actually to solve challenging real world problems of course
  281. there are many things that are there are open problems there like some of them
  282. that are sort of interesting at least for me is the idea of separating the
  283. computational power of a model from the number of weights or the number of
  284. layers it has or basically again going back to on supervised learning learning
  285. to transfer so if we do this deep reinforcement learning models with the idea
  286. to to actually generalize to transfer okay so the Impala agent is based on the
  287. on another work that we have done couple of years ago called the a synchronous
  288. advantage actor critic the a3c model in the end it's a it's opposed to gradient
  289. methods but you have is like that I tried to sort of cartoonishly explain that
  290. in the in the in the figure at every time step the agent sees the environment
  291. and at that time step the agent outputs a post distribution and also the also
  292. the value function the value function is the agents expectation of the total
  293. amount of reward that it's going to get until the end of the episode being in
  294. that state all right and the policy is the distribution over the actions that
  295. the agent has and at every time step the agent looks at the environment and
  296. updates is policy so that it can be can actually act in the environment and it
  297. updates his value function and the way you train this is with the with the post
  298. the gradient intuitively this is actually is actually very simple what you do
  299. is the gradient of the policy is scaled by the difference between the total
  300. reward that the agent actually gets in the environment - the baseline and the
  301. baseline is the value function right so what it means is if the agent ends up
  302. doing better than what the value function what its assumption was then it's a
  303. good thing you have a positive gradient you're going to reinforce your
  304. understanding of the environment if the agent does worse than what it got so
  305. well so the value was higher than the total reward that you got then you have a
  306. negative gradient you need to shuffle things around and the way you learn the
  307. value function is by the usual and step and step TD error now the a3c algorithm
  308. so this was the actor critic part the a synchronous party in 3 C algorithm is
  309. composed of multiple actors and each actor independently operates in the
  310. environment and and and collecting for collect observations acts in the
  311. environment computes the posted gradients and and completes the gradients with
  312. respect to the parameters of its network then what it does is it sends those
  313. gradients back into the parameter server then the parameter server collects all
  314. these gradients from all different actors combines them together and then
  315. shares those parameters with all the actors around now what happens in this
  316. case is as you increase the number of actors this is the usual asynchronous
  317. stochastic gradient descent setup as the number of actors increases the stale
  318. grade the staleness of the gradients becomes a problem so what happens is in
  319. the end is distribution the experience collection is actually something very
  320. very advantages it's very good and but what happens is communicating gradients
  321. might become a bottleneck as you try to really scale things up so for that what
  322. we tried was a different architecture the idea of a sanctuary server is
  323. actually quite useful but rather than using it to just to just do the
  324. accumulate the parameter updates the idea of that learner is to make the
  325. centralized component into a learner so the all the whole learning algorithm is
  326. is contained in that what the actors does is only act in the environment not
  327. compute the gradients or anything send the observations back into learners to
  328. the learner and the learner sends the parameters back and in this in this way
  329. what you are doing is you are completely decoupling what happens about your
  330. experience collection in your environments from your learning algorithm and in
  331. this way you are actually gaining a lot of robustness into noise in your
  332. environments sometimes rendering times vary some some environments are slow
  333. some environments are fast all that is completely decoupled from your learning
  334. algorithm but of course what you need is a good learning algorithm to to be
  335. able to deal with that kind of variation so in the end we empower what we have
  336. is we have a very efficient decoupled backward pass if you were so actors
  337. generate trajectories as I said but then but that that decoupling creates this
  338. of posionous write the policy in the actors the behavior poles if you will is
  339. separate from the policy in the learner target policy so what we need is enough
  340. posted earning of course there are many of posted learning algorithms but we
  341. really wanted to have a post gradient method and and for that we developed this
  342. new method called V trace and it's an off-post advantage critic algorithm the
  343. advantage of V traces it is using these truncated important sampling ratios to
  344. actually come up with an estimate for the valley so because of there is this
  345. imbalance between the learners that and the actors you need to balance those
  346. you need to balance that difference the good thing about this is it's an
  347. algorithm is a smooth transition between the on post case and off policy case
  348. when they when the actors and the learner are completely in sync so you're in
  349. the on policy case the algorithm actually boils down to the usual a3c update
  350. with the n steps bellman equation if they become more separate than the
  351. correction of the algorithm kicks in and then you have the corrected corrected
  352. estimate the algorithm has two main components to those truncation factors to
  353. control two different aspects of the of off learning one of them is the robe
  354. which controls the reach value function the algorithm is going to converge
  355. towards the behavior the value function that code that corresponds to the
  356. behavior policy or the value function that corresponds to the target policy in
  357. the learner and the other one controls the speed of convergence the C factor by
  358. by controlling the by controlling the truncation that it can it can increase or
  359. decrease the variance in learning and the stick and it can it can it can have
  360. an effect on the speed of convergence now than me when we tested this of course
  361. the goal is to test on all environments at once but what we wanted to do was
  362. first you look at the single task is also we look at five different
  363. environments and we see that in these environments the Impala algorithm always
  364. very stable it performs at the top so the comparisons here are the Impala
  365. algorithm the batch a3c method and they touch a to C method and then different
  366. versions of a three C algorithms and you can see that Impala and batch a to C
  367. are always at performing at the top Impala seems to be doing fine they're like
  368. the the dark blue curve and and this gives us the sort of feeling that okay we
  369. have a nice outlet now of course the other thing that is very important and
  370. that is discussed a lot is the stability of these algorithms right I actually
  371. really like these floods since during the a3c work actually keep looking at
  372. these floods and we always put them in the papers the plot here is on the
  373. x-axis we have the heart we have the hyper parameter combinations when you when
  374. you of course trade any model what we do all of us is we do some sort of hyper
  375. parameter sweep and here what we are doing is we are looking at the final score
  376. that we achieve with every single hyper parameter setting that we that we get
  377. and you sort it and in the in this kind of thought what you have is the the the
  378. KERS the algorithms that are at the top and that our most flood are the most
  379. like better performing and most stable algorithms right and what we see here is
  380. Impala is always of course it's achieving better results but it's not achieving
  381. those results because there is one sort of lucky - parameter setting is
  382. consistently at the top and you can see that it's not of course completely flat
  383. because in the end we are sort of searching over three orders of magnitude in
  384. parameter settings the but we can see that the algorithm is actually quite
  385. stable now when we look at our our our main goal here what we are looking at in
  386. on the x-axis we have the wall clock time and on the y-axis we have the sort of
  387. the normalized score and the and the red line that you see there is the a3 see
  388. and you can see that Impala not only H is much better of course if they choose
  389. them much much much faster the other thing is comparing the green and the
  390. orange line thirds that is the comparison between training Impala in an expert
  391. setting versus a multi task City and we see that it achieves better scores like
  392. the faster which again gives us the idea that we are actually seeing positive
  393. transfer it's it's a like to like setting the all the all the all the details
  394. of the network and the agent are the same in one case you have one network
  395. tasks and in other case you train the same network on all the tasks and what
  396. you achieve is a better result because of the positive transfer between those
  397. tasks and what happens is if you give Impala more resources you end up with
  398. this almost vertical takeoff from there right and what you have is you can
  399. actually solve this challenging turkey task domain in under 24 hours given the
  400. resources and that is the kind of algorithmic sort of power that we want to be
  401. able to train these very highly scalable agents now why do we want to do that
  402. that is the point that I want to come next and and and in the final part this
  403. is the new spiral algorithm that I want to talk about now just quickly going
  404. back to the original ideas that that I talked about unsupervised learning is
  405. also about explaining environments and generating samples but maybe generate
  406. examples by explaining environments and we talked about the fact that when we
  407. have these deep learning models like magnet we can generate amazing samples but
  408. at the same time maybe there's a different way we can do these things less
  409. implicit in the Sun set when we generate these samples they come with some
  410. explanation and that explanation can go through some using some tools in this
  411. particular case what we are going to do is we are going to use a painting tool
  412. and we are going to learn to control this painting tool it's a real drawing
  413. program and we are going to basically generate a program that the painting tool
  414. will use to generate the image and the main idea that I want to convey is by
  415. using tools by it by by learning how to use tools that are already available
  416. that we have actually we can start thinking about different kinds of
  417. generalizations that I'll try to demonstrate so in real word we have a lot of
  418. examples of programs and their executions and the results of those programs
  419. they can be arithmetic programs floating programs or even architectural
  420. blueprints right and what we do is because we know we have an information on
  421. that generation process when we see the results we can go and try to infer what
  422. was the program what was the blueprint that generated that that particular
  423. input so we can do this and the goal is to be able to do this with our with our
  424. agents too specifically we are going to use this environment called lead my
  425. paint it is actually a professional-grade open-source drawing library and it's
  426. used worldwide by many artists what we are doing is we are using a limited
  427. interface basically learning - learning to draw brushstrokes we are going to
  428. have an agent that does that the agent in the end called spiral has three main
  429. components first of all is the agent that generates the brushstrokes sort of I
  430. like to see that as writing the program the second one is the environment to
  431. lead my paint so the brushstrokes come in environment turns those into
  432. brushstrokes in the canvas and that cameras got those into a discriminator and
  433. the discriminator is trained like again and that discriminative looks at the
  434. generated image and says does this look like a real drawing and then gives a
  435. score and that score is opposed to the usual gun training rather than
  436. propagating the gradient packs we get that score and we train our agent with
  437. that score is a reward so when you think about this all these three components
  438. coming together you have an unsupervised learning model similar to the Ganz but
  439. rather than generating in the pixel space we generate in this program space and
  440. the training is done through the done through the reward that the agent itself
  441. also learns so we are sort of trusting another neural net just like in Gans
  442. setup to actually guide learning but not through its gradients just treat the
  443. score function so in my opinion it makes it in certain cases it makes it very
  444. very sort of capable of using a different kinds of tools so as I said this
  445. agent the the reinforcement learning part of the agent is completely the same
  446. as the Impala so we now that we have an agent that can actually solve really
  447. challenging reinforcement learning setups we take it and put it into this
  448. environment augmented with the ability to learn a discriminative function to
  449. actually have the reward the to emphasize again the important thing here is yes
  450. we have an agent but there is no environment that actually says that ok this is
  451. the reward that the agent should get the reward generation is also inside the
  452. agent thanks to again all the unsupervised learning models that is actually
  453. being studied here so we specifically use against it up there so can we
  454. generate the first thing of course we try is when you are doing unsupervised
  455. learning from scratch again you go back to illness right you start from M&S;
  456. and initially of course it's generating various crash pad like things but then
  457. through training it becomes better and better and better here in the middle you
  458. see that now the the agent learned - these are complete unconditional samples
  459. again the ones that you see in the middle it learn to create these trucks that
  460. generates these digits right to emphasize this this agent has never seen
  461. strokes that are coming from real people how we draw digits it learned to
  462. experiment with these drugs and it's sort of built its own policy to create
  463. these strokes that would generate these images of course you can train the
  464. whole set up is a conditional generation process to recreate a given image - I
  465. think the main thing about this is it's learning an unsupervised way to throw
  466. the strokes I see it as the environment the the league my paint environment
  467. sort of gives us a grounded bottleneck to actually create a meaningful
  468. representation space of course the next thing we tried was on the glut and
  469. again you see the same things it can generate unconditional meaningful only
  470. glove looking like samples or it can recreate on the glut samples but then
  471. generalization right so here what we tried was train the model on Omniglot and
  472. then ask it to generate endless digits right this is what you see in the middle
  473. middle road there can it draw in this digits this has never seen amnesty just
  474. before but we all know that only God is more general than in this and it can do
  475. it right given an amnesty yet it can actually draw that the network itself has
  476. never seen any any amnesty just during its training then we tried Smiley's
  477. right there line drawings okay so it can giving it smiley it can also drop
  478. Smiley's - that is great so can we do more we did this we took this cartoon
  479. drawing and this is done by chopping it up into 64 by 64 pieces and it's a
  480. general line drawing right again this is the imagine that if the Train using
  481. Omniglot and now you can see that it can actually recreate that trolling
  482. certain areas are read about right back around eyes insides they are really
  483. complicated but in general you can see that it is actually capable of
  484. generating those drawings so this gives you an idea of okay generalization I
  485. can I can sort of train on one domain and generalize the new ones so can I push
  486. it further the next thing that we tried was okay the advantage of using a tool
  487. is you have a meaningful representation space that we can hopefully transfer
  488. that representation space into a new environment so here what we do is again
  489. the same agent that is trained using Omniglot we transfer that simulated that
  490. that simulated environment into real world the way we do that is we we took
  491. that same program and our friends at the robotics group at deep mine wrote a
  492. controller to control that robotic arm to take that program and drove it this
  493. whole like experiment happened in under a week really and what we ended up with
  494. was the same agent the same agent it is not fine-tuned through all the setup or
  495. anything the same agent generates its brushstroke programs and then that
  496. program goes into a controller that can be realized by a real robotic arm right
  497. the advantage of doing this is the reason we can do this is the environment
  498. that we used is a real environment we didn't sort of create that environment
  499. the latent space if you will is not something some arbitrary latent space that
  500. we created because it's a latent space that is defined by us that is as a
  501. meaningful to space and the reason we create those tools is to solve many
  502. different problems anyways right and this is an example of that using that tool
  503. space gives us the ability to actually transfer its capability so with that I
  504. want to conclude I tried to give an explanation of you think about generative
  505. models and unsupervised learning and to me of course like I'm a hundred percent
  506. sure everyone agrees that our aim is not to just look at images right our aim
  507. is to do much more than that and I tried to give two different two different
  508. aspects one of them is the kind of genital models that we can do actually right
  509. now can solve real world problems like we have seen in Vienna and also we can
  510. think about a different kind of setup where we have agents actually training
  511. and and generating interpretable programs right that is an important aspect
  512. that we have seen that conversation coming up here actually through several of
  513. the talks here that being interbeing able to generate interpretable programs is
  514. one of the bottlenecks that we face right now because there are many critical
  515. applications that we want to solve there are many tools that we're gonna eat
  516. you eyes and this is one sort of step towards that best way how how I see and
  517. being able to do these requires us to create these very capable reinforcement
  518. learning agents that rely on new algorithms that we need to that we need to
  519. work on with that thank you very much I think I want to thank all my
  520. co-operators for their for their help on this thank you very much [Applause]
  521. [Music] [Applause] we have time for maybe one or two questions okay so I have
  522. 100 so how do you think about scaling to like more like general domains beyond
  523. some simple strokes how to generate like realistic scenes right so one thing
  524. that I haven't shown here actually yes creating realistic scenes is is one case
  525. one thing that I haven't talked about is actually as part of sorry as part of
  526. this work it's actually in the paper one thing that the team did by the way I
  527. had to mention and this was worked on most by Yaroslav gun in Melbourne he's
  528. actually PhD student at Mira and he spent his summer with us doing his
  529. internship so as an amazing job for actually doing it during an internship
  530. pretty big congratulations to him so one thing that that that we did was
  531. actually try to generate images so we took the survey data set and use the same
  532. drawing program to actually to actually draw those and in that case our setup
  533. is just scaling towards those like the same stuff set up actually scales
  534. because it's a general drawing - and you can control the color and we can do
  535. that but it requires a little bit more sort of like it was one of the last
  536. experiments that we did but like it is it is sort of in the words thanks for a
  537. great talker I had a question about the Impala results right you had a slide
  538. where one with a curve where all workers are learning versus having one
  539. centralized sorry centralized learner the all workers learning actually does
  540. better than the centralized letter and I found that not quite surprising but
  541. like you know it's great and it's great to see the positive transfer between
  542. tasks do you think have you tried that on other Suites of tasks do you think
  543. it's just because it's tasks in this suite of tasks are very similar to usually
  544. like it definitely depends on that but the reason we created those tasks it is
  545. for that reason right in real world what we have is we have the visual
  546. structure of our world is unique so the kind of setup that we have in deep
  547. defined lab that that that tasks it is that it's a unified visual environment
  548. you have one sort of one one one kind of agent with a unified action space and
  549. now you can focus on solving different kinds of tasks of course like that is
  550. the kind of thing that we were testing given all these through does it actually
  551. is it possible to do the multi task positive transfer that we see in supervised
  552. learning cases that we were able to see that in reinforcement learning yeah
  553. hello this is exciting I have a question about extending this to maybe more
  554. open domains so what is the challenge it's a challenge to be a number of
  555. actions to pick because the number of strokes maybe the strokes face smaller so
  556. what other challenge to extend to open domains with what do you like what do
  557. you have in mind is open domains like number of actions is definitely a
  558. challenge right it is definitely one of the big challenges that a lot of
  559. research in as far as I know in RL goes into that but that is that is I think
  560. only one of the main challenges the other challenge of course is the straight
  561. representation that is mainly why we sort of used deep learning right because
  562. we expect that with deep learning we are going to be able to learn better
  563. representations and that still remains as a challenge because being able to
  564. learn representations is not an architectural problem only it is also about
  565. finding the right sort of training set up and spyro was an example of that
  566. where we can get that reward function that that reward signal in an
  567. unsupervised way right and in many different domains like there are many
  568. different ways we can do this but actually finding those solutions also part of
  569. that okay so let's Bank arriving
  570. [Music]
  571. [Applause]
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement