Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- good morning hi my name is Amelie I'm going to be a session chair for the
- morning so it's my great pleasure to introduce cryovac solo who is going to
- give an invited talk karai is a director of research in deep mine and he's one
- of the star researchers in our community he has contributed to many highly
- influential projects in deep mind such as spatial transformer networks auto
- regressive generated models such as pixel recurrent networks and wave nets and
- debris enforcement learning for playing Atari games and alphago today he will
- talk about from generative models to generative agents so let's welcome karai
- [Applause]
- [Music]
- thank you very much Hong Kong for the very nice introduction and thanks
- everyone for being here it's it's absolute pleasure so is Hong like mentioned
- I'm going to try to talk about unsupervised learning in general starting from
- the generative models may be a classical way when I try to give another view
- that I think is quite interesting that we have been we have been working on
- recently when I think about what are the important things for us to do is a is
- a community I think everyone here sort of agrees that in the end what is
- important is to be to be doing constitute to us learning we sort of realize
- that supervised learning has all sorts of successes but in the end unsupervised
- learning is kind of like the next frontier and when I think about unsupervised
- learning there are there are sort of like different explanations that come to
- my mind and when talking to people I think we all have sort of different
- opinions on this one of the things that I think is a common explanation is we
- have an unsupervised learning algorithm we run it on our data what we expect is
- the algorithm to understand our data and to explain our data or or or our
- environment right and and what we expect from this is that the algorithm is
- going to learn the intrinsic properties of our data of our environment and then
- it's going to be able to explain that through those properties but most of the
- time what happens is because of the kinds of models that we use we resort to
- and at the end looking at samples and what we look at the samples we try to see
- that did our model really understand the environment and if it understood the
- environment then then the sample should be meaningful of course we look at all
- sorts of objective measures that we try to that that we use during training
- like Inception scores looked at cahoots and such but in the end we always
- resort the samples in terms of like understanding if our model really can
- explain what's going on in the environment the other kind of general
- explanation that we all use is like the goal of unsupervised learning is to
- learn rich representations right it's already embedded in the name of the skill
- of this conference the main goal of deep learning unsupervised learning is with
- learning those are presentations but then when we think about those
- representations again it doesn't this explanation doesn't give us an objective
- measure what we think about is why those like how are we going to think about
- those representations in terms of being great and useful and to me the most
- important bit is if we have good and richer presentations then they are useful
- for generalization for transfer right and we need to we need to sort of if you
- have a good unsupervised learning model and it can give us good through
- presentations then we can get generalization so what I'm going to do is today
- also tie it together with something else that is really I think for me it is
- very important as long as I've mentioned some a big chunk of work that we
- have been doing a deep mine that I've been doing is about agents and
- reinforcement learning and in this talk I'm going to sort of take a look at
- unsupervised learning from classical sense of like learning a learning a
- generative model and also learning an agent that can do on supervised learning
- so I'm going to start from the wavenet model hopefully as many of you know it
- is a generative model of audio it's a pure deep learning model and turns it
- does you can model any audio signal like speech and and and music and then you
- can get really realistic samples out of that and the next thing I'm going to do
- is I'm going to explain this other sort of new approach that that I find really
- interesting to unsupervised learning that is based on deep reinforcement
- learning learning an agent that can actually that does unsupervised learning so
- this model called spiral is based on a new agent architecture that we have been
- that we have been working on that we have published recently called Impala it's
- a very large highly scaleable efficient off-post elearning agent architecture
- that we use in spiral to do unsupervised learning and the interesting bit about
- the spiral work is it does generalization through using some sort of tool space
- tools that we as people have created that we have created so that we can
- actually solve not one specific problem we can solve many different problems
- using these tools and using the interface of a two and having an agent you can
- actually now learn a generative model of your environment all right so without
- like more delay the first thing that I'm going to try to introduce is like
- quickly the very net model way net is a generative model of of audio as I said
- it is it samples the robot your signal it doesn't use any sort of interface to
- model the audio signal audio in general is very very high dimensional so the
- the standard audio signal that we started when Miller done moved a bit when we
- were at the beginning but 16,000 samples per second like if you compare that
- our usual language modeling and and and machine translation kind of tasks it is
- several orders of magnitude more data so the kinds of dependencies that one
- needs to model to be able to model good audio signals is very it's very long so
- this model what it does is it samples it models one sample at a time and it is
- a soft max distribution to model the model each sample depending on dependent
- on all the all the previous samples of the of the signal when you look at it
- more closely though it is it is it is an architecture that has quite a bit of
- resemblance to the pixel CNN model maybe some of you also are familiar with
- that in the end it is a stack of multiple commotion layers to be a little bit
- more specific it has these residual blocks you use multiples of those residual
- blocks and each decision and in each residual work there are these dilated
- convolutional layers that that go on top of each other and through those
- dilated convolutional layers that are causal convolutions we can model very
- long dependencies so through that we can get the modelling dependency in time
- now one of the biggest design considerations about wag net is it is designed to
- be very very efficient during training because during training what you can do
- is because all the targets are known when you generate the signal you generate
- the whole signal at once just run it like a convulsion on net you get your
- signal then because you have the targets you get your error signal from that
- propagate back so training is very efficient but of course when it comes to
- sampling time in the end this is an autoregressive model and through those
- causal emotions you need to run through them one sample at a time so if you are
- sampling let's say 24 kilohertz 24,000 samples per second you need to generate
- one sample at a time just like you see in this animation and of course this is
- painful this is painful but in the end it works quite well and we can generate
- very very high quality audio with this so what I want to do is I want to
- actually I want to I want to make you listen to the unconditional samples from
- this model so rag model the speed signal and without any conditioning on text
- or anything just take the audio signal and model that with model that it
- wavenet and then when you sample this is the kind of so as you can see or here
- hopefully the the quality is very high and this is modeling really the raw
- audio grow audio signal and this is completely unconditional so what you hear
- is sometimes you even hear short words like okay from and then if you try to
- listen all the tonation and everything sounds quite natural and sometimes it
- feels like you are listening to someone speaking in a language that you don't
- know so the the main characteristics of the of the signal is all captured there
- so in terms of dependencies we are looking into like something like several
- thousand samples of dependencies are actually properly and correctly modelled
- there and then of course sorry and then of course what you can do is you can
- you can augment this model by conditioning on a text signal that is associated
- with the signal that you want to generate and by conditioning on the text
- signal now you have a generative model a conditional generative model that
- actually solves a real-world problem just by itself and turn deep learning
- right so the text you create the linguistic embeddings from that using those
- linguistic embeddings you can generate the signal and then and then it starts
- it's not talking right so it's a it's a solution to the whole text to speech
- synthesis problem that as you know is very very common used in in in real world
- sorry alright so when we did this the the bayonet model and this was around
- like almost two years ago now we looked at the we looked at equality when we
- use it as a TTS model and in green what you see is the quality of the human
- speech I can obtain through this mean opinion scores and in blue you see the
- wavenet and the other colors are the other models that were the best models
- around and at the time and you can see that they met close the gap between the
- human called speech and other models by by a big margin so at the time this
- this really got us excited because now we actually had a model a deep learning
- model that comes with all the flexibilities and advantages of doing deep
- learning and at the same time it's modeling raw audio and it is it is it is
- very very high quality I could play text to speech samples that is generated by
- this model but actually what you can do is what I'm going to go into next if
- you are using Google assistant right now you are already hearing back that
- there because this is already in production so anyone who's using Google
- assistant and like querying Wikipedia and things like that the the speech that
- is generated there is actually coming from the very net model and what I want
- to do is I want to explain how we how we did that and that brings me into our
- next project that we did in the wagonette in the very net domain this is the
- parallel way net power the net project so of course when you have a research
- project and at some point you realize that okay it is actually lands it
- actually lands itself into the solution of a real-world problem and you want to
- put it into production in a very challenging environment then then of course it
- requires much more than our little research group so this was a big cooperation
- between the D point research applied and the Google speech teams actually so in
- this slide what but what what I show is basis the the the basic ingredients of
- how we turn a wave net architecture into a feed-forward and parallel
- architecture because what we realize pretty soon when we started when we try to
- attempt doing doing putting putting a system like this into production was
- actually speed of course is very important quality is very very important but
- the the importance is of speed is it is not enough to actually run something in
- real time the kind of constraints that we track those ovals like orders of
- magnitude faster than real time even actually being able to run in constant
- time so when one day when the constraint becomes being able to run in constant
- time the only thing you can do is create a feed-forward Network and then
- paralyze the signal generation right so that is what we did so in this slide at
- the top what you see is the usual wavenet model we call it the teacher now in
- the setting this wavenet model is pure trained and it is fixed and it is used
- as a scoring function at the bottom what you see is the generator that we call
- the student and this student model is again an architecture that is very close
- to write net but it is a it is it is run as a feed-forward convolutional
- network and the way it is run is and it is trained is actually it has two
- components one component is coming from a net we know that it is very efficient
- in training as I said but slow in something the other the other thing is based
- on the inverse autoregressive flow work that was done by the king - colleagues
- at opening I last year and and and and this this structure gives gives us the
- capability to actually get a input noise signal in and slowly transform that
- noise signal into a into a proper distribution that is going to be the speed
- signal right so the way we train this is random noise goes in together with the
- linguistic features through layers and layers of these flows the signal gets
- that that random noise gets transferred into speech signal that speed signal
- goes into a net very net is like already the best kind of scoring function that
- we can use because it's a it's a density model and wavenet scores that and that
- score from that we get the gradients back into the generator and then we update
- the generator we call this process the proper water density distribution but of
- course when you are trying to do real-world things and if things are very
- challenging like speed signals that is by itself not enough so I have
- highlighted two components here one of them as I said is the magnet scoring
- function the other thing that we use is a power loss because what happens is
- when we train the model in this manner the signal tends to be very low energy
- sort of like whispering someone speaks but they are like whispering so during
- training we sort of edit this extra loss that tries to conserve the energy of
- the generated speech and with these two the the wavenet scoring and the power
- loss we were already getting very high called speed signal but of course like
- the constraints are very very tough and what we did was we trained another wave
- net model so we sort of used wavenet everywhere right that we are generating
- through a leg net through convolution we are using very net as a scoring
- function we again trained another very net model this time we used it as a
- speech recognition system and that is the perceptual loss that you see there so
- we train the wave net again as a speech recognition system what we do is during
- training of course you have the text and the corresponding speech signal we
- generate the we generate the corresponding speech through our generator we get
- the text give that the speech recognition system the speech recognition system
- of course not needs to decode we generated signal into those into that text
- right and we get the error from there propagate back into our generator so
- that's another sort of quality improvement that we get by using speech
- recognition as a perceptual loss in our generation system and the last thing
- that we did was using a contrasting term that basically uses okay we generate a
- signal conditioned on some text you can you can create a contrast applause
- we're saying that the signal that is generated with the corresponding text is
- it should be different than the same signal if it if it was conditioned on a
- separate text right there's a contrasting luster so more specifically what we
- have is in the end we end up with these four terms at the top we see that the
- the original sort of using vena there's a scoring function the problem with
- advances the distillation idea then we have the power loss that that uses
- Fourier transforms eternal to to conserve the energy and the contrastive term
- and find out the perceptual was that does the that does the speech of cognition
- and when we all these then of course what we did was we looked at the quality
- now what what I'm showing here is the quality with respect to the again the
- best non wavenet model so this is sort of like a year after the original
- research pretty much exactly a year and so during that time of course the the
- best speech synthesis models also improved but wavenet was still better than
- better than anything else and it was matching the quality of so the new magnet
- the parallel bayonet is exactly matching the quality of the of the original
- magnitude and what what I'm showing here is three different US English voices
- and also Japanese and this is the kind of thing that we always want from deep
- learning right the ability to generalize to new datasets to new domains so we
- have developed all this model one practically one single US English voice and
- it was just a matter of collecting or getting another data set from another
- either speaker or another language like some speaker speaking Japanese you just
- get that run it and there you go you have a speech synthesis you have a
- production called speaks into the system just by doing that this is the kind of
- thing that we really like from deep line right and and if you are thinking
- about from from deep learning and if you are thinking about unsupervised
- learning I think this is this is this is a very good demonstration of that so
- before switching to the next one I also want to mention that we have also done
- some further work on this called wave RN and that is recently published and I
- encourage you to look into that one too that's a very interesting piece of work
- also for generating speech at very very high speed the next thing I want to
- talk about is the Impala architecture the new agent architecture that I said
- because as I said so now wavenet is a sort of in a classical sense of of
- unsupervised model that actually can solve a real world problem now the next
- thing I want to sort of start talking about is this new different way of doing
- unsupervised learning but for that most another exciting bit is to be able to
- do deep reinforcement learning at scale sorry all right so I want to sort of
- motivate why do we want to actually push our deep reinforcement learning
- models further and further because most of the time what we do because this
- is a new area is we take sort of like very simple tasks in in some simple
- environments and what we try to do is we try to train an agent that shows a
- single task in that environment well what we what we want to do is we want to
- go further than that right like again going back to the point of
- generalization and being able to solve multiple tasks we have created the new
- task set this is an open source task set that we have like we have an open
- source environment called vm lab and as part of that we have created this new
- task set vm lab 30 it is 30 environments that are sort of covering tasks
- around language memory and navigation and those kinds of things and the goal
- is not to solve each one of them individually the goal is to have one single
- agent one single network that is that is solving all those thoughts all at
- the same time there is nothing custom in that agent that is specific to any
- single one of these environments when you look at those environments I'm
- showing some of those here the agency has a first-person view so it is in like
- a maze-like environment and the agent has a first-person view camera input and
- it can navigate around go forward backwards and rotate around look up down jump
- and those kinds of things and and it is solving all different kinds of tasks
- that are that are catered to test different kinds of kinds of abilities but the
- goal is as I said again to solve all of them at the same time one thing that
- becomes really really important in this case is of course the stability of our
- algorithms because now we are not solving one single task we are solving 30 of
- them and we want to really stable models because we don't have the chance to
- tune hyper parameters one single task anymore and of course what becomes really
- important is task interference right hopefully what we expect again by using
- deep learning is this is like a multi task setting and in this multi task
- setting we hope to see positive transfer rather than task interference and and
- and we hope to demonstrate this in this in this challenging reinforcement of a
- reinforcement learning domain - okay I sort of realized that I needed to put a
- slide about by deep reinforcement learning because a little bit to my surprise
- that was actually not much reinforcement learning in this conference this year
- and I wanted to sort of a little bit touch on why I think is important for for
- the deep learning community before this community to actually do deep
- reinforcement learning because it is to me it is at the core of if if one of
- the goals that we work for here is AI then it is at the core of order right
- reinforcement learning is a very general framework for it for doing sequential
- decision-making for learning sequential decision making tasks and deep learning
- on the other hand of course is the best model that we have the best set of
- algorithms we have to learn representations and combinations of those
- combinations of these two different models is is the most sort of like arm is
- the best answer so far we have in terms of learning very good state
- representations of very challenging tasks that are not just for like solving
- toy domains but actually to solve challenging real world problems of course
- there are many things that are there are open problems there like some of them
- that are sort of interesting at least for me is the idea of separating the
- computational power of a model from the number of weights or the number of
- layers it has or basically again going back to on supervised learning learning
- to transfer so if we do this deep reinforcement learning models with the idea
- to to actually generalize to transfer okay so the Impala agent is based on the
- on another work that we have done couple of years ago called the a synchronous
- advantage actor critic the a3c model in the end it's a it's opposed to gradient
- methods but you have is like that I tried to sort of cartoonishly explain that
- in the in the in the figure at every time step the agent sees the environment
- and at that time step the agent outputs a post distribution and also the also
- the value function the value function is the agents expectation of the total
- amount of reward that it's going to get until the end of the episode being in
- that state all right and the policy is the distribution over the actions that
- the agent has and at every time step the agent looks at the environment and
- updates is policy so that it can be can actually act in the environment and it
- updates his value function and the way you train this is with the with the post
- the gradient intuitively this is actually is actually very simple what you do
- is the gradient of the policy is scaled by the difference between the total
- reward that the agent actually gets in the environment - the baseline and the
- baseline is the value function right so what it means is if the agent ends up
- doing better than what the value function what its assumption was then it's a
- good thing you have a positive gradient you're going to reinforce your
- understanding of the environment if the agent does worse than what it got so
- well so the value was higher than the total reward that you got then you have a
- negative gradient you need to shuffle things around and the way you learn the
- value function is by the usual and step and step TD error now the a3c algorithm
- so this was the actor critic part the a synchronous party in 3 C algorithm is
- composed of multiple actors and each actor independently operates in the
- environment and and and collecting for collect observations acts in the
- environment computes the posted gradients and and completes the gradients with
- respect to the parameters of its network then what it does is it sends those
- gradients back into the parameter server then the parameter server collects all
- these gradients from all different actors combines them together and then
- shares those parameters with all the actors around now what happens in this
- case is as you increase the number of actors this is the usual asynchronous
- stochastic gradient descent setup as the number of actors increases the stale
- grade the staleness of the gradients becomes a problem so what happens is in
- the end is distribution the experience collection is actually something very
- very advantages it's very good and but what happens is communicating gradients
- might become a bottleneck as you try to really scale things up so for that what
- we tried was a different architecture the idea of a sanctuary server is
- actually quite useful but rather than using it to just to just do the
- accumulate the parameter updates the idea of that learner is to make the
- centralized component into a learner so the all the whole learning algorithm is
- is contained in that what the actors does is only act in the environment not
- compute the gradients or anything send the observations back into learners to
- the learner and the learner sends the parameters back and in this in this way
- what you are doing is you are completely decoupling what happens about your
- experience collection in your environments from your learning algorithm and in
- this way you are actually gaining a lot of robustness into noise in your
- environments sometimes rendering times vary some some environments are slow
- some environments are fast all that is completely decoupled from your learning
- algorithm but of course what you need is a good learning algorithm to to be
- able to deal with that kind of variation so in the end we empower what we have
- is we have a very efficient decoupled backward pass if you were so actors
- generate trajectories as I said but then but that that decoupling creates this
- of posionous write the policy in the actors the behavior poles if you will is
- separate from the policy in the learner target policy so what we need is enough
- posted earning of course there are many of posted learning algorithms but we
- really wanted to have a post gradient method and and for that we developed this
- new method called V trace and it's an off-post advantage critic algorithm the
- advantage of V traces it is using these truncated important sampling ratios to
- actually come up with an estimate for the valley so because of there is this
- imbalance between the learners that and the actors you need to balance those
- you need to balance that difference the good thing about this is it's an
- algorithm is a smooth transition between the on post case and off policy case
- when they when the actors and the learner are completely in sync so you're in
- the on policy case the algorithm actually boils down to the usual a3c update
- with the n steps bellman equation if they become more separate than the
- correction of the algorithm kicks in and then you have the corrected corrected
- estimate the algorithm has two main components to those truncation factors to
- control two different aspects of the of off learning one of them is the robe
- which controls the reach value function the algorithm is going to converge
- towards the behavior the value function that code that corresponds to the
- behavior policy or the value function that corresponds to the target policy in
- the learner and the other one controls the speed of convergence the C factor by
- by controlling the by controlling the truncation that it can it can increase or
- decrease the variance in learning and the stick and it can it can it can have
- an effect on the speed of convergence now than me when we tested this of course
- the goal is to test on all environments at once but what we wanted to do was
- first you look at the single task is also we look at five different
- environments and we see that in these environments the Impala algorithm always
- very stable it performs at the top so the comparisons here are the Impala
- algorithm the batch a3c method and they touch a to C method and then different
- versions of a three C algorithms and you can see that Impala and batch a to C
- are always at performing at the top Impala seems to be doing fine they're like
- the the dark blue curve and and this gives us the sort of feeling that okay we
- have a nice outlet now of course the other thing that is very important and
- that is discussed a lot is the stability of these algorithms right I actually
- really like these floods since during the a3c work actually keep looking at
- these floods and we always put them in the papers the plot here is on the
- x-axis we have the heart we have the hyper parameter combinations when you when
- you of course trade any model what we do all of us is we do some sort of hyper
- parameter sweep and here what we are doing is we are looking at the final score
- that we achieve with every single hyper parameter setting that we that we get
- and you sort it and in the in this kind of thought what you have is the the the
- KERS the algorithms that are at the top and that our most flood are the most
- like better performing and most stable algorithms right and what we see here is
- Impala is always of course it's achieving better results but it's not achieving
- those results because there is one sort of lucky - parameter setting is
- consistently at the top and you can see that it's not of course completely flat
- because in the end we are sort of searching over three orders of magnitude in
- parameter settings the but we can see that the algorithm is actually quite
- stable now when we look at our our our main goal here what we are looking at in
- on the x-axis we have the wall clock time and on the y-axis we have the sort of
- the normalized score and the and the red line that you see there is the a3 see
- and you can see that Impala not only H is much better of course if they choose
- them much much much faster the other thing is comparing the green and the
- orange line thirds that is the comparison between training Impala in an expert
- setting versus a multi task City and we see that it achieves better scores like
- the faster which again gives us the idea that we are actually seeing positive
- transfer it's it's a like to like setting the all the all the all the details
- of the network and the agent are the same in one case you have one network
- tasks and in other case you train the same network on all the tasks and what
- you achieve is a better result because of the positive transfer between those
- tasks and what happens is if you give Impala more resources you end up with
- this almost vertical takeoff from there right and what you have is you can
- actually solve this challenging turkey task domain in under 24 hours given the
- resources and that is the kind of algorithmic sort of power that we want to be
- able to train these very highly scalable agents now why do we want to do that
- that is the point that I want to come next and and and in the final part this
- is the new spiral algorithm that I want to talk about now just quickly going
- back to the original ideas that that I talked about unsupervised learning is
- also about explaining environments and generating samples but maybe generate
- examples by explaining environments and we talked about the fact that when we
- have these deep learning models like magnet we can generate amazing samples but
- at the same time maybe there's a different way we can do these things less
- implicit in the Sun set when we generate these samples they come with some
- explanation and that explanation can go through some using some tools in this
- particular case what we are going to do is we are going to use a painting tool
- and we are going to learn to control this painting tool it's a real drawing
- program and we are going to basically generate a program that the painting tool
- will use to generate the image and the main idea that I want to convey is by
- using tools by it by by learning how to use tools that are already available
- that we have actually we can start thinking about different kinds of
- generalizations that I'll try to demonstrate so in real word we have a lot of
- examples of programs and their executions and the results of those programs
- they can be arithmetic programs floating programs or even architectural
- blueprints right and what we do is because we know we have an information on
- that generation process when we see the results we can go and try to infer what
- was the program what was the blueprint that generated that that particular
- input so we can do this and the goal is to be able to do this with our with our
- agents too specifically we are going to use this environment called lead my
- paint it is actually a professional-grade open-source drawing library and it's
- used worldwide by many artists what we are doing is we are using a limited
- interface basically learning - learning to draw brushstrokes we are going to
- have an agent that does that the agent in the end called spiral has three main
- components first of all is the agent that generates the brushstrokes sort of I
- like to see that as writing the program the second one is the environment to
- lead my paint so the brushstrokes come in environment turns those into
- brushstrokes in the canvas and that cameras got those into a discriminator and
- the discriminator is trained like again and that discriminative looks at the
- generated image and says does this look like a real drawing and then gives a
- score and that score is opposed to the usual gun training rather than
- propagating the gradient packs we get that score and we train our agent with
- that score is a reward so when you think about this all these three components
- coming together you have an unsupervised learning model similar to the Ganz but
- rather than generating in the pixel space we generate in this program space and
- the training is done through the done through the reward that the agent itself
- also learns so we are sort of trusting another neural net just like in Gans
- setup to actually guide learning but not through its gradients just treat the
- score function so in my opinion it makes it in certain cases it makes it very
- very sort of capable of using a different kinds of tools so as I said this
- agent the the reinforcement learning part of the agent is completely the same
- as the Impala so we now that we have an agent that can actually solve really
- challenging reinforcement learning setups we take it and put it into this
- environment augmented with the ability to learn a discriminative function to
- actually have the reward the to emphasize again the important thing here is yes
- we have an agent but there is no environment that actually says that ok this is
- the reward that the agent should get the reward generation is also inside the
- agent thanks to again all the unsupervised learning models that is actually
- being studied here so we specifically use against it up there so can we
- generate the first thing of course we try is when you are doing unsupervised
- learning from scratch again you go back to illness right you start from M&S;
- and initially of course it's generating various crash pad like things but then
- through training it becomes better and better and better here in the middle you
- see that now the the agent learned - these are complete unconditional samples
- again the ones that you see in the middle it learn to create these trucks that
- generates these digits right to emphasize this this agent has never seen
- strokes that are coming from real people how we draw digits it learned to
- experiment with these drugs and it's sort of built its own policy to create
- these strokes that would generate these images of course you can train the
- whole set up is a conditional generation process to recreate a given image - I
- think the main thing about this is it's learning an unsupervised way to throw
- the strokes I see it as the environment the the league my paint environment
- sort of gives us a grounded bottleneck to actually create a meaningful
- representation space of course the next thing we tried was on the glut and
- again you see the same things it can generate unconditional meaningful only
- glove looking like samples or it can recreate on the glut samples but then
- generalization right so here what we tried was train the model on Omniglot and
- then ask it to generate endless digits right this is what you see in the middle
- middle road there can it draw in this digits this has never seen amnesty just
- before but we all know that only God is more general than in this and it can do
- it right given an amnesty yet it can actually draw that the network itself has
- never seen any any amnesty just during its training then we tried Smiley's
- right there line drawings okay so it can giving it smiley it can also drop
- Smiley's - that is great so can we do more we did this we took this cartoon
- drawing and this is done by chopping it up into 64 by 64 pieces and it's a
- general line drawing right again this is the imagine that if the Train using
- Omniglot and now you can see that it can actually recreate that trolling
- certain areas are read about right back around eyes insides they are really
- complicated but in general you can see that it is actually capable of
- generating those drawings so this gives you an idea of okay generalization I
- can I can sort of train on one domain and generalize the new ones so can I push
- it further the next thing that we tried was okay the advantage of using a tool
- is you have a meaningful representation space that we can hopefully transfer
- that representation space into a new environment so here what we do is again
- the same agent that is trained using Omniglot we transfer that simulated that
- that simulated environment into real world the way we do that is we we took
- that same program and our friends at the robotics group at deep mine wrote a
- controller to control that robotic arm to take that program and drove it this
- whole like experiment happened in under a week really and what we ended up with
- was the same agent the same agent it is not fine-tuned through all the setup or
- anything the same agent generates its brushstroke programs and then that
- program goes into a controller that can be realized by a real robotic arm right
- the advantage of doing this is the reason we can do this is the environment
- that we used is a real environment we didn't sort of create that environment
- the latent space if you will is not something some arbitrary latent space that
- we created because it's a latent space that is defined by us that is as a
- meaningful to space and the reason we create those tools is to solve many
- different problems anyways right and this is an example of that using that tool
- space gives us the ability to actually transfer its capability so with that I
- want to conclude I tried to give an explanation of you think about generative
- models and unsupervised learning and to me of course like I'm a hundred percent
- sure everyone agrees that our aim is not to just look at images right our aim
- is to do much more than that and I tried to give two different two different
- aspects one of them is the kind of genital models that we can do actually right
- now can solve real world problems like we have seen in Vienna and also we can
- think about a different kind of setup where we have agents actually training
- and and generating interpretable programs right that is an important aspect
- that we have seen that conversation coming up here actually through several of
- the talks here that being interbeing able to generate interpretable programs is
- one of the bottlenecks that we face right now because there are many critical
- applications that we want to solve there are many tools that we're gonna eat
- you eyes and this is one sort of step towards that best way how how I see and
- being able to do these requires us to create these very capable reinforcement
- learning agents that rely on new algorithms that we need to that we need to
- work on with that thank you very much I think I want to thank all my
- co-operators for their for their help on this thank you very much [Applause]
- [Music] [Applause] we have time for maybe one or two questions okay so I have
- 100 so how do you think about scaling to like more like general domains beyond
- some simple strokes how to generate like realistic scenes right so one thing
- that I haven't shown here actually yes creating realistic scenes is is one case
- one thing that I haven't talked about is actually as part of sorry as part of
- this work it's actually in the paper one thing that the team did by the way I
- had to mention and this was worked on most by Yaroslav gun in Melbourne he's
- actually PhD student at Mira and he spent his summer with us doing his
- internship so as an amazing job for actually doing it during an internship
- pretty big congratulations to him so one thing that that that we did was
- actually try to generate images so we took the survey data set and use the same
- drawing program to actually to actually draw those and in that case our setup
- is just scaling towards those like the same stuff set up actually scales
- because it's a general drawing - and you can control the color and we can do
- that but it requires a little bit more sort of like it was one of the last
- experiments that we did but like it is it is sort of in the words thanks for a
- great talker I had a question about the Impala results right you had a slide
- where one with a curve where all workers are learning versus having one
- centralized sorry centralized learner the all workers learning actually does
- better than the centralized letter and I found that not quite surprising but
- like you know it's great and it's great to see the positive transfer between
- tasks do you think have you tried that on other Suites of tasks do you think
- it's just because it's tasks in this suite of tasks are very similar to usually
- like it definitely depends on that but the reason we created those tasks it is
- for that reason right in real world what we have is we have the visual
- structure of our world is unique so the kind of setup that we have in deep
- defined lab that that that tasks it is that it's a unified visual environment
- you have one sort of one one one kind of agent with a unified action space and
- now you can focus on solving different kinds of tasks of course like that is
- the kind of thing that we were testing given all these through does it actually
- is it possible to do the multi task positive transfer that we see in supervised
- learning cases that we were able to see that in reinforcement learning yeah
- hello this is exciting I have a question about extending this to maybe more
- open domains so what is the challenge it's a challenge to be a number of
- actions to pick because the number of strokes maybe the strokes face smaller so
- what other challenge to extend to open domains with what do you like what do
- you have in mind is open domains like number of actions is definitely a
- challenge right it is definitely one of the big challenges that a lot of
- research in as far as I know in RL goes into that but that is that is I think
- only one of the main challenges the other challenge of course is the straight
- representation that is mainly why we sort of used deep learning right because
- we expect that with deep learning we are going to be able to learn better
- representations and that still remains as a challenge because being able to
- learn representations is not an architectural problem only it is also about
- finding the right sort of training set up and spyro was an example of that
- where we can get that reward function that that reward signal in an
- unsupervised way right and in many different domains like there are many
- different ways we can do this but actually finding those solutions also part of
- that okay so let's Bank arriving
- [Music]
- [Applause]
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement