Untitled

good morning hi my name is Amelie I'm going to be a session chair for the
morning so it's my great pleasure to introduce cryovac solo who is going to
give an invited talk karai is a director of research in deep mine and he's one
of the star researchers in our community he has contributed to many highly
influential projects in deep mind such as spatial transformer networks auto
regressive generated models such as pixel recurrent networks and wave nets and
debris enforcement learning for playing Atari games and alphago today he will
talk about from generative models to generative agents so let's welcome karai
[Applause]
[Music]
thank you very much Hong Kong for the very nice introduction and thanks
everyone for being here it's it's absolute pleasure so is Hong like mentioned
I'm going to try to talk about unsupervised learning in general starting from
the generative models may be a classical way when I try to give another view
that I think is quite interesting that we have been we have been working on
recently when I think about what are the important things for us to do is a is
a community I think everyone here sort of agrees that in the end what is
important is to be to be doing constitute to us learning we sort of realize
that supervised learning has all sorts of successes but in the end unsupervised
learning is kind of like the next frontier and when I think about unsupervised
learning there are there are sort of like different explanations that come to
my mind and when talking to people I think we all have sort of different
opinions on this one of the things that I think is a common explanation is we
have an unsupervised learning algorithm we run it on our data what we expect is
the algorithm to understand our data and to explain our data or or or our
environment right and and what we expect from this is that the algorithm is
going to learn the intrinsic properties of our data of our environment and then
it's going to be able to explain that through those properties but most of the
time what happens is because of the kinds of models that we use we resort to
and at the end looking at samples and what we look at the samples we try to see
that did our model really understand the environment and if it understood the
environment then then the sample should be meaningful of course we look at all
sorts of objective measures that we try to that that we use during training
like Inception scores looked at cahoots and such but in the end we always
resort the samples in terms of like understanding if our model really can
explain what's going on in the environment the other kind of general
explanation that we all use is like the goal of unsupervised learning is to
learn rich representations right it's already embedded in the name of the skill
of this conference the main goal of deep learning unsupervised learning is with
learning those are presentations but then when we think about those
representations again it doesn't this explanation doesn't give us an objective
measure what we think about is why those like how are we going to think about
those representations in terms of being great and useful and to me the most
important bit is if we have good and richer presentations then they are useful
for generalization for transfer right and we need to we need to sort of if you
have a good unsupervised learning model and it can give us good through
presentations then we can get generalization so what I'm going to do is today
also tie it together with something else that is really I think for me it is
very important as long as I've mentioned some a big chunk of work that we
have been doing a deep mine that I've been doing is about agents and
reinforcement learning and in this talk I'm going to sort of take a look at
unsupervised learning from classical sense of like learning a learning a
generative model and also learning an agent that can do on supervised learning
so I'm going to start from the wavenet model hopefully as many of you know it
is a generative model of audio it's a pure deep learning model and turns it
does you can model any audio signal like speech and and and music and then you
can get really realistic samples out of that and the next thing I'm going to do
is I'm going to explain this other sort of new approach that that I find really
interesting to unsupervised learning that is based on deep reinforcement
learning learning an agent that can actually that does unsupervised learning so
this model called spiral is based on a new agent architecture that we have been
that we have been working on that we have published recently called Impala it's
a very large highly scaleable efficient off-post elearning agent architecture
that we use in spiral to do unsupervised learning and the interesting bit about
the spiral work is it does generalization through using some sort of tool space
tools that we as people have created that we have created so that we can
actually solve not one specific problem we can solve many different problems
using these tools and using the interface of a two and having an agent you can
actually now learn a generative model of your environment all right so without
like more delay the first thing that I'm going to try to introduce is like
quickly the very net model way net is a generative model of of audio as I said
it is it samples the robot your signal it doesn't use any sort of interface to
model the audio signal audio in general is very very high dimensional so the
the standard audio signal that we started when Miller done moved a bit when we
were at the beginning but 16,000 samples per second like if you compare that
our usual language modeling and and and machine translation kind of tasks it is
several orders of magnitude more data so the kinds of dependencies that one
needs to model to be able to model good audio signals is very it's very long so
this model what it does is it samples it models one sample at a time and it is
a soft max distribution to model the model each sample depending on dependent
on all the all the previous samples of the of the signal when you look at it
more closely though it is it is it is an architecture that has quite a bit of
resemblance to the pixel CNN model maybe some of you also are familiar with
that in the end it is a stack of multiple commotion layers to be a little bit
more specific it has these residual blocks you use multiples of those residual
blocks and each decision and in each residual work there are these dilated
convolutional layers that that go on top of each other and through those
dilated convolutional layers that are causal convolutions we can model very
long dependencies so through that we can get the modelling dependency in time
now one of the biggest design considerations about wag net is it is designed to
be very very efficient during training because during training what you can do
is because all the targets are known when you generate the signal you generate
the whole signal at once just run it like a convulsion on net you get your
signal then because you have the targets you get your error signal from that
propagate back so training is very efficient but of course when it comes to
sampling time in the end this is an autoregressive model and through those
causal emotions you need to run through them one sample at a time so if you are
sampling let's say 24 kilohertz 24,000 samples per second you need to generate
one sample at a time just like you see in this animation and of course this is
painful this is painful but in the end it works quite well and we can generate
very very high quality audio with this so what I want to do is I want to
actually I want to I want to make you listen to the unconditional samples from
this model so rag model the speed signal and without any conditioning on text
or anything just take the audio signal and model that with model that it
wavenet and then when you sample this is the kind of so as you can see or here
hopefully the the quality is very high and this is modeling really the raw
audio grow audio signal and this is completely unconditional so what you hear
is sometimes you even hear short words like okay from and then if you try to
listen all the tonation and everything sounds quite natural and sometimes it
feels like you are listening to someone speaking in a language that you don't
know so the the main characteristics of the of the signal is all captured there
so in terms of dependencies we are looking into like something like several
thousand samples of dependencies are actually properly and correctly modelled
there and then of course sorry and then of course what you can do is you can
you can augment this model by conditioning on a text signal that is associated
with the signal that you want to generate and by conditioning on the text
signal now you have a generative model a conditional generative model that
actually solves a real-world problem just by itself and turn deep learning
right so the text you create the linguistic embeddings from that using those
linguistic embeddings you can generate the signal and then and then it starts
it's not talking right so it's a it's a solution to the whole text to speech
synthesis problem that as you know is very very common used in in in real world
sorry alright so when we did this the the bayonet model and this was around
like almost two years ago now we looked at the we looked at equality when we
use it as a TTS model and in green what you see is the quality of the human
speech I can obtain through this mean opinion scores and in blue you see the
wavenet and the other colors are the other models that were the best models
around and at the time and you can see that they met close the gap between the
human called speech and other models by by a big margin so at the time this
this really got us excited because now we actually had a model a deep learning
model that comes with all the flexibilities and advantages of doing deep
learning and at the same time it's modeling raw audio and it is it is it is
very very high quality I could play text to speech samples that is generated by
this model but actually what you can do is what I'm going to go into next if
you are using Google assistant right now you are already hearing back that
there because this is already in production so anyone who's using Google
assistant and like querying Wikipedia and things like that the the speech that
is generated there is actually coming from the very net model and what I want
to do is I want to explain how we how we did that and that brings me into our
next project that we did in the wagonette in the very net domain this is the
parallel way net power the net project so of course when you have a research
project and at some point you realize that okay it is actually lands it
actually lands itself into the solution of a real-world problem and you want to
put it into production in a very challenging environment then then of course it
requires much more than our little research group so this was a big cooperation
between the D point research applied and the Google speech teams actually so in
this slide what but what what I show is basis the the the basic ingredients of
how we turn a wave net architecture into a feed-forward and parallel
architecture because what we realize pretty soon when we started when we try to
attempt doing doing putting putting a system like this into production was
actually speed of course is very important quality is very very important but
the the importance is of speed is it is not enough to actually run something in
real time the kind of constraints that we track those ovals like orders of
magnitude faster than real time even actually being able to run in constant
time so when one day when the constraint becomes being able to run in constant
time the only thing you can do is create a feed-forward Network and then
paralyze the signal generation right so that is what we did so in this slide at
the top what you see is the usual wavenet model we call it the teacher now in
the setting this wavenet model is pure trained and it is fixed and it is used
as a scoring function at the bottom what you see is the generator that we call
the student and this student model is again an architecture that is very close
to write net but it is a it is it is run as a feed-forward convolutional
network and the way it is run is and it is trained is actually it has two
components one component is coming from a net we know that it is very efficient
in training as I said but slow in something the other the other thing is based
on the inverse autoregressive flow work that was done by the king - colleagues
at opening I last year and and and and this this structure gives gives us the
capability to actually get a input noise signal in and slowly transform that
noise signal into a into a proper distribution that is going to be the speed
signal right so the way we train this is random noise goes in together with the
linguistic features through layers and layers of these flows the signal gets
that that random noise gets transferred into speech signal that speed signal
goes into a net very net is like already the best kind of scoring function that
we can use because it's a it's a density model and wavenet scores that and that
score from that we get the gradients back into the generator and then we update
the generator we call this process the proper water density distribution but of
course when you are trying to do real-world things and if things are very
challenging like speed signals that is by itself not enough so I have
highlighted two components here one of them as I said is the magnet scoring
function the other thing that we use is a power loss because what happens is
when we train the model in this manner the signal tends to be very low energy
sort of like whispering someone speaks but they are like whispering so during
training we sort of edit this extra loss that tries to conserve the energy of
the generated speech and with these two the the wavenet scoring and the power
loss we were already getting very high called speed signal but of course like
the constraints are very very tough and what we did was we trained another wave
net model so we sort of used wavenet everywhere right that we are generating
through a leg net through convolution we are using very net as a scoring
function we again trained another very net model this time we used it as a
speech recognition system and that is the perceptual loss that you see there so
we train the wave net again as a speech recognition system what we do is during
training of course you have the text and the corresponding speech signal we
generate the we generate the corresponding speech through our generator we get
the text give that the speech recognition system the speech recognition system
of course not needs to decode we generated signal into those into that text
right and we get the error from there propagate back into our generator so
that's another sort of quality improvement that we get by using speech
recognition as a perceptual loss in our generation system and the last thing
that we did was using a contrasting term that basically uses okay we generate a
signal conditioned on some text you can you can create a contrast applause
we're saying that the signal that is generated with the corresponding text is
it should be different than the same signal if it if it was conditioned on a
separate text right there's a contrasting luster so more specifically what we
have is in the end we end up with these four terms at the top we see that the
the original sort of using vena there's a scoring function the problem with
advances the distillation idea then we have the power loss that that uses
Fourier transforms eternal to to conserve the energy and the contrastive term
and find out the perceptual was that does the that does the speech of cognition
and when we all these then of course what we did was we looked at the quality
now what what I'm showing here is the quality with respect to the again the
best non wavenet model so this is sort of like a year after the original
research pretty much exactly a year and so during that time of course the the
best speech synthesis models also improved but wavenet was still better than
better than anything else and it was matching the quality of so the new magnet
the parallel bayonet is exactly matching the quality of the of the original
magnitude and what what I'm showing here is three different US English voices
and also Japanese and this is the kind of thing that we always want from deep
learning right the ability to generalize to new datasets to new domains so we
have developed all this model one practically one single US English voice and
it was just a matter of collecting or getting another data set from another
either speaker or another language like some speaker speaking Japanese you just
get that run it and there you go you have a speech synthesis you have a
production called speaks into the system just by doing that this is the kind of
thing that we really like from deep line right and and if you are thinking
about from from deep learning and if you are thinking about unsupervised
learning I think this is this is this is a very good demonstration of that so
before switching to the next one I also want to mention that we have also done
some further work on this called wave RN and that is recently published and I
encourage you to look into that one too that's a very interesting piece of work
also for generating speech at very very high speed the next thing I want to
talk about is the Impala architecture the new agent architecture that I said
because as I said so now wavenet is a sort of in a classical sense of of
unsupervised model that actually can solve a real world problem now the next
thing I want to sort of start talking about is this new different way of doing
unsupervised learning but for that most another exciting bit is to be able to
do deep reinforcement learning at scale sorry all right so I want to sort of
motivate why do we want to actually push our deep reinforcement learning
models further and further because most of the time what we do because this
is a new area is we take sort of like very simple tasks in in some simple
environments and what we try to do is we try to train an agent that shows a
single task in that environment well what we what we want to do is we want to
go further than that right like again going back to the point of
generalization and being able to solve multiple tasks we have created the new
task set this is an open source task set that we have like we have an open
source environment called vm lab and as part of that we have created this new
task set vm lab 30 it is 30 environments that are sort of covering tasks
around language memory and navigation and those kinds of things and the goal
is not to solve each one of them individually the goal is to have one single
agent one single network that is that is solving all those thoughts all at
the same time there is nothing custom in that agent that is specific to any
single one of these environments when you look at those environments I'm
showing some of those here the agency has a first-person view so it is in like
a maze-like environment and the agent has a first-person view camera input and
it can navigate around go forward backwards and rotate around look up down jump
and those kinds of things and and it is solving all different kinds of tasks
that are that are catered to test different kinds of kinds of abilities but the
goal is as I said again to solve all of them at the same time one thing that
becomes really really important in this case is of course the stability of our
algorithms because now we are not solving one single task we are solving 30 of
them and we want to really stable models because we don't have the chance to
tune hyper parameters one single task anymore and of course what becomes really
important is task interference right hopefully what we expect again by using
deep learning is this is like a multi task setting and in this multi task
setting we hope to see positive transfer rather than task interference and and
and we hope to demonstrate this in this in this challenging reinforcement of a
reinforcement learning domain - okay I sort of realized that I needed to put a
slide about by deep reinforcement learning because a little bit to my surprise
that was actually not much reinforcement learning in this conference this year
and I wanted to sort of a little bit touch on why I think is important for for
the deep learning community before this community to actually do deep
reinforcement learning because it is to me it is at the core of if if one of
the goals that we work for here is AI then it is at the core of order right
reinforcement learning is a very general framework for it for doing sequential
decision-making for learning sequential decision making tasks and deep learning
on the other hand of course is the best model that we have the best set of
algorithms we have to learn representations and combinations of those
combinations of these two different models is is the most sort of like arm is
the best answer so far we have in terms of learning very good state
representations of very challenging tasks that are not just for like solving
toy domains but actually to solve challenging real world problems of course
there are many things that are there are open problems there like some of them
that are sort of interesting at least for me is the idea of separating the
computational power of a model from the number of weights or the number of
layers it has or basically again going back to on supervised learning learning
to transfer so if we do this deep reinforcement learning models with the idea
to to actually generalize to transfer okay so the Impala agent is based on the
on another work that we have done couple of years ago called the a synchronous
advantage actor critic the a3c model in the end it's a it's opposed to gradient
methods but you have is like that I tried to sort of cartoonishly explain that
in the in the in the figure at every time step the agent sees the environment
and at that time step the agent outputs a post distribution and also the also
the value function the value function is the agents expectation of the total
amount of reward that it's going to get until the end of the episode being in
that state all right and the policy is the distribution over the actions that
the agent has and at every time step the agent looks at the environment and
updates is policy so that it can be can actually act in the environment and it
updates his value function and the way you train this is with the with the post
the gradient intuitively this is actually is actually very simple what you do
is the gradient of the policy is scaled by the difference between the total
reward that the agent actually gets in the environment - the baseline and the
baseline is the value function right so what it means is if the agent ends up
doing better than what the value function what its assumption was then it's a
good thing you have a positive gradient you're going to reinforce your
understanding of the environment if the agent does worse than what it got so
well so the value was higher than the total reward that you got then you have a
negative gradient you need to shuffle things around and the way you learn the
value function is by the usual and step and step TD error now the a3c algorithm
so this was the actor critic part the a synchronous party in 3 C algorithm is
composed of multiple actors and each actor independently operates in the
environment and and and collecting for collect observations acts in the
environment computes the posted gradients and and completes the gradients with
respect to the parameters of its network then what it does is it sends those
gradients back into the parameter server then the parameter server collects all
these gradients from all different actors combines them together and then
shares those parameters with all the actors around now what happens in this
case is as you increase the number of actors this is the usual asynchronous
stochastic gradient descent setup as the number of actors increases the stale
grade the staleness of the gradients becomes a problem so what happens is in
the end is distribution the experience collection is actually something very
very advantages it's very good and but what happens is communicating gradients
might become a bottleneck as you try to really scale things up so for that what
we tried was a different architecture the idea of a sanctuary server is
actually quite useful but rather than using it to just to just do the
accumulate the parameter updates the idea of that learner is to make the
centralized component into a learner so the all the whole learning algorithm is
is contained in that what the actors does is only act in the environment not
compute the gradients or anything send the observations back into learners to
the learner and the learner sends the parameters back and in this in this way
what you are doing is you are completely decoupling what happens about your
experience collection in your environments from your learning algorithm and in
this way you are actually gaining a lot of robustness into noise in your
environments sometimes rendering times vary some some environments are slow
some environments are fast all that is completely decoupled from your learning
algorithm but of course what you need is a good learning algorithm to to be
able to deal with that kind of variation so in the end we empower what we have
is we have a very efficient decoupled backward pass if you were so actors
generate trajectories as I said but then but that that decoupling creates this
of posionous write the policy in the actors the behavior poles if you will is
separate from the policy in the learner target policy so what we need is enough
posted earning of course there are many of posted learning algorithms but we
really wanted to have a post gradient method and and for that we developed this
new method called V trace and it's an off-post advantage critic algorithm the
advantage of V traces it is using these truncated important sampling ratios to
actually come up with an estimate for the valley so because of there is this
imbalance between the learners that and the actors you need to balance those
you need to balance that difference the good thing about this is it's an
algorithm is a smooth transition between the on post case and off policy case
when they when the actors and the learner are completely in sync so you're in
the on policy case the algorithm actually boils down to the usual a3c update
with the n steps bellman equation if they become more separate than the
correction of the algorithm kicks in and then you have the corrected corrected
estimate the algorithm has two main components to those truncation factors to
control two different aspects of the of off learning one of them is the robe
which controls the reach value function the algorithm is going to converge
towards the behavior the value function that code that corresponds to the
behavior policy or the value function that corresponds to the target policy in
the learner and the other one controls the speed of convergence the C factor by
by controlling the by controlling the truncation that it can it can increase or
decrease the variance in learning and the stick and it can it can it can have
an effect on the speed of convergence now than me when we tested this of course
the goal is to test on all environments at once but what we wanted to do was
first you look at the single task is also we look at five different
environments and we see that in these environments the Impala algorithm always
very stable it performs at the top so the comparisons here are the Impala
algorithm the batch a3c method and they touch a to C method and then different
versions of a three C algorithms and you can see that Impala and batch a to C
are always at performing at the top Impala seems to be doing fine they're like
the the dark blue curve and and this gives us the sort of feeling that okay we
have a nice outlet now of course the other thing that is very important and
that is discussed a lot is the stability of these algorithms right I actually
really like these floods since during the a3c work actually keep looking at
these floods and we always put them in the papers the plot here is on the
x-axis we have the heart we have the hyper parameter combinations when you when
you of course trade any model what we do all of us is we do some sort of hyper
parameter sweep and here what we are doing is we are looking at the final score
that we achieve with every single hyper parameter setting that we that we get
and you sort it and in the in this kind of thought what you have is the the the
KERS the algorithms that are at the top and that our most flood are the most
like better performing and most stable algorithms right and what we see here is
Impala is always of course it's achieving better results but it's not achieving
those results because there is one sort of lucky - parameter setting is
consistently at the top and you can see that it's not of course completely flat
because in the end we are sort of searching over three orders of magnitude in
parameter settings the but we can see that the algorithm is actually quite
stable now when we look at our our our main goal here what we are looking at in
on the x-axis we have the wall clock time and on the y-axis we have the sort of
the normalized score and the and the red line that you see there is the a3 see
and you can see that Impala not only H is much better of course if they choose
them much much much faster the other thing is comparing the green and the
orange line thirds that is the comparison between training Impala in an expert
setting versus a multi task City and we see that it achieves better scores like
the faster which again gives us the idea that we are actually seeing positive
transfer it's it's a like to like setting the all the all the all the details
of the network and the agent are the same in one case you have one network
tasks and in other case you train the same network on all the tasks and what
you achieve is a better result because of the positive transfer between those
tasks and what happens is if you give Impala more resources you end up with
this almost vertical takeoff from there right and what you have is you can
actually solve this challenging turkey task domain in under 24 hours given the
resources and that is the kind of algorithmic sort of power that we want to be
able to train these very highly scalable agents now why do we want to do that
that is the point that I want to come next and and and in the final part this
is the new spiral algorithm that I want to talk about now just quickly going
back to the original ideas that that I talked about unsupervised learning is
also about explaining environments and generating samples but maybe generate
examples by explaining environments and we talked about the fact that when we
have these deep learning models like magnet we can generate amazing samples but
at the same time maybe there's a different way we can do these things less
implicit in the Sun set when we generate these samples they come with some
explanation and that explanation can go through some using some tools in this
particular case what we are going to do is we are going to use a painting tool
and we are going to learn to control this painting tool it's a real drawing
program and we are going to basically generate a program that the painting tool
will use to generate the image and the main idea that I want to convey is by
using tools by it by by learning how to use tools that are already available
that we have actually we can start thinking about different kinds of
generalizations that I'll try to demonstrate so in real word we have a lot of
examples of programs and their executions and the results of those programs
they can be arithmetic programs floating programs or even architectural
blueprints right and what we do is because we know we have an information on
that generation process when we see the results we can go and try to infer what
was the program what was the blueprint that generated that that particular
input so we can do this and the goal is to be able to do this with our with our
agents too specifically we are going to use this environment called lead my
paint it is actually a professional-grade open-source drawing library and it's
used worldwide by many artists what we are doing is we are using a limited
interface basically learning - learning to draw brushstrokes we are going to
have an agent that does that the agent in the end called spiral has three main
components first of all is the agent that generates the brushstrokes sort of I
like to see that as writing the program the second one is the environment to
lead my paint so the brushstrokes come in environment turns those into
brushstrokes in the canvas and that cameras got those into a discriminator and
the discriminator is trained like again and that discriminative looks at the
generated image and says does this look like a real drawing and then gives a
score and that score is opposed to the usual gun training rather than
propagating the gradient packs we get that score and we train our agent with
that score is a reward so when you think about this all these three components
coming together you have an unsupervised learning model similar to the Ganz but
rather than generating in the pixel space we generate in this program space and
the training is done through the done through the reward that the agent itself
also learns so we are sort of trusting another neural net just like in Gans
setup to actually guide learning but not through its gradients just treat the
score function so in my opinion it makes it in certain cases it makes it very
very sort of capable of using a different kinds of tools so as I said this
agent the the reinforcement learning part of the agent is completely the same
as the Impala so we now that we have an agent that can actually solve really
challenging reinforcement learning setups we take it and put it into this
environment augmented with the ability to learn a discriminative function to
actually have the reward the to emphasize again the important thing here is yes
we have an agent but there is no environment that actually says that ok this is
the reward that the agent should get the reward generation is also inside the
agent thanks to again all the unsupervised learning models that is actually
being studied here so we specifically use against it up there so can we
generate the first thing of course we try is when you are doing unsupervised
learning from scratch again you go back to illness right you start from M&S;
and initially of course it's generating various crash pad like things but then
through training it becomes better and better and better here in the middle you
see that now the the agent learned - these are complete unconditional samples
again the ones that you see in the middle it learn to create these trucks that
generates these digits right to emphasize this this agent has never seen
strokes that are coming from real people how we draw digits it learned to
experiment with these drugs and it's sort of built its own policy to create
these strokes that would generate these images of course you can train the
whole set up is a conditional generation process to recreate a given image - I
think the main thing about this is it's learning an unsupervised way to throw
the strokes I see it as the environment the the league my paint environment
sort of gives us a grounded bottleneck to actually create a meaningful
representation space of course the next thing we tried was on the glut and
again you see the same things it can generate unconditional meaningful only
glove looking like samples or it can recreate on the glut samples but then
generalization right so here what we tried was train the model on Omniglot and
then ask it to generate endless digits right this is what you see in the middle
middle road there can it draw in this digits this has never seen amnesty just
before but we all know that only God is more general than in this and it can do
it right given an amnesty yet it can actually draw that the network itself has
never seen any any amnesty just during its training then we tried Smiley's
right there line drawings okay so it can giving it smiley it can also drop
Smiley's - that is great so can we do more we did this we took this cartoon
drawing and this is done by chopping it up into 64 by 64 pieces and it's a
general line drawing right again this is the imagine that if the Train using
Omniglot and now you can see that it can actually recreate that trolling
certain areas are read about right back around eyes insides they are really
complicated but in general you can see that it is actually capable of
generating those drawings so this gives you an idea of okay generalization I
can I can sort of train on one domain and generalize the new ones so can I push
it further the next thing that we tried was okay the advantage of using a tool
is you have a meaningful representation space that we can hopefully transfer
that representation space into a new environment so here what we do is again
the same agent that is trained using Omniglot we transfer that simulated that
that simulated environment into real world the way we do that is we we took
that same program and our friends at the robotics group at deep mine wrote a
controller to control that robotic arm to take that program and drove it this
whole like experiment happened in under a week really and what we ended up with
was the same agent the same agent it is not fine-tuned through all the setup or
anything the same agent generates its brushstroke programs and then that
program goes into a controller that can be realized by a real robotic arm right
the advantage of doing this is the reason we can do this is the environment
that we used is a real environment we didn't sort of create that environment
the latent space if you will is not something some arbitrary latent space that
we created because it's a latent space that is defined by us that is as a
meaningful to space and the reason we create those tools is to solve many
different problems anyways right and this is an example of that using that tool
space gives us the ability to actually transfer its capability so with that I
want to conclude I tried to give an explanation of you think about generative
models and unsupervised learning and to me of course like I'm a hundred percent
sure everyone agrees that our aim is not to just look at images right our aim
is to do much more than that and I tried to give two different two different
aspects one of them is the kind of genital models that we can do actually right
now can solve real world problems like we have seen in Vienna and also we can
think about a different kind of setup where we have agents actually training
and and generating interpretable programs right that is an important aspect
that we have seen that conversation coming up here actually through several of
the talks here that being interbeing able to generate interpretable programs is
one of the bottlenecks that we face right now because there are many critical
applications that we want to solve there are many tools that we're gonna eat
you eyes and this is one sort of step towards that best way how how I see and
being able to do these requires us to create these very capable reinforcement
learning agents that rely on new algorithms that we need to that we need to
work on with that thank you very much I think I want to thank all my
co-operators for their for their help on this thank you very much [Applause]
[Music] [Applause] we have time for maybe one or two questions okay so I have
100 so how do you think about scaling to like more like general domains beyond
some simple strokes how to generate like realistic scenes right so one thing
that I haven't shown here actually yes creating realistic scenes is is one case
one thing that I haven't talked about is actually as part of sorry as part of
this work it's actually in the paper one thing that the team did by the way I
had to mention and this was worked on most by Yaroslav gun in Melbourne he's
actually PhD student at Mira and he spent his summer with us doing his
internship so as an amazing job for actually doing it during an internship
pretty big congratulations to him so one thing that that that we did was
actually try to generate images so we took the survey data set and use the same
drawing program to actually to actually draw those and in that case our setup
is just scaling towards those like the same stuff set up actually scales
because it's a general drawing - and you can control the color and we can do
that but it requires a little bit more sort of like it was one of the last
experiments that we did but like it is it is sort of in the words thanks for a
great talker I had a question about the Impala results right you had a slide
where one with a curve where all workers are learning versus having one
centralized sorry centralized learner the all workers learning actually does
better than the centralized letter and I found that not quite surprising but
like you know it's great and it's great to see the positive transfer between
tasks do you think have you tried that on other Suites of tasks do you think
it's just because it's tasks in this suite of tasks are very similar to usually
like it definitely depends on that but the reason we created those tasks it is
for that reason right in real world what we have is we have the visual
structure of our world is unique so the kind of setup that we have in deep
defined lab that that that tasks it is that it's a unified visual environment
you have one sort of one one one kind of agent with a unified action space and
now you can focus on solving different kinds of tasks of course like that is
the kind of thing that we were testing given all these through does it actually
is it possible to do the multi task positive transfer that we see in supervised
learning cases that we were able to see that in reinforcement learning yeah
hello this is exciting I have a question about extending this to maybe more
open domains so what is the challenge it's a challenge to be a number of
actions to pick because the number of strokes maybe the strokes face smaller so
what other challenge to extend to open domains with what do you like what do
you have in mind is open domains like number of actions is definitely a
challenge right it is definitely one of the big challenges that a lot of
research in as far as I know in RL goes into that but that is that is I think
only one of the main challenges the other challenge of course is the straight
representation that is mainly why we sort of used deep learning right because
we expect that with deep learning we are going to be able to learn better
representations and that still remains as a challenge because being able to
learn representations is not an architectural problem only it is also about
finding the right sort of training set up and spyro was an example of that
where we can get that reward function that that reward signal in an
unsupervised way right and in many different domains like there are many
different ways we can do this but actually finding those solutions also part of
that okay so let's Bank arriving
[Music]
[Applause]