Untitled

import torch
import torch.nn as nn

from utils import idx2onehot


class AE(nn.Module):

    def __init__(self,
                 encoder_layer_sizes,
                 latent_size,
                 decoder_layer_sizes,
                 conditional=False,
                 num_labels=0):

        super().__init__()

        if conditional:
            assert num_labels > 0

        assert type(encoder_layer_sizes) == list
        assert type(latent_size) == int
        assert type(decoder_layer_sizes) == list

        self.latent_size = latent_size

        self.encoder = Encoder(
            encoder_layer_sizes,
            latent_size,
            conditional,
            num_labels)
        self.decoder = Decoder(
            decoder_layer_sizes,
            latent_size,
            conditional,
            num_labels)

    def forward(self,
                x,
                c=None):

        if x.dim() > 2:
            x = x.view(-1, 28*28)

        z = self.encoder(x, c)

        recon_x = self.decoder(z, c)

        return recon_x, z

    def inference(self, device, n=1, c=None):

        batch_size = n
        z = torch.randn([batch_size,
                         self.latent_size]).to(device)

        recon_x = self.decoder(z, c)

        return recon_x


class Encoder(nn.Module):

    def __init__(self,
                 layer_sizes,
                 latent_size,
                 conditional,
                 num_labels):

        super().__init__()

        self.conditional = conditional
        if self.conditional:
            layer_sizes[0] += num_labels

        self.MLP = nn.Sequential()

        for i, (in_size, out_size) in enumerate(zip(layer_sizes[:-1],
                                                    layer_sizes[1:])):
            print(i, ": ", in_size, out_size)
            self.MLP.add_module(name="L{:d}".format(i),
                                module=nn.Linear(in_size, out_size))
            if i != len(layer_sizes):
                print("ReLU added @ Encoder")
                self.MLP.add_module(name="A{:d}".format(i),
                                    module=nn.ReLU())
                # self.MLP.add_module(name="BN{:d}".format(i),
                #                     module=nn.BatchNorm1d(out_size))

        self.linear = nn.Linear(layer_sizes[-1], latent_size)

    def forward(self, x, c=None):

        if self.conditional:
            c = idx2onehot(c, n=10)
            x = torch.cat((x, c), dim=-1)

        x = self.MLP(x)

        z = self.linear(x)

        return z


class Decoder(nn.Module):

    def __init__(self,
                 layer_sizes,
                 latent_size,
                 conditional,
                 num_labels):

        super().__init__()

        self.MLP = nn.Sequential()

        self.conditional = conditional
        if self.conditional:
            input_size = latent_size + num_labels
        else:
            input_size = latent_size

        for i, (in_size, out_size) in enumerate(
                zip([input_size]+layer_sizes[:-1], layer_sizes)):
            print(i, ": ", in_size, out_size)
            self.MLP.add_module(
                name="L{:d}".format(i), module=nn.Linear(in_size, out_size))
            if i+1 < len(layer_sizes):
                if i != 0:
                    print("ReLU added @ Decoder")
                    self.MLP.add_module(name="A{:d}".format(i), module=nn.ReLU())
                    # self.MLP.add_module(name="BN{:d}".format(i),
                    #                     module=nn.BatchNorm1d(out_size))

            else:
                print("Sig step")
                self.MLP.add_module(name="sigmoid", module=nn.Sigmoid())

    def forward(self, z, c):

        if self.conditional:
            c = idx2onehot(c, n=10)
            z = torch.cat((z, c), dim=-1)

        x = self.MLP(z)

        return x

## Goal of the Project
The project goal is about the way to determine the `optimal number of latent
dimension`.

First, the project introduces the linearity and non-linearity and postulates
the assumption that linearity corresponds to `one` dimension. Then, this
linearity could be split into `two` non-overlapping dimension by one ReLU based non-linearity.

Therefore, this project shows that the determination of optimal number of latent dimension
 preliminarily `not depend on the data distribution itself`, but depends on `the network structure`,
 more specifically, depends on the `total number of dimension that the model
about to express`. The paper will call this total number of dimension that the
model about to express as **model dimension**.

After the model dimension being set, one can train the network and check whether
it's possible to over-fit the network with the data given. If the data points
over-fit in some point of train epochs, this network can be thought as "enough to
express the data distribution". However, if not over-fit, one can consider to
enlarge the **model dimension** and re-try the over-fit process.

## To-do
Define the over-fit.
The classification threshold of over-fit depends on the experiment.
- In which epoch of training process one should determine over-fit?

## Caution
It's better to use whole data when to determine the "model dimension" since
it's about how much non-linearity is required for the collected or targeted
data domain.

## Convergence Determination Metric
When the EpochAVGLoss doe not change more than 1 % for 5 epochs from the first epoch, we determine the training loss being converged

## Experiment Workflow

##### Exp_1 : 1 ReLU applied to 256 dimension. (Then Linear Transformation to LatentDim)

By the assumption, the **model dimension** is 512(256*2). Thus, we verify the assumption by

1) check the sequential decrease of Loss at certain train epoch while sequentially increase the LatentDim

with `1 * (MLP + ReLU) + LatentDim 1`

    Epoch 09/10 Batch 0937/937, Loss  165.5437

with `1 * (MLP + ReLU) + LatentDim 2`

    Epoch 09/10 Batch 0937/937, Loss  150.2990

with `1 * (MLP + ReLU) + LatentDim 3`

    Epoch 09/10 Batch 0937/937, Loss  133.2206

with `1 * (MLP + ReLU) + LatentDim 4`

    Epoch 09/10 Batch 0937/937, Loss  138.1151

with `1 * (MLP + ReLU) + LatentDim 8`

    Epoch 09/10 Batch 0937/937, Loss  110.9839

with `1 * (MLP + ReLU) + LatentDim 16`

    Epoch 09/10 Batch 0937/937, Loss 89.6707

with `1 * (MLP + ReLU) + LatentDim 32`

    Epoch 09/10 Batch 0937/937, Loss 72.5663

with `1 * (MLP + ReLU) + LatentDim 64`

    Epoch 09/10 Batch 0937/937, Loss 54.2545

> ... since the model converges at LatentDim 64 with Loss 52, we shrink down the ReLU_InputDim to 32 (go to Exp3)

with `1 * (MLP + ReLU) + LatentDim 128`

    Epoch 09/10 Batch 0937/937, Loss   54.3565

with `1 * (MLP + ReLU) + LatentDim 256`

    Epoch 09/10 Batch 0937/937, Loss   52.3050

> ... must keep decreasing. write the code to automatically does this job

with `1 * (MLP + ReLU) + LatentDim 512`

    Epoch 09/10 Batch 0937/937, Loss   53.2412

> ... Check whether at any LatentDim > 512, no decrease of Loss at fixed train epoch.


with `1 * (MLP + ReLU) + LatentDim 1024`

    Epoch 09/10 Batch 0937/937, Loss   54.3255

> As you see, with the expansion of LatentDim `doubled`, still the LossAtFixedStep is not decreased,
which means model dimension already being saturated.
#### Exp_2: Now Introduce the Twice more model dimension by ReLU

with `2 * (MLP + ReLU) + LatentDim 1024`

> Epoch 09/10 Batch 0937/937, Loss   57.9039


(without Bias.. the sequential ReLU doesn't work)


### Exp_3 : Shrink down ReLU InputDim to 32 maintaining latentDim 64

### Summary of Algorithm

    If convgeLoss != 0:
        if modelDim > latentDim:
            enlarge latentDim
        if modelDim =< latentDim:
            increase #ReLU

    * modelDim = 2* num_ReLUs

To verify this,

@ exp latentDim 64, convergeLoss 80, layerSize [784, 32],
if one increase the latentDim, convergeLoss should not be below 80

Let's Check!
@ exp latentDim 128, convergeLoss 80, layerSize [784, 32], convergeLoss 80

now, let's add stack the double ReLU layers, [784, 32, 32], which is assumably represents 128 dimension
@ exp latentDim 128, convergeLoss 80, layerSize [784, 32, 32], convergeLoss 80 (still same)

As you see, without enlarge of foremost dimension, the deeper ReLU does not work. This is reference from Raghu(2017)

Now make it wide, such as [784, 64],
@ exp_1555829642 latentDim 128, convergeLoss 80, layerSize [784, 64], the convergeLoss 65 < 80

moreover, make it more wide, such as [784, 128],
@ exp_1555829642 latentDim 128, convergeLoss 55, layerSize [784, 128], the convergeLoss 55 < 80

moreover, make it more wide, such as [784, 256],
@ exp_1555832143 latentDim 128, convergeLoss 55, layerSize [784, 256], the convergeLoss  55 = 55

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 256, convergeLoss 55, layerSize [784, 256], the convergeLoss  55 = 55

===> Question! How to determine latentDim with less effort not getting through this cumbersome experimental step?

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 128, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  65 > 55

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 256, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  68 > 55

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 64, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  68 > 55

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 128, convergeLoss 60, layerSize [784, 256, 128], the convergeLoss  60 > 55

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555834546 latentDim 64, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  55 = 55

=====> decrease the latentDim makes the model to learn better (Q1)

The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555834546 latentDim 32, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  60 > 55


If one check the currently get 55,

    listed as:
          [784, 128], ld 128
          [784, 128], ld 256
          [784, 256, 256], ld 64

@ 1555843696, ld64 [784, 128, 128] convergeLoss 60>55
@ 1555844254, ld128 [784, 128, 128] convergeLoss 64>55
@ 1555844254, ld32 [784, 128, 128] convergeLoss 66>55


Dont know why, but if the network is deeper, too many latent space decrease the learning efficiency (Q1)


The problem is, latentDim. Make sure the latentDim is sufficient
@ exp_1555832638 latentDim 32, convergeLoss 65, layerSize [784, 256, 256], the convergeLoss  55 = 55


Maybe, if the modelDim is too big and latentDim is too small, as seen in exp [784, 32, 32],
training might be not working. Thus, we have leverage up the latentDim at the same setting from 128 to 256
@ exp_1555830495 convergeLoss 80 (still same)