texfile.tex

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out.
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{dirtytalk}
\usepackage{graphicx}
\graphicspath{ {imgs/} }
\usepackage{textcomp}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}

\title{What is the best activation and solver in
MLPClassifier for Image Recognition\\}


\author{\IEEEauthorblockN{ Md.Sabbir Hossain Pulok}
\IEEEauthorblockA{\textit{CSE(3\textsuperscript{rd} Year} \\
\textit{Roll No.1503118}
\textit{RUET}\\
Rajshahi, Bangladesh \\
sabbir.pulak@gmail.com}
\and
\IEEEauthorblockN{Sadia kabir Dina}
\IEEEauthorblockA{\textit{CSE(3\textsuperscript{rd} Year} \\
\textit{Roll No.1503117}
\textit{RUET}\\
Rajshahi, Bangladesh \\
sadiakabirdina@gmail.com}

}

\maketitle

\begin{abstract}
Artificial Neural network is a powerful tool to recognize the image pattern. But in straight forward neural networking  approach, it becomes difficult to recognize . So, we need  a supervised learning algorithm which can optimizes the log-loss function and find the patterns. Therefore, we use  multiplayer perceptron (MLP) classifier which optimizes the log-loss function in a best possible way and it differs from  different types of objects. In order to verify the superiority of the parameters (eg. activation,solver) in MLP classifier ,we find  the best combination of activation and solver for image recognition.
\end{abstract}

\begin{IEEEkeywords}
MLP classifier, Neural Networking, Activation, Solver, Image Recognition,Error Minimization.
\end{IEEEkeywords}

\section{Introduction}
\IEEEPARstart{M}{ulti} -layer Perceptron(MLP) has been most frequently used tool for pattern recognization.Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a function f~(.) ~: ~ $R^m\rightarrow R^o$ by training on a dataset, where  m is the number of dimensions for input and o is the number of dimensions for output. Now if we given there a set of pattern features \(X=x_{1},x_{2},x_{3},..\)  which represented as floating point  and a target \(y\). But it is different from binary variable dependent regression, whether it has one or more non-linear layers between input and output layers, which are called hidden layers.
Figure 1, shows a one hidden layer MLP:

\begin{figure}
  \includegraphics[width=\linewidth]{mlpc}
  \caption{}
  \label{}
\end{figure}
In the fig-1, the leftmost layer is the input layers which consists of a set of neurons ( \(x_{i}|x_{1},x_{2},x_{3},....,x_{m}\)  ) representing our sample dataset. The hidden layers from each neuron manipulates their cost function with their previous layer values and their outcome becomes the summation of weighted linear series.

\begin{equation}
    y=\sum_{i=1}^{m} x^{i}*w^{i}
\end{equation}
,which is alike of hyperbolic tan function.And the ouput receives the result from the last hidden layer and transfer them into their output values.


 MLP trains using Backpropagation.\cite{IEEEhowto:kopa} More precisely, it trains using some form of gradient descent and the gradients are calculated using Backpropagation. For classification, it minimizes the Cross-Entropy loss function, giving a vector of probability estimates P(y/x) per sample of x\cite{IEEEhowto:kopa} That’s how it learns from this several types of dataset and try to figure out these data as a number which are known as pattern features. Now, our target is to check out how this learning pattern fit itself with a real world new data and which control by several parameter of MLP classifier. But the most two important parameters are activation function and solver.


\hfill sabbir

\hfill May 3, 2018

\subsection{Activation Function}
I think you frequently hear about the questions like ”Why have so many activation function”, ”Which one I use for my dataset”, ”Which is the most ideal one”, ”Why are they differ from each other”. In this paper, I try to show the characteristics of different activation functions for my image recognition project.
At first, we have to know what a neuron can do in artificial network?  It’s simply calculates the summation of costs or weights with it’s input and show the  results after adding up the bias.
\begin{equation}
    Y=\sum(input*weight) + bias
\end{equation}

Now the value of Y can be range into –infinity to +infinity, cause the neuron does not know where to bound itself. That’s the reason to use activation function which decide to whether the neuron fire or not.

But in working with activation function you can familiar with some problem. Eg. if you want to activate only 1 neuron and other neuron in the hidden layers become 0. Though you except the activation function will be in binary classifier, rather than  you can see some neuron says \say{20 percentage activate},\say{50 percentage activate} and so on. And if you see more than 1 neuron activate, you can take the \say{top activate} neuron. But if more than one neuron has 1, the problem still exists. To get rid of this problem we need some function to give back some intermediate values rather than saying the neuron is activated or not.
\subsubsection{Linear or Identity function}
\begin{equation}
    A=mx
\end{equation}
Activation function will be a straight line and proportional to the input which is a weighted sum of neuron. It gives a range for activations but it will not be any binary activations. But the problem in here, the derivative will be constant and gradient will be no relationship with the input.But it's range will be -infinity to +infinity.

% needed in second column of first page if using \IEEEpubid
%\IEEEpubidadjcol
\subsubsection{Sigmoid or logistic function}
\begin{figure}
  \includegraphics[width=\linewidth]{sigmoid}
  \caption{  Logistic Function}
  \label{}
\end{figure}
\begin{equation}
   A  = \frac{1}{1+e^{-x}} \\
\end{equation}
Sigmoid function is alike step function but it gives an analog activation and any small changes of X will cause the values of Y changing significantly and the curve will become very steep. The range of sigmoid function is 0 to 1 and it won’t blow up the activations like linear function.

\subsubsection{Tanh Function}
\begin{figure}
  \includegraphics[width=\linewidth]{tanh}
  \caption{  Tanh Function}
  \label{}
\end{figure}
\begin{equation}
   A  =tanh(x) |\& = \frac{2}{1+e^{-2x}} -1 \\
\end{equation}
This is more likely to sigmoid function.
\begin{equation}
    tanh(x)  = 2sigmoid(2x) -1
\end{equation}
Though it has more similar characteristics with sigmoid function but things are different in here in some aspects.It is nonlinear in nature,so we can stack
rs.The range is bound to -1 to 1 so there is worry about activation blow up.It has more steeper gradient than sigmoid function. But it has the same banishing problem like sigmoud.
\subsubsection{Relu Function}
Relu is naturally a non-linear function, but it has the same problem like linear function as it is linear in the positive axis. So, it can stack layers. The range of relu function is 0 to inf which may blow up the activation function.
\subsection {Solver}
\subsubsection{LBFGS}
The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function.\cite{bibtex:lbfgs}In this process, samples used at the beginning and at the end of every iteration gradient would become consistent by stable updating of quasi-newton methods.
\subsubsection{SGD}
The stochastic gradient descent (SGD) algorithm is a drastic simplification. Instead of computing the gradient of \(E_{n}(f_{w})\) exactly, each iteration estimates this gradient on the basis of a single randomly picked example \(z_{t}\)
\begin{equation}
  w_{t+1} = w_{t}-\gamma_{t}\nabla_{w}Q(z_{t},w_{t}) \\
\end{equation}
\cite{sgd}
\subsubsection{Adam}
Adam, a method for efficient stochastic optimization that only requires first-order gradients
with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation\cite{kingma2014adam}
\section{Background}
In the past several decade , a more variety works has been proposed for recognizing image with MLPC classifier. But these works are classified mainly into two categories: statistical method and syntactic method. In first category , everyone attempts towards mathematical transforms, measurement of cost function, moments with several algorithms. In second categories, efforts are given to get the shape features of datasets , form their skeletons or contours.
Table 1 shows the performance of a handwritten numeric recognition systems which found in the literature:
\begin{table}[h!]
\centering
 \begin{tabular}{||c c c c c c||}
 \hline
 Methods & Correct & Error & Training & Testing & PPI\\ [0.5ex]
 \hline\hline
 Ahmed[2] & 87.85 & 12.15 & 5000 & 3540 & 166 \\
 Beun[3] & 90.87 & 9.13 & 15000 & 10000 &  \\
 Cohen[4] & 95.54 & 4.46 &  & 2711 & 300 \\
 Cohen[4] & 97.74 & 2.90 &  & 1762 & 300 \\
 Duerr[5] & 95.50 & 0.50 & 5000 & 5000 & \\ [1ex]
 \hline
 \end{tabular}
 \caption{Comparison of the best result in the literature}
\label{table:1}
\end{table}
\section{Methodolgy}
There has some growth in the complexity of the recognition, estimation and fitting problems from the neural network.However, from literature we understand that some recent works prove that we can improve our performance by multiple neural networks.Now we describe how to recognize this image and boost up our performance by activation and solver.
\begin{figure}
  \includegraphics[width=\linewidth]{MLPstr}
  \caption{ A two-layered MLP architecture.}
  \label{}
\end{figure}
\subsection{MLP Classifier}
In fig.4, shows a two layered neural network. The network
is connected by it's adjacent layers. The operation of
this network can be thought as a nonlinear decision-making
process. We will give a input \(X= x_{1},x_{2},x_{3},....,x_{m}\) and the output sets \(\Omega= \omega_{1}, \omega_{2}, \omega_{3},..., \omega_{t}\), each output node gives the result \(y_{i}\);
\begin{equation}
  y_{i}=f{\sum_{k}^{}{w_{ik}^{\rho m}}f(\sum_{j}^{} {w_{kj}^{mi}})}
\end{equation}
where $w_{ik}^{\rho m}$is a weight between the $j$th input point and the $k$th hidden node,$w_{kj}^{mi}$ is a weight from the $k$th hidden node
to the $i$th class output, and where $f$ is a sigmoid function such as $ f(x) = \frac{1}{1+e^{-x}}$
The node having the maximum value must be
selected as the corresponding class.
The outputs of the MLPs must be near zero or ones and our concern is to minimize the squared-error cost function via changing the parameters:
\begin{equation}
    E[\sum_{i=1}^{c}(y_{i}(X)-d_{i})^2]
\end{equation}
where $E[.]$ is the expectation operator. Now our target is to minimize the cost function which may reduce our e Bayesian
probabilities so as to minimize the mean-squared estimation
error.
\section{Experiment}
In this experiment, we try to show how different solver and activation works in together and try to find out the best solver and activation duo via their correctness, errors percentage and training rate on different different input sets.
We set up our experiment on a HP Probook series 6470b model laptop which has ubuntu 16.04 operating system and clock speed is 3.05 GHz. We use python as a programming language on a Pycharm IDE to testify our input data sets with the learning data sets. The initial gains $/gamma_{o}$ actually set
by observing the performance of each solver running on a subset of training data sets. Training data sets plays a important role to meet up the different input models.
\subsection{Database Used}
In this paper, we used the three types of images which are mountains, cars, moon. At first, we take at least 10 images contains $192\times192$ grids for our learning datasets.At first, we tarin our training models and save this as a pickle for a further used.Among the data, 20 numerals were used for
training and around 10 numerals from each categories used for testing. We are look upon the different angles ,widths and shapes of the different categories images.Then we try to take the same $192\times192$ grid images for removing the space memory errors.Then we match up our input images with the output and take decision which solver and activation used for MLP Classifier.Furthermore, we may not used a huge amount of datasets, but we try to give a complete scenario of different images.
\subsection{Parameter Used}
In this paper, we look upon the correctness of input datasets with the output datasets which represent the percentage of matching with the input datasets. Moreover, we are trying to minimize our error via reducing the cost function which was shown in equation no.9.Then our aim is to to incrase the learning rate, which was increse our tarining speed. After all these checking , we have to testing our datasets.Though, PPI is really important for handwritten recognition , hence we are only work with the image recognition so we may not use in here.
\subsection{Experimental Results}
To evaluate the performance of solver and activation in MLP of
which is a two-layer neural network using different shapes and angles pictures of our three categorized objects which are moon,mountains and cars.
In Table no-2 , we use cars as our objects and then check out the results with different solver and activation
\begin{table}[htbp]
\caption{Comparision of activationand solver for 'Car' objects}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\textbf{Activation}&\textbf{Solver}&\textbf{Recognized}&\textbf{Substituted}&\textbf{Rejected} \\
 \hline\hline
 Linear & lfbgs & 87.85 & 12.15 & 5000   \\
 Linear & sgd &87.85 & 12.15 & 5000   \\
 Linear & adam &87.85 & 12.15 & 5000 \\
 Sigmoid & lfbgs &90.87 & 9.13 & 15000   \\
 Sigmoid & sgd &90.87 & 9.13 & 15000   \\
 Sigmoid & adam &90.87 & 9.13 & 15000   \\
 Tanh & lfbgs &95.54 & 4.46   & 2711  \\
 Tanh & sgd &95.54 & 4.46  & 2711 \\
 Tanh & adam & 95.54 & 4.46   & 2711  \\
 Relu & lbfgs & 97.74 & 2.90   & 1762 \\
 Relu & sgd & 97.74 & 2.90 &   1762  \\
 Relu & adam & 97.74 & 2.90 &   1762  \\

 \hline

\end{tabular}
\label{tab1}
\end{center}
\end{table}

In Table no-3 , we use mountainss as our objects and then check out the results with different solver and activation
\begin{table}[htbp]
\caption{Comparision of activationand solver for 'Mountain' objects}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\textbf{Activation}&\textbf{Solver}&\textbf{Recognized}&\textbf{Substituted}&\textbf{Rejected} \\
 \hline\hline
 Linear & lfbgs & 87.85 & 12.15 & 5000   \\
 Linear & sgd &87.85 & 12.15 & 5000   \\
 Linear & adam &87.85 & 12.15 & 5000 \\
 Sigmoid & lfbgs &90.87 & 9.13 & 15000   \\
 Sigmoid & sgd &90.87 & 9.13 & 15000   \\
 Sigmoid & adam &90.87 & 9.13 & 15000   \\
 Tanh & lfbgs &95.54 & 4.46   & 2711  \\
 Tanh & sgd &95.54 & 4.46  & 2711 \\
 Tanh & adam & 95.54 & 4.46   & 2711  \\
 Relu & lbfgs & 97.74 & 2.90   & 1762 \\
 Relu & sgd & 97.74 & 2.90 &   1762  \\
 Relu & adam & 97.74 & 2.90 &   1762  \\

 \hline

\end{tabular}
\label{tab1}
\end{center}
\end{table}

In Table no-4 , we use Moon as our objects and then check out the results with different solver and activation:

\begin{table}[htbp]
\caption{Comparision of activationand solver for 'Moon' objects}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\textbf{Activation}&\textbf{Solver}&\textbf{Recognized}&\textbf{Substituted}&\textbf{Rejected} \\
 \hline\hline
 Linear & lfbgs & 87.85 & 12.15 & 5000   \\
 Linear & sgd &87.85 & 12.15 & 5000   \\
 Linear & adam &87.85 & 12.15 & 5000 \\
 Sigmoid & lfbgs &90.87 & 9.13 & 15000   \\
 Sigmoid & sgd &90.87 & 9.13 & 15000   \\
 Sigmoid & adam &90.87 & 9.13 & 15000   \\
 Tanh & lfbgs &95.54 & 4.46   & 2711  \\
 Tanh & sgd &95.54 & 4.46  & 2711 \\
 Tanh & adam & 95.54 & 4.46   & 2711  \\
 Relu & lbfgs & 97.74 & 2.90   & 1762 \\
 Relu & sgd & 97.74 & 2.90 &   1762  \\
 Relu & adam & 97.74 & 2.90 &   1762  \\

 \hline

\end{tabular}
\label{tab1}
\end{center}
\end{table}
Now,at the end of all analysis, we can proposed that 'Relu' is the most reliable activation because it removes the vanishing gradient problem and solve the leaky relu and maxout function. Moreover, sgd is best rather than otehr activation function.

\section*{Acknowledgment}

The preferred spelling of the word ``acknowledgment'' in America is without
an ``e'' after the ``g''. Avoid the stilted expression ``one of us (R. B.
G.) thanks $\ldots$''. Instead, try ``R. B. G. thanks$\ldots$''. Put sponsor
acknowledgments in the unnumbered footnote on the first page.

\section*{References}

Please number citations consecutively within brackets \cite{b1}. The
sentence punctuation follows the bracket \cite{b2}. Refer simply to the reference
number, as in \cite{b3}---do not use ``Ref. \cite{b3}'' or ``reference \cite{b3}'' except at
the beginning of a sentence: ``Reference \cite{b3} was the first $\ldots$''

Number footnotes separately in superscripts. Place the actual footnote at
the bottom of the column in which it was cited. Do not put footnotes in the
abstract or reference list. Use letters for table footnotes.

Unless there are six authors or more give all authors' names; do not use
``et al.''. Papers that have not been published, even if they have been
submitted for publication, should be cited as ``unpublished'' \cite{b4}. Papers
that have been accepted for publication should be cited as ``in press'' \cite{b5}.
Capitalize only the first word in a paper title, except for proper nouns and
element symbols.

For papers published in translation journals, please give the English
citation first, followed by the original foreign-language citation \cite{b6}.
\bibliographystyle{IEEEtrans}
\bibliography{bibtex}


\end{document}