org-export html

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<title>Proximal Methods</title>
<!-- 2014-08-28 Thu 11:07 -->
<meta  http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta  name="generator" content="Org-mode" />
<meta  name="author" content="Shyam Upadhyay" />
<style type="text/css">
 <!--/*--><![CDATA[/*><!--*/
  .title  { text-align: center; }
  .todo   { font-family: monospace; color: red; }
  .done   { color: green; }
  .tag    { background-color: #eee; font-family: monospace;
            padding: 2px; font-size: 80%; font-weight: normal; }
  .timestamp { color: #bebebe; }
  .timestamp-kwd { color: #5f9ea0; }
  .right  { margin-left: auto; margin-right: 0px;  text-align: right; }
  .left   { margin-left: 0px;  margin-right: auto; text-align: left; }
  .center { margin-left: auto; margin-right: auto; text-align: center; }
  .underline { text-decoration: underline; }
  #postamble p, #preamble p { font-size: 90%; margin: .2em; }
  p.verse { margin-left: 3%; }
  pre {
    border: 1px solid #ccc;
    box-shadow: 3px 3px 3px #eee;
    padding: 8pt;
    font-family: monospace;
    overflow: auto;
    margin: 1.2em;
  }
  pre.src {
    position: relative;
    overflow: visible;
    padding-top: 1.2em;
  }
  pre.src:before {
    display: none;
    position: absolute;
    background-color: white;
    top: -10px;
    right: 10px;
    padding: 3px;
    border: 1px solid black;
  }
  pre.src:hover:before { display: inline;}
  pre.src-sh:before    { content: 'sh'; }
  pre.src-bash:before  { content: 'sh'; }
  pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
  pre.src-R:before     { content: 'R'; }
  pre.src-perl:before  { content: 'Perl'; }
  pre.src-java:before  { content: 'Java'; }
  pre.src-sql:before   { content: 'SQL'; }

  table { border-collapse:collapse; }
  caption.t-above { caption-side: top; }
  caption.t-bottom { caption-side: bottom; }
  td, th { vertical-align:top;  }
  th.right  { text-align: center;  }
  th.left   { text-align: center;   }
  th.center { text-align: center; }
  td.right  { text-align: right;  }
  td.left   { text-align: left;   }
  td.center { text-align: center; }
  dt { font-weight: bold; }
  .footpara:nth-child(2) { display: inline; }
  .footpara { display: block; }
  .footdef  { margin-bottom: 1em; }
  .figure { padding: 1em; }
  .figure p { text-align: center; }
  .inlinetask {
    padding: 10px;
    border: 2px solid gray;
    margin: 10px;
    background: #ffffcc;
  }
  #org-div-home-and-up
   { text-align: right; font-size: 70%; white-space: nowrap; }
  textarea { overflow-x: auto; }
  .linenr { font-size: smaller }
  .code-highlighted { background-color: #ffff00; }
  .org-info-js_info-navigation { border-style: none; }
  #org-info-js_console-label
    { font-size: 10px; font-weight: bold; white-space: nowrap; }
  .org-info-js_search-highlight
    { background-color: #ffff00; color: #000000; font-weight: bold; }
  /*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart  The following is the entire license notice for the
JavaScript code in this tag.

Copyright (C) 2012-2013 Free Software Foundation, Inc.

The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version.  The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU GPL for more details.

As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.


@licend  The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
 function CodeHighlightOn(elem, id)
 {
   var target = document.getElementById(id);
   if(null != target) {
     elem.cacheClassElem = elem.className;
     elem.cacheClassTarget = target.className;
     target.className = "code-highlighted";
     elem.className   = "code-highlighted";
   }
 }
 function CodeHighlightOff(elem, id)
 {
   var target = document.getElementById(id);
   if(elem.cacheClassElem)
     elem.className = elem.cacheClassElem;
   if(elem.cacheClassTarget)
     target.className = elem.cacheClassTarget;
 }
/*]]>*///-->
</script>
<script type="text/javascript" src="http://orgmode.org/mathjax/MathJax.js"></script>
<script type="text/javascript">
<!--/*--><![CDATA[/*><!--*/
    MathJax.Hub.Config({
        // Only one of the two following lines, depending on user settings
        // First allows browser-native MathML display, second forces HTML/CSS
        //  config: ["MMLorHTML.js"], jax: ["input/TeX"],
            jax: ["input/TeX", "output/HTML-CSS"],
        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
                     "TeX/noUndefined.js"],
        tex2jax: {
            inlineMath: [ ["\\(","\\)"] ],
            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
            skipTags: ["script","noscript","style","textarea","pre","code"],
            ignoreClass: "tex2jax_ignore",
            processEscapes: false,
            processEnvironments: true,
            preview: "TeX"
        },
        showProcessingMessages: true,
        displayAlign: "center",
        displayIndent: "2em",

        "HTML-CSS": {
             scale: 100,
             availableFonts: ["STIX","TeX"],
             preferredFont: "TeX",
             webFont: "TeX",
             imageFont: "TeX",
             showMathMenu: true,
        },
        MMLorHTML: {
             prefer: {
                 MSIE:    "MML",
                 Firefox: "MML",
                 Opera:   "HTML",
                 other:   "HTML"
             }
        }
    });
/*]]>*///-->
</script>
</head>
<body>
<div id="content">
<h1 class="title">Proximal Methods</h1>
\(
\newcommand{\opt}[1]{{#1}^{*}}
\newcommand{\pred}[1]{\hat{#1}}
\newcommand{\dnorm}[1]{{#1}^{*}}
\renewcommand{\Pr}{\mathbb{P}}
\renewcommand{\vec}[1]{\mathbf{#1}}
\DeclareMathOperator{\E}{\mathbb{E}}
\DeclareMathOperator*{\argmin}{\mathbf{arg\,min}}
\DeclareMathOperator*{\argmax}{\mathbf{arg\,max}}
\DeclareMathOperator*{\sup}{sup}
\)
<div id="outline-container-sec-1" class="outline-2">
<h2 id="sec-1"><span class="section-number-2">1</span> Proximity Operator and Moreau Decomposition</h2>
<div class="outline-text-2" id="text-1">
<p>
Proximity operator for a convex function \(f\) is defined as
</p>
\begin{align*}
& Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
& Prox_{\lambda f}(x) = \argmin_z f(z) + \frac{1}{2\lambda} \|x-z\|^2 =  \argmin_z \lambda f(z) + \frac{1}{2} \|x-z\|^2
\end{align*}
<p>
Here the norm is the euclidean norm.
Note that the minimizer is unique as the objective is strongly convex. Hence the operator is well-defined.
Intuitively, we want to remain close to the previous iterate (here \(x\)), and also respect the regularization (here \(f(z)\)). The proximal point \(Prox_f(x)\) is a point that compromises between minimizing \(f\) and staying close to \(x\), &lambda; being the relative trade-off factor (larger the &lambda; closer we can move to \(f\)'s minima).
An appealing property of proximal operators is that the minima is a fixpoint of the proximal operator. More formally, \(\opt{x}\) is the minima of \(f\) iff,
</p>
\begin{align*}
Prox_f(\opt{x}) = \opt{x} \tag{try proving both directions}
\end{align*}
</div>
</div>
<div id="outline-container-sec-2" class="outline-2">
<h2 id="sec-2"><span class="section-number-2">2</span> Moreau Decomposition and Projection Operator</h2>
<div class="outline-text-2" id="text-2">
<p>
If \(f\) is a norm, then \(\dnorm{f}\) is \(I_B\) where \(B\) is the unit ball for the dual norm. Recall that the dual norm \(\dnorm{f}\) of \(f\) is defined as
</p>
\begin{align*}
\dnorm{f}(x) = \|x\|_{\dnorm{f}} = \argmax_z x^Tz \text{ s.t. } \|z\|_{f} \le 1
\end{align*}
\begin{align*}
Prox_f(x) + Prox_{\dnorm{f}}(x) = x \\
\text{More generally, } Prox_{\lambda f}(x) + \lambda Prox_{\frac{\dnorm{f}}{\lambda}}(\frac{x}{\lambda}) = x \\
Prox_f(x) = x - \Pi_B(x) \\
\end{align*}

<p>
This identity helps in writing out the proximal operators effortlessly for common norms.
For example, the dual of \(\ell_2\) norm is the norm itself. The projection operator becomes
</p>
\begin{align*}
& \Pi_B(x) =
\begin{cases}
x/\|x\|_2 & \text{if } \|x\|_2 \ge 1 \\
x & \text{if } \|x\|_2 \le 1
\end{cases} \\
\text{thus, the proximal is } & Prox_f(x) = x - \Pi_B(x) = \begin{cases}
x(1-1/\|x\|_2) & \text{if } \|x\|_2 \ge 1 \\
0 & \text{if } \|x\|_2 \le 1 \tag{Block Soft Thresholding}
\end{cases}
\end{align*}
<p>
Similarly, we can find proximal operators for \(\ell_1\) and its dual \(\ell_\infty\) norm.
Remember that the unit norm ball for \(\ell_\infty\) norm is a box, so projection simply involves component-wise thresholding
</p>
\begin{align*}
& [\Pi_B(x)]_i =
\begin{cases}
x_i & \text{if } x_i \le 1 \\
1 & \text{if } x_i > 1 \\
-1 & \text{if } x_i < -1
\end{cases} \\
\text{thus, the proximal is } & [Prox_f(x)]_i = x_i - [\Pi_B(x)]_i = \begin{cases}
0 & \text{if } x_i \le 1 \\
x_i-1 & \text{if } x_i > 1 \\
x_i+1 & \text{if } x_i < -1 \tag{Soft Thresholding}
\end{cases}
\end{align*}
</div>
</div>
<div id="outline-container-sec-3" class="outline-2">
<h2 id="sec-3"><span class="section-number-2">3</span> Proximal Operator as a Generic Case</h2>
<div class="outline-text-2" id="text-3">
<p>
Proximal operator can be seen as a generalization of several common algorithms.
</p>
</div>
<div id="outline-container-sec-3-1" class="outline-3">
<h3 id="sec-3-1"><span class="section-number-3">3.1</span> Gradient Descent</h3>
<div class="outline-text-3" id="text-3-1">
<p>
If we take the first order approximation of f(z) around x in,
</p>

\begin{align*}
& Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
\text{we get } & = \argmin_z f(x) + \grad f(x)^T(z-x)  + \frac{1}{2} \|x-z\|^2 \\
& = \grad f(x)^T + (z-x)=0 \\
& \implies z = x - \grad f(x) \\
& \implies Prox_f(x) = x - \grad f(x) \\
& \text{More generally }\implies Prox_{\lambda f(x)} = x - \lambda\grad f(x)
\end{align*}

<p>
This observation also suggests thinking about gradient descent in another way.
If we attach a regularization term in the above approximation, this becomes ISTA (notice the resemblance to proximal gradient descent where we use the second-order approximation).
</p>
</div>
</div>
<div id="outline-container-sec-3-2" class="outline-3">
<h3 id="sec-3-2"><span class="section-number-3">3.2</span> Levenberg Marquart</h3>
<div class="outline-text-3" id="text-3-2">
<p>
Likewise, using the second order approximation, we get the Levenberg-Marquat update,
</p>

  \begin{align*}
  \argmin_z f(x) + \nabla f(x)^T(z-x) + \frac{\nabla^2 f(x)^T}{2}\|x-z\|^2 + \frac{1}{2} \|x-z\|^2 \\
  z = x - (\nabla^2 f(x) + I )^{-1}\nabla f(x)
\end{align*}
</div>
</div>
</div>
<div id="outline-container-sec-4" class="outline-2">
<h2 id="sec-4"><span class="section-number-2">4</span> Proximal Gradient Methods</h2>
<div class="outline-text-2" id="text-4">
<p>
This is a method suitable for large scale optimization. We will show that regularized convex objectives can be cast into problem involving evaluation of the proximity operator.
</p>

<p>
Our objective will in general look like,
</p>

\begin{align*}
\argmin_\theta R(\theta) + L(\theta)
\end{align*}

<p>
Here \(R\) is a (<i>possibly non-differentiable</i>) convex regularizer,
and \(L\) is a <b>differentiable</b> convex loss function.
</p>

<p>
We will see that as long as we can evaluate the proximal operator defined
by \(R\), we can solve the problem efficiently.
</p>

<p>
The idea is to approximate the \(L\) term with its taylor approximation
at the current iterate,
</p>

\begin{align*}
\theta_{k+1} = \argmin_\theta R(\theta) + L(\theta_k) + \nabla L(\theta_k)^{T}(\theta - \theta_k) + \frac{1}{2B} \| \theta - \theta_k \|^2
\end{align*}

<p>
Here \(B\) is an approximate inverse Hessian (we will discuss the choice of B later).
Now dropping terms which are not dependent on \(\theta\) and completing the squares using the second term
and the third term in the quadratic approximation, we get,
</p>

\begin{align*}
\argmin_\theta R(\theta) + \frac{1}{2} \| \theta - (\theta_k - B\nabla L(\theta_k))\|^2
\end{align*}

<p>
Which is a problem of evaluating the proximal operator. In particular, it is equivalent to solving,
</p>

\begin{align*}
\theta_{k+1} = Prox_R( \theta_k - B\nabla L(\theta_k) )
\end{align*}
<p>
This can be interpreted as follows -  the next iterate should be close to the gradient step, and at the same time respect the regularizer R.
</p>

<p>
Note that if \(R(\theta)\) is zero, this is equivalent to standard gradient descent.
If \(R\) is a indicator function for a convex set, then this is equivalent to projected gradient descent.
</p>
</div>
</div>
<div id="outline-container-sec-5" class="outline-2">
<h2 id="sec-5"><span class="section-number-2">5</span> Relation to FOBOS and Truncated Gradient</h2>
<div class="outline-text-2" id="text-5">
<p>
FOBOS = Proximal Gradient method = Truncated Gradient.
</p>
\begin{align*}
\min L(w) + R(w) \\
w_{t+\frac{1}{2}} = w_t - \eta g^L_{t} \\
w_{t+1} = \argmin_w \frac{1}{2} \|w-w_{t+\frac{1}{2}}\|^2 + \eta R(w) \\
\end{align*}
</div>
</div>
<div id="outline-container-sec-6" class="outline-2">
<h2 id="sec-6"><span class="section-number-2">6</span> Relation to Projection Operator</h2>
<div class="outline-text-2" id="text-6">
<p>
Fenchel conjugates are related to dual norms.
In fact, the conjugate of a norm R
is the indicator function of a unit ball of the dual norm R*.
That is,
</p>

\begin{align*}
I_R{w} =
\begin{cases}
1 & : R(w) \le 1 \\
\infty & : otherwise
\end{cases}
\end{align*}

<p>
And,
$$
   R*(w) = I_{R}(w)
   $$
</p>

<p>
When the proximal operator is defined in terms of a norm, it can be shown
that
</p>
</div>
</div>
<div id="outline-container-sec-7" class="outline-2">
<h2 id="sec-7"><span class="section-number-2">7</span> The Soft Thresholding Operator</h2>
<div class="outline-text-2" id="text-7">
\begin{align*}
f(w) = \|w-x\|_2^2 + \lambda \|w\|_1
\end{align*}

<p>
We know that by the subgradient optimality condition, \(0 \in \partial_j(f) \text{ at solution}\)
</p>

\begin{align*}
\partial_j(f) =
\begin{cases}
w_i - x_i + \lambda & \text{ if } w_i > 0 \\
w_i - x_i - \lambda & \text{ if } w_i < 0 \\
-x_i + \lambda \times [-1,1] & \text{ if } w_i = 0
\end{cases}
\end{align*}

<p>
Equating each case to zero to find the optimal,
</p>

\begin{align*}
\begin{cases}
\opt{w}_i = x_i - \lambda & \text{ when } \opt{w}_i > 0 \text{ this happens when $x_i > \lambda$} \\
\opt{w}_i = x_i + \lambda &  \text{ when } \opt{w}_i < 0 \text{ this happens when $x_i < \lambda$} \\
0 \in -x_i + [-\lambda, \lambda]
\rightarrow \lambda \le x_i \le \lambda
\end{cases}
\end{align*}

<p>
This can equivalently be written as,
</p>

\begin{align*}
\begin{cases}
\opt{w}_i = x_i - \lambda & \text{ when } x_i > 0 \\
\opt{w}_i = x_i + \lambda &  \text{ when } x_i < 0 \\
0 & \text{ if } \lambda \le x_i \le \lambda
\end{cases}
\end{align*}

<p>
because sign(\opt{w}<sub>i</sub>) = sign(x<sub>i</sub>) in all cases.
</p>

<p>
Or more compactly,
</p>

\begin{align*}
w_i = sign(x_i)(|x_i| - \lambda)_{+}
\end{align*}

<p>
So if \(x_i\) is less than \(\lambda\), \(w_i\) will be driven to zero.
</p>
</div>
<div id="outline-container-sec-7-1" class="outline-3">
<h3 id="sec-7-1"><span class="section-number-3">7.1</span> Relation to Projection onto \(L_{\infty}\) Ball</h3>
</div>
</div>
<div id="outline-container-sec-8" class="outline-2">
<h2 id="sec-8"><span class="section-number-2">8</span> FOBOS Example for L1 Regularizer</h2>
<div class="outline-text-2" id="text-8">
<p>
A common problem of subgradient methods is that if R is un-differentiable, the iterates are rarely at the point of non-differentiablity.
However these points are the true minima of the function.
</p>
\begin{align*}
w_{t+1,j} & = sign(w_{t+\frac{1}{2},j}) [w_{t+\frac{1}{2},j} - \lambda]_{+} \tag{soft thresholding operator}\\
& = sign(w_{t,j} - \eta g) [w_{t+\frac{1}{2}} - \lambda]_{+}
\end{align*}
<p>
This can lead to sparser solutions (explain why?)
</p>
</div>
</div>
<div id="outline-container-sec-9" class="outline-2">
<h2 id="sec-9"><span class="section-number-2">9</span> Naive L1 Regularization with Subgradient Descent</h2>
<div class="outline-text-2" id="text-9">
\begin{align*}
\min \sum_i L(w,z_i) + \lambda \|w\|_1 \\
w_{t+1} = w_t - \eta \nabla L(w_t,z_i) - \eta \lambda sgn(w_t) \tag{naive stochastic subgradient descent}\\
\end{align*}
</div>
</div>

<div id="outline-container-sec-10" class="outline-2">
<h2 id="sec-10"><span class="section-number-2">10</span> Choice of B</h2>
</div>
</div>
<div id="postamble" class="status">
<p class="author">Author: Shyam Upadhyay</p>
<p class="date">Created: 2014-08-28 Thu 11:07</p>
<p class="creator"><a href="http://www.gnu.org/software/emacs/">Emacs</a> 24.3.1 (<a href="http://orgmode.org">Org</a> mode 8.2.6)</p>
<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
</div>
</body>
</html>