Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- <?xml version="1.0" encoding="utf-8"?>
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
- "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
- <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
- <head>
- <title>Proximal Methods</title>
- <!-- 2014-08-28 Thu 11:07 -->
- <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
- <meta name="generator" content="Org-mode" />
- <meta name="author" content="Shyam Upadhyay" />
- <style type="text/css">
- <!--/*--><![CDATA[/*><!--*/
- .title { text-align: center; }
- .todo { font-family: monospace; color: red; }
- .done { color: green; }
- .tag { background-color: #eee; font-family: monospace;
- padding: 2px; font-size: 80%; font-weight: normal; }
- .timestamp { color: #bebebe; }
- .timestamp-kwd { color: #5f9ea0; }
- .right { margin-left: auto; margin-right: 0px; text-align: right; }
- .left { margin-left: 0px; margin-right: auto; text-align: left; }
- .center { margin-left: auto; margin-right: auto; text-align: center; }
- .underline { text-decoration: underline; }
- #postamble p, #preamble p { font-size: 90%; margin: .2em; }
- p.verse { margin-left: 3%; }
- pre {
- border: 1px solid #ccc;
- box-shadow: 3px 3px 3px #eee;
- padding: 8pt;
- font-family: monospace;
- overflow: auto;
- margin: 1.2em;
- }
- pre.src {
- position: relative;
- overflow: visible;
- padding-top: 1.2em;
- }
- pre.src:before {
- display: none;
- position: absolute;
- background-color: white;
- top: -10px;
- right: 10px;
- padding: 3px;
- border: 1px solid black;
- }
- pre.src:hover:before { display: inline;}
- pre.src-sh:before { content: 'sh'; }
- pre.src-bash:before { content: 'sh'; }
- pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
- pre.src-R:before { content: 'R'; }
- pre.src-perl:before { content: 'Perl'; }
- pre.src-java:before { content: 'Java'; }
- pre.src-sql:before { content: 'SQL'; }
- table { border-collapse:collapse; }
- caption.t-above { caption-side: top; }
- caption.t-bottom { caption-side: bottom; }
- td, th { vertical-align:top; }
- th.right { text-align: center; }
- th.left { text-align: center; }
- th.center { text-align: center; }
- td.right { text-align: right; }
- td.left { text-align: left; }
- td.center { text-align: center; }
- dt { font-weight: bold; }
- .footpara:nth-child(2) { display: inline; }
- .footpara { display: block; }
- .footdef { margin-bottom: 1em; }
- .figure { padding: 1em; }
- .figure p { text-align: center; }
- .inlinetask {
- padding: 10px;
- border: 2px solid gray;
- margin: 10px;
- background: #ffffcc;
- }
- #org-div-home-and-up
- { text-align: right; font-size: 70%; white-space: nowrap; }
- textarea { overflow-x: auto; }
- .linenr { font-size: smaller }
- .code-highlighted { background-color: #ffff00; }
- .org-info-js_info-navigation { border-style: none; }
- #org-info-js_console-label
- { font-size: 10px; font-weight: bold; white-space: nowrap; }
- .org-info-js_search-highlight
- { background-color: #ffff00; color: #000000; font-weight: bold; }
- /*]]>*/-->
- </style>
- <script type="text/javascript">
- /*
- @licstart The following is the entire license notice for the
- JavaScript code in this tag.
- Copyright (C) 2012-2013 Free Software Foundation, Inc.
- The JavaScript code in this tag is free software: you can
- redistribute it and/or modify it under the terms of the GNU
- General Public License (GNU GPL) as published by the Free Software
- Foundation, either version 3 of the License, or (at your option)
- any later version. The code is distributed WITHOUT ANY WARRANTY;
- without even the implied warranty of MERCHANTABILITY or FITNESS
- FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
- As additional permission under GNU GPL version 3 section 7, you
- may distribute non-source (e.g., minimized or compacted) forms of
- that code without the copy of the GNU GPL normally required by
- section 4, provided you include this license notice and a URL
- through which recipients can access the Corresponding Source.
- @licend The above is the entire license notice
- for the JavaScript code in this tag.
- */
- <!--/*--><![CDATA[/*><!--*/
- function CodeHighlightOn(elem, id)
- {
- var target = document.getElementById(id);
- if(null != target) {
- elem.cacheClassElem = elem.className;
- elem.cacheClassTarget = target.className;
- target.className = "code-highlighted";
- elem.className = "code-highlighted";
- }
- }
- function CodeHighlightOff(elem, id)
- {
- var target = document.getElementById(id);
- if(elem.cacheClassElem)
- elem.className = elem.cacheClassElem;
- if(elem.cacheClassTarget)
- target.className = elem.cacheClassTarget;
- }
- /*]]>*///-->
- </script>
- <script type="text/javascript" src="http://orgmode.org/mathjax/MathJax.js"></script>
- <script type="text/javascript">
- <!--/*--><![CDATA[/*><!--*/
- MathJax.Hub.Config({
- // Only one of the two following lines, depending on user settings
- // First allows browser-native MathML display, second forces HTML/CSS
- // config: ["MMLorHTML.js"], jax: ["input/TeX"],
- jax: ["input/TeX", "output/HTML-CSS"],
- extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
- "TeX/noUndefined.js"],
- tex2jax: {
- inlineMath: [ ["\\(","\\)"] ],
- displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
- skipTags: ["script","noscript","style","textarea","pre","code"],
- ignoreClass: "tex2jax_ignore",
- processEscapes: false,
- processEnvironments: true,
- preview: "TeX"
- },
- showProcessingMessages: true,
- displayAlign: "center",
- displayIndent: "2em",
- "HTML-CSS": {
- scale: 100,
- availableFonts: ["STIX","TeX"],
- preferredFont: "TeX",
- webFont: "TeX",
- imageFont: "TeX",
- showMathMenu: true,
- },
- MMLorHTML: {
- prefer: {
- MSIE: "MML",
- Firefox: "MML",
- Opera: "HTML",
- other: "HTML"
- }
- }
- });
- /*]]>*///-->
- </script>
- </head>
- <body>
- <div id="content">
- <h1 class="title">Proximal Methods</h1>
- \(
- \newcommand{\opt}[1]{{#1}^{*}}
- \newcommand{\pred}[1]{\hat{#1}}
- \newcommand{\dnorm}[1]{{#1}^{*}}
- \renewcommand{\Pr}{\mathbb{P}}
- \renewcommand{\vec}[1]{\mathbf{#1}}
- \DeclareMathOperator{\E}{\mathbb{E}}
- \DeclareMathOperator*{\argmin}{\mathbf{arg\,min}}
- \DeclareMathOperator*{\argmax}{\mathbf{arg\,max}}
- \DeclareMathOperator*{\sup}{sup}
- \)
- <div id="outline-container-sec-1" class="outline-2">
- <h2 id="sec-1"><span class="section-number-2">1</span> Proximity Operator and Moreau Decomposition</h2>
- <div class="outline-text-2" id="text-1">
- <p>
- Proximity operator for a convex function \(f\) is defined as
- </p>
- \begin{align*}
- & Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
- & Prox_{\lambda f}(x) = \argmin_z f(z) + \frac{1}{2\lambda} \|x-z\|^2 = \argmin_z \lambda f(z) + \frac{1}{2} \|x-z\|^2
- \end{align*}
- <p>
- Here the norm is the euclidean norm.
- Note that the minimizer is unique as the objective is strongly convex. Hence the operator is well-defined.
- Intuitively, we want to remain close to the previous iterate (here \(x\)), and also respect the regularization (here \(f(z)\)). The proximal point \(Prox_f(x)\) is a point that compromises between minimizing \(f\) and staying close to \(x\), λ being the relative trade-off factor (larger the λ closer we can move to \(f\)'s minima).
- An appealing property of proximal operators is that the minima is a fixpoint of the proximal operator. More formally, \(\opt{x}\) is the minima of \(f\) iff,
- </p>
- \begin{align*}
- Prox_f(\opt{x}) = \opt{x} \tag{try proving both directions}
- \end{align*}
- </div>
- </div>
- <div id="outline-container-sec-2" class="outline-2">
- <h2 id="sec-2"><span class="section-number-2">2</span> Moreau Decomposition and Projection Operator</h2>
- <div class="outline-text-2" id="text-2">
- <p>
- If \(f\) is a norm, then \(\dnorm{f}\) is \(I_B\) where \(B\) is the unit ball for the dual norm. Recall that the dual norm \(\dnorm{f}\) of \(f\) is defined as
- </p>
- \begin{align*}
- \dnorm{f}(x) = \|x\|_{\dnorm{f}} = \argmax_z x^Tz \text{ s.t. } \|z\|_{f} \le 1
- \end{align*}
- \begin{align*}
- Prox_f(x) + Prox_{\dnorm{f}}(x) = x \\
- \text{More generally, } Prox_{\lambda f}(x) + \lambda Prox_{\frac{\dnorm{f}}{\lambda}}(\frac{x}{\lambda}) = x \\
- Prox_f(x) = x - \Pi_B(x) \\
- \end{align*}
- <p>
- This identity helps in writing out the proximal operators effortlessly for common norms.
- For example, the dual of \(\ell_2\) norm is the norm itself. The projection operator becomes
- </p>
- \begin{align*}
- & \Pi_B(x) =
- \begin{cases}
- x/\|x\|_2 & \text{if } \|x\|_2 \ge 1 \\
- x & \text{if } \|x\|_2 \le 1
- \end{cases} \\
- \text{thus, the proximal is } & Prox_f(x) = x - \Pi_B(x) = \begin{cases}
- x(1-1/\|x\|_2) & \text{if } \|x\|_2 \ge 1 \\
- 0 & \text{if } \|x\|_2 \le 1 \tag{Block Soft Thresholding}
- \end{cases}
- \end{align*}
- <p>
- Similarly, we can find proximal operators for \(\ell_1\) and its dual \(\ell_\infty\) norm.
- Remember that the unit norm ball for \(\ell_\infty\) norm is a box, so projection simply involves component-wise thresholding
- </p>
- \begin{align*}
- & [\Pi_B(x)]_i =
- \begin{cases}
- x_i & \text{if } x_i \le 1 \\
- 1 & \text{if } x_i > 1 \\
- -1 & \text{if } x_i < -1
- \end{cases} \\
- \text{thus, the proximal is } & [Prox_f(x)]_i = x_i - [\Pi_B(x)]_i = \begin{cases}
- 0 & \text{if } x_i \le 1 \\
- x_i-1 & \text{if } x_i > 1 \\
- x_i+1 & \text{if } x_i < -1 \tag{Soft Thresholding}
- \end{cases}
- \end{align*}
- </div>
- </div>
- <div id="outline-container-sec-3" class="outline-2">
- <h2 id="sec-3"><span class="section-number-2">3</span> Proximal Operator as a Generic Case</h2>
- <div class="outline-text-2" id="text-3">
- <p>
- Proximal operator can be seen as a generalization of several common algorithms.
- </p>
- </div>
- <div id="outline-container-sec-3-1" class="outline-3">
- <h3 id="sec-3-1"><span class="section-number-3">3.1</span> Gradient Descent</h3>
- <div class="outline-text-3" id="text-3-1">
- <p>
- If we take the first order approximation of f(z) around x in,
- </p>
- \begin{align*}
- & Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
- \text{we get } & = \argmin_z f(x) + \grad f(x)^T(z-x) + \frac{1}{2} \|x-z\|^2 \\
- & = \grad f(x)^T + (z-x)=0 \\
- & \implies z = x - \grad f(x) \\
- & \implies Prox_f(x) = x - \grad f(x) \\
- & \text{More generally }\implies Prox_{\lambda f(x)} = x - \lambda\grad f(x)
- \end{align*}
- <p>
- This observation also suggests thinking about gradient descent in another way.
- If we attach a regularization term in the above approximation, this becomes ISTA (notice the resemblance to proximal gradient descent where we use the second-order approximation).
- </p>
- </div>
- </div>
- <div id="outline-container-sec-3-2" class="outline-3">
- <h3 id="sec-3-2"><span class="section-number-3">3.2</span> Levenberg Marquart</h3>
- <div class="outline-text-3" id="text-3-2">
- <p>
- Likewise, using the second order approximation, we get the Levenberg-Marquat update,
- </p>
- \begin{align*}
- \argmin_z f(x) + \nabla f(x)^T(z-x) + \frac{\nabla^2 f(x)^T}{2}\|x-z\|^2 + \frac{1}{2} \|x-z\|^2 \\
- z = x - (\nabla^2 f(x) + I )^{-1}\nabla f(x)
- \end{align*}
- </div>
- </div>
- </div>
- <div id="outline-container-sec-4" class="outline-2">
- <h2 id="sec-4"><span class="section-number-2">4</span> Proximal Gradient Methods</h2>
- <div class="outline-text-2" id="text-4">
- <p>
- This is a method suitable for large scale optimization. We will show that regularized convex objectives can be cast into problem involving evaluation of the proximity operator.
- </p>
- <p>
- Our objective will in general look like,
- </p>
- \begin{align*}
- \argmin_\theta R(\theta) + L(\theta)
- \end{align*}
- <p>
- Here \(R\) is a (<i>possibly non-differentiable</i>) convex regularizer,
- and \(L\) is a <b>differentiable</b> convex loss function.
- </p>
- <p>
- We will see that as long as we can evaluate the proximal operator defined
- by \(R\), we can solve the problem efficiently.
- </p>
- <p>
- The idea is to approximate the \(L\) term with its taylor approximation
- at the current iterate,
- </p>
- \begin{align*}
- \theta_{k+1} = \argmin_\theta R(\theta) + L(\theta_k) + \nabla L(\theta_k)^{T}(\theta - \theta_k) + \frac{1}{2B} \| \theta - \theta_k \|^2
- \end{align*}
- <p>
- Here \(B\) is an approximate inverse Hessian (we will discuss the choice of B later).
- Now dropping terms which are not dependent on \(\theta\) and completing the squares using the second term
- and the third term in the quadratic approximation, we get,
- </p>
- \begin{align*}
- \argmin_\theta R(\theta) + \frac{1}{2} \| \theta - (\theta_k - B\nabla L(\theta_k))\|^2
- \end{align*}
- <p>
- Which is a problem of evaluating the proximal operator. In particular, it is equivalent to solving,
- </p>
- \begin{align*}
- \theta_{k+1} = Prox_R( \theta_k - B\nabla L(\theta_k) )
- \end{align*}
- <p>
- This can be interpreted as follows - the next iterate should be close to the gradient step, and at the same time respect the regularizer R.
- </p>
- <p>
- Note that if \(R(\theta)\) is zero, this is equivalent to standard gradient descent.
- If \(R\) is a indicator function for a convex set, then this is equivalent to projected gradient descent.
- </p>
- </div>
- </div>
- <div id="outline-container-sec-5" class="outline-2">
- <h2 id="sec-5"><span class="section-number-2">5</span> Relation to FOBOS and Truncated Gradient</h2>
- <div class="outline-text-2" id="text-5">
- <p>
- FOBOS = Proximal Gradient method = Truncated Gradient.
- </p>
- \begin{align*}
- \min L(w) + R(w) \\
- w_{t+\frac{1}{2}} = w_t - \eta g^L_{t} \\
- w_{t+1} = \argmin_w \frac{1}{2} \|w-w_{t+\frac{1}{2}}\|^2 + \eta R(w) \\
- \end{align*}
- </div>
- </div>
- <div id="outline-container-sec-6" class="outline-2">
- <h2 id="sec-6"><span class="section-number-2">6</span> Relation to Projection Operator</h2>
- <div class="outline-text-2" id="text-6">
- <p>
- Fenchel conjugates are related to dual norms.
- In fact, the conjugate of a norm R
- is the indicator function of a unit ball of the dual norm R*.
- That is,
- </p>
- \begin{align*}
- I_R{w} =
- \begin{cases}
- 1 & : R(w) \le 1 \\
- \infty & : otherwise
- \end{cases}
- \end{align*}
- <p>
- And,
- $$
- R*(w) = I_{R}(w)
- $$
- </p>
- <p>
- When the proximal operator is defined in terms of a norm, it can be shown
- that
- </p>
- </div>
- </div>
- <div id="outline-container-sec-7" class="outline-2">
- <h2 id="sec-7"><span class="section-number-2">7</span> The Soft Thresholding Operator</h2>
- <div class="outline-text-2" id="text-7">
- \begin{align*}
- f(w) = \|w-x\|_2^2 + \lambda \|w\|_1
- \end{align*}
- <p>
- We know that by the subgradient optimality condition, \(0 \in \partial_j(f) \text{ at solution}\)
- </p>
- \begin{align*}
- \partial_j(f) =
- \begin{cases}
- w_i - x_i + \lambda & \text{ if } w_i > 0 \\
- w_i - x_i - \lambda & \text{ if } w_i < 0 \\
- -x_i + \lambda \times [-1,1] & \text{ if } w_i = 0
- \end{cases}
- \end{align*}
- <p>
- Equating each case to zero to find the optimal,
- </p>
- \begin{align*}
- \begin{cases}
- \opt{w}_i = x_i - \lambda & \text{ when } \opt{w}_i > 0 \text{ this happens when $x_i > \lambda$} \\
- \opt{w}_i = x_i + \lambda & \text{ when } \opt{w}_i < 0 \text{ this happens when $x_i < \lambda$} \\
- 0 \in -x_i + [-\lambda, \lambda]
- \rightarrow \lambda \le x_i \le \lambda
- \end{cases}
- \end{align*}
- <p>
- This can equivalently be written as,
- </p>
- \begin{align*}
- \begin{cases}
- \opt{w}_i = x_i - \lambda & \text{ when } x_i > 0 \\
- \opt{w}_i = x_i + \lambda & \text{ when } x_i < 0 \\
- 0 & \text{ if } \lambda \le x_i \le \lambda
- \end{cases}
- \end{align*}
- <p>
- because sign(\opt{w}<sub>i</sub>) = sign(x<sub>i</sub>) in all cases.
- </p>
- <p>
- Or more compactly,
- </p>
- \begin{align*}
- w_i = sign(x_i)(|x_i| - \lambda)_{+}
- \end{align*}
- <p>
- So if \(x_i\) is less than \(\lambda\), \(w_i\) will be driven to zero.
- </p>
- </div>
- <div id="outline-container-sec-7-1" class="outline-3">
- <h3 id="sec-7-1"><span class="section-number-3">7.1</span> Relation to Projection onto \(L_{\infty}\) Ball</h3>
- </div>
- </div>
- <div id="outline-container-sec-8" class="outline-2">
- <h2 id="sec-8"><span class="section-number-2">8</span> FOBOS Example for L1 Regularizer</h2>
- <div class="outline-text-2" id="text-8">
- <p>
- A common problem of subgradient methods is that if R is un-differentiable, the iterates are rarely at the point of non-differentiablity.
- However these points are the true minima of the function.
- </p>
- \begin{align*}
- w_{t+1,j} & = sign(w_{t+\frac{1}{2},j}) [w_{t+\frac{1}{2},j} - \lambda]_{+} \tag{soft thresholding operator}\\
- & = sign(w_{t,j} - \eta g) [w_{t+\frac{1}{2}} - \lambda]_{+}
- \end{align*}
- <p>
- This can lead to sparser solutions (explain why?)
- </p>
- </div>
- </div>
- <div id="outline-container-sec-9" class="outline-2">
- <h2 id="sec-9"><span class="section-number-2">9</span> Naive L1 Regularization with Subgradient Descent</h2>
- <div class="outline-text-2" id="text-9">
- \begin{align*}
- \min \sum_i L(w,z_i) + \lambda \|w\|_1 \\
- w_{t+1} = w_t - \eta \nabla L(w_t,z_i) - \eta \lambda sgn(w_t) \tag{naive stochastic subgradient descent}\\
- \end{align*}
- </div>
- </div>
- <div id="outline-container-sec-10" class="outline-2">
- <h2 id="sec-10"><span class="section-number-2">10</span> Choice of B</h2>
- </div>
- </div>
- <div id="postamble" class="status">
- <p class="author">Author: Shyam Upadhyay</p>
- <p class="date">Created: 2014-08-28 Thu 11:07</p>
- <p class="creator"><a href="http://www.gnu.org/software/emacs/">Emacs</a> 24.3.1 (<a href="http://orgmode.org">Org</a> mode 8.2.6)</p>
- <p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
- </div>
- </body>
- </html>
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement