Advertisement
Guest User

org-export html

a guest
Aug 28th, 2014
366
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
HTML 17.69 KB | None | 0 0
  1. <?xml version="1.0" encoding="utf-8"?>
  2. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  3. "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  4. <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  5. <head>
  6. <title>Proximal Methods</title>
  7. <!-- 2014-08-28 Thu 11:07 -->
  8. <meta  http-equiv="Content-Type" content="text/html;charset=utf-8" />
  9. <meta  name="generator" content="Org-mode" />
  10. <meta  name="author" content="Shyam Upadhyay" />
  11. <style type="text/css">
  12.  <!--/*--><![CDATA[/*><!--*/
  13.  .title  { text-align: center; }
  14.  .todo   { font-family: monospace; color: red; }
  15.  .done   { color: green; }
  16.  .tag    { background-color: #eee; font-family: monospace;
  17.            padding: 2px; font-size: 80%; font-weight: normal; }
  18.  .timestamp { color: #bebebe; }
  19.  .timestamp-kwd { color: #5f9ea0; }
  20.  .right  { margin-left: auto; margin-right: 0px;  text-align: right; }
  21.  .left   { margin-left: 0px;  margin-right: auto; text-align: left; }
  22.  .center { margin-left: auto; margin-right: auto; text-align: center; }
  23.  .underline { text-decoration: underline; }
  24.  #postamble p, #preamble p { font-size: 90%; margin: .2em; }
  25.  p.verse { margin-left: 3%; }
  26.  pre {
  27.    border: 1px solid #ccc;
  28.    box-shadow: 3px 3px 3px #eee;
  29.    padding: 8pt;
  30.    font-family: monospace;
  31.    overflow: auto;
  32.    margin: 1.2em;
  33.  }
  34.  pre.src {
  35.    position: relative;
  36.    overflow: visible;
  37.    padding-top: 1.2em;
  38.  }
  39.  pre.src:before {
  40.    display: none;
  41.    position: absolute;
  42.    background-color: white;
  43.    top: -10px;
  44.    right: 10px;
  45.    padding: 3px;
  46.    border: 1px solid black;
  47.  }
  48.  pre.src:hover:before { display: inline;}
  49.  pre.src-sh:before    { content: 'sh'; }
  50.  pre.src-bash:before  { content: 'sh'; }
  51.  pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
  52.  pre.src-R:before     { content: 'R'; }
  53.  pre.src-perl:before  { content: 'Perl'; }
  54.  pre.src-java:before  { content: 'Java'; }
  55.  pre.src-sql:before   { content: 'SQL'; }
  56.  
  57.  table { border-collapse:collapse; }
  58.  caption.t-above { caption-side: top; }
  59.  caption.t-bottom { caption-side: bottom; }
  60.  td, th { vertical-align:top;  }
  61.  th.right  { text-align: center;  }
  62.  th.left   { text-align: center;   }
  63.  th.center { text-align: center; }
  64.  td.right  { text-align: right;  }
  65.  td.left   { text-align: left;   }
  66.  td.center { text-align: center; }
  67.  dt { font-weight: bold; }
  68.  .footpara:nth-child(2) { display: inline; }
  69.  .footpara { display: block; }
  70.  .footdef  { margin-bottom: 1em; }
  71.  .figure { padding: 1em; }
  72.  .figure p { text-align: center; }
  73.  .inlinetask {
  74.    padding: 10px;
  75.    border: 2px solid gray;
  76.    margin: 10px;
  77.    background: #ffffcc;
  78.  }
  79.  #org-div-home-and-up
  80.   { text-align: right; font-size: 70%; white-space: nowrap; }
  81.  textarea { overflow-x: auto; }
  82.  .linenr { font-size: smaller }
  83.  .code-highlighted { background-color: #ffff00; }
  84.  .org-info-js_info-navigation { border-style: none; }
  85.  #org-info-js_console-label
  86.    { font-size: 10px; font-weight: bold; white-space: nowrap; }
  87.  .org-info-js_search-highlight
  88.    { background-color: #ffff00; color: #000000; font-weight: bold; }
  89.  /*]]>*/-->
  90. </style>
  91. <script type="text/javascript">
  92. /*
  93. @licstart  The following is the entire license notice for the
  94. JavaScript code in this tag.
  95.  
  96. Copyright (C) 2012-2013 Free Software Foundation, Inc.
  97.  
  98. The JavaScript code in this tag is free software: you can
  99. redistribute it and/or modify it under the terms of the GNU
  100. General Public License (GNU GPL) as published by the Free Software
  101. Foundation, either version 3 of the License, or (at your option)
  102. any later version.  The code is distributed WITHOUT ANY WARRANTY;
  103. without even the implied warranty of MERCHANTABILITY or FITNESS
  104. FOR A PARTICULAR PURPOSE.  See the GNU GPL for more details.
  105.  
  106. As additional permission under GNU GPL version 3 section 7, you
  107. may distribute non-source (e.g., minimized or compacted) forms of
  108. that code without the copy of the GNU GPL normally required by
  109. section 4, provided you include this license notice and a URL
  110. through which recipients can access the Corresponding Source.
  111.  
  112.  
  113. @licend  The above is the entire license notice
  114. for the JavaScript code in this tag.
  115. */
  116. <!--/*--><![CDATA[/*><!--*/
  117. function CodeHighlightOn(elem, id)
  118. {
  119.   var target = document.getElementById(id);
  120.   if(null != target) {
  121.     elem.cacheClassElem = elem.className;
  122.     elem.cacheClassTarget = target.className;
  123.     target.className = "code-highlighted";
  124.     elem.className   = "code-highlighted";
  125.   }
  126. }
  127. function CodeHighlightOff(elem, id)
  128. {
  129.   var target = document.getElementById(id);
  130.   if(elem.cacheClassElem)
  131.     elem.className = elem.cacheClassElem;
  132.   if(elem.cacheClassTarget)
  133.     target.className = elem.cacheClassTarget;
  134. }
  135. /*]]>*///-->
  136. </script>
  137. <script type="text/javascript" src="http://orgmode.org/mathjax/MathJax.js"></script>
  138. <script type="text/javascript">
  139. <!--/*--><![CDATA[/*><!--*/
  140.    MathJax.Hub.Config({
  141.        // Only one of the two following lines, depending on user settings
  142.        // First allows browser-native MathML display, second forces HTML/CSS
  143.        //  config: ["MMLorHTML.js"], jax: ["input/TeX"],
  144.            jax: ["input/TeX", "output/HTML-CSS"],
  145.        extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js",
  146.                     "TeX/noUndefined.js"],
  147.        tex2jax: {
  148.            inlineMath: [ ["\\(","\\)"] ],
  149.            displayMath: [ ['$$','$$'], ["\\[","\\]"], ["\\begin{displaymath}","\\end{displaymath}"] ],
  150.            skipTags: ["script","noscript","style","textarea","pre","code"],
  151.            ignoreClass: "tex2jax_ignore",
  152.            processEscapes: false,
  153.            processEnvironments: true,
  154.            preview: "TeX"
  155.        },
  156.        showProcessingMessages: true,
  157.        displayAlign: "center",
  158.        displayIndent: "2em",
  159.  
  160.        "HTML-CSS": {
  161.             scale: 100,
  162.             availableFonts: ["STIX","TeX"],
  163.             preferredFont: "TeX",
  164.             webFont: "TeX",
  165.             imageFont: "TeX",
  166.             showMathMenu: true,
  167.        },
  168.        MMLorHTML: {
  169.             prefer: {
  170.                 MSIE:    "MML",
  171.                 Firefox: "MML",
  172.                 Opera:   "HTML",
  173.                 other:   "HTML"
  174.             }
  175.        }
  176.    });
  177. /*]]>*///-->
  178. </script>
  179. </head>
  180. <body>
  181. <div id="content">
  182. <h1 class="title">Proximal Methods</h1>
  183. \(
  184. \newcommand{\opt}[1]{{#1}^{*}}
  185. \newcommand{\pred}[1]{\hat{#1}}
  186. \newcommand{\dnorm}[1]{{#1}^{*}}
  187. \renewcommand{\Pr}{\mathbb{P}}
  188. \renewcommand{\vec}[1]{\mathbf{#1}}
  189. \DeclareMathOperator{\E}{\mathbb{E}}
  190. \DeclareMathOperator*{\argmin}{\mathbf{arg\,min}}
  191. \DeclareMathOperator*{\argmax}{\mathbf{arg\,max}}
  192. \DeclareMathOperator*{\sup}{sup}
  193. \)
  194. <div id="outline-container-sec-1" class="outline-2">
  195. <h2 id="sec-1"><span class="section-number-2">1</span> Proximity Operator and Moreau Decomposition</h2>
  196. <div class="outline-text-2" id="text-1">
  197. <p>
  198. Proximity operator for a convex function \(f\) is defined as
  199. </p>
  200. \begin{align*}
  201. & Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
  202. & Prox_{\lambda f}(x) = \argmin_z f(z) + \frac{1}{2\lambda} \|x-z\|^2 =  \argmin_z \lambda f(z) + \frac{1}{2} \|x-z\|^2
  203. \end{align*}
  204. <p>
  205. Here the norm is the euclidean norm.
  206. Note that the minimizer is unique as the objective is strongly convex. Hence the operator is well-defined.
  207. Intuitively, we want to remain close to the previous iterate (here \(x\)), and also respect the regularization (here \(f(z)\)). The proximal point \(Prox_f(x)\) is a point that compromises between minimizing \(f\) and staying close to \(x\), &lambda; being the relative trade-off factor (larger the &lambda; closer we can move to \(f\)'s minima).
  208. An appealing property of proximal operators is that the minima is a fixpoint of the proximal operator. More formally, \(\opt{x}\) is the minima of \(f\) iff,
  209. </p>
  210. \begin{align*}
  211. Prox_f(\opt{x}) = \opt{x} \tag{try proving both directions}
  212. \end{align*}
  213. </div>
  214. </div>
  215. <div id="outline-container-sec-2" class="outline-2">
  216. <h2 id="sec-2"><span class="section-number-2">2</span> Moreau Decomposition and Projection Operator</h2>
  217. <div class="outline-text-2" id="text-2">
  218. <p>
  219. If \(f\) is a norm, then \(\dnorm{f}\) is \(I_B\) where \(B\) is the unit ball for the dual norm. Recall that the dual norm \(\dnorm{f}\) of \(f\) is defined as
  220. </p>
  221. \begin{align*}
  222. \dnorm{f}(x) = \|x\|_{\dnorm{f}} = \argmax_z x^Tz \text{ s.t. } \|z\|_{f} \le 1
  223. \end{align*}
  224. \begin{align*}  
  225. Prox_f(x) + Prox_{\dnorm{f}}(x) = x \\
  226. \text{More generally, } Prox_{\lambda f}(x) + \lambda Prox_{\frac{\dnorm{f}}{\lambda}}(\frac{x}{\lambda}) = x \\
  227. Prox_f(x) = x - \Pi_B(x) \\
  228. \end{align*}
  229.  
  230. <p>
  231. This identity helps in writing out the proximal operators effortlessly for common norms.
  232. For example, the dual of \(\ell_2\) norm is the norm itself. The projection operator becomes
  233. </p>
  234. \begin{align*}
  235. & \Pi_B(x) =
  236. \begin{cases}
  237. x/\|x\|_2 & \text{if } \|x\|_2 \ge 1 \\
  238. x & \text{if } \|x\|_2 \le 1
  239. \end{cases} \\
  240. \text{thus, the proximal is } & Prox_f(x) = x - \Pi_B(x) = \begin{cases}
  241. x(1-1/\|x\|_2) & \text{if } \|x\|_2 \ge 1 \\
  242. 0 & \text{if } \|x\|_2 \le 1 \tag{Block Soft Thresholding}
  243. \end{cases}
  244. \end{align*}
  245. <p>
  246. Similarly, we can find proximal operators for \(\ell_1\) and its dual \(\ell_\infty\) norm.
  247. Remember that the unit norm ball for \(\ell_\infty\) norm is a box, so projection simply involves component-wise thresholding
  248. </p>
  249. \begin{align*}
  250. & [\Pi_B(x)]_i =
  251. \begin{cases}
  252. x_i & \text{if } x_i \le 1 \\
  253. 1 & \text{if } x_i > 1 \\
  254. -1 & \text{if } x_i < -1
  255. \end{cases} \\
  256. \text{thus, the proximal is } & [Prox_f(x)]_i = x_i - [\Pi_B(x)]_i = \begin{cases}
  257. 0 & \text{if } x_i \le 1 \\
  258. x_i-1 & \text{if } x_i > 1 \\
  259. x_i+1 & \text{if } x_i < -1 \tag{Soft Thresholding}
  260. \end{cases}
  261. \end{align*}
  262. </div>
  263. </div>
  264. <div id="outline-container-sec-3" class="outline-2">
  265. <h2 id="sec-3"><span class="section-number-2">3</span> Proximal Operator as a Generic Case</h2>
  266. <div class="outline-text-2" id="text-3">
  267. <p>
  268. Proximal operator can be seen as a generalization of several common algorithms.
  269. </p>
  270. </div>
  271. <div id="outline-container-sec-3-1" class="outline-3">
  272. <h3 id="sec-3-1"><span class="section-number-3">3.1</span> Gradient Descent</h3>
  273. <div class="outline-text-3" id="text-3-1">
  274. <p>
  275. If we take the first order approximation of f(z) around x in,
  276. </p>
  277.  
  278. \begin{align*}
  279. & Prox_f(x) = \argmin_z f(z) + \frac{1}{2} \|x-z\|^2 \\
  280. \text{we get } & = \argmin_z f(x) + \grad f(x)^T(z-x)  + \frac{1}{2} \|x-z\|^2 \\
  281. & = \grad f(x)^T + (z-x)=0 \\
  282. & \implies z = x - \grad f(x) \\
  283. & \implies Prox_f(x) = x - \grad f(x) \\
  284. & \text{More generally }\implies Prox_{\lambda f(x)} = x - \lambda\grad f(x)
  285. \end{align*}
  286.  
  287. <p>
  288. This observation also suggests thinking about gradient descent in another way.
  289. If we attach a regularization term in the above approximation, this becomes ISTA (notice the resemblance to proximal gradient descent where we use the second-order approximation).
  290. </p>
  291. </div>
  292. </div>
  293. <div id="outline-container-sec-3-2" class="outline-3">
  294. <h3 id="sec-3-2"><span class="section-number-3">3.2</span> Levenberg Marquart</h3>
  295. <div class="outline-text-3" id="text-3-2">
  296. <p>
  297. Likewise, using the second order approximation, we get the Levenberg-Marquat update,
  298. </p>
  299.  
  300.  \begin{align*}
  301.  \argmin_z f(x) + \nabla f(x)^T(z-x) + \frac{\nabla^2 f(x)^T}{2}\|x-z\|^2 + \frac{1}{2} \|x-z\|^2 \\
  302.  z = x - (\nabla^2 f(x) + I )^{-1}\nabla f(x)
  303. \end{align*}
  304. </div>
  305. </div>
  306. </div>
  307. <div id="outline-container-sec-4" class="outline-2">
  308. <h2 id="sec-4"><span class="section-number-2">4</span> Proximal Gradient Methods</h2>
  309. <div class="outline-text-2" id="text-4">
  310. <p>
  311. This is a method suitable for large scale optimization. We will show that regularized convex objectives can be cast into problem involving evaluation of the proximity operator.
  312. </p>
  313.  
  314. <p>
  315. Our objective will in general look like,
  316. </p>
  317.  
  318. \begin{align*}
  319. \argmin_\theta R(\theta) + L(\theta)
  320. \end{align*}
  321.  
  322. <p>
  323. Here \(R\) is a (<i>possibly non-differentiable</i>) convex regularizer,
  324. and \(L\) is a <b>differentiable</b> convex loss function.
  325. </p>
  326.  
  327. <p>
  328. We will see that as long as we can evaluate the proximal operator defined
  329. by \(R\), we can solve the problem efficiently.
  330. </p>
  331.  
  332. <p>
  333. The idea is to approximate the \(L\) term with its taylor approximation
  334. at the current iterate,
  335. </p>
  336.  
  337. \begin{align*}  
  338. \theta_{k+1} = \argmin_\theta R(\theta) + L(\theta_k) + \nabla L(\theta_k)^{T}(\theta - \theta_k) + \frac{1}{2B} \| \theta - \theta_k \|^2
  339. \end{align*}  
  340.  
  341. <p>
  342. Here \(B\) is an approximate inverse Hessian (we will discuss the choice of B later).
  343. Now dropping terms which are not dependent on \(\theta\) and completing the squares using the second term
  344. and the third term in the quadratic approximation, we get,
  345. </p>
  346.  
  347. \begin{align*}
  348. \argmin_\theta R(\theta) + \frac{1}{2} \| \theta - (\theta_k - B\nabla L(\theta_k))\|^2
  349. \end{align*}
  350.  
  351. <p>
  352. Which is a problem of evaluating the proximal operator. In particular, it is equivalent to solving,
  353. </p>
  354.  
  355. \begin{align*}
  356. \theta_{k+1} = Prox_R( \theta_k - B\nabla L(\theta_k) )
  357. \end{align*}
  358. <p>
  359. This can be interpreted as follows -  the next iterate should be close to the gradient step, and at the same time respect the regularizer R.
  360. </p>
  361.  
  362. <p>
  363. Note that if \(R(\theta)\) is zero, this is equivalent to standard gradient descent.
  364. If \(R\) is a indicator function for a convex set, then this is equivalent to projected gradient descent.
  365. </p>
  366. </div>
  367. </div>
  368. <div id="outline-container-sec-5" class="outline-2">
  369. <h2 id="sec-5"><span class="section-number-2">5</span> Relation to FOBOS and Truncated Gradient</h2>
  370. <div class="outline-text-2" id="text-5">
  371. <p>
  372. FOBOS = Proximal Gradient method = Truncated Gradient.
  373. </p>
  374. \begin{align*}  
  375. \min L(w) + R(w) \\
  376. w_{t+\frac{1}{2}} = w_t - \eta g^L_{t} \\
  377. w_{t+1} = \argmin_w \frac{1}{2} \|w-w_{t+\frac{1}{2}}\|^2 + \eta R(w) \\
  378. \end{align*}
  379. </div>
  380. </div>
  381. <div id="outline-container-sec-6" class="outline-2">
  382. <h2 id="sec-6"><span class="section-number-2">6</span> Relation to Projection Operator</h2>
  383. <div class="outline-text-2" id="text-6">
  384. <p>
  385. Fenchel conjugates are related to dual norms.
  386. In fact, the conjugate of a norm R
  387. is the indicator function of a unit ball of the dual norm R*.
  388. That is,
  389. </p>
  390.  
  391. \begin{align*}
  392. I_R{w} =
  393. \begin{cases}
  394. 1 & : R(w) \le 1 \\
  395. \infty & : otherwise
  396. \end{cases}
  397. \end{align*}
  398.  
  399. <p>
  400. And,
  401. $$
  402.   R*(w) = I_{R}(w)
  403.   $$
  404. </p>
  405.  
  406. <p>
  407. When the proximal operator is defined in terms of a norm, it can be shown
  408. that
  409. </p>
  410. </div>
  411. </div>
  412. <div id="outline-container-sec-7" class="outline-2">
  413. <h2 id="sec-7"><span class="section-number-2">7</span> The Soft Thresholding Operator</h2>
  414. <div class="outline-text-2" id="text-7">
  415. \begin{align*}
  416. f(w) = \|w-x\|_2^2 + \lambda \|w\|_1
  417. \end{align*}
  418.  
  419. <p>
  420. We know that by the subgradient optimality condition, \(0 \in \partial_j(f) \text{ at solution}\)
  421. </p>
  422.  
  423. \begin{align*}
  424. \partial_j(f) =
  425. \begin{cases}
  426. w_i - x_i + \lambda & \text{ if } w_i > 0 \\
  427. w_i - x_i - \lambda & \text{ if } w_i < 0 \\
  428. -x_i + \lambda \times [-1,1] & \text{ if } w_i = 0
  429. \end{cases}
  430. \end{align*}
  431.  
  432. <p>
  433. Equating each case to zero to find the optimal,
  434. </p>
  435.  
  436. \begin{align*}
  437. \begin{cases}
  438. \opt{w}_i = x_i - \lambda & \text{ when } \opt{w}_i > 0 \text{ this happens when $x_i > \lambda$} \\
  439. \opt{w}_i = x_i + \lambda &  \text{ when } \opt{w}_i < 0 \text{ this happens when $x_i < \lambda$} \\
  440. 0 \in -x_i + [-\lambda, \lambda]
  441. \rightarrow \lambda \le x_i \le \lambda
  442. \end{cases}
  443. \end{align*}
  444.  
  445. <p>
  446. This can equivalently be written as,
  447. </p>
  448.  
  449. \begin{align*}
  450. \begin{cases}
  451. \opt{w}_i = x_i - \lambda & \text{ when } x_i > 0 \\
  452. \opt{w}_i = x_i + \lambda &  \text{ when } x_i < 0 \\
  453. 0 & \text{ if } \lambda \le x_i \le \lambda
  454. \end{cases}
  455. \end{align*}
  456.  
  457. <p>
  458. because sign(\opt{w}<sub>i</sub>) = sign(x<sub>i</sub>) in all cases.
  459. </p>
  460.  
  461. <p>
  462. Or more compactly,
  463. </p>
  464.  
  465. \begin{align*}
  466. w_i = sign(x_i)(|x_i| - \lambda)_{+}
  467. \end{align*}
  468.  
  469. <p>
  470. So if \(x_i\) is less than \(\lambda\), \(w_i\) will be driven to zero.
  471. </p>
  472. </div>
  473. <div id="outline-container-sec-7-1" class="outline-3">
  474. <h3 id="sec-7-1"><span class="section-number-3">7.1</span> Relation to Projection onto \(L_{\infty}\) Ball</h3>
  475. </div>
  476. </div>
  477. <div id="outline-container-sec-8" class="outline-2">
  478. <h2 id="sec-8"><span class="section-number-2">8</span> FOBOS Example for L1 Regularizer</h2>
  479. <div class="outline-text-2" id="text-8">
  480. <p>
  481. A common problem of subgradient methods is that if R is un-differentiable, the iterates are rarely at the point of non-differentiablity.
  482. However these points are the true minima of the function.
  483. </p>
  484. \begin{align*}
  485. w_{t+1,j} & = sign(w_{t+\frac{1}{2},j}) [w_{t+\frac{1}{2},j} - \lambda]_{+} \tag{soft thresholding operator}\\
  486. & = sign(w_{t,j} - \eta g) [w_{t+\frac{1}{2}} - \lambda]_{+}
  487. \end{align*}
  488. <p>
  489. This can lead to sparser solutions (explain why?)
  490. </p>
  491. </div>
  492. </div>
  493. <div id="outline-container-sec-9" class="outline-2">
  494. <h2 id="sec-9"><span class="section-number-2">9</span> Naive L1 Regularization with Subgradient Descent</h2>
  495. <div class="outline-text-2" id="text-9">
  496. \begin{align*}  
  497. \min \sum_i L(w,z_i) + \lambda \|w\|_1 \\
  498. w_{t+1} = w_t - \eta \nabla L(w_t,z_i) - \eta \lambda sgn(w_t) \tag{naive stochastic subgradient descent}\\
  499. \end{align*}
  500. </div>
  501. </div>
  502.  
  503. <div id="outline-container-sec-10" class="outline-2">
  504. <h2 id="sec-10"><span class="section-number-2">10</span> Choice of B</h2>
  505. </div>
  506. </div>
  507. <div id="postamble" class="status">
  508. <p class="author">Author: Shyam Upadhyay</p>
  509. <p class="date">Created: 2014-08-28 Thu 11:07</p>
  510. <p class="creator"><a href="http://www.gnu.org/software/emacs/">Emacs</a> 24.3.1 (<a href="http://orgmode.org">Org</a> mode 8.2.6)</p>
  511. <p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
  512. </div>
  513. </body>
  514. </html>
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement