Vlad FeinbergVlad's Blog
https://vlad17.github.io/
Sun, 10 Mar 2019 02:00:45 +0000Sun, 10 Mar 2019 02:00:45 +0000Jekyll v3.7.4BERT, Part 1: Deep Learning Intro<h1 id="a-modeling-introduction-to-deep-learning">A Modeling Introduction to Deep Learning</h1>
<p>In this post, I’d like to introduce you to some basic concepts of deep learning (DL) from a modeling perspective. I’ve tended to stay away from “intro” style blog posts because:</p>
<ul>
<li>There are so, so many of them.</li>
<li>They’re hard to keep in focus.</li>
</ul>
<p>That said, I was presenting on <a href="https://arxiv.org/abs/1810.04805">BERT</a> for a discussion group at work. This was our first DL paper, so I needed to warm-start a technical audience with a no-frills intro to modeling with deep nets. So here we are, trying to focus what this post will be:</p>
<ul>
<li>It will presume a technically sophisticated reader.</li>
<li>No machine learning (ML) background is assumed.</li>
<li>The main goal is to set the stage for future discussion about BERT.</li>
</ul>
<p>Basically, this is me typing up those notes. Note the above leaves questions about optimization and generalization squarely out of scope.</p>
<h2 id="the-parametric-model">The Parametric Model</h2>
<p>Deep learning is a tool for the generic task of parametric modeling. Parametric modeling (PM) is a term I am generously applying from statistical estimation theory that encapsulates a broad variety of ML buzzwords, including supervised, unsupervised, reinforcement, and transfer learning.</p>
<p>In the most general sense, a parametric model \(M\) accepts some vector of parameters \(\theta\) and describes some structure in a random process. Goodness, what does that mean?</p>
<ul>
<li>Structure in a random process is everything that differentiates it from noise. But what’s “noise”?</li>
<li>When we fix the model \(M\), we’re basically saying there’s only some classes of structure we’re going to represent, and everything else is what we consider noise.</li>
<li>The goal is to pick a “good” model and find parameters for it.</li>
</ul>
<h3 id="a-simple-example">A Simple Example</h3>
<p>For instance, let’s take a simple random process, iid draws from the normal distribution \(z\sim \mathcal{D}= N(\mu, \sigma^2)\) with an unknown mean \(\mu\) and variance \(\sigma^2\). We’re going to try capture the richest possible structure over \(z\), its actual distribution. One model might be the unit normal, \(M(\theta)=N(\theta, 1)\). Then our setup, and potential sources of error, look like this:</p>
<p><img src="/assets/2019-03-09-dl-intro/model-err.png" alt="sources of error" class="center-image" /></p>
<p>What I call parametric and model mismatch are also known as estimation and approximation error (<a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning">Bottou and Bousquet 2007</a>).</p>
<p>Here, we have one the most straightforward instances of PM, parameter estimation (we’re trying to estimate \(\mu\)).</p>
<h3 id="revisiting-our-definitions">Revisiting our definitions</h3>
<p>What constitutes a “good” model? Above, we probably want to call models with \(\theta\) near \(\mu\) good ones. But in other cases, it’s not so obvious what makes a good model.</p>
<p>One of the challenges in modeling in general is articulating what we want. This is done through a loss function \(\ell\), where want models with small losses. In other words, we’d like to find a model \(M\) and related parameters \(\theta\) where
\[
\E_{z\sim \mathcal{D}}\ha{\ell(z, M(\theta))}
\]
is as small as possible (here, for our iid process). Note that in some cases, this doesn’t have to be the same as the loss function used for optimization for finding \(\theta\), but that’s another discussion (there are several reasons to do so).</p>
<h3 id="another-example">Another Example</h3>
<p>Now let’s jump into another modeling task, supervised learning. Here:</p>
<ul>
<li>Our iid random process \(\mathcal{D}\) will be generating pairs \(\pa{\text{some image}, \text{“cat” or “dog”}}\).</li>
<li>The structure we want to capture is that all images of dogs happen to be paired with the label \(\text{“dog”}\) and analogously so for cats.</li>
<li>We’ll gloss over what our model is for now.</li>
</ul>
<p>A loss that captures what we want for our desired structure would be the <em>zero-one loss</em>, which is \(1\) when we’re wrong, \(0\) when we’re right. Let’s fix some model and parameters, which takes an image and labels it as a cat or dog (so \(M(\theta)\) is a <em>function</em> itself) as follows, and then let’s see how it does on our loss function.</p>
<p><img src="/assets/2019-03-09-dl-intro/losses.png" alt="sources of error" class="center-image" /></p>
<h2 id="ok-so-why-deep-learning">OK, so why Deep Learning?</h2>
<p>This post was intentionally structured in a way that takes the attention away from DL. DL is a means to achieve the above PM goals–it’s a means to an end and being able to reason about higher-level modeling concerns is crucial to understanding the tool.</p>
<p>So, DL is an approach to building models \(M\) and it studies how to find good parameters \(\theta\) for those models.</p>
<h3 id="deep-learning-models">Deep Learning Models</h3>
<p>A DL model is anything that vaguely resembles the following model. Namely, it has many parameterized functions composed together to create a function.</p>
<p>A function is usually good enough to capture most structure that we’re interested in random processes, given sufficiently sophisticated inputs and outputs. The inputs and outputs to this function can be (not exhaustive):</p>
<ul>
<li>fixed-width multidimensional arrays (casually known as tensors, sort of)</li>
<li>embeddings (numerical translations) of categories (like all the words in the English dictionary)</li>
<li>variable width tensors</li>
</ul>
<p>The parameters this function takes (which differ from its inputs and effect what the function looks like) are fixed width tensors. I haven’t seen variable-width parameters in DL models, except as some Bayesian interpretations (<a href="https://www.cs.toronto.edu/~hinton/absps/colt93.pdf">Hinton 1993</a>).</p>
<h3 id="the-multi-layer-perceptron">The Multi-Layer Perceptron</h3>
<p>Our prototypical example of a neural network is the Multi-Layer Perceptron, or MLP, which takes a numerical vector input to a numerical vector output. For a parameter vector \(\theta=\mat{\theta_1& \theta_2&\cdots&\theta_L}\), which contains parameters for our \(L\) layers, an MLP looks like:
\[
M(\theta)= x\mapsto f_{\theta_L}^{(L)}\circ f_{\theta_{L-1}}^{(L-1)}\circ\cdots\circ f_{\theta_1}^{(1)}(x)\,,
\]
and we define each layer as
\[
f_{\theta_i}=\max(0, W_ix+b_i)\,.
\]
The parameters \(W_i, b_i\) are set by the contents of \(\theta_i\).</p>
<p>This is the functional form of linear transforms followed by nonlinearities. It describes what’s going on in this image:</p>
<p><img src="/assets/2019-03-09-dl-intro/mlpi.png" alt="sources of error" class="center-image" /></p>
<h3 id="why-dl">Why DL?</h3>
<p>While it might be believable that functions in general make for great models that could capture structure in a lot of phenomena, why have these particular parameterizations of functions taken off recently?</p>
<p>This is basically the only part of this post that has to do with DL, and most of it’s out of scope.</p>
<p>In my opinion, it boils down to three things.</p>
<p>Deep learning is simultaneously:</p>
<ul>
<li>Flexible in terms of how many functions it can represent for a fixed parameter size.</li>
<li>Lets us find so-called low-loss estimates of \(\theta\) fairly quickly.</li>
<li>Has working regularization strategies.</li>
</ul>
<h4 id="flexibility">Flexibility</h4>
<p>The MLP format above might seem strange, but this linearity-followed-by-non-linearity happens to be particularly expressive, in terms of the number of different functions we can represent with a small set of parameters.</p>
<p>The fact that a sufficiently wide neural network can well-approximate smooth functions is well known (<a href="https://en.wikipedia.org/wiki/Universal_approximation_theorem">Universal Approximation Theorem</a>), but what’s of particular interest is how linear increases in depth to a network exponentially increase its expressiveness (<a href="https://arxiv.org/abs/1402.1869">Montúfar, et al 2014</a>).</p>
<p><img src="/assets/2019-03-09-dl-intro/montufar2014.png" alt="expressiveness" class="center-image" /></p>
<p>An image from the cited work above demonstrates how composition with non-linearities increases expressiveness. Here, with an absolute value nonlinearity, we can reflect the input space on itself through composition. This means we double the number of linear regions in our neural net by adding a layer.</p>
<h4 id="efficiency">Efficiency</h4>
<p>One of the papers that kicked off the DL craze was Alexnet (<a href="the foundational papers that">Krizhevsky 2012</a>), and the reasons for its existence was that we could efficiently compute the value of a neural network \(M(\theta)\) on a particular image \(x\) using specialized hardware.</p>
<p>Not only does the simple composition of simple functions enable fast <em>forward</em> computation of the model value \(M(\theta)(x)\), but because the operations can be expressed as a directed acyclic graph of almost differentiable functions, one can quickly compute <em>reverse</em> automatic derivatives \(\partial_\theta M(\theta)(x)\) in just about the same amount of time.</p>
<p>This is a very happy coincidence. We can compute the functional value of a neural net and its derivative in time linear in the parameter size, and we have a lot of parameters. Here, efficiency matters a lot for the inner loop of the optimization (which uses derivatives with SGD) to find “good” parameters \(\theta\). This efficiency, in turn, enabled a lot of successful research.</p>
<h4 id="generalization">Generalization</h4>
<p>Finally, neural networks generalize well. This means that given a training set of examples, they are somehow able to have low loss on unseen examples coming from the same random process, just by training on a (possibly altered, or regularized) loss from given examples.</p>
<p>This is particularly counterintuitive for nets due to their expressivity, which is typically at odds with generalization with traditional ML analyses.</p>
<p><a href="https://arxiv.org/abs/1611.03530">Many</a></p>
<p><a href="https://arxiv.org/abs/1710.05468">theories</a></p>
<p><a href="https://arxiv.org/abs/1705.05502">for</a></p>
<p><a href="https://arxiv.org/abs/1503.02406">why</a></p>
<p><a href="https://arxiv.org/abs/1711.01530">this</a></p>
<p><a href="https://arxiv.org/abs/1710.09553">occurs</a></p>
<p>have been proposed, but none of them are completely satisfying yet.</p>
<h2 id="next-time">Next time</h2>
<ol>
<li>We’ll review the Transformer, and what it does.</li>
<li>That’ll set us up for some BERT discussion.</li>
</ol>
Sat, 09 Mar 2019 00:00:00 +0000
https://vlad17.github.io/2019/03/09/dl-intro.html
https://vlad17.github.io/2019/03/09/dl-intro.htmldeep-learningNumpy Gems, Part 1<h1 id="numpy-gems-1-approximate-dictionary-encoding-and-fast-python-mapping">Numpy Gems 1: Approximate Dictionary Encoding and Fast Python Mapping</h1>
<p>Welcome to the first installment of <em>Numpy Gems</em>, a deep dive into a library that probably shaped python itself into the language it is today, <a href="http://www.numpy.org/">numpy</a>.</p>
<p>I’ve spoken <a href="https://nbviewer.jupyter.org/github/vlad17/np-learn/blob/master/presentation.ipynb">extensively</a> on numpy (<a href="https://news.ycombinator.com/item?id=15996077">HN discussion</a>), but I think the library is full of delightful little gems that enable perfect instances of API-context fit, the situation where interfaces and algorithmic problem contexts fall in line oh-so-nicely and the resulting code is clean, expressive, and efficient.</p>
<h2 id="what-is-dictionary-encoding">What is dictionary encoding?</h2>
<p>A dictionary encoding is an efficient way of representing data with lots of repeated values. For instance, at the <a href="https://grouplens.org/datasets/movielens/Movie">MovieLens dataset</a>, which contains a list of ratings for a variety of movies.</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/joined.png" alt="movielens movies" class="center-image" /></p>
<p>But the dataset only has around 27K distinct movies for over 20M ratings. If the average movie is rated around 700 times, then it doesn’t make much sense to represent the list of movies for each rating as an array of strings. There’s a lot of needless copies. If we’re trying to build a recommendation engine, then a key part of training is going to involve iterating over these ratings. With so much extra data being transferred between RAM and cache, we’re just asking for our bandwidth to be saturated. Not to mention the gross overuse of RAM in the first place.</p>
<p>That’s why this dataset actually comes with <code class="highlighter-rouge">movieId</code>s, and then each rating refers to a movie though its identifier. Then we store a “dictionary” mapping movie identifiers to movie names and their genre metadata. This solves all our problems: no more duplication, no more indirection, much less memory use.</p>
<p>That’s basically it. It’s a very simple encoding, which makes it easy to integrate efficiently in many algorithms. So much so, that many, many libraries natively support dictionary encoding your data–see factors in <a href="https://www.stat.berkeley.edu/~s133/factors.html">R</a> and <a href="https://pandas.pydata.org/pandas-docs/stable/categorical.html">pandas</a>.</p>
<h2 id="why-approximate">Why approximate?</h2>
<p>Let’s run with our example. Suppose we have a list of our movie titles, and we’re doing some NLP on them for better recommendations. Usually, that means each of these movies correspond to some kind of encodings.</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/titles.png" alt="titles" class="center-image" /></p>
<p>Let’s use the built-in pandas categorical dtype, which is a dictionary encoding.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>len(titles) # ---> 20000263
cat_titles = titles.astype(
pd.api.types.CategoricalDtype(
pd.unique(titles)))
len(cat_titles.cat.categories) # ---> 9260
len(cat_titles.cat.codes) # ---> 20000263
</code></pre></div></div>
<p>This stores our data into a densely packed array of integers, the codes, which index into the categories array, which is now a much smaller array of 9K deduplicated strings. But still, if our movie titles correspond to giant floating-point encodings, we’ll still end up shuffling a bunch of memory around. Maybe 9K doesn’t sound so bad to you, but what if we had a larger dataset? Bear with this smaller one for demonstration purposes.</p>
<p>A key observation is that, like most datasets, we’ll observe a power-law like distribution of popularity:</p>
<p><img src="/assets/2019-01-19-numpy-gems-1/movie-popularity.png" alt="movie popularity" class="center-image" /></p>
<p>What this means is that we have a long tail of obscure movies that we just don’t care about. In fact, if we’re OK dropping 5% coverage, which won’t affect our performance too much, we can save a bunch of space.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cdf = counts_desc.cumsum() / counts_desc.sum()
np.searchsorted(cdf, [.95, .99, .999, 1])
# ---> array([3204, 5575, 7918, 9259])
</code></pre></div></div>
<p>Indeed, it looks like dropping the 5% least-popular movies corresponds to needing to support only 1/3 as many movies overall! This can be a huge win, especially if your model considers higher-order interactions (if you like movie X and movie Y, then you might like movie Z). In such models that 1/3 becomes a 1/27th!</p>
<h2 id="how-to-approximate">How to approximate?</h2>
<p>However, if we’re being asked to serve model predictions online or want to train a “catch-all” encoding, then we still need to have a general catch-all “movie title” corresponding to the unknown situation. We have a bunch of dictionary indices in <code class="highlighter-rouge">[0, d)</code>, like <code class="highlighter-rouge">[1, 3, 5, 2, 6, 1, 0, 11]</code>. In total we have <code class="highlighter-rouge">n</code> of these. We also have a list of <code class="highlighter-rouge">e</code> items we actually care about in our approximate dictionary, say <code class="highlighter-rouge">[5, 8, 10, 11]</code>, but this might not be a contiguous range.</p>
<p>What we want is an approximate dictionary encoding with a catch-all, namely we want to get a list of <code class="highlighter-rouge">n</code> numbers between <code class="highlighter-rouge">0</code> and <code class="highlighter-rouge">e</code>, with <code class="highlighter-rouge">e</code> being the catch all.</p>
<p>In the above example, <code class="highlighter-rouge">n = 8, d = 12, e = 4</code>, and the correct result array is <code class="highlighter-rouge">[4, 4, 0, 4, 4, 4, 4, 3]</code>. For something like embeddings, it’s clear how this is useful in greatly reducing the number of things we need to represent.</p>
<h2 id="the-gem">The Gem</h2>
<p>The above is actually an instance of a translation problem, in the sense that we have some translation mapping from <code class="highlighter-rouge">[0, d)</code> into <code class="highlighter-rouge">[0, e]</code> and we’d like to apply it to every item in the array. Like many things in python, this is most efficient when pushed to C. Indeed, for strings, there’s <a href="https://docs.python.org/3/library/stdtypes.html#str.translate">translate</a> that does this.</p>
<p>We’ll consider two dummy distributions, which will either be extremely sparse (<code class="highlighter-rouge">d > n</code>) or more typical (<code class="highlighter-rouge">d <= n</code>). Both kinds show up in real life.
We extract the most popular <code class="highlighter-rouge">e</code> of these items (or maybe we have some other metric, not necessarily popularity, that extracts these items of interest).
There are more efficient ways of doing the below, but we’re just setting up.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>if d < n:
dindices = np.random.geometric(p=0.01, size=(n - d)) - 1
dindices = np.concatenate([dindices, np.arange(d)])
dcounts = np.bincount(dindices)
selected = dcounts.argsort()[::-1][:e]
else:
dindices = np.random.choice(d, n // 2)
frequent = np.random.choice(n, n - n // 2)
dindices = np.concatenate([dindices, frequent])
c = Counter(dindices)
selected = np.asarray(sorted(c, key=c.get, reverse=True)[:e])
</code></pre></div></div>
<p>Let’s look at the obvious implementation. We’d like to map contiguous integers, so let’s implement a mapping as an array, where the array value at an index is the mapping’s value for that index as input. This is the implementation that pandas uses under the hood when you ask it to change its categorical values.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mapping = np.full(d, e)
mapping[selected] = np.arange(e)
result = np.take(mapping, dindices)
</code></pre></div></div>
<p>As can be seen from the code, we’re going to get burned when <code class="highlighter-rouge">d</code> is large, and we can’t take advantage of the fact that <code class="highlighter-rouge">e</code> is small. These benchmarks, performed with <code class="highlighter-rouge">%%memit</code> and <code class="highlighter-rouge">%%timeit</code> jupyter magics on fresh kernels each run, back this sentiment up.</p>
<table class="table table-bordered">
<thead>
<tr>
<th><code class="highlighter-rouge">e</code></th>
<th><code class="highlighter-rouge">d</code></th>
<th><code class="highlighter-rouge">n</code></th>
<th>memory (MiB)</th>
<th>time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>763</td>
<td>345</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>11</td>
<td>9.62</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code> </td>
<td><code class="highlighter-rouge">10^8</code> </td>
<td><code class="highlighter-rouge">10^4</code> </td>
<td>763</td>
<td>210</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>763</td>
<td>330</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>11</td>
<td>9.66</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>763</td>
<td>210</td>
</tr>
</tbody>
</table>
<p>This brings us to our first puzzle and numpy gem. How can we re-write this to take advantage of small <code class="highlighter-rouge">e</code>? The trick is to use a sparse representation of our mapping, namely just <code class="highlighter-rouge">selected</code>. We can look in this mapping very efficiently, thanks to <code class="highlighter-rouge">np.searchsorted</code>. Then with some extra tabulation (using <code class="highlighter-rouge">-1</code> as a sentinel value), all we have to ask is where in <code class="highlighter-rouge">selected</code> a given index from <code class="highlighter-rouge">dindices</code> was found.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>searched = np.searchsorted(selected, dindices)
selected2 = np.append(selected, [-1])
searched[selected2[searched] != dindices] = -1
searched[searched == -1] = e
result = searched
</code></pre></div></div>
<p>A couple interesting things happen up there: we switch our memory usage from linear in <code class="highlighter-rouge">d</code> to linear in <code class="highlighter-rouge">n</code>, and completely adapt our algorithm to being insensitive to a high number of unpopular values. Certainly, this performs horribly where <code class="highlighter-rouge">d</code> is small enough that the mapping above is the clear way to go, but the benchmarks expose an interesting tradeoff frontier:</p>
<table class="table table-bordered">
<thead>
<tr>
<th><code class="highlighter-rouge"> e </code></th>
<th><code class="highlighter-rouge"> d </code></th>
<th><code class="highlighter-rouge"> n </code></th>
<th>memory (MiB)</th>
<th>time (ms)</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">10^3</code> </td>
<td><code class="highlighter-rouge">10^4</code> </td>
<td><code class="highlighter-rouge">10^8</code> </td>
<td>1546</td>
<td>5070</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>13</td>
<td>31</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10^3</code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>0.24</td>
<td>0.295</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^4 </code></td>
<td><code class="highlighter-rouge">10^8</code></td>
<td>1573</td>
<td>1940</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^6 </code></td>
<td><code class="highlighter-rouge">10^6</code></td>
<td>13</td>
<td>17</td>
</tr>
<tr>
<td><code class="highlighter-rouge">10</code></td>
<td><code class="highlighter-rouge">10^8 </code></td>
<td><code class="highlighter-rouge">10^4</code></td>
<td>0.20</td>
<td>0.117</td>
</tr>
</tbody>
</table>
<p><a href="/assets/2019-01-19-numpy-gems-1/numpy-gems-1.ipynb">Link to benchmarks.</a></p>
Sat, 19 Jan 2019 00:00:00 +0000
https://vlad17.github.io/2019/01/19/numpy-gems-1.html
https://vlad17.github.io/2019/01/19/numpy-gems-1.htmlhardware-accelerationtoolsnumpy-gemsSubgaussian Concentration<h1 id="subgaussian-concentration">Subgaussian Concentration</h1>
<p>This is a quick write-up of a brief conversation I had with Nilesh Tripuraneni and Aditya Guntuboyina a while ago that I thought others might find interesting.</p>
<p>This post focuses on the interplay between two types of concentration inequalities. Concentration inequalities usually describe some random quantity \(X\) as a constant \(c\) which it’s frequently near (henceforth, \(c\) will be our stand-in for some constant which possibly changes equation-to-equation). Basically, we can quantify how infrequent divergence \(t\) of \(X\) from \(c\) is with some rate \(r(t)\) which vanishes as \(t\rightarrow\infty\).</p>
<p>\[
\P\pa{\abs{X-c}>t}\le r(t)\,.
\]</p>
<p>In fact, going forward, if \(r(t)=c’\exp(-c’’ O(g(t)))\), we’ll say \(X\) <em>concentrates about</em> \(c\) <em>in rate</em> \(g(t)\).</p>
<p>Subgaussian (sg) random variables (rvs) with parameter \(\sigma^2\) exhibit a strong form of this. They have zero mean and concentrate in rate \(-t^2/\sigma^2\).
Equivalently, we may write \(X\in\sg(\sigma^2)\). Subgaussian rvs decay quickly because of a characteristic about their moments. In particular, \(X\) is subgaussian if for all \(\lambda\), the following holds:
\[
\E\exp\pa{\lambda X}\le \exp\pa{\frac{1}{2}\lambda^2\sigma^2}\,.
\]</p>
<p>On the other hand, suppose we have \(n\) independent (indep) bounded (bdd) rvs \(X=\ca{X_i}_{i=1}^n\) and a function \(f\) that’s convex (cvx) in each one. Note being cvx in each variable isn’t so bad, for instance the low-rank matrix completion loss \(\norm{A-UV^\top}^2\) does this in \(U, V\). Then by BLM Thm. 6.10 (p. 180), \(f(X)\) concentrates about its mean quadratically.</p>
<p>This is pretty damn spiffy. You get a <em>function</em> that’s nothing but a little <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">montonic in averages</a>, and depends on a bunch of different knobs. Said knobs spin independently, and somehow this function behaves <a href="https://en.wikipedia.org/wiki/Talagrand%27s_concentration_inequality">basically constant</a>. This one isn’t a deep property of some distribution, like sg rvs, but rather a deep property of smooth functions on product measures.</p>
<h2 id="a-little-motivation">A Little Motivation</h2>
<p>Concentration lies at the heart of machine learning. For instance, take the well-known probably approximately correct (PAC) learning framework–it’s old, yes, and has been superseded by more generic techniques, but it still applies to simple classifiers we know and love. At its core, it seems to be making something analogous to a counting argument:</p>
<ol>
<li>The set of all possible classifiers is small by assumption.</li>
<li>Since there aren’t many classifiers overall, there can’t be many crappy classifiers.</li>
<li>Crappy classifiers have a tendency of fucking up on random samples of data (like our training set).</li>
<li>Therefore any solution we find that nails our training set is likely not crap (i.e., probably approximately correct).</li>
</ol>
<p>However, this argument can be viewed from a different lens, one which exposes machinery that underlies much more expressive theories about learning like M-estimation or empirical process analysis.</p>
<ol>
<li>The <em>generalization error</em> of our well-trained classifier is no more than twice the worst <em>generalization gap</em> (difference between training and test errors) in our hypothesis class (symmetrization).</li>
<li>For large sample sizes, this gap vanishes because training errors concentrate around the test errors (concentration).</li>
</ol>
<p>For this reason, being able to identify when a random variable (such as a classifier’s generalization gap, before we see its training dataset) concentrates is useful.</p>
<h2 id="ok-get-to-the-point">OK, Get to the Point</h2>
<p>Now that we’ve established why concentration is interesting, I’d like to present the conversation points. Namely, we have a general phenomenon, the <a href="https://en.wikipedia.org/wiki/Concentration_of_measure">concentration of measure</a>.</p>
<p>Recall the concentration of measure from above, that for a convex, Lipschitz function \(f\) is basically constant, but requiring bounded variables. However, these are some onerous conditions.</p>
<p>To some degree, these conditions to be weakened. For starters, convexity need only be quasi-convexity. The Wikipedia article is a bit nebulous, but the previously linked <a href="https://en.wikipedia.org/wiki/Talagrand%27s_concentration_inequality">Talagrand’s Inequality</a> can be used to weaken this requirement (BLM Thm. 7.12, p. 230).</p>
<p>Still:</p>
<ol>
<li>One can imagine that a function that’s not necessarily globally Lipschitz, but instead just coordinate-wise Lipschitz, we can still give some guarantees.</li>
<li>Why do we need bounded random variables? Perhaps variables that are <em>effectively</em> bounded most of the time are good enough.</li>
</ol>
<p>Our goal here will be to see if there are smooth ways of relaxing the conditions above and framing the concentration rates \(r(t)\) in terms of these relaxations.</p>
<h3 id="coordinate-sensitivity-and-bounded-differences">Coordinate Sensitivity and Bounded Differences</h3>
<p>The concentration of measure bounds above rely on a global Lipschitz property: no matter which way you go, the function \(f\) must lie in a slope-bounded double cone, which can be centered at any of its points; this can be summarized by the property that our \(f:\R^n\rightarrow\R\) satisfies \(\abs{f(\vx)-f(\vy)}\le L\norm{\vx-\vy}\) for all \(\vx,\vy\)</p>
<p><img src="/assets/2018-12-22-subgaussian-concentration/lipschitz_continuity.png" alt="lipschitz continuity image" class="center-image" /></p>
<p>Moreover, why does it matter that the preimage metric space of our \(f\) need to, effectively, be bounded? All that really matters is how the function \(f\) responds to changes in inputs, right?</p>
<p>Here’s where <a href="https://en.wikipedia.org/wiki/Doob_martingale#McDiarmid's_inequality">McDiarmid’s Inequality</a> comes in, which says that so long as we satisfy the bounded difference property, where
\[
\sup_{\vx, \vx^{(i)}}\abs{f(\vx)-f(\vx^{(i)})}\le c_i\,,
\]
holding wherever \(\vx, \vx^{(i)}\) only differ in position \(i\), then we concentrate with rate \(t^2/\sum_ic_i^2\). The proof basically works by computing the distance of \(f(X)\), our random observation, from \(\E f(X)\), the mean, through a series of successive approximations done by changing each coordinate, one at a time. Adding up these approximations happens to give us a martingale, and it turns out these bounded differences have a concentration (<a href="https://en.wikipedia.org/wiki/Hoeffding%27s_inequality">Hoeffding’s</a>) of their own.</p>
<p>Notice how the rate worsens individually according to the constants \(c_i\) in each dimension.</p>
<h3 id="whats-in-the-middle">What’s in the Middle?</h3>
<p>We’ve seen how we can achieve concentration (that’s coordinate-wise sensitive in its bounds) by restricting ourselves to:</p>
<ul>
<li>Well-behaved functions and bounded random inputs (Talagrand’s).</li>
<li>Functions with bounded responses to coordinate change (McDiarmid’s).</li>
</ul>
<p>Can we get rid of boundedness altogether now, relaxing it to the probibalistic “boundedness” that is subgaussian concentration? Well, yes and no.</p>
<h3 id="hows-this-possible">How’s this possible?</h3>
<p><a href="https://arxiv.org/abs/1309.1007">Kontorovich 2014</a> claims concentration for generic Lipschitz functions for subgaussian inputs. At first, this may sound too good to be true. Indeed, a famous counterexample (BLM Problem 6.4, p. 211, which itself refers to LT p. 25) finds a particular \(f\) where the following holds for sufficiently large \(n\).
\[
\P\ca{f(X)> \E f(X)+cn^{1/4}}\ge 1/4\,.
\]
Technically, the result is shown for the median, not mean value of \(f\), but by integrating the median concentration inequality for Lipschitz functions of subgaussian variables (LT p. 21), we can state the above, since the mean and median are within a constant of each of other (bdd rvs with zero mean are sg).
From the proof (LT, p. 25), \(f(X)\) has rate no better than \(t^2n^{-1/2}\).</p>
<p>Therein lies the resolution for the apparent contradiction: we’re <em>pathologically</em> dependent on the dimension factor.
On the other hand, the bound proven in the aforementioned Kontorovich 2014 paper is that for sg \(X\), we can achieve a concentration rate \(t^2/\sum_i\Delta_{\text{SG}, i}^2\), where \(\Delta_{\text{SG}, i}\) is a subgaussian diameter, which for our purposes is just a constant times \(\sigma_i^2\), the subgaussian parameter for the \(i\)-th position in the \(n\)-dimensional vector \(X\). For some \(\sigma^2=\max_i\sigma^2\), note that the hidden dimensionality emerges, since the Kontorovich rate is then \(t^2/(n\sigma^2)\).</p>
<p>The Kontorovich paper is a nice generalization of McDiarmid’s inequality which replaces the boundedness condition with a subgaussian one. We still incur the dimensionality penalty, but we don’t care about this if we’re making a one-dimensional or fixed-\(n\) statement. In fact, the rest of the Kontorovich paper investigates scenarios where this dimensionality term is cancelled out by a shrinking \(\sigma^2\sim n^{-1}\) (in the paper, this is observed for some stable learning algorithms).</p>
<p>In fact, there’s even quite a bit of room between the Kontorovich bound \(t^2/n\) (fixing the sg diameter now) and the counterexample lower bound \(t^2/\sqrt{n}\). This next statement might be made out of my own ignorance, but it seems like there’s still a lot of open space to map out in terms of what rates are possible to achieve in the non-convex case, if we care about the dimension \(n\) (which we do).</p>
<h1 id="references">References</h1>
<ol>
<li>BLM - Boucheron, Lugosi, Massart (2013), Concentration Inequalities</li>
<li>LT - Ledoux and Talagrand (1991), Probability in Banach Spaces</li>
</ol>
Sat, 22 Dec 2018 00:00:00 +0000
https://vlad17.github.io/2018/12/22/subgaussian-concentration.html
https://vlad17.github.io/2018/12/22/subgaussian-concentration.htmlmachine-learningDuplicate Finding<h1 id="duplicate-finding">Duplicate Finding</h1>
<p>I’ve been getting pretty good signal from a CS-fundamentals-type interview question lately. It’s got quite a lot of solutions, and it’s pretty hard, while still admitting decent naive solutions, so it scales well with a variety of candidates.</p>
<p>I’ll leave the merits of brain-teaser–though I think that’s a mis-characterization–interview questions to HN commenters or another post.</p>
<h2 id="the-question">The Question</h2>
<p>You’re given a mutable array of \(n\) integers. Each one is valued in the range \([0,n-2]\), so there must be at least one duplicate. Return a duplicate.</p>
<h2 id="the-basics">The Basics</h2>
<p>What’s nice here is we can build up a memory/runtime frontier even with some crude asymptotic analysis. Let’s work on an idealized machine where we can process pointers in constant time, but bits still aren’t free (i.e., we can’t treat the integers as being arbitrarily wide).</p>
<p>One immediate solution is “bubble”. Let the input array be <code class="highlighter-rouge">xs</code></p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for x in xs:
for y in (xs skipping x):
if x == y:
return y
</code></pre></div></div>
<p>Nice, \(O(n^2)\) runtime \(O(1)\) extra space overhead down. Let’s get less naive. We might notice bubble linearly compares equality with every element; but a set can compare equality with multiple elements in constant time.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vals = set()
for x in xs:
if x in vals:
return x
vals.add(x)
</code></pre></div></div>
<p>Great, for a hashset we now have \(O(n)\) extra space overhead for a linear-time algorithm. One might ask is this “linear time” worst-case or average? A hash set with a doubling strategy that picks prime sizes and an identity hash function would actually avoid the worst-case separate chaining performance bad hash functions induce. But if we’re being pedantic, and we are, you’d need to build-in an algorithm to get arbitrarily large primes during resizes. Luckily we can avoid all this nonsense by switching the <code class="highlighter-rouge">set</code> to a <code class="highlighter-rouge">bitset</code> above for worst-case linear performance.</p>
<p>One might be tempted to encode the <code class="highlighter-rouge">bitset</code> in the first bit (the sign bit) of the <code class="highlighter-rouge">xs</code> array, but this would still incur linear overhead according to our assumptions. Indeed, these assumptions are somewhat realistic (say <code class="highlighter-rouge">n</code> contains values at least up to \(2^{31}\) and we use 4-byte integers).</p>
<p>Now we might ask if there’s a nicer trade-off giving up runtime to save on memory use than going from the hash to the bubble approach. Indeed, we know another data structure that enables equality checks on multiple numbers in sublinear time: the sorted array!</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>xs.sort()
for x, y in zip(xs, xs[1:]):
if x == y:
return x
</code></pre></div></div>
<p>Now, what sort would we use (from an interview perspective, it’s less valuable to see how many sorts do you know, but do you know a sort really well, and what is it doing)? What does your chosen language do? If you’re using python, then you get points for knowing Timsort is a combination of out-of-place mergesort and quicksort; inheriting worst-case \(O(n)\) space and \(O(n\log n)\) time from the former.</p>
<p>Did you use Java? Props to you if you know JDK8 would be applying dual-pivot quick sort on the primitive type array (bonus bonus: timsort on non-primitives).</p>
<p>If you choose quicksort, then you should be aware of its worst-case performance and how to mitigate it. One mitigation would be as in C++, to use a heapsort cutoff (“introsort”).</p>
<p>Of course, this is all very extra, but someone who’s aware of what’s going on at all layers of the stack is very valuable.</p>
<h2 id="the-advanced">The Advanced</h2>
<p>Now here’s where I tell you there are at least three different ways to achieve \(O(n)\) run time and \(O(1)\) extra memory overhead for this problem.</p>
<p>They are all meaningfully different, and each one has a different runtime profile. Two of the solutions use \(\sim 2n\) array accesses or sets in the worst case, and one of those has the bonus of being cache-friendly.</p>
<p>For starters, we could just apply radix sort to the sorting solution above. This is kind-of cheating since radix sort really needs \(O(\log n)\) passes over the data, but we assumed pointers and therefore indices into the array are fixed-width, so we should count this solution as allowed.</p>
<p>After a few iterations, I’m convinced the most elegant solution to this problem is as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def unwind(xs):
while xs[-1] != xs[xs[-1]]:
xs[xs[-1]], xs[-1] = xs[-1], xs[xs[-1]]
return xs[-1]
</code></pre></div></div>
<p>The proof that the above takes linear time is the same as its correctness proof: every iteration, the amount of numbers “not in their slot” goes down by one. This value is bounded above by \(n\). So the algorithm must terminate, and when it does, the loop condition is broken, which is proof a duplicate was found.</p>
<p>I discovered a note by <a href="https://cs.stackexchange.com/questions/95379/">Yuval Filmus</a> which performs a similar computation, but without modifying the array (at the cost of running a cycle algorithm, which would require a few more array accesses).</p>
<p>While these solutions are quite satisfying in an aesthetic sense, they might not be the most performant in practice. For what it’s worth, I have not found the cache-friendly version online yet :)</p>
Sat, 03 Nov 2018 00:00:00 +0000
https://vlad17.github.io/2018/11/03/duplicates.html
https://vlad17.github.io/2018/11/03/duplicates.htmlinterview-questionBeating TensorFlow Training in-VRAM<h1 id="beating-tensorflow-training-in-vram">Beating TensorFlow Training in-VRAM</h1>
<p>In this post, I’d like to introduce a technique that I’ve found helps accelerate mini-batch SGD training in my use case. I suppose this post could also be read as a public grievance directed towards the TensorFlow Dataset API optimizing for the large vision deep learning use-case, but maybe I’m just not hitting the right incantation to get <code class="highlighter-rouge">tf.Dataset</code> working (in which case, <a href="https://github.com/vlad17/vlad17.github.io/issues/new">drop me a line</a>). The solution is to TensorFlow <em>harder</em> anyway, so this shouldn’t really be read as a complaint.</p>
<p>Nonetheless, if you are working with a new-ish GPU that has enough memory to hold a decent portion of your data alongside your neural network, you may find the final training approach I present here useful. The experiments I’ve run fall exactly in line with this “in-VRAM” use case (in particular, I’m training deep reinforcement learning value and policy networks on semi-toy environments, whose training profile is many iterations of training on a small replay buffer of examples). For some more context, you may want to check out an article on the <a href="https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/">TensorForce blog</a>, which suggests that RL people should be building more of their TF graphs like this.</p>
<p>Briefly, if you have a dataset that fits into a GPU’s memory, you’re giving away a lot of speed with the usual TensorFlow pipelining or data-feeding approach, where the CPU delivers mini-batches whose forward/backward passes are computed on GPUs. This gets worse as you move to pricier GPUs, whose relative CPU-GPU bandwidth-to-GPU-speed ratio drops. Pretty easy change for a 2x.</p>
<h2 id="punchline">Punchline</h2>
<p>Let’s get to it. With numbers similar to my use case, 5 epochs of training take about <strong>16 seconds</strong> with the standard <code class="highlighter-rouge">feed_dict</code> approach, <strong>12-20 seconds</strong> with the TensorFlow Dataset API, and <strong>8 seconds</strong> with a custom TensorFlow control-flow construct.</p>
<p>This was tested on an Nvidia Tesla P100 with a compiled TensorFlow 1.4.1 (CUDA 9, cuDNN 7), Python 3.5. Here is the <a href="https://gist.github.com/vlad17/5d67eef9fb06c6a679aeac6d07b4dc9c">test script</a>. I didn’t test it too many times (<a href="https://gist.github.com/vlad17/f43dba5783adfc21b1abab520dd2a8f1">exec trace</a>). Feel free to change the data sizes to see if the proposed approach would still help in your setting.</p>
<p>Let’s fix the toy benchmark supervised task we’re looking at:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="c"># pretend we don't have X, Y available until we're about</span>
<span class="c"># to train the network, so we have to use placeholders. This is the case</span>
<span class="c"># in, e.g., RL.</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="c"># suffix tensors with their shape</span>
<span class="c"># n = number of data points, x = x dim, y = y dim</span>
<span class="n">X_nx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">,</span> <span class="mi">64</span><span class="p">))</span>
<span class="n">Y_ny</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">([</span><span class="n">X_nx</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)])</span>
<span class="n">nbatches</span> <span class="o">=</span> <span class="mi">10000</span> <span class="c"># == 20 epochs at 512 batch</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">512</span></code></pre></figure>
<h3 id="vanilla-approach">Vanilla Approach</h3>
<p>This is the (docs-discouraged) approach that everyone really uses for training. Prepare a mini-batch on the CPU, ship it off to the GPU. <em>Note code here and below is excerpted (see the test script link above for the full code). It won’t work if you just copy it.</em></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># b = batch size</span>
<span class="n">input_ph_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="n">output_ph_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">Y_ny</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="c"># mlp = a depth 5 width 32 MLP net</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">input_ph_bx</span><span class="p">)</span>
<span class="n">tot_loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">output_ph_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">tot_loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nbatches</span><span class="p">):</span>
<span class="n">batch_ix_b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,))</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">update</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span>
<span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">],</span>
<span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">]})</span></code></pre></figure>
<p>This drops whole-dataset loss from around 4500 to around 4, taking around <strong>16 seconds</strong> for training. You might worry that random-number generation might be taking a while, but excluding that doesn’t drop the time more than <strong>0.5 seconds</strong>.</p>
<h3 id="dataset-api-approach">Dataset API Approach</h3>
<p>With the dataset API, we set up a pipeline where TensorFlow orchestrates some dataflow by synergizing more buzzwords on its worker threads. This should constantly feed the GPU by staging the next mini-batch while the current one is sitting on the GPU. This might be the case when there’s a lot of data, but it doesn’t seem to work very well when the data is small and GPU-CPU latency, not throughput, is the bottleneck.</p>
<p>Another unpleasant thing to deal with is that all those orchestrated workers and staging areas and buffers and shuffle queues need magic constants to work well. I tried my best, but it seems like performance is very sensitive with this use case. This could be fixed if Dataset detected (or could be told) it could be placed onto the GPU, and then it did so.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># make training dataset, which should swallow the entire dataset once</span>
<span class="c"># up-front and then feed it in mini-batches to the GPU</span>
<span class="c"># presumably since we only need to feed stuff in once it'll be faster</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensor_slices</span><span class="p">((</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">))</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">repeat</span><span class="p">()</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="n">bufsize</span><span class="p">)</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="p">)</span>
<span class="c"># magic that Zongheng Yang (http://zongheng.me/) suggested I add that was</span>
<span class="c"># necessary to keep this from being *worse* than feed_dict</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">prefetch</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span> <span class="o">*</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">it</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">make_initializable_iterator</span><span class="p">()</span>
<span class="c"># reddit user ppwwyyxx further suggests folding training into a single call</span>
<span class="k">def</span> <span class="nf">while_fn</span><span class="p">(</span><span class="n">t</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">t</span><span class="p">]):</span>
<span class="n">next_bx</span><span class="p">,</span> <span class="n">next_by</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">next_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">next_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">update</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">while_loop</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="n">t</span> <span class="o"><</span> <span class="n">nbatches</span><span class="p">,</span>
<span class="n">while_fn</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">initializer</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">)</span></code></pre></figure>
<p>For a small <code class="highlighter-rouge">bufsize</code>, like <code class="highlighter-rouge">1000</code>, this trains in around <strong>12 seconds</strong>. But then it’s not actually shuffling the data too well (since all data points can only move by a position of 1000). Still, the loss drops from around 4500 to around 4, as in the <code class="highlighter-rouge">feed_dict</code> case. A large <code class="highlighter-rouge">bufsize</code> like <code class="highlighter-rouge">1000000</code>, which you’d think should effectively move the dataset onto the GPU entirely, performs <em>worse</em> than <code class="highlighter-rouge">feed_dict</code> at around <strong>20 seconds</strong>.</p>
<p>I don’t think I’m unfair in counting <code class="highlighter-rouge">it.initializer</code> time in my benchmark (which isn’t that toy, either, since it’s similar to my RL use case size). All the training methods need to load the data onto the GPU, and the data isn’t available until run time.</p>
<h3 id="using-a-tensorflow-loop">Using a TensorFlow Loop</h3>
<p>This post isn’t a tutorial on <code class="highlighter-rouge">tf.while_loop</code> and friends, but this code does what was promised: just feed everything once into the GPU and do all your epochs without asking for permission to continue from the CPU.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># generate random batches up front</span>
<span class="c"># i = iterations</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">batches_ib</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">random_uniform</span><span class="p">((</span><span class="n">nbatches</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="c"># use a fold + control deps to make sure we only train on the next batch</span>
<span class="c"># after we're done with the first</span>
<span class="k">def</span> <span class="nf">fold_fn</span><span class="p">(</span><span class="n">prev</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">):</span>
<span class="n">X_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="n">Y_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">output_ph_ny</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="c"># removing control deps here probably gives you Hogwild!</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">prev</span><span class="p">]):</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">X_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">Y_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">opt</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)]):</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">foldl</span><span class="p">(</span><span class="n">fold_fn</span><span class="p">,</span> <span class="n">batches_ib</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span></code></pre></figure>
<p>This one crushes at around <strong>8 seconds</strong>, dropping loss again from around 4500 to around 4.</p>
<h2 id="discussion">Discussion</h2>
<p>It’s pretty clear Dataset isn’t feeding as aggressively as it can, and its many widgets and knobs don’t help (well, they do, but only after making me do more work). But, if TF wants to invalidate this blog post, I suppose it could add yet another option that plops the dataset into the GPU.</p>
Sat, 23 Dec 2017 00:00:00 +0000
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.html
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.htmlhardware-accelerationmachine-learningtoolsDeep Learning Learning<h1 id="deep-learning-learning-plan">Deep Learning Learning Plan</h1>
<p>This is my plan to on-board myself with recent deep learning practice (as of the publishing date of this post). Comments and recommendations <a href="https://github.com/vlad17/vlad17.github.io/issues">via GitHub issues</a> are welcome and appreciated! This plan presumes some probability, linear algebra, and machine learning theory already, but if you’re following along <a href="http://www.deeplearningbook.org/">Part 1 of the Deep Learning book</a> gives an overview of prerequisite topics to cover.</p>
<p>My notes on these sources are <a href="https://github.com/vlad17/ml-notes">publicly available</a>, as are my <a href="https://github.com/vlad17/learning-to-deep-learn">experiments</a>.</p>
<ol>
<li>Intro tutorials/posts.
<ul>
<li><a href="http://karpathy.github.io/neuralnets/">Karpathy</a></li>
<li>Skim lectures from weeks 1-6, 9-10 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
</ul>
</li>
<li>Scalar supervised learning theory
<ul>
<li>Read Chapters 6, 7, 8, 9, 11, 12 of <a href="http://www.deeplearningbook.org/">Dr. Goodfellow’s Deep Learning Book</a> and <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient Backprop</a></li>
</ul>
</li>
<li>Scalar supervised learning practice
<ul>
<li>Choose an enviornment.
<ul>
<li>Should be TensorFlow-based, given the wealth of ecosystem around it; stuff like <a href="https://github.com/deepmind/sonnet">Sonnet</a> and <a href="https://github.com/tensorflow/tensor2tensor">T2T</a>.</li>
<li>I tried <a href="https://github.com/tensorflow/models/blob/master/inception/inception/slim/README.md">TF-Slim</a> and and <a href="https://github.com/zsdonghao/tensorlayer">TensorLayer</a>, but I still found <a href="https://keras.io/">Keras</a> easiest to rapidly prototype in (and expand). TensorFlow is still pretty easy to <a href="https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html">drop down into</a> from the Keras models.</li>
<li>Even with Keras, TF is awkward to prototype in: it’s also worth considering <a href="http://pytorch.org/">PyTorch</a>.</li>
</ul>
</li>
<li>Google <a href="https://www.tensorflow.org/get_started/mnist/pros">MNIST</a></li>
<li>Lessons 0-4 from <a href="http://course.fast.ai/index.html">USF</a></li>
<li>Assignments 1-4 from <a href="https://www.udacity.com/course/deep-learning--ud730">Udacity</a></li>
<li><a href="https://www.tensorflow.org/tutorials/deep_cnn">CIFAR-10</a>
<ul>
<li>Extend to multiple GPUs</li>
<li>Visualizations (with Tensorboard): histogram summary for weights/biases/activations and layer-by-layer gradient norm recordings (+ how does batch norm affect them), graph visualization, cost over time</li>
<li>Visualizations for trained kernels: most-activating image from input set as viz, direct kernel image visualizations + maximizing image from input set as the viz <a href="https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html">per maximizing inputs</a>, activations direct image viz (per <a href="http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf">Yosinki et al 2015</a>). For maximizing inputs use regularization from Yosinki paper.</li>
<li>Faster input pipeline and timing metrics for each stage of operation <a href="http://web.stanford.edu/class/cs20si/lectures/notes_09.pdf">input pipeline notes</a>.</li>
</ul>
</li>
<li>Assignment 2 from <a href="http://web.stanford.edu/class/cs20si/syllabus.html">Stanford CS20S1</a></li>
<li>Lab 1 from <a href="https://github.com/yala/introdeeplearning">MIT 6.S191</a></li>
<li><a href="http://cs231n.github.io/">Stanford CS231n</a></li>
<li>Try out slightly less common techniques: compare initialization (orthogonal vs LSUV vs uniform), weight normalization vs batch normalization vs layer normalization, Bayesian-inspired weight decay vs early stopping vs proximal regularization</li>
<li>Replicate <a href="https://arxiv.org/abs/1512.03385">ResNet by He et al 2015</a>, <a href="http://cs.nyu.edu/~wanli/dropc/">Dropconnect</a>, <a href="https://arxiv.org/abs/1302.4389">Maxout</a>, <a href="https://github.com/tensorflow/models/tree/master/inception">Inception</a> (do a fine-tuning example with Inception per <a href="http://proceedings.mlr.press/v32/donahue14.pdf">this paper</a>).</li>
<li>Do an end-to-end application from scratch. E.g., convert an equation image to LaTeX.</li>
</ul>
</li>
<li>Sequence supervised learning
<ul>
<li>Gentle introductions
<ul>
<li>Lessons 5-7 from <a href="http://course.fast.ai/index.html">USF</a></li>
<li>Assignments 5-6 from <a href="https://www.udacity.com/course/deep-learning--ud730">Udacity</a></li>
<li><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Karpathy RNN post</a></li>
<li>Weeks 7-8 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
</ul>
</li>
<li>Theory
<ul>
<li>Chapter 10 from <a href="http://www.deeplearningbook.org/">Goodfellow</a></li>
</ul>
</li>
<li>Practice
<ul>
<li>Lab 2 from <a href="https://github.com/yala/introdeeplearning">MIT 6.S191</a></li>
<li>End-to-end application from scratch: a Swype keyboard (<a href="https://www.reddit.com/r/MachineLearning/comments/5ogbd5/d_training_lstms_in_practice_tips_and_tricks/">Reddit tips</a>)</li>
</ul>
</li>
<li>Paper recreations
<ul>
<li>Machine translation <a href="https://arxiv.org/abs/1409.3215">Sutskever et al 2014</a></li>
<li>NLP <a href="https://arxiv.org/abs/1412.7449">Vinyals et al 2015</a></li>
<li>Dense captioning <a href="http://cs.stanford.edu/people/karpathy/densecap/">Karpathy 2016</a></li>
<li><a href="https://arxiv.org/abs/1506.03134">Pointer nets</a></li>
<li><a href="https://arxiv.org/abs/1706.03762">Attention</a></li>
</ul>
</li>
</ul>
</li>
<li>Unsupervised and semi-supervised approaches
<ul>
<li>Theory
<ul>
<li>Weeks 11-16 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
<li>Chapters 13, 16-20 from <a href="http://www.deeplearningbook.org/">Goodfellow</a></li>
<li>See also my links for <a href="https://github.com/vlad17/ml-notes/tree/master/deep-learning">VAE and RBM notes here</a></li>
</ul>
</li>
<li>Practice
<ul>
<li>Remaining <a href="http://deeplearning.net/tutorial/">deeplearning.net</a> tutorials, based on interest.</li>
<li>Notebooks 06, 11 from <a href="https://github.com/nlintz/TensorFlow-Tutorials">nlintz/TensorFlow-Tutorials</a>.</li>
</ul>
</li>
<li>Paper recreations
<ul>
<li><a href="https://arxiv.org/abs/1701.07875">WGAN</a></li>
<li><a href="https://arxiv.org/abs/1312.6114">VAE</a></li>
<li><a href="https://arxiv.org/abs/1606.04934">IAF VAE</a></li>
<li><a href="https://arxiv.org/abs/1507.02672">Ladder Nets</a></li>
</ul>
</li>
</ul>
</li>
</ol>
Sun, 09 Jul 2017 00:00:00 +0000
https://vlad17.github.io/2017/07/09/deep-learning-learning.html
https://vlad17.github.io/2017/07/09/deep-learning-learning.htmldeep-learningNon-convex First Order Methods<h1 id="non-convex-first-order-methods">Non-convex First Order Methods</h1>
<p>This is a high-level overview of the methods for first order local improvement optimization methods for non-convex, Lipschitz, (sub)differentiable, and regularized functions with efficient derivatives, with a particular focus on neural networks (NNs).</p>
<p>\[
\argmin_\vx f(\vx) = \argmin_\vx \frac{1}{n}\sum_{i=1}^nf_i(\vx)+\Omega(\vx)
\]</p>
<p>Make sure to read the <a href="/2017/06/19/neural-network-optimization-methods.html">general overview post</a> first. I’d also reiterate <a href="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html">as Moritz Hardt has</a> that one should be wary of only looking at convergence rates willy-nilly.</p>
<p><strong>Notation and Definitions</strong>.</p>
<ul>
<li>The \(t\)-th step stochastic gradient of \(f:\R^d\rightarrow\R\), computed in \(O(d)\) time at the location \(\vx_{t}\), by selecting either a single \(f_i\) or a mini-batch, is denoted \(\tilde{\nabla}_t\), with \(\E\tilde{\nabla}_t=\nabla_t=\nabla f(\vx_t)\).</li>
<li>Arithmetic operations may be applied elementwise to vectors.</li>
<li>If smooth and efficiently differentiable, e.g., \(\Omega(\vx)=\frac{1}{2}\norm{\vx}_2^2\), regularization can be folded into each \(f_i\) to make new \(f_i’=f_i+\frac{1}{n}\Omega\), as if it was never there in the first place. However, we may wish to apply \(L^1\) regularization or other non-smooth, non-differntiable but still convex functions–these are the problems I’ll label <em>composite</em>.</li>
<li>I’ll use \(x\simeq y\) to claim that equality holds up to some fixed multiplicative constants.</li>
<li>I will presume an initialization \(\vx_0\) (<a href="https://github.com/vlad17/ml-notes/blob/master/deep-learning/optimization.pdf">see discussion here</a>).</li>
<li>
<p>Finally, recall the two stationary point conditions:</p>
<ul>
<li>\(\epsilon\)-approximate critical point: \(\norm{\nabla f(\vx_*)}\le \epsilon\)</li>
<li>\(\epsilon\)-approximate local minimum: there exists a neighborhood \(N\) of \(\vx_*\) such that for any \(\vx\) in \(N\), \(f(\vx)-f(\vx_*)\le \epsilon\). For \(f\) twice-differentiable at \(\vx_*\), it suffices to be an \(\epsilon\)-approximate critical point and have \(\nabla^2 f(\vx_*)\succeq \sqrt{\epsilon}I\).</li>
</ul>
</li>
<li>In this post, many algorithms will depend on a fixed learning rate, even if it’s just an initial scale for the learning rate. Convergence is sensitive to this setting; a fixed recommendation will surely be a poor choice for some problem. For a first choice, setting \(\eta\) to one of \( \{0.001,0.01,0.1\}\) based on a guess about the magnitude of the smoothness of the problem at hand is a good bet.</li>
</ul>
<h2 id="stochastic-gradient-descent-sgd">Stochastic Gradient Descent (SGD)</h2>
<p>\[
\vx_{t+1}=\vx_t-\eta_t\tilde{\nabla}_t
\]</p>
<p><strong>Description</strong>. See <a href="https://arxiv.org/abs/1309.5549">Ghadimi and Lan 2013a</a> for analysis and TensorFlow’s <a href="https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer">non-composite</a>/<a href="https://www.tensorflow.org/api_docs/python/tf/train/ProximalGradientDescentOptimizer">composite</a> implementation. The intuition behind SGD is to travel in a direction we expect is downhill, at least from where we are now. Put another way, the gradient defines a local linear approximation to our function, and we head in the direction that most directly lowers the cost for that approximation. The learning rate \(\eta_t\) controls how far against the gradient we’d like to go (before we judge the linear approximation to be inaccurate).</p>
<p><strong>Assumptions</strong>. SGD makes the <em>gradient estimation assumption</em>, that \(\tilde{\nabla}_t\) is an unbiased estimator of \(\nabla_t\) with variance globally bounded, and the assumes that \(f\) is <em>\(L\)-gradient-Lipschitz</em>. <a href="https://arxiv.org/abs/1308.6594">Ghadimi et al 2013</a> extend to composite costs.</p>
<p><strong>Guarantees</strong>. For a <em>fixed-rate</em>, \(\eta_t=\eta\), we expect to converge to an approximate critical point in \(O\pa{ d\epsilon^{-4} }\) as long as \(\eta\simeq\min\pa{L^{-1},\epsilon^2}\). With <em>annealing</em>, \(\eta_t\simeq\min(L^{-1},\epsilon t^{-1/4})\) offers the same guarantees.</p>
<p><strong>Practical Notes</strong>. Vanilla SGD, though simple, has quite a few pitfalls without careful tuning.</p>
<ul>
<li>Its theoretical performance is poor, and convergence is only guaranteed to hold if assuming step size is kept small corresponding to smoothness constants of the cost function. The fact that annealing doesn’t benefit worst-case runtime is a bit surprising since that’s what happens in the strongly convex case, but I believe this is a testament to the fact that the general cost function shape is no longer bowl-like, but can be fractal in nature, so there might never be an end to directions to descend.</li>
<li>In practice, I’ve found that at least for simple problems like logistic regression, where we have \(L\) available, using a fixed learning rate of at most \(L^{-1}\), is many, many orders of magnitudes slower than a “reasonable” constant. Global Lipschitz properties might be poorer than local ones, so you’re dooming yourself to slow learning.</li>
<li>A common strategy to cope with this is to use an exponential decay schedule, \(\eta_t\simeq e^{-t}\), with the idea being to traverse a large range of learning rates, hopefully spending most of the time in a range appropriate to the problem. Of course, this will be very sensitive to hyper parameters: note that using exponential decay bounds the diameter of exploration, and even using an inverse-time schedule \(\eta_t\simeq t^{-1}\) for \(T\) steps means you can only travel \(O(\log T)\) distance from your starting point! Inverse-time schedules, and more generally schedules with \(\sum_{t=1}^\infty\eta_t=\infty\) but \(\sum_{t=1}^\infty\eta_t^2<\infty\), can draw on more restrictive smoothness assumptions about \(f\) to guarantee almost-sure convergence (<a href="http://leon.bottou.org/papers/bottou-98x">Bottou 1998</a>).</li>
<li><a href="https://arxiv.org/abs/1309.5549">Ghadimi and Lan 2013a</a> also offer a treatment of “2-phase random stochastic gradient”, which is vanilla SGD with random restarts, for probabilistic guarantees of finding approximate stationary points. Finally, Ghadimi and Lan’s SGD technically expects to find \(\vx_*\) with \(\E\ha{\norm{\nabla f(\vx_*)}^2}<\epsilon\). This implies the above \(O(d\epsilon^{-4})\) convergence rate, but is technically slightly stronger.</li>
</ul>
<p>Most subsequent algorithms have been developed to handle finding \(\eta_t\) on their own, adapting the learning rate as they go along. This was done for the convex case, but that doesn’t stop us from applying the same improvements to the non-convex case!</p>
<h2 id="accelerated-stochastic-gradient-descent-agd">Accelerated (Stochastic) Gradient Descent (AGD)</h2>
<p>See <a href="https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer">tf.train.MomentumOptimizer</a> for implementation. AGD is motivated by momentum-added SGD from <a href="http://www.sciencedirect.com/science/article/pii/0041555364901375">Polyak 1964</a>. A modern version looks like this:
\[
\begin{align}
\vm_0&=0\\\<br />
\vm_{t+1}&=\beta \vm_t+\eta \nabla_t \\\<br />
\vx_{t+1}&=\vx_t-\vm_{t+1}
\end{align}
\]
<strong>Description</strong>. We can intuit this as, in <a href="https://arxiv.org/abs/1609.04747">Ruder’s words</a>, as a ball rolling down a hill, with a growing momentum. In this way, we extend the hill metaphor, in effect trusting that we can continue further in the general downhill direction maintained by the momentum terms. Some momentum implementations just replace the above gradient with the estimator \(\tilde{\nabla}_t\) and set stuff running, like the linked TensorFlow optimizer does by default. However, even with full gradient information and assuming smoothness and convexity, momentum alone doesn’t perform optimally. Nesterov’s 1983 paper, <em>A method of solving a convex programming problem with convergence rate \(O(1/k^2)\)</em>, fixes this by correcting momentum to look ahead, which is helpful if the curvature of the function starts changing:
\[
\begin{align}
\vm_0&=0\\\<br />
\vm_{t+1}&=\beta \vm_t+\eta \nabla f(\vx_t -\beta\vm_t)\\\<br />
\vx_{t+1}&=\vx_t-\vm_{t+1}
\end{align}
\]
<strong>Practical Notes</strong>. While optimal in the smooth, convex, full gradient setting, and even optimally extended to non-smooth settings (see <a href="http://www.mit.edu/~dimitrib/PTseng/papers/apgm.pdf">Tseng 2008</a> for an overview), changing the above to use a random gradient estimator ruins asymptotic performance, concede <a href="http://proceedings.mlr.press/v28/sutskever13.html">Sutskever et al 2013</a>. <a href="http://www.deeplearningbook.org/contents/optimization.html">Goodfellow</a> claims momentum handles ill-conditioning in the Hessian of \(f\) and variance in the gradient though the introduction of the stabilizing term \(\vm_{t}\). Indeed, this seems to be the thesis laid out by Sutskever et al 2013, where the authors argue that a certain transient phase of optimization matters more for deep NNs, which AGD accelerates empirically (see also <a href="https://arxiv.org/abs/1212.0901">Bengio et al 2012</a>). Many authors set \(\beta=0.9\), but see <a href="http://proceedings.mlr.press/v28/sutskever13.html">Sutskever et al 2013</a> for detailed considerations on the momentum schedule.</p>
<p><strong>Guarantees</strong>. Later work by <a href="https://arxiv.org/abs/1310.3787">Ghadimi and Lan 2013b</a> solidifies the analysis for AGD in for stochastic, smooth, composite, and non-convex costs, though it uses a slightly different formulation for momentum. Under the previous gradient estimation assumptions from SGD (including slightly stronger light-tail assumptions about the variance of \(\tilde{\nabla}_t\)), \(L\)-gradient-Lipschitz assumptions for \(f\), and a schedule which increases mini-batch size <em>linearly</em> in the iteration count to refine gradient estimation, AGD requires \(O(\epsilon^{-2})\) iterations but \(O(d\epsilon^{-4})\) runtime to converge to an approximate critical point. Perhaps with yet stronger assumptions about the concentration of \(\tilde{\nabla}_t\) around \(\nabla_t\) AGD has promise to perform better.</p>
<h2 id="adagrad">AdaGrad</h2>
<p>AdaGrad was proposed by <a href="http://jmlr.org/papers/v12/duchi11a.html">Duchi et al 2011</a> and is available in <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer">TensorFlow</a>.
\[
\begin{align}
\vv_0&=\epsilon\\\<br />
\vv_{t+1}&=\vv_t+\tilde{\nabla}_t^2\\\<br />
\vx_{t+1}&=\vx-\frac{\eta}{\sqrt{\vv_{t+1}} }\tilde{\nabla}_t
\end{align}
\]
<strong>Description</strong>. AdaGrad is actually analyzed in the framework of <a href="http://ocobook.cs.princeton.edu/">online convex optimization (OCO)</a>. This adversarial, rather than stochastic, optimization setting can immediately be applied to stochastic optimization of convex functions over compact sets. The first is based on OCO. Consider an \(L\)-gradient Lipschitz but possibly nonconvex cost \(f\) at some iterate \(\vx_t\), which implies an upper bound \(f(\vx)\le f(\vx_t)+(\vx-\vx_t)^\top\nabla_t+\frac{L}{2}\norm{\vx-\vx_t}^2_2\); the convexity inequality, if we had it, would sandwich \(f(\vx)\ge f(\vx_t)+(\vx-\vx_t)^\top\nabla_t\). Minimizing this upper bound, which results in full gradient descent (GD), then, guarantees improvement in our cost. The quadratic term effectively quantifies how much we trust our linear approximation. An analogous technique applied to a sequence of cost functions in the online setting gives rise to Follow the Regularized Leader (FTRL): given past performance, create an upper bound on the global cost reconstructed from our stochastic information, and find the next best iterate subject to working within a trusted region. The difficulty is in defining this trusted region with an unfortunately named regularization function, which differs from \(\Omega\). AdaGrad improves the quadratic regularization \(\frac{L}{2}\norm{\vx-\vx_t}^2_I\) in GD to the less crude \(\frac{L}{2}\norm{\vx-\vx_t}^2_{G_t}\), where \(\norm{\vx}^2_A=\vx^\top A\vx\) and \(G_t=\diag \vv_t^{1/2}\) from the iterates above (see <a href="/assets/2017-06-20-nonconvex-first-order-methods/proximal_notes.pdf">these notes</a>, retrieved <a href="http://cs.stanford.edu/~ppasupat/a9online/uploads/proximal_notes.pdf">from here</a>, for discussion). This <em>adaptive</em> regularization function, at least in the OCO setting, is as good, in terms of convergence, as an optimal choice quadratic regularization, up to multiplicative constants. We see that the learning rate for every feature changes with respect to its history, so that new information is weighed against the old.</p>
<p><strong>Practical Notes</strong>. AdaGrad is a convex optimization algorithm, and it shows, but not in a good way.</p>
<ul>
<li>In nonconvex optimization problems, aggregates of gradients from the beginning of training are irrelevant to the curvature of the current location being optimized, Goodfellow claims. As a result, they result in aggressive learning rate decrease.</li>
<li>The \(\epsilon\) constant is only for numerical stability. <a href="https://keras.io/optimizers/#adagrad">Keras</a> and <a href="https://arxiv.org/abs/1609.04747">Ruder</a> recommend setting it to \(10^{-8}\).</li>
<li>For noncomposite versions of AdaGrad, see <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdagradDAOptimizer">tf.train.AdagradDAOptimizer</a>, mentioned in the original <a href="http://jmlr.org/papers/v12/duchi11a.html">Duchi et al 2011</a> and <a href="https://www.tensorflow.org/api_docs/python/tf/train/ProximalAdagradOptimizer">tf.train.ProximalAdagradOptimizer</a> based on FOBOS from <a href="https://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting">Duchi et al 2009</a>. See <a href="https://arxiv.org/abs/1009.3240">McMahan 2011</a> for discussion.</li>
<li>While AdaGrad greatly improved performance by having a per-dimension learning rate, its use is frequently discouraged because of its maintenance of the entire gradient history.</li>
</ul>
<p><strong>AdaDelta</strong> attempts to address the aggressive learning rate decrease problem of AdaGrad by exponentially decaying an estimate of accumulated gradient term \(\vv_t\) (<a href="https://arxiv.org/abs/1212.5701">Zeiler 2012</a>). This adds a new parameter for the exponential decay \(\beta\), typically \(0.9\), and introduces a unit correction \(\tilde{\vx}_t\) in place of the learning rate:
\[
\begin{align}
\vv_0&=\tilde{\vx}_0=0\\\<br />
\vv_{t+1}&=\beta \vv_t+(1-\beta)\tilde{\nabla}_t^2\\\<br />
\Delta_{t+1}&=\frac{\sqrt{\tilde{\vx}_{t}+\epsilon} }{\sqrt{\vv_{t+1}+\epsilon} }\tilde{\nabla}_t\\\<br />
\tilde{\vx}_{t+1}&=\beta \tilde{\vx}_{t}+(1-\beta)\Delta_{t+1}\\\<br />
\vx_{t+1}&=\vx_t-\Delta_{t+1}
\end{align}
\]
Similar update rules have been explored by <a href="https://arxiv.org/abs/1206.1106">Schaul et al 2012</a> in a sound but presumptive setting where \(\nabla^2f_i(\vx)\) are considered identical and diagonal for all \(i\in[n]\) and any fixed \(\vx\). <strong>RMSProp</strong> is similar to AdaDelta, but still relies on a fixed learning rate \(\tilde{\vx}_t=\eta\). Both RMSProp and AdaDelta have seen practical success, improving over AdaGrad in later iterations because they are unencumbered by previous gradient accumulation. RMSProp even has a Nesterov momentum variant. However, the exponential decay approximation may have high bias early in the iteration. The Adaptive Moment Estimation (Adam) paper corrects for this.</p>
<h2 id="adam">Adam</h2>
<p>The Adam method, proposed by <a href="https://arxiv.org/abs/1412.6980">Kingma and Ba 2014</a>, improves on AdaGrad-inspired adaptive rate methods by adding both a momentum term and removing first and second moment bias from exponential decay approximations to the gradient accumulators. See <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer">TensorFlow</a> for an implementation.
\[
\begin{align}
\vm_0&=\vv_0=0\\\<br />
\vm_{t+1}&=\beta_1\vm_t+(1-\beta_1)\tilde{\nabla}_t \\\<br />
\vv_{t+1}&=\beta_2 \vv_t+(1-\beta_2)\tilde{\nabla}_t^2\\\<br />
\vx_{t+1}&=\vx_t-\eta\pa{1-\beta_2^t}^{1/2}\pa{1-\beta_1^t}^{-1}\frac{\vm_{t+1} }{\sqrt{\vv_{t+1}+\epsilon} }
\end{align}
\]
<strong>Description</strong>. Adam seeks to combine AdaGrad’s adapitivity, which can learn the curvature of the space it’s optimizing in (making it able to deal with sparse gradients), and momentum-based approaches like RMSProp, which are able to adapt to new settings during the course of the optimization. The bias correction ensures that, roughly, \(\E\ha{ \tilde{\nabla}_t^2} =\vv_t(1-\beta_2^t)+\zeta\) and analogously for \(\vm_t\), with \(\zeta\) being the error that occurs from non-stationarity in the gradient. Under the assumption that appropriate \(\beta_1,\beta_2\) are selected, such that the non-stationarity error is appropriately vanished by the exponential decay, Adam has low bias for the gradient moments \(\vm,\vv\). As the paper describes, the unbiased \(\frac{\vm_{t+1} }{\sqrt{\vv_{t+1}+\epsilon} }\) captures the <em>signal-to-noise</em> ratio for the gradient.</p>
<p><strong>Guarantees</strong>. Adam reduces to Adagrad under certain parameter settings. Like Adagrad, it has strong guarantees in an OCO setting, which are valuable but not immediately applicable here.</p>
<p><strong>Practical Notes</strong>. Given that Adam has fairly intuitive hyperparameters, Adam has pretty decent performance across the board.</p>
<ul>
<li>As before, for stability, a small \(\epsilon=10^{-8}\) is typically used.</li>
<li>AdaGrad can be recovered with an annealing \(\eta\sim t^{-1/2}\) and near-0 values for\(\beta_1, 1-\beta_2\): these are recommended in the convex setting.</li>
<li>For other, nonconvex, settings \(\beta_1\) should be higher, for instance, \(0.9\). Settings for \(\beta_2\) from the paper are among \({0.99, 0.999,0.9999}\). High settings for both \(\beta_1,\beta_2\) imply stationarity in the gradient moments.</li>
<li>Though Adam and other adaptive methods might seem like empirical improvements over SGD (though they certainly don’t seem to have any better convergence guarantees in the nonconvex case), they seem to struggle with generalization error, which is the ultimate goal for our optimization. Recall the point made in the <a href="/2017/06/19/neural-network-optimization-methods.html">overview post</a> about <a href="http://leon.bottou.org/papers/bottou-bousquet-2011">Bousquet and Bottou 2007</a>: the convergence guarantees for the training loss above are only part of the overall error equation. This is still an active area of research, but intuitively we can construct training sets where adaptive methods reach poorly generalizing minima but SGD methods approach well-generalizing good ones (<a href="https://arxiv.org/abs/1705.08292">Wilson et al 2017</a>). Empirical responses to this have found that momentum-based SGD can be tuned to address the convergence speed issues but avoid generalization error qualms (<a href="https://arxiv.org/abs/1706.03471">Zhang et al 2017</a>). I would posit that SGD perhaps finds “stable” minima (ones whose generalization gap is small, conceptually minima that exist on a large, flat basin), and that momentum does not affect this approach, whereas adaptive methods might find a minimum within a narrow valley that might have better training loss, but has a large generalization gap since the valley “feature” of this cost function terrain is unstable with respect to the training set.</li>
</ul>
<h2 id="visualization">Visualization</h2>
<p>This visualization is coming from <a href="http://sebastianruder.com/optimizing-gradient-descent/index.html">Sebastian Ruder’s related post</a>. Check it out for discussion about the below visualization. Note that NAG is AGD and Momentum is uncorrected momentum added to SGD.</p>
<p><img src="/assets/2017-06-20-nonconvex-first-order-methods/update-rules-viz.gif" alt="visualization of different update rules in action" class="center-image" /></p>
<h1 id="future-directions">Future Directions</h1>
<h2 id="variance-reduction">Variance Reduction</h2>
<p>A new approach, Stochastic Variance Reduction Gradient (SVRG), was developed by <a href="https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction">Johnson and Zhang 2013</a>. Its analysis, for strongly convex and smooth non-composite functions, didn’t improve any long-standing convergence rates, but the idea introduced was novel: we could use stale full-gradient information \(\nabla_t\) taken occasionally to de-noise stochastic estimations \(\tilde{\nabla}_t\). Updating our full gradient every \(m\) steps, with \(\tilde{\nabla}_t(\vx)\) being the stochastic estimator for the cost gradient at time \(t\) at location \(\vx\), so the previous defualt notation has \(\tilde{\nabla}_t=\tilde{\nabla}_t(\vx_t)\):
\[
\begin{align}
\bar{\vx}_t &= \begin{cases}\E_{\xi}\bar{\vx}_{t-\xi}& t\equiv 0\pmod{m} \\\\ \bar{\vx}_{t-1} & \text{otherwise}\end{cases} \\\<br />
\vg_t &= \begin{cases} \nabla f(\bar{\vx}_t) & t\equiv 0\pmod{m} \\\\ {\vg}_{t-1} & \text{otherwise}\end{cases} \\\<br />
\vx_t &= \vx_{t-1}- \eta_t\pa{\tilde{\nabla}_t (\vx_{t-1})-\tilde{\nabla}_t(\bar{\vx}_{t})+\vg_t}
\end{align}
\]
Above, \(\xi\) is a random variable supported on \([m]\). The same guarantees hold without taking expectation wrt \(\xi\) for computing \(\bar{\vx}_t\). In particular, for certain \(\xi,\eta_t\) SVRG was shown to reach an approximate critical points in \(O(dn+dn^{2/3}\epsilon^{-2})\) time, at least in the non-composite setting, simultaneously by <a href="https://arxiv.org/abs/1603.06160">Reddi et al 2016</a> and <a href="https://arxiv.org/abs/1603.05643">Allen-Zhu and Hazan 2016</a>. For these problems this improves over the GD runtime cost \(O(dn\epsilon^{-2})\).</p>
<p>Still, it’s debatable whether the \(O(d\epsilon^{-4})\) SGD is improved upon by SVRG methods, since they depend on \(n\). Datasets can be extremely large, so the \(n^{2/3}\epsilon^{-2}\) term may be prohibitive. At least in convex settings, <a href="https://arxiv.org/abs/1511.01942">Babanezhad et al 2015</a> explore using mini-batches for a variance-reduction effect. Perhaps an extension of this to non-convex costs would be what’s necessary to see SVRG applied to NNs. Right now, its use doesn’t seem to be very standard.</p>
<h2 id="noise-injected-sgd">Noise-injected SGD</h2>
<p><strong>Noisy SGD</strong>, is a surprisingly cheap and viable new solution proposed to find approximate <em>local minima</em> by <a href="https://arxiv.org/abs/1503.02101">Ge et al 2015</a>. Intuitively, adding jitter the parameters would ensure that the gradient-vanishing pathology of strict saddle points won’t be a problem. In particular, even if the gradient shrinks as you near a saddle point, the jitter will be strong enough that you won’t have to spend a long time around it before escaping.</p>
<p>\[
\begin{align}
\xi_{t}&\sim \Uniform \pa{B_{1} }\\\<br />
\vx_{t+1}&=\vx_t-\eta \tilde{\nabla_{t} }+\xi_t
\end{align}
\]</p>
<p>Above, \(B_r\) is a ball centered at the origin of radius \(r\). Unfortunately, noisy SGD is merely \(O(\poly(d/\epsilon))\). Its important contribution is showing that even stochastic first order methods could feasibly be used to arrive at local minima. With additional assumptions, and removing stochasticity, this was improved by <a href="https://arxiv.org/abs/1703.00887">Jin et al 2017</a> in <strong>Perturbed Gradient Descent</strong> (PGD):
\[
\begin{align}
\xi_{t}&\sim \Uniform \pa{B_{r_t} }\\\<br />
\vx_{t+1}&=\vx_t-\eta \nabla_{t}+\xi_t
\end{align}
\]
The radius \(r_t\) is carefully chosen depending on whether or not PGD detects we are near a saddle point. Usually, it is set to 0, so the algorithm mostly behaves like GD. With some additional second-order smoothness assumptions, this runs in time \(O(nd\epsilon^{-2}\log^4d)\), showing a cheap extension of GD for finding minima. However, until a similar analysis is performed for stochastic PGD, with equally friendly results, these methods aren’t yet ready for prime time. Recent work by Chi Jin adds acceleration to PGD, improving by a factor of \(\epsilon^{1/4}\).</p>
Tue, 20 Jun 2017 00:00:00 +0000
https://vlad17.github.io/2017/06/20/nonconvex-first-order-methods.html
https://vlad17.github.io/2017/06/20/nonconvex-first-order-methods.htmlmachine-learningoptimizationdeep-learningNeural Network Optimization Methods<h1 id="neural-network-optimization-methods">Neural Network Optimization Methods</h1>
<p>The goal of this post and its related sub-posts is to explore at a high level how the theoretical guarantees of the various optimization methods interact with non-convex problems in practice, where we don’t really know Lipschitz constants, the validity of the assumptions that these methods make, or appropriate hyperparameters. Obviously, a detailed treatment would require delving into intricacies of cutting-edge research. That’s not the point of this post, which just seeks to offer a theoretical survey.</p>
<p>I should also caution the reader that I’m not drawing on any of my own experience when discussing “practical” aspects of neural network (NN) optimization, but rather <a href="http://www.deeplearningbook.org/">Dr. Goodfellow’s</a>. For the most part, I’ll be summarizing sections 8.5 and 8.6 of the <a href="http://www.deeplearningbook.org/contents/optimization.html">optimization chapter</a> in that book, but I’ll throw in some relevant background and research, too. Further, one departure from practicality that I’ll be making for simplicity is not considering parallelism. All mentioned analyses assume sequential execution, and may not have obvious parallel versions. Even if they do, most bets are off.</p>
<p>In part, I’ll also try to address exactly what theoretical guarantees we do have in a NN setting. Lots of work has been done for convex and adversarial online convex optimization, and most NNs are optimized by just throwing such a method at training. Luckily, a lot of very recent work, as of this posting, has addressed exactly what happens in this situation.</p>
<h2 id="setting">Setting</h2>
<p>A NN is a real-valued circuit \(\hat{y}_\bsth\) of computationally efficient, differentiable, and Lipschitz functions parameterized by \(\bsth\). This network is trained to minimize a loss, \(J(\bsth)\), based on empirical risk minimization (ERM). This is the hard part, computationally, for training NNs. We are given a set of supervised examples, pairs \(\vx^{(i)},y^{(i)}\) for \(i\in[n]\). Under the assumption that these pairs are coming from some fixed, unknown distribution, some learning can be done by ERM relative to a loss \(\ell\) on our training set, which amounts to the following:
\[
\argmin_\bsth J(\bsth) = \argmin_\bsth \frac{1}{n}\sum_{i=1}^m\ell(\hat{y}_\bsth(\vx^{(i)}), y^{(i)})+\Omega(\bsth)
\]
Above, \(\Omega\) is a regularization term (added to restrict the hypothesis class). Its purpose is for generalization. Typically, \(\Omega\) is of the form of an \(L^2\) or \(L^1\) norm. In other cases, it has a more complicated implicit form such as the case when we perform model averaging through dropout or weight regularization through early stopping (regularization may also be some kind of smoothing, like gradient clipping). In any case, we will assume that there exist some general strategies for reducing problems with nonzero \(\Omega\) to those where it is zero (see, for example, analysis and references in <a href="https://papers.nips.cc/paper/563-a-simple-weight-decay-can-improve-generalization">Krogh and Hertz 1991</a>, <a href="http://epubs.siam.org/doi/abs/10.1137/080716542">Beck and Teboulle 2009</a>, <a href="https://arxiv.org/abs/1603.05953">Allen-Zhu 2016</a>). The presence of regularization in general is nuanced, and its application requires deeper analyses for minimization methods, but we will skirt those concerns when discussing practical behavior for the time being.</p>
<p>An initial source of confusion about the above machine learning notation is the reuse of variable names in the optimization literature, where instead our parameters \(\bsth\) are points \(\vx\) and our training point errors \(\ell(\hat{y}_\bsth(\vx^{(i)}), y^{(i)})\) are replaced with opaque Lipschitz, differentiable costs \(f_i(\vx)\). We now summarize our general task of (unconstrained) NN optimization of our nonconvex composite (regularized) cost function \(f:\R^d\rightarrow \R\):
\[
\argmin_\vx f(\vx) = \argmin_\vx \frac{1}{n}\sum_{i=1}^nf_i(\vx)+\Omega(\vx)
\]</p>
<p>In a lot of literature inspiring these algorithms, it’s important to keep straight in one’s head the various types of minimization problems that are being solved, and whether they’re making incompatible assumptions with the NN environment.</p>
<ul>
<li>Many algorithms are inspired by the general convex \(f\) case. NN losses are usually not convex.</li>
<li>Sometimes, full gradients \(\nabla f\) are assumed. A full gradient is intractable for NNs, as it requires going through the entire \(m\)-sized dataset. We are looking for stochastic approximations to the gradient \(\E\ha{\tilde{\nabla} f}=\nabla f\).</li>
<li>Some algorithms assume \(\Omega = 0\), but that’s not usually the case.</li>
</ul>
<h2 id="theoretical-convergence">Theoretical Convergence</h2>
<p>We’ll be looking at gradient descent (GD) optimization algorithms, which assume an initial point \(\vx_0\) and move in nearby directions to reduce the cost. As such, basically all asymptotic rates contain a hidden constant multiplicative term \(f(\vx_0) - \inf f\).</p>
<h3 id="problem-specification">Problem Specification</h3>
<p>Before discussing speed, it’s important to know what constitutes a solution. Globally minimizing a possibly non-convex function such as deep NN is NP-hard. Even finding an approximate local minimum of just a quartic multivariate polynomial or showing its convexity is NP-hard (<a href="https://arxiv.org/abs/1012.1908">Ahmadi et al 2010</a>).</p>
<p>What we do, in theory, at least, is instead merely find <strong>approximate critical points</strong>; i.e., a typical non-convex optimization algorithm would return a point \(\vx_{*}\) that satisfies \(\norm{\nabla f(\vx_*)}\le \epsilon\). This is an <strong>incredibly weak</strong> requirement: for NNs, there are significantly more saddle points than local minima, and they have high cost. Luckily, local minima actually concentrate around the global minimum cost for NNs, as opposed to saddles, so recent cutting-edge methods that find approximate local minima are worth keeping in mind. An approximate local minimum \(\vx_*\) has a neighborhood such that any \(\vx\) in that neighborhood will have \(f(\vx)-f(\vx_*)\le \epsilon\). <a href="https://github.com/vlad17/ml-notes/blob/master/deep-learning/optimization.pdf">See extended discussion here.</a></p>
<p>We’ll assume that \(f\) is differentiable and Lipschitz. Even though ReLU activations and \(L^1\) regularization may technically invalidate the differentiability, these functions have well-defined <strong>subgradient</strong> that respect <a href="http://web.stanford.edu/class/msande318/notes/notes-first-order-nonsmooth.pdf">GD properties that we care about</a>. Certain algorithms further might assume \(f\in\mathcal{C}^2\) and that the Hessian is operator-norm Lipschitz or bounded.</p>
<p>There are two main runtime costs. The first is the desired degree of accuracy, \(\epsilon\). The second is due to the dimensionality of our input \(d\). Ignoring representation issues, thanks to the circuit structure of \(f\), we evaluate for any \(i\in[n]\) and \(\vv\in\R^d\) all of \(f_i(\vx), \nabla f_i(\vx), {\nabla^2 f_i(\vx)} \vv\) in \(O(d)\) time. Finally, since gradients of \(f_i\) approximate gradients of \(f\) only <em>in expectation</em>, reported worst-case runtimes are usually worst-case runtimes such that we <em>expect</em> to arrive at an approximate stationary point (expectation taken over the random uniform selection of \(i\) in SGD).</p>
<h3 id="fundamental-lower-bounds">Fundamental Lower Bounds</h3>
<p>First, unless \(\ptime = \nptime\), we expect runtime to be at least \(\Omega\pa{\log\frac{d}{\epsilon}}\) due to the aforementioned hardness results.
fluctuation.</p>
<p>Less obviously, convex optimization lower bounds for smooth functions imply that any first-order non-convex algorithm requires at least \(\Omega(1/\epsilon)\) gradient steps (<a href="https://arxiv.org/abs/1405.4980">Bubek 2014</a>, see also notes <a href="http://www.stat.cmu.edu/~larry/=sml/optrates.pdf">here</a> and <a href="http://www.cs.cmu.edu/~suvrit/teach/aaditya_lect23.pdf">here</a>). Note that this is nowhere near polynomial time in the bit size of \(\epsilon\)!</p>
<p>See also <a href="http://ieeexplore.ieee.org/document/585893/">Wolpert and Macready 1997</a>, <a href="https://papers.nips.cc/paper/125-training-a-3-node-neural-network-is-np-complete">Blum and Rivest 1988</a>, and a recent re-visiting of the topic in <a href="https://arxiv.org/abs/1410.1141">Livni et al 2014</a>. In other words, general non-convex optimization time lower bounds are too broad to apply usefully to NN, but specific approaches to fixed architectures may be appropriate.</p>
<h3 id="limitations-of-theoretical-descriptions">Limitations of Theoretical Descriptions</h3>
<p>There are a couple of limitations in using asymptotic, theoretical descriptions of convergence rates to analyze these algorithms.</p>
<p>First, the \(\epsilon\) in \(\epsilon\)-approximate critical points above is merely a small piece in the overall generalization error that the NN will experience. As explained in <a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning">Bousquet and Bottou 2007</a> (<a href="http://leon.bottou.org/papers/bottou-bousquet-2011">extended version</a>), the generalization error is broken into approximation (how accurate the entire function class of neural networks for a fixed architecture is in representing the true function we’re learning), estimation (how far we are from a global optimum among our hypothesis class of functions), and optimization error (our convergence tolerance). As cautioned in the aforementioned paper, the tradeoff between the aforementioned errors implies that even improvements in optimization convergence rate, like the use of full GD instead of stochastic GD (SGD) may not be helpful if they increase other errors in hidden ways.</p>
<p>Second, early stopping might prevent prevent convergence altogether—as mentioned in Goodfellow’s book, gradient norms can increase while training error decreases. It’s unclear whether we can fold in early stopping as an implicit term in \(\Omega\) and claim that we’re reaching a critical point in this virtual cost function.</p>
<p>The fact that that theoretical lower bound rates are not relevant for NN training time (compared to what we see in practice) shows that there is a wide gap between general non-convex and smooth approximate local minimum finding and the same problem for NNs.</p>
<h2 id="existing-algorithms">Existing Algorithms</h2>
<p>In the below two linked blog posts, I will review the high-level details existing algorithms for NN non-convex optimization. Most of these are methods that have been developed for the composite <em>convex</em> smooth optimization problem, so they may not even have any theoretical guarantees for the \(\epsilon\)-approximate critical point or local min problem. It turns out that indeed we find a general dichotomy between these GD algorithms:</p>
<ul>
<li>Algorithms which are practically available, e.g., <a href="https://www.tensorflow.org/api_guides/python/train">TensorFlow’s first order methods</a>, but were initially developed for convex problems, and whose non-convex interpretations are usually only approximate critical point finders</li>
<li>Algorithms which are (as of June 2017) cutting-edge research and not widely available, yet have been designed for finding local minima efficiently in non-convex settings. Nonetheless, they’re still useful to mention since the respective paper implementations might be available and it may be worthwhile to manually implement the optimization, too.</li>
</ul>
<p>This list of existing algorithms is going to be a bit redundant with the review <a href="https://arxiv.org/abs/1609.04747">Ruder 2016</a>, but my intention is to be a bit more comprehensive and rigorous but less didactic in terms of update rules covered.</p>
<p>In general, all these rules have the format \(\vx_{t+1}=\vx_t-\eta_t\vg_t\) where \(\eta_t\) is a learning rate and \(\vg_t\) is the gradient descent direction, both making a small local improvement at the \(t\)-th discrete time. Theoretical analysis won’t be presented, but guarantees, assumptions, intuition, and update rules will be described. Proofs will be linked.</p>
<ul>
<li><a href="/2017/06/20/nonconvex-first-order-methods.html">First order methods</a></li>
</ul>
<p>Technically, most neural networks don’t have smoothness or even differentiability everywhere. While in reality those issues don’t seem to surface in practice, it turns out we can <a href="https://arxiv.org/abs/1804.07795">still make some strong statements</a> about first-order optimization methods.</p>
Mon, 19 Jun 2017 00:00:00 +0000
https://vlad17.github.io/2017/06/19/neural-network-optimization-methods.html
https://vlad17.github.io/2017/06/19/neural-network-optimization-methods.htmlmachine-learningoptimizationdeep-learningJupyter Tricks<h1 id="jupyter-tricks">Jupyter Tricks</h1>
<p>Here’s a list of my top-used Juypter tricks, and what they do.</p>
<h2 id="ui">UI</h2>
<p>I find the UI to be intuitive, <code class="highlighter-rouge">Help > User Interface Tour</code> describes more. There are <strong>command</strong> (enter by pressing the escape button or clicking outside of a cell) and <strong>edit</strong> (enter by typing in a cell) modes. You can tell you’re in edit mode if the “pencil” corner indicator is present:</p>
<p><img src="/assets/2017-05-25-jupyter-tricks/corner-indicator.png" alt="corner indicator symbol" class="center-image" /></p>
<p>It’s also faster to use the commands as listed in <code class="highlighter-rouge">Help > Keyboard Shortcuts</code>; with those you can also remove the toolbar with <code class="highlighter-rouge">View > Toggle Toolbar</code>.</p>
<p><code class="highlighter-rouge">jupyter notebook existing-notebook.ipynb</code> - auto-launch an existing notebook without the Jupyter menu.</p>
<h2 id="remote-serving">Remote Serving</h2>
<p>Run the kernel on a beefy server, view with a browser on your laptop. You can change the ports appropriately to something high and unused.</p>
<ol>
<li>On laptop, initiate SSH with a tunnel <code class="highlighter-rouge">ssh -L8888:localhost:12321 vlad@my-beefy-server.com</code></li>
<li>On server, launch <code class="highlighter-rouge">tmux</code> if you’d like to persist the Jupyter server (useful if you need to keep running stuff and reconnect notebook later).</li>
<li>On server, <code class="highlighter-rouge">jupyter notebook --no-browser --port=12321</code></li>
<li>On laptop, navigate to <code class="highlighter-rouge">localhost:8888</code> in-browser.</li>
</ol>
<h2 id="tex">TeX</h2>
<p>The way I use TeX in Jupyter notebook depends on the end goal of the notebook itself.</p>
<ul>
<li><a href="#embedded-latex">Embedded (Math) TeX</a>. Here, the end product is <strong>*.ipynb notebook itself</strong>, in which case the <em>TeX is auxiliary to the code</em>, placed only to elucidate the math involved.</li>
<li><a href="#jupyter-prepared-reports">Jupyter-prepared Reports</a>. Here, the end product is a prepared <strong>*.pdf report</strong>, in which case the <em>code is auxiliary to the TeX</em>, with Jupyter used to create the source code for a more formal report or document, one you would print out.</li>
</ul>
<p>In both of the above use cases, one may find it useful to generate rendered <a href="#tex-from-code">TeX from code</a></p>
<h3 id="embedded-tex">Embedded TeX</h3>
<p>Use MathJax in Markdown cells. In the first equation you can add convenience <code class="highlighter-rouge">\newcommand</code> items if you prefer (MathJax will evaluate cells top down).</p>
<p><em>Note:</em> The magic <code class="highlighter-rouge">%%latex</code> works, but I don’t use it. It’s treated like a code cell, but we’re really only interested in the output in this setting. Finally, you can embed images right in the Markdown, too, with <code class="highlighter-rouge"><img src="path/to/image.png"></code>.</p>
<h3 id="jupyter-prepared-reports">Jupyter-prepared Reports</h3>
<p>In this setting, use raw input cells to create segments of <code class="highlighter-rouge">LaTeX</code> code. This won’t render within the notebook, but this is OK since we’re treating the notebook like source in this setting. To generate a report (with the inline evaluated code), I use <code class="highlighter-rouge">nbconvert</code>. For prepared scripts, check out <a href="https://github.com/vlad17/ipython-latex">my ipython-latex repo</a>.</p>
<h3 id="tex-from-code">TeX from Code</h3>
<p><img src="/assets/2017-05-25-jupyter-tricks/math.png" alt="generated math" class="center-image" /></p>
<h2 id="logging">Logging</h2>
<p>Bring logging to the cell output:</p>
<pre><code class="language-{python}">import logging
logging.getLogger().addHandler(logging.StreamHandler())
logging.getLogger(some_module.that.I.want.logged).setLevel(logging.INFO)
</code></pre>
<h2 id="matplotlib">Matplotlib</h2>
<pre><code class="language-{python}">%matplotlib inline
import matplotlib.pyplot as plt
plt.rc('font', family='serif', serif='Computer Modern Roman')
plt.rc('text', usetex=True)
</code></pre>
<p>This preamble will render generated Matplotlib objects in the Jupyter HTML. The font specified above is consistent with the LaTeX generated by ipython-latex in Jupyter-prepared reports. Notably, this is going to differ from the MathJax font that is used inside Markdown cells.</p>
<h2 id="magic">Magic</h2>
<p>Docs accessible with <code class="highlighter-rouge">%<magic>?</code>.</p>
<ul>
<li><code class="highlighter-rouge">x = !! echo hi</code> - run a bash command in a subshell, save stdout in returned string, split on newlines (<code class="highlighter-rouge">!</code> for no split)</li>
<li><code class="highlighter-rouge">%%bash</code> - run cell as bash</li>
<li><code class="highlighter-rouge">%%timeit</code> - time cell</li>
<li><code class="highlighter-rouge">? f</code> - get docstring</li>
<li><code class="highlighter-rouge">?? f</code> - get source</li>
<li><code class="highlighter-rouge">%run nb.ipynb</code> - line magic, runs notebook</li>
<li><code class="highlighter-rouge">%pdb</code> - run debbuger on cell evaluation</li>
<li><code class="highlighter-rouge">%env ENV_VAR=3</code> - set enviornment variable in kernel</li>
<li><a href="http://arogozhnikov.github.io/2016/09/10/jupyter-features.html#Profiling:-%prun,-%lprun,-%mprun">Cell code profiling</a>.</li>
</ul>
<h2 id="extensions">Extensions</h2>
<p>These are the extensions I find useful: auto-format code, toggle font size, auto-comment regions, spell check, and control the cutoff at which output starts scrolling, respectively, below.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda install -c conda-forge jupyter_contrib_nbextensions
pip install yapf # for code-prettification
for i in hide_input/main code_prettify/code_prettify code_font_size/code_font_size comment-uncomment/main spellchecker/main autoscroll/main; do jupyter nbextension enable $i ; done
</code></pre></div></div>
Thu, 25 May 2017 00:00:00 +0000
https://vlad17.github.io/2017/05/25/jupyter-tricks.html
https://vlad17.github.io/2017/05/25/jupyter-tricks.htmltoolsMy Princeton Senior Thesis<h1 id="my-princeton-senior-thesis">My Princeton Senior Thesis</h1>
<p><strong>Submitted to the university as part of completion of Computer Science BSE degree</strong> June 2017</p>
<p>Completed during the 2016-2017 academic year.</p>
<p><a href="https://arxiv.org/abs/1705.10813">A concise and more up-to-date paper version.</a></p>
<p><a href="/assets/2017-05-23-my-princeton-senior-thesis/thesis.pdf">Link to download report.</a></p>
<p><a href="https://github.com/vlad17/runlmc">Code repository.</a></p>
Tue, 23 May 2017 00:00:00 +0000
https://vlad17.github.io/2017/05/23/my-princeton-senior-thesis.html
https://vlad17.github.io/2017/05/23/my-princeton-senior-thesis.htmlmy-whitepapersmachine-learning