Vlad FeinbergVlad's Blog
https://vlad17.github.io/
Sun, 29 Jul 2018 21:53:20 +0000Sun, 29 Jul 2018 21:53:20 +0000Jekyll v3.7.3Beating TensorFlow Training in-VRAM<h1 id="beating-tensorflow-training-in-vram">Beating TensorFlow Training in-VRAM</h1>
<p>In this post, I’d like to introduce a technique that I’ve found helps accelerate mini-batch SGD training in my use case. I suppose this post could also be read as a public grievance directed towards the TensorFlow Dataset API optimizing for the large vision deep learning use-case, but maybe I’m just not hitting the right incantation to get <code class="highlighter-rouge">tf.Dataset</code> working (in which case, <a href="https://github.com/vlad17/vlad17.github.io/issues/new">drop me a line</a>). The solution is to TensorFlow <em>harder</em> anyway, so this shouldn’t really be read as a complaint.</p>
<p>Nonetheless, if you are working with a new-ish GPU that has enough memory to hold a decent portion of your data alongside your neural network, you may find the final training approach I present here useful. The experiments I’ve run fall exactly in line with this “in-VRAM” use case (in particular, I’m training deep reinforcement learning value and policy networks on semi-toy environments, whose training profile is many iterations of training on a small replay buffer of examples). For some more context, you may want to check out an article on the <a href="https://reinforce.io/blog/end-to-end-computation-graphs-for-reinforcement-learning/">TensorForce blog</a>, which suggests that RL people should be building more of their TF graphs like this.</p>
<p>Briefly, if you have a dataset that fits into a GPU’s memory, you’re giving away a lot of speed with the usual TensorFlow pipelining or data-feeding approach, where the CPU delivers mini-batches whose forward/backward passes are computed on GPUs. This gets worse as you move to pricier GPUs, whose relative CPU-GPU bandwidth-to-GPU-speed ratio drops. Pretty easy change for a 2x.</p>
<h2 id="punchline">Punchline</h2>
<p>Let’s get to it. With numbers similar to my use case, 5 epochs of training take about <strong>16 seconds</strong> with the standard <code class="highlighter-rouge">feed_dict</code> approach, <strong>12-20 seconds</strong> with the TensorFlow Dataset API, and <strong>8 seconds</strong> with a custom TensorFlow control-flow construct.</p>
<p>This was tested on an Nvidia Tesla P100 with a compiled TensorFlow 1.4.1 (CUDA 9, cuDNN 7), Python 3.5. Here is the <a href="https://gist.github.com/vlad17/5d67eef9fb06c6a679aeac6d07b4dc9c">test script</a>. I didn’t test it too many times (<a href="https://gist.github.com/vlad17/f43dba5783adfc21b1abab520dd2a8f1">exec trace</a>). Feel free to change the data sizes to see if the proposed approach would still help in your setting.</p>
<p>Let’s fix the toy benchmark supervised task we’re looking at:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">tensorflow</span> <span class="k">as</span> <span class="n">tf</span>
<span class="c"># pretend we don't have X, Y available until we're about</span>
<span class="c"># to train the network, so we have to use placeholders. This is the case</span>
<span class="c"># in, e.g., RL.</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1234</span><span class="p">)</span>
<span class="c"># suffix tensors with their shape</span>
<span class="c"># n = number of data points, x = x dim, y = y dim</span>
<span class="n">X_nx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">1000</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">,</span> <span class="mi">64</span><span class="p">))</span>
<span class="n">Y_ny</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">column_stack</span><span class="p">([</span><span class="n">X_nx</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span>
<span class="n">np</span><span class="o">.</span><span class="n">sin</span><span class="p">(</span><span class="n">X_nx</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)])</span>
<span class="n">nbatches</span> <span class="o">=</span> <span class="mi">10000</span> <span class="c"># == 20 epochs at 512 batch</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="mi">512</span></code></pre></figure>
<h3 id="vanilla-approach">Vanilla Approach</h3>
<p>This is the (docs-discouraged) approach that everyone really uses for training. Prepare a mini-batch on the CPU, ship it off to the GPU. <em>Note code here and below is excerpted (see the test script link above for the full code). It won’t work if you just copy it.</em></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># b = batch size</span>
<span class="n">input_ph_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="n">output_ph_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">placeholder</span><span class="p">(</span><span class="n">tf</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">shape</span><span class="o">=</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="n">Y_ny</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]])</span>
<span class="c"># mlp = a depth 5 width 32 MLP net</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">input_ph_bx</span><span class="p">)</span>
<span class="n">tot_loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">output_ph_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">tot_loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">nbatches</span><span class="p">):</span>
<span class="n">batch_ix_b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">X_nx</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span><span class="p">,))</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">update</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="p">{</span>
<span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">],</span>
<span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">[</span><span class="n">batch_ix_b</span><span class="p">]})</span></code></pre></figure>
<p>This drops whole-dataset loss from around 4500 to around 4, taking around <strong>16 seconds</strong> for training. You might worry that random-number generation might be taking a while, but excluding that doesn’t drop the time more than <strong>0.5 seconds</strong>.</p>
<h3 id="dataset-api-approach">Dataset API Approach</h3>
<p>With the dataset API, we set up a pipeline where TensorFlow orchestrates some dataflow by synergizing more buzzwords on its worker threads. This should constantly feed the GPU by staging the next mini-batch while the current one is sitting on the GPU. This might be the case when there’s a lot of data, but it doesn’t seem to work very well when the data is small and GPU-CPU latency, not throughput, is the bottleneck.</p>
<p>Another unpleasant thing to deal with is that all those orchestrated workers and staging areas and buffers and shuffle queues need magic constants to work well. I tried my best, but it seems like performance is very sensitive with this use case. This could be fixed if Dataset detected (or could be told) it could be placed onto the GPU, and then it did so.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># make training dataset, which should swallow the entire dataset once</span>
<span class="c"># up-front and then feed it in mini-batches to the GPU</span>
<span class="c"># presumably since we only need to feed stuff in once it'll be faster</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">data</span><span class="o">.</span><span class="n">Dataset</span><span class="o">.</span><span class="n">from_tensor_slices</span><span class="p">((</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">))</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">repeat</span><span class="p">()</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="n">bufsize</span><span class="p">)</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">batch</span><span class="p">(</span><span class="n">batch_size</span><span class="p">)</span>
<span class="c"># magic that Zongheng Yang (http://zongheng.me/) suggested I add that was</span>
<span class="c"># necessary to keep this from being *worse* than feed_dict</span>
<span class="n">ds</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">prefetch</span><span class="p">(</span><span class="n">buffer_size</span><span class="o">=</span><span class="p">(</span><span class="n">batch_size</span> <span class="o">*</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">it</span> <span class="o">=</span> <span class="n">ds</span><span class="o">.</span><span class="n">make_initializable_iterator</span><span class="p">()</span>
<span class="c"># reddit user ppwwyyxx further suggests folding training into a single call</span>
<span class="k">def</span> <span class="nf">while_fn</span><span class="p">(</span><span class="n">t</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">t</span><span class="p">]):</span>
<span class="n">next_bx</span><span class="p">,</span> <span class="n">next_by</span> <span class="o">=</span> <span class="n">it</span><span class="o">.</span><span class="n">get_next</span><span class="p">()</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">next_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">next_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">adam</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">update</span><span class="p">]):</span>
<span class="k">return</span> <span class="n">t</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">while_loop</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="n">t</span> <span class="o"><</span> <span class="n">nbatches</span><span class="p">,</span>
<span class="n">while_fn</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">it</span><span class="o">.</span><span class="n">initializer</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">)</span></code></pre></figure>
<p>For a small <code class="highlighter-rouge">bufsize</code>, like <code class="highlighter-rouge">1000</code>, this trains in around <strong>12 seconds</strong>. But then it’s not actually shuffling the data too well (since all data points can only move by a position of 1000). Still, the loss drops from around 4500 to around 4, as in the <code class="highlighter-rouge">feed_dict</code> case. A large <code class="highlighter-rouge">bufsize</code> like <code class="highlighter-rouge">1000000</code>, which you’d think should effectively move the dataset onto the GPU entirely, performs <em>worse</em> than <code class="highlighter-rouge">feed_dict</code> at around <strong>20 seconds</strong>.</p>
<p>I don’t think I’m unfair in counting <code class="highlighter-rouge">it.initializer</code> time in my benchmark (which isn’t that toy, either, since it’s similar to my RL use case size). All the training methods need to load the data onto the GPU, and the data isn’t available until run time.</p>
<h3 id="using-a-tensorflow-loop">Using a TensorFlow Loop</h3>
<p>This post isn’t a tutorial on <code class="highlighter-rouge">tf.while_loop</code> and friends, but this code does what was promised: just feed everything once into the GPU and do all your epochs without asking for permission to continue from the CPU.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># generate random batches up front</span>
<span class="c"># i = iterations</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">shape</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">batches_ib</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">random_uniform</span><span class="p">((</span><span class="n">nbatches</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">tf</span><span class="o">.</span><span class="n">int32</span><span class="p">)</span>
<span class="c"># use a fold + control deps to make sure we only train on the next batch</span>
<span class="c"># after we're done with the first</span>
<span class="k">def</span> <span class="nf">fold_fn</span><span class="p">(</span><span class="n">prev</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">):</span>
<span class="n">X_bx</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">input_ph_nx</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="n">Y_by</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">output_ph_ny</span><span class="p">,</span> <span class="n">batch_ix_b</span><span class="p">)</span>
<span class="c"># removing control deps here probably gives you Hogwild!</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">prev</span><span class="p">]):</span>
<span class="n">pred_by</span> <span class="o">=</span> <span class="n">mlp</span><span class="p">(</span><span class="n">X_bx</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">losses</span><span class="o">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">Y_by</span><span class="p">,</span> <span class="n">pred_by</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">control_dependencies</span><span class="p">([</span><span class="n">opt</span><span class="o">.</span><span class="n">minimize</span><span class="p">(</span><span class="n">loss</span><span class="p">)]):</span>
<span class="k">return</span> <span class="n">tf</span><span class="o">.</span><span class="n">constant</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">training</span> <span class="o">=</span> <span class="n">tf</span><span class="o">.</span><span class="n">foldl</span><span class="p">(</span><span class="n">fold_fn</span><span class="p">,</span> <span class="n">batches_ib</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">back_prop</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">fd</span> <span class="o">=</span> <span class="p">{</span><span class="n">input_ph_nx</span><span class="p">:</span> <span class="n">X_nx</span><span class="p">,</span> <span class="n">output_ph_ny</span><span class="p">:</span> <span class="n">Y_ny</span><span class="p">}</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">training</span><span class="p">,</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">fd</span><span class="p">)</span></code></pre></figure>
<p>This one crushes at around <strong>8 seconds</strong>, dropping loss again from around 4500 to around 4.</p>
<h2 id="discussion">Discussion</h2>
<p>It’s pretty clear Dataset isn’t feeding as aggressively as it can, and its many widgets and knobs don’t help (well, they do, but only after making me do more work). But, if TF wants to invalidate this blog post, I suppose it could add yet another option that plops the dataset into the GPU.</p>
Sat, 23 Dec 2017 00:00:00 +0000
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.html
https://vlad17.github.io/2017/12/23/beating-tf-api-in-vram.htmlhardware-accelerationmachine-learningtoolsDeep Learning Learning<h1 id="deep-learning-learning-plan">Deep Learning Learning Plan</h1>
<p>This is my plan to on-board myself with recent deep learning practice (as of the publishing date of this post). Comments and recommendations <a href="https://github.com/vlad17/vlad17.github.io/issues">via GitHub issues</a> are welcome and appreciated! This plan presumes some probability, linear algebra, and machine learning theory already, but if you’re following along <a href="http://www.deeplearningbook.org/">Part 1 of the Deep Learning book</a> gives an overview of prerequisite topics to cover.</p>
<p>My notes on these sources are <a href="https://github.com/vlad17/ml-notes">publicly available</a>, as are my <a href="https://github.com/vlad17/learning-to-deep-learn">experiments</a>.</p>
<ol>
<li>Intro tutorials/posts.
<ul>
<li><a href="http://karpathy.github.io/neuralnets/">Karpathy</a></li>
<li>Skim lectures from weeks 1-6, 9-10 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
</ul>
</li>
<li>Scalar supervised learning theory
<ul>
<li>Read Chapters 6, 7, 8, 9, 11, 12 of <a href="http://www.deeplearningbook.org/">Dr. Goodfellow’s Deep Learning Book</a> and <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">Efficient Backprop</a></li>
</ul>
</li>
<li>Scalar supervised learning practice
<ul>
<li>Choose an enviornment.
<ul>
<li>Should be TensorFlow-based, given the wealth of ecosystem around it; stuff like <a href="https://github.com/deepmind/sonnet">Sonnet</a> and <a href="https://github.com/tensorflow/tensor2tensor">T2T</a>.</li>
<li>I tried <a href="https://github.com/tensorflow/models/blob/master/inception/inception/slim/README.md">TF-Slim</a> and and <a href="https://github.com/zsdonghao/tensorlayer">TensorLayer</a>, but I still found <a href="https://keras.io/">Keras</a> easiest to rapidly prototype in (and expand). TensorFlow is still pretty easy to <a href="https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html">drop down into</a> from the Keras models.</li>
<li>Even with Keras, TF is awkward to prototype in: it’s also worth considering <a href="http://pytorch.org/">PyTorch</a>.</li>
</ul>
</li>
<li>Google <a href="https://www.tensorflow.org/get_started/mnist/pros">MNIST</a></li>
<li>Lessons 0-4 from <a href="http://course.fast.ai/index.html">USF</a></li>
<li>Assignments 1-4 from <a href="https://www.udacity.com/course/deep-learning--ud730">Udacity</a></li>
<li><a href="https://www.tensorflow.org/tutorials/deep_cnn">CIFAR-10</a>
<ul>
<li>Extend to multiple GPUs</li>
<li>Visualizations (with Tensorboard): histogram summary for weights/biases/activations and layer-by-layer gradient norm recordings (+ how does batch norm affect them), graph visualization, cost over time</li>
<li>Visualizations for trained kernels: most-activating image from input set as viz, direct kernel image visualizations + maximizing image from input set as the viz <a href="https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html">per maximizing inputs</a>, activations direct image viz (per <a href="http://yosinski.com/media/papers/Yosinski__2015__ICML_DL__Understanding_Neural_Networks_Through_Deep_Visualization__.pdf">Yosinki et al 2015</a>). For maximizing inputs use regularization from Yosinki paper.</li>
<li>Faster input pipeline and timing metrics for each stage of operation <a href="http://web.stanford.edu/class/cs20si/lectures/notes_09.pdf">input pipeline notes</a>.</li>
</ul>
</li>
<li>Assignment 2 from <a href="http://web.stanford.edu/class/cs20si/syllabus.html">Stanford CS20S1</a></li>
<li>Lab 1 from <a href="https://github.com/yala/introdeeplearning">MIT 6.S191</a></li>
<li><a href="http://cs231n.github.io/">Stanford CS231n</a></li>
<li>Try out slightly less common techniques: compare initialization (orthogonal vs LSUV vs uniform), weight normalization vs batch normalization vs layer normalization, Bayesian-inspired weight decay vs early stopping vs proximal regularization</li>
<li>Replicate <a href="https://arxiv.org/abs/1512.03385">ResNet by He et al 2015</a>, <a href="http://cs.nyu.edu/~wanli/dropc/">Dropconnect</a>, <a href="https://arxiv.org/abs/1302.4389">Maxout</a>, <a href="https://github.com/tensorflow/models/tree/master/inception">Inception</a> (do a fine-tuning example with Inception per <a href="http://proceedings.mlr.press/v32/donahue14.pdf">this paper</a>).</li>
<li>Do an end-to-end application from scratch. E.g., convert an equation image to LaTeX.</li>
</ul>
</li>
<li>Sequence supervised learning
<ul>
<li>Gentle introductions
<ul>
<li>Lessons 5-7 from <a href="http://course.fast.ai/index.html">USF</a></li>
<li>Assignments 5-6 from <a href="https://www.udacity.com/course/deep-learning--ud730">Udacity</a></li>
<li><a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">Karpathy RNN post</a></li>
<li>Weeks 7-8 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
</ul>
</li>
<li>Theory
<ul>
<li>Chapter 10 from <a href="http://www.deeplearningbook.org/">Goodfellow</a></li>
</ul>
</li>
<li>Practice
<ul>
<li>Lab 2 from <a href="https://github.com/yala/introdeeplearning">MIT 6.S191</a></li>
<li>End-to-end application from scratch: a Swype keyboard (<a href="https://www.reddit.com/r/MachineLearning/comments/5ogbd5/d_training_lstms_in_practice_tips_and_tricks/">Reddit tips</a>)</li>
</ul>
</li>
<li>Paper recreations
<ul>
<li>Machine translation <a href="https://arxiv.org/abs/1409.3215">Sutskever et al 2014</a></li>
<li>NLP <a href="https://arxiv.org/abs/1412.7449">Vinyals et al 2015</a></li>
<li>Dense captioning <a href="http://cs.stanford.edu/people/karpathy/densecap/">Karpathy 2016</a></li>
<li><a href="https://arxiv.org/abs/1506.03134">Pointer nets</a></li>
<li><a href="https://arxiv.org/abs/1706.03762">Attention</a></li>
</ul>
</li>
</ul>
</li>
<li>Unsupervised and semi-supervised approaches
<ul>
<li>Theory
<ul>
<li>Weeks 11-16 of <a href="https://www.coursera.org/learn/neural-networks">Hinton’s Coursera course</a></li>
<li>Chapters 13, 16-20 from <a href="http://www.deeplearningbook.org/">Goodfellow</a></li>
<li>See also my links for <a href="https://github.com/vlad17/ml-notes/tree/master/deep-learning">VAE and RBM notes here</a></li>
</ul>
</li>
<li>Practice
<ul>
<li>Remaining <a href="http://deeplearning.net/tutorial/">deeplearning.net</a> tutorials, based on interest.</li>
<li>Notebooks 06, 11 from <a href="https://github.com/nlintz/TensorFlow-Tutorials">nlintz/TensorFlow-Tutorials</a>.</li>
</ul>
</li>
<li>Paper recreations
<ul>
<li><a href="https://arxiv.org/abs/1701.07875">WGAN</a></li>
<li><a href="https://arxiv.org/abs/1312.6114">VAE</a></li>
<li><a href="https://arxiv.org/abs/1606.04934">IAF VAE</a></li>
<li><a href="https://arxiv.org/abs/1507.02672">Ladder Nets</a></li>
</ul>
</li>
</ul>
</li>
</ol>
Sun, 09 Jul 2017 00:00:00 +0000
https://vlad17.github.io/2017/07/09/deep-learning-learning.html
https://vlad17.github.io/2017/07/09/deep-learning-learning.htmldeep-learningNon-convex First Order Methods<h1 id="non-convex-first-order-methods">Non-convex First Order Methods</h1>
<p>This is a high-level overview of the methods for first order local improvement optimization methods for non-convex, Lipschitz, (sub)differentiable, and regularized functions with efficient derivatives, with a particular focus on neural networks (NNs).</p>
<p>\[
\argmin_\vx f(\vx) = \argmin_\vx \frac{1}{n}\sum_{i=1}^nf_i(\vx)+\Omega(\vx)
\]</p>
<p>Make sure to read the <a href="/2017/06/19/neural-network-optimization-methods.html">general overview post</a> first. I’d also reiterate <a href="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html">as Moritz Hardt has</a> that one should be wary of only looking at convergence rates willy-nilly.</p>
<p><strong>Notation and Definitions</strong>.</p>
<ul>
<li>The \(t\)-th step stochastic gradient of \(f:\R^d\rightarrow\R\), computed in \(O(d)\) time at the location \(\vx_{t}\), by selecting either a single \(f_i\) or a mini-batch, is denoted \(\tilde{\nabla}_t\), with \(\E\tilde{\nabla}_t=\nabla_t=\nabla f(\vx_t)\).</li>
<li>Arithmetic operations may be applied elementwise to vectors.</li>
<li>If smooth and efficiently differentiable, e.g., \(\Omega(\vx)=\frac{1}{2}\norm{\vx}_2^2\), regularization can be folded into each \(f_i\) to make new \(f_i’=f_i+\frac{1}{n}\Omega\), as if it was never there in the first place. However, we may wish to apply \(L^1\) regularization or other non-smooth, non-differntiable but still convex functions–these are the problems I’ll label <em>composite</em>.</li>
<li>I’ll use \(x\simeq y\) to claim that equality holds up to some fixed multiplicative constants.</li>
<li>I will presume an initialization \(\vx_0\) (<a href="https://github.com/vlad17/ml-notes/blob/master/deep-learning/optimization.pdf">see discussion here</a>).</li>
<li>
<p>Finally, recall the two stationary point conditions:</p>
<ul>
<li>\(\epsilon\)-approximate critical point: \(\norm{\nabla f(\vx_*)}\le \epsilon\)</li>
<li>\(\epsilon\)-approximate local minimum: there exists a neighborhood \(N\) of \(\vx_*\) such that for any \(\vx\) in \(N\), \(f(\vx)-f(\vx_*)\le \epsilon\). For \(f\) twice-differentiable at \(\vx_*\), it suffices to be an \(\epsilon\)-approximate critical point and have \(\nabla^2 f(\vx_*)\succeq \sqrt{\epsilon}I\).</li>
</ul>
</li>
<li>In this post, many algorithms will depend on a fixed learning rate, even if it’s just an initial scale for the learning rate. Convergence is sensitive to this setting; a fixed recommendation will surely be a poor choice for some problem. For a first choice, setting \(\eta\) to one of \( \{0.001,0.01,0.1\}\) based on a guess about the magnitude of the smoothness of the problem at hand is a good bet.</li>
</ul>
<h2 id="stochastic-gradient-descent-sgd">Stochastic Gradient Descent (SGD)</h2>
<p>\[
\vx_{t+1}=\vx_t-\eta_t\tilde{\nabla}_t
\]</p>
<p><strong>Description</strong>. See <a href="https://arxiv.org/abs/1309.5549">Ghadimi and Lan 2013a</a> for analysis and TensorFlow’s <a href="https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer">non-composite</a>/<a href="https://www.tensorflow.org/api_docs/python/tf/train/ProximalGradientDescentOptimizer">composite</a> implementation. The intuition behind SGD is to travel in a direction we expect is downhill, at least from where we are now. Put another way, the gradient defines a local linear approximation to our function, and we head in the direction that most directly lowers the cost for that approximation. The learning rate \(\eta_t\) controls how far against the gradient we’d like to go (before we judge the linear approximation to be inaccurate).</p>
<p><strong>Assumptions</strong>. SGD makes the <em>gradient estimation assumption</em>, that \(\tilde{\nabla}_t\) is an unbiased estimator of \(\nabla_t\) with variance globally bounded, and the assumes that \(f\) is <em>\(L\)-gradient-Lipschitz</em>. <a href="https://arxiv.org/abs/1308.6594">Ghadimi et al 2013</a> extend to composite costs.</p>
<p><strong>Guarantees</strong>. For a <em>fixed-rate</em>, \(\eta_t=\eta\), we expect to converge to an approximate critical point in \(O\pa{ d\epsilon^{-4} }\) as long as \(\eta\simeq\min\pa{L^{-1},\epsilon^2}\). With <em>annealing</em>, \(\eta_t\simeq\min(L^{-1},\epsilon t^{-1/4})\) offers the same guarantees.</p>
<p><strong>Practical Notes</strong>. Vanilla SGD, though simple, has quite a few pitfalls without careful tuning.</p>
<ul>
<li>Its theoretical performance is poor, and convergence is only guaranteed to hold if assuming step size is kept small corresponding to smoothness constants of the cost function. The fact that annealing doesn’t benefit worst-case runtime is a bit surprising since that’s what happens in the strongly convex case, but I believe this is a testament to the fact that the general cost function shape is no longer bowl-like, but can be fractal in nature, so there might never be an end to directions to descend.</li>
<li>In practice, I’ve found that at least for simple problems like logistic regression, where we have \(L\) available, using a fixed learning rate of at most \(L^{-1}\), is many, many orders of magnitudes slower than a “reasonable” constant. Global Lipschitz properties might be poorer than local ones, so you’re dooming yourself to slow learning.</li>
<li>A common strategy to cope with this is to use an exponential decay schedule, \(\eta_t\simeq e^{-t}\), with the idea being to traverse a large range of learning rates, hopefully spending most of the time in a range appropriate to the problem. Of course, this will be very sensitive to hyper parameters: note that using exponential decay bounds the diameter of exploration, and even using an inverse-time schedule \(\eta_t\simeq t^{-1}\) for \(T\) steps means you can only travel \(O(\log T)\) distance from your starting point! Inverse-time schedules, and more generally schedules with \(\sum_{t=1}^\infty\eta_t=\infty\) but \(\sum_{t=1}^\infty\eta_t^2<\infty\), can draw on more restrictive smoothness assumptions about \(f\) to guarantee almost-sure convergence (<a href="http://leon.bottou.org/papers/bottou-98x">Bottou 1998</a>).</li>
<li><a href="https://arxiv.org/abs/1309.5549">Ghadimi and Lan 2013a</a> also offer a treatment of “2-phase random stochastic gradient”, which is vanilla SGD with random restarts, for probabilistic guarantees of finding approximate stationary points. Finally, Ghadimi and Lan’s SGD technically expects to find \(\vx_*\) with \(\E\ha{\norm{\nabla f(\vx_*)}^2}<\epsilon\). This implies the above \(O(d\epsilon^{-4})\) convergence rate, but is technically slightly stronger.</li>
</ul>
<p>Most subsequent algorithms have been developed to handle finding \(\eta_t\) on their own, adapting the learning rate as they go along. This was done for the convex case, but that doesn’t stop us from applying the same improvements to the non-convex case!</p>
<h2 id="accelerated-stochastic-gradient-descent-agd">Accelerated (Stochastic) Gradient Descent (AGD)</h2>
<p>See <a href="https://www.tensorflow.org/api_docs/python/tf/train/MomentumOptimizer">tf.train.MomentumOptimizer</a> for implementation. AGD is motivated by momentum-added SGD from <a href="http://www.sciencedirect.com/science/article/pii/0041555364901375">Polyak 1964</a>. A modern version looks like this:
\[
\begin{align}
\vm_0&=0\\\<br />
\vm_{t+1}&=\beta \vm_t+\eta \nabla_t \\\<br />
\vx_{t+1}&=\vx_t-\vm_{t+1}
\end{align}
\]
<strong>Description</strong>. We can intuit this as, in <a href="https://arxiv.org/abs/1609.04747">Ruder’s words</a>, as a ball rolling down a hill, with a growing momentum. In this way, we extend the hill metaphor, in effect trusting that we can continue further in the general downhill direction maintained by the momentum terms. Some momentum implementations just replace the above gradient with the estimator \(\tilde{\nabla}_t\) and set stuff running, like the linked TensorFlow optimizer does by default. However, even with full gradient information and assuming smoothness and convexity, momentum alone doesn’t perform optimally. Nesterov’s 1983 paper, <em>A method of solving a convex programming problem with convergence rate \(O(1/k^2)\)</em>, fixes this by correcting momentum to look ahead, which is helpful if the curvature of the function starts changing:
\[
\begin{align}
\vm_0&=0\\\<br />
\vm_{t+1}&=\beta \vm_t+\eta \nabla f(\vx_t -\beta\vm_t)\\\<br />
\vx_{t+1}&=\vx_t-\vm_{t+1}
\end{align}
\]
<strong>Practical Notes</strong>. While optimal in the smooth, convex, full gradient setting, and even optimally extended to non-smooth settings (see <a href="http://www.mit.edu/~dimitrib/PTseng/papers/apgm.pdf">Tseng 2008</a> for an overview), changing the above to use a random gradient estimator ruins asymptotic performance, concede <a href="http://proceedings.mlr.press/v28/sutskever13.html">Sutskever et al 2013</a>. <a href="http://www.deeplearningbook.org/contents/optimization.html">Goodfellow</a> claims momentum handles ill-conditioning in the Hessian of \(f\) and variance in the gradient though the introduction of the stabilizing term \(\vm_{t}\). Indeed, this seems to be the thesis laid out by Sutskever et al 2013, where the authors argue that a certain transient phase of optimization matters more for deep NNs, which AGD accelerates empirically (see also <a href="https://arxiv.org/abs/1212.0901">Bengio et al 2012</a>). Many authors set \(\beta=0.9\), but see <a href="http://proceedings.mlr.press/v28/sutskever13.html">Sutskever et al 2013</a> for detailed considerations on the momentum schedule.</p>
<p><strong>Guarantees</strong>. Later work by <a href="https://arxiv.org/abs/1310.3787">Ghadimi and Lan 2013b</a> solidifies the analysis for AGD in for stochastic, smooth, composite, and non-convex costs, though it uses a slightly different formulation for momentum. Under the previous gradient estimation assumptions from SGD (including slightly stronger light-tail assumptions about the variance of \(\tilde{\nabla}_t\)), \(L\)-gradient-Lipschitz assumptions for \(f\), and a schedule which increases mini-batch size <em>linearly</em> in the iteration count to refine gradient estimation, AGD requires \(O(\epsilon^{-2})\) iterations but \(O(d\epsilon^{-4})\) runtime to converge to an approximate critical point. Perhaps with yet stronger assumptions about the concentration of \(\tilde{\nabla}_t\) around \(\nabla_t\) AGD has promise to perform better.</p>
<h2 id="adagrad">AdaGrad</h2>
<p>AdaGrad was proposed by <a href="http://jmlr.org/papers/v12/duchi11a.html">Duchi et al 2011</a> and is available in <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdagradOptimizer">TensorFlow</a>.
\[
\begin{align}
\vv_0&=\epsilon\\\<br />
\vv_{t+1}&=\vv_t+\tilde{\nabla}_t^2\\\<br />
\vx_{t+1}&=\vx-\frac{\eta}{\sqrt{\vv_{t+1}} }\tilde{\nabla}_t
\end{align}
\]
<strong>Description</strong>. AdaGrad is actually analyzed in the framework of <a href="http://ocobook.cs.princeton.edu/">online convex optimization (OCO)</a>. This adversarial, rather than stochastic, optimization setting can immediately be applied to stochastic optimization of convex functions over compact sets. The first is based on OCO. Consider an \(L\)-gradient Lipschitz but possibly nonconvex cost \(f\) at some iterate \(\vx_t\), which implies an upper bound \(f(\vx)\le f(\vx_t)+(\vx-\vx_t)^\top\nabla_t+\frac{L}{2}\norm{\vx-\vx_t}^2_2\); the convexity inequality, if we had it, would sandwich \(f(\vx)\ge f(\vx_t)+(\vx-\vx_t)^\top\nabla_t\). Minimizing this upper bound, which results in full gradient descent (GD), then, guarantees improvement in our cost. The quadratic term effectively quantifies how much we trust our linear approximation. An analogous technique applied to a sequence of cost functions in the online setting gives rise to Follow the Regularized Leader (FTRL): given past performance, create an upper bound on the global cost reconstructed from our stochastic information, and find the next best iterate subject to working within a trusted region. The difficulty is in defining this trusted region with an unfortunately named regularization function, which differs from \(\Omega\). AdaGrad improves the quadratic regularization \(\frac{L}{2}\norm{\vx-\vx_t}^2_I\) in GD to the less crude \(\frac{L}{2}\norm{\vx-\vx_t}^2_{G_t}\), where \(\norm{\vx}^2_A=\vx^\top A\vx\) and \(G_t=\diag \vv_t^{1/2}\) from the iterates above (see <a href="/assets/2017-06-20-nonconvex-first-order-methods/proximal_notes.pdf">these notes</a>, retrieved <a href="http://cs.stanford.edu/~ppasupat/a9online/uploads/proximal_notes.pdf">from here</a>, for discussion). This <em>adaptive</em> regularization function, at least in the OCO setting, is as good, in terms of convergence, as an optimal choice quadratic regularization, up to multiplicative constants. We see that the learning rate for every feature changes with respect to its history, so that new information is weighed against the old.</p>
<p><strong>Practical Notes</strong>. AdaGrad is a convex optimization algorithm, and it shows, but not in a good way.</p>
<ul>
<li>In nonconvex optimization problems, aggregates of gradients from the beginning of training are irrelevant to the curvature of the current location being optimized, Goodfellow claims. As a result, they result in aggressive learning rate decrease.</li>
<li>The \(\epsilon\) constant is only for numerical stability. <a href="https://keras.io/optimizers/#adagrad">Keras</a> and <a href="https://arxiv.org/abs/1609.04747">Ruder</a> recommend setting it to \(10^{-8}\).</li>
<li>For noncomposite versions of AdaGrad, see <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdagradDAOptimizer">tf.train.AdagradDAOptimizer</a>, mentioned in the original <a href="http://jmlr.org/papers/v12/duchi11a.html">Duchi et al 2011</a> and <a href="https://www.tensorflow.org/api_docs/python/tf/train/ProximalAdagradOptimizer">tf.train.ProximalAdagradOptimizer</a> based on FOBOS from <a href="https://papers.nips.cc/paper/3793-efficient-learning-using-forward-backward-splitting">Duchi et al 2009</a>. See <a href="https://arxiv.org/abs/1009.3240">McMahan 2011</a> for discussion.</li>
<li>While AdaGrad greatly improved performance by having a per-dimension learning rate, its use is frequently discouraged because of its maintenance of the entire gradient history.</li>
</ul>
<p><strong>AdaDelta</strong> attempts to address the aggressive learning rate decrease problem of AdaGrad by exponentially decaying an estimate of accumulated gradient term \(\vv_t\) (<a href="https://arxiv.org/abs/1212.5701">Zeiler 2012</a>). This adds a new parameter for the exponential decay \(\beta\), typically \(0.9\), and introduces a unit correction \(\tilde{\vx}_t\) in place of the learning rate:
\[
\begin{align}
\vv_0&=\tilde{\vx}_0=0\\\<br />
\vv_{t+1}&=\beta \vv_t+(1-\beta)\tilde{\nabla}_t^2\\\<br />
\Delta_{t+1}&=\frac{\sqrt{\tilde{\vx}_{t}+\epsilon} }{\sqrt{\vv_{t+1}+\epsilon} }\tilde{\nabla}_t\\\<br />
\tilde{\vx}_{t+1}&=\beta \tilde{\vx}_{t}+(1-\beta)\Delta_{t+1}\\\<br />
\vx_{t+1}&=\vx_t-\Delta_{t+1}
\end{align}
\]
Similar update rules have been explored by <a href="https://arxiv.org/abs/1206.1106">Schaul et al 2012</a> in a sound but presumptive setting where \(\nabla^2f_i(\vx)\) are considered identical and diagonal for all \(i\in[n]\) and any fixed \(\vx\). <strong>RMSProp</strong> is similar to AdaDelta, but still relies on a fixed learning rate \(\tilde{\vx}_t=\eta\). Both RMSProp and AdaDelta have seen practical success, improving over AdaGrad in later iterations because they are unencumbered by previous gradient accumulation. RMSProp even has a Nesterov momentum variant. However, the exponential decay approximation may have high bias early in the iteration. The Adaptive Moment Estimation (Adam) paper corrects for this.</p>
<h2 id="adam">Adam</h2>
<p>The Adam method, proposed by <a href="https://arxiv.org/abs/1412.6980">Kingma and Ba 2014</a>, improves on AdaGrad-inspired adaptive rate methods by adding both a momentum term and removing first and second moment bias from exponential decay approximations to the gradient accumulators. See <a href="https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer">TensorFlow</a> for an implementation.
\[
\begin{align}
\vm_0&=\vv_0=0\\\<br />
\vm_{t+1}&=\beta_1\vm_t+(1-\beta_1)\tilde{\nabla}_t \\\<br />
\vv_{t+1}&=\beta_2 \vv_t+(1-\beta_2)\tilde{\nabla}_t^2\\\<br />
\vx_{t+1}&=\vx_t-\eta\pa{1-\beta_2^t}^{1/2}\pa{1-\beta_1^t}^{-1}\frac{\vm_{t+1} }{\sqrt{\vv_{t+1}+\epsilon} }
\end{align}
\]
<strong>Description</strong>. Adam seeks to combine AdaGrad’s adapitivity, which can learn the curvature of the space it’s optimizing in (making it able to deal with sparse gradients), and momentum-based approaches like RMSProp, which are able to adapt to new settings during the course of the optimization. The bias correction ensures that, roughly, \(\E\ha{ \tilde{\nabla}_t^2} =\vv_t(1-\beta_2^t)+\zeta\) and analogously for \(\vm_t\), with \(\zeta\) being the error that occurs from non-stationarity in the gradient. Under the assumption that appropriate \(\beta_1,\beta_2\) are selected, such that the non-stationarity error is appropriately vanished by the exponential decay, Adam has low bias for the gradient moments \(\vm,\vv\). As the paper describes, the unbiased \(\frac{\vm_{t+1} }{\sqrt{\vv_{t+1}+\epsilon} }\) captures the <em>signal-to-noise</em> ratio for the gradient.</p>
<p><strong>Guarantees</strong>. Adam reduces to Adagrad under certain parameter settings. Like Adagrad, it has strong guarantees in an OCO setting, which are valuable but not immediately applicable here.</p>
<p><strong>Practical Notes</strong>. Given that Adam has fairly intuitive hyperparameters, Adam has pretty decent performance across the board.</p>
<ul>
<li>As before, for stability, a small \(\epsilon=10^{-8}\) is typically used.</li>
<li>AdaGrad can be recovered with an annealing \(\eta\sim t^{-1/2}\) and near-0 values for\(\beta_1, 1-\beta_2\): these are recommended in the convex setting.</li>
<li>For other, nonconvex, settings \(\beta_1\) should be higher, for instance, \(0.9\). Settings for \(\beta_2\) from the paper are among \({0.99, 0.999,0.9999}\). High settings for both \(\beta_1,\beta_2\) imply stationarity in the gradient moments.</li>
<li>Though Adam and other adaptive methods might seem like empirical improvements over SGD (though they certainly don’t seem to have any better convergence guarantees in the nonconvex case), they seem to struggle with generalization error, which is the ultimate goal for our optimization. Recall the point made in the <a href="/2017/06/19/neural-network-optimization-methods.html">overview post</a> about <a href="http://leon.bottou.org/papers/bottou-bousquet-2011">Bousquet and Bottou 2007</a>: the convergence guarantees for the training loss above are only part of the overall error equation. This is still an active area of research, but intuitively we can construct training sets where adaptive methods reach poorly generalizing minima but SGD methods approach well-generalizing good ones (<a href="https://arxiv.org/abs/1705.08292">Wilson et al 2017</a>). Empirical responses to this have found that momentum-based SGD can be tuned to address the convergence speed issues but avoid generalization error qualms (<a href="https://arxiv.org/abs/1706.03471">Zhang et al 2017</a>). I would posit that SGD perhaps finds “stable” minima (ones whose generalization gap is small, conceptually minima that exist on a large, flat basin), and that momentum does not affect this approach, whereas adaptive methods might find a minimum within a narrow valley that might have better training loss, but has a large generalization gap since the valley “feature” of this cost function terrain is unstable with respect to the training set.</li>
</ul>
<h2 id="visualization">Visualization</h2>
<p>This visualization is coming from <a href="http://sebastianruder.com/optimizing-gradient-descent/index.html">Sebastian Ruder’s related post</a>. Check it out for discussion about the below visualization. Note that NAG is AGD and Momentum is uncorrected momentum added to SGD.</p>
<p><img src="/assets/2017-06-20-nonconvex-first-order-methods/update-rules-viz.gif" alt="visualization of different update rules in action" class="center-image" /></p>
<h1 id="future-directions">Future Directions</h1>
<h2 id="variance-reduction">Variance Reduction</h2>
<p>A new approach, Stochastic Variance Reduction Gradient (SVRG), was developed by <a href="https://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction">Johnson and Zhang 2013</a>. Its analysis, for strongly convex and smooth non-composite functions, didn’t improve any long-standing convergence rates, but the idea introduced was novel: we could use stale full-gradient information \(\nabla_t\) taken occasionally to de-noise stochastic estimations \(\tilde{\nabla}_t\). Updating our full gradient every \(m\) steps, with \(\tilde{\nabla}_t(\vx)\) being the stochastic estimator for the cost gradient at time \(t\) at location \(\vx\), so the previous defualt notation has \(\tilde{\nabla}_t=\tilde{\nabla}_t(\vx_t)\):
\[
\begin{align}
\bar{\vx}_t &= \begin{cases}\E_{\xi}\bar{\vx}_{t-\xi}& t\equiv 0\pmod{m} \\\\ \bar{\vx}_{t-1} & \text{otherwise}\end{cases} \\\<br />
\vg_t &= \begin{cases} \nabla f(\bar{\vx}_t) & t\equiv 0\pmod{m} \\\\ {\vg}_{t-1} & \text{otherwise}\end{cases} \\\<br />
\vx_t &= \vx_{t-1}- \eta_t\pa{\tilde{\nabla}_t (\vx_{t-1})-\tilde{\nabla}_t(\bar{\vx}_{t})+\vg_t}
\end{align}
\]
Above, \(\xi\) is a random variable supported on \([m]\). The same guarantees hold without taking expectation wrt \(\xi\) for computing \(\bar{\vx}_t\). In particular, for certain \(\xi,\eta_t\) SVRG was shown to reach an approximate critical points in \(O(dn+dn^{2/3}\epsilon^{-2})\) time, at least in the non-composite setting, simultaneously by <a href="https://arxiv.org/abs/1603.06160">Reddi et al 2016</a> and <a href="https://arxiv.org/abs/1603.05643">Allen-Zhu and Hazan 2016</a>. For these problems this improves over the GD runtime cost \(O(dn\epsilon^{-2})\).</p>
<p>Still, it’s debatable whether the \(O(d\epsilon^{-4})\) SGD is improved upon by SVRG methods, since they depend on \(n\). Datasets can be extremely large, so the \(n^{2/3}\epsilon^{-2}\) term may be prohibitive. At least in convex settings, <a href="https://arxiv.org/abs/1511.01942">Babanezhad et al 2015</a> explore using mini-batches for a variance-reduction effect. Perhaps an extension of this to non-convex costs would be what’s necessary to see SVRG applied to NNs. Right now, its use doesn’t seem to be very standard.</p>
<h2 id="noise-injected-sgd">Noise-injected SGD</h2>
<p><strong>Noisy SGD</strong>, is a surprisingly cheap and viable new solution proposed to find approximate <em>local minima</em> by <a href="https://arxiv.org/abs/1503.02101">Ge et al 2015</a>. Intuitively, adding jitter the parameters would ensure that the gradient-vanishing pathology of strict saddle points won’t be a problem. In particular, even if the gradient shrinks as you near a saddle point, the jitter will be strong enough that you won’t have to spend a long time around it before escaping.</p>
<p>\[
\begin{align}
\xi_{t}&\sim \Uniform \pa{B_{1} }\\\<br />
\vx_{t+1}&=\vx_t-\eta \tilde{\nabla_{t} }+\xi_t
\end{align}
\]</p>
<p>Above, \(B_r\) is a ball centered at the origin of radius \(r\). Unfortunately, noisy SGD is merely \(O(\poly(d/\epsilon))\). Its important contribution is showing that even stochastic first order methods could feasibly be used to arrive at local minima. With additional assumptions, and removing stochasticity, this was improved by <a href="https://arxiv.org/abs/1703.00887">Jin et al 2017</a> in <strong>Perturbed Gradient Descent</strong> (PGD):
\[
\begin{align}
\xi_{t}&\sim \Uniform \pa{B_{r_t} }\\\<br />
\vx_{t+1}&=\vx_t-\eta \nabla_{t}+\xi_t
\end{align}
\]
The radius \(r_t\) is carefully chosen depending on whether or not PGD detects we are near a saddle point. Usually, it is set to 0, so the algorithm mostly behaves like GD. With some additional second-order smoothness assumptions, this runs in time \(O(nd\epsilon^{-2}\log^4d)\), showing a cheap extension of GD for finding minima. However, until a similar analysis is performed for stochastic PGD, with equally friendly results, these methods aren’t yet ready for prime time. Recent work by Chi Jin adds acceleration to PGD, improving by a factor of \(\epsilon^{1/4}\).</p>
Tue, 20 Jun 2017 00:00:00 +0000
https://vlad17.github.io/2017/06/20/nonconvex-first-order-methods.html
https://vlad17.github.io/2017/06/20/nonconvex-first-order-methods.htmlmachine-learningoptimizationdeep-learningNeural Network Optimization Methods<h1 id="neural-network-optimization-methods">Neural Network Optimization Methods</h1>
<p>The goal of this post and its related sub-posts is to explore at a high level how the theoretical guarantees of the various optimization methods interact with non-convex problems in practice, where we don’t really know Lipschitz constants, the validity of the assumptions that these methods make, or appropriate hyperparameters. Obviously, a detailed treatment would require delving into intricacies of cutting-edge research. That’s not the point of this post, which just seeks to offer a theoretical survey.</p>
<p>I should also caution the reader that I’m not drawing on any of my own experience when discussing “practical” aspects of neural network (NN) optimization, but rather <a href="http://www.deeplearningbook.org/">Dr. Goodfellow’s</a>. For the most part, I’ll be summarizing sections 8.5 and 8.6 of the <a href="http://www.deeplearningbook.org/contents/optimization.html">optimization chapter</a> in that book, but I’ll throw in some relevant background and research, too. Further, one departure from practicality that I’ll be making for simplicity is not considering parallelism. All mentioned analyses assume sequential execution, and may not have obvious parallel versions. Even if they do, most bets are off.</p>
<p>In part, I’ll also try to address exactly what theoretical guarantees we do have in a NN setting. Lots of work has been done for convex and adversarial online convex optimization, and most NNs are optimized by just throwing such a method at training. Luckily, a lot of very recent work, as of this posting, has addressed exactly what happens in this situation.</p>
<h2 id="setting">Setting</h2>
<p>A NN is a real-valued circuit \(\hat{y}_\bsth\) of computationally efficient, differentiable, and Lipschitz functions parameterized by \(\bsth\). This network is trained to minimize a loss, \(J(\bsth)\), based on empirical risk minimization (ERM). This is the hard part, computationally, for training NNs. We are given a set of supervised examples, pairs \(\vx^{(i)},y^{(i)}\) for \(i\in[n]\). Under the assumption that these pairs are coming from some fixed, unknown distribution, some learning can be done by ERM relative to a loss \(\ell\) on our training set, which amounts to the following:
\[
\argmin_\bsth J(\bsth) = \argmin_\bsth \frac{1}{n}\sum_{i=1}^m\ell(\hat{y}_\bsth(\vx^{(i)}), y^{(i)})+\Omega(\bsth)
\]
Above, \(\Omega\) is a regularization term (added to restrict the hypothesis class). Its purpose is for generalization. Typically, \(\Omega\) is of the form of an \(L^2\) or \(L^1\) norm. In other cases, it has a more complicated implicit form such as the case when we perform model averaging through dropout or weight regularization through early stopping (regularization may also be some kind of smoothing, like gradient clipping). In any case, we will assume that there exist some general strategies for reducing problems with nonzero \(\Omega\) to those where it is zero (see, for example, analysis and references in <a href="https://papers.nips.cc/paper/563-a-simple-weight-decay-can-improve-generalization">Krogh and Hertz 1991</a>, <a href="http://epubs.siam.org/doi/abs/10.1137/080716542">Beck and Teboulle 2009</a>, <a href="https://arxiv.org/abs/1603.05953">Allen-Zhu 2016</a>). The presence of regularization in general is nuanced, and its application requires deeper analyses for minimization methods, but we will skirt those concerns when discussing practical behavior for the time being.</p>
<p>An initial source of confusion about the above machine learning notation is the reuse of variable names in the optimization literature, where instead our parameters \(\bsth\) are points \(\vx\) and our training point errors \(\ell(\hat{y}_\bsth(\vx^{(i)}), y^{(i)})\) are replaced with opaque Lipschitz, differentiable costs \(f_i(\vx)\). We now summarize our general task of (unconstrained) NN optimization of our nonconvex composite (regularized) cost function \(f:\R^d\rightarrow \R\):
\[
\argmin_\vx f(\vx) = \argmin_\vx \frac{1}{n}\sum_{i=1}^nf_i(\vx)+\Omega(\vx)
\]</p>
<p>In a lot of literature inspiring these algorithms, it’s important to keep straight in one’s head the various types of minimization problems that are being solved, and whether they’re making incompatible assumptions with the NN environment.</p>
<ul>
<li>Many algorithms are inspired by the general convex \(f\) case. NN losses are usually not convex.</li>
<li>Sometimes, full gradients \(\nabla f\) are assumed. A full gradient is intractable for NNs, as it requires going through the entire \(m\)-sized dataset. We are looking for stochastic approximations to the gradient \(\E\ha{\tilde{\nabla} f}=\nabla f\).</li>
<li>Some algorithms assume \(\Omega = 0\), but that’s not usually the case.</li>
</ul>
<h2 id="theoretical-convergence">Theoretical Convergence</h2>
<p>We’ll be looking at gradient descent (GD) optimization algorithms, which assume an initial point \(\vx_0\) and move in nearby directions to reduce the cost. As such, basically all asymptotic rates contain a hidden constant multiplicative term \(f(\vx_0) - \inf f\).</p>
<h3 id="problem-specification">Problem Specification</h3>
<p>Before discussing speed, it’s important to know what constitutes a solution. Globally minimizing a possibly non-convex function such as deep NN is NP-hard. Even finding an approximate local minimum of just a quartic multivariate polynomial or showing its convexity is NP-hard (<a href="https://arxiv.org/abs/1012.1908">Ahmadi et al 2010</a>).</p>
<p>What we do, in theory, at least, is instead merely find <strong>approximate critical points</strong>; i.e., a typical non-convex optimization algorithm would return a point \(\vx_{*}\) that satisfies \(\norm{\nabla f(\vx_*)}\le \epsilon\). This is an <strong>incredibly weak</strong> requirement: for NNs, there are significantly more saddle points than local minima, and they have high cost. Luckily, local minima actually concentrate around the global minimum cost for NNs, as opposed to saddles, so recent cutting-edge methods that find approximate local minima are worth keeping in mind. An approximate local minimum \(\vx_*\) has a neighborhood such that any \(\vx\) in that neighborhood will have \(f(\vx)-f(\vx_*)\le \epsilon\). <a href="https://github.com/vlad17/ml-notes/blob/master/deep-learning/optimization.pdf">See extended discussion here.</a></p>
<p>We’ll assume that \(f\) is differentiable and Lipschitz. Even though ReLU activations and \(L^1\) regularization may technically invalidate the differentiability, these functions have well-defined <strong>subgradient</strong> that respect <a href="http://web.stanford.edu/class/msande318/notes/notes-first-order-nonsmooth.pdf">GD properties that we care about</a>. Certain algorithms further might assume \(f\in\mathcal{C}^2\) and that the Hessian is operator-norm Lipschitz or bounded.</p>
<p>There are two main runtime costs. The first is the desired degree of accuracy, \(\epsilon\). The second is due to the dimensionality of our input \(d\). Ignoring representation issues, thanks to the circuit structure of \(f\), we evaluate for any \(i\in[n]\) and \(\vv\in\R^d\) all of \(f_i(\vx), \nabla f_i(\vx), {\nabla^2 f_i(\vx)} \vv\) in \(O(d)\) time. Finally, since gradients of \(f_i\) approximate gradients of \(f\) only <em>in expectation</em>, reported worst-case runtimes are usually worst-case runtimes such that we <em>expect</em> to arrive at an approximate stationary point (expectation taken over the random uniform selection of \(i\) in SGD).</p>
<h3 id="fundamental-lower-bounds">Fundamental Lower Bounds</h3>
<p>First, unless \(\ptime = \nptime\), we expect runtime to be at least \(\Omega\pa{\log\frac{d}{\epsilon}}\) due to the aforementioned hardness results.
fluctuation.</p>
<p>Less obviously, convex optimization lower bounds for smooth functions imply that any first-order non-convex algorithm requires at least \(\Omega(1/\epsilon)\) gradient steps (<a href="https://arxiv.org/abs/1405.4980">Bubek 2014</a>, see also notes <a href="http://www.stat.cmu.edu/~larry/=sml/optrates.pdf">here</a> and <a href="http://www.cs.cmu.edu/~suvrit/teach/aaditya_lect23.pdf">here</a>). Note that this is nowhere near polynomial time in the bit size of \(\epsilon\)!</p>
<p>See also <a href="http://ieeexplore.ieee.org/document/585893/">Wolpert and Macready 1997</a>, <a href="https://papers.nips.cc/paper/125-training-a-3-node-neural-network-is-np-complete">Blum and Rivest 1988</a>, and a recent re-visiting of the topic in <a href="https://arxiv.org/abs/1410.1141">Livni et al 2014</a>. In other words, general non-convex optimization time lower bounds are too broad to apply usefully to NN, but specific approaches to fixed architectures may be appropriate.</p>
<h3 id="limitations-of-theoretical-descriptions">Limitations of Theoretical Descriptions</h3>
<p>There are a couple of limitations in using asymptotic, theoretical descriptions of convergence rates to analyze these algorithms.</p>
<p>First, the \(\epsilon\) in \(\epsilon\)-approximate critical points above is merely a small piece in the overall generalization error that the NN will experience. As explained in <a href="https://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning">Bousquet and Bottou 2007</a> (<a href="http://leon.bottou.org/papers/bottou-bousquet-2011">extended version</a>), the generalization error is broken into approximation (how accurate the entire function class of neural networks for a fixed architecture is in representing the true function we’re learning), estimation (how far we are from a global optimum among our hypothesis class of functions), and optimization error (our convergence tolerance). As cautioned in the aforementioned paper, the tradeoff between the aforementioned errors implies that even improvements in optimization convergence rate, like the use of full GD instead of stochastic GD (SGD) may not be helpful if they increase other errors in hidden ways.</p>
<p>Second, early stopping might prevent prevent convergence altogether—as mentioned in Goodfellow’s book, gradient norms can increase while training error decreases. It’s unclear whether we can fold in early stopping as an implicit term in \(\Omega\) and claim that we’re reaching a critical point in this virtual cost function.</p>
<p>The fact that that theoretical lower bound rates are not relevant for NN training time (compared to what we see in practice) shows that there is a wide gap between general non-convex and smooth approximate local minimum finding and the same problem for NNs.</p>
<h2 id="existing-algorithms">Existing Algorithms</h2>
<p>In the below two linked blog posts, I will review the high-level details existing algorithms for NN non-convex optimization. Most of these are methods that have been developed for the composite <em>convex</em> smooth optimization problem, so they may not even have any theoretical guarantees for the \(\epsilon\)-approximate critical point or local min problem. It turns out that indeed we find a general dichotomy between these GD algorithms:</p>
<ul>
<li>Algorithms which are practically available, e.g., <a href="https://www.tensorflow.org/api_guides/python/train">TensorFlow’s first order methods</a>, but were initially developed for convex problems, and whose non-convex interpretations are usually only approximate critical point finders</li>
<li>Algorithms which are (as of June 2017) cutting-edge research and not widely available, yet have been designed for finding local minima efficiently in non-convex settings. Nonetheless, they’re still useful to mention since the respective paper implementations might be available and it may be worthwhile to manually implement the optimization, too.</li>
</ul>
<p>This list of existing algorithms is going to be a bit redundant with the review <a href="https://arxiv.org/abs/1609.04747">Ruder 2016</a>, but my intention is to be a bit more comprehensive and rigorous but less didactic in terms of update rules covered.</p>
<p>In general, all these rules have the format \(\vx_{t+1}=\vx_t-\eta_t\vg_t\) where \(\eta_t\) is a learning rate and \(\vg_t\) is the gradient descent direction, both making a small local improvement at the \(t\)-th discrete time. Theoretical analysis won’t be presented, but guarantees, assumptions, intuition, and update rules will be described. Proofs will be linked.</p>
<ul>
<li><a href="/2017/06/20/nonconvex-first-order-methods.html">First order methods</a></li>
</ul>
<p>Technically, most neural networks don’t have smoothness or even differentiability everywhere. While in reality those issues don’t seem to surface in practice, it turns out we can <a href="https://arxiv.org/abs/1804.07795">still make some strong statements</a> about first-order optimization methods.</p>
Mon, 19 Jun 2017 00:00:00 +0000
https://vlad17.github.io/2017/06/19/neural-network-optimization-methods.html
https://vlad17.github.io/2017/06/19/neural-network-optimization-methods.htmlmachine-learningoptimizationdeep-learningJupyter Tricks<h1 id="jupyter-tricks">Jupyter Tricks</h1>
<p>Here’s a list of my top-used Juypter tricks, and what they do.</p>
<h2 id="ui">UI</h2>
<p>I find the UI to be intuitive, <code class="highlighter-rouge">Help > User Interface Tour</code> describes more. There are <strong>command</strong> (enter by pressing the escape button or clicking outside of a cell) and <strong>edit</strong> (enter by typing in a cell) modes. You can tell you’re in edit mode if the “pencil” corner indicator is present:</p>
<p><img src="/assets/2017-05-25-jupyter-tricks/corner-indicator.png" alt="corner indicator symbol" class="center-image" /></p>
<p>It’s also faster to use the commands as listed in <code class="highlighter-rouge">Help > Keyboard Shortcuts</code>; with those you can also remove the toolbar with <code class="highlighter-rouge">View > Toggle Toolbar</code>.</p>
<p><code class="highlighter-rouge">jupyter notebook existing-notebook.ipynb</code> - auto-launch an existing notebook without the Jupyter menu.</p>
<h2 id="remote-serving">Remote Serving</h2>
<p>Run the kernel on a beefy server, view with a browser on your laptop. You can change the ports appropriately to something high and unused.</p>
<ol>
<li>On laptop, initiate SSH with a tunnel <code class="highlighter-rouge">ssh -L8888:localhost:12321 vlad@my-beefy-server.com</code></li>
<li>On server, launch <code class="highlighter-rouge">tmux</code> if you’d like to persist the Jupyter server (useful if you need to keep running stuff and reconnect notebook later).</li>
<li>On server, <code class="highlighter-rouge">jupyter notebook --no-browser --port=12321</code></li>
<li>On laptop, navigate to <code class="highlighter-rouge">localhost:8888</code> in-browser.</li>
</ol>
<h2 id="tex">TeX</h2>
<p>The way I use TeX in Jupyter notebook depends on the end goal of the notebook itself.</p>
<ul>
<li><a href="#embedded-latex">Embedded (Math) TeX</a>. Here, the end product is <strong>*.ipynb notebook itself</strong>, in which case the <em>TeX is auxiliary to the code</em>, placed only to elucidate the math involved.</li>
<li><a href="#jupyter-prepared-reports">Jupyter-prepared Reports</a>. Here, the end product is a prepared <strong>*.pdf report</strong>, in which case the <em>code is auxiliary to the TeX</em>, with Jupyter used to create the source code for a more formal report or document, one you would print out.</li>
</ul>
<p>In both of the above use cases, one may find it useful to generate rendered <a href="#tex-from-code">TeX from code</a></p>
<h3 id="embedded-tex">Embedded TeX</h3>
<p>Use MathJax in Markdown cells. In the first equation you can add convenience <code class="highlighter-rouge">\newcommand</code> items if you prefer (MathJax will evaluate cells top down).</p>
<p><em>Note:</em> The magic <code class="highlighter-rouge">%%latex</code> works, but I don’t use it. It’s treated like a code cell, but we’re really only interested in the output in this setting. Finally, you can embed images right in the Markdown, too, with <code class="highlighter-rouge"><img src="path/to/image.png"></code>.</p>
<h3 id="jupyter-prepared-reports">Jupyter-prepared Reports</h3>
<p>In this setting, use raw input cells to create segments of <code class="highlighter-rouge">LaTeX</code> code. This won’t render within the notebook, but this is OK since we’re treating the notebook like source in this setting. To generate a report (with the inline evaluated code), I use <code class="highlighter-rouge">nbconvert</code>. For prepared scripts, check out <a href="https://github.com/vlad17/ipython-latex">my ipython-latex repo</a>.</p>
<h3 id="tex-from-code">TeX from Code</h3>
<p><img src="/assets/2017-05-25-jupyter-tricks/math.png" alt="generated math" class="center-image" /></p>
<h2 id="logging">Logging</h2>
<p>Bring logging to the cell output:</p>
<pre><code class="language-{python}">import logging
logging.getLogger().addHandler(logging.StreamHandler())
logging.getLogger(some_module.that.I.want.logged).setLevel(logging.INFO)
</code></pre>
<h2 id="matplotlib">Matplotlib</h2>
<pre><code class="language-{python}">%matplotlib inline
import matplotlib.pyplot as plt
plt.rc('font', family='serif', serif='Computer Modern Roman')
plt.rc('text', usetex=True)
</code></pre>
<p>This preamble will render generated Matplotlib objects in the Jupyter HTML. The font specified above is consistent with the LaTeX generated by ipython-latex in Jupyter-prepared reports. Notably, this is going to differ from the MathJax font that is used inside Markdown cells.</p>
<h2 id="magic">Magic</h2>
<p>Docs accessible with <code class="highlighter-rouge">%<magic>?</code>.</p>
<ul>
<li><code class="highlighter-rouge">x = !! echo hi</code> - run a bash command in a subshell, save stdout in returned string, split on newlines (<code class="highlighter-rouge">!</code> for no split)</li>
<li><code class="highlighter-rouge">%%bash</code> - run cell as bash</li>
<li><code class="highlighter-rouge">%%timeit</code> - time cell</li>
<li><code class="highlighter-rouge">? f</code> - get docstring</li>
<li><code class="highlighter-rouge">?? f</code> - get source</li>
<li><code class="highlighter-rouge">%run nb.ipynb</code> - line magic, runs notebook</li>
<li><code class="highlighter-rouge">%pdb</code> - run debbuger on cell evaluation</li>
<li><code class="highlighter-rouge">%env ENV_VAR=3</code> - set enviornment variable in kernel</li>
<li><a href="http://arogozhnikov.github.io/2016/09/10/jupyter-features.html#Profiling:-%prun,-%lprun,-%mprun">Cell code profiling</a>.</li>
</ul>
<h2 id="extensions">Extensions</h2>
<p>These are the extensions I find useful: auto-format code, toggle font size, auto-comment regions, spell check, and control the cutoff at which output starts scrolling, respectively, below.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>conda install -c conda-forge jupyter_contrib_nbextensions
pip install yapf # for code-prettification
for i in hide_input/main code_prettify/code_prettify code_font_size/code_font_size comment-uncomment/main spellchecker/main autoscroll/main; do jupyter nbextension enable $i ; done
</code></pre></div></div>
Thu, 25 May 2017 00:00:00 +0000
https://vlad17.github.io/2017/05/25/jupyter-tricks.html
https://vlad17.github.io/2017/05/25/jupyter-tricks.htmltoolsMy Princeton Senior Thesis<h1 id="my-princeton-senior-thesis">My Princeton Senior Thesis</h1>
<p><strong>Submitted to the university as part of completion of Computer Science BSE degree</strong> June 2017</p>
<p>Completed during the 2016-2017 academic year.</p>
<p><a href="https://arxiv.org/abs/1705.10813">A concise and more up-to-date paper version.</a></p>
<p><a href="/assets/2017-05-23-my-princeton-senior-thesis/thesis.pdf">Link to download report.</a></p>
<p><a href="https://github.com/vlad17/runlmc">Code repository.</a></p>
Tue, 23 May 2017 00:00:00 +0000
https://vlad17.github.io/2017/05/23/my-princeton-senior-thesis.html
https://vlad17.github.io/2017/05/23/my-princeton-senior-thesis.htmlmy-whitepapersmachine-learningThe Semaphore Barrier (Solution)<h1 id="the-semaphore-barrier">The Semaphore Barrier</h1>
<p>This is the answer post to the question <a href="/2017/01/24/semaphore-barrier.html">posed here</a>.</p>
<h2 id="a-useful-formalism">A Useful Formalism</h2>
<p>Reasoning about parallel systems is tough, so to make sure that our solution is correct we’ll have to introduce a formalism for parallel execution.</p>
<p>The notion is the following. Given some instructions for threads \(\{t_i\}_{i=0}^{n-1}\), we expect each thread’s individual instructions to execute in sequence, but instructions between threads can be interleaved arbitrarily.</p>
<p>In our simplified execution model without recursive functions, it suffices to assume each thread has a fixed set of instructions it will execute. Let this be the sequence \(t_i\), with \(k\)-th instruction \(t_{ik}\), which must be <code class="highlighter-rouge">s(j).up</code> or <code class="highlighter-rouge">s(j).down</code> for some <code class="highlighter-rouge">j</code>.</p>
<h3 id="order-of-execution">Order of Execution</h3>
<p>Our parallel machine is free to choose a global order of operations \(g\) among all threads \(\{t_i\}_{i}\), where each \(g_j=t_{ik}\) for all \(j\) and some corresponding \(i,k\). However, the machine has to choose an ordering that is <em>valid</em>.</p>
<p>A valid ordering \(g\) satisfies two criteria.</p>
<p>The <em>sequencing constraint</em> is as follows:</p>
<p>\[
k<m \implies t_{ik} <_g t_{im}
\]</p>
<p>Above, we define an ordering over operations \(x <_g y\) with respect to some ordering in the natural way: in \(g\), \(x\) comes before \(y\). If a statement holds for all (valid) \(g\), we omit the subscript: the conclusion of the sequencing constraint can be re-written \(t_{ik} < t_{im}\).</p>
<p>In addition, for every global ordering of operations \(g\), there’s a corresponding sequence \(s\) (which differs from the un-italicized <code class="highlighter-rouge">s(i)</code>, the code for the <code class="highlighter-rouge">i</code>-th semaphore). The \(j\)-th element in the sequence \(s\) is the state of each semaphore after the \(j\)-th instruction \(g_j\). We represent this state as a function from semaphore index to semaphore state. Letting \(s_0=(i\mapsto 0)\):</p>
<p>\[
s_{j}(i)=s_{j-1}(i)+\begin{cases}
1 & g_j=\text{s(i).up}\\ - 1 & g_j=\text{s(i).down}\\ 0 & \text{otherwise}\<br />
\end{cases}
\]</p>
<p>The above just says that after the <code class="highlighter-rouge">i</code>-th semaphore is upped, its value should be 1 more than before, and vice-versa for down.</p>
<p>The <em>semaphore constraint</em> requires that the global order \(g\) is chosen such that:
\[
\forall i,j\,\,\,\,\, s_j(i)\ge 0
\]
Here, this constraint just makes sure that semaphores actually work as expected - it can’t be that a <code class="highlighter-rouge">down</code> call succeeds on a semaphore that had state 0 - it should wait until a corresponding <code class="highlighter-rouge">up</code> call completes, first.</p>
<h3 id="solution-criteria">Solution Criteria</h3>
<p>A solution (which defines the particular values \(\{t_i\}_{i}\)) must satisfy two criteria.</p>
<p>(<em>Correctness</em>): No thread can finish <code class="highlighter-rouge">b.wait()</code> before all threads have called the method:
\[
\forall i,j,\,\, t_{j1}<t_{i\left\vert t_i\right\vert}
\]</p>
<p>(<em>Liveness</em>): Eventually, every thread must complete <code class="highlighter-rouge">b.wait()</code>. There must exist at least one valid ordering \(g\) (if there is only one, the parallel processing system is forced to choose it).</p>
<h3 id="example">Example</h3>
<p>Let’s apply the formalism to the warmup solution for two threads:</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(0).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(1).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).down</code></td>
</tr>
</tbody>
</table>
<p>All the potential orderings respecting sequencing are:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0. s(0).up, s(1).up, s(1).down, s(0).down
1. s(0).up, s(1).up, s(0).down, s(1).down
2. s(1).up, s(0).up, s(1).down, s(0).down
3. s(1).up, s(0).up, s(0).down, s(1).down
4. s(0).up, s(1).down, s(1).up, s(0).down
5. s(1).up, s(0).down, s(0).up, s(1).down
</code></pre></div></div>
<p>Of these, we notice <code class="highlighter-rouge">4</code> and <code class="highlighter-rouge">5</code> violate the semaphore constraint. For <code class="highlighter-rouge">4</code>, the state function after the second step \(s_2(0)=1, s_2(1)=-1\), and vice-versa for <code class="highlighter-rouge">5</code>.</p>
<p>That leaves only <code class="highlighter-rouge">0,1,2,3</code> as the valid orderings. In turn, we satisfy liveness. Correctness is guaranteed by inspection: the last operations are only executed after the first ones.</p>
<h2 id="solution-1">Solution 1</h2>
<p>\(O(n^2)\) space and \(O(n)\) time.</p>
<p>This solution follows directly from reasoning about our formalism. Suppose <code class="highlighter-rouge">s(i)</code> was upped only once. For any \(g\) to be valid (no negative values), we must only down it once as well. Moreover, any down is guaranteed to occur after the up, again by the non-negativity requirement.</p>
<p>This could be proven formally - every state starts at 0, so if no ups occur before a down, by induction, the state of that semaphore is 0 right before the down and -1 after. This leads to a contradiction.</p>
<p>Suppose \(t_{ik}\) is <code class="highlighter-rouge">s(ij).up</code> and \(t_{jm}\) is <code class="highlighter-rouge">s(ij).down</code>. If we never use <code class="highlighter-rouge">s(ij)</code> again, the lemma above holds, in which case for every ordering \(t_{ik} < t_{jm}\). For any sequences \(t_i,t_j\), we must have \(k\in[1, \left\vert t_i\right\vert],m\in[1, \left\vert t_j\right\vert]\). Then by transitivity we conclude:
\[
t_{i1}\le t_{ik} < t_{jm} \le t_{j\left\vert t_j\right\vert}
\]</p>
<p>Thus, the presence of <code class="highlighter-rouge">s(ij).up</code> on thread <code class="highlighter-rouge">i</code> and <code class="highlighter-rouge">s(ij).down</code> on <code class="highlighter-rouge">j</code> guarantees correctness, if applied to all threads <code class="highlighter-rouge">i,j</code>. To guarantee some ordering exists, we will want to ignore the redundant case <code class="highlighter-rouge">s(ii)</code> and sequence our operations in a clear way:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def wait(thread i):
for all j != i:
s(ij).up
for all j != i:
s(ji).down
</code></pre></div></div>
<p>This solution is live: an order where all ups get executed in some order, then all downs do exists and is valid.</p>
<p>With 3 threads, this looks like:</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
<th> </th>
<th>t2</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(01).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(12).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(20).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(02).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(10).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(21).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(10).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(21).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(02).down</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(20).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(01).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(12).down</code></td>
</tr>
</tbody>
</table>
<p>In other words, if we represent each pairwise constraint \(\forall i,j,T\triangleq\left\vert t_i\right\vert, t_{j1}<t_{iT}\) explicitly, we get a solution.</p>
<h2 id="solution-2">Solution 2</h2>
<p>\(O(n)\) space and \(O(n)\) time.</p>
<p>This solution can be constructed by augmenting our lemma from before: for any \(g\) to be valid (no negative values), any semaphore must be upped more times than it has been downed right before every down.</p>
<p>Then, if a single thread is responsible for upping its own semaphore, and all other threads down it exactly once, <em>at least one</em> up must’ve occurred before each of the downs. This lets us recover the transitive inequality from before for correctness.</p>
<p>In other words, the following works:</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
<th> </th>
<th>t2</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(0).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(0).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(1).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).down</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(2).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).down</code></td>
</tr>
</tbody>
</table>
<p>With the same liveness argument, more generally the psuedocode is:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def wait(thread i):
do n-1 times:
s(i).up
for all j != i:
s(j).down
</code></pre></div></div>
<h2 id="solution-3">Solution 3</h2>
<p>\(O(n)\) space and \(O(1)\) average time, \(O(n)\) worst-case time</p>
<p>Now we need to start getting a little bit more clever. Previous solutions still performed a quadratic amount of work total, establishing the quadratic number of inequalities needed for correctness.</p>
<p>The goal here will be to get transitivity to do some of our heavy lifting.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def wait(thread i):
// (1)
if i < n-1:
s(i).down
s(i + 1).up
else:
s(0).up
s(n-1).down
// (2)
if i < n-1:
s(n).down
else:
do n-1 times:
s(n).up
</code></pre></div></div>
<p>By the reasoning from before, block (2) guarantees that \(t_{(n-1)k} < t_{j1}\) for some \(j\neq n-1\) and some \(k\in[3,n+2]\). Then by sequential validity of our orderings and transitivity we have a global property saying:
\[
\forall j, t_{(n-1)3}<t_{j\left\vert t_j\right\vert}
\]</p>
<p>In other words, all threads wait on thread 3 (eq. 1).</p>
<p>Next, we focus on block (1). We apply the lemma from solution 1 for each \(i\) between \(2\) and \(n-2\), which, by virtue of <code class="highlighter-rouge">s(i)</code> only being used once, says that the <code class="highlighter-rouge">s(i).down</code> instruction on thread \(i\) follows the <code class="highlighter-rouge">s(j + 1).up</code> one on thread \(j\), where \(j = i - 1\). For \(j<n-2\), this statement is \(t_{j2}<t_{i1}\). Next, by the sequence property, we have \(\forall j,t_{j1}<t_{j2}\). Finally, chaining all these inequalities together, we get for \(j<n-1\) (eq.2):</p>
<p>\[
t_{j1}\le t_{(n-2)1}< t_{(n-2)2}
\]</p>
<p>We use the lemma from solution 1 once on the semaphore <code class="highlighter-rouge">s(n-1)</code>, upped exactly at \(t_{(n-2)2}\) and downed on \(t_{(n-1)2}\). In turn, we have (eq. 3):</p>
<p>\[
t_{(n-2)2} < t_{(n-1)2}< t_{(n-1)3}
\]</p>
<p>Let’s recap. All threads already wait on thread \(n-1\). We just need to check that all threads also wait on all threads \(i\) between \(1\) and \(n-2\). For all \(i,j\):</p>
<p>\[
\begin{align} t_{i1} &< t_{(n-2)2} & \text{eq. 2}\\ &<t_{(n-1)3} &\text{eq. 3} \\ &<t_{j\left\vert t_j\right\vert} &\text{eq. 1} \\ \end{align}
\]</p>
<p>This finishes the correctness proof. We show liveness exists by providing the ordering \(t_{(n-1)1}\) followed by \(t_{i1}, t_{i2}\) for all \(i\) in order up to \(n-1\). Then we let \(t_{n-1}\) finish and after that order doesn’t matter.</p>
<p>Here’s what this looks like on 5 threads:</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
<th> </th>
<th>t2</th>
<th> </th>
<th>t3</th>
<th> </th>
<th>t4</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(0).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(3).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(1).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(3).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(4).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(4).down</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(5).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(5).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(5).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(5).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(5).up</code></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><code class="highlighter-rouge">s(5).up</code></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><code class="highlighter-rouge">s(5).up</code></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td><code class="highlighter-rouge">s(5).up</code></td>
</tr>
</tbody>
</table>
<h2 id="solution-4">Solution 4</h2>
<p>\(O(n)\) space and \(O(1)\) worst-case time</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def wait(thread i):
// (1)
if i < n-1:
s(i).down
s(i + 1).up
else:
s(0).up
s(n-1).down
// (2)
if i > 0:
s(i).down
s(i - 1).up
else:
s(n-1).up
s(0).down
</code></pre></div></div>
<p>The proof is left as an exercise to the reader :)</p>
<h2 id="solution-5">Solution 5</h2>
<p>\(O(1)\) space and \(O(1)\) worst-case time. This solution works by simulating a mutex with a semaphore, and implementing the barrier with that mutex.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ctr = 0
def wait(thread i):
if i == 0:
ctr += 1
local = ctr
s(0).up
else:
s(0).down
ctr += 1
local = ctr
s(0).up
if local == n:
s(1).up
else:
s(1).down
s(1).up
</code></pre></div></div>
<p>This one introduces control flow that isn’t predictable given just <code class="highlighter-rouge">i</code>, so our model isn’t sufficient to prove that it works.</p>
Wed, 25 Jan 2017 00:00:00 +0000
https://vlad17.github.io/2017/01/25/semaphore-answer.html
https://vlad17.github.io/2017/01/25/semaphore-answer.htmlinterview-questionparallelThe Semaphore Barrier<h1 id="the-semaphore-barrier">The Semaphore Barrier</h1>
<p>I wanted to share an interview question I came up with. The idea came from my operating and distributed systems classes, where we were expected to implement synchronization primitives and reason about parallelism, respectively.</p>
<p>Synchronization primitives can be used to coordinate across multiple threads working on a task in parallel.</p>
<p>Most primitives can be implemented through the use of a condition variable and lock, but I was wondering about implementing other primitives in terms of semaphores.</p>
<h2 id="introduction-to-the-primitives">Introduction to the Primitives</h2>
<h3 id="semaphores">Semaphores</h3>
<p>Semaphores are a type of synchronization primitive that encapsulate the idea of “thresholding”.</p>
<p>A semaphore <code class="highlighter-rouge">s</code> has two operations: <code class="highlighter-rouge">s.up()</code> and <code class="highlighter-rouge">s.down()</code>. A semaphore also has an internal non-negative number representing its state. A thread calling <code class="highlighter-rouge">s.down()</code> is allowed to continue only if this number is positive, in which case the number is atomically decremented and the thread goes on with its work.</p>
<p><code class="highlighter-rouge">s.up()</code> doesn’t guarantee any blocking either, and raises the number.</p>
<p>If we wanted to make sure that only 5 threads executing <code class="highlighter-rouge">f()</code> (a function that we implement) ever printed <code class="highlighter-rouge">hello</code>, and we somehow preemptively set the state of semaphore <code class="highlighter-rouge">s</code> to <code class="highlighter-rouge">5</code>, then the following code would work:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def f():
s.down()
print("hello\n", end='')
</code></pre></div></div>
<p>Regardless of how many threads call <code class="highlighter-rouge">f</code> at the same time, because of the atomic guarantees on <code class="highlighter-rouge">s</code>’s state, only 5 threads will be let through to print <code class="highlighter-rouge">hello</code>.</p>
<p>If at any time in the above we had at least 6 threads call <code class="highlighter-rouge">f</code> and any thread also call <code class="highlighter-rouge">s.up()</code> at some point, eventually 6 <code class="highlighter-rouge">hello</code>s would be printed.</p>
<p>In the following, we’ll assume the OS provides a magic semaphore implementation.</p>
<h3 id="barriers">Barriers</h3>
<p>A barrier is similar to a semaphore, but it’s meant to be a one-off, well, barrier. A barrier is preconfigured to accept <code class="highlighter-rouge">n</code> threads. Its API is defined by <code class="highlighter-rouge">b.wait()</code>, where a thread waits until <code class="highlighter-rouge">n-1</code> other threads are <em>also</em> waiting on <code class="highlighter-rouge">b</code>, an only then are the threads allowed to continue.</p>
<p>Barriers are useful when we want to coordinate some work. Suppose we have 2 threads, who want to draw a picture together. Say thread 0 can only draw red and thread 1 can only draw blue. But no two threads can draw on the same half of the screen at the same time.</p>
<p>Assuming <code class="highlighter-rouge">b</code> has been initialized with <code class="highlighter-rouge">n=2</code>, the following would work:</p>
<table>
<thead>
<tr>
<th>thread 0</th>
<th> </th>
<th>thread 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>draw left half</td>
<td> </td>
<td>draw right half</td>
</tr>
<tr>
<td><code class="highlighter-rouge">b.wait()</code></td>
<td> </td>
<td><code class="highlighter-rouge">b.wait()</code></td>
</tr>
<tr>
<td>draw right half</td>
<td> </td>
<td>draw left half</td>
</tr>
</tbody>
</table>
<p>Now, no matter which thread is faster, we’ll never violate the condition that 2 threads write on the same half of the screen.</p>
<p>The only way thread 0 can be on the left half is if it hasn’t crossed the barrier <code class="highlighter-rouge">wait</code> yet. The only way thread 1 can be on the left half is if it crossed the barrier, but since <code class="highlighter-rouge">b=2</code>, it can only cross the barrier if thread <code class="highlighter-rouge">1</code> is waiting, in which case it must have finished drawing on the left half!</p>
<p>Similar logic can be applied to the right side; in other words, no side of the screen is ever shared by two threads at any given time, regardless of how fast one thread is compared to the other.</p>
<h2 id="the-challenge">The Challenge</h2>
<h3 id="warm-up">Warm-up</h3>
<p>Our goal will be to implement a barrier (namely, fill in what <code class="highlighter-rouge">b.wait()</code> does for a given <code class="highlighter-rouge">n</code>). Let’s focus on the case where we only have <code class="highlighter-rouge">n=2</code> threads.</p>
<p>This can be done with two semaphores.</p>
<h3 id="solution-to-the-warm-up">Solution to the Warm-up</h3>
<p>As you may have guessed, the only nontrivial semaphore arrangement works. From here on, we let <code class="highlighter-rouge">s(i)</code> be the <code class="highlighter-rouge">i</code>-th semaphore, initialized with state 0. Similarly, \(t_i\) will refer to the \(i\)-th thread. Here’s what we would want <code class="highlighter-rouge">b.wait()</code> to do on each thread.</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(0).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(1).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).down</code></td>
</tr>
</tbody>
</table>
<p>Indeed - \(t_0\) can’t advance past <code class="highlighter-rouge">b.wait()</code> unless <code class="highlighter-rouge">s(1)</code> is <code class="highlighter-rouge">up</code>ped, which only happens if \(t_1\) call <code class="highlighter-rouge">b.wait()</code>. Symmetric logic shows that our barrier, if implemented to execute those instructions on each thread, will similarly stop <code class="highlighter-rouge">s(2)</code> from advancing without <code class="highlighter-rouge">s(1)</code> being ready.</p>
<h3 id="the-general-problem">The General Problem</h3>
<p>Now, here’s the main question:</p>
<p><strong>Can we implement an arbitrary barrier, capable of blocking <code class="highlighter-rouge">n</code> threads, with semaphores and no control flow? With control flow?</strong></p>
<p>Now, can we do so <em>efficiently</em>, using as few semaphores as possible? In as little time per thread as possible?</p>
<h4 id="attempt-extending-the-2-thread-case">Attempt: Extending the 2-thread Case</h4>
<p>Let’s try extending our approach from the 2-thread case. Maybe we can just use 3 semaphores now, but using the “cycle” that seems to be built in the 2-thread example?</p>
<table>
<thead>
<tr>
<th>t0</th>
<th> </th>
<th>t1</th>
<th> </th>
<th>t2</th>
</tr>
</thead>
<tbody>
<tr>
<td><code class="highlighter-rouge">s(0).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(1).up</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).up</code></td>
</tr>
<tr>
<td><code class="highlighter-rouge">s(1).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(2).down</code></td>
<td> </td>
<td><code class="highlighter-rouge">s(0).down</code></td>
</tr>
</tbody>
</table>
<p>But this won’t work, unfortunately: suppose \(t_0\) is running slow. Both \(t_1\) and \(t_2\) finish well ahead of time, and each calls <code class="highlighter-rouge">b.wait()</code>. Then \(t_2\) ups <code class="highlighter-rouge">s(2)</code>, after which \(t_1\) can pass through without waiting for \(t_0\) to call <code class="highlighter-rouge">b.wait()</code>, a violation of our barrier behavior.</p>
<h4 id="answer">Answer</h4>
<p>Not so fast! Try it yourself! How efficient is your solution? There’s a couple of them, in increasing order of difficulty. The following list describes the asymptotic space complexity (number of semaphores used) and time complexity (<strong>per thread</strong>).</p>
<ol>
<li>\(O(n^2)\) space and \(O(n)\) time</li>
<li>\(O(n)\) space and \(O(n)\) time</li>
<li>\(O(n)\) space and \(O(1)\) average time, \(O(n)\) worst-case time</li>
<li>\(O(n)\) space and \(O(1)\) worst-case time</li>
<li>\(O(1)\) space and \(O(1)\) worst-case time</li>
</ol>
<p><a href="/2017/01/25/semaphore-answer.html">Link to answer</a></p>
<h3 id="a-note-on-thread-ids">A Note on Thread IDs</h3>
<p>The fact that we can write different code for each of the threads to execute in the above examples might seem a bit questionable. However, we can get around this by assuming that we have access to thread IDs. As long as we can procure a thread’s procedure given just its ID (and the function procuring such a procedure doesn’t take \(O(n^2)\) space), we should be fine.</p>
<p>Even if the thread ID isn’t available, we can use an atomic counter, which assigns effective thread IDs based on which thread called <code class="highlighter-rouge">b.wait()</code> first:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>atomic = AtomicInteger(0)
def wait():
tid = atomic.increment_and_get()
...
</code></pre></div></div>
Tue, 24 Jan 2017 00:00:00 +0000
https://vlad17.github.io/2017/01/24/semaphore-barrier.html
https://vlad17.github.io/2017/01/24/semaphore-barrier.htmlinterview-questionparallelMy Princeton Junior Year Research<h1 id="my-princeton-junior-year-research">My Princeton Junior Year Research</h1>
<p><strong>Unpublished</strong></p>
<p><strong>Submitted to the university as part of completion of Computer Science BSE degree</strong> January 2016</p>
<p>Completed during fall semester 2015-2016</p>
<p><a href="/assets/2016-11-03-my-princeton-junior-year-research/paper.pdf">Link to download report.</a></p>
Thu, 03 Nov 2016 00:00:00 +0000
https://vlad17.github.io/2016/11/03/my-princeton-junior-year-research.html
https://vlad17.github.io/2016/11/03/my-princeton-junior-year-research.htmlmy-whitepapersmachine-learninghardware-accelerationMapReduce<h1 id="mapreduce-simplified-data-processing-on-large-clusters">MapReduce: Simplified Data Processing on Large Clusters</h1>
<p><strong>Published</strong> December 2004</p>
<p><a href="http://research.google.com/archive/mapreduce.html">Paper link</a></p>
<h2 id="abstract">Abstract</h2>
<p>MapReduce offers an abstraction for large-scale computation by managing the scheduling, distribution, parallelism, partitioning, communication, and reliability in the same way to applications adhering to a template for execution.</p>
<h2 id="introduction">Introduction</h2>
<h2 id="programming-model">Programming Model</h2>
<p>MR offers the application-level programmer two operations through which to express their large-scale computation.</p>
<p><em>Note</em>: the types I offer here are not identical to the original map reduce; my version is simplified somewhat. Though both are equivalent in the sense that they reduce to one another.</p>
<ol>
<li><code class="highlighter-rouge">type Map T K V = T -> [(K, V)]</code></li>
<li><code class="highlighter-rouge">type Reduce K V U = K -> [V] -> U</code></li>
</ol>
<p>Run-time type requirements (necessary for the implementation) are that <code class="highlighter-rouge">K</code> is both “shuffle-able” and equality-checkable. The ability to shuffle a type depends on one’s partitioning function. In the paper, this may be a requirement on orderability or hashability.</p>
<p>Evaluation’s signature, as presented in the MR paper, in pseudo Haskell, would be:</p>
<pre><code class="language-{haskell}">-- Assume all are Serializable
evaluate :: (Hashable k, Eq k) => Map t k v -> Reduce k v u -> [t] -> [u]
</code></pre>
<h3 id="example">Example</h3>
<p>Word count can be implemented pretty easily with the above definitions, where we map over a list of words, converting each word into the pair (word, 1), then the reduce operations sums the second part of each pair.</p>
<h3 id="more-examples">More Examples</h3>
<p>Other uses of map reduce include substring search, graph reversal (indexing), bag-of-words computation for language processing, and distributed sort (under specific shuffling functions).</p>
<h2 id="implementation">Implementation</h2>
<p>Map reduce was built with the intent of distributing work among commodity hardware with up to thousands of nodes. Storage is assumed to be HDD and inexpensive. Network assumptions are that we have 100 Mbps to 1 Gbps.</p>
<h3 id="execution-overview">Execution Overview</h3>
<p>For \(M\) map tasks and \(R\) reduce tasks, with a partitioning function (which may be either a hash or interval rank, depending on whether output was expected to be sorted or not), MR works in the following manner.</p>
<p><img src="/assets/2016-09-17-mapreduce/mapreduce-exec.png" alt="mapred-exec" class="center-image" /></p>
<ol>
<li>The input is split into \(M\) pieces and made available for distributed reading.</li>
<li>A master node initializes the state for the map tasks and reduce tasks; scheduling them with dependencies.</li>
<li>Map workers apply the map function to their chunk of the input, which was first copied to local disk. The outputted key/value pairs are buffered in memory and periodically flushed to the local disk.</li>
<li><strong>After evaluation of the entire map file</strong>, the worker notifies the master of the location of its local intermediate output. The master forwards this information to the respective reducer (each mapper creates intermediate output for each reducer).</li>
<li>A started reduce worker, when notified of the mapper’s location, starts reading the outputted key/value pairs that it is responsible for using an RPC.</li>
<li>After reading in <strong>all</strong> of the inputted data from <strong>all</strong> its map tasks, it performs a sort, possibly out-of-memory if necessary (<strong>not</strong> a hash-based local shuffle). Then it evaluates each key with the reduce function.</li>
<li>The reduce worker uploads the result to a distributed store (GFS), then performs an atomic rename upon completion, notifying the master.</li>
</ol>
<p>The output is then stored in \(R\) separate files, one for each reducer.</p>
<p>From step (4), we see that the total metadata maintained on the master is \(O(MR)\). Scheduling decisions require an additional \(O(M+R)\) amount of work.</p>
<p>Failures are handled simply by re-launching the task when a worker fails to respond to a heartbeat after a certain amount of time. Deduplication is performed on the master (i.e., if a worker is assumed lost, and the task is restarted, but it then sends its results). Thus, only-once reducer input idempotency is maintained by having <strong>synchronous</strong> evaluation: the master isn’t notified of the map output until it’s completely ready.</p>
<h4 id="master-failure">Master Failure</h4>
<p>The master node is a single point of failure. It can be made reliable, but is so rare that it is often easier to just restart the task.</p>
<h4 id="failure-semantics">Failure semantics</h4>
<p>Deterministic functions will be equivalent to a sequential run of the program.</p>
<p>Non-deterministic functions result in outputted reduce tasks from some combination of some runs of the program, so they are not guaranteed to be equal to any single run of the sequential program.</p>
<h3 id="locality">Locality</h3>
<p>The network-scarcity assumption means that the optimal blocking size for the computation should be around the size that the distributed state store uses, to avoid extra low-capacity blocks from being passed around. For GFS, this was 64MB. By integrating with GFS, the master is able to schedule map tasks in locations that house the actual data. This allows for step (1) from above to avoid any network reads.</p>
<h3 id="task-granularity">Task Granularity</h3>
<h3 id="backup-tasks">Backup Tasks</h3>
<p>Stragglers, caused by bugs or hardware failures, are common with an increase of the number of workers. They are resolved by launching backup tasks near completion, which reduce the probability of all tasks straggling (only one needs to finish). Removing this optimization in the sorting example causes a 44% slowdown.</p>
<h2 id="refinements">Refinements</h2>
<h3 id="combiner-function">Combiner Function</h3>
<p>A combiner function is an associative, commutative <code class="highlighter-rouge">Reduce</code>-type function that can be iteratively applied in a tree structure increase the reduce task’s span (level of parallelism). MR applies the combiner on the map task side, which can reduce communication size and overall compute time.</p>
<h2 id="performance">Performance</h2>
<p>Performance was tested on 1800 machines with a two-level tree-shaped switched network of 100 Gbps aggregate bandwidth, 2GHz processors, and 3 GBs of RAM per node.</p>
<p>Distributed grep used \(M=15000, R=1\), oversubscribing the map tasks for appropriate task granularity (input was 1TB). End-to-end is 150 seconds.</p>
<p>Terasort takes 891 seconds. For comparison, at the time, the best TeraSort was 1057 seconds. Note that for general sorting, MR recommends a sampling pre-pass for split computations.</p>
<p>MR was observed to be biased towards faster shuffle and input rates. Output is slower because it makes replicated writes.</p>
<h2 id="experience">Experience</h2>
<h2 id="related-work">Related Work</h2>
<h2 id="conclusions">Conclusions</h2>
<p>MR introduced the notion of a <strong>restricted, simple API</strong> that allows for expressiveness required for many tasks while maintaining simple semantics and enabling distributed processing. <strong>Network bandwidth</strong> is observed to be the bottleneck. Finally, replication-based variance reduction for latency is introduced as a technique (though it may be used for other goals as well).</p>
<h1 id="notes">Notes</h1>
<h2 id="observations">Observations</h2>
<ul>
<li>MR set the standard assumption that <strong>network is constraining</strong>; this notion was key in design of such distributed processing systems until newer technologies like Spark emerged, which moved bottlenecks elsewhere (see <a href="http://dl.acm.org/citation.cfm?id=2789791">this performance analysis for more details</a>).</li>
<li>Output is made reliable by storage to a replicated distributed state store (such as GFS). This interactivity between the execution engine (MR) and the store (GFS) is repeated in open-source versions of the product, such as Hadoop, with its MapReduce and HDFS.</li>
<li>MR chooses to have a master-in-the loop synchronous evaluation style, where the map task completion alerts the master and then starts the reduce operation. This thinking helps correctness. It was used in subsequent execution engines (like Spark). Unfortunately, even though for one task the \(O(M R)\) state in the master is manageable, especially with an efficient implementation, as the number of concurrent MR tasks increases (as is common nowadays with a shared cluster environment), scheduling becomes a large portion of the overhead that is also unparallelizable.</li>
</ul>
<h2 id="weaknesses">Weaknesses</h2>
<ul>
<li>For correctness, MR requires that functions with side effects respect parallel re-entrancy and thread-safety across machines (as well as locally, if multiple tasks can be scheduled on one thread). Typical operations that would violate this are non-idempotent or non-associative or non-commutative transactions to a database.</li>
<li>As mentioned above, master-in-the-loop evaluation causes scheduling delays. Workers maintaining some metadata themselves could allow for faster transitions between mapping and reducing. With additional bookkeeping (for handling failures), even <strong>asynchronous</strong> information-passing can be introduced.</li>
</ul>
<h2 id="strengths">Strengths</h2>
<ul>
<li>Speeding up tail performance through replication is an innovation pioneered by MR (see <a href="#backup-tasks">Backup Tasks</a>).</li>
<li>MR also was very innovative in usability, considering that it was an early execution engine. Besides the interface itself, MR provided mechanisms for automatically detecting deterministic user-code bugs (via bad record tracking and ignoring), a local execution mode, out-of-band counters, and HTTP-based status information.</li>
<li>For its time, MR was massively enabling for large-scale computation.</li>
</ul>
<h1 id="open-questions">Open Questions</h1>
<ul>
<li>What would it take to have a master-out-of-the-loop asynchronous execution engine?</li>
<li>What needs to happen to alleviate the parallel (1) re-entrancy and (2) thread-safety requirements that MR places on its user code in case of failure?</li>
</ul>
Sat, 17 Sep 2016 00:00:00 +0000
https://vlad17.github.io/2016/09/17/mapreduce.html
https://vlad17.github.io/2016/09/17/mapreduce.htmlparalleldistributed-systems