Jekyll2021-08-13T15:24:06+00:00https://www.highonscience.com/feed.xmlHigh on ScienceMachine learning and data analysis for practicing data scientistsWill HighMachine Learning Likelihood, Loss, Gradient, and Hessian Cheat Sheet2021-06-18T19:00:00+00:002021-06-18T19:01:00+00:00https://www.highonscience.com/blog/2021/06/18/ml-loss-function-cheat-sheet<p>Cheat sheet for likelihoods, loss functions, gradients, and Hessians. This is a living document that I’ll update over time.</p> <h1 id="motivating-theory">Motivating theory</h1> <h2 id="bayes-theorem">Bayes theorem</h2> <p>Bayes’ theorem tells us that the posterior probability of a hypothesis $H$ given data $D$ is</p> <p>\begin{equation} P(H|D) = \frac{P(H) P(D|H)}{P(D)}, \end{equation}</p> <p>where</p> <ul> <li>$P(H \vert D)$ is the <strong>posterior</strong> probability of the (variable) hypothesis given the (fixed) observed data</li> <li>$P(H)$ is the <strong>prior</strong> probability of the hypothesis</li> <li>$P(D \vert H)$ is the <strong>likelihood</strong> $\mathcal{L}$, the probability that the observed data was generated by $H$</li> <li>$P(D)$ is the marginal likelihood, usually discarded because it’s not a function of $H$.</li> </ul> <p>In supervised machine learning, models are hypotheses and data are $y_i | \mathbf{x}_i$ label-feature vector tuples.</p> <p>We’re looking for the best model, which maximizes the posterior probability. If the prior is flat ($P(H) = 1$) this reduces to likelihood maximization.</p> <p>If the prior on model parameters is normal you get Ridge regression. If the prior on model parameters is Laplace distributed you get LASSO.</p> <h2 id="gradient-descent">Gradient descent</h2> <p>Objectives are derived as the negative of the log-likelihood function. Objects with regularization can be thought of as the negative of the log-posterior probability function, but I’ll be ignoring regularizing priors here.</p> <p>Objective function is derived as the negative of the log-likelihood function, and can also be expressed as the mean of a loss function $\ell$ over data points.</p> $L = -\log{\mathcal{L}} = \frac{1}{N}\sum_i^{N} \ell_i.$ <h3 id="in-linear-regression-gradient-descent-happens-in-parameter-space">In linear regression, gradient descent happens in parameter space</h3> <p>For linear models like least-squares and logistic regression,</p> $\ell_i = \ell(f(\beta; \mathbf{x}_i))$ <p>where</p> $f(\beta; \mathbf{x}_i) = \mathbf{x}_i^T \mathbf{\beta},$ <p>$\beta$ are the coefficients and $$\mathbf{x}_i = 1$$ is the $i$-th feature vector. This formulation supports a y-intercept or offset term by defining $x_{i,0} = 1$. The rest of the entries $x_{i,j}: j&gt;0$ are the model features.</p> <p>Gradient descent minimazation methods make use of the first partial derivative.</p> $\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial \mathbf{\beta}} = \mathbf{x}_i \frac{\partial \ell}{\partial f} \end{equation}$ <p>Some gradient descent variants, like Newton-Raphson, use the second partial derivative or <em>Hessian</em>.</p> $\begin{equation} \ell^{\prime\prime} = \frac{\partial^2 \ell}{\partial \mathbf{\beta}^2} = \mathbf{x}_i^2 \frac{\partial^2 \ell}{\partial f^2} \end{equation}$ <h3 id="in-gradient-boosting-gradient-descent-happens-in-function-space">In gradient boosting, gradient descent happens in function space</h3> <p>In gradient boosting,</p> $\begin{equation} \ell_i = \ell(f(\mathbf{x}_i)) \end{equation}$ <p>where optimization is done over the set of different functions $\{f\}$ in functional space rather than over parameters of a single linear function. In this case the gradient is taken w.r.t. the function $f$.</p> $\begin{equation} \ell^{\prime} = \frac{\partial \ell}{\partial f} \end{equation}$ <p>and the Hessian is</p> $\begin{equation} \ell^{\prime\prime} = \frac{\partial^2 \ell}{\partial f^2}. \end{equation}$ <p>All derivatives below will be computed with respect to $f$. If you are using them in a gradient boosting context, this is all you need. If you are using them in a linear model context, you need to multiply the gradient and Hessian by $\mathbf{x}_i$ and $\mathbf{x}_i^2$, respectively.</p> <h1 id="likelihood-loss-gradient-hessian">Likelihood, loss, gradient, Hessian</h1> <p>The loss is the negative log-likelihood for a single data point.</p> <h2 id="square-loss">Square loss</h2> <p>Used in continous variable regression problems.</p> <p><strong>Likelihood</strong></p> <p>Start by asserting normally distributed errors.</p> $\begin{equation} \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{-\frac{(y_i - f(\mathbf{x}_i))^2}{2\sigma^2}} \end{equation}$ <p><strong>Loss</strong></p> $\begin{equation} \ell = (y_i - f(\mathbf{x}_i))^2 \end{equation}$ <p><strong>Gradient</strong></p> $\begin{equation} \frac{\partial \ell}{\partial f} = 2(f(\mathbf{x}_i) - y_i) \end{equation}$ <p><strong>Hessian</strong></p> $\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = 2 \end{equation}$ <h2 id="log-loss">Log loss</h2> <p>Used in binary classifiction problems.</p> <p><strong>Likelihood</strong></p> <p>Start by asserting binary outcomes are Bernoulli distributed.</p> <p>\begin{equation} \prod_{i=1}^N p(\mathbf{x}_i)^{y_i} (1 - p(\mathbf{x}_i))^{1 - {y_i}} \end{equation}</p> <p>The model in this case is a function with support $h \in \{-\infty, \infty\}$ that maps to the Bernoulli probability parameter $p$ via the log-odds or “logit” link function.</p> <p>\begin{equation} f(\mathbf{x}_i) = \log{\frac{p(\mathbf{x}_i)}{1 - p(\mathbf{x}_i)}} \end{equation}</p> <p>This formulation maps the boundless hypotheses onto probabilities $p \in \{0, 1\}$ by just solving for $p$:</p> <p>\begin{equation} p(\mathbf{x}_i) = \frac{1}{1 + \exp{(-f(\mathbf{x}_i))}} \end{equation}</p> <p><strong>Loss</strong></p> <p>For labels following the binary indicator convention $y \in \{0, 1\}$, all of the following are equivalent. The easiest way to prove they are equivalent is to plug in $y = 0$ and $y = 1$ and rearrange.</p> $\begin{equation} \begin{split} \ell &amp; = -y_i\log{p(\mathbf{x}_i)} - (1 - y_i)\log{(1 - p(\mathbf{x}_i))} \\ &amp; = y_i \log{(1 + \exp{(-f(\mathbf{x}_i))})} \\ &amp; \qquad + (1 - y_i) \log{(1 + \exp{(f(\mathbf{x}_i))})} \\ &amp; = - y_i f(\mathbf{x}_i) + \log{(1 + \exp{(f(\mathbf{x}_i))})} \end{split} \end{equation}$ <p>The first form is useful if you want to use different link functions.</p> <p>For labels following the transformed convention $z = 2y-1 \in \{-1, 1\}$:</p> $\begin{equation} \ell = \log{(1 + exp{(-z f(\mathbf{x}_i))})} \end{equation}$ <p><strong>Gradient</strong></p> $\begin{equation} \frac{\partial \ell}{\partial f} = p(\mathbf{x}_i) - y_i \end{equation}$ <p><strong>Hessian</strong></p> $\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = p(\mathbf{x}_i)(1 - p(\mathbf{x}_i)) \end{equation}$ <h2 id="quantile-regression">Quantile regression</h2> <p><strong>Likelihood</strong></p> <p>I have not yet seen somebody write down a motivating likelihood function for quantile regression loss.</p> <p><strong>Loss</strong></p> <p>Sometimes called the pinball loss.</p> $\begin{equation} \begin{split} \ell &amp; = (y_i - f(\mathbf{x}_i)) ( \tau - \mathbb{1}_{y_i &lt; f(\mathbf{x}_i)} ) \\ &amp; = \sum_{y_i \geq f(\mathbf{x}_i)}\tau (y_i - f(\mathbf{x}_i)) \\ &amp; \qquad - \sum_{y_i &lt; f(\mathbf{x}_i)}(1 - \tau) (y_i - f(\mathbf{x}_i)) \end{split} \end{equation}$ <p><strong>Gradient</strong></p> $\begin{equation} \begin{split} \frac{\partial \ell}{\partial f} &amp; = - ( \tau - \mathbb{1}_{y_i &lt; f(\mathbf{x}_i)} ) \\ &amp; = - \sum_{y_i \geq f(\mathbf{x}_i)}\tau + \sum_{y_i &lt; f(\mathbf{x}_i)}(1 - \tau) \end{split} \end{equation}$ <p><strong>Hessian</strong></p> $\begin{equation} \frac{\partial^2 \ell}{\partial f^2} = 0 \end{equation}$ <h2 id="mean-absolute-deviation">Mean absolute deviation</h2> <p>Mean absolute deviation is quantile regression at $\tau=0.5$.</p> <h2 id="cox-proportional-hazards">Cox proportional hazards</h2> <p><strong>Likelihood</strong></p> <p>Start from the Cox proportional hazards partial likelihood function. The partial likelihood is, as you might guess, just part of a larger likelihood, but it is sufficient for maximum likelihood estimation and therefore regression.</p> $\begin{equation} L(f) = \prod_{i:C_i = 1} \frac{\exp{f_i}}{\sum_{j:t_j \geq t_i} \exp{f_j}} \end{equation}$ <p>Using the analogy of subscribers to a business who may or may not renew from period to period, following is the unique terminology of survival analysis.</p> <ul> <li>$i$ and $j$ index users.</li> <li>$C_i = 1$ is a cancelation or churn event for user $i$ at time $t_i$</li> <li>$C_i = 0$ is a renewal or survival event for user $i$ at time $t_i$</li> <li>Subscribers $i:C_i = 1$ are users who canceled at time $t_i$.</li> <li>$j:t_j \geq t_i$ are users who have survived up to and including time $t_i$, which is the instant before subscriber $i$ canceled their subscription and churned out of the business. This is called the <em>risk set</em>, because they are the users at risk of canceling at the time user $i$ canceled. The risk set includes user $i$.</li> </ul> <p>In clinical studies, users are subjects and churn is non-survival, i.e. death.</p> <p><strong>Loss</strong></p> $\begin{equation} \ell_i = \delta_i \left[ - f_i + \log{\sum_{j:t_j \geq t_i} \exp{f_j}} \right] \end{equation}$ <p>where $\delta_i$ is the churn/death indicator.</p> <p><strong>Gradient</strong></p> <p>The efficient algorithm to compute the gradient and hessian involves ordering the $n$ survival data points, which are index by $i$, by time $t_i$. This turns $n^2$ time complexity into $n\log{n}$ for the sort followed by $n$ for the progressive total-loss compute (<a href="https://arxiv.org/abs/2003.00116">ref</a>).</p> <p>For linear regression, the gradient for instance $i$ is</p> $\begin{equation} \frac{\partial \ell_i}{\partial \beta} = \delta_i \left[ - \mathbf{x}_i + \frac{\sum_{j:t_j \geq t_i} \mathbf{x}_j \exp{f(\beta; \mathbf{x}_j)}}{\sum_{j:t_j \geq t_i} \exp{f(\beta; \mathbf{x}_j)}} \right] \end{equation}$ <p>For gradient boosting, the gradient for instance $i$ is</p> $\begin{equation} \frac{\partial \ell_i}{\partial f} = \delta_i \left[ - 1 + \sum_{j=1}^n \delta_j \mathbb{1}_{t_i \geq t_j} \frac{\exp{(f(\mathbf{x}_i))}}{\sum_{r=1}^n \mathbb{1}_{t_r \geq t_j} \exp{(f(\mathbf{x}_r))}} \right] \end{equation}$ <p><strong>Hessian</strong></p> <p>To be written.</p> <!-- **References** * [Boosting proportional hazards models using smoothing splines, with applications to high-dimensional microarray data](https://academic.oup.com/bioinformatics/article/21/10/2403/206251) * [BigSurvSGD: Big Survival Data Analysis via Stochastic Gradient Descent](https://arxiv.org/abs/2003.00116) * [sksurv](https://scikit-survival.readthedocs.io/en/latest/index.html) * [On the Breslow estimator](https://dlin.web.unc.edu/wp-content/uploads/sites/1568/2013/04/Lin07.pdf) * [Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4757968/) * [XGBoost CoxPH objective source code](https://github.com/dmlc/xgboost/blob/master/src/objective/regression_obj.cu#L279) * --> <h1 id="backlog">Backlog</h1> <ul> <li>Cross entropy for multiclass problems</li> <li>Accelerated failure time <ul> <li><a href="https://arxiv.org/abs/2006.04920">https://arxiv.org/abs/2006.04920</a></li> <li><a href="https://scikit-survival.readthedocs.io/en/stable/index.html">sksurv</a></li> </ul> </li> <li>Hinge</li> <li>Huber</li> <li>Poisson</li> <li>Kullback-Leibler</li> </ul> <h1 id="further-reading">Further reading</h1> <ul> <li><a href="https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf">R GBM vignette, Section 4 “Available Distributions”</a></li> <li><a href="https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html">ML Cheat Sheet, Section “Loss Functions”</a></li> <li><a href="https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-supervised-learning">Supervised Learning cheatsheet</a></li> <li><a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2012/01/tricks-2012.pdf">Stochastic Gradient Descent Tricks</a></li> </ul>Will HighCheat sheet for likelihoods, loss functions, gradients, and Hessians.Deploy Custom Shiny Apps to AWS Elastic Beanstalk2021-06-02T20:00:00+00:002021-06-13T19:00:00+00:00https://www.highonscience.com/blog/2021/06/02/shiny-apps-elastic-beanstalk<h1 id="update-for-rocker-versioned2-r-4">Update for rocker-versioned2 (R 4)</h1> <p>Same basic setup except with two small changes:</p> <ul> <li>R dependency installation can be done using the <code class="language-plaintext highlighter-rouge">install2.r</code> convenience script</li> <li>The server startup entrypoint command is now <code class="language-plaintext highlighter-rouge">CMD ["/init"]</code></li> </ul> <p>So Docker.base is</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> rocker/shiny-verse:4.0.5</span> <span class="k">RUN </span>install2.r <span class="nt">--error</span> <span class="nt">--skipinstalled</span> ROCR gbm </code></pre></div></div> <p>and Docker is</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> &lt;aws_account_id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/rshiny-base:latest</span> <span class="k">COPY</span><span class="s"> apps /srv/shiny-server</span> <span class="k">CMD</span><span class="s"> ["/init"]</span> </code></pre></div></div> <h1 id="overview">Overview</h1> <p>This is a fast way to stand up a Shiny server in the cloud that serves your own set of custom Shiny apps with very few lines of code, including the example app, thanks to <a href="https://www.rocker-project.org/">rocker</a>’s Shiny images and AWS. The time consuming parts are Docker image data transfer, server start overheads, and of course any software installation and account signups that you need.</p> <p>Note that I could not get this to work when pulling rocker’s image from Dockerhub directly within Elastic Beanstalk. EB timed out. My solution, which appears to be pretty stable, is to rebuild the rocker image locally, push it to AWS’s own Docker image repository called ECR, and ask EB to pull that instead. The idea is the in-region data transfer across AWS services should generally be faster.</p> <p>You can also automate the rocker image build with AWS CI/CD services, which I have done successfully in the past using CodePipelines. But this post is just a “Hello, World!” and I’ll leave that part to you.</p> <h1 id="glossary">Glossary</h1> <ul> <li>AWS: <a href="https://aws.amazon.com/">Amazon Web Services</a></li> <li>EB: <a href="https://aws.amazon.com/elasticbeanstalk/">Elastic Beanstalk</a>, an AWS service that serves web applications like web sites and REST APIs</li> <li>ECR: <a href="https://aws.amazon.com/ecr/">Elastic Container Registry</a>, an AWS service that hosts Docker images</li> <li>CLI: Command line interface</li> </ul> <h1 id="requirements">Requirements</h1> <ul> <li><a href="https://www.docker.com/">Docker Desktop</a></li> <li>An <a href="https://aws.amazon.com/">AWS</a> account</li> <li>The <a href="https://aws.amazon.com/cli/">AWS CLI</a> (<code class="language-plaintext highlighter-rouge">brew install awscli</code>)</li> <li>The <a href="https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/eb-cli3.html">AWS Elastic Beanstalk CLI</a> (<code class="language-plaintext highlighter-rouge">brew install awsebcli</code>)</li> </ul> <h1 id="quickstart">Quickstart</h1> <ol> <li><code class="language-plaintext highlighter-rouge">mkdir new-shiny-app-repo &amp;&amp; cd new-shiny-app-repo</code></li> <li><code class="language-plaintext highlighter-rouge">mkdir apps</code> and then put a <a href="https://shiny.rstudio.com/gallery/">“Hello, World!” Shiny app</a> in there</li> <li>Create Dockerfile.base that just pulls <code class="language-plaintext highlighter-rouge">FROM rocker/shiny</code> on <a href="https://hub.docker.com/r/rocker/shiny">Docker Hub</a> (or rocker/shiny-verse to also make the tidyverse available) and installs any additional R packages your apps need</li> <li>Create an <a href="https://aws.amazon.com/ecr/">ECR repo</a> called <code class="language-plaintext highlighter-rouge">rshiny-base</code> on ECR</li> <li>Build the <code class="language-plaintext highlighter-rouge">rshiny-base</code> image locally from Dockerfile.base and push it to ECR</li> <li>Create the <code class="language-plaintext highlighter-rouge">Dockerfile</code> specified below</li> <li>On a Mac: install the EB CLI with <code class="language-plaintext highlighter-rouge">brew install awsebcli</code></li> <li>Git-commit your changes</li> <li><code class="language-plaintext highlighter-rouge">eb init shiny</code></li> <li><code class="language-plaintext highlighter-rouge">eb create shiny</code></li> </ol> <p>You should end up with a directory structure like this.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>new-shiny-app-repo ├── apps/ | ├── index.html # optional | └── hello-world/ | ├── server.R | └── ui.R ├── Dockerfile ├── Dockerfile.base └── .gitignore </code></pre></div></div> <h1 id="more-details">More Details</h1> <h2 id="docker-and-aws-preliminaries">Docker and AWS Preliminaries</h2> <p>Install the free version of <a href="https://www.docker.com/">Docker Desktop</a>.</p> <p>Create an <a href="https://aws.amazon.com/">AWS account</a>. Take note of your default region. Mine region is <code class="language-plaintext highlighter-rouge">us-west-1</code>. Let’s call it <code class="language-plaintext highlighter-rouge">$region</code>.</p> <p>Install the AWS CLI. I like to use homebrew a la <code class="language-plaintext highlighter-rouge">brew install awscli</code>.</p> <p>Install the AWS EB CLI. I like to use homebrew a la <code class="language-plaintext highlighter-rouge">brew install awsebcli</code>.</p> <p>Create an <a href="https://aws.amazon.com/ecr/">ECR repo</a> called <code class="language-plaintext highlighter-rouge">rshiny-base</code>. Take note here of your AWS account ID, which I will call <code class="language-plaintext highlighter-rouge">$aws_account_id</code> below. It’s a bunch of numbers.</p> <h2 id="create-a-base-dockerfile">Create a Base Dockerfile</h2> <p>Now make a file called <code class="language-plaintext highlighter-rouge">Dockerfile.base</code>. You’ll be pulling a base Shiny image and then this is where you’re going to want to install additional R packages. You’re installing them here because it could take a while, and Elastic Beanstalk would time out if you did it downstream of this. Here I’m installing ROCR and gbm.</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> rocker/shiny</span> <span class="c"># Install more R packages like this:</span> <span class="k">RUN </span><span class="nb">.</span> /etc/environment <span class="o">&amp;&amp;</span> R <span class="nt">-e</span> <span class="s2">"install.packages(c('ROCR', 'gbm'), repos='</span><span class="nv">$MRAN</span><span class="s2">')"</span> </code></pre></div></div> <p>For added stability you can pin to a specific <code class="language-plaintext highlighter-rouge">rocker/shiny</code> version, e.g. 3.4.4, with <code class="language-plaintext highlighter-rouge">FROM rocker/shiny:3.4.4</code>. I’m sure there’s a way to pin R packages as well but it’s not on my fingertips and so I’ll add that in later.</p> <p>Now build the base image locally and call it <code class="language-plaintext highlighter-rouge">rshiny-base</code>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker build <span class="nt">-t</span> rshiny-base <span class="nt">-f</span> Dockerfile.base <span class="nb">.</span> </code></pre></div></div> <p>Then push the image to ECR. Follow the instructions on ECR web site on how to authenticate and push, but here’s what it looked like at the time of writing. You’re logging into AWS, building the image locally (can skip this if you already did it), tagging the locally built image as latest, then pushing the local image to ECR.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># region="us-west-1"</span> <span class="c"># aws_account_id=123456789</span> aws ecr get-login-password <span class="nt">--region</span> <span class="nv">$region</span> | docker login <span class="nt">--username</span> AWS <span class="nt">--password-stdin</span> <span class="k">${</span><span class="nv">aws_account_id</span><span class="k">}</span>.dkr.ecr.<span class="k">${</span><span class="nv">region</span><span class="k">}</span>.amazonaws.com docker build <span class="nt">-t</span> rshiny-base Dockerfile.base docker tag rshiny-base:latest <span class="k">${</span><span class="nv">aws_account_id</span><span class="k">}</span>.dkr.ecr.<span class="k">${</span><span class="nv">region</span><span class="k">}</span>.amazonaws.com/rshiny-base:latest docker push <span class="k">${</span><span class="nv">aws_account_id</span><span class="k">}</span>.dkr.ecr.<span class="k">${</span><span class="nv">region</span><span class="k">}</span>.amazonaws.com/rshiny-base:latest </code></pre></div></div> <p>The upload took a while for me. If I had to do this repeatedly I would set up an automated job to build and push using AWS Batch or, much more likely, as part of a CI/CD pipeline using AWS CodePipeline.</p> <h2 id="create-any-new-shiny-apps">Create Any New Shiny Apps</h2> <p>Make a directory called <code class="language-plaintext highlighter-rouge">apps/</code> and put a simple working app spec there, copied from, say, the <a href="https://shiny.rstudio.com/gallery/">Shiny gallery</a>. Here’s the subdirectory structure for a bunch of custom apps.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>apps/ ├── hello-world/ | ├── server.R | └── ui.R ├── app1/ | ├── server.R | └── ui.R └── app2/ ├── server.R └── ui.R </code></pre></div></div> <p>You will access them after EB deployment at <code class="language-plaintext highlighter-rouge">http://&lt;url&gt;/hello-world/</code>, <code class="language-plaintext highlighter-rouge">http://&lt;url&gt;/app1/</code>, and so forth.</p> <h2 id="create-a-new-dockerfile-for-your-app-server">Create a New Dockerfile for Your App Server</h2> <p>Now create a file called <code class="language-plaintext highlighter-rouge">Dockerfile</code> with the following contents.</p> <div class="language-docker highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">FROM</span><span class="s"> &lt;aws_account_id&gt;.dkr.ecr.&lt;region&gt;.amazonaws.com/rshiny-base</span> <span class="k">USER</span><span class="s"> shiny</span> <span class="k">COPY</span><span class="s"> apps /srv/shiny-server</span> <span class="k">EXPOSE</span><span class="s"> 3838</span> <span class="k">CMD</span><span class="s"> ["/usr/bin/shiny-server.sh"]</span> </code></pre></div></div> <p>The key thing we are doing here is copying your custom apps into the Docker image itself. You can try to build it and run it like this.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker build <span class="nt">-t</span> rshiny-apps <span class="nb">.</span> docker run <span class="nt">--rm</span> <span class="nt">-p</span> 3838:3838 rshiny-apps </code></pre></div></div> <p>Now open http://127.0.0.1:3838/. You should see a message letting you know the server is running properly. You can create and edit a custom home page at the apps root, <code class="language-plaintext highlighter-rouge">apps/index.html</code>.</p> <h2 id="commit-to-git">Commit to Git</h2> <p>The EB CLI zips your latest git commit on your configured default branch. <strong>It does not zip your latest changes if you have not git committed them.</strong> Can’t tell you how many times I’ve forgotten to commit.</p> <h2 id="push-it-to-elastic-beanstalk-for-the-first-time">Push It to Elastic Beanstalk for the First Time</h2> <p>Here’s what I did to create an application called <code class="language-plaintext highlighter-rouge">shiny</code>, from the root directory of the git repository.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>eb init eb create shiny </code></pre></div></div> <p>And you’re done! Go to the Elastic Beanstalk to find your (obscure) URL. You should see your index and you can visit the <code class="language-plaintext highlighter-rouge">http://&lt;url&gt;//example-app/</code> path from there.</p> <h2 id="make-changes-and-push-again">Make Changes and Push Again</h2> <p>Make your changes, git-commit, then</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>eb deploy </code></pre></div></div> <p>So easy.</p> <h1 id="next-steps">Next Steps</h1> <ul> <li>Build Dockerfile.base automatically in a CI/CD pipeline.</li> <li>Find a way to install packages in Dockerfile directly during app development to shorten the loop.</li> </ul> <p>Enjoy!</p>Will HighHow I tricked AWS into serving R Shiny with my local custom applications using rocker and Elastic Beanstalk.Debugging Metaflow Jobs2021-06-02T19:00:00+00:002021-06-02T19:00:00+00:00https://www.highonscience.com/blog/2021/06/02/metaflow-ml-debugging<h1 id="overview">Overview</h1> <p>This post addresses Metaflow job debugging and feature development. My aim is to make the entire cycle as short, painless, and accurate as possible.</p> <p>Let’s start with the setup.</p> <h1 id="setup">Setup</h1> <p>The basic setup involves opening up a Python IDE and a Jupyter notebook. The IDE is for editing the Python utility package. Changes made to your Python utility package can be immediately used in the Jupyter notebook at the cell level without rerunning import statements thanks to <a href="https://ipython.org/ipython-doc/3/config/extensions/autoreload.html">autoreload magic</a>. The Jupyter notebook can also access Metaflow artifacts from previous runs. Putting it together, you can edit code in your IDE and immediately test it in your notebook at the cell level.</p> <p>Here’s a picture.</p> <p><img src="https://docs.google.com/drawings/d/e/2PACX-1vQ0MFP42TOMeomoqIkNkD_v1B8VBAMoz9aRHetNwuhZvcKL5gw92s8x4Fx6BKjmQdA4cYtWpCfetWER/pub?w=960&amp;h=720" /></p> <p>Here are the components in a bit more detail.</p> <ul> <li>Create or reuse a git repository.</li> <li>Make a directory structure with (see <a href="/blog/2021/05/25/metaflow-best-practices-for-ml/#develop-a-separate-python-package">Metaflow Best Practices for Machine Learning</a> for more specifics on directory structures) <ul> <li>a subdirectory for Metaflow flows and local common code</li> <li>and a pip-installable Python package.</li> </ul> </li> <li>Use feature branches and pull requests to make changes.</li> <li>Write unit tests.</li> <li>Set up continuous integration and have it run the unit tests.</li> <li>Fire up an IDE to edit code in a feature branch (I like PyCharm).</li> <li>Fire up Jupyter Lab to load Metaflow data and object artifacts, and use <a href="https://ipython.org/ipython-doc/3/config/extensions/autoreload.html">autoreload magic</a> to test source code edits that I’m actively making in my IDE against those artifacts.</li> </ul> <p>I use this setup when developing examples in <a href="https://github.com/fwhigh/metaflow-helper">https://github.com/fwhigh/metaflow-helper</a>. I’ll be referring to those components quite a bit. Examples from this article are reproducible from the metaflow-helper repo commit tagged <strong>v0.0.1</strong> The local Python package is used by doing, for example, <code class="language-plaintext highlighter-rouge">from metaflow_helper.utils import install_dependencies</code> at the top of flows. The flows live in multiple subdiretories of <code class="language-plaintext highlighter-rouge">examples/</code>, like <code class="language-plaintext highlighter-rouge">examples/model-selection/</code>.</p> <h1 id="pre-adoption">Pre Adoption</h1> <p>In the early stages of a project, prior to first adoption by my potential users, I do most of my prototyping in Jupyter and then slowly begin to copy-paste working functions and classes into Metaflow steps (train.py, predict.py in examples/model-selection/), my local common Python script (common.py in examples/model-selection/), and into my local Python package (metaflow_helper at the top level).</p> <p>The basic structure of the notebook looks like this. Look out for the autoreload magic command, Metaflow artifact accessors, the metaflow-helper Python package import, and the common.py utility import that is local to the example script. <script src="https://gist.github.com/c6f9c88cf94cedf2e96d6900ac0f1226.js?file=debug.ipynb"> </script></p> <h1 id="accessing-metaflow-artifacts">Accessing Metaflow Artifacts</h1> <p>Metaflow already provides simple artifact access patterns like</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">metaflow</span> <span class="kn">import</span> <span class="n">Metaflow</span> <span class="k">print</span><span class="p">(</span><span class="n">Metaflow</span><span class="p">().</span><span class="n">flows</span><span class="p">)</span> </code></pre></div></div> <p>and</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">metaflow</span> <span class="kn">import</span> <span class="n">Step</span> <span class="n">data</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">Step</span><span class="p">(</span><span class="sa">f</span><span class="s">'Train/1234/some_step'</span><span class="p">))[</span><span class="mi">0</span><span class="p">].</span><span class="n">data</span><span class="err"></span> </code></pre></div></div> <p>There’s nothing else I’ll need on the core Metaflow side.</p> <h1 id="editing-and-testing-my-own-code">Editing and Testing My Own Code</h1> <p>But to make and test edits on my own code I’ve got autoreload 2 enabled. In my notebook I’ll <code class="language-plaintext highlighter-rouge">import common</code> at the top and in a later cell use a fuction from common.py like</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">result</span> <span class="o">=</span> <span class="n">common</span><span class="p">.</span><span class="n">some_function</span><span class="p">()</span> </code></pre></div></div> <p>I can now make changes to the source code of <code class="language-plaintext highlighter-rouge">some_function</code> directly in common.py and see those changes reflected immediately. I don’t have to re-import common, I can just re-execute the cell that calls the function.</p> <p>The same is true for my local package called metaflow-helper, which I installed using <code class="language-plaintext highlighter-rouge">pip install -e .</code> at the top level of the repository. That <code class="language-plaintext highlighter-rouge">-e</code> means “editable mode”. I can <code class="language-plaintext highlighter-rouge">from metaflow_helper.feature_engineer import FeatureEngineer</code> and in later cells instantiate FeatureEngineer. When I make changes to member functions of FeatureEngineer, they will also immediately be reflected at the notebook cell level without having to reimport metaflow-helper.</p> <h1 id="putting-it-all-together">Putting It All Together</h1> <p>The really killer thing is now I can access Metaflow artifacts from successful or even failed Metaflow runs and feed them into common or metaflow-helper functions in the notebook while making on-the-fly changes to the code. At the cell level I can run and rerun to debug code edits.</p> <p>Once I’m happy with the result I can git-commit and -push and issue a pull request. Fix any failed unit test, get code reviewer approval, merge to the the target branch and I’m ready to go with the changes.</p> <h1 id="post-adoption">Post Adoption</h1> <p>Once I’ve done this Jupyter-to-source cycle enough times my source code becomes larger and more battle-hardened. If at some stage I’ve also achieved buy-in and adoption from my users, I’ve got production-worthy flow and code.</p> <p>Runs still fail at these mature stages, and I’ll still need to debug. I can still use the Jupyter notebook debugging pattern from the earlier stages, but I’ll be skewing much more heavily to iterative changes to my production code rather than prototyping from scratch in Jupyter and pushing to source code scripts and packages.</p>Will HighThe combination of an IDE, a Jupyter notebook, and some best practices can radically shorten the Metaflow development and debugging cycle.Metaflow Best Practices for Machine Learning2021-05-25T19:00:00+00:002021-05-25T19:00:00+00:00https://www.highonscience.com/blog/2021/05/25/metaflow-best-practices-for-ml<p>I wanted to share with you some recommended best practices that I battle tested over about four or so years working with Metaflow at Netflix. The goals are a short development loop; reusable, maintainable, and reliable code; and just an overall fun and rewarding developer experience. This list is probably not complete but I can add more later.</p> <h1 id="minimal-directory-structure">Minimal directory structure</h1> <p>I’ll start with a suggested minimal directory structure. This example has a training flow and a prediction flow, with common code across the two plus a Jupyter notebook for debugging.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;your-repo&gt; ├── flows | ├── train.py | ├── predict.py | ├── debug.ipynb | └── common.py └── .gitignore </code></pre></div></div> <p>The training and prediction flow specs import common code and do… you know, machine learning. They are separated out to support ongoing predictions on a model that is potentially re-trained on a different schedule. So you might run train.py quarterly, and run predict.py weekly on fresh incoming data.</p> <p>Check out this minimal flow example and</p> <script src="https://gist.github.com/c6f9c88cf94cedf2e96d6900ac0f1226.js?file=train.py"> </script> <p>Note <code class="language-plaintext highlighter-rouge">import common</code>. That brings me to…</p> <h1 id="put-common-code-into-a-separate-script">Put common code into a separate script</h1> <p>In the minimal code structure and flow example I’ve got a script called common.py. That contains reusable functions, classes, and variables that both train.py and predict.py can use. Pulling your code into a separate script makes it more easily reusable and testable and shortens your Metaflow steps and overall flow spec.</p> <h1 id="git-ignore-metaflow">Git-ignore .metaflow</h1> <p>Don’t forget to add <code class="language-plaintext highlighter-rouge">.metaflow</code> to your .gitignore because those directories contain the local data artifacts.</p> <h1 id="debug-in-a-jupyter-notebook">Debug in a Jupyter notebook</h1> <p>You can debug common code and access Metaflow data artifacts in Jupyter notebooks. Here’s a minimal example of all of that.</p> <script src="https://gist.github.com/c6f9c88cf94cedf2e96d6900ac0f1226.js?file=debug.ipynb"> </script> <p>I’m using autoreload magic so that I can make changes to common.py and have those changes immediately reflected at the cell level without having to re-import common or otherwise run a bunch of other cells. Your working directory in the notebook should be <code class="language-plaintext highlighter-rouge">&lt;your-repo&gt;/flows/</code> in this case.</p> <p>(Note: that debug snippet shows artifacts from my <a href="/blog/2021/05/24/ml-model-selection-with-metaflow/">previous Metaflow post</a>.)</p> <h1 id="develop-a-separate-python-package">Develop a separate Python package</h1> <p>When it makes sense to (and not earlier) I try to separate out the more broadly reusable code into a separate, pip-installable Python package. That’s <em>in addition to</em> using local common.py scripts. You can put the code in the same repo, or break it out into another one. Here’s a minimal example of putting it in the same repo. You’ll see a full working example of this at <a href="https://github.com/fwhigh/metaflow-helper">metaflow-helper</a>.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;your-repo&gt; ├── flows | ├── train.py | ├── predict.py | ├── debug.ipynb | └── common.py ├── .gitignore ├── your_package | ├── __init__.py | └── models.py └── setup.py </code></pre></div></div> <p>Adding the setup.py makes <code class="language-plaintext highlighter-rouge">your_package</code> pip installable. During development I’ll fire up a Python venv with <code class="language-plaintext highlighter-rouge">python -m venv venv &amp;&amp; . venv/bin/activate</code> and then install the package in editable mode with <code class="language-plaintext highlighter-rouge">pip install -e .</code>, all from the top level of the repo.</p> <p>In each Metaflow step I’ll pip install from git if the package is not already locally available. Here are the functions that do that, which I put into common.py.</p> <script src="https://gist.github.com/c6f9c88cf94cedf2e96d6900ac0f1226.js?file=common.py"> </script> <p>Now at the top of each step you would do something like:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">common</span><span class="p">.</span><span class="n">install_dependencies</span><span class="p">(</span> <span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'git+ssh://git@github.com/&lt;github-username&gt;/&lt;your-repo&gt;.git'</span><span class="p">}</span> <span class="p">)</span> </code></pre></div></div> <p>This will try to <code class="language-plaintext highlighter-rouge">import your_package</code> (the dictionary key) and if it fails, pip-install from Github. Doing this will seem like nonsense during development, but when you deploy to a production environment this will become necessary. Installing via pip lets you get athe code from Github or PyPI, and will let you pin in both cases. Here are some different ways to pin.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Install the latest commit from the default branch </span><span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'git+ssh://git@github.com/&lt;github-username&gt;/&lt;your-repo&gt;.git'</span><span class="p">}</span> <span class="c1"># Pin by installing a tagged commit </span><span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'git+ssh://git@github.com/&lt;github-username&gt;/&lt;your-repo&gt;.git@v0.0.1'</span><span class="p">}</span> <span class="c1"># Pin by installing a commit hash </span><span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'git+ssh://git@github.com/&lt;github-username&gt;/&lt;your-repo&gt;.git@00db203'</span><span class="p">}</span> <span class="c1"># Install from PyPI </span><span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'your_package'</span><span class="p">}</span> <span class="c1"># Pin by install from a PyPI version </span><span class="p">{</span><span class="s">'your_package'</span><span class="p">:</span> <span class="s">'your_package==0.0.1'</span><span class="p">}</span> <span class="c1"># etc etc </span></code></pre></div></div> <p>You can call <code class="language-plaintext highlighter-rouge">install_dependencies</code> in your debugging Jupyter notebook, too. If <code class="language-plaintext highlighter-rouge">your_package</code> is already available to it, nothing will happen. <em>This means you can test your Metaflow artifacts, common flow code, and your external package code all in the same notebook.</em></p> <p>And speaking of pinning…</p> <h1 id="pin-your-packages">Pin your packages</h1> <p>If you plan on running your flows on a cron schedule or against triggers over long periods of time, do yourself a favor and pin your packages. This increases stability of repeated flow runs that use artifacts from other flows that ran earlier. For example, predict.py needs to load the model artifact persisted in train.py, potentially days, weeks, or months later, depending on your design.</p> <p>It’s useful to think of your Metaflow jobs like you would any long-running application, for instance a web app. Pin for reproducibility and to minimize maintenance over the long term.</p> <p>You can take this thinking one step further with Metaflow: <em>think of each Metaflow step as an independent, long-running application</em> and pin potentially different packages at the top of every step. One example where I’ve seen this come up is in using Tensorflow. Tensorflow requires a specific version range of numpy, but otherwise I want access to a more recent numpy release elsewhere. If I isolate my Tensorflow modeling code to a single step or set of steps, and do pre- and post-processing in separate steps, I can pin Tensorflow with a floating numpy version and in the other steps I’ll in general get a different numpy version. The <code class="language-plaintext highlighter-rouge">install_dependencies</code> function pattern I mentioned above in <a href="#develop-a-separate-python-package">Develop a separate Python package</a> will let me do this.</p> <p>Now that I’ve said all this stuff about pip…</p> <h1 id="migrate-to-conda">Migrate to conda</h1> <p>The pip-install pattern is useful for shortening the development cycle, but the <a href="https://docs.metaflow.org/metaflow/dependencies">Metaflow maintainers recommend adopting conda</a> to maximize reproducibility. I don’t have a good recommendation at this time on how to adopt the conda pattern while still keeping the development loop short. My guess is one of <a href="https://stackoverflow.com/questions/19042389/conda-installing-upgrading-directly-from-github">pip-installing inside conda-decorated steps or conda-installing from git</a>, might work. I’ll give these a try as soon as I have a need to.</p> <h1 id="keep-flows-and-flow-steps-short">Keep flows and flow steps short</h1> <p>If you pull common code into local Python scripts or into a separate package, you’ll be in a good position to make your flow spec and each of its steps as short as possible.</p> <p>Keeping them short is useful for readability and maintainability. You’ll also invariably have to do other high-level stuff at the step level without the option of pulling that code into common functions, for example <a href="https://docs.metaflow.org/metaflow/failures#catching-exceptions-with-the-catch-decorator">Metaflow step-level exception handling</a>. Do flow-control-level operations in steps and otherwise call just a few functions per step if you can.</p> <p>Keeping the scope of steps small is also useful for debugging different logical chunks of your pipeline without having to rerun upstream code and for resuming execution after a failed run. Often times in production you’ll get failures due to platform failures, and it’s useful to have completed as much upstream processing successsfully as possible. Then you can resume from the failed steps forward. Or you’ll get a runtime failure from an unhandled edge case. It’s helpful when upstream, smaller scope steps have completed successfully and the runtime failure is isolated to a small step. Small steps make the whole debugging and maintenance experience more enjoyable.</p> <h1 id="fail-fast-in-your-start-step">Fail fast in your start step</h1> <p>Your start step is an opportunity to fail fast. This means things like:</p> <ul> <li>Try to ping your external services.</li> <li>Load the pointers for you Metaflow artifact dependencies.</li> <li>Validate configuration and variables.</li> </ul> <p>If any of your canary procedures fail, let the flow error out and report back to you. It’s far better to fail in the start step if you can rather than failing toward the end of a potentially very long running flow.</p> <h1 id="implement-a-test-mode">Implement a test mode</h1> <p>Implement a test mode that will run your flow as-is but on as small a data set as possible and with hyperparameter settings that make the ML training optimization as quick as possible. I like to do this by creating a flag parameter that I can use to subset the data and reduce parallelism to one concurrent task for any given step. If you’re training a model, reduce the number of maximum possible optimization iterations to something small like 10 epochs. Here’s one way to create a test-mode using a Metaflow Parameter.</p> <script src="https://gist.github.com/c6f9c88cf94cedf2e96d6900ac0f1226.js?file=test_mode.py"> </script> <p>Now I can run the flow normally with</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python train.py </code></pre></div></div> <p>or in test mode with</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python train.py <span class="nt">--test_mode</span> 1 </code></pre></div></div> <p>I did a variant of this in my model selection example from <a href="/blog/2021/05/24/ml-model-selection-with-metaflow/">my previous Metaflow post</a>. Instead of using a boolean flag I point to different configuration files by string, some of which perform the same tasks of subsetting the data down and shortening the model training times dramatically.</p> <h1 id="run-flows-in-test-mode-in-a-cicd-pipeline">Run flows in test mode in a CI/CD pipeline</h1> <p>If you’ve got a nice and short test mode working you can run it as part of continuous integration/continuous delivery &amp; deployment. You’ll see working examples of this in <a href="https://github.com/fwhigh/metaflow-helper">metaflow-helper</a>. I’ve got separate jobs and badges set up for unit testing and for running the Metaflow examples in test mode.</p> <h1 id="use-an-ide">Use an IDE</h1> <p>I prefer PyCharm. It plays nice with Metaflow. Debugging seems to work fine, but it can be a bit tricky to debug parallel tasks in foreach steps. Using test-mode (see <a href="#implement-a-test-mode">Implement a test mode</a>) and eliminating parallel tasks helps. Make sure to reuse your virtual environment interpreter if you set one up already. I’ve also set PyCharm up for Metaflow development against a remote server – that works pretty well, though there are a lot of configuration options to set.</p> <p>VS Code works as well. It’s faster but has reduced functionality. I especially miss refactoring and fully-featured inspection when I use VS Code. There’s a time and a place for both and as always it boils down to personal preference.</p> <p>I haven’t tested other IDEs like Spyder. I’d like to hear if others have and what the good and the bad are about each one.</p> <h1 id="what-did-i-miss">What did I miss?</h1> <p>I didn’t talk about more advanced practices I like, such as <a href="https://docs.metaflow.org/metaflow/tagging#tagging">Metaflow run tagging</a> and setting up isolated test and development environments that operate without affecting the production environment. I’ll cover those future posts.</p> <p>I’d like to hear from you on what I may have missed or how you do things differently!</p>Will HighSome of these are specific to Metaflow, some are more general to Python and ML.Machine Learning Model Selection with Metaflow2021-05-24T19:00:00+00:002021-08-12T19:00:00+00:00https://www.highonscience.com/blog/2021/05/24/ml-model-selection-with-metaflow<!-- NOTES examples/model-selection/results/1621113810652298 - 10_000 samples, noise 10, 1 category examples/model-selection/results/1621115192490006 - 10_000 samples, noise 100, 1 category examples/model-selection/results/1621116705680653 - 10_000 samples, noise 10, 2 categories examples/model-selection/results/1621117548416977 - 10_000 samples, noise 100, 2 categories examples/model-selection/results/1621302832648370 - 10_000 samples, noise 100, 1 categories, randomized search examples/model-selection/results/1621303262250162 - 10_000 samples, noise 100, 2 categories, randomized search --> <ul> <li><a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226#file-aicamp_recipe-sh">AICamp recipe Github gist</a></li> <li><a href="https://github.com/fwhigh/metaflow-helper">fwhigh/metaflow-helper Github project</a></li> <li><a href="https://www.youtube.com/watch?v=kDIBTWDJGgY">AICamp talk on YouTube</a></li> </ul> <h1 id="overview">Overview</h1> <p>I worked with the <a href="https://metaflow.org/">Metaflow</a> creators at Netflix from the time they built their first proofs of concept. About six months later I built my first flows as one of the earliest adopters. I had been rolling my own Flask API services to serve machine learning model predictions but Metaflow provided a much more accessible, lower complexity path to keep the models and services up to date.</p> <p>I also had the privilege of working next to a lot of other talented developers who built some of their own spectacular ML based applications with Metaflow over the following years. Now that I’ve left Netflix I look forward to continuing to use it and helping others get the most out of it.</p> <p>What is Metaflow? It’s a framework that lets you write data pipelines in pure Python, and it’s particularly suited to scaling up machine learning applications. Pipelines are specified as multiple <em>steps</em> in a <em>flow</em>, and steps can consist of potentially many <em>tasks</em> executed in parallel in their own isolated containers in the cloud. Tasks are stateless and reproducible. Metaflow persists objects and data in a data store like S3 for easy retrieval, inspection, and further processing by downstream systems. Read more at <a href="https://metaflow.org/">https://metaflow.org/</a>.</p> <p>In this post I’ll demonstrate one of the ways I like to use it: doing repeatable machine learning model selection at scale. (This post does not address the ML model reproducibility crisis. Repeatable here means easily re-runnable.) I’ll compare 5 different hyperparameter settings for each of LightGBM and Keras regressors, with 5 fold cross validation and early stopping, for a total of 50 parallel model candidates. All of these instances are executed in parallel. The following box plots show the min and max and the 25th, 50th (median), and 75th percentiles of r-squared score from a mock regression data set.</p> <figure class="align-center" style="display: table;"> <a href="/assets/ml-model-selection-with-metaflow/1621302832648370/all-scores.png"><img width="100%" src="/assets/ml-model-selection-with-metaflow/1621302832648370/all-scores.png" /></a> <figcaption style="display: table-caption; caption-side: bottom; font-style: italic;" width="100%">Noisy regression, one category: any of the tested Keras architectures wins on out-of-sample r-squared score. The narrow single-hidden-layer Keras model happened to be best overall, with l1 factor 2.4e-7 and l2 factor 7.2e-6.</figcaption> </figure> <figure class="align-center" style="display: table;"> <a href="/assets/ml-model-selection-with-metaflow/1621303262250162/all-scores.png"><img width="100%" src="/assets/ml-model-selection-with-metaflow/1621303262250162/all-scores.png" /></a> <figcaption style="display: table-caption; caption-side: bottom; font-style: italic;" width="100%">Noisy regression, two categories: LightGBM with depth 3 interactions and learning rate 0.03 wins on out-of-sample r-squared score. The LightGBM model with depth 1 performed the worst.</figcaption> </figure> <p>Predictions from the best model settings on the held out test set look like this for the noisy one-category data set.</p> <figure class="align-center" style="display: table; "> <a href="/assets/ml-model-selection-with-metaflow/1621302832648370/predicted-vs-true.png"><img width="100%" src="/assets/ml-model-selection-with-metaflow/1621302832648370/predicted-vs-true.png" /></a> <figcaption style="display: table-caption; caption-side: bottom; font-style: italic;" width="100%">Predicted versus true for the noisy regression, one category.</figcaption> </figure> <p>For just 2 models each on a hyperparameter grid of size 10 to 100, and using 5 fold cross validation, cardinality can reach between of order 100 to 1000 jobs. It’s easy to imagine making that even bigger with more models or hyperparameter combinations. Running Metaflow in the cloud (e.g. AWS) lets you execute each one of them concurrently in isolated containers. I’ve seen the cardinality blow up to of order 10,000 or more and things still work just fine, as long as you’ve got the time, your settings are reasonable, and your account with your cloud provider is big enough. With the</p> <p>The code is available at <a href="https://github.com/fwhigh/metaflow-helper">https://github.com/fwhigh/metaflow-helper</a>. The examples in this article are reproducible from the commit tagged <strong>v0.0.1</strong>. You can also install the tagged package from PyPI with <code class="language-plaintext highlighter-rouge">pip install metaflow-helper==0.0.1</code>. Comments, issues, and pull requests are welcome.</p> <p>This post is <em>not</em> meant to conclude whether LightGBM is better than Keras or vice versa – I chose them for illustration purposes only. What model to choose, and which will win a tournament, are application-dependent. And that’s sort of the point! This procedure outlines how you would productionalize model tournaments that you can run on many different data sets, and repeat the tournament over time as well.</p> <h1 id="quickstart">Quickstart</h1> <p>You can run the model selection tournament immediately like this. Install a convenience package called metaflow-helper at the commit tagged v0.0.1.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226#file-model_selection_quickstart_install-sh"> gist</a> <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226/raw/model_selection_quickstart_install.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226" data-gist-file="model_selection_quickstart_install.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true"></code> </div> <p>Then run the Metaflow tournament job at a small scale just to test it out. This one needs a few more packages, including Metaflow itself, which metaflow-helper doesn’t currently require.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226#file-model_selection_quickstart_train_run-sh"> gist</a> <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226/raw/model_selection_quickstart_train_run.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226" data-gist-file="model_selection_quickstart_train_run.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true"></code> </div> <p>Results are printed to the screen, but they are also summarized in a local file <code class="language-plaintext highlighter-rouge">results/&lt;run-id&gt;/summary.txt</code> along with some plots. There are full scale model selection configurations available in there as well.</p> <p>Following figure shows the flow you are running. The mock data is generated in the start step. The next step splits across all hyperparameter grid points for all contenders – 10 total for 2 models in the case of this example. Then there are 5 tasks for each cross validation fold, for a total of 50 tasks. Models are trained in these tasks directly. The next step joins the folds and summarizes the results by model and hyperparameter grid point. Then there’s a join over all models and grid points, whereupon a final model with a held out test set is trained and evaluated/ Finally a model on all of the data is trained. The end step produces summary data and figures.</p> <figure class="align-center" style="display: table; "> <a href="/assets/ml-model-selection-with-metaflow/model-selection-flow.png"><img width="100%" src="/assets/ml-model-selection-with-metaflow/model-selection-flow.png" /></a> <figcaption style="display: table-caption; caption-side: bottom; font-style: italic;" width="100%">Model selection flow.</figcaption> </figure> <h1 id="mocking-a-data-set">Mocking A Data Set</h1> <p>The mock regression data is generated using Scikit-learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html">make_regression</a>. Keyword parameter settings are controlled entirely in configuration files like <a href="https://github.com/fwhigh/metaflow-helper/blob/v0.0.1/examples/model-selection/randomized_config.py">randomized_config.py</a> in an object called <code class="language-plaintext highlighter-rouge">make_regression_init_kwargs</code>. If you set <code class="language-plaintext highlighter-rouge">n_categorical_features = 1</code> you’ll get a single data set with <code class="language-plaintext highlighter-rouge">n_numeric_features</code> continuous features, <code class="language-plaintext highlighter-rouge">n_informative_numeric_features</code> of which are “informative” to the target <code class="language-plaintext highlighter-rouge">y</code>, with noise given by <code class="language-plaintext highlighter-rouge">noise</code>, through the relationship <code class="language-plaintext highlighter-rouge">y = beta * X + noise</code>. <code class="language-plaintext highlighter-rouge">beta</code> are the coefficients, <code class="language-plaintext highlighter-rouge">n_numeric_features - n_informative_numeric_features</code> of which will be zero. You can add any other parameters <code class="language-plaintext highlighter-rouge">make_regression</code> accepts directly to <code class="language-plaintext highlighter-rouge">make_regression_init_kwargs</code>.</p> <p>If you set <code class="language-plaintext highlighter-rouge">n_categorical_features = 2</code> or more, you’ll get <code class="language-plaintext highlighter-rouge">n_categorical_features</code> independent regression sets concatenated together into a single data set. Each category corresponds to a totally independent set of coefficients. Which features are uninformative for each of the categories is entirely random. This is a silly construction but it allows for validation of the flow against at least one categorical variable.</p> <h1 id="specifying-contenders">Specifying Contenders</h1> <p>All ML model contenders, including their hyperparameter grids, are also specified in <a href="https://github.com/fwhigh/metaflow-helper/blob/v0.0.1/examples/model-selection/randomized_config.py">randomized_config.py</a> using the <code class="language-plaintext highlighter-rouge">contenders_spec</code> object. Implement this spec object like you would any hyperparameter grid that you would pass to Scikit-learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html">GridSearchCV</a> or <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html">RandomizedSearchCV</a>, or equivalently <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html">ParameterGrid</a> or <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterSampler.html">ParameterSampler</a>. Randomized search is automaticallly used if the <code class="language-plaintext highlighter-rouge">'__n_iter'</code> key is present in the contender spec, otherwise the flow will fall back to grid search.</p> <p>Here’s an illustration of tuning two models. The LightGBM model is being tuned over 5 random <code class="language-plaintext highlighter-rouge">max_depth</code> and <code class="language-plaintext highlighter-rouge">learning_rate</code> settings. The Keras model is being tuned over 5 different combinations of layer architectures and regularizers. The layer architectures are</p> <ul> <li>no hidden layers,</li> <li>one hidden layer of size 15,</li> <li>two hidden layers each of size 15, and</li> <li>one wide hidden layer of size 225. The regularizers are l1 and l2 factors, log-uniformly sampled and applied globally to all biases, kernels, and activations. This specific example may well be a naive search, but the main purpose right now is to demonstrate what is possible. The spec can be extended arbitrarily for real-world applications.</li> </ul> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226#file-model_selection_contenders_spec-py"> gist</a> <a href="https://gist.github.com/fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226/raw/model_selection_contenders_spec.py"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/c6f9c88cf94cedf2e96d6900ac0f1226" data-gist-file="model_selection_contenders_spec.py" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true"></code> </div> <p>The model is specified in a reserved key, <code class="language-plaintext highlighter-rouge">'__model'</code>. The value of <code class="language-plaintext highlighter-rouge">'__model'</code> is a fully qualified Python object path string. In this case I’m using metaflow-helper convenience objects I’m calling model helpers, which reimplement init, fit, and predict with a small number of required keyword arguments.</p> <p>Anything prepended with <code class="language-plaintext highlighter-rouge">'__init_kwargs__model'</code> gets passed to the model initializers and <code class="language-plaintext highlighter-rouge">'__fit_kwargs__model'</code> keys get passed to the fitters. I’m wrapping the model in a Scikit-learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">Pipeline</a> with step-name <code class="language-plaintext highlighter-rouge">'model'</code>.</p> <p>I implemented two model wrappers, a LightGBM regressor and a Keras regressor. Sources for these are in <a href="https://github.com/fwhigh/metaflow-helper/tree/v0.0.1/metaflow_helper/models">metaflow_helper/models</a>. They’re straightforward, and you can implement additional ones for any other algo.</p> <h1 id="further-ideas-and-extensions">Further Ideas and Extensions</h1> <p>There are a number of ways to extend this idea.</p> <p><strong>Idea 1:</strong> It was interesting to do model selection on a continuous target variable, but it’s possible to do the same type of optimization for a classification task using Scikit-learn <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html">make_classification</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html">make_multilabel_classification</a> to mock data.</p> <p><strong>Idea 2:</strong> You can add more model handlers for ever larger model selection searches.</p> <p><strong>Idea 3:</strong> It’d be especially interesting to try to use <em>all</em> models in the grid in an ensemble, which is definitely also possible with Metaflow by joining each model from parallel grid tasks and applying another model of models.</p> <p><strong>Idea 4:</strong> I do wish I could simply access each task in Scikit-learn’s cross-validation search (e.g. GridSearchCV) tasks and distribute those directly into Metaflow steps. Then I could recycle all of its Pipeline and CV search machinery and patterns, which I like. I poked around the Scikit-learn source code just a bit but it didn’t seem straightforward to implement things this way. I had to break some Scikit-learn patterns to make things work but it wasn’t too painful.</p> <p>I’m interested in any other ideas you might have. Enjoy!</p>Will HighConfigurable, repeatable, parallel model selection using Metaflow, including randomized hyperparameter tuning, cross-validation, and early stopping.Where Those Loss Constants Come From2021-03-29T19:00:00+00:002021-03-29T19:00:00+00:00https://www.highonscience.com/blog/2021/03/29/loss-constants<h1 id="tldr">tl;dr</h1> <p>A common form of the objective function for LASSO a.k.a. L1 regularized least-squares regression for batch (size $n$) of training data (total size $N$) is</p> $\begin{equation} L(\theta) = \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \frac{(y_i - f(\mathbf{x_i};\theta))^2}{2}. \end{equation}$ <p>The total loss $L$ consists of a regularization term and the mean of a loss function $\ell$. Why does that $1/n$ show up in the loss term? Where does that $2$ come from in the denominator? The objective function would be prettier without them!</p> <p>$1/n$ shows up to support batch gradient descent optimization. The mean loss over a batch of size $n&lt;N$ is an estimate of the mean loss over the entire data set. $n$ can be as small as $1$ and this is still true.</p> <p>As for that $2$… that is introduced in the loss purely so that the gradient has no dangling constants. That’s it. Not super compelling, but also not completely devoid of purpose.</p> <p>Along the way, we’ll get a bonus of seeing what other constants are implicitly packed into $\lambda$, the regularization strength.</p> <p>I’ll show my work as best I can.</p> <h1 id="posterior-probability">Posterior Probability</h1> <p>In typical machine learning problems the data is divided into the target or dependent variable $y$ (a scalar) and the features or independent variables $\mathbf{x}$ (a vector). The target and the features are paired into tuples $y_i\vert \mathbf{x}_i$ read “y given x” for all data points indexed by i from 1 to $N$, the size of the data set. The set of all data is $$\{ y_i \vert \mathbf{x}_i \}$$.</p> <p>The <em>posterior probability</em> $$p(\theta \vert \{ y_i \vert \mathbf{x}_i \})$$ is the probability of observing the data given the model parameters. You can use Bayes’ rule to show that it is proportional to the product of the prior on parameters $p(\theta)$ and the likelihood function $\mathcal{L}$:</p> $\begin{equation} p(\theta \vert \{ y_i \vert \mathbf{x}_i \}) \propto p(\theta) \mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \}). \end{equation}$ <p>The likelihood is interpreted as the probability that the observed data was generated by the model. The prior is interpreted as the range of likely values of the parameters of the model, as expressed using probabilities.</p> <p>The likelihood function will lead to the loss function, which is the square-error in this case, and the prior will lead to the regularization term.</p> <p>The best model is the one that maximizes the posterior probability. This is called the maximum a posteriori or MAP model. If the prior is flat ($p(\theta)=1$), the maximum a posteriori model is also the maximum likelihood estimate, abbreviated MLE.</p> <p>I am ignoring an additional term called the model evidence or marginal likelihood, which is a probability that the prior and the likelihood are divided by. This term is immediately discarded because it does not depend on model parameters $\theta$, and maximizing the full posterior probability is equivalent to maximizing just the prior times the likelihood. I bring this up because discarding the marginal likelihood does not affect the constants that enter into the optimization problem.</p> <p>Optimization is more conveniently performed in log space. Maximization is turned into minimization by multiplying by $-1$; this is entirely an historical convention.</p> $\begin{equation} -\log{p(\theta \vert \{ y_i \vert \mathbf{x}_i \})} = - \log{p(\theta)} - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})}. \end{equation}$ <h1 id="gaussian-likelihood">Gaussian Likelihood</h1> <p>Now to make things more concrete. Let’s say the dependent variable $y_i$ is continuous and unbounded from $$-\infty$$ to $$\infty$$, and errors between $y_i$ and predictions of $$y_i$$ from the model $f(\mathbf{x}_i;\theta)$ are normally distributed. Then the likelihood function is</p> $\begin{equation} \mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \}) = \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \end{equation}$ <p>The negative log-likelihood is</p> $\begin{eqnarray} - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})} &amp; = &amp; - \log{\left( \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \right)} \\ &amp; = &amp; - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \right)} \\ &amp; = &amp; - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} - \sum_i^N \log{\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)}} \\ &amp; = &amp; - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \end{eqnarray}$ <p>Minimization of the negative log-likelihood over $\theta$ can further simplify the task:</p> $\begin{eqnarray} \mathrm{argmin}_{\theta} \left( - \log{\mathcal{L}(\theta \vert \{ y_i \vert \mathbf{x}_i \})} \right) &amp; = &amp; \mathrm{argmin}_{\theta} \left( - \sum_i^N \log{\left( \frac{1}{\sigma\sqrt{2\pi}} \right)} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \right) \\ &amp; = &amp; \mathrm{argmin}_{\theta} \left( \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2} \right). \end{eqnarray}$ <p>That is to say, you can get rid of the constant term because it does not depend on $\theta$.</p> <h1 id="laplace-prior">Laplace Prior</h1> <p>Let’s say that the parameters of the model should be Laplace distributed about $0$, like this:</p> $\begin{eqnarray} p(\theta) = \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)}. \end{eqnarray}$ <p>This prior asserts that I believe values of $\theta$ should usually be about $b$ or less away from $0$, and I most expect $\theta=0$ “with prejudice”. As far as priors on model parameters goes, this prior amounts to a really silly assertion, because if I thought the model parameters should most likely be $0$ then it’s not clear why I’m building a model at all. More precisely, in real situations I always expect <em>at least some</em> model parameters to be different from zero, so the Laplace prior results in a biased MAP model.</p> <p>That said, I’m going to proceed with this prior anyway because it ends up being enormously useful.</p> <p>The negative log-prior is</p> $\begin{eqnarray} - \log{p(\theta)} &amp; = &amp; - \log \left( \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \right) \\ &amp; = &amp; - \log \left( \frac{1}{2b} \right) - \log \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \\ &amp; = &amp; - \log \left( \frac{1}{2b} \right) + \frac{\vert \theta \vert}{b} \end{eqnarray}$ <p>Minimization of the negative log-prior over $\theta$ can further simplify the task:</p> $\begin{eqnarray} \mathrm{argmin}_{\theta} \left( - \log{p(\theta)} \right) &amp; = &amp; \mathrm{argmin}_{\theta} \left( - \log \left( \frac{1}{2b} \right) + \frac{\vert \theta \vert}{b} \right) \\ &amp; = &amp; \mathrm{argmin}_{\theta} \left( \frac{\vert \theta \vert}{b} \right). \end{eqnarray}$ <p>That is to say, you can get rid of the constant term because it does not depend on $\theta$.</p> <h1 id="total-loss">Total Loss</h1> <p>Math is beautiful to look at, so let’s write out the full posterior probability model I’ve developed so far simply to behold it:</p> $\begin{eqnarray} p(\theta \vert \{ y_i \vert \mathbf{x}_i \}) \propto \frac{1}{2b} \exp{\left( - \frac{\vert \theta \vert}{b} \right)} \prod_{i=1}^N\frac{1}{\sigma\sqrt{2\pi}}\exp{\left(-\frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right)} \end{eqnarray}$ <p>The objective function $Q(\theta)$ is defined as the negative log-posterior, and because I intend to minimize the object with respect to $\theta$, I discard the constant terms immediately for simplicity:</p> $\begin{eqnarray} Q(\theta) &amp; = &amp; \frac{\vert \theta \vert}{b} + \sum_i^N \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}$ <p>To be super clear, I’ve simple taken the negative log-prior term that depends on $\theta$ and the negative log-likelihood term that depends on $\theta$ and labeled it $Q(\theta)$. This is not quite yet the total loss I talked about at the beginning. We’re getting there.</p> <p>Machine learning reduces to learning the best $\theta$, which is the one that minimizes $Q(\theta)$. How is that done?</p> <h1 id="stochastic-gradient-descent">Stochastic Gradient Descent</h1> <p>Gradient descent (GD) is class of optimization techniques where you search for a function’s minimum by taking steps in the direction of the function’s steepest descent. The direction of a function’s steepest descent is its gradient.</p> <p>To do GD you do two things.</p> <ol> <li><strong>Take a measurement</strong>. Measure the objective function value at the current step $j$, $Q(\theta_j)$.</li> <li><strong>Take a step in the right direction</strong>. Descend the objective function in the direction of the gradient with a step size equal to the steepness times some tunable number $\eta_j$ that modulates the steepness. This puts us in the next location $\theta_{j+1} = \theta_j - \eta_j \nabla Q(\theta_j)$.</li> </ol> <p>If you do these two steps repeatedly and you are smart to make $\eta_j$ smaller with each step, you will eventually reach a (local) minimum of the function $Q$.</p> <p>That’s great if $N$ is not absolutely enormous. If it is, computing each step can take forever and standard GD takes too long to be practically useful. So I have to use more tricks.</p> <h1 id="batching">Batching</h1> <p>A solution at huge $N$ is to break up the training data $$\{y_i \vert \mathbf{x}_i\}$$ into <em>batches</em> that are small enough to support quick computation of the objective and its gradient for steps 1 and 2, respectively.</p> <p>If the batches are $n&lt;N$ randomly sampled data points, how do I fairly estimate the total objective function in step 1 and the gradient in step 2? If I naively compute the objective over the subset of data points I get</p> $\begin{eqnarray} q(\theta) &amp; = &amp; \frac{\vert \theta \vert}{b} + \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}$ <p>The problem is that this is not a good estimate of $Q$ from the full data set. The reason is, $Q$ accumulates square-residuals (times constants) over every data point up to $N$. If square residuals are on average, say, $0.1$ for illustration’s sake, then</p> $\begin{eqnarray} Q = \frac{\vert \theta \vert}{b} + \frac{0.1N}{2\sigma^2}. \end{eqnarray}$ <p>But on a batch of data $q$ will be about $\vert \theta \vert/b + 0.1n/2\sigma^2 &lt; \vert \theta \vert/b + 0.1N/2\sigma^2$, so $q&lt;Q$ in expectation. A solution to this is to multiply the likelihood term by $N/n$ to scale it back up. Then I get, in my cartoon, a new batch objective of</p> $\begin{eqnarray} q_{\mathrm{batch}} = \frac{\vert \theta \vert}{b} + \frac{N}{n} \frac{0.1n}{2\sigma^2} = \frac{\vert \theta \vert}{b} + \frac{0.1N}{2\sigma^2} = Q, \end{eqnarray}$ <p>on average. And if this is true for my randomly chosen average square residual of $0.1$ it must be true for every average square residual.</p> <p>So now the task is to minimize</p> $\begin{eqnarray} q_{\mathrm{batch}}(\theta) &amp; = &amp; \frac{\vert \theta \vert}{b} + \frac{N}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}. \end{eqnarray}$ <p>It is also true that, for the purposes of minimization,</p> $\begin{eqnarray} \mathrm{argmin}_{\theta} q_{\mathrm{batch}}(\theta) &amp; = &amp; \mathrm{argmin}_{\theta}\left( \frac{\vert \theta \vert}{b} + \frac{N}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2\sigma^2}\right) \\ &amp; = &amp; \mathrm{argmin}_{\theta}\left( \frac{\sigma^2}{N} \frac{\vert \theta \vert}{b} + \frac{1}{n} \sum_i^n \frac{(y_i - f(\mathbf{x}_i;\theta))^2}{2}\right) \\ &amp; = &amp; \mathrm{argmin}_{\theta}\left( \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \ell_i(y_i\vert \mathbf{x}_i;\theta) \right) \end{eqnarray}$ <p>if $\lambda = n\sigma^2/Nb$ and $$\ell_i(y_i\vert \mathbf{x}_i;\theta) = (y_i - f(\mathbf{x}_i;\theta))^2/2$$. In the machine learning community, $\lambda$ is treated as a tunable hyperparameter and $\ell_i(y_i\vert\mathbf{x}_i;\theta)$ is a useful abstraction called the <em>loss</em>. The objective function to be minimized is called the total loss, as estimated from a batch of size $n$,</p> $\begin{eqnarray} L(\theta) = \lambda \vert \theta \vert + \frac{1}{n} \sum_i^n \ell_i(y_i\vert\mathbf{x}_i;\theta). \end{eqnarray}$ <p>The gradient in this case, to be used in gradient descent, is</p> $\begin{eqnarray} \nabla L(\theta) &amp; = &amp; \nabla \lambda \vert \theta \vert + \nabla \frac{1}{n} \sum_i^n \ell_i(y_i\vert\mathbf{x}_i;\theta) \\ &amp; = &amp; \lambda (2\times \mathbb{1}_{\theta &gt; 0} - 1) + \frac{1}{n} \sum_i^n \nabla \ell_i(y_i\vert\mathbf{x}_i;\theta;) \\ &amp; = &amp; \lambda (2\times \mathbb{1}_{\theta &gt; 0} - 1) + \frac{1}{n} \sum_i^n 2 \mathbf{x}_i \frac{(f(\mathbf{x}_i;\theta) - y_i)}{2} \\ &amp; = &amp; \lambda (2\times \mathbb{1}_{\theta &gt; 0} - 1) + \frac{1}{n} \sum_i^n \mathbf{x}_i (f(\mathbf{x}_i;\theta) - y_i) \end{eqnarray}$ <p>where $\mathbb{1}_{\theta &gt; 0}$ is the indicator function equal to 1 when the argument $\theta &gt; 0$ is true and 0 otherwise. Note that the 2 in the denominator of this loss function canceled with the 2 that fell out of the gradient operation. This cancelation is the only reason to keep the 2 around in the denominator (and it was not a very good reason).</p> <h1 id="conclusion">Conclusion</h1> <p>That’s it! I’ve shown in gory detail where the $1/n$ and the $2$ came from. As a bonus I showed that $\lambda$ can be expressed entirely in terms of other fundamental constants of the problem.</p> <p>Here are some variants of the this procedure to try so you can stay sharp and impress people at parties.</p> <ul> <li>Use a zero-centered Gaussian prior to reproduce Ridge regression.</li> <li>Use a Bernoulli likelihood function to reproduce (regularized) logistic regression.</li> <li>Think about L0 regularization in terms of its implied prior. What makes L0 regression intractable?</li> </ul>Will HighHere's where that n and that 2 come from in the square-loss objective function, in gory detail.Parallel Grep and Awk2021-03-21T19:00:00+00:002021-06-03T19:00:00+00:00https://www.highonscience.com/blog/2021/03/21/parallel-grep<aside class="sidebar__right"> <nav class="toc"> <header><h4 class="nav__title"><i class="fas fa-file-alt"></i> </h4></header> <ul class="toc__menu" id="markdown-toc"> <li><a href="#tldr" id="markdown-toc-tldr">tl;dr</a></li> <li><a href="#parallel-grep-on-one-file" id="markdown-toc-parallel-grep-on-one-file">Parallel grep on one file</a></li> <li><a href="#parallel-feature-cardinality-with-awk-on-one-file" id="markdown-toc-parallel-feature-cardinality-with-awk-on-one-file">Parallel feature cardinality with awk on one file</a></li> <li><a href="#what-i-learned-about-the-data" id="markdown-toc-what-i-learned-about-the-data">What I learned about the data</a></li> </ul> </nav> </aside> <h1 id="tldr">tl;dr</h1> <!-- 2990406 appears at the top of kddb 13653924 appears in both kddb and kddb.t --> <p>Get <a href="https://www.gnu.org/software/parallel/">GNU parallel</a> (e.g. <code class="language-plaintext highlighter-rouge">brew install parallel</code>, <code class="language-plaintext highlighter-rouge">apt-get install parallel</code>, etc.).</p> <p>Run grep in parallel blocks on a single file.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> &lt;filename&gt; <span class="nt">-k</span> <span class="nb">grep</span> &lt;grep-args&gt; </code></pre></div></div> <p>Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add <code class="language-plaintext highlighter-rouge">/dev/null</code> to force grep to prepend the filename to the matching line.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find <span class="nb">.</span> <span class="nt">-type</span> f | xargs <span class="nt">-n</span> 1 <span class="nt">-P</span> 4 <span class="nb">grep</span> &lt;grep-args&gt; /dev/null </code></pre></div></div> <p>Run grep in parallel blocks on multiple files in serial. Manually prepend the filename since grep can’t do it in this case.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># for-loop</span> <span class="k">for </span>filename <span class="k">in</span> <span class="sb"></span>find <span class="nb">.</span> <span class="nt">-type</span> f<span class="sb"></span> <span class="k">do </span>parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> <span class="nv">$filename</span> <span class="nt">-k</span> <span class="s2">"grep &lt;grep-args&gt; | awk -v OFS=: '{print </span><span class="se">\"</span><span class="nv">$filename</span><span class="se">\"</span><span class="s2">,</span><span class="se">\$</span><span class="s2">0}'"</span> <span class="k">done</span> <span class="c"># using xargs</span> find <span class="nb">.</span> <span class="nt">-type</span> f | xargs <span class="nt">-I</span> filename parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> filename <span class="nt">-k</span> <span class="s2">"grep &lt;grep-args&gt; | awk -v OFS=: '{print </span><span class="se">\"</span><span class="s2">filename</span><span class="se">\"</span><span class="s2">,</span><span class="se">\$</span><span class="s2">0}'"</span> </code></pre></div></div> <p>Run grep in parallel blocks on multiple files in parallel. Take care to prepend the filename since grep can’t do it in this case. Warning, this may be an inefficient use of multithreading.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>find <span class="nb">.</span> <span class="nt">-type</span> f | xargs <span class="nt">-n</span> 1 <span class="nt">-P</span> 4 <span class="nt">-I</span> filename parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> filename <span class="nt">-k</span> <span class="s2">"grep &lt;grep-args&gt; | awk -v OFS=: '{print </span><span class="se">\"</span><span class="s2">filename</span><span class="se">\"</span><span class="s2">,</span><span class="se">\$</span><span class="s2">0}'"</span> </code></pre></div></div> <h1 id="parallel-grep-on-one-file">Parallel grep on one file</h1> <p>Say I want to know how many times feature “15577606” appears in the KDD CUP 2010 <a href="https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html">kddb LIBSVM</a> machine learning benchmark training set. This is a binary classification data set containing 19 million lines (each line is a feature vector) and 30 million features – a large grep task.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kddb.bz2 bunzip2 kddb.bz2 <span class="nb">time grep </span>15577606 kddb <span class="o">&gt;</span> /dev/null </code></pre></div></div> <p>A standard grep takes me 1m24s. Grep picks out just 199 lines containing that feature.</p> <p><a href="https://www.gnu.org/software/parallel/">The GNU parallel utility</a> gives me a nice <strong>5.6x speedup at 15s</strong> using multiple threads.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">time</span> <span class="se">\</span> parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> kddb <span class="nt">-k</span> <span class="nb">grep </span>15577606 <span class="se">\</span> <span class="o">&gt;</span> /dev/null </code></pre></div></div> <h1 id="parallel-feature-cardinality-with-awk-on-one-file">Parallel feature cardinality with awk on one file</h1> <p>Now I’ll use this to do something useful: count the occurrence of each of the$O(10^7)$features in the training file. I’ll use a map-reduce pattern. In the map phase I’ll run “feature_count.awk” with the following contents.</p> <div class="language-awk highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/awk -f </span> <span class="p">{</span> <span class="c1"># loop over each feature but skip the label</span> <span class="k">for</span> <span class="p">(</span><span class="nx">i</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="nx">i</span> <span class="o">&lt;=</span> <span class="kc">NF</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="c1"># split the feature at the ':' character</span> <span class="nb">split</span><span class="p">(</span><span class="nv">$i</span><span class="p">,</span> <span class="nx">a</span><span class="p">,</span> <span class="s2">":"</span><span class="p">)</span> <span class="c1"># count the number of times the feature appears</span> <span class="nx">n</span><span class="p">[</span><span class="nx">a</span><span class="p">[</span><span class="mi">1</span><span class="p">]]</span><span class="o">++</span> <span class="p">}</span> <span class="p">}</span> <span class="kr">END</span> <span class="p">{</span> <span class="k">for</span> <span class="p">(</span><span class="nx">i</span> <span class="o">in</span> <span class="nx">n</span><span class="p">)</span> <span class="p">{</span> <span class="c1"># print out the feature and its count</span> <span class="k">print</span> <span class="nx">i</span><span class="p">,</span> <span class="nx">n</span><span class="p">[</span><span class="nx">i</span><span class="p">]</span> <span class="p">}</span> <span class="p">}</span> </code></pre></div></div> <p>The reduce stage is an awk one liner that adds the counts by feature. Naively you would run it like this.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">time</span> ./feature_count.awk kddb | <span class="se">\</span> <span class="nb">awk</span> <span class="s1">'{n[$1] +=$2} END {for (i in n) {print i, n[i]}}'</span> <span class="o">&gt;</span> <span class="se">\</span> kddb_feature_count_naive.txt </code></pre></div></div> <p>This takes me 19m50s. With <code class="language-plaintext highlighter-rouge">parallel</code> you could do</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">time </span>parallel <span class="nt">--pipepart</span> <span class="nt">--block</span> 10M <span class="nt">-a</span> kddb <span class="nt">-k</span> ./feature_count.awk | <span class="se">\</span> <span class="nb">awk</span> <span class="s1">'{n[$1] +=$2} END {for (i in n) {print i, n[i]}}'</span> <span class="o">&gt;</span> <span class="se">\</span> kddb_feature_count.txt </code></pre></div></div> <p>GNU parallel gives me 4m50s – a 4.1x speedup.</p> <h1 id="what-i-learned-about-the-data">What I learned about the data</h1> <p>A quick analysis of the feature statistics follows.</p> <p>First thing I learned, 72% of the features appear in the training set just once, as in, they appear in just one single feature vector. This is a red flag because normally you’d want features to appear many times for the model to learn something generalizable from them.</p> <p>I’ll do a separate feature cardinality run on the test data set.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kddb.t.bz2 bunzip2 kddb.t.bz2 ./feature_count.awk kddb.t <span class="o">&gt;</span> kddb.t_feature_count.txt </code></pre></div></div> <p>There are 2,990,384 features in the test set.</p> <p>The superset of all features in the training and test sets is 29,890,095.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># set union for first column of two files </span> <span class="nb">awk</span> <span class="s1">' !($1 in n) {m++; n[$1]} END {print m} '</span> kddb.t_feature_count.txt kddb_feature_count.txt </code></pre></div></div> <p>But only 7% of the test set features appear in the training set.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># set intersection for first column of two files </span> <span class="nb">awk</span> <span class="s1">' NR == FNR {n[$1]} NR &gt; FNR &amp;&amp; ($1 in n) {m++} END {print m} '</span> kddb.t_feature_count.txt kddb_feature_count.txt </code></pre></div></div> <p>Even fewer (4%) make more than 10 appearances in the training set.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># set intersection plus filter for first column of two files </span> <span class="nb">awk</span> <span class="nt">-v</span> <span class="nv">min_occurrences</span><span class="o">=</span>10 <span class="s1">' NR == FNR &amp;&amp; $2 &gt; min_occurrences {n[$1]} NR &gt; FNR &amp;&amp; ($1 in n) {m++} END {print m} '</span> kddb_feature_count.txt kddb.t_feature_count.txt </code></pre></div></div> <p>If I were to build an ML model for this task, I would remove features that appear less than 10 times in the training set as a preprocessing step. Here’s a quick way to generate a feature include-list.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># features that appear more than min_occurrences times</span> <span class="nb">awk</span> <span class="nt">-v</span> <span class="nv">min_occurrences</span><span class="o">=</span>10 <span class="s1">'$2 &gt; min_occurrences {print $1} '</span> kddb_feature_count.txt <span class="o">&gt;</span> kddb_feature_include_list.txt </code></pre></div></div> <p>That’s 3,814,194 eligible features in the training set, 13% of the original dimensionality. This would bring nice speedups to model training and prediction, at no cost to accuracy.</p>Will HighI get a nearly 6x speedup over standard grep by using GNU parallel.Hacking a Serverless Machine-Learning Scoring Microservice with AWS Lambda2017-09-29T07:00:00+00:002017-09-29T07:00:00+00:00https://www.highonscience.com/blog/2017/09/29/ml-scoring-service-on-aws-lambda<p>In this post I’ll attempt to hack a <code class="language-plaintext highlighter-rouge">scikit-learn</code> model prediction microservice with AWS Lambda. This can be called a “serverless machine-learning scoring microservice”. It’s a mouthful o’ buzzwords, but it’s a sneak peak at an exciting advancement in efficient, scalable cloud technology for serving up machine learning model predictions over the web with low operational overhead and high availability for anybody and anything wishing to consume them, like web and mobile apps, dynamic visualizations, Shiny/Django/Rails apps and possibly processes requiring batch predictions. The advancements of Lambda are paying for compute time rather than servers and eliminating cumbersome low-level server management.</p> <p>The service works great on my local machine. Lambda has a 50Mb limit on the deployed package size but the smallest I’ve managed to get mine is about 75 Mb, so AWS errors out when I deploy it to the cloud. If I could bump up that limit then I’m all but certain it will work in the cloud just as well, and better under Lambda’s infrastructure for availability.</p> <h1 id="dependencies">Dependencies</h1> <p>The app is pure Python. The local Python dependencies are</p> <ul> <li><code class="language-plaintext highlighter-rouge">awscli</code> for programmatic AWS interaction.</li> <li><code class="language-plaintext highlighter-rouge">boto3</code> for AWS S3 interaction.</li> <li><code class="language-plaintext highlighter-rouge">chalice</code> to implement RESTful API’s.</li> <li><code class="language-plaintext highlighter-rouge">scikit-learn</code> for machine-learning modeling.</li> <li><code class="language-plaintext highlighter-rouge">scipy</code> is the only explicit additional <code class="language-plaintext highlighter-rouge">scikit-learn</code> dependency needed for the app given the model I trained.</li> <li><code class="language-plaintext highlighter-rouge">virtualenvwrapper</code> for simple Python virtual environment management.</li> </ul> <p>Remote Python dependences are placed into a <code class="language-plaintext highlighter-rouge">requirements.txt</code> that ultimately contains</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>boto3==1.4.7 scikit-learn[alldeps] </code></pre></div></div> <h1 id="procedure">Procedure</h1> <p>Sign up for AWS free tier.</p> <p>On your local machine, <a href="https://virtualenvwrapper.readthedocs.io/en/latest/">install <code class="language-plaintext highlighter-rouge">virtualenvwrapper</code></a>, then create a virtual environment and install the AWS CLI and <code class="language-plaintext highlighter-rouge">chalice</code>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkvirtualenv aws-serverless-microservices pip <span class="nb">install </span>awscli pip <span class="nb">install </span>chalice pip <span class="nb">install </span>boto3 </code></pre></div></div> <p>In the AWS Console create an IAM user for programmatic CLI access with admin privileges, then configure AWS CLI with your access information with the AWS CLI utility.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>aws configure </code></pre></div></div> <p>You’ll be prompted for your AWS region and access keys.</p> <p>Do the <a href="http://chalice.readthedocs.io/en/latest/quickstart.html"><code class="language-plaintext highlighter-rouge">chalice</code> helloworld</a>. When you deploy it you will get the Lambda URL. Set a Bash variable to this URL like this.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">BASEURL</span><span class="o">=</span>https://brycp1llxh.execute-api.us-west-2.amazonaws.com/api </code></pre></div></div> <p>When you deploy in local mode you’ll specify a port.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chalice <span class="nb">local</span> <span class="nt">--port</span> 5005 </code></pre></div></div> <p>It this case you’ll set the base URL variable differently.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">PORT</span><span class="o">=</span>5005 <span class="nv">BASEURL</span><span class="o">=</span>http://localhost:<span class="nv">$PORT</span> </code></pre></div></div> <p>Test your hello-world endpoint.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nv">$BASEURL</span> </code></pre></div></div> <p>You should get</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"hello"</span><span class="p">:</span><span class="w"> </span><span class="s2">"world"</span><span class="p">}</span><span class="w"> </span></code></pre></div></div> <p>if things are set up correctly. Try <code class="language-plaintext highlighter-rouge">local</code> and <code class="language-plaintext highlighter-rouge">deploy</code> modes.</p> <p>You’ll be storing a static serialized (pickled) <code class="language-plaintext highlighter-rouge">scikit-learn</code> model on AWS S3, so first create an S3 bucket in the S3 console. Then do the <a href="http://chalice.readthedocs.io/en/latest/quickstart.html#tutorial-policy-generation">“policy generation” tutorial</a> to get a basic familiarity with accessing S3 using <code class="language-plaintext highlighter-rouge">boto3</code>. <code class="language-plaintext highlighter-rouge">chalice</code> recommends pinning the boto3 version in <code class="language-plaintext highlighter-rouge">requirements.txt</code>.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip freeze | <span class="nb">grep </span>boto3 <span class="o">&gt;&gt;</span> requirements.txt </code></pre></div></div> <p>Test out the S3 endpoints from the tutorial by putting a simple JSON object into your S3 bucket.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> PUT <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="nt">-d</span> <span class="s1">'{"key1":"value"}'</span> <span class="nv">$BASEURL</span>/objects/test.json curl <span class="nt">-X</span> GET <span class="nv">$BASEURL</span>/objects/test.json </code></pre></div></div> <p>If you get your object back, you’ve successfully configured your S3 connection. You’ll also be able to see the object you just created in the bucket using your S3 Management Console. This is where you’ll put your pickled model.</p> <p>On to the machine learning. Train a 3-class gradient boosted decision tree logistic regression model on iris data set using the <a href="http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html"><code class="language-plaintext highlighter-rouge">scikit-learn</code> tutorial</a> as a guide. Pickle the model as <code class="language-plaintext highlighter-rouge">model.pkl</code>. <strong>It doesn’t matter how good this model is</strong> for the purposes of this hack, it just needs to make predictions. Here’s my full model training and serialization script.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">pickle</span> <span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">datasets</span> <span class="kn">from</span> <span class="nn">sklearn.ensemble</span> <span class="kn">import</span> <span class="n">GradientBoostingClassifier</span> <span class="c1"># import some data to play with </span><span class="n">iris</span> <span class="o">=</span> <span class="n">datasets</span><span class="p">.</span><span class="n">load_iris</span><span class="p">()</span> <span class="n">X</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">data</span><span class="p">[:,</span> <span class="p">:</span><span class="mi">2</span><span class="p">]</span> <span class="c1"># we only take the first two features. </span><span class="n">Y</span> <span class="o">=</span> <span class="n">iris</span><span class="p">.</span><span class="n">target</span> <span class="n">clf</span> <span class="o">=</span> <span class="n">GradientBoostingClassifier</span><span class="p">(</span><span class="n">n_estimators</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">max_depth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="p">)</span> <span class="c1"># make prediction </span><span class="n">preds</span> <span class="o">=</span> <span class="n">clf</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="n">pickle</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">clf</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="s">"model.pkl"</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">))</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'model.pkl'</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">myfile</span><span class="p">:</span> <span class="n">r</span> <span class="o">=</span> <span class="n">myfile</span><span class="p">.</span><span class="n">read</span><span class="p">()</span> <span class="n">model_r</span> <span class="o">=</span> <span class="n">pickle</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="n">preds_r</span> <span class="o">=</span> <span class="n">model_r</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">preds</span> <span class="o">-</span> <span class="n">preds_r</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">):</span> <span class="k">raise</span> <span class="nb">Exception</span><span class="p">(</span><span class="s">"serialization error"</span><span class="p">)</span> </code></pre></div></div> <p>Now manually upload <code class="language-plaintext highlighter-rouge">model.pkl</code> to your S3 bucket using the S3 Management Console.</p> <p>Add <code class="language-plaintext highlighter-rouge">scikit-learn</code> to your requirements file.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"scikit-learn[alldeps]"</span> <span class="o">&gt;&gt;</span> requirements.txt </code></pre></div></div> <p>Here are contents of my final <code class="language-plaintext highlighter-rouge">app.py</code> file. It contains the original hello-world, the S3 tutorial endpoints and a prediction endpoint.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span> <span class="kn">import</span> <span class="nn">pickle</span> <span class="kn">import</span> <span class="nn">boto3</span> <span class="kn">from</span> <span class="nn">botocore.exceptions</span> <span class="kn">import</span> <span class="n">ClientError</span> <span class="kn">from</span> <span class="nn">chalice</span> <span class="kn">import</span> <span class="n">Chalice</span> <span class="c1"># Global variables. </span><span class="n">S3</span> <span class="o">=</span> <span class="n">boto3</span><span class="p">.</span><span class="n">client</span><span class="p">(</span><span class="s">'s3'</span><span class="p">,</span> <span class="n">region_name</span><span class="o">=</span><span class="s">'us-west-2'</span><span class="p">)</span> <span class="n">BUCKET</span> <span class="o">=</span> <span class="s">'helloworld-model'</span> <span class="n">MODEL_KEY</span> <span class="o">=</span> <span class="s">'model.pkl'</span> <span class="c1"># Global functions. </span><span class="k">def</span> <span class="nf">memoize</span><span class="p">(</span><span class="n">f</span><span class="p">):</span> <span class="n">memo</span> <span class="o">=</span> <span class="p">{}</span> <span class="k">def</span> <span class="nf">helper</span><span class="p">(</span><span class="n">x</span><span class="p">):</span> <span class="k">if</span> <span class="n">x</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">memo</span><span class="p">:</span> <span class="n">memo</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">return</span> <span class="n">memo</span><span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="k">return</span> <span class="n">helper</span> <span class="o">@</span><span class="n">memoize</span> <span class="k">def</span> <span class="nf">get_model</span><span class="p">(</span><span class="n">model_key</span><span class="p">):</span> <span class="k">try</span><span class="p">:</span> <span class="n">response</span> <span class="o">=</span> <span class="n">S3</span><span class="p">.</span><span class="n">get_object</span><span class="p">(</span><span class="n">Bucket</span><span class="o">=</span><span class="n">BUCKET</span><span class="p">,</span> <span class="n">Key</span><span class="o">=</span><span class="n">model_key</span><span class="p">)</span> <span class="k">except</span> <span class="n">ClientError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span> <span class="k">if</span> <span class="n">e</span><span class="p">.</span><span class="n">response</span><span class="p">[</span><span class="s">'Error'</span><span class="p">][</span><span class="s">'Code'</span><span class="p">]</span> <span class="o">==</span> <span class="s">"404"</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">"The object does not exist."</span><span class="p">)</span> <span class="k">else</span><span class="p">:</span> <span class="k">raise</span> <span class="c1"># TODO find a way to persist this model </span> <span class="n">model_str</span> <span class="o">=</span> <span class="n">response</span><span class="p">[</span><span class="s">'Body'</span><span class="p">].</span><span class="n">read</span><span class="p">()</span> <span class="n">model</span> <span class="o">=</span> <span class="n">pickle</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="n">model_str</span><span class="p">)</span> <span class="k">return</span> <span class="n">model</span> <span class="c1"># Begin chalice app endpoint definitions. </span><span class="n">app</span> <span class="o">=</span> <span class="n">Chalice</span><span class="p">(</span><span class="n">app_name</span><span class="o">=</span><span class="s">'helloworld'</span><span class="p">)</span> <span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">route</span><span class="p">(</span><span class="s">'/'</span><span class="p">)</span> <span class="k">def</span> <span class="nf">index</span><span class="p">():</span> <span class="k">return</span> <span class="p">{</span><span class="s">'hello'</span><span class="p">:</span> <span class="s">'world'</span><span class="p">}</span> <span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">route</span><span class="p">(</span><span class="s">'/objects/{key}'</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s">'GET'</span><span class="p">,</span> <span class="s">'PUT'</span><span class="p">])</span> <span class="k">def</span> <span class="nf">myobject</span><span class="p">(</span><span class="n">key</span><span class="p">):</span> <span class="n">request</span> <span class="o">=</span> <span class="n">app</span><span class="p">.</span><span class="n">current_request</span> <span class="k">if</span> <span class="n">request</span><span class="p">.</span><span class="n">method</span> <span class="o">==</span> <span class="s">'PUT'</span><span class="p">:</span> <span class="n">OBJECTS</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="n">json_body</span> <span class="k">elif</span> <span class="n">request</span><span class="p">.</span><span class="n">method</span> <span class="o">==</span> <span class="s">'GET'</span><span class="p">:</span> <span class="k">try</span><span class="p">:</span> <span class="k">return</span> <span class="p">{</span><span class="n">key</span><span class="p">:</span> <span class="n">OBJECTS</span><span class="p">[</span><span class="n">key</span><span class="p">]}</span> <span class="k">except</span> <span class="nb">KeyError</span><span class="p">:</span> <span class="k">raise</span> <span class="n">NotFoundError</span><span class="p">(</span><span class="n">key</span><span class="p">)</span> <span class="o">@</span><span class="n">app</span><span class="p">.</span><span class="n">route</span><span class="p">(</span><span class="s">'/predict'</span><span class="p">,</span> <span class="n">methods</span><span class="o">=</span><span class="p">[</span><span class="s">'POST'</span><span class="p">])</span> <span class="k">def</span> <span class="nf">predict</span><span class="p">():</span> <span class="n">request</span> <span class="o">=</span> <span class="n">app</span><span class="p">.</span><span class="n">current_request</span> <span class="k">if</span> <span class="n">request</span><span class="p">.</span><span class="n">method</span> <span class="o">==</span> <span class="s">'POST'</span><span class="p">:</span> <span class="n">result</span> <span class="o">=</span> <span class="p">{}</span> <span class="n">model</span> <span class="o">=</span> <span class="n">get_model</span><span class="p">(</span><span class="n">MODEL_KEY</span><span class="p">)</span> <span class="n">body_dict</span> <span class="o">=</span> <span class="n">request</span><span class="p">.</span><span class="n">json_body</span> <span class="c1"># eg, {"data": [[ 6.2, 3.4]]} </span> <span class="n">data</span> <span class="o">=</span> <span class="n">body_dict</span><span class="p">[</span><span class="s">'data'</span><span class="p">]</span> <span class="c1"># eg, [[ 6.2, 3.4]] </span> <span class="n">pred</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">data</span><span class="p">).</span><span class="n">tolist</span><span class="p">()</span> <span class="n">result</span> <span class="o">=</span> <span class="p">{</span><span class="s">'prediction'</span><span class="p">:</span> <span class="n">pred</span><span class="p">}</span> <span class="k">return</span> <span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">result</span><span class="p">)</span> </code></pre></div></div> <p>Deploying to Lambda will fail (see Appendix), so try it out in local mode and test the prediction endpoint.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-X</span> POST <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="nt">-d</span> <span class="s1">'{"data":[[6.2, 3.4], [6.2, 1]]}'</span> <span class="nv">$BASEURL</span>/predict </code></pre></div></div> <p>My response is</p> <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="nl">"prediction"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">1</span><span class="p">]}</span><span class="w"> </span></code></pre></div></div> <p>Run it a second time, and you should notice a significantly faster prediction. This is because the model file is pulled from S3 just the first time and memoized (cached) forever thereafter, eliminating any need for additional S3 data transfer. I’m seeing sub-10ms response time in local mode.</p> <p>And that’s the hack. This is a great way to define and deploy scalable services that use resources efficiently. without ever having to deal with Ubuntu or Docker—and I’m sure if I paid Amazon money they would increase my Lambda limits.</p> <h1 id="appendix">Appendix</h1> <p>Here’s what that <code class="language-plaintext highlighter-rouge">chalice deploy</code> error looks like.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Creating deployment package. Updating IAM policy for role: helloworld-dev Updating lambda function: helloworld-dev ERROR - While sending your chalice handler code to Lambda to update function "helloworld-dev", received the following error: Connection aborted. Lambda closed the connection before chalice finished sending all of the data. This is likely because the deployment package is 74.8 MB. Lambda only allows deployment packages that are 50.0 MB or less in size. To avoid this error, decrease the size of your chalice application by removing code or removing dependencies from your chalice application. </code></pre></div></div>Will HighIn this post I'll attempt to hack a scikit-learn` model prediction microservice with AWS Lambda.Guaranteeing k Samples in Streaming Sampling Without Replacement2017-06-25T07:00:00+00:002017-06-25T07:00:00+00:00https://www.highonscience.com/blog/2017/06/25/guaranteeing-k-samples<h2 id="tldr">tl;dr</h2> <p>When doing streaming sampling without replacement of a finite data set of known size $N$ in Pig, you can do</p> <pre><code class="language-piglatin">data = LOAD 'data' AS (f1:int,f2:int,f3:int); X = SAMPLE data p; </code></pre> <p>for some number $p$ between 0 and 1. If you need $k$ samples, typically you’d naively choose $p = k/N$, <strong>but this only gives you $k$ on average – sometimes less, sometimes more</strong>. <strong>If you must have at least $k$ samples use</strong></p> <p>\begin{equation} p = \frac{1}{N}\left(k + \frac{1}{2} z^2 + \frac{1}{2}\sqrt{z^2(4k + z^2)}\right). \end{equation}</p> <p>The table below helps in choosing $z$. It reads like this: setting $z$ to the specified value guarantees you’ll get at least $k$ about $CL$ of the time for any $k$ larger than 20 or so; for smaller $k$ you’ll do even better.</p> <table> <thead> <tr> <th>$z$</th> <th>$CL$</th> </tr> </thead> <tbody> <tr> <td>$0$</td> <td>$\geq 50\%$</td> </tr> <tr> <td>$1$</td> <td>$\geq 84\%$</td> </tr> <tr> <td>$2$</td> <td>$\geq 98\%$</td> </tr> <tr> <td>$3$</td> <td>$\geq 99.9\%$</td> </tr> <tr> <td>$4$</td> <td>$\geq 99.997\%$</td> </tr> <tr> <td>$5$</td> <td>$\geq 99.99997\%$</td> </tr> </tbody> </table> <p>You’ll get more than $k$ back, so as a final step maybe you’ll randomly shuffle the resulting sample and select the top $k$, assuming $k$ is not enormous.</p> <p>If you’re unlucky enough to get less than $k$ back, try again with a new random seed.</p> <aside class="sidebar__right"> <nav class="toc"> <header><h4 class="nav__title"><i class="fas fa-file-alt"></i> </h4></header> <ul class="toc__menu" id="markdown-toc"> <li><a href="#tldr" id="markdown-toc-tldr">tl;dr</a></li> <li><a href="#problem-statement" id="markdown-toc-problem-statement">Problem statement</a></li> <li><a href="#requirement-1-approximately-k" id="markdown-toc-requirement-1-approximately-k">Requirement 1: approximately $k$</a></li> <li><a href="#requirement-2-at-least-k" id="markdown-toc-requirement-2-at-least-k">Requirement 2: at least $k$</a> <ul> <li><a href="#random-sampling-of-a-big-data-stream-as-a-poisson-process" id="markdown-toc-random-sampling-of-a-big-data-stream-as-a-poisson-process">Random sampling of a big data stream as a Poisson process</a></li> <li><a href="#random-sampling-of-small-data-as-a-bernoulli-process" id="markdown-toc-random-sampling-of-small-data-as-a-bernoulli-process">Random sampling of small data as a Bernoulli process</a></li> <li><a href="#large-lambda" id="markdown-toc-large-lambda">Large $\lambda$</a></li> <li><a href="#monte-carlo-gut-check" id="markdown-toc-monte-carlo-gut-check">Monte Carlo gut check</a></li> </ul> </li> <li><a href="#requirement-3-exactly-k" id="markdown-toc-requirement-3-exactly-k">Requirement 3: exactly $k$</a></li> </ul> </nav> </aside> <h2 id="problem-statement">Problem statement</h2> <p>You have a data set of finite size $N$ and you want $k$ random samples without replacement. A computationally efficient procedure is to stream through the data and emit entries with some probability $p$ by generating a uniform random number at each entry and emitting the entry if that number is $\leq p$. Here’s what I just said in pseudocode:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>given table of size N given desired number of sample k let p &lt;= k/N for each row in table if rand_unif() ≤ p emit row end if end for </code></pre></div></div> <p>Pig provides this functionality with</p> <pre><code class="language-piglatin">data = LOAD 'data' AS (f1:int,f2:int,f3:int); X = SAMPLE data 0.001; </code></pre> <p>Hive provides this API</p> <pre><code class="language-hiveql">SELECT * FROM data TABLESAMPLE(0.1 PERCENT) s; </code></pre> <p>but it behaves differently as it gives you a 0.1% size block or more of the table. To replicate the pseudocode behavior maybe you’d do</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT * FROM data WHERE RAND() &lt; 0.001; </code></pre></div></div> <p>These queries are probabilistic insofar as the size of the output is size $k = 0.001 N$ only on average. Algorithms in the <a href="https://en.wikipedia.org/wiki/Reservoir_sampling">reservoir sampling</a> family will give you exactly $k$ from a stream, but these are complicated solutions if you’re just issuing Hive/Pig queries during an ad hoc analysis. Googling produces language documentation and posts such as <a href="http://www.joefkelley.com/736/">“Random Sampling in Hive”</a>, which underscore the problem.</p> <p>I got to thinking about what happens when you place stronger requirements on the final sample size in this kind of setting. I’ll touch on three requirements.</p> <p><strong>Requirement 1.</strong> Produce approximately $k$ samples.</p> <p><strong>Requirement 2.</strong> Produce at least $k$ samples.</p> <p><strong>Requirement 3.</strong> Produce exactly $k$ samples.</p> <p>My assumptions going into this are</p> <ol> <li>The data set is static, large, distributed and finite.</li> <li>Its size is known up front.</li> <li>It may be ordered or not.</li> <li>There is no key (say, a user hash, or rank) available for me to use, well distributed or otherwise.</li> </ol> <h2 id="requirement-1-approximately-k">Requirement 1: approximately $k$</h2> <p>In this case just use $p = k/N$. On average you get $k$, with variability of about $\pm\sqrt{k}$.</p> <h2 id="requirement-2-at-least-k">Requirement 2: at least $k$</h2> <p>At least $k$ can be guaranteed in one or sometimes (rarely) more passes. I’ll develop a model for the probability of generating at least $k$ in one pass. To strictly guarantee at least $k$ you would have to check the final sample size and, if it undershoots, make another pass through the data, but you can tune the probability such that the chance of this happening is very low.</p> <h3 id="random-sampling-of-a-big-data-stream-as-a-poisson-process">Random sampling of a big data stream as a Poisson process</h3> <p>The probabilistic emission of rows in the pseudocode above can be modeled as a Poisson process. The number of events, or points emitted, over the entire big data set follows a Poisson distribution. The Poisson distribution, $\mathrm{Poisson}(\lambda)$, is fully described by a single parameter $\lambda$, which is the mean rate of events over some fixed amount of time, or in our case, the mean number of samples over one pass. That is, $\lambda = Np$ in one pass over the full data set.</p> <p>The expectation of the the number of samples $X$ is $\mathrm{E}(X) = \lambda$, and the variance is $\mathrm{Var}(X) = \lambda$.</p> <h3 id="random-sampling-of-small-data-as-a-bernoulli-process">Random sampling of small data as a Bernoulli process</h3> <p>Despite all the hype it’s still sometimes fun to think about not-big data. In this case you can think about random sampling without replacement as a Bernoulli process, so the number of emitted points is distributed as $\mathrm{Binomial}(N, p)$. You’re doing $N$ coin tosses with a coin biased as $p$.</p> <p>In the limit of large $N$ and fixed $p$, $\mathrm{Binomial}(N, p) \to \mathrm{Normal}(\mu = Np, \sigma^2 = Np(1-p))$. If $p$ is also small, $\mathrm{Binomial}(N, p) \to \mathrm{Normal}(\mu = Np, \sigma^2 = Np)$.</p> <p>In the limit of large $N$ and small $p$, $\mathrm{Binomial}(N, p) \to \mathrm{Poisson}(\lambda = Np)$, which is what I’ve already described in the previous section.</p> <p>You might guess by the transitive rule that $\mathrm{Poisson}(\lambda = Np) \to \mathrm{Normal}(\mu = \lambda, \sigma^2 = \lambda)$ when $Np$ is large and $p$ is small. This is what I’ll talk about next.</p> <h3 id="large-lambda">Large $\lambda$</h3> <p>When $\lambda$ is large the Poisson distribution converges to a normal distribution with mean $\lambda$ and variance $\lambda$. $\lambda$ can be as small as 20 for $\mathrm{Normal}(\lambda,\lambda)$ to be a good approximation. This is convenient because all of the usual Gaussian statistics can be applied.</p> <p>The number of times you get at least $k$ samples is described by a one-sided $z$-statistic and can be read off of a standard $z$-score table. $z$ is the number of standard deviations from the mean. The probability of getting at least $k$ samples is $CL = 84\%$ at $z = 1$. Here are four such sets of useful values, with illustrations.</p> <p>$z$ table: area under $\mathrm{Normal}(\lambda,\lambda)$ above $k = \lambda - z\sqrt{\lambda}$.</p> <table> <thead> <tr> <th>$z$</th> <th>$CL$</th> <th>Illustration</th> </tr> </thead> <tbody> <tr> <td>$0$</td> <td>$50.0\%$</td> <td><img src="/assets/guaranteeing-k-samples/cl505.png" alt="" title="50.0% confidence limit" width="" /></td> </tr> <tr> <td>$1$</td> <td>$84.1\%$</td> <td><img src="/assets/guaranteeing-k-samples/cl843.png" alt="" title="84.1% confidence limit" width="" /></td> </tr> <tr> <td>$2$</td> <td>$97.7\%$</td> <td><img src="/assets/guaranteeing-k-samples/cl983.png" alt="" title="97.7% confidence limit" width="" /></td> </tr> <tr> <td>$3$</td> <td>$99.9\%$</td> <td><img src="/assets/guaranteeing-k-samples/cl99_93.png" alt="" title="99.9% confidence limit" width="" /></td> </tr> </tbody> </table> <p>To guarantee at least $k$ samples in $CL$ of your queries you’d choose $p = \lambda(k,z)/N$, where $\lambda(k,z)$ solves $\lambda - z\sqrt{\lambda} = k$. In other words, you’d choose a rate $p = \lambda/N$ such that $k$ is $z$ standard deviations below $\lambda$. The (useful) solution is $\lambda(k,z) = k + \frac{1}{2} z^2 + \frac{1}{2}\sqrt{z^2(4k + z^2)}$.</p> <h3 id="monte-carlo-gut-check">Monte Carlo gut check</h3> <p>To prove to myself the math is right, I’ll run 8 parallel Monte Carlo simulations of 10,000 iteration each in bash. I’ll try to sample $k=100$ out of $N=1000$ for different values of $z$.</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function </span>monte_carlo <span class="o">()</span> <span class="o">{</span> <span class="nb">awk</span> <span class="s2">"</span><span class="nv">$@</span><span class="s2">"</span> <span class="nt">-v</span> <span class="nv">seed</span><span class="o">=</span><span class="nv">$RANDOM</span> <span class="s1">' BEGIN { lam = k + 0.5*z**2 + 0.5*sqrt(z**2*(4.0*k + z**2)) p = lam/n_data srand(seed) for (j=1;j&lt;=n_experiments;j++) { emit=0 for (i=1;i&lt;=n_data;i++) { if (rand()&lt;=p) { emit++ } } if (emit&gt;=k) { n++ } } print seed,k,lam,p,n_data,n }'</span> <span class="p">;</span> <span class="o">}</span> <span class="nb">export</span> <span class="nt">-f</span> monte_carlo parallel <span class="nt">-N0</span> monte_carlo <span class="nt">-v</span> <span class="nv">n_experiments</span><span class="o">=</span>1e4 <span class="nt">-v</span> <span class="nv">n_data</span><span class="o">=</span>1e3 <span class="nt">-v</span> <span class="nv">k</span><span class="o">=</span>100 <span class="nt">-v</span> <span class="nv">z</span><span class="o">=</span>0 ::: 1 2 3 4 5 6 7 8 </code></pre></div></div> <p>In one of the runs at $CL = 84\%$ I got at least $k = 100$ samples 8,664 out of 10,000 times. In this case, $\lambda = 110.5$. Here’s a table of typical results at different $z$ value settings.</p> <p>Monte Carlo runs of 10,000 iterations each, setting $k = 100$.</p> <table> <thead> <tr> <th>$z$</th> <th>$CL$</th> <th>num $\geq k$</th> <th>$\lambda$</th> </tr> </thead> <tbody> <tr> <td>$0$</td> <td>$\geq 50\%$</td> <td>$5\,183$</td> <td>$100.0$</td> </tr> <tr> <td>$1$</td> <td>$\geq 84\%$</td> <td>$8\,664$</td> <td>$110.5$</td> </tr> <tr> <td>$2$</td> <td>$\geq 98\%$</td> <td>$9\,846$</td> <td>$122.1$</td> </tr> <tr> <td>$3$</td> <td>$\geq 99.9\%$</td> <td>$9\,998$</td> <td>$134.8$</td> </tr> <tr> <td>$4$</td> <td>$\geq 99.997\%$</td> <td>$10\,000$</td> <td>$148.8$</td> </tr> </tbody> </table> <p>It’s clear from the simulations that the true confidence is slightly higher than advertised, but this is expected. There are two sources of bias: finite $k$ in the central limit theorem approximation, and discreteness of the random variable at $k$. Given how I implemented the Monte Carlo, both push the true confidence higher, so the stated lower limit on confidence holds.</p> <p>The “guarantee” of at least $k$ in a single pass is a probabilistic one, and it implies that at, say, a $CL = 99.9\%$ specification I would have to go over the data set a second time roughly 1 out of every $(1-CL)^{-1} = 1\,000$ times that I undertook this whole exercise. At this specification the need to rerun is rare, but it will eventually happen. When it does, I’d have to go through the full data set again with a smaller $p$ to get a fair random sample, specifically, I would reapply the same rule for a new $\lambda(k\to k-k_1,z)$, where $k_1$ is the actual sample size that the first iteration yielded me. It is even rarer that I’d have to do a third pass at $CL = 99.9\%$. This case happens one in a million times.</p> <p>It’s fairly obvious to me just thinking about it that it’s better to set $CL$ high and try to do just one pass than it is to set $CL$ low and to do multiple passes. For example, if $CL = 50\%$ ($z = 0$) then nearly half the time I’d be rerunning twice or more times to build up a fair sample. Passes over big data are expensive as it is, so it’s better to eat $k_1 - k$ too many samples in one pass than to have to do additional passes on the data.</p> <h2 id="requirement-3-exactly-k">Requirement 3: exactly $k$</h2> <p>Run the above, randomly shuffle and pick out the top $k$.</p> <p>If you get less than $k$ you were just very unlucky. Run the whole thing again with a different random seed.</p> <p>You may also consider implementing a reservoir sampler, but this is more work than is needed.</p>Will HighIf you need $k$ samples out of $N$ in Hive or Pig, typically you'd naively choose $p = k/N$, but this only gives you $k$ on average.The Streaming Distributed Bootstrap2017-06-15T07:05:20+00:002017-06-15T07:05:20+00:00https://www.highonscience.com/blog/2017/06/15/streaming-distributed-bootstrap<aside class="sidebar__right"> <nav class="toc"> <header><h4 class="nav__title"><i class="fas fa-file-alt"></i> </h4></header> <ul class="toc__menu" id="markdown-toc"> <li><a href="#refresher-the-standard-bootstrap" id="markdown-toc-refresher-the-standard-bootstrap">Refresher: the standard bootstrap</a></li> <li><a href="#thinking-more-deeply-about-the-resampling" id="markdown-toc-thinking-more-deeply-about-the-resampling">Thinking more deeply about the resampling</a></li> <li><a href="#the-leap-to-big-data-the-poisson-trick" id="markdown-toc-the-leap-to-big-data-the-poisson-trick">The leap to big data: the Poisson trick</a></li> <li><a href="#a-streaming-bootstrap-of-the-mean" id="markdown-toc-a-streaming-bootstrap-of-the-mean">A streaming bootstrap of the mean</a></li> <li><a href="#distributing-it" id="markdown-toc-distributing-it">Distributing it</a></li> <li><a href="#testing-it-on-the-twitter-firehose" id="markdown-toc-testing-it-on-the-twitter-firehose">Testing it on the Twitter firehose</a></li> <li><a href="#summary" id="markdown-toc-summary">Summary</a></li> <li><a href="#references" id="markdown-toc-references">References</a></li> </ul> </nav> </aside> <p>The <a href="https://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrap</a> (Efron 1979) is an incredibly practical method to estimate uncertainty from finite sampling on almost any quantity of interest. If, say, you’re training a model using just 30 training examples, you’ll likely want to know how uncertain your goodness-of-fit metric is. Is your AUC statistically consistent with 0.5? That’d be key to know, and you could estimate it with the bootstrap.</p> <p>The standard bootstrap, however, does not scale well to big data, and for unbounded data streams it’s in fact not well defined. The standard bootstrap assumes you have all your data locally available, it’s static, it fits into primary memory, and it’s easy to compute your metric of interest (AUC in the above example).</p> <p>There has been lots of exciting new research around scaling the bootstrap to unbounded and distributed data. The <a href="http://arxiv.org/abs/1112.5016">Bag of Little Bootstraps</a> paper distributes the bootstrap, but issues a standard “static” bootstrap on each thread, so it doesn’t solve the unbounded data problem. The <a href="http://arxiv.org/abs/1312.5021">Vowpal Wabbit paper</a> solves the unbounded data problem, but on a single thread. A <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43157.pdf">Google paper</a> puts together a bootstrap that is both streaming and distributed.</p> <p>The streaming distributed bootstrap is a really fun solution, and I’ve mocked up a Python package to test it out. In this article, I’m going to assume you’re already a fairly technical person who understands why you’d want to estimate uncertainty on a big data application.</p> <p>At the end of this post I’ll set loose a streaming bootstrap on the Twitter firehose, computing the mean tweet rate on top trending terms at the time I ran it, with streaming one-sigma error bands.</p> <p>I’ll be using the following R libraries and global settings for the R snippets.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="1-9"></code> </div> <h2 id="refresher-the-standard-bootstrap">Refresher: the standard bootstrap</h2> <p>Let’s start with a fake data set of $N=30$ points, and let’s say the data is drawn from a random uniform distribution.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="11-13"></code> </div> <p>Here’s what it looks like.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> index count value 1: 1 1 0.23666352 2: 2 1 0.17972139 3: 3 1 0.04451190 4: 4 1 0.94861152 5: 5 1 0.97012678 6: 6 1 0.11215551 7: 7 1 0.81616358 8: 8 1 0.71385778 9: 9 1 0.54395965 10: 10 1 0.03084962 11: 11 1 0.58808657 12: 12 1 0.03196848 13: 13 1 0.78770594 14: 14 1 0.57319726 15: 15 1 0.23914206 16: 16 1 0.45210949 17: 17 1 0.74648136 18: 18 1 0.76919459 19: 19 1 0.50524423 20: 20 1 0.68976405 21: 21 1 0.88024924 22: 22 1 0.52815155 23: 23 1 0.03672133 24: 24 1 0.16118379 25: 25 1 0.23268336 26: 26 1 0.51450148 27: 27 1 0.18569415 28: 28 1 0.54663857 29: 29 1 0.89967953 30: 30 1 0.34810339 index count value </code></pre></div></div> <p>The index identifies the data point, the count is simply the number of times that data point appears, and the value is “the data.” The mean of the data is 0.48 and the standard error on the mean is 0.05.</p> <p>I’m going to start by histogramming the counts of each index. This is a trivial histogram: there’s just one of everything.</p> <figure> <a href="/assets/streaming-distributed-bootstrap/example_raw_count_hist.png"><img width="70%" src="/assets/streaming-distributed-bootstrap/example_raw_count_hist.png" /></a> <figcaption width="70%">Trivial historam of the counts of data points.</figcaption> </figure> <p>Now the bootstrap procedure is</p> <ol> <li>resample the $N$ data points $N$ times with replacement,</li> <li>compute your quantity of interest on the resampled data exactly as if it was the original data, and</li> <li>repeat hundreds or thousands of times.</li> </ol> <p>The resulting distribution of the quantity of interest is an empirical estimate of the sampling distribution of that quantity. This means the mean of the distro is an estimate of the quantity of interest itself, and the standard deviation is an estimate of the <em>standard error</em> of that quantity. That last point is tricky and worth memorizing to impress your statistics friends at parties. (Just kidding, statisticians don’t go to parties!)</p> <p>Here’s an implementation of the bootstrap for this data set. The quantity of interest in this example is the mean of the data.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="22-33"></code> </div> <p>My session gives me</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>compare bootstrap mean 0.477468987332949 to sample quantity of interest 0.477104055796129 compare bootstrap standard deviation 0.0547528263387387 to the sample standard error of the quantity of interest 0.0555938381440978 </code></pre></div></div> <p>So both the bootstrap mean and the data mean are consistent with the population mean of 0.5 within one standard error of the mean, and the analytical estimator of the standard error is consistent with the bootstrap standard deviation. All is good.</p> <h2 id="thinking-more-deeply-about-the-resampling">Thinking more deeply about the resampling</h2> <p>Now it gets interesting. It turns out that step #1 in the bootstrap procedure can be thought of as rolling an unbiased $N$ sided dice $N$ times, and counting the number of times each face (index) comes up. This is a result from elementary statistics: the counts of each face is distributed as</p> <p>\begin{equation} \mathrm{Multinomial}(N,\pmb{p}=(1/N,1/N,\dots,1/N)), \end{equation}</p> <p>where $\pmb{p}$ is a vector of $N$ probabilities. I’m going to simulate one single bootstrap iteration like this:</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="35-36"></code> </div> <p>Here’s a histogram of the number of times each data point comes up in this iteration:</p> <figure> <a href="/assets/streaming-distributed-bootstrap/example_resample_hist.png"><img width="70%" src="/assets/streaming-distributed-bootstrap/example_resample_hist.png" /></a> <figcaption width="70%">Histogram of resample counts.</figcaption> </figure> <p>A bunch of points are resampled zero times, the 12th and 26th data points are resampled 3 times, and everything else is in between. I already know the resample counts are Multinomially distributed, but here’s a slightly different question: <em>over the course of a full bootstrap simulation, what’s the distribution of the sample rate of just the first data point?</em></p> <p>This time the answer comes from thinking about coin tosses. In one single draw in one bootstrap iteration, the chances the first data point will be drawn is $1/N$. This is like flipping a coin that is biased as $p=1/N$. $N$ draws with replacement in one bootstrap iteration are like flipping that coin $N$ times, so the number of times the first data point comes up in one bootstrap iteration is distributed as $\mathrm{Binomial}(N,1/N)$.</p> <p>You can simulate this effect for 10000 bootstrap iterations in R by picking out the first row of 10000 Multinomial random number draws. I’m going to do this, then count the number of time each draw frequency per bootstrap iteration occurs, then compare the the Binomial random draw.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="45-51"></code> </div> <p>Let’s take a look:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> frequency count binomial 1: 0 3630 3678.610464 2: 1 3698 3678.978362 3: 2 1907 1839.489181 4: 3 607 613.101738 5: 4 135 153.244776 6: 5 17 30.639760 7: 6 6 5.104584 </code></pre></div></div> <p>Math works.</p> <h2 id="the-leap-to-big-data-the-poisson-trick">The leap to big data: the Poisson trick</h2> <p>So far I’ve been dealing with small data, at $N = 30$. With big data $N$ is either very large or, more often, unknown. Let’s think about what happens when $N$ gets huge.</p> <p>As $N$ grows the probability $p = 1/N$ of drawing the first data point is getting tiny, but the number of draws in a single bootstrap iteration is getting huge with $N$. It turns out this process converges, and the number of times you’d expect to see the first data point is Poisson distributed with mean 1. That is to say,</p> <p>\begin{equation} \mathrm{Binomial}(N,1/N) \xrightarrow{N\to\infty} \mathrm{Poisson}(1). \end{equation}</p> <p><strong>The resampling procedure is independent of $N$!</strong> This was purely luck, and it means I don’t need to know the size of the full data set to bootstrap uncertainty estimates, and in fact if $N$ is large enough the bootstrapping procedure is nearly exact.</p> <p>Let’s take a look.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="53-60"></code> </div> <figure> <a href="/assets/streaming-distributed-bootstrap/first_index_freq_pois.png"><img width="70%" src="/assets/streaming-distributed-bootstrap/first_index_freq_pois.png" /></a> <figcaption width="70%">Comparing Poisson drawn, Binomial drawn and actual resampled counts.</figcaption> </figure> <p>The agreement between the Binomial and Poisson distributions is already extremely good at just $N=30$, and it only gets better with large $N$. Because the Poisson distribution is independent of the parameter $N$, I can turn the entire bootstrap process sideways: I’ll run 10000 bootstraps on each data point as it arrives in the stream, without regard to what will stream in later, as long as I can define an online update rule for the metric of interest.</p> <h2 id="a-streaming-bootstrap-of-the-mean">A streaming bootstrap of the mean</h2> <p>The mean as the quantity of interest is a useful example because it has a simple update rule. Pseudocode of the online weighted mean update rule is as follows.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Algorithm: WeightedMeanUpdate Input: Current mean, theta0 Aggregate weight of current mean, w0 New data, X Weight of new data, W Output: Updated mean, theta1 Aggregate weight of updated mean, w1 theta1 = (w0*theta0 + W*X)/(w0 + W) w1 = w0 + W return (theta1,w1) </code></pre></div></div> <p>When <code class="language-plaintext highlighter-rouge">W</code> = 1, this reduces to an online unweighted mean update rule. For the very first data point, the mean and aggregate weight are set to 0.</p> <p>Using the Poisson trick, the streaming serial bootstrap algorithm is:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Algorithm: StreamingSerialBootstrap Input: Unbounded set of data-weight tuples {(X,w),(X,w),...} Number of bootstrap realizations, r Output: Estimator value in each bootstrap realization, theta[k], k in {1,2,..,r} // initialize for each k in 1 to r do wInner[k] = 0 end // data arrives from a stream for each i in stream // do r bootstrap iteration // approximate resampling with replacement with the Poisson trick for k in 1 to r do weight = wInner[i]*PoissonRandom(1) (thetaInner[k],wInner[k]) = InnerOnlineUpdate(thetaInner[k],wInner[k],X[i],weight) end end </code></pre></div></div> <p>The InnerOnlineUpdate must be set to <code class="language-plaintext highlighter-rouge">WeightedMeanUpdate</code> for the mean as the quantity of interest. <code class="language-plaintext highlighter-rouge">thetaInner</code> is an array representing a running estimate of the sample distribution of the mean at the $i$-th data point.</p> <p>I’ve mocked this up in a Python package called <a href="https://github.com/fwhigh/sdbootstrap"><code class="language-plaintext highlighter-rouge">sdbootstrap</code></a>. First export the R data to file.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-bootstrap-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/bootstrap.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="bootstrap.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="62"></code> </div> <p>Then in the terminal,</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-streaming_bootstrap-sh"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/streaming_bootstrap.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="streaming_bootstrap.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="3-6"></code> </div> <p>The output is</p> <ul> <li>master update ID, which is the timestamp of the update,</li> <li>bootstrap iteration ID,</li> <li>quantity of interest,</li> <li>number of total resamples</li> </ul> <p>Here are my top 10 lines:</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1497630831.95 0 0.498431696583 35.0 1497630831.95 1 0.497802280908 32.0 1497630831.95 2 0.52108250027 34.0 1497630831.95 3 0.52590605129 34.0 1497630831.95 4 0.396820495147 26.0 1497630831.95 5 0.552172223237 32.0 1497630831.95 6 0.494006591897 29.0 1497630831.95 7 0.491453935547 27.0 1497630831.95 8 0.524614127751 31.0 1497630831.95 9 0.49028054256 25.0 </code></pre></div></div> <p>The bootstrap mean (mean of column 3) is 0.4784 and the bootstrap standard error (standard deviation of column 3) is 0.0551, both close to the regular bootstrap values 0.4775 and 0.0548. So this seems to work fine even on just 30 data points.</p> <h2 id="distributing-it">Distributing it</h2> <p>Here’s a picture of what I’m about to construct.</p> <figure> <a href="/assets/streaming-distributed-bootstrap/streaming_distributed_bootstrap_figure.png"><img width="70%" src="/assets/streaming-distributed-bootstrap/streaming_distributed_bootstrap_figure.png" /></a> <figcaption width="70%">Streaming distributed bootstrapping.</figcaption> </figure> <p>I’ll do it by way of example. Let’s generate $N = 10000$ data points uniformly distributed between 0 and 1 with awk’s random number generator.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-streaming_bootstrap-sh"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/streaming_bootstrap.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="streaming_bootstrap.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="17-22"></code> </div> <p>My top 10 entries are</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.237788 1 0.291066 1 0.845814 1 0.152208 1 0.585537 1 0.193475 1 0.810623 1 0.173531 1 0.484983 1 0.151863 1 </code></pre></div></div> <p>A traditional bootstrap at 10000 iterations in R gives me</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>compare bootstrap mean 0.498704013190406 to sample quantity of interest 0.49865821011823 compare bootstrap standard deviation 0.00287165409456161 to the sample standard error of the quantity of interest 0.00288191093375261 </code></pre></div></div> <p>The Bag of Little Bootstrap authors pointed out that you can multithread the bootstrap by doing a bunch of independent bootstraps and collecting the results. Their approach is to over-resample the data and summarize immediately in each thread, then collect, but I want to maintain the full bootstrap distribution so I’ll do it a little differently. And importantly, I’ll do it for unbounded data: each thread will do the job of the streaming bootstrap in the previous section, and it won’t know how big the full data set is.</p> <p>This algo reuses the above <code class="language-plaintext highlighter-rouge">StreamingSerialBootstrap</code> in what I’m calling the inner bootstrap, and also implements an outer bootstrap procedure that collects each bootstrap iteration’s current state and does a aggregation to produce the a master bootstrap distribution.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Algorithm: StreamingDistributedBootstrap Input: Unbounded set of data-weight tuples {(X,w),(X,w),...} Number of bootstrap realizations, r Number of nodes, s Output: Estimator value in each bootstrap realization, theta[k], k in {1,2,..,r} // initialize for each k in 1 to r do wOuter[k] = 0 for each j in 1 to s do wInner[j,k] = 0 end end // data arrives from a stream for each i in stream Assign tuple (X[i],w[i]) to node j in {1,2,...,s} // inner bootstrap // do r bootstrap iterations // approximate resampling with replacement with the Poisson trick for k in 1 to r do weight = wInner[i]*PoissonRandom(1) (thetaInner[j,k],wInner[j,k]) = InnerOnlineUpdate(thetaInner[j,k],wInner[j,k],X[i],weight) end if UpdateMaster() do (thetaMaster[k],wMaster[k]) = OuterOnlineUpdate(thetaMaster[k],wMaster[k],thetaInner[j,k],wInner[j,k]) // flush the inner bootstrap distros thetaInner[j,k] = 0 wInner[j,k] = 0 end end </code></pre></div></div> <p>For the mean, set both <code class="language-plaintext highlighter-rouge">InnerOnlineUpdate</code> and <code class="language-plaintext highlighter-rouge">OuterOnlineUpdate</code> to <code class="language-plaintext highlighter-rouge">WeightedMeanUpdate</code>. <code class="language-plaintext highlighter-rouge">thetaMaster</code> is a running estimate of the sample mean at the $i$-th data point.</p> <p>Here it is with the Python package. I’m going distribute the data over 6 threads using the lovely Gnu parallel package (<code class="language-plaintext highlighter-rouge">brew install parallel</code>).</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-streaming_bootstrap-sh"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/streaming_bootstrap.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="streaming_bootstrap.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true" data-gist-line="41-43"></code> </div> <p>My top 10 output lines are</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1497631385.97 5988 0.500353215188 9843.0 1497631385.97 5989 0.495968983126 10111.0 1497631385.97 5982 0.504122756197 9995.0 1497631385.97 5983 0.497282486683 10121.0 1497631385.97 5980 0.500647175678 10013.0 1497631385.97 5981 0.493803255786 9907.0 1497631385.97 5986 0.497372163828 10076.0 1497631385.97 5987 0.500258192842 9963.0 1497631385.97 5984 0.495168146999 10054.0 1497631385.97 5985 0.499804540368 10074.0 </code></pre></div></div> <p>I’m getting a streaming distributed bootstrap mean of 0.49871 and a streaming distributed bootstrap standard deviation of 0.0028622 (compare to standard bootstrap values 0.49866 and 0.0028821, and to the direct data estimates 0.49866 and 0.0028819).</p> <p>This example and a lot of thinking convinces me that the algorithm is right and my implementation is right. It is certainly not a complete or rigorous proof, I leave that to others.</p> <h2 id="testing-it-on-the-twitter-firehose">Testing it on the Twitter firehose</h2> <p>Just for fun, let’s set this loose on some Twitter search terms. This simplest average quantity I could think of was the mean time between tweets for different terms. Let’s call this inter-tweet time.</p> <p>I’m accessing the firehose via the <a href="https://github.com/sferik/t"><code class="language-plaintext highlighter-rouge">t</code></a> command line utility. The top trending term at the time of writing is “Whole Foods”. Here’s what some of the tweet data look like.</p> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ID,Posted at,Screen name,Text 875734500466081796,2017-06-16 15:19:09 +0000,RogueInformant,@ALT_uscis @amazon @WholeFoods How long until @WholeFoods offers its own credit card? 15% off you first 5 bundles o… https://t.co/fifTNCPlMs 875734502785548288,2017-06-16 15:19:10 +0000,Fronk83,RT @CNET: Grocery a-go-go: #Amazon to buy #WholeFoods for $13.7 billion$AMZN https://t.co/zF72YYe2Fo https://t.co/3dURTz7Rup 875734510230491136,2017-06-16 15:19:12 +0000,ultramet,So does this mean that @WholeFoods employees will now be treated in the same crappy way Amazon employees are? Feel bad for them. 875734517562183680,2017-06-16 15:19:14 +0000,framhammer,"RT @JacobAWare: Amazon, based on your recent purchase of #WholeFoods, you might also like: • South Park season 19 on blu-ray • asparagus water" 875734520993120256,2017-06-16 15:19:14 +0000,osobsamantar,"@GuledKnowmad @JeffBezos @amazon @WholeFoods Not any kind of milk, organic almond milk" 875734524080119808,2017-06-16 15:19:15 +0000,ChrisArvay,"#Roc @Wegmans Your move! @amazon buys @WholeFoods https://t.co/p2Nq1DXLal" </code></pre></div></div> <p>I’ll convert the date to a timestamp in seconds and subtract the timestamp of the previous line, skipping the first one line of course.</p> <p>I’ll make use of a Gnu utility called stdbuf (<code class="language-plaintext highlighter-rouge">brew install gstdbuf</code> from Homebrew on my Mac) that lets me disable the buffering that the awk stage is apparently inspiring. With no buffering I can turn on the firehose and get real-time updates of the bootstrap distribution at every 10 tweets.</p> <p>I’m disabling the inner bootstrap flush so that I can look at the cumulative effect of all data as a function of time. Normally I need to flush the inner bootstrap estimates upon updating the master outer bootstrap distribution.</p> <p>And I’m not parallelizing the inner bootstraps per trending term because I have just one data stream each – so this is not a full blown demo but it’s still cool and suggestive.</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-twitter_example-sh"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/twitter_example.sh"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="twitter_example.sh" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true"></code> </div> <p>I let it run for rew minutes. Plotting it up:</p> <div class="gist-embed-link"> <!-- <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1#file-twitter_example-r"> gist</a> <a href="https://gist.github.com/fwhigh/a68eb626018abb2985af1c2d8b7b93c1/raw/twitter_example.R"> raw</a> --> <code class="gist-embed-code" data-gist-id="fwhigh/a68eb626018abb2985af1c2d8b7b93c1" data-gist-file="twitter_example.R" data-gist-hide-footer="false" data-gist-show-spinner="true" gist-enable-cache="true"></code> </div> <figure> <a href="/assets/streaming-distributed-bootstrap/tweet_rate.png"><img width="70%" src="/assets/streaming-distributed-bootstrap/tweet_rate.png" /></a> <figcaption width="70%">Cumulative mean inter-tweet time for a few top trending terms.</figcaption> </figure> <p>The top term is significantly more popular than the next two, which themselves are indistinguishible from one another over the period I accumulated data.</p> <p>While I did not distribute each trending term’s bootstrap, I did already demo parallelizing the weighted mean bootstrap above so hopefully that’s enough to convince you that it’s possible over distributed Twitter streams.</p> <!-- ## Subsampled streaming distributed bootstrap Reframing random resampling without replacement as Multinomial random draws, then as Poisson random draws, is particularly cool. You might imagine extending this idea to the subsampled bootstrap that the Bag of Little Bootstrap does, then making it online. In this case there are still $N$ data points but only $b=fN$ are subject to resampling in a given bootstrap iteration, where $f\in(0,1)$ is the subsample rate. The first data point therefore has a probability of $1-f$ of not being sampled, and therefore being zero, and otherwise is sampled as before with counts distributed as $\mathrm{Binomial}(n, 1/b)$. This is a minor modification of the streaming bootstrap resampling methodology. --> <h2 id="summary">Summary</h2> <p>I’ve sketched the train of logic that takes you from the standard bootstrap, to the streaming flavor, to the streaming-and-distributed flavor, and I did a cute Twitter firehose example. The purpose of this approach was not to prove anything rigorously but to make the concepts real and build deep intuition by actually doing it.</p> <p>This is not the streaming version of the Bag of Little Bootstrap. The BLB (1) maintains over-resampled bootstrap distributions over each shard and (2) immediately summarizes the sharded bootstraps locally on their own threads, then (3) collects and aggregates the summaries themselves to create a more precise summary. What I’ve done here is maintain exact streaming bootstraps over each shard and collected the <em>full bootstrap distribution</em> in a downstream master thread, then summarized. For the weighted mean and many data points, the streaming distributed bootstrap is equivalent to a single standard bootstrap. This is true for any online update rule that can handle aggregated versions of the statistic of interest. In cases where the update rule cannot handle aggregate versions of the statistic, maybe you’d just do a weighted mean update to compute the master bootstrap distribution.</p> <p>You can play many variations on this theme, as I’ve done a bit in this post, choosing any combination among</p> <ul> <li>online or batch</li> <li>distributed or single threaded</li> <li>master bootstrap distribution thread or immediate mean &amp; standard-deviation summarization</li> </ul> <p>I’ve toyed with more statistics, like quantiles and the exponentially weighted moving average (EWMA), both of which can be computed online. You can take a look at these updaters at <a href="https://github.com/fwhigh/sdbootstrap/tree/master/sdbootstrap/updater">https://github.com/fwhigh/sdbootstrap/tree/master/sdbootstrap/updater</a>.</p> <h2 id="references">References</h2> <ul> <li><a href="http://projecteuclid.org/euclid.aos/1176344552">Efron 1979</a></li> <li><a href="http://arxiv.org/abs/1112.5016">Bag of Little Bootstraps</a></li> <li><a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43157.pdf">Google paper</a></li> <li><a href="http://arxiv.org/abs/1312.5021">Vowpal Wabbit paper</a></li> </ul>Will HighThe streaming distributed bootstrap is a really fun solution, and I've mocked up a Python package to test it out.