• large scale machine learning is, algorithms but viewing with big data sets.


  • If you look back at a recent 5 or 10-year history of machine learning. One of the reasons that learning algorithms work so much better now than 5-years ago, is just the sheer amount of data that we have now and that we can train our algorithms on.


  • So why do we want to use such large data sets? We've already seen that one of the best ways to get a high performance machine learning system, is if you take a low-bias learning algorithm, and train that on a lot of data.


  • An example of classifying between confusable words. For breakfast, I ate two (TWO) eggs and we saw in this example, these sorts of results, where, so long as you feed the algorithm a lot of data, it seems to do very well.


  • And so it's results like these that has led to the saying in machine learning that often it's not who has the best algorithm that wins. It's who has the most data.


  • So you want to learn from large data sets, at least when we can get such large data sets. But learning with large data sets comes with its own unique problems, specifically, computational problems.


  • Let's say your training set size is M equals 100,000,000. And this is actually pretty realistic for many modern data sets.


  • And let's say you want to train a linear regression model, or maybe a logistic regression model, in which case this is the gradient descent rule.


  • And if you look at what you need to do to compute the gradient, which is this term over here, then when M is a hundred million, you need to carry out a summation over a hundred million terms, in order to compute these derivatives terms and to perform a single step of decent.


  • single step of decent
  • Because of the computational expense of summing over a hundred million entries in order to compute just one step of gradient descent we can replace this with something else or to find more efficient ways to compute this derivative.


  • Of course, before we put in the effort into training a model with a hundred million examples, We should also ask ourselves, well, why not use just a thousand examples.


  • Maybe we can randomly pick the subsets of a thousand examples out of a hundred million examples and train our algorithm on just a thousand examples.


  • So before investing the effort into actually developing and the software needed to train these massive models is often a good sanity check, if training on just a thousand examples might do just as well.


  • The way to sanity check of using a much smaller training set might do just as well, that is if using a much smaller n equals 1000 size training set, that might do just as well, it is the usual method of plotting the learning curves, so if you were to plot the learning curves and if your training objective were to look like figure given below, that's J train theta.


  • single step of decent
  • And if your cross-validation set objective, Jcv of theta would look like above, then this looks like a high-variance learning algorithm, and we will be more confident that adding extra training examples would improve performance.


  • Whereas in contrast if you were to plot the learning curves, if your training objective were to look like figure below, and if your cross-validation objective were to look like given line below, then this looks like the classical high-bias learning algorithm.


  • single step of decent
  • And in the latter case, you know, if you were to plot this up to, say, m equals 1000 and so that is m equals 500 up to m equals 1000, then it seems unlikely that increasing m to a hundred million will do much better and then you'd be just fine sticking to n equals 1000, rather than investing a lot of effort to figure out how the scale of the algorithm.


  • Of course, if you were in the situation shown by the second figure then one natural thing to do would be to add extra features, or add extra hidden units to your neural network and so on, so that you end up with a situation closer to that on the first figure, where maybe this is up to n equals 1000, and this then gives you more confidence that trying to add infrastructure to change the algorithm to use much more than a thousand examples that might actually be a good use of your time.


  • So in large-scale machine learning, we like to come up with computationally reasonable ways, or computationally efficient ways, to deal with very big data sets.






  • For many learning algorithms, we derived them by coming up with an optimization objective (cost function) and using an algorithm to minimize that cost function.


  • When you have a large dataset, gradient descent becomes very expensive. So here we'll define a different way to optimize for large data sets which will allow us to scale the algorithms.


  • Suppose you're training a linear regression model with gradient descent


  • Hypothesis : \( h_{\theta}(x) = \sum_{j=0}^{n} \theta_j.x_j\)


  • Cost function : \( J_{train}(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}x^{(i)} - y^{(i)})^2 \)


  • If we plot our two parameters vs. the cost function we get something like this


  • single step of decent


  • Quick reminder - how does gradient descent work?


  • single step of decent


  • In the inner loop we repeatedly update the parameters θ. We will use linear regression for our algorithmic example here when talking about stochastic gradient descent, although the ideas apply to other algorithms too, such as Logistic regression, Neural networks etc.


  • Below we have a contour plot for gradient descent showing iteration to a global minimum


  • single step of decent


  • As mentioned, if m is large gradient descent can be very expensive. Although so far we just referred to it as gradient descent, this kind of gradient descent is called batch gradient descent


  • This just means we look at all the examples at the same time. Batch gradient descent is not great for huge datasets. If you have 300,000,000 records you need to read in all the records into memory from disk because you can't store them all in memory.


  • By reading all the records, you can move one step (iteration) through the algorithm. Then repeat for EVERY step. This means it take a LONG time to converge.


  • Especially because disk I/O is typically a system bottleneck anyway, and this will inevitably require a huge number of reads What we're going to do here is come up with a different algorithm which only needs to look at single example at a time






  • Define our cost function slightly differently, as


  • \( J = \frac{1}{2m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)} )^2 \)


  • So the function represents the cost of θ with respect to a specific example (xi, yi). And we calculate this value as one half times the squared error on that example. Measures how well the hypothesis works on a single example. The overall cost function can now be re-written in the following form;


  • cost-function-bigdata


  • This is equivalent to the batch gradient descent cost function. With this slightly modified (but equivalent) view of linear regression we can write out how stochastic gradient descent works.


  • cost-function-bigdata


  • So what's going on here? The term \( (h_{\theta}(x^{(i)}))x_j^{(i)} \)


  • Is the same as that found in the summation for batch gradient descent It's possible to show that this term is equal to the partial derivative with respect to the parameter θj of the cost(θ, (xi, yi)) \( \frac{\partial}{\partial \theta_j} cost(\theta, (x^{(i)}, y^{(i)})) \)


  • What stochastic gradient descent algorithm is doing is scanning through each example. The inner for loop does something like this... Looking at example 1, take a step with respect to the cost of just the 1st training example.


  • Having done this, we go on to the second training example. Now take a second step in parameter space to try and fit the second training example better. Now move onto the third training example. And so on... Until it gets to the end of the data.


  • We may now repeat this whole procedure and take multiple passes over the data. The randomly shuffling at the start means we ensure the data is in a random order so we don't bias the movement. Randomization should speed up convergence a little bit.


  • Although stochastic gradient descent is a lot like batch gradient descent, rather than waiting to sum up the gradient terms over all m examples, we take just one example and make progress in improving the parameters already.


  • Means we update the parameters on EVERY step through data, instead of at the end of each loop through all the data.


  • What does the algorithm do to the parameters? As we saw, batch gradient descent does something like this to get to a global minimum.


  • cost-function-bigdata


  • With stochastic gradient descent every iteration is much faster, but every iteration is fitting a single example.


  • cost-function-bigdata


  • What you find is that you "generally" move in the direction of the global minimum, but not always. The process Never actually converges like batch gradient descent does, but ends up wandering around some region close to the global minimum.


  • In practice, this isn't a problem - as long as you're close to the minimum that's probably OK.


  • One final implementation note. You May need to loop over the entire dataset 1-10 times. If you have a truly massive dataset it's possible that by the time you've taken a single pass through the dataset you may already have a perfectly good hypothesis.


  • In which case the inner loop might only need to happen 1 if m is very very large If we contrast this to batch gradient descent. We have to make k passes through the entire dataset, where k is the number of steps needed to move through the data.






  • Mini-batch gradient descent is an additional approach which can work even faster than stochastic gradient descent.


  • To summarize our approaches so far: Batch gradient descent: Use all m examples in each iteration Stochastic gradient descent: Use 1 example in each iteration Mini-batch gradient descent: Use b examples in each iteration b = mini-batch size


  • So just like batch, except we use tiny batches Typical range for b = 2-100 (10 maybe) For example b = 10 Get 10 examples from training set Perform gradient descent update using the ten examples


  • Mini-batch algorithm


  • cost-function-bigdata


  • We for-loop through b-size batches of m. Compared to batch gradient descent this allows us to get through data in a much more efficient way. After just b examples we begin to improve our parameters. Here we Don't have to update parameters after every example, and don't have to wait until you cycled through all the data.


  • Mini-batch gradient descent vs. stochastic gradient descent


    1. Why should we use mini-batch? It Allows you to have a vectorized implementation. Means implementation is much more efficient.


    2. Can partially parallelize your computation (i.e. do 10 at once). A disadvantage of mini-batch gradient descent is the optimization of the parameter b. But this is often worth it!


    3. To be honest, stochastic gradient descent and batch gradient descent are just specific forms of mini-batch-gradient descent.


    4. For mini-batch gradient descent, b is somewhere in between 1 (which makes it stochastic gradient descent) and m (which makes it batch gradient descent) and you can try to optimize for it!






  • We now know about stochastic gradient descent. But how do you know when it's done!? How do you tune learning rate alpha (α)?


  • The solution is Checking for convergence. With batch gradient descent, we could plot cost function vs number of iterations. It Should decrease on every iteration.


  • This works when the training set size was small because we could sum over all examples.


  • Doesn't work when you have a massive dataset. With stochastic gradient descent. We don't want to have to pause the algorithm periodically to do a summation over all data.


  • Moreover, the whole point of stochastic gradient descent is to avoid those whole-data summations. For stochastic gradient descent, we have to do something different. Take cost function definition. \( cost(\theta,(x^{(i)},y^{(i)})) = \frac{1}{2} (h_{\theta}(x^{(i)}) - y^{(i)} )^2 \)


  • It is One half the squared error on a single example.


  • While the algorithm is looking at the example (xi, yi), but before it has updated θ we can compute the cost of the example (cost(θ, (xi, yi)) i.e. we compute how well the hypothesis is working on the training example.


  • Need to do this before we update θ because if we did it after θ was updated the algorithm would be performing a bit better (because we'd have just used (xi, yi)) to improve θ).


  • To check for the convergence, every 1000 iterations we can plot the costs averaged over the last 1000 examples. Gives a running estimate of how well we've done on the last 1000 estimates. By looking at the plots we should be able to check convergence is happening.


  • What do these plots look like. In general: Might be a bit noisy (1000 examples isn't that much). If you get a figure like this


  • cost-function-bigdata


  • That's a pretty decent run. Algorithm may have convergence.


  • If you use a smaller learning rate you may get an even better final solution.


  • cost-function-bigdata


  • This is because the parameter oscillate around the global minimum. A smaller learning rate means smaller oscillations. If you average over 1000 examples and 5000 examples you may get a smoother curve.


  • cost-function-bigdata


  • This disadvantage of a larger average means you get less frequent feedback. Sometimes you may get a plot that looks like this.


  • cost-function-bigdata


  • Looks like cost is not decreasing at all. But if you then increase to averaging over a larger number of examples you do see this general trend. Means the blue line was too noisy, and that noise is ironed out by taking a greater number of entires per average.


  • Of course, it may not decrease, even with a large number. If you see a curve the looks like its increasing then the algorithm may be displaying divergence.


  • cost-function-bigdata


  • Should use a smaller learning rate.


  • Learning rate :


    1. We saw that with stochastic gradient descent we get this wandering around the minimum. In most implementations the learning rate is held constant.


    2. However, if you want to converge to a minimum you can slowly decrease the learning rate over time.


    3. A classic way of doing this is to calculate α as follows: α = const1/(iterationNumber + const2).


    4. Which means you're guaranteed to converge somewhere, You also need to determine const1 and const2.


    5. BUT if you tune the parameters well, you can get something like this


    6. cost-function-bigdata






  • Allows us to model problems where you have a continuous stream of data you want an algorithm to learn from


  • Similar idea of stochastic gradient descent, in that you do slow updates


  • Web companies use various types of online learning algorithms to learn from traffic


  • They Can (for example) learn about user preferences and hence optimize your website


  • Example - Shipping service : Users come and tell you origin and destination


  • You offer to ship the package for some amount of money ($10 - $50). Based on the price you offer, sometimes the user uses your service (y = 1), sometimes they don't (y = 0)


  • Build an algorithm to optimize what price we offer to the users. Capture Information about user like Origin and destination. Then Work out What the probability of a user selecting the service is


  • We want to optimize the price - To model this probability we have something like p(y = 1|x; θ) (Probability that y = 1, given x, parameterized by θ)


  • Build this model with something like Logistic regression, Neural network. If you have a website that runs continuously an online learning algorithm would do something like this.


  • User comes - is represented as an (x,y) pair where x - feature vector including price we offer, origin, destination y - if they chose to use our service or not


  • The algorithm updates θ using just the (x,y) pair
    cost-function-bigdata


  • So we basically update all the θ parameters every time we get some new data


  • While in previous examples we might have described the data example as (xi, yi) for an online learning problem we discard this idea of a data "set" - instead we have a continuous stream of data so indexing is largely irrelevant as you're not storing the data (although presumably you could store it)


  • If you have a major website where you have a massive stream of data then this kind of algorithm is pretty reasonable


  • You're free of the need to deal with all your training data


  • If you had a small number of users you could save their data and then run a normal algorithm on a dataset


  • An online algorithm can adapt to changing user preferences. So over time users may become more price sensitive. The algorithm adapts and learns to this. So your system is dynamic






  • You Run an online store that sells cellphones. You have a UI where the user can type in a query like, "Android phone 1080p camera"


  • We want to offer the user 10 phones per query. How do we do this ? For each phone and given a specific user query, we create a feature vector (x) which has data like features of the phone, how many words in the user query match the name of the phone, how many words in user query match description of phone


  • Basically how well does the phone match the user query. We want to estimate the probability of a user selecting a phone. So define y = 1 if a user clicks on a link y = 0 otherwise


  • So we want to learn p(y = 1|x ; θ) --> this is the problem of learning the predicted click through rate (CTR). If you can estimate the CTR for any phone we can use this to show the highest probability phones first.


  • If we display 10 phones per search, it means for each search we generate 10 training examples of data. i.e. user can click through one or more, or none of them, which defines how well the prediction performed


  • Other things you can do: Special offers to show the user, Show news articles - learn what users like, Product recommendation


  • These problems could have been formulated using standard techniques, but they are the kinds of problems where you have so much data that this is a better way to do things






  • Previously spoke about stochastic gradient descent and other algorithms. These could be run on one machine. Some problems are just too big for one computer


  • Talk here about a different approach called Map Reduce. Map reduce example : We want to do batch gradient descent


  • cost-function-bigdata


  • Assume m = 400. Normally m would be more like 400 000 000


  • If m is large this is really expensive. Split training sets into different subsets. So split training set into 4 pieces.


  • Machine 1 : use (x1, y1), ..., (x100, y100) Uses first quarter of training set Just does the summation for the first 100


  • cost-function-bigdata


  • So now we have these four temp values, and each machine does 1/4 of the work Once we've got our temp variables Send to to a centralized master server


  • Put them back together Update θ using cost-function-bigdata


  • This equation is doing the same as our original batch gradient descent algorithm More generally map reduce uses the following scheme (e.g. where you split into 4)
    cost-function-bigdata


  • The bulk of the work in gradient descent is the summation. Now, because each of the computers does a quarter of the work at the same time, you get a 4x speedup


  • Of course, in practice, because of network latency, combining results, it's slightly less than 4x, but still good!


  • Important thing to ask is "Can algorithm be expressed as computing sums of functions of the training set?"


  • Many algorithms can! Another example Using an advanced optimization algorithm with logistic regression

    cost-function-bigdata


  • Need to calculate cost function - see we sum over training set So split training set into x machines, have x machines compute the sum of the value over 1/xth of the data

    cost-function-bigdata


  • These terms are also a sum over the training set. So use same approach. So with these results send temps to central server to deal with combining everything


  • More broadly, by taking algorithms which compute sums you can scale them to very large datasets through parallelization


  • Parallelization can come from : Multiple machines, Multiple CPUs, Multiple cores in each CPU, So even on a single computer you can implement parallelization


  • The advantage of thinking about Map Reduce here is because you don't need to worry about network issues


  • It's all internal to the same machine. Final caveat/thought : Depending on implementation detail, certain numerical linear algebra libraries can automatically parallelize your calculations across multiple cores


  • So, if this is the case and you have a good vectorization implementation you can not worry about local Parallelization and the local libraries sort optimization out for you






  • Hadoop is a good open source Map Reduce implementation


  • Represents a top-level Apache project develop by a global community of developers


  • Large developer community all over the world. Written in Java, Yahoo has been the biggest contributor, Pushed a lot early on, Support now from Cloudera.