• Example: Cat vs Non-cat


  • In this example, we want to create a mobile application that will classify and recognize pictures of cats taken and uploaded by users.


  • There are two sources of data used to develop the mobile app. The first data distribution is small, 10 000 pictures uploaded from the mobile application.


  • Since they are from amateur users, the pictures are not professionally shot, not well framed and blurrier. The second source is from the web, you downloaded 200 000 pictures where cat’s pictures are professionally framed and in high resolution.


  • The problem is that you have a different distribution:


    1. small data set from pictures uploaded by users. This distribution is important for the mobile app.


    2. bigger data set from the web.


  • The guideline used is that you have to choose a development set and test set to reflect data you expect to get in the future and consider important to do well.


  • The data is split as follow:


  • mismatch-data-distribution


  • The advantage of this way of splitting up is that the target is well defined. The disadvantage is that the training distribution is different from the development and test set distributions. However, this way of splitting the data has a better performance in long term






  • Example: Cat classifier with mismatch data distribution


  • When the training set is from a different distribution than the development and test sets, the method to analyze bias and variance changes.


  • mismatch-data-distribution


  • Scenario A


    1. If the development data comes from the same distribution as the training set, then there is a large variance problem and the algorithm is not generalizing well from the training set.


    2. However, since the training data and the development data come from a different distribution, this conclusion cannot be drawn.


    3. There isn't necessarily a variance problem. The problem might be that the development set contains images that are more difficult to classify accurately. When the training set, development and test sets distributions are different, two things change at the same time.


    4. First of all, the algorithm is trained in the training set but not in the development set.


    5. Second of all, the distribution of data in the development set is different. It's difficult to know which of these two changes produces this 9% increase in error between the training set and the development set.


    6. To resolve this issue, we define a new subset called training - development set. This new subset has the same distribution as the training set, but it is not used for training the neural network.


  • Scenario B


    1. The error between the training set and the training- development set is 8%.


    2. In this case, since the training set and training-development set come from the same distribution, the only difference between them is the neural network sorted the data in the training and not in the training development.


    3. The neural network is not generalizing well to data from the same distribution that it hadn't seen before Therefore, we have really a variance problem.


  • Scenario C In this case, we have a mismatch data problem since the 2 data sets come from different distribution.


  • Scenario D In this case, the avoidable bias is high since the difference between Bayes error and training error is 10 %.


  • Scenario E In this case, there are 2 problems. The first one is that the avoidable bias is high since the difference between Bayes error and training error is 10 % and the second one is a data mismatched problem.


  • Scenario F Development should never be done on the test set. However, the difference between the development set and the test set gives the degree of overfitting to the development set.


  • General formulation


  • mismatch-data-distribution






  • This is a general guideline to address data mismatch:


    1. Perform manual error analysis to understand the error differences between training, development/test sets. Development should never be done on test set to avoid overfitting.


    2. Make training data or collect data similar to development and test sets. To make the training data more similar to your development set, you can use is artificial data synthesis. However, it is possible that if you might be accidentally simulating data only from a tiny subset of the space of all possible examples