by Angela Guess
Douglas Merrill of Forbes has written a new article about the nature of Big Data and how math lies behind its power. He writes, “As a general rule, more data is always better than less data. You can do more math magic with more data. In general, it gives you more degrees of freedom. Most importantly, more data makes it easier to avoid a problem called overfitting. Stay tuned, I’ll come right back to that. First, to learn a new machine learning model, you need a bunch of stuff. You need data; at least some of that data needs to be tagged with the outcome you are hoping to learn. So, for example, if you are trying to predict the probability a borrower will default on a loan, you need some cases where borrowers defaulted, and some where borrowers paid off their loans.”
He continues, “So, you have a bunch of tagged observations, so you know that these loans went bad, these paid off, and so forth. Off to model! First, what do you do? Recall the math issues I mentioned last time; you can still screw up the math. More data will mean it will take a few more seconds to generate garbage data, but the data will still be garbage. Again, it’s not the number of bits, it’s the amount of information you can generate. However, since you’re drowning in bits, your first step will be to divide your data into a training segment — that you’ll use to teach the model — and a testing segment — that you will use to test the quality of your model. If you have extra data, you might also create a third group, called a validation group, that you will use to make sure you didn’t overfit your data.”

















