Click here to learn more about Ted Kwartler.
Each semester I am tempted to teach my Harvard Extension students using standard data sets like Iris or Titanic. It would be easy with so many examples floating around. Plus, the explanation of a K-Nearest Neighbor algorithm fits perfectly into an Iris world. But honestly, real-world data never classifies that cleanly, and who uses KNN in production anyway? Since KNN is non-parametric, it takes forever to get predictions with data of any remarkable size!
Instead, I do the heavy lifting (read: teaching) using real-world data accompanied by multiple algorithm choices. Instructors only using toy data sets are doing a disservice to students in the long run. Why not challenge students with messy data, full of leakage and missing values, or data that needs to be joined to other datasets as part of the learning process?
Instructors can also challenge pupils to think about an outcome that matters today. Calculating the probability of surviving a tragic boating accident from 1912 doesn’t help anyone. In my class, we learn logistic regression with college basketball data. One could argue that basketball isn’t consequential, but the $2B in lost productivity during March Madness each year and the $1B wagered on the event say otherwise. I still teach KNN because it’s one of the easiest algorithms for students to intuit, but I use East Side vs. West Side Cleveland housing data and mention a favorite rap group of mine, Bone Thugs-N-Harmony.
In the end, the class learns many of the same topics but from a fresh perspective with imperfect data. For both March Madness and Cleveland housing, the data doesn’t work out as well as Iris, but it’s realistic.
A contrarian could argue that new students need to focus on the algorithm and how it “learns” data. Thus, a simple data set, like Iris, lets them grow into more complex applications and does not overwhelm them. It’s a fair point, but one that assumes the student will seek out additional resources or messy data after their pure Data Science education. Instead, I try to inspire ingenuity and applicability during the first lesson in the hopes that the algorithm’s application, not just the underlying math, will be seen as relevant and engaging.
So, even if you are an accomplished machine learning engineer, principal data scientist, or merely starting out on this learning journey, let me share the advice I give students when asked how to improve their Data Science skills:
1. Find a passion project. For me, it was figuring out March Madness using basketball statistics. Along the way, you will encounter obstacles that you can overcome if motivated. For example, to model basketball outcomes, you have to first get the data, which requires you to learn APIs and web scraping.
2. Subscribe to a blog aggregation service or YouTube channels. The R community has R-bloggers, and Python has Python Weekly. Even if you only read articles of interest, over time, services like these will help you stay current with new packages and constantly present you with novel work that could be applied to your passion project.
3. Teach someone else. Over time, you will gain skills from your passion project that are applicable to other domains. For example, in web scraping basketball data, you probably have to also learn string manipulation. All of these micro-skills will help you professionally. I challenge you to teach these micro-skills because it is both rewarding and forces you to have a depth of knowledge. Keep in mind, “teaching” can be as simple as a team presentation or local meetup. It’s not limited to a blog, conference presentation, or research paper.
Your time and mental energy are a resource, so be selective about where you learn your Data Science skills. You will be motivated if you choose a topic of interest, assuming historical tragedies and flowers aren’t your thing. Take training that doesn’t use common data sets, like DataCamp or the recently announced 10x: The Applied Data Science Academy.
Lastly, if you follow people in the Data Science community, pay particular attention to realists like Hamel Husain, who can articulate the business value and horror stories of ML projects gone awry. Being selective will improve your skills and marketability, making you a better data scientist.