Advertisement

So You Want to be a Machine Learning Engineer?

By on

Ideally, a machine learning engineer would have both the skills of a software engineer and the experience of a data scientist and data engineer. However, data scientists and software engineers usually come from very different backgrounds, and data scientists should not be expected to be great programmers, nor should software engineers be expected to provide statistical summaries. Nonetheless, a background in machine learning algorithms and how they can be implemented is critical to the machine learning engineer (MLE).  

An MLE works with different algorithms and applies them to different codebases and settings. Previous experience with software engineering and codebase would provide a very useful foundation for this career field. It’s not unusual for an MLE to start with a background in software engineering, and then gain the machine learning (ML) and statistical knowledge on the job. Keeping up with the current machine learning and deep learning research, and implementing it, are also important parts of the MLE’s responsibilities.

The machine learning engineer is responsible for building ML systems capable of performing difficult tasks, or replacing slow, plodding human efforts. These goals require good engineering, the ability to write bug-free ML code, and the ability to develop the needed algorithms.

Algorithms

An algorithm is a series of steps or instructions used to solve problems or tell a computer what to do. Through the use of algorithms, machine learning provides computers with the ability to develop habitual responses. These responses are based on observations, and the repeated behaviors and actions of people the system interacts with. The ability to learn repetitive behaviors is important. As ML models receive new data, they adapt, with earlier experiences providing options on how to respond.

Having a baseline knowledge of the algorithms supporting machine learning will help in implementing models. Algorithms for machine learning can be broken into three broad categories :

  • Supervised learning is the most used. The training program acts as a supervisor, and corrects mistakes made by the algorithm after it has made a prediction.
  • Unsupervised learning can be useful in situations where the goal is to use the algorithms to discover and present otherwise unrecognizable structures within the data.
  • Reinforcement learning is a mixture of the two.  It provides some form of feedback for each response or action, but gives no precise right or wrong evaluation.

A Tour of The Top Ten Algorithms For Machine Learning Newbies is good for getting started with the basics.

As there is no specific algorithm capable of giving optimal results for each setting, each algorithm will have to be modified. Modifying an existing algorithm is a common practice. Algorithms are easy to modify and should be adjusted to changing circumstances. When an algorithm is structured properly, new modifications can produce better algorithms.

Python

Learning Python is a necessity. Fortunately, it is one of the easier programming languages to learn. The majority of machine learning projects will use Python or C/C++ (Python is typically preferred). Some people would describe it as a useful, fairly easy to learn scripting language. For those not already familiar with Python, there are many free, user-friendly courses. One helpful tip: pay attention to tabbing and spacing, which is required to organize and activate codes. With Python, white space is important.

As a programming language, Python lets the work get done quickly and integrates systems more efficiently. Python is a typically used for general-purpose, high-level programming. It was originally designed in 1991 by Guido van Rossum, and later developed by the Python Software Foundation. Its purpose is to emphasize code readability, with its syntax allowing programmers to express concepts using fewer lines of code.

As with most skills, Python requires practice, practice, practice. The Free Machine Learning interactive tool helps to develop Python skills and begin working with algorithms. Many data tools were built in Python, or were built with API access that allowed for easy Python access. The language’s syntax is fairly easy to pick up. Python has an exceptional amount of training resources currently available and supports a variety of programming paradigms, ranging from to object-oriented programming and functional programming. Recommended readings include:

Application Programming Interface (API)

An application programming interface is, in general terms, a series of clearly defined methods for communication between various components. It includes communication protocols, tools for building software, and subroutine definitions.

When building applications, an API can simplify the process by outlining the underlying data and exposing only the objects, or actions, needed by the developer. For example, an API for a file input/output might offer the developer a function that can copy a file from one place and send it to another, with the developer requiring no understanding of the file system operations taking place behind the scenes.

APIs are typically related to a software library. It describes and defines “expected behavior.” A single API can be implemented multiple times with different libraries sharing the same programming interface. An MLE can take publicly available APIs and choose the best model, while learning procedures for the project.

Integrated Development Environments (IDE)

An IDE is a software program offering comprehensive tools for computer programmers to develop new software. An IDE is typically made up of a source code editor, a debugger, and build automation tools, at the very minimum. Integrated development environments, such as Eclipse and NetBeans, contain a compiler, or an interpreter, or both. Other IDEs, such as Lazarus and SharpDevelop, do not.

As IDEs are integrated, the boundaries between it and the other software systems become fuzzy. In some situations a variety of integrated tools are used to streamline the construction of a graphical user interface (GUI). Several modern IDEs also include a class hierarchy diagram, an object browser, and a class browser for use in object-oriented software development.

A machine learning engineer makes complex engineering decisions about managing data and system deployments. This includes collecting data from SQL + NoSQL databases, APIs, and web scraping, and the use of pipeline frameworks (for example, Luigi or Airflow). When deploying applications, a container (such as Docker) might be used for its scalability and reliability. Tools (such as Flask) can be used to create APIs for the application.

What is the best Python IDE for Data Science?provides a good background on many of these tools.

Software Development Skills

A machine learning engineer should have a strong background in software development. Software development is a logical process with the goal of creating a software programmed to address business or personal goals. Software development is normally a planned process that consists of various steps during the creation of operational software:

  • Initial research
  • Specifying
  • Designing
  • Programming
  • Documenting
  • Testing
  • Bug fixing

Practice: Github has a variety of collaborative tools that can help in expanding a novice’s knowledge and experience. Github offers “nose” for code using testing frameworks, and tools for testing APIs such as Postman. The CI systems like Jenkins can be used to make sure the code doesn’t break. Github is an excellent resource for developing good code review skills.

Practice: The open-sourced Jupyter Notebook is also an excellent resource. It comes pre-installed with several important data science libraries, and with an clean, easy, interactive interface. The Jupyter Notebook is a free web application that allows for the creation of shared documents containing live code, visualizations, narrative text, and equations. Other uses include statistical modeling, data cleaning and transformation, data visualization, numerical simulation, machine learning, and more. The Jupyter Notebook includes extensions that allow for the results to be shared. Additionally, the files work well with Github.

Practice: The Pandas Cookbook offers examples and resources, taken from the Pandas framework, which is a powerful data manipulation library. It provides easy examples of how to play with datasets.

Working with Datasets

A dataset is simply a collection of data. Generally, a dataset is associated with a database table, and each column of the table symbolizes a particular variable. Each row corresponds with a given “member” of the dataset. There are thousands of data repositories available on the internet, offering access to literally millions of datasets, including the data of local and national governments from around the world. Google launched Dataset Search so researchers can find the data they need for their work. 19 Free Public Data Sets for Your First Data Science Projectdiscusses different datasets available on the internet.

Practice: Kaggle Datasets offers several of the publicly available datasets. It shows what projects have been built using same dataset.

Practice: Spark Jupyter notebooks are also hosted on Databricks, which offers a tutorial intro on working with Big Data. It also offers practice for production-level code examples.

Training Machine Learning Models

Deep learning, at its most basic, maps inputs to outputs. It then finds correlations. The deep learning process is sometimes referred to as a “universal approximator.” Deep learning is a form of machine learning algorithms which uses multiple layers. These progressively take higher level features from the raw input. With image processing, the lower layers might identify the edges, while higher layers identify such things as numbers, letters, or faces.

Practice: TensorFlow and Deep Learning without a PhD was built by Google, and is an interactive course combining theory with practical labs and code.

Practice: Publicly Available Big Data Sets provides a list of very large datasets — and available for use, and practice, practice, practice.

Image used under license from Shutterstock.com

Leave a Reply