Interview Guide: Machine Learning
Impress your future employer
Interview Guide: Machine Learning
You are an aspiring Machine Learning engineer, data scientist or analyst, or researcher preparing for an interview. This guide will help you get into the right mindset, and review essential skills that you may be evaluated on.
Company and Role
If a company is looking to hire a Machine Learning Engineer, it’s clear that they are trying to solve a complex problem where traditional algorithmic solutions are hard to apply or simply do not work well enough. Needless to say, they are also extremely motivated to solve that problem (otherwise they wouldn’t be in this business!).
Identify a Core Problem
The first thing you need to do when applying for such a role, and this may sound cliché, is to imagine yourself in that role – find out as much as possible about the company and position. Then ask yourself: What is one core problem I need to solve? The answer should excite you, and drive you to find out more about the problem, existing approaches, recent developments in that domain, and lead you to a bunch of more specific challenges. If you know what team you are being interviewed for, picking an appropriate problem might be easy; otherwise, choose something that is essential for the company.
For example, as a Machine Learning Engineer at Udacity, your primary responsibility could be to improve student engagement and retention. It is widely known that MOOC completion rates are abysmally low. This is partly due to factors that are out of our control, but we’d surely like to do better. Why do students drop out of our programs? Do they find our content too hard, or too easy? How can we improve student engagement in the classroom? What do students really need - how can we add value to their lives?
Explore Potential Data Sources
Next, think about what data you would need to answer these questions. Some of this may be readily available, while you may have to build in additional hooks to gather certain pieces of information. Dig into the company’s infrastructure and operations - what stack do they operate on, what APIs do they have, what data are they already collecting, etc. Most companies today have a blog where they often discuss their challenges, approaches, successes, and failures. This should give you further insight into how they operate, and what products and services they might have in the pipeline.
Prepare to Discuss Machine Learning Solutions
Now you need to make a fairly big conceptual jump: How does machine learning fit into all this? Given what you’re trying to achieve, and the data you think might be available, can you frame it as a learning problem? What is an appropriate model to use? How would you go about training and evaluating it? To give you an example, the primary challenge that a lot of recommendation systems like Netflix and Amazon face is clustering, not prediction - i.e. once you are able to figure out groups of users who seem to have similar preference and behavior, it becomes a whole lot easier to recommend products that they may find useful.
This thought process will help you be prepared to talk about issues that matter to the company the most. Nobody expects you to walk into an interview and lay out a complete solution for something they’ve been working hard on for months or years! But everybody likes a candidate who shows genuine interest, motivation, and curiosity for a problem that is close to their hearts.
Depending on your interviewer and the stage of your interview, you may be asked more technical questions, but you should try to use any opportunity you get to demonstrate that you have thought about the company and role. When asked more open-ended questions such as “Describe a technical challenge you faced when working on a project and how you solved it?”- try to pick something that aligns well with the company’s interests.
Skills and Sample Questions
Essential skills that a Machine Learning Engineer needs can be divided into the following groups. Within each group are topics that you should be familiar with, and some sample questions that you may be asked in an interview.
Computer Science Fundamentals and Programming
- Data structures: Lists, stacks, queues, strings, hash maps, vectors, matrices, classes & objects, trees, graphs, etc.
- Algorithms: Recursion, searching, sorting, optimization, dynamic programming, etc.
- Computability and complexity: P vs. NP, NP-complete problems, big-O notation, approximate algorithms, etc.
- Computer architecture: Memory, cache, bandwidth, threads & processes, deadlocks, etc.
- How would you check if a linked list has cycles?
- Given two elements in a binary search tree, find their lowest common ancestor.
- Write a function to sort a given stack.
- What is the time complexity of any comparison-based sorting algorithm? Can you prove it?
- How will you find the shortest path from one node to another in a weighted graph? What if some weights are negative?
- Find all palindromic substrings in a given string.
For all such questions, you should be able to reason about the time and space complexity of your approach (usually in big-O notation), and try to aim for the lowest complexity possible.
Extensive practice is the only way to familiarize yourself with the different classes of problems, so that you can quickly converge on an efficient solution. Coding/interview prep platforms like InterviewBit, LeetCode and Pramp are highly beneficial for this purpose.
Probability and Statistics
- Basic probability: Conditional probability, Bayes rule, likelihood, independence, etc.
- Probabilistic models: Bayes Nets, Markov Decision Processes, Hidden Markov Models, etc.
- Statistical measures: Mean, median, mode, variance, population parameters vs. sample statistics etc.
- Proximity and error metrics: Cosine similarity, mean-squared error, Manhattan and Euclidean distance, log-loss, etc.
- Distributions and random sampling: Uniform, normal, binomial, Poisson, etc.
- Analysis methods: ANOVA, hypothesis testing, factor analysis, etc.
- The mean heights of men and women in a population were calculated to be Mand W. What is the mean height of the total population?
- A recent poll revealed that a third of the cars in Italy are Ferraris, and that half of those are red. If you spot a red car approaching from a distance, what is the likelihood that it is a Ferrari?
- You’re trying to find the best place to put in an advertisement banner on your website. You can make the size (thickness) small, medium or large, and choose vertical position top, middle or bottom. At least how many total page visits (n) and ad clicks (m) do you need to say with 95% confidence that one of the designs performs better than all the other possibilities?
- The time period between consecutive eruptions of the Old Faithful geyser in Yellowstone National Park is found to have the following distribution. How would you describe/characterize it? What can you infer from it?
Remember that many machine learning algorithms have a basis in probability and statistics. Conceptual clarity of these fundamentals is extremely important, but at the same time, you must be able to relate abstract formulae with real-world quantities.
Data Modeling and Evaluation
- Data preprocessing: Munging/wrangling, transforming, aggregating, etc.
- Pattern recognition: Correlations, clusters, trends, outliers & anomalies, etc.
- Dimensionality reduction: Eigenvectors, Principal Component Analysis, etc.
- Prediction: Classification, regression, sequence prediction, etc.; suitable error/accuracy metrics.
- Evaluation: Training-testing split, sequential vs. randomized cross-validation, etc.
- A dairy farmer is trying to understand the factors that affect milk production of her cattle. She has been keeping logs of the daily temperature (usually 30-40°C), humidity (60-90%), feed consumption (2000-2500 kgs), and milk produced (500-1000 liters).
- How would you begin processing the data in order to model it, with the goal of predicting liters of milk produced in a day?
- What kind of machine learning problem is this?
- Your company is building a facial expression coding system, which needs to take input images from a standard HD 1920x1080 pixel webcam, and continuously tell whether the user is in one of the following states: neutral, happy, sad, angry or afraid. When the user’s face is not visible in the camera frame, it should indicate a special state: none.
- What class of machine learning problems does this belong to?
- If each pixel is made up of 3 values (for red, green, blue channels), what is the raw input data complexity (no. of dimensions) for processing each image? Is there a way to reduce the no. of dimensions?
- How would you encode the output of the system? Explain why.
- Climate data collected over the past century reveals a cyclic pattern of rising and falling temperatures. How would you model this data (a sequence of average annual temperature values) to predict the average temperature over the next 5 years?
- Your job at an online news service is to collect text reports from around the world, and present each story as a single article with content aggregated from different sources. How would you go about designing such a system? What ML techniques would you apply?
Applying Machine Learning Algorithms and Libraries
- Models: Parametric vs. nonparametric, decision tree, nearest neighbor, neural net, support vector machine, ensemble of multiple models, etc.
- Learning procedure: Linear regression, gradient descent, genetic algorithms, bagging, boosting, and other model-specific methods; regularization, hyperparameter tuning, etc.
- Tradeoffs and gotchas: Relative advantages and disadvantages, bias and variance, overfitting and underfitting, vanishing/exploding gradients, missing data, data leakage, etc.
- You’re trying to classify images of cats and dogs. Plotting the images in some transformed 2-dimensional feature space reveals the following pattern (on the left). In some other space, images of dogs and wolves show a different pattern (on the right).
- What model would you use to classify cats vs. dogs, and what would you use for dogs vs. wolves? Why?
- I’m trying to fit a single hidden layer neural network to a given dataset, and I find that the weights are oscillating a lot over training iterations (varying wildly, often swinging between positive and negative values). What parameter do I need to tune to address this issue?
- When training a support vector machine, what value are you optimizing for?
- Lasso regression uses the L1-norm of coefficients as a penalty term, while ridge regression uses the L2-norm. Which of these regularization methods is more likely to result in sparse solutions, where one or more coefficients are exactly zero?
- When training a 10-layer neural net using backpropagation, I find that the weights for the top 3 layers are not changing at all! The next few layers (4-6) are changing, but very slowly. What’s going on and how do I fix this?
- I’ve found some data about wheat-growing regions in Europe that includes annual rainfall (R, in inches), mean altitude (A, in meters) and wheat output (O, in kgs/km2). A rough analysis and some plots make me believe that output is related to the square of rainfall, and log of altitude: O = β0 + β1 × R2 + β2 × loge(A)
Can I fit the coefficients (β) in my model to the data using linear regression?
Data science and Machine Learning challenges such as those on Kaggle are a great way to get exposed to different kinds of problems and their nuances. Try to participate in as many as you can, and apply different machine learning models.
Software Engineering and System Design
- Software interface: Library calls, REST APIs, data collection endpoints, database queries, etc.
- User interface: Capturing user inputs & application events, displaying results & visualization, etc.
- Scalability: Map-reduce, distributed processing, etc.
- Deployment: Cloud hosting, containers & instances, microservices, etc.
- You run an ecommerce website. When a user clicks on an item to open its details page, you would like to suggest 5 more items that the user may be interested in, based on item features as well as the user’s purchase history, and display them at the bottom of the page. What services and database tables would you need to support this behavior? Assuming they’re available, write a query or procedure to fetch the 5 items to suggest.
- What data would you like to collect from an online video player (like YouTube) to measure user engagement and video popularity?
- A very simple spam detection system works as follows: It processes one email at a time and counts the number of occurrences of each unique word in it (term frequency), and then it compares those counts with those of previously seen emails which have been marked as spam or not. In order to scale up this system to handle a large volume of email traffic, can you design a map-reduce scheme that can run on a cluster of computers?
- You want to generate a live visualization of what portion of a webpage users are currently viewing and clicking, sort of like a heat map. What components/services/APIs do you need in place, on the client and server end, to enable this?