How To Complete A Machine Learning Project | by Tutort Academy | January 2022

A machine learning project consists of several steps. To accomplish a project, you must follow a set of typical steps. For any project, we must first collect data according to our business needs. The next step is to clean the data by removing values, removing outliers, dealing with unbalanced data sets, converting categorical variables to numeric values, etc.

After training on the model, several machine learning and deep learning methods are used. The model is then evaluated using various metrics such as recall, f1-score, precision, etc. Finally, deploy a model to the cloud and recycle a model.

Soh Let’s start:

Structured data appears as a table (rows and columns, similar to an Excel spreadsheet). It includes several types of data, such as numerical, category, and time series data.

  • Nominal/categorical — This or that (mutually exclusive). Color, for example, is a category for automotive scales. An automobile can be blue, but it cannot be white. It doesn’t matter what order you follow.
  • Numeric: Any continuous value in which the difference between them is significant.
  • Ordinal data: data with order but uncertain distance between values. For example, how would you rate your health on a scale of 1 to 5? One is poor, while five are healthy. You can answer 1,2,3,4,5, but the distance between each value does not necessarily mean that a 5 is five times better than a 1.
  • Time series data: data that spans time. For example, look at historical bulldozer selling prices from 2012 to 2018.

Unstructured data: Information that lacks a rigid framework (eg photos, video, natural language speech and writing).

Learn more about the data you process through exploratory data analysis (EDA).

  • Data preparation
  • What are feature (input) variables and target variables? (output) For example, characteristic variables that predict heart disease could be a person’s age, weight, average heart rate, and amount of physical activity. And whether or not they have a disease will be the target variable.
  • What kind are you? Time series can be structured, unstructured, numerical, or a combination of the three. Are there values ​​missing? Should they be deleted or filled in using feature imputation?
  • Where can I find the outliers? How many do you think there are? What are they doing here? Are there any data-related questions you could ask a domain expert? Would a heart disease doctor, for example, be able to shed some light on your heart disease dataset?

Data preprocessing is the process of preparing your data for modeling.

— Fill with the mean or median of the column in a single imputation.

— Multiple Imputations: Model other missing values ​​as well as your model results.

— KNN (k-nearest neighbors): Fill the data with a value from a similar example.

— There are many others, including random imputation, last observation carried forward (for time series), moving window, and most frequent.

  • Encoding of characteristics (transformation of values ​​into numbers). All values ​​in a machine learning model must be numeric.
  • A popular encoding: all unique values ​​should be converted into lists of 0s and 1s, with the target value being 1s and the rest being 0s. For example, if a car colors green, blue red, and green, the future color of a car would be [1, 0, and 0], and a red would be [0, 1, and 0].
  • Label encoder: Labels must be converted to unique integer values. If your target variables are different animals, such as a dog, cat, or bird, these can become 0, 1, and 2, accordingly.
  • Integrated encoding: Discover a representation among all data points. A language model, for example, is a representation of how different words relate to each other. For structured data, integration is also becoming more frequently available.
  • Normalization (scaling) or standardization of features: some machine learning techniques don’t work well when your numeric variables are at different scales (e.g. number of bathrooms is between 1 and 5 and lot size is between 500 and 20,000 square feet). Scaling and normalizing can help with this.
  • Feature engineering is the process of transforming data into a more relevant representation using domain knowledge.
  • Feature selection is the process of selecting the most valuable features from your collection to model. Overfitting and training time could be reduced while improving accuracy.
  • Imbalance management: Does your data contain 10,000 cases of one type but only 100 examples of another?

Data Division

  • The model learns from the training set (typically 70-80% of the data).
  • Validation set (often 10-15% of data): on a test set (usually 10-15% of data), the model’s hyperparameters are tuned: the models’ final performance is based on this. If you did everything right, the results on the test set should give you a good idea of ​​how the model should behave in the real world. This data set should not be used to refine the model.
  • Linear regression, logistic regression, KNN, SVMs, decision trees and random forests, AdaBoost/Gradient Boosting machines are examples of supervised algorithms (boosting)
  • Clustering, dimensionality reduction (PCA, Autoencoders, t-SNE) and anomaly detection are examples of unsupervised techniques.
  • Underfitting occurs when your model does not perform as well on your data as expected. Experiment with training for a longer or more advanced pattern.
  • Overfitting occurs when your validation loss increases or when the model performs better on the training set than on the test set.
  • Regularization is a set of technologies that prevent or reduce overfitting
  • Hyperparameter tuning involves conducting a series of experiments using various parameters to see which works best.
  • Accuracy, Precision, Confusion Matrix, Mean Mean Precision, F1
  • MSE, MAE, R2 — Regression
  • Task-based metric – For example, in the case of a self-driving car, you might want to know the disengagement count.
  • Put the model to the test and see how it goes.
  • You can use the following tools: Sagemaker, TensorFlow Servinf, PyTorch Servinf, Google AI Platform
  • MLOps: the intersection of software engineering with machine learning, essentially all the technologies needed to get a machine learning model working in production.
  • Examine the performance of the model after (or before) the release against various endpoints, and repeat the previous steps as needed (remember that machine learning is highly experimental, so this is where you’ll want to track your data and experiences).
  • You will also notice that your model’s predictions start to “age” (usually not in a good way) or “drift”, as data sources change or upgrade (new hardware, etc.). This is when it needs to be recycled.

Thanks for the reading. Please share this post with your friends if you enjoyed it. If you have any suggestions or questions, please leave them in the comment section.

You can also reach us on LinkedIn and visit our website for more information on our employment-focused courses for working professionals in software development, data science, machine learning, and artificial intelligence.

Sherry J. Basler