Machine learning techniques for the identification of carcinogenic mutations, which cause breast adenocarcinoma

This study uses machine learning techniques for the detection of breast adenocarcinoma. Different machine learning algorithms are involved in the study to identify cancer. The systematic scheme of the proposed system is shown in Fig. 3.

picture 3

Systematic diagram of the proposed model.

Figure 3 explains how the complete process works step by step. Decision tree, random forest and Gaussian Naïve Bayes are used in each evaluation method to identify the mutation to detect breast adenocarcinoma. Researchers can use the proposed framework to develop an early warning diagnostic system based on genomic data. It will allow oncologists to detect and treat breast adenocarcinomas in a more personalized way. The following sections explain the algorithms in detail with their test methods and ROC curves.

Collection of reference data sets

The dataset is the most critical factor for any study related to bioinformatics. Typically, the dataset is used for training, testing, and validation. This study aims to use a high-quality, highly accurate and study-relevant reference dataset to achieve the best results. A meaningful dataset of breast adenocarcinoma driver gene sequences is selected. Normal gene sequences are taken from Mutation information is taken from the most recent version available at The IntOgen database does not contain mutated sequences. It only contains information about mutations. An application is therefore developed in python to incorporate this information into normal gene sequences, extracted from, in order to construct mutated sequences. Passenger mutations are not carcinogenic; therefore, these are considered normal sequences. Driver mutations are carcinogenic mutations. For the proposed study, 4127 human samples are used with a total of 6170 mutations in a total of 99 genes involved in breast adenocarcinoma. The genes involved in breast adenocarcinoma are shown in Table 1.

Table 1 Genes involved in breast adenocarcinoma and mutation.

Word cloud is a visualization technique in python for representing textual data. The size of each word indicates its frequency and importance27. The word cloud in Figure 4 shows the frequency and importance of each nucleotide in all gene sequences related to breast adenocarcinoma.

Figure 4
number 4

Breast adenocarcinoma data set word cloud.

Synthetic Minority Oversampling Technique (SMOTE)

The SMOTE technique balances the dataset. An unbalanced dataset is a dataset in which the classification is not equally represented. There are two standard techniques used to balance the oversampling and undersampling of datasets. In the downsampling technique, the number of classes is reduced to balance the data set. Global data records are collapsed. Whereas in the oversampling technique, the number of minority classes is increased. Smote is an oversampling technique to balance the data set. SMOTE randomly selects instances of the minority class. It uses the interpolation method to generate instances between the selected point and its nearby instances.

The steps involved in the SMOTE algorithm are as follows28:

  1. 1.

    Insert the dataset and mark the minority and majority classes from it.

  2. 2.

    Calculate the number of generated instances from the oversampling percentage.

  3. 3.

    From minority classes, identify a random instance (K) and find your neighbors (NOT).

  4. 4.

    Of all the neighbours, find the difference between (NOT) and (K).

  5. 5.

    Multiply the difference by any number between 0 and 1 and add that difference to (K).

  6. 6.

    Repeat the process until the required number of instances are generated.

Figure 5 explains the creation of synthetic data points in SMOTE29.

Figure 5
number 5

Creating synthetic data points in SMOTE.

The dataset for the proposed study is represented by a (B) defined by eq. (1).

$${text{B}=text{B}}^{+ } U {text{B}}^{- },$$


here (B+) are the mutated gene sequences that cause cancer while (B-) are the normal gene sequences and U is the union of the two sequences.

Feature extraction

Here H defines the sequence of the gene25.

The following equations. (2) and (3) compute that Hahne was polynomial.

$${h}_{n}^{r,s}left(P, Qright)= {left(Q+V-1right)}_{n}({Q-1)}_ {n}times sumlimits _{z=0}^{n}{left(-1right)}^{z}frac{{left(-nright)}_{z} {left(-pright)}_{z}{left(2Q+r+sn-1right)}_{z}}{{left(Q+s-1right)}_{ z}{left(Q-1right)}_{z}} frac{1}{z!}.$$


here (P) is an integer value of any Q, (Q-1) positive integers31. The Hahn moment for two-dimensional data is found by Eq. (3).

$$Le {H}_{xy}= sum nolimits_{j=0}^{G-1}sum nolimits_{i=0}^{G-1}{delta }_{xy}{ h}_{x}^{a,b}{left(j, Qright)h}_{y}^{a,b}left(j,Qright),quad m, n= 0, 1, 2, . . ., Q-1.$$


The raw moment is used for data imputation. Imputation replaces missing data values ​​in the dataset with most surrogate values ​​to preserve information32. The raw moment for 2D data with order (a+b) is expressed by Eq. (4)33.

$${U}_{ab}= sum nolimits_{e=1}^{n}sum nolimits_{f=1}^{n}{e}^{a}{f}^{b} {delta}_{ef}.$$


Centroids ((r, s)) are needed to calculate the central moments visualized as the center of the data. By exploiting the centers of gravity, the central moments can be calculated as.

$${V}_{rs}= sum nolimits_{e=1}^{n} sum nolimits_{f=1}^{n}{left(e-overline{x }right) }^{r} {left(f-overline{y}right)}^{s} delta ef.$$


The Position Relative Incidence Matrix (PRIM) is used to determine the position of each gene in the breast adenocarcinoma gene sequence. The matrix formed by PRIM with the dimension of 20 by 20 is shown in Eq. (4)34.

$${R}_{PRIM}= left[begin{array}{ccc}begin{array}{cc}{R}_{1to 1}& {R}_{1to 2cdots } {R}_{2to 1}& {R}_{2to 2cdots }end{array}& begin{array}{c}{R}_{1to qcdots } {R}_{2to qcdots }end{array}& begin{array}{c}{R}_{1to M} {R}_{2to M}end{array} begin{array}{cc}vdots & vdots {R}_{pto 1}& {R}_{pto 2cdots }end{array}& begin{array}{c}vdots {R}_{pto qcdots }end{array}& begin{array}{c}vdots {R}_{pto M}end{array} begin{array}{cc}vdots & vdots {R}_{Mto 1}& {R}_{Mto 2cdots }end{array}& begin{array}{c}vdots {R}_{Mto qcdots }end{array}& begin{array}{c}vdots {R}_{Mto M}end{array}end{array}right].$$


Scaling functionality allows every data sample to participate in breast cancer detection30. In machine learning, the algorithm is considered most efficient in which the most relevant data has been extracted. PRIM did not extract all information from the data. The Inverse Position Relative Incidence Matrix (RPRIM) also works the same way as PRIM, but in reverse order.

The frequency matrix provides information about the presence of genes in the gene sequence. The Cumulative Absolute Position Incidence Vector (AAPIV) includes information about the sequence composition of the gene. The relative positioning of the cancer gene is found using AAPIV. Equation (7) illustrates the relative positioning of gene sequences35.

$${text{AAPIV}} = left{ {upvarepsilon_{1} ,;upvarepsilon_{2} ,;upvarepsilon_{3} , ldots upvarepsilon_{N} } right}. $$


The inverse cumulative absolute position incidence vector (RAAPIV) works the same way as AAPIV but in reverse order. The eq. (8) for RAAPIV is as follows

$${text{RAAPIV}} = left{ {upvarepsilon_{1} ,;upvarepsilon_{2} ,;upvarepsilon_{3} , ldots upvarepsilon_{N} } right}. $$


Prediction algorithms

This study uses a decision tree, naive Gaussian Bayes and a random forest classifier for the prediction of breast adenocarcinoma.

Decision tree is a supervised machine learning technique. It is mainly used for classification and regression problems. In a decision tree root, nodes can be used as input. These nodes are filtered through decision nodes and leaf nodes used to get the desired output35,36,37. Entropy controls how the data will be split in the decision tree, and information gain indicates how much information an entity gives away about the respective class. Equations (9) and (10) explain the formula for calculating entropy and information gain in decision tree38.


$${text{IG}} = {text{ Entropy }}left( {{text{Parent}}} right) , – {text{ Average Entropy}}left( {{text {Child}}} right).$$


In the decision tree, data flows through the nodes. Figure 6 explains how the decision tree algorithm works39.

Figure 6
number 6

The Naive Bayes algorithm is mainly used in data mining algorithms based on Bayes’ theorem and uses simple probabilities. The eq. (11) of Bayes’ theorem is as follows40.

$$Pleft(B|Yright)= frac{Pleft(Bright)Pleft(B|Yright)}{P|Y}.$$


Here, P refers to probability and Y is the attribute of a class. Figure 7 explains the Naïve Bayes classification41.

Picture 7
number 7

The algorithm for NB is

figure a

here ({D}_{t}) is the set of training examples, (I) is the instance, and ({X}_{i}) is the random variable42. It is a simple algorithm used in many fields of medical science43.

Random Forest (RF) is the third algorithm applied to all evaluation methods. It is the collection of tree predictions that use different data for different techniques, and each technique leads to a different result. It is the ensemble learning method for regression and classification by building a multitude of decision trees44. The result is merged to represent the average result. Figure 8 illustrates how random forest algorithms work45.

Figure 8
figure 8

Work on the Random Forest algorithm.

MSE measures the average squared error. It is the difference between the actual values ​​and the calculated values. The root mean square error in RF is measured by Eq. (12).

$$MSE = frac{1}{{{N}}}sum nolimits_{I=1}^{N}{({f}_{1}-{y}_{1})}^{ 2}.$$


In the equation ({({f}_{1}-{y}_{1})}^{2})is the square of the errors.

Where ({y}_{1}) is the predicted values ​​and ({f}_{1}) is the actual values.

$$Entropy = sum nolimits_{I=1}^{C}-{p}_{1}times {text{log}}_{2}{p}_{1}.$$


Entropy is used to measure uncertainty and disorder. In eq. (13), p1 is the prior probability of each class, c, and the number of unique classes46.

Sherry J. Basler