Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
Ebook597 pages4 hours

Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Classification algorithms are essential in machine learning as they allow us to make predictions about the class or category of an input by considering its features. These algorithms have a significant impact on multiple applications like spam filtering, sentiment analysis, image recognition, and fraud detection. If you want to expand your knowledge about classification algorithms, this book is the ideal resource for you.

The book starts with an introduction to problem-solving in machine learning and subsequently focuses on classification problems. It then explores the Naïve Bayes algorithm, a probabilistic method widely used in industrial applications. The application of Bayes Theorem and underlying assumptions in developing the Naïve Bayes algorithm for classification is also covered. Moving forward, the book centers its attention on the Logistic Regression algorithm, exploring the sigmoid function and its significance in binary classification. The book also covers Decision Trees and discusses the Gini Factor, Entropy, and their use in splitting trees and generating decision leaves. The Random Forest algorithm is also thoroughly explained as a cutting-edge method for classification (and regression). The book concludes by exploring practical applications such as Spam Detection, Customer Segmentation, Disease Classification, Malware Detection in JPEG and ELF Files, Emotion Analysis from Speech, and Image Classification.

By the end of the book, you will become proficient in utilizing classification algorithms for solving complex machine learning problems.
LanguageEnglish
Release dateMay 23, 2023
ISBN9789355518484
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)

Read more from Partha Majumdar

Related to Mastering Classification Algorithms for Machine Learning

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for Mastering Classification Algorithms for Machine Learning

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Classification Algorithms for Machine Learning - Partha Majumdar

    CHAPTER 1

    Introduction to Machine Learning

    Welcome to this book.

    In this book, we will explore models for classifying data. We need to classify data for various purposes. For example, from piles of data regarding credit card transactions, we need to find out if there is any fraudulent transaction. So essentially, we are classifying the data into two classes – good transactions and fraudulent transactions. For example, from data regarding pictures of food items, we need to figure out if a food item would suit a diabetic patient.

    Human beings are experts in classification in most situations. However, the data to classify is too large in the modern world. So, we need machines to classify as effectively as humans so that it is practical to meet the demand.

    This book will discuss various models using which, machines can effectively classify data. Before we discuss these classification models, we start with a discussion on what machine learning is. We will also explore how machines can be made to learn.

    Figure 1.1

    Structure

    In this chapter, we will discuss the following topics:

    Machine learning

    Traditional programming versus programming for machine learning

    The learning process of a machine

    Kinds of data machines can learn from

    Types of machine learning

    Supervised learning

    Unsupervised learning

    Objectives

    After reading this chapter, you can differentiate between traditional programming and programming for machine learning. Also, you will understand what are the different problems that can be solved by machine learning.

    Machine learning

    Neuroscientist Warren S. McCulloch and Logician Walter H. Pitts published A Logical Calculus of the ideas immanent in the Nervous activity in 1943 in the Bulletin of Mathematical Biophysics, Vol 5. In this paper, they discussed a mathematical model of neural networks. This is the first attempt to make machines think like the human brain. ¹

    What is being able to think is a vast subject. We can make a simple abstraction, as shown in Figure 1.1. Thinking is a process of collecting data, finding patterns in the data, and making inferences from the patterns.


    1 https://www.cse.chalmers.se/~coquand/AUTOMATA/mcp.pdf.

    Figure 1.2: Abstraction of how thinking is performed

    Let us discuss the process of thinking through an example. Suppose the data provided to us is a massive pile of medicines. On receiving this data, we could find patterns like which medicines are like each other. We may study the composition of the medicines and the manufacturers and many other attributes. We may classify the medicines as which medicine group is for curing what disease based on the patterns we find.

    Machine learning is like this. We present data to the machine and sometimes provide information about the data. Based on this information and knowledge, the machine finds patterns and considers the mechanism to see them as its rules. Once the machine has formulated its rules, it makes inferences about a new situation.

    Machine learning is a branch of Artificial Intelligence (AI). In machine learning, using mathematical modeling on data, a machine is made to learn the patterns in the data without any human intervention.

    Traditional programming versus programming for machine learning

    Programming for machine learning is different from traditional programming.

    In traditional programming, we have data and rules. We apply the rules to the data to get the Output. Refer to Figure 1.3:

    Figure 1.3: Traditional Programming

    Consider this example from the world of Physics. When we want the computer to calculate the value of momentum, we tell the computer that the formula for momentum is the mass multiplied by the velocity and tell the computer the value of mass and velocity. Here, the value of mass and velocity is the Data. On this data, the computer applies the Rule, that is, the formula for momentum, to find the momentum value for us. The value of momentum calculated by the computer is the Output.

    Momentum = Mass * Velocity

    Generally written as,

    Momentum = mv

    In contrast to traditional programming, in machine learning, we supply the computer with Data and Output, and we expect the computer to generate the Rules as shown in Figure 1.4:

    Figure 1.4: Programming for Machine Learning

    Suppose we had a mechanism to get values of momentum from some experiment. And we knew the values of mass and velocity in each of the experiments. Now, if we want the computer to determine the formula for momentum, that would be a machine learning situation. So, we would input the values of mass, velocity, and momentum and tell the machine to determine the formula for calculating momentum.

    The learning process of a machine

    Let us discuss a simplistic way the machines learn. As you would imagine, the actual process is much more complex.

    Consider that we have the following data, (Refer to Figure 1.5) from an experiment. We ask the machine to provide a relationship between momentum mass and velocity.

    Figure 1.5: Input to a computer to create a Machine Learning Model

    For the machine to build a model, the data scientist must tell the machine what model to make. Generally, the data scientist tries to understand the data. This step is called Exploratory Data Analysis. In the preceding situation, we have two independent variables, m, and v, and one dependent variable, M. We can plot this data on a 2-dimensional chart as shown in Figure 1.6:

    Figure 1.6: Scatter Plot based on data in Figure 1.4

    Let us say that data scientist decides to create a linear model of the form M = β0 + β1 * m + β2 * v. The machine needs to estimate the values of β0, β1, and β2.

    The Data Scientist provides a starting value of β0, β1, and β2. Let us say that these values are β0 = 5, β1 = 5, and β2 = 5. Using these values, the machine calculates the values for M, as shown in Figure 1.7. We call the value calculated by the machine Mhat.

    Figure 1.7: Initial estimates of Momentum (M) made by the machine

    If we plot this data, we get the chart shown in Figure 1.8, where the dots are the actual values of M as provided to the machine. The stars are the values of M estimated by the machine:

    Figure 1.8: Plot of the machine's initial Momentum (M) estimates

    We can see that the machine did not do so well. However, the machine continues. The machine calculates its error in making the estimates, as shown in Figure 1.9. We see that the machine can overestimate or underestimate. So, the error can be negative or positive. Instead of considering the value of the error, we consider the value of the square of the error. Further, we calculate the mean of the squared error (MSE) across all the data points by averaging the squares of error.

    Figure 1.9: Computation of error in estimates made by the machine

    Now, the machine considers other values of β0, β1, and β2 so that the value of the MSE is minimized. After some rounds of calculations, the machine gets the following values of β0, β1, and β2, as shown in Figure 1.10:

    Figure 1.10: Estimate of Momentum after minimizing MSE

    The estimates, though better, could be more reasonable. So, the data scientist considers another strategy. This time the data scientist asks the computer to try and find a relationship between m*v and M. They want the machine to create an equation of the form M = β0 + β1 * m * v. As in the earlier case, the data scientist gives initial values for β0 and β1 as β0 = 5 and β1 = 5.

    The setup is shown in Figure 1.11:

    Figure 1.11: New setup

    The machine tries to minimize the MSE for this setup and calculate the values of β0 and β1, as shown in Figure 1.12:

    Figure 1.12: Estimate of Momentum after minimizing MSE for the new model devised in Figure 1.11

    The machine has done much better. Let us plot this data and check (Refer to Figure 1.13):

    Figure 1.13: Plot of new estimates made by the machine. The RED crosses are the estimates

    So, the machine has given us a formula for calculating momentum based on the data provided to the machine. According to the machine:

    Momentum = 2.46214616625777 + 0.991020053873143 * Mass * Velocity

    Now, for any new value of Mass and Velocity, say Mass = 7 kg and Velocity = 8 km/h, the machine would say that:

    Momentum = 2.46214616625777 + 0.991020053873143 * 7 kg * 8 km/h = 57.95926918 kg * km/h

    This is pretty good as, according to the formula from physics, the value of momentum for Mass = 7 kg and Velocity = 8 km/h should be 56 kg * km/h.

    Kinds of data the machines can learn from

    From nature, human beings can gather data through the five sense organs. We can see, hear, smell, taste, and feel. Out of these five types of data, human beings have been able to digitize what they see and hear. Likewise, machines can also understand data from images and sounds.

    Human beings have created a lot of digital data from various activities we perform. This data is either structured or unstructured.

    Structured data is organized in tabular form and follow definite semantics. It is by far the data most processed by machines. As of 2022, about 80% of the data machines learn from are structured data. Machines are extremely good with structured data. Also, machines are very useful in working on structured data as humans fail to cope with the volumes of structured data. Examples of structured data can be found in any system where some transactions are conducted. For example, the data regarding credit card system transactions is structured. In a credit card system, millions of transactions are performed daily. Tasks like detecting fraudulent transactions are extremely difficult for human beings. So, here machines are best suited for the job.

    Unstructured data is a more recent phenomenon. This has mainly exploded due to social media. Unstructured data has no definite semantics, so, such data must be expressed with some semantics before the machines can work on them. Over the years, many representations of unstructured data have emerged; thus, machines can work efficiently on such data. Examples of unstructured data include tweets and newspaper articles. Images and audio/video clips are also unstructured data.

    We can also categorize data as semi-structured, containing portions of structured and unstructured data. For example, data from emails have a structure in that it contains structured information regarding the date the email was sent, who sent it, whom it was sent to, what the subject is, does it have attachments, and so on. However, the body of the emails contains unstructured data. As machines work well with structured and unstructured data, machines work well with semi-structured data too.

    No matter the type of data, it must be understood that machines can only work on numbers. So, any data the machines need to understand must be presented to the machine in numbers. In this book, we will discuss various techniques for converting non-numeric data to numbers without any loss of context and allowing machines to learn from them. These discussions will be spread across all the remaining chapters as we will discuss different problems to be solved by the machines.

    Types of machine learning

    Machine learning can be classified into two main types. They are Supervised learning and Unsupervised learning.

    In Supervised learning, we can perform two tasks: regression and classification.

    In Unsupervised learning, we can do two tasks: clustering and dimensionality reduction.

    There is a special case of Clustering tasks called Anomaly Detection.

    Figure 1.14 summarizes all types of machine learning and the tasks that can be performed:

    Figure 1.14: Types of machine learning

    Some people also consider Reinforcement learning as one type of Machine Learning. At the same time, some people argue that Reinforcement learning is approximate dynamic programming.

    Let us discuss each type of machine learning in more detail. However, this book focuses on the classification task, a type of supervised learning.

    Supervised learning

    In Supervised learning, the machine is provided data along with labels. The machine learns based on the data and the associated labels and then makes inferences. So, we are providing the machine with prior knowledge, and then after the machine learns from this knowledge, it can make decisions within the boundaries of this provided knowledge.

    Labels are the analysis of the data as determined by humans. For example, if we want the machine to learn to differentiate between images of dogs and cats, we need to provide data regarding dogs and cats to the machines. Along with this data, we need to provide labels stating which are the images of dogs and which are the images of cats. Suppose we want the machine to predict the marks in an exam. In that case, we need to provide historical data along with labels stating how many marks were obtained under the circumstances provided in the data.

    The bottom line in Supervised learning is that we provide existing knowledge to the machine and expect the machine to find patterns in the provided knowledge and make rules that the machine can use to answer future questions asked on the same subject.

    Let us understand this with an example. Consider that we want the machine to be able to detect spam emails. So, we gather the data as shown in Table 1.1:

    Table 1.1 : Example dataset of emails for spam detection

    In this example, the dataset in Table 1.1 contains only 6 data points. In real situations, the datasets have thousands and millions of data points. Nevertheless, the dataset contains data in 4 variables: Contains spelling mistakes, Contains the word Urgent, Contains the word ASAP, and Contains a link to click. In machine language parlance, these variables are called Independent Variables. For these four variables, there is data in each data point. In normal circumstances, experts would have studied real emails and gathered these four characteristics for each email. Apart from collecting data regarding the characteristics of the emails, experts would also assign a label as to whether the email is benign or spam. The variable we refer to as the label is also called the dependent variable in machine learning parlance.

    In Supervised learning, the machine would form patterns from the independent variables considering the associated dependent variable. From the pattern would emerge a rule that the machine will use when given new values for the independent variables.

    The preceding example is a Classification problem where the machine needs to decide whether an email is benign or spam. This type of Classification problem is called a Binary classification problem, as the machine must decide between two options or classes.

    There are classification problems where the machine needs to choose between more than two classes. Such classification problems are called Multi-class classification problems.

    Implementation of classification on the example data provided in Table 1.1 is as follows:

    import pandas as pd

    df = pd.DataFrame([['NO', 'NO', 'NO', 'YES', 'Benign'],

                      ['NO', 'NO', 'NO', 'NO', 'Benign'],

                      ['YES', 'NO', 'YES', 'NO', 'Spam'],

                      ['NO', 'YES', 'YES', 'YES', 'Spam'],

                      ['YES', 'NO', 'NO', 'YES', 'Benign'],

                      ['YES', 'YES', 'YES', 'YES', 'Spam']

                      ],

                      columns = ['ContainsSpellingMistakes', 'ContainsUrgent', 'ContainsASAP',  'ContainsLink', 'Label']

                    )

    df

      ContainsSpellingMistakes ContainsUrgent ContainsASAP ContainsLink  Label

    0                  NO            NO          NO          YES    Benign

    1                  NO            NO          NO          NO    Benign

    2                  YES            NO          YES        NO    Spam

    3                  NO            YES          YES        YES    Spam

    4                  YES            NO          NO          YES    Benign

    5                  YES            YES          YES        YES    Spam

    X = df.drop('Label', axis = 1, inplace = False)

    y = df['Label']

    print(X, '\n\n', y)

      ContainsSpellingMistakes ContainsUrgent ContainsASAP ContainsLink

    0                      NO            NO          NO          YES

    1                      NO            NO          NO          NO

    2                      YES            NO          YES          NO

    3                      NO            YES          YES          YES

    4                      YES            NO          NO          YES

    5                      YES            YES          YES          YES

    0    Benign

    1    Benign

    2      Spam

    3      Spam

    4    Benign

    5      Spam

    Name: Label, dtype: object

    from sklearn.preprocessing import LabelEncoder

    # Convert all data to numbers

    leX = LabelEncoder()

    XL = X.apply(leX.fit_transform)

    leY = LabelEncoder()

    yL = leY.fit_transform(y)

    print(XL, '\n\n', yL)

      ContainsSpellingMistakes  ContainsUrgent  ContainsASAP  ContainsLink

    0                        0              0            0            1

    1                        0              0            0            0

    2                        1              0            1            0

    3                        0              1            1            1

    4                        1              0            0            1

    5                        1              1            1            1

    [0 0 1 1 0 1]

    from sklearn.linear_model import LogisticRegression

    # Build Model

    lr = LogisticRegression()

    lr.fit(XL, yL)

    # Prepare Test Data

    testData = ['NO', 'YES', 'NO', 'YES']

    Xtest = leX.transform(testData)

    prediction = lr.predict(Xtest.reshape(1, -1))

    print('Prediction =', leY.inverse_transform(prediction))

    Prediction = ['Benign']

    Take another example. Suppose we have the temperatures of a city, say Bengaluru, every day for many years. We have three attributes, that is, the date, whether it was cloudy on that date, and the temperature on that date, as shown in Table 1.2:

    Table 1.2 : Example dataset of temperatures in a city

    Suppose we have this data from 01-Jan-2001 till 31-Dec-2015. Also, we want to know what the temperature would be on 25-Oct-2022. We should be able to predict the same using a Machine learning system with a Regression model. We need historical data for all the independent and associated dependent variables in Regression problems. In the example in Table 1.2, the date and whether cloudy or not are the independent variables or features. From some independent variables, we can derive many more independent variables. For example, from our data in Table 1.2, from the feature date, we can derive other independent variables like month, day of the year, etc. So, instead of using the date as the independent variable, we could use the month of the date and the day of the year as our features. Generating independent variable(s) or feature(s) from the existing independent variable(s) is called Feature Engineering (Refer to Table 1.3).

    Table 1.3 : Feature Engineered dataset of temperatures in a city

    The temperature is the dependent variable or target variable. Given this data, we want the machine to learn the patterns and create a rule. Then, given any date in the future and whether it is cloudy, the machine should predict the temperature on that day. So, this is also Supervised Learning.

    A regression implementation on the example data provided in Table 1.2 is as follows:

    import pandas as pd

    df = pd.DataFrame([['01-01-2001', 'YES', 14.3],

                      ['01-02-2001', 'NO', 13.7],

                      ['01-03-2001', 'NO', 13.6],

                      ['01-04-2001', 'YES', 14.3],

                      ['01-05-2001', 'NO', 14.2],

                      ['01-06-2001', 'YES', 12.8],

                      ['01-07-2001', 'NO', 14.7],

                      ['01-08-2001', 'NO', 11.3],

                      ['01-09-2001', 'NO', 11.7],

                      ['01-10-2001', 'NO', 12.1],

                      ],

                      columns = ['Date', 'Cloudy', 'Temperature']

                    )

    df

            Date Cloudy  Temperature

    0  01-01-2001    YES        14.3

    1  01-02-2001    NO        13.7

    2  01-03-2001    NO        13.6

    3  01-04-2001    YES        14.3

    4  01-05-2001    NO        14.2

    5  01-06-2001    YES        12.8

    6  01-07-2001    NO        14.7

    7  01-08-2001    NO        11.3

    8  01-09-2001    NO        11.7

    9  01-10-2001    NO        12.1

    import datetime

    import numpy as np

    from sklearn.preprocessing import LabelEncoder

    # Feature Engineering

    # Get Month and Day of the Year

    df['Month'] = pd.to_datetime(df['Date']).dt.month

    referenceDate = np.array([datetime.datetime(2001, 1, 1)] * len(df))

    df['DayOfYear'] = (pd.to_datetime(df['Date']) - referenceDate).dt.days

    # Convert Cloudy to numbers

    leC = LabelEncoder()

    df['Cloudy'] = leC.fit_transform(df['Cloudy'])

    df

            Date  Cloudy  Temperature  Month  DayOfYear

    0  01-01-2001     

    Enjoying the preview?
    Page 1 of 1