Vijay Kothandaraman
3 min readNov 8, 2021

--

(pic credit- Shaitris)

Machine Learning, as I am learning it-

Till a few months ago, the term ML was daunting to me, as it must still be, for some people. So here is my attempt to explain it in plain language.

But let’s begin from Data-

Data is a collection of raw facts.

Information is processed data.

Information in a specific context is Knowledge.

Data can be-

  • Nominal- A nominal scale classifies data into distinct categories in which no ranking is implied. eg, gender, marital status.
  • Ordinal- An ordinal scale classifies data into distinct categories in which ranking is implied. eg, product satisfaction(Satisfied, Neutral, Not satisfied).
  • Interval- An interval scale is an ordered scale in which the difference between measurements is a meaningful quantity but the measurements have no true zero point. eg, Temperature in Fahrenheit and Celsius, Year.
  • Ratio- A ratio scale is an ordered scale in which the difference between the measurements is a meaningful quantity and the measurements have a true zero point. eg, Weight, Age, Salary.

Data can either be-

  • Categorical- eg. Marital status, eye color, etc.
  • Numerical- can be of 2 types-

Discrete(Counted)- eg, number of children, defects/hour.

Continuous(Measured)- eg, Weight, Voltage.

ML is the science of training a system to learn from data and act. The logic that drives the ML-based system is not explicitly programmed, rather, it is learnt from data. A simple analogy is an infant learning to speak by observing others. Babies are not born with any language skills, but they learn to understand and speak words by observing others. Similarly, in ML, we train a system with data instead of explicitly programming its behavior. To be specific, an ML algorithm infers patterns and relationships between different variables in a dataset. It then uses that knowledge to generalize beyond the training dataset. i.e, an ML algorithm learns to predict from data.

In a tabular dataset, a row represents an observation and a column represents a feature.For eg, a tabular dataset containing user profiles which includes fields such as age, gender, profession, city and income. Each field in this dataset is a feature in the context of ML. Each row containing a user profile is an observation. Features are also collectively referred to as dimensions. A categorical feature or variable is a descriptive feature that can take on one of a fixed number of discrete values. It represents a qualitative value, which is a name or a label. These values have no ordering. For eg, gender is a categorical feature that can take on only one of 2(or to be politically correct, few) values, each of which is a label. Profession is also a categorical variable, but it can take on one of several hundred values.

A numerical feature or variable is quantitative variable that can take on any numerical value, which describes a measurable quantity as a number. These values have mathematical ordering. For eg, income is a numerical feature. Numerical features can be further classified into discrete and continuous features. A discrete numerical feature can take on only certain values,eg, number of children. A continuous numerical feature can take on any value within a finite or infinite interval.eg, temperature.

The data used for evaluating the predictive performance of a model(ML algorithm) is called test data or test set. After a model has been trained, its predictive capabilities should be tested on a known dataset before it is used on new data. Test data should be set aside before training a model. It should NOT be used at all during the training phase, it should not be used for training or optimizing a model . In fact, it should not influence the training phase in any manner, DO NOT even look at it during the training phase. A corollary to this is that a model should not be tested with the training dataset. It will perform very well on the observations from the training set. It should be tested on data that was not used in training it. Generally, a small proportion of a dataset is held out for testing before training a model. A general rule of thumb is to use 80% of data for training a model and 20% for testing.

ML applications- A variety of tasks in different fields use machine learning and they can be broadly grouped into the following categories-

Classification

Regression

Clustering

Anomaly Detection

Recommendation

Dimensionality reduction

(to be continued)

--

--

Vijay Kothandaraman

Freelancer, word aggregator, Dentist, wannabe data analyst.