πŸ’‘⏳ Mining deep into Data Mining - PART II ⏳πŸ’‘

 HurrayπŸ’₯ , we have seen the basics of data mining in Part I πŸ˜ƒ

Let's get into the phases involved in the KDD process step by step. To start with, let's explore the Data Preprocessing phase.

What actually is DATA in data mining? πŸ€”

In data mining, Data refers to the collection of objects and their attributes. Umm, Confusing right? 😨

πŸ‘‰ An Object is just like an entry in a table or an instance. It is also known as record, point, entity or sample.
πŸ‘‰ Attribute is any property or characteristic of an object.
πŸ‘‰ For example, If the eye of a person is considered as an object then, the eye color, blink rate are regarded as the attributes. 
πŸ‘‰ Attribute can also be called a feature, field, characteristic, or variable in data mining.
πŸ‘‰ Here, the organization of data is in a tabular form.

Let's get more technical πŸ’»πŸ™Œ

πŸ‘‰ Each of the rows can be called vectors, (ie) object vectors or feature vectors.
πŸ‘‰ The number of attributes will determine the dimensions of the vector.
πŸ‘‰ Thus, our data will be a collection of N-dimensional vectors.
πŸ‘‰ Let's understand with an example


The above table denotes fraudulent loan transaction details. Each of the rows or entities can be called a vector.  The attributes such as Tid, Refund, Marital status, Taxable income determine the dimensions (ie) according to the given example, each of the vectors is three-dimensional. The target attribute is Cheat.

Attributes, properties and their types ✊ :

As discussed already, the attribute is any characteristic or feature of an object. 

The type of attribute depends on the properties it possesses:
πŸ‘‰ Distinctness - equal or not equal (=,!=)
πŸ‘‰ Order - Greater than or less than (< >)
πŸ‘‰ Addition, subtraction (+, -)
πŸ‘‰ Multiplication, division (*, /)

The different types of attributes include

πŸ‘‰ Nominal - any unique value of an object such as ID, eye colour, country zip code
                        Property: Distinctness - since it is a static value, only equality can be checked
                        Operations:  mode, entropy, correlation, chi-square test

πŸ‘‰ Ordinal - based on rankings (eg: ratings from1 to 5, preferences)
                        Property: Distinctness, Order - since it is based on hierarchy, the data can be ranked as most and least
                        Operations: median, rank correlation, sign tests

πŸ‘‰ Interval - Calendar dates, Temperature in Celcius or Kelvin.
                        Property: Distinctness, Order, Addition/Subtraction - intervals can grow by adding
                        Operations:  Mean, standard deviation, t, and F tests

πŸ‘‰ Ratio - Interval, count and time, length
                        Property:  Distinctness, Order, Addition/Subtraction, Multiplication/Division - for example, we can divide a length by another length but cannot be done for a date value.
                        Operations: Geometric mean, harmonic mean, percent variation

The ratio attribute can be divided into Discrete or Continuous values

πŸ‘‰ Discrete Attribute 
                    ➡ Has finite or countably infinite values
                    ➡ Generally represented as integer values
                    ➡ Example, zip codes, count of words in a document, number of states in the country.

πŸ‘‰ Continuous Attribute 
                    ➡ Has real numbers as attribute values
                    ➡ Generally represented as floating-point values
                    ➡ Example, height, weight, temperature or air quality metrics

Types of Data sets other than table format πŸ˜ƒ:

πŸ‘‰ Graph data - social network data, molecular structure, World wide web
πŸ‘‰ Ordered data - spatial(geographical or map data), sequential (makes sense when viewed as a linear data - speech, or sound, genome sequence), temporal (based on events)

What should we do before data pre-processing? πŸ€”

πŸ‘‰ Determine the data quality. Some factors that affect the data quality are
πŸ‘‰ Noise - inconsistent data
πŸ‘‰ Outliers - data that does not belong to the common characteristics or pattern
πŸ‘‰ Missing values - Replace with probability values or estimate and eliminate them
πŸ‘‰ Duplicate data - Perform data cleaning 

Data pre-processing techniques πŸ’­


                    ➡ Combining two or more attributes or objects as a single entity
                    ➡  The main purpose is to reduce the data points, obtain more stable data, and change of scale of data (eg) cities can be aggregated into regions, states, etc


                   ➡ Sampling is basically the process of data selection and obtaining a sample of data
                   ➡ Choose a representative sample so that it is effective

Types of sampling
                   ➡ Simple Random Sampling: the equal probability of selecting an item
                   ➡ Sampling with replacement: Remove the sample from the entire sample
                   ➡ Sampling without replacement: objects are not removed from the whole data
                   ➡ Stratified sampling: Split the data into partitions and draw random samples from them

πŸ‘‰Dimensionality Reduction

                   ➡ Higher the dimensions of data more will be the complexity of analysis
                   ➡ Techniques involved are:
                                ➡ PCA (Principal Component Analysis)
                                ➡ Singular value decomposition
                                ➡ Supervised and unsupervised techniques

πŸ‘‰Feature subset selection

                 ➡ Redundant features are the ones that are commonly found in many object entries with high influence. Example: The salary of an employee and tax paid
                 ➡ Irrelevant features include student's date of birth for CGPA calculation
                 ➡ Techniques involved are:
                                 ➡ Brute force approach
                                 ➡ Embedded approach
                                 ➡ Filter approach
                                 ➡ Wrapper approach

πŸ‘‰Feature  creation

                     ➡ Create new attributes from the existing original attributes
                     ➡ One technique is Data Discretization - converting and defining discrete data into continuous intervals.

πŸ‘‰Attribute transformation

                    ➡ Any function that maps the set of input values to obtain with a new set of replacement values where the new values are identified by the old ones
                    ➡ Example: Normalization, Standardization

We are done with a basic understanding of data pre-processing. Let's get more mathematical πŸ’₯πŸ˜ƒ in the upcoming posts

                    πŸ’‘ Let's keep mining deeper  !!! πŸ’‘


