💡⏳ Mining deep into Data Mining - PART II ⏳💡

 Hurray💥 , we have seen the basics of data mining in Part I 😃

Let's get into the phases involved in the KDD process step by step. To start with, let's explore the Data Preprocessing phase.

What actually is DATA in data mining? 🤔

In data mining, Data refers to the collection of objects and their attributes. Umm, Confusing right? 😨

👉 An Object is just like an entry in a table or an instance. It is also known as record, point, entity or sample.
👉 Attribute is any property or characteristic of an object.
👉 For example, If the eye of a person is considered as an object then, the eye color, blink rate are regarded as the attributes. 
👉 Attribute can also be called a feature, field, characteristic, or variable in data mining.
👉 Here, the organization of data is in a tabular form.
                                                

Let's get more technical 💻🙌

👉 Each of the rows can be called vectors, (ie) object vectors or feature vectors.
👉 The number of attributes will determine the dimensions of the vector.
👉 Thus, our data will be a collection of N-dimensional vectors.
👉 Let's understand with an example

                            


The above table denotes fraudulent loan transaction details. Each of the rows or entities can be called a vector.  The attributes such as Tid, Refund, Marital status, Taxable income determine the dimensions (ie) according to the given example, each of the vectors is three-dimensional. The target attribute is Cheat.

Attributes, properties and their types ✊ :

As discussed already, the attribute is any characteristic or feature of an object. 

The type of attribute depends on the properties it possesses:
👉 Distinctness - equal or not equal (=,!=)
👉 Order - Greater than or less than (< >)
👉 Addition, subtraction (+, -)
👉 Multiplication, division (*, /)

The different types of attributes include

👉 Nominal - any unique value of an object such as ID, eye colour, country zip code
                        Property: Distinctness - since it is a static value, only equality can be checked
                        Operations:  mode, entropy, correlation, chi-square test

👉 Ordinal - based on rankings (eg: ratings from1 to 5, preferences)
                        Property: Distinctness, Order - since it is based on hierarchy, the data can be ranked as most and least
                        Operations: median, rank correlation, sign tests

👉 Interval - Calendar dates, Temperature in Celcius or Kelvin.
                        Property: Distinctness, Order, Addition/Subtraction - intervals can grow by adding
                        Operations:  Mean, standard deviation, t, and F tests

👉 Ratio - Interval, count and time, length
                        Property:  Distinctness, Order, Addition/Subtraction, Multiplication/Division - for example, we can divide a length by another length but cannot be done for a date value.
                        Operations: Geometric mean, harmonic mean, percent variation

The ratio attribute can be divided into Discrete or Continuous values

👉 Discrete Attribute 
                    ➡ Has finite or countably infinite values
                    ➡ Generally represented as integer values
                    ➡ Example, zip codes, count of words in a document, number of states in the country.

👉 Continuous Attribute 
                    ➡ Has real numbers as attribute values
                    ➡ Generally represented as floating-point values
                    ➡ Example, height, weight, temperature or air quality metrics

Types of Data sets other than table format 😃:

👉 Graph data - social network data, molecular structure, World wide web
👉 Ordered data - spatial(geographical or map data), sequential (makes sense when viewed as a linear data - speech, or sound, genome sequence), temporal (based on events)

What should we do before data pre-processing? 🤔

👉 Determine the data quality. Some factors that affect the data quality are
👉 Noise - inconsistent data
👉 Outliers - data that does not belong to the common characteristics or pattern
👉 Missing values - Replace with probability values or estimate and eliminate them
👉 Duplicate data - Perform data cleaning 


Data pre-processing techniques 💭

 👉Aggregation

                    ➡ Combining two or more attributes or objects as a single entity
                    ➡  The main purpose is to reduce the data points, obtain more stable data, and change of scale of data (eg) cities can be aggregated into regions, states, etc

👉Sampling

                   ➡ Sampling is basically the process of data selection and obtaining a sample of data
                   ➡ Choose a representative sample so that it is effective

Types of sampling
                    
                   ➡ Simple Random Sampling: the equal probability of selecting an item
                   ➡ Sampling with replacement: Remove the sample from the entire sample
                   ➡ Sampling without replacement: objects are not removed from the whole data
                   ➡ Stratified sampling: Split the data into partitions and draw random samples from them

👉Dimensionality Reduction

                   ➡ Higher the dimensions of data more will be the complexity of analysis
                   ➡ Techniques involved are:
                                ➡ PCA (Principal Component Analysis)
                                ➡ Singular value decomposition
                                ➡ Supervised and unsupervised techniques
            

👉Feature subset selection

                 ➡ Redundant features are the ones that are commonly found in many object entries with high influence. Example: The salary of an employee and tax paid
                 ➡ Irrelevant features include student's date of birth for CGPA calculation
                 ➡ Techniques involved are:
                                 ➡ Brute force approach
                                 ➡ Embedded approach
                                 ➡ Filter approach
                                 ➡ Wrapper approach

👉Feature  creation

                     ➡ Create new attributes from the existing original attributes
                     ➡ One technique is Data Discretization - converting and defining discrete data into continuous intervals.

👉Attribute transformation

                    ➡ Any function that maps the set of input values to obtain with a new set of replacement values where the new values are identified by the old ones
                    ➡ Example: Normalization, Standardization

We are done with a basic understanding of data pre-processing. Let's get more mathematical 💥😃 in the upcoming posts


                    💡 Let's keep mining deeper  !!! 💡



Comments

Post a Comment

Articles by Hemapriya

Comprehending the state-of-art Digit Recognizer dataset using machine learning

Performance Analysis of Weather Data using Machine Learning

Market Basket Analysis using Association Rule-Mining in R language