💡⏳ Mining deep into Data Mining

Hurray💥 , we have seen the basics of data mining in Part I 😃

Let's get into the phases involved in the KDD process step by step. To start with, let's explore the Data Preprocessing phase.

What actually is DATA in data mining? 🤔

In data mining, Data refers to the collection of objects and their attributes. Umm, Confusing right? 😨

👉 An Object is just like an entry in a table or an instance. It is also known as record, point, entity or sample.

👉 Attribute is any property or characteristic of an object.

👉 For example, If the eye of a person is considered as an object then, the eye color, blink rate are regarded as the attributes.

👉 Attribute can also be called a feature, field, characteristic, or variable in data mining.

👉 Here, the organization of data is in a tabular form.

Let's get more technical 💻🙌

👉 Each of the rows can be called vectors, (ie) object vectors or feature vectors.

👉 The number of attributes will determine the dimensions of the vector.

👉 Thus, our data will be a collection of N-dimensional vectors.

👉 Let's understand with an example

The above table denotes fraudulent loan transaction details. Each of the rows or entities can be called a vector. The attributes such as Tid, Refund, Marital status, Taxable income determine the dimensions (ie) according to the given example, each of the vectors is three-dimensional. The target attribute is Cheat.

Attributes, properties and their types ✊ :

As discussed already, the attribute is any characteristic or feature of an object.

The type of attribute depends on the properties it possesses:

👉 Distinctness - equal or not equal (=,!=)

👉 Order - Greater than or less than (< >)

👉 Addition, subtraction (+, -)

👉 Multiplication, division (*, /)

The different types of attributes include

👉 Nominal - any unique value of an object such as ID, eye colour, country zip code

Property: Distinctness - since it is a static value, only equality can be checked

Operations: mode, entropy, correlation, chi-square test

👉 Ordinal - based on rankings (eg: ratings from1 to 5, preferences)

Property: Distinctness, Order - since it is based on hierarchy, the data can be ranked as most and least

Operations: median, rank correlation, sign tests

👉 Interval - Calendar dates, Temperature in Celcius or Kelvin.

Property: Distinctness, Order, Addition/Subtraction - intervals can grow by adding

Operations: Mean, standard deviation, t, and F tests

👉 Ratio - Interval, count and time, length

Property: Distinctness, Order, Addition/Subtraction, Multiplication/Division - for example, we can divide a length by another length but cannot be done for a date value.

Operations: Geometric mean, harmonic mean, percent variation

The ratio attribute can be divided into Discrete or Continuous values

👉 Discrete Attribute

➡ Has finite or countably infinite values

➡ Generally represented as integer values

➡ Example, zip codes, count of words in a document, number of states in the country.

👉 Continuous Attribute

➡ Has real numbers as attribute values

➡ Generally represented as floating-point values

➡ Example, height, weight, temperature or air quality metrics

Types of Data sets other than table format 😃:

👉 Graph data - social network data, molecular structure, World wide web

👉 Ordered data - spatial(geographical or map data), sequential (makes sense when viewed as a linear data - speech, or sound, genome sequence), temporal (based on events)

What should we do before data pre-processing? 🤔

👉 Determine the data quality. Some factors that affect the data quality are

👉 Noise - inconsistent data

👉 Outliers - data that does not belong to the common characteristics or pattern

👉 Missing values - Replace with probability values or estimate and eliminate them

👉 Duplicate data - Perform data cleaning

Data pre-processing techniques 💭

👉Aggregation

➡ Combining two or more attributes or objects as a single entity

➡ The main purpose is to reduce the data points, obtain more stable data, and change of scale of data (eg) cities can be aggregated into regions, states, etc

👉Sampling

➡ Sampling is basically the process of data selection and obtaining a sample of data

➡ Choose a representative sample so that it is effective

Types of sampling

➡ Simple Random Sampling: the equal probability of selecting an item

➡ Sampling with replacement: Remove the sample from the entire sample

➡ Sampling without replacement: objects are not removed from the whole data

➡ Stratified sampling: Split the data into partitions and draw random samples from them

👉Dimensionality Reduction

➡ Higher the dimensions of data more will be the complexity of analysis

➡ Techniques involved are:

➡ PCA (Principal Component Analysis)

➡ Singular value decomposition

➡ Supervised and unsupervised techniques

👉Feature subset selection

➡ Redundant features are the ones that are commonly found in many object entries with high influence. Example: The salary of an employee and tax paid

➡ Irrelevant features include student's date of birth for CGPA calculation

➡ Techniques involved are:

➡ Brute force approach

➡ Embedded approach

➡ Filter approach

➡ Wrapper approach

👉Feature creation

➡ Create new attributes from the existing original attributes

➡ One technique is Data Discretization - converting and defining discrete data into continuous intervals.

👉Attribute transformation

➡ Any function that maps the set of input values to obtain with a new set of replacement values where the new values are identified by the old ones

➡ Example: Normalization, Standardization

We are done with a basic understanding of data pre-processing. Let's get more mathematical 💥😃 in the upcoming posts

Search This Blog

✨Incisive Tech-Zone✨