π‘⏳ Mining deep into Data Mining - PART II ⏳π‘
Hurrayπ₯ , we have seen the basics of data mining in Part I π
Let's get into the phases involved in the KDD process step by step. To start with, let's explore the Data Preprocessing phase.
What actually is DATA in data mining? π€
In data mining, Data refers to the collection of objects and their attributes. Umm, Confusing right? π¨
π An Object is just like an entry in a table or an instance. It is also known as record, point, entity or sample.
π Attribute is any property or characteristic of an object.
π For example, If the eye of a person is considered as an object then, the eye color, blink rate are regarded as the attributes.
π Attribute can also be called a feature, field, characteristic, or variable in data mining.
π Here, the organization of data is in a tabular form.
Let's get more technical π»π
π Each of the rows can be called vectors, (ie) object vectors or feature vectors.
π The number of attributes will determine the dimensions of the vector.
π Thus, our data will be a collection of N-dimensional vectors.
π Let's understand with an example
The above table denotes fraudulent loan transaction details. Each of the rows or entities can be called a vector. The attributes such as Tid, Refund, Marital status, Taxable income determine the dimensions (ie) according to the given example, each of the vectors is three-dimensional. The target attribute is Cheat.
Attributes, properties and their types ✊ :
As discussed already, the attribute is any characteristic or feature of an object.
The type of attribute depends on the properties it possesses:
π Distinctness - equal or not equal (=,!=)
π Order - Greater than or less than (< >)
π Addition, subtraction (+, -)
π Multiplication, division (*, /)
The different types of attributes include
π Nominal - any unique value of an object such as ID, eye colour, country zip code
Property: Distinctness - since it is a static value, only equality can be checked
Operations: mode, entropy, correlation, chi-square test
π Ordinal - based on rankings (eg: ratings from1 to 5, preferences)
Property: Distinctness, Order - since it is based on hierarchy, the data can be ranked as most and least
Operations: median, rank correlation, sign tests
π Interval - Calendar dates, Temperature in Celcius or Kelvin.
Property: Distinctness, Order, Addition/Subtraction - intervals can grow by adding
Operations: Mean, standard deviation, t, and F tests
π Ratio - Interval, count and time, length
Property: Distinctness, Order, Addition/Subtraction, Multiplication/Division - for example, we can divide a length by another length but cannot be done for a date value.
Operations: Geometric mean, harmonic mean, percent variation
The ratio attribute can be divided into Discrete or Continuous values
π Discrete Attribute
➡ Has finite or countably infinite values
➡ Generally represented as integer values
➡ Example, zip codes, count of words in a document, number of states in the country.
π Continuous Attribute
➡ Has real numbers as attribute values
➡ Generally represented as floating-point values
➡ Example, height, weight, temperature or air quality metrics
Types of Data sets other than table format π:
π Graph data - social network data, molecular structure, World wide web
π Ordered data - spatial(geographical or map data), sequential (makes sense when viewed as a linear data - speech, or sound, genome sequence), temporal (based on events)
What should we do before data pre-processing? π€
π Determine the data quality. Some factors that affect the data quality are
π Noise - inconsistent data
π Outliers - data that does not belong to the common characteristics or pattern
π Missing values - Replace with probability values or estimate and eliminate them
π Duplicate data - Perform data cleaning
Data pre-processing techniques π
πAggregation
➡ Combining two or more attributes or objects as a single entity
➡ The main purpose is to reduce the data points, obtain more stable data, and change of scale of data (eg) cities can be aggregated into regions, states, etc
πSampling
➡ Sampling is basically the process of data selection and obtaining a sample of data
➡ Choose a representative sample so that it is effective
Types of sampling
➡ Simple Random Sampling: the equal probability of selecting an item
➡ Sampling with replacement: Remove the sample from the entire sample
➡ Sampling without replacement: objects are not removed from the whole data
➡ Stratified sampling: Split the data into partitions and draw random samples from them
πDimensionality Reduction
➡ Higher the dimensions of data more will be the complexity of analysis
➡ Techniques involved are:
➡ PCA (Principal Component Analysis)
➡ Singular value decomposition
➡ Supervised and unsupervised techniques
πFeature subset selection
➡ Redundant features are the ones that are commonly found in many object entries with high influence. Example: The salary of an employee and tax paid
➡ Irrelevant features include student's date of birth for CGPA calculation
➡ Techniques involved are:
➡ Brute force approach
➡ Embedded approach
➡ Filter approach
➡ Wrapper approach
πFeature creation
➡ Create new attributes from the existing original attributes
➡ One technique is Data Discretization - converting and defining discrete data into continuous intervals.
πAttribute transformation
➡ Any function that maps the set of input values to obtain with a new set of replacement values where the new values are identified by the old ones
➡ Example: Normalization, Standardization
We are done with a basic understanding of data pre-processing. Let's get more mathematical π₯π in the upcoming posts
Nice work! Keep enlightening!!
ReplyDelete