Posts

⏳ Mining deep into Data Mining - Statistics - PART II ⏳

Image
 Before analyzing distributions in statistics, Let's understand the required essential basics πŸ’­ MEAN Mean is an essential concept in statistics. In common terms, it can be defined as the average of a collection of values. It can be referred to as central tendency or centrality for a probability distribution. Thus, it basically denotes the centrality of a series of values. πŸ’­  STANDARD DEVIATION A standard deviation is a statistic that measures the dispersion of a dataset relative to its mean. The standard deviation is calculated as the square root of variance by determining each data point's deviation relative to the mean. πŸ’­  VARIANCE Variance is the measure of how well the data is dispersed from the existing data points. (ie) the mean squared difference between every data point and the center of distribution (mean). This yields the rate of dispersion of data points. Variance is also the square of standard deviation. For example, let's consider a series of price list valu

⏳ Mining deep into Data Mining - Statistics - PART I ⏳

Image
 Why do we have to know statistics?πŸ€” As mentioned in the previous posts, we live in the world of data from which we can derive insightful information. Thus, Statistics play a vital role in processing and analyzing the data to make decisions and predictions. What actually is statistics? πŸ‘€ Let's get more technical  Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data . There are two types or classes of statistics.               πŸ‘‰  Descriptive statistics                πŸ‘‰  Inferential statistics Descriptive Statistics πŸ˜€                πŸ‘‰ Descriptive statistics focuses more on analyzing, summarizing, and organizing data in the form of numbers or graphs.                πŸ‘‰ Bar plots, histograms, pie-charts are used in visualizing descriptive data and determining PDF (probability density function), CDF (cumulative distribution function), Normal distribution.                πŸ‘‰ Measure of central tendency is determin

πŸ’‘⏳ Mining deep into Data Mining - PART II ⏳πŸ’‘

Image
 HurrayπŸ’₯ , we have seen the basics of data mining in Part I πŸ˜ƒ Let's get into the phases involved in the KDD process step by step. To start with, let's explore the Data Preprocessing phase. What actually is DATA in data mining? πŸ€” In data mining, Data refers to the collection of objects and their attributes . Umm, Confusing right? 😨 πŸ‘‰ An Object is just like an entry in a table or an instance. It is also known as record, point, entity or sample. πŸ‘‰ Attribute is any property or characteristic of an object. πŸ‘‰ For example, If the eye of a person is considered as an object then, the eye color, blink rate are regarded as the attributes.  πŸ‘‰ Attribute can also be called a feature, field, characteristic, or variable in data mining. πŸ‘‰ Here, the organization of data is in a tabular form.                                                   Let's get more technical πŸ’»πŸ™Œ πŸ‘‰ Each of the rows can be called vectors, (ie) object vectors or feature vectors. πŸ‘‰ The number of attributes wil

πŸ’‘⏳ Mining deep into Data Mining - PART I ⏳πŸ’‘

 "Necessity is the mother of invention" The need for knowledge is the root of data collection, discovery, and analysis. To be precise, we could say that the current technological world is  drowning in data but starving for knowledge. Thus, data mining comes in handy What is Data Mining? It is the extraction of interesting, non-trivial, previously unknown, potentially useful, patterns or knowledge from the huge amount of data. Want to know the alternative names of Data Mining? πŸ‘‰ Knowledge Discovery and Databases (KDD) πŸ‘‰ Data or Pattern analysis πŸ‘‰ Data archeology πŸ‘‰ Data dredging πŸ‘‰ Information harvesting πŸ‘‰ Business Intelligence Data mining is indeed a confluence of multiple disciplines mainly πŸ‘‰ Statistics πŸ‘‰ Algorithms πŸ‘‰ Data visualization πŸ‘‰ Machine learning πŸ‘‰ Pattern recognition πŸ‘‰ Database Technology Why not follow traditional data analysis? πŸ‘‰ Traditional analysis of data will not be able to handle tera-bytes of data πŸ‘‰ High dimensional data add complexity to the a

Market Basket Analysis using Association Rule-Mining in R language

Image
Association mining is usually done on transactions data from a retail market or from an online e-commerce store. Since most transactions data is large, the apriori algorithm makes it easier to find these patterns or rules quickly. Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.  Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. DATASET:  Groceries_dataset Let's code and analyse the algorithm πŸ’ͺ πŸ‘‰ Import the groceries dataset πŸ‘‰Explore the data πŸ‘‰ Perform data preparation such as checking the Null values, normalising the format of data to numeric values and group the data of similar values

Performance Analysis of Weather Data using Machine Learning

Image
   Weather and Climate Observations Weather data is primarily important for determining the climate of a region. Climate is determined by a number of factors.The formation and advancement of storm systems, the amount of precipitation an area gets, and the number of cloudy days are all influenced by air pressure, temperature, and humidity at various altitudes.These influences affect the environment on a local, international, and global scale over time. Why performance analysis of weather data is important ? The value of weather data analytics in human life is immense. Accurate weather forecasting is beneficial to the agricultural industry, tourism, and preparing for natural disasters such as floods and droughts. Weather forecasting has a lot of economic appeal in news organisations, government agencies, and industrial agriculture. Performance analysis of meteorological data: we can use weather and climate datasets to better understand and forecast the effect on shipping and logistics pr

Comprehending the state-of-art Digit Recognizer dataset using machine learning

Image
   Handwriting Recognition Handwritten text recognition has been a challenge since the first automatic machines were required to identify individual characters in handwritten texts.Consider the five-digit ZIP codes on letters at the post office and the automation used to identify them.To sort mail automatically and efficiently, perfect understanding of these codes is required. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined. But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice.  To address this issue in Python, the scikit-learn library provides a good example to better understand th