Comprehending the state-of-art Digit Recognizer dataset using machine learning

March 31, 2021

Handwriting Recognition

Handwritten text recognition has been a challenge since the first automatic machines were required to identify individual characters in handwritten texts.Consider the five-digit ZIP codes on letters at the post office and the automation used to identify them.To sort mail automatically and efficiently, perfect understanding of these codes is required.

Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined. But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice.

To address this issue in Python, the scikit-learn library provides a good example to better understand this technique, the issues involved, and the possibility of making predictions.

Recognizing Handwritten Digits with scikit-learn

The problem to be addressed involves predicting a numeric value, and then reading and interpreting an image that uses a handwritten font. So even in this case you will have an estimator with the task of learning through a fit() function, and once it has reached a degree of predictive capability (a model sufficiently valid), it will produce a prediction with the predict() function.

An estimator that is useful in this case is sklearn.svm.SVC, which uses the technique of Support Vector Classification (SVC). Thus, you have to import the svm module of the scikit-learn library. You can create an estimator of SVC type and then choose an initial setting, assigning the values C and gamma generic values. These values can then be adjusted in a different way during the course of the analysis.

Digits Dataset

The scikit-learn library provides numerous datasets that are useful for testing many problems of data analysis and prediction of the results. Also in this case there is a dataset of images called Digits. This dataset consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale. After loading the dataset, you can analyze the content. First, you can read lots of information about the datasets by calling the DESCR attribute.

The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to a grayscale from white, with a value of 0, to black, with the value 15.