Automated Text Extraction from Images: Using the Ocr System

Topic > Automated Text Extraction from Images: Using the Ocr System

IndexResearch HistoryIntroductionClassificationOCRT SystemTesseractApplicationsConclusionThe popularity of digital images is increasing rapidly. Every day, many images were generated by many groups such as students, engineers, doctors. The need and use of the image varies among these users. They can access images based on its primitive features or associated text. The text in such images can provide meaningful information. Our goal is to retrieve content and automatically summarize visual information from images. Different algorithms are needed to develop optical character recognition (OCR). Tesseract is currently the most accurate OCR engine developed by HP Labs and is currently owned by Google. In this paper, we extract text from images using text localization, segmentation, and binarization techniques. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay Text extraction can be achieved by applying text detection which helps us identify the parts of the image containing text, text localization determines the exact position of the text, text segmentation includes separating the text from its background, and the binarization process converts colored images into a binary image that includes black text appearing on a white background. Character recognition is applied to this binary image to convert it to ASCII text. This method also considers various features such as colors, shapes, textures and extracts the relevant document. Text extraction is used in creating e-books from scanned books, searching for images from a collection of visual data, etc. Research History In recent years, the need to retrieve data from images and store it for future reference has rapidly increased. A lot of research has been conducted to investigate various approaches that might be useful for extracting data from images. These approaches include methods related to various extraction processes such as text detection, text localization, text segmentation etc. Furthermore, various properties of an image such as color, intensity, component connection, orientation, text style etc. they are used to distinguish text regions from their backgrounds and from other regions within the image. Automatic recognition of handwritten texts has been the subject of pattern recognition research. Tesseract has previously been used to run user-specific training on both isolated and free-flowing text specifically using lowercase Roman script. Discontinued for more than 10 years, Tesseract now provides the basis for major trading engines with improved accuracy. Tesseract was developed by HP, but has never been used by it. It was subsequently modified, improved and maintained by Google. Although Tesseract is useful for extracting data from images with more precision but at the same time it has brought with it some flaws like over-segmentation for some characters and under-segmentation or rejection in case of cursive word segments are few of them. Its unusual choice of features is probably its main strength and instead of rough contours, the use of polygonal approximation is its main weakness. Various projects have been developed using Tesseract to implement real-world scenarios related to manuscripts, data extraction and storage from images, effective manipulation of image databases, language processing, and much more.Introduction Due to the rapid development of digital technology, we have a huge collection of information stored in the form of images resulting in digitization of assets in various industries. There have been recent studies on image processing that show the importance of retrieving content from images. Extracting text from images and converting them to ASCII text can be achieved using OCR systems. OCR is very useful and popular in various applications including digital libraries, information retrieval systems, multimedia systems and geographical information systems. OCRThe system has almost reduced the keyboard interface between man and machine and helps in the automation of office, saving a lot of time and human effort. The accuracy of an OCR system sometimes depends on the text pre-processing and segmentation algorithms. The level of difficulty in extracting text depends on different image styles, sizes, orientation, complex background, etc. Various methodologies have been used to perform text extraction from images such as text detection, text localization, text segmentation, etc. Text detection plays an important vital role in determining and highlighting regions containing only text or can also be fed into the optical character reading module for recognition. Images captured in OCR systems often include oblique and perspective distortions due to some human errors that need to be removed. Binarize tilt-corrected images using simple yet efficient binarization techniques performed before the segmentation process. After processing the input image, we get a binarized image, that is, the texts are written in black color on a white background. Text localization is performed on this binarized image which involves separating each character from the entire word in the image by scanning the pixels sequentially. Sometimes situations arise such that components of adjacent characters are touched or overlapped, which creates difficulties in the segmentation task. This problem occurs frequently due to the changing of the upper and lower zones; therefore, it is an important phase. Tesseract is an open source software that helps in text extraction with relatively high accuracy compared to other OCR systems. Tesseract does not have its own page layout, so it assumes the input is a binary image and optionally defined polygonal text areas. The processing period followed is a step-by-step analysis pipeline of connected components (called Blobs) that recognizes text as black-on-white text, organizing blobs into lines of text, splitting lines of text into words based on the type of character spacing and so on.ClassificationCharacter RecognitionCharacter recognition is generally classified into two categories: online and offline character recognition. Inline character recognition involves managing a stream of data from a transducer while the user is typing. When the user writes on an electromagnetic and pressure-sensitive digital tablet, typical hardware used for data collection, subsequent movements of the pen are transformed into series of electronic signals that are further stored and analyzed by the computer. Offline handwriting recognition, on the other hand, deals with the automatic conversion of texts from images into letter codes useful in computers and text processing applications. Image texts Text within images can be classified into two categories as artificial text (e.g. caption text orsuperimposed). and scene text (e.g. graphic text). Artificial texts are artificially superimposed onto images at a later stage, such as titles displayed on television. The texts of the scenes, however, exist naturally in the images, like the texts printed on T-shirts. Texts have various attributes that change their appearance such as font style, size, orientation, color, texture, contrast, alignment, background etc. All changes in appearance will increase the difficulty level of text extraction and make it more complicated. Size: The size of text can vary greatly in an image, although assumptions are made based on the application domain. Distance between characters: Typically, a line of text contains a uniform distance between text characters. Color: the lines of text have the property of not needing to have all the characters of the same color. This property is very useful in using the connected component approach for text detection. This property makes efficient use of the process of detecting text in various images and documents which may be challenging due to the presence of text strings in more than two colors (polychrome). Alignment: Usually, caption text characters appear in groups and are located horizontally, but sometimes appears as non-planar text due to some special effects. But various perspective distortions such as alignment in any direction and geometric distortions can also be present in scene texts. OCR System The OCR system involves a step-by-step process to extract text from an image into an editable format. The basic steps are: digitization, pre-processing, text localization, text segmentation, classification and character recognition. Digitization In the initial and most important stage of OCR systems, a handwritten or printed paper document is digitized (converted into electronic format ). Digitization could be achieved by a method in which a document is scanned and an image file format is processed to obtain an electronic representation of the original document. Digitization can be done by scanners such as drum scanners, flatbed scanners, sheet-fed scanners, face-up scanners, digital cameras etc. using various scanning processes such as photocopying, microfilming etc. But along with the advantage of scanning, they come with the disadvantage of degrading the quality of documents. An expensive but very effective approach, used by the world's best grant archives, is to microfilm the documents and then use a medical scanner or a high-definition film scanner. Pre-processing Once you have your scanned image, you need to pre-process it. elaborate. This preprocessing is also sometimes referred to as the text detection method. Preprocessing helps improve the performance of the OCR system by performing skew correction methods and clutter and noise removal methods. Tilt Correction: We have often noticed that images captured via camera and other means suffer from tilt and perspective distortions. Such distortions occur due to non-parallel axes and/or planes at the time of image capture. The effect of perspective distortion, however, is widespread throughout the image and is rarely visible on small parts of the image. Since the segmentation process generates only a few text areas, it is possible to correct them computationally using fast and efficient skew correction techniques. Removal of disturbances and noise: in this processing phase, all disturbing elements and noise are removedfrom the image using a step-by-step process where gray scaling is performed, followed by removal of lines and discontinuities, and finally removal of points. Gray scaling. Grayscale is performed on a skew-corrected text area to produce a binary image, making it easier to distinguish the text from the background. Grayscale is a method used for binarizing text. While this is an improvement over monochrome, it requires more memory space for storage as each pixel takes up 4 to 8 bits. Grayscale involves converting RGB to their grayscale equivalent by adding 30% red, 59% green, and 11% blue to each pixel in an image. In addition to grayscale, there are various other binarization methods (e.g. the Bernsen binarization method) that are constantly being studied and modified for better efficiency.Line removal. When the image contains noise in the form of horizontal and vertical fluctuations (lines throughout the image), it becomes necessary to remove such lines from the image to improve the efficiency of OCR systems. For this, the image is scanned progressively to detect rows and columns made up of black pixels across the entire width and height of the image. The detected row or column is then removed by changing the color value of the pixel from black to white. Discontinuity Removal. Here it is important to consider that this process has no negative effect on images without such distortion, but leads to some discontinuities in text areas intersected by lines. These discontinuities make the text recognition process difficult, so they must be corrected. The connectivity algorithm is applied to 8 connected pixels. In this algorithm, if the diagonally, horizontally or vertically opposite pairs of the considered pixel are black, the (considered) pixel is also set as black. The iterative implementation of this algorithm until all discontinuity pixels are successfully processed.Point Removal. In the final stage of preprocessing, remaining noise, such as unwanted black pixels, which have not yet been eliminated, are eliminated. Again the 8 connected pixel connectivity algorithm is used assuming that each group of black pixels forming the noise is significantly smaller than any other group of text characters. Finally, all connected black pixels below the threshold value that indicate that the considered cluster is unwanted noise (and does not include any character clusters in the text) are successfully eliminated. Text Localization After the image preprocessing is finished, we are now left with text on a plain background. The localization process involves enhancing the text area by removing non-text areas. Text in an image has the property that all characters appear close to each other forming a cluster. Considering this property, a morphological dilation operation can be used by grouping text pixels together and deleting pixels far from the candidate text area. Text segmentationIn this phase, identification of individual glyphs (basic units representing one or more characters, usually contiguous). Segmenting handwritten characters into various zones such as top, middle and bottom zone and characters is more difficult than standard printed documents due to variations in paragraphs, words in a line, characters in a word, slant, in inclination, dimensions and curves. For this process, the horizontal histogram profiles of the binarized areas are analyzed for segmentation into.