Text Detection and Recognition

6 min readMay 18, 2021

Content

Introduction
Business Constraints
Data set analysis
Text Detection
Text Recognition
Quantization Techniques
Pipeline
Future Works
Deployment with Streamlit
Profile
Reference

1 Introduction

On an average in a Data centers there are thousands of Network devices, fibers, cables and patch etc. Each Network device for example Router or a Switch are basically made up of various parts like slots shelfs for different types of cards and connection modules. For any robust networks augmentation, migration and decom is integral part of operation but imagine the amount of manual work need to be done for maintaining such a huge network inventory. However network inventory is updated by digital tools. Although digital tools are advantageous when it comes to updating and tracking the changes performed and managing inventory still the device details had to be manually added to the tools each and every time a change is made which increase the chance of manual errors, my vision is to design a tool that can update the change inventory by just scanning the images of device. Where the model will detect and recognize the text from the images using deep learning framework.

However Text detection and recognition task has seen huge attention in computer vision community still extracting text data from real-world images is considered a one of the challenging task due to the complexity of the natural images.

We have two tasks to perform mainly text detection from image and text recognition and performing all the tasks using Deep Learning models.

Text Detection: where the model localizing text by drawing bounding boxes around the text. Here modified EAST algorithm can be used to detect the Text region from a image.

Text Recognition: After text detection comes recognition part where the detected textual regions are further processed in order to recognize what is the text in English. Here OCRs like tesseract algorithm will help in recognizing the text in english.

2 Business constraints

Training a model with Images of critical network devices will not be possible as it will have security implications, so will be training the model with ICDAR 2015 Dataset.

As natural images can be blurred, noisy, and in low quality with complex background. This makes the model prediction difficult to detect textual regions.

Low latency is needed as the model had to detect and recognize the text in the images in real time.

3 Data set analysis

Dataset used here is ICDAR 2015 which have mainly English text to detect and commonly used for text detection and spotting tasks.

This dataset have total 1500 images to work on where 1000 images for training the model and 500 for test.

For text spotting the data set contains eight text localization coordinates and the ground truth of the image text.

There are “Do Not Care” regions in the dataset which are indicated in the ground truth with “###” and we will ignore those.

While plotting the bounding boxes on the images with the given coordinates and inserting the text in the image we can see the bounding boxes are accurately spotting the text image for all the images in the dataset however the “Do Not Care” labels and co-ordinates are removed along with images without any coordinates.

The Image size and channel is same for all the images in the dataset i.e. of 1280 x 720 dimensions and 3 channels i.e. RBG

4 Text Detection:

Text Detection is a important part of our case as Detecting the text region correctly is directly proportional to the accuracy of our model as the detected text will further feed to Text recognising part hence accurate text detection is the priority.

Our text detector module is based on a fully convolutional neural network which localize the text regions which is inspired by the EAST model, where our model works great even with challenging ICDAR 2015 dataset where most of the images are blur and texts in the images are sometimes too small as the images are taken by wearable cameras.

Instead of the PVANet we have used Vgg16 and loss function we have used the dice loss instead of the pixel wise balanced cross entropy mentioned in the original EAST paper. We have implemented on the RBOX part only .

We have prepared the input images and the GT coordinates before training the model by resizing the input images with padding into size of 512x512x3 images and 8 coordinates of the text regions into 128x128x6 as training labels, then put all of the preprocessed training images and their corresponding annotation files in two separate directory.

Created a simple Data pipeline for the training the model using the Tf.Data pipeline where a generator pulls given batch size of the preprocessed images and their corresponding annotation from saved directories and train the model.

Dice loss is used to optimize the model where 128x128x1 is the score map and the rest 5 are the four coordinates of geo map and one rotation angle.

Dice loss

Total Loss

The Detection Model is inspired from the EAST Architecture where we have used the Vgg16 and made the model as simple as it can be by defining the model in tf.keras.Model and simply using the fit function to train.

EAST Model

The Detection Model is prediction text regions with good overlapping region with respect to the the GT coordinates.

5 Text Recognition:

Once the Detection part is done the detected region is feed to the recognition part where optical character recognition (OCR) are nice option for our case. We will be using the tesseract Recognition task .

The detected text region of the images is first cropped and feed to the pytesseract to recognise the text in the image, however we have observed that the cropped images can be too blur to be recognised by pytesseract so in our model we have pre-processed the cropped images and then feed to the pytesseract to increase the performance.

Text Recognition

6 Quantization techniques :

Quantization techniques like tf.lit is used to improve the time taken by the model to predict and reduce the model size by not deteriorating the model performance. With reduced model with less time taken to prediction we now look to deploy the end to end pipeline for real world challenges.

Quantization techniques

7 Pipeline :

Both text detection and recognition models are combined to form the end to end pipeline. Which is simple as the detected text is only feed to the pretrained pytesseract recognition model. The pipeline of the combined output has both the detected text in the bounding box and the recognised text and also displaying the each detected text cropped image along with the predicted text.

Pipeline

Text detection and recognition on ICADR 2015 test images

Text detection and recognition on Cisco modular switch

8 Future Works

Text detection model can be improved for detecting blur images, Better performance on text with unusual fronts and improve accuracy of the bounding box detection area

Text recognition part can be analysed on other OCR or custom OCR can be created for this task to more improvement.

Training with bigger image datasets model performance can be improved

9Deployment Using streamlit

10Profile

GitHub Link-

dipdgrt/Text-Detection-and-Recognition