Fast Oriented Text Spotting (FOTS)

sugam verma
11 min readApr 22, 2021

Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community.In daily life text is present in variety of natural scenes such as road signs, shop names, posters, signboards etc. Sometime these text present in images even help one to understand context of Image.In this work we are going to implement the research paper FOTS which focus on detection and recognition of text at the same time from the real world images.


  1. Business Problem
  2. ML formulation
  3. Source of Data
  4. EDA
  5. Data Generation
  6. Overall architecture of FOTS
  7. Losses used
  8. Model Training
  9. Inference Pipeline
  10. Deployment
  11. Future work
  12. References
  13. Links to profile

1. Explanation of business problem

Reading text in natural images has attracted increasing attention in the computer vision community .And since this problem is combination of Computer Vision and NLP it has numerous practical applications in document analysis, scene understanding, robot navigation, image retrieval ,self driving cars,etc . It’s also one of the most challenging tasks because of the of the different fonts, sizes/scales and alignment of the text in real life images.

2. ML Problem formulation

According to the research paper FOTS we can consider this problem as two sub problems first as text detection(in which we are detecting where actually the text is in the real world image with the help of generated bounding boxes around the text) and second as text recognition(which recognizes actually what is the text in the generated bounding boxes in image).

We are going to design a fast oriented text spotting unified network with the help of deep learning techiques like CNNs,ResNet,LSTMs,sequential decoder,etc which is solving these two subproblems at the same time.We are going to build a trainable model end to end which can do detection and recognition at the same time with the minimum loss.

3. Source of Data

There are 2 datasets Synthtext and ICDAR 2015 we are going to use for this problem

Synthtext dataset - This dataset is a synthetically generated dataset, in which word instances are placed in natural scene images, while taking into account the scene layouts. This is very large dataset which contains 800K images with different texts. link- refer this

icdar 2015 dataset - This is real world dataset having images from wearable cameras. This dataset is comparatively very small (only 1000 training images) as compared to SynthText dataset. link - refer this

4. Exploratory Data Analysis

EDA on Synthtext dataset

Visualizing the images

Distribution of images sizes

As from above we can observe,

  1. Most of the images are having size between 20 and 80.
  2. The average size of all images is 40 bytes
  3. There very few images whose size is greater than 80.

Above are the images from the synthext dataset

Visualizing the bounding boxes per images

Above are the ground truth values of bounding boxes on synthtext data set images

As from above we can observe,

  1. Most of the images are having number of bounding boxes less than 15.
  2. There are few images who are greater than 15 bounding boxes per image.
  3. The average number of bounding boxes per image are aprox. 9.

EDA on icdar 2015 dataset

Visualizing the images

Above are the images from the icdar 2015 dataset, as we can see these are the real world images which are having different background with different font size of texts at different angles

Visualizing the bounding boxes per images

Above the the ground truth values of bounding boxes on icdar image dataset

  1. Most to the bounding box per image are between 1 to 25.
  2. Few are having number of bounding boxes per image more than 50.
  3. Average number of bounding boxes per image is aprrox. 8.

5. Data Generation/Ground Truth Generation

We are creating generators functions that can fit for both text detection branch and text recognition branch separately which will be helpful during training of both


To train the text detection component of FOTS model, the following ground truth masks/images are required to be generated for each original image that will be used while training this model.

  1. Score Map:- Which is an image channel representing whether the pixel is part of text or non text region for each pixel in the given image(1 for text region and 0 for non-text region)
  2. Geo Map:-Geographic map contains 5 masks/channels: For each pixel which is part of the text, the first 4 channels predict its distances to top, bottom, left, right sides of the bounding box that contains this pixel, and the last channel predicts the orientation of the corresponding bounding box.
  3. Training Mask:- It contains Don’t Care Text Region that we have to ignore while training our Detection Branch.

Also We have converted all images to resolution of (512,512,3) in this generator so that text detection branch works fine with different resolution images.Below are the visualizations of original image ,score map, 5 geo map and the training masks.

Original image, score map, 5 geo map, training mask


In Text Recognition Model we have created 2 generators (Train and Test). We have converted images to size (64,128,3) and converted words for particular word image as vector of size 23. Here for these vectors we have created vocabulary which contains all possible characters that can constitute a word.

6 . Overall architecture of FOTS

According to this paper overall architecture is as follows , the whole architecture is divided into 4 parts namely as Shared Convolution,Text Detection branch ,RoI Rotate,Text Recognition Branch.

Shared Convolutions

For extracting the high level features from the image shared convolutions are used which are pretrained on image net and using Resnet50 as its backbone.Shared convolutions are the convolution layers which share the same weight amongs them. Above image shows the architecture of the shared convolutions in which first orange color blocks contains one convolution layer followed by max pool layer followed by the resnet50 layers. Other orange blocks contains the only resnet50 layers architecture.The light green blocks are the output features from the corresponding blocks. The blue color blocks are the Deconv blocks which performs the deconvolutions on the corresponding output features . As there are a lot of small text boxes in natural scene images, we upscale(deconv) the feature maps from 1/32 to 1/4 size of the original input image in shared convolutions. i,e it upsamples the input features and increases the images or features size.

Text Detection Branch

For text detection branch they have adopted idea from efficient and accurate scene text detector( EAST ) research paper in which Text detection branch uses fully convolutional network as text detector. These convolutional layer would have 5 channels for score map and geo map. Once the bounding boxes are proposed by text detector branch, locality aware NMS (Non max suppression) will be used to get the bounding box with the highest IoU over the ground truth bounding box.

RoIRotate (Region of Interest Rotate)


Note that the above image is just for visualization. The actual implementation of RoIRotate operates over feature maps extracted by shared convolutions instead of raw images.

The main function of this part is to transform the angled text block into anormal axis-aligned text block after affine transformation. In this work, they correct the output height and keep the aspect ratio constant to handle the change in text length. In contrast to RRoI, which converts the rotated region to a fixed-size region by maximum pooling, this article uses bilinear interpolation to calculate the value of the output. The misalignment of the RoI with the extracted features is avoided, so that the length of the output features is variable, which is more suitable for text recognition. This process can be divided into two steps. First, affine transformation parameters are computed via predicted or ground truth coordinates of text proposals. Then, affine transformations are applied to shared feature maps for each region respectively, and canonical horizontal feature maps of text regions are obtained.

The text recognition branch

The text recognition branch is intended to predict text labels using region features extracted by the shared convolution and transformed by RoIRotate. Taking into account the length of the sequence of labels in the text area, the input features of the LSTM are only reduced twice by the shared convolution of the original image along the width axis. Otherwise, discernible features in the compact text area, especially the features of narrow characters, will be eliminated. Our text recognition branch includes a sequential convolution like VGG, a collection that only decreases along the height axis, a bidirectional LSTM, a fully connected and final CTC decoder. This part is mainly similar to CRNN, and the structure is shown in the figure below

The above FOTS architecture works as follows , First Image is fed into the Shared Convolutions from which shared features are extracted. These shared features are fed into the Text detection Branch where we predict and detect the bounding boxes for the text into the image .The output of shared convolutions(shared features) and text detection branch(Predicted Boxes of text in image) are fed into the RoI Rotate operator which extracts the text proposal features these features are then fed to the Text recognition branch which is made of Recurrent neural network( RNN ) encoder and connectionist temporal classification( CTC ) decoder for recognizing and predicting the text finally the output of Text detection branch and text recognition branch are merged to the an image to predicted bounding boxes and predicted text of bounded boxes.Since all looses used for models in this network are differentiable whole network can be trained end-to-end.

7. Losses Used

In this problem we are training detection and recognition part separately for which we are using different losses for detection and recognition which are as follows

  1. Dice Loss - This loss is used for text Detection Branch for classifying weather pixels in the input images are the part of text or non- text region. Refer this
  2. IOU(Intersection Over Union) Loss: This is second loss we used during training the Text Detection Branch for generating proper bounding boxes around text regions.Refer this

3. CTC(Connectionist Temporal Categorical) Loss: This loss is used for Training Text Recognition Branch to convert the Text in the bounding boxes predicted by the detection branch to actual text.Refer this

8. Model training

For solving this problem, since synthtext dataset is very large i,e 800k images(41GB) training our model on whole this data is not possible because of less computation power . So as a solution to this we are randomly selecting 10k images from this dataset and using them for training our model.

For model training basically we are dividing the whole training process into 2 parts detection and recognition.We will be training both models i,e detection and recognition separately and at end for inference we will combine these both with ROI rotate . Overview Training of these branches is discussed below


This this branch we are using the same architecture as mentioned above i,e We are using shared convolutions which are pretrained on imagenet data set using Resnet50 as its backbone . Here we have first downscaled image by factor of 1/32 and then upscaled by factor of 1/32. While Training this Branch we have used 2 losses Dice Loss and IOU loss.

To avoid domination of dice loss over IOU loss we have used complete loss as:

Loss=0.01*Dice Loss + IOU Loss

Below is the code for detection branch

We are training our detection branch with 10k images from synthtext for 50 epochs and after that on top of it we are using icdar dataset to fine tune our model with the real world images untill convergence.

Here while training we have used various Tensor flow callbacks like:

  1. Reduce On Plateau Callback:- this is used to reduce learning rate when our model weights get stuck on local Minima.
  2. Model Check Pointing Callback :- this is used to save model weights while training.
  3. Tensor board Callback:- used to visualize loss and weights of layers in tensor board.


We are making use of 5k synthtext data points and combining them with icdar data points to extract the text boxes from images which will be used for training the recognition branch this is because icdar data is too small and training model only on this data causes overfitting.

We have used the same architecture as mentioned above for recognition branch in which we have used series of Convolution Operations followed by Batch Normalization, Relu Activation and Max Pooling(Which reduces dimensions by half along height axis only).

Finally after these operation we have used 2 layers of Bidirectional GRU and finally a dense layer with 100 units.(Here 100 is used because our vocabulary is 99 and we need one extra for blank symbol while using CTC loss and CTC Decoder).Below is the recognition network architecture

We are using CTC Loss for training this network architecture.

We Trained this recognition architecture mentioned above by combining Synth Text dataset and icdar dataset so as to avoid overfitting of model.

Above we can see the training and validation loss plot ,We trained the model for 30 epochs with the batch size of 128 . At end we where able to get loss of 5.1 on training data and 6.4 on validation data. Here also while training we have used various callback functions like Tensor Board Callback, Reduce On Plateau Callback, Model Check Pointing Callback.

9. Inference pipeline

Final Inference pipeline consists of all 3 parts Text detection(includes shared convolutions), RoI rotate and Text Recognition. In RoI Rotate we generate coordinated of boxes where text is present in our images by making use of Score Map and Geo Maps that are predicted by Text Detection branch. After getting these coordinates of predicted bounding boxes on image we pass these generated bounding boxes to Text Recognition branch to get text form the particular Text region in image.

Here are the few results from the final pipeline

Since we were not able train our detection and recognition models on the whole data due to lack of computation resources we are not getting the perfect results. But these results can be improved by training the models on the whole data.

10. Deployment

For deployment part, I have build the web app and deployed it in cloud using the streamlit API. Here is the demo of web app

11. Future Work

  1. Will be making use of full synthtext data for training both the parts of detection and recgnition once the resources are available.
  2. Working on detection part to produce more accurate bounding boxes which will helpfull for recognition branch to produce perfect results.

12. References


13. Link to profile

Github Repo -

Linkedin -