Signzy — Building Global digital trust system using AI & Blockchain
As Andrew NG phrases it perfectly “AI is the new electricity”. We at Signzy are true believers of this and have witnessed what AI could do, how deep its impact can be and the drift it has brought in the way we approach our problems.
Being in the finance and regulatory domain we have been identifying a series of our customer problems which could potentially be addressed through AI.
One such interesting use case we encountered recently was classification of Identity cards. Given an image of an identity card the algorithm has to classify it to one of the following classes.
- Aadhaar
- PAN
- Driving License
- Passport
- Voter Id
In this blog post we will take you to behind-the-scenes of our state-of-the-art system and how we tackled the problem, ultimately overpassing the targeted accuracy required for real world use.
Knowing the beast we are to fight
As soon as we began to dive deeper into understanding the problem and identifying techniques we would use to attack it, we realised the most important constraints we had to work within and the aim we are striving to achieve.
The idea is to deploy the pipeline into financial institutions with all possibilities of input variation and yet it should surpass or at least be equivalent to accuracy of a human being. The solution is to work on data which arrives from the most rural parts with pics taken from even 0.3 MegaPixel cameras and travelling over a dramatically slow connectivity. We knew the toughest challenge was to cater to variations that could arrive in inputs.
Humans have evolved intelligence for thousands of years, and created the systems to be easily processed by themselves. Take for instance an identity card. It is designed in dimensions to sit in pocket wallet, color formats to be more soothing to human eyes, data format which could sit well read by humans. If the Identity cards were designed to be consumed by a computer vision software it would have been an easier game, but since that’s not the case it becomes especially challenging.
We talked with different on-ground stakeholders to identify variations in input. Collecting initial samples wasn’t that hard, since a lot of these variations were told by our end users, but we knew creating training data is not going to be easy. We realized this quickly and started creating exhaustive training data in heavily curated and precisely controlled laboratory settings. We were able to get desired training sets successfully, which was half the problem solved.
World is not the cozy laboratory, we know that!
Our target was to create the architecture which could be more than 99% accurate and yet be fast enough to make an impact. This isn’t easy when you know your input is coming from the rural end of India and you won’t have high end GPUs to process on (As a matter of fact, our largest implementation of this solution runs without GPUs).
The system is expected to perform well in different sorts of real world scenarios like varying viewpoints, illumination, deformation, occlusion, background clutter, less inter-class variation, high intra-class variation (eg. Driving License).
You can’t reject an application by an old rural lady, who has brought you a photocopy of printout which in turn is obtained from a scanned copy of a long faded PAN card. We took it as a challenge to create the system so that it can help even the rural Indian masses.
A few samples that we expect as input into our system are here:
The number of samples we have for training is a huge constraint, you only have so much time and resources to prepare your training data.
Creating the technology
Baby steps ahead
We tried out various approaches for solving the problem. Firstly we extracted features using Histogram of Oriented Gradients (HOG) feature extractor from OpenCV and then trained a Support Vector Machine (SVM) classifier on top of the extracted features. The results were further improved by choosing XGBoost classifier. We were able to reach about 72% accuracy. We were using Scikit learn machine learning framework for this.
Not enough, let’s try something else
In our second approach, we tried ‘Bag of words’ model where we had built a corpus containing unique words from each identity card. Then we feed the test identity cards to an inhouse developed OCR pipeline to extract text from the identity card. Finally we input the extracted text to a ‘Naive bayes’ classifier for the predictions. This method boosted the accuracy to 96% . But the drawback of this approach was that it can be easily fooled by hand written text.
Taking the deep learning leap
“The electric light did not come from the continuous improvement of candles.” — Oren Harari
In the next approach we trained a classical Convolutional Neural Network for this image classification task. We benchmarked various existing state of the art architectures to find out which works best for our dataset eg. Inception V4, VGG-16, ResNet, GooLeNet. We also tried on RMS prop and Stochastic Gradient Descent optimizers which did not turn out to be good. We finalized on ResNet 50 with Adam optimizer, learning rate of 0.001 & decay of 1e-5. But since we had less data our model could not converge. So we did a transfer learning from “Image net”, where we used the existing weights trained originally on 1 million images. We replaced the last layer with our identity labels and freezed the remaining layers and trained. We noted that still our validation error was high. Then we ran 5 epochs with all layers unfreezed. Finally we reached accuracy of around 91%. But still we were lagging by 9% from our target.
Hit the right nail kid, treat them as objects
The final approach is where the novelty of our algorithm lies. The idea is to use an image object detector ensemble model for image classification purpose. For eg. the Aadhaar identity has Indian Emblem, QR code objects in it. We train an object detector for detecting these objects in card and on presence with a certain level of confidence we classify it as a Aadhaar. Like this we found 8 objects which were unique to each identity. We trained on state of the art Faster Region Proposal CNN (FRCNN) architecture. The features maps are extracted by a CNN model and fed into a ROI proposal network and a classifier. The ROI network tries to predict the object bounding box and the classifier (Softmax) predicts the class labels. The errors are back propagated by ‘softmax L2 loss function’. We got good results on both precision and recall. But still the network was performing bad on rotated images. So we rotated those 8 objects in various angles and trained again on it. Finally we reached an accuracy of about 99.46% . We were using Tensorflow as the tool.
But we were yet to solve one final problem i.e the execution time. It took FRCNN approximately 10 seconds to classify in a 4 core CPU. But the targeted time was 3 seconds. Because of the ROI pooling the model was slow. We explored and found out that Single shot multibox detector (SSD) architecture is much faster than FRCNN as it was end-to-end pipeline with no ROI layer. We re-trained the model in this architecture. We reached accuracy of about 99.15%. But our execution time was brought down to 2.8s.
Good work lad! What next?
While the pipeline we had come up with till here has a very high accuracy and efficient processing time, it was yet far from the a productionised software. We conducted multiple rounds of quality checks and real world simulation on the entire pipeline. Fine-tuning the most impactful parameters and refining the stages, we have been recently been able to develop a production ready, world class classifier with an error rate less than human and at a much much lesser cost.
We are clearly seeing the impact deep learning can have on solving these problems which we once were unable to comprehend through technology. We were able to gauge the huge margin of enhancement that deep learning provides over traditional image processing algorithms. It’s truly the new technological wave. And that’s for good.
In the upcoming posts, we will share our story on how we tackled another very difficult problem — Optical Character Recognition (OCR). We are competing with global giants in this space including Google, Microsoft, IBM and Abby and clearly surpassing them in our use cases. We have a interesting story to tell over “How we became the global best in enterprise grade OCR“. Stay tuned.
No comments:
Post a Comment