NUBOMEDIA: FP7/2007-2013 GA-610576

Computer Vision as a Service (CVaaS): example and trends.

  • Computer Vision

Computer Vision (CV) is a rapidly growing field devoted to analyzing, modifying, and high-level understanding of images. Its objective is to determine what is happening in front of a camera and use that understanding to control a computer or robotic system, to provide people with new images that are more informative or to index content enabling richer and more efficient searching. Application areas for CV technology include video surveillance, biometrics, automotive, photography, movie production, Web search, medicine, augmented reality, gaming, new user interfaces and many more. However, CV is computationally expensive: it tends to consume prohibitive amount of resources including memory and CPU. This is probably the main problem avoiding CV mass-scale usage.

For this reason, the emergence of cloud technologies may enable CV to be applied in scenarios where it was not possible. Running in the cloud CV algorithms that can interoperate with each other and that are accessible through comprehensive and simple APIs can open new horizons to CV applications. Some of the benefits that cloud technology offers to computer vision are:

  • Access to on-demand computational power and storage.
  • Access to on-demand CV algorithms or services.
  • Access to simple to use APIs for creating CV applications.
  • Pay per-use models both for computational resources and algorithm or other intellectual property access.

Although these advantages are clear, it does not seem to exist a mature solution for running CV in the Cloud. To understand current state-of-the-art, we classify currently available solutions into two groups. In the first, we can find technologies bringing generic CV capabilities to the cloud. In the second, cloud systems providing specific and very specialized algorithms. Looking to the former, we can find some open source based solutions such as CloudCV, a large-scale distributed CV as a Service. CloudCV is an open source initiative which emerged from the Graphlab ecosystem in summer 2012. The first version of CloudCV was available in summer 2013. It only provided an image stitching algorithm suitable for combining multiple photographic images with overlapping fields of view to produce a segmented panorama. After this first version the CloudCV has introduced new algorithms including:

  • Object detection.
  • Object classification, through which different objects in the image can be automatically identified.
  • Decaf: A deep convolutional activation feature for generic visual recognition.

Figure1. CloudCV Architecture

CloudCV provides computer vision algorithms as a service to researchers, students and app developers through its Matlab, python and web APIs. GraphLab provides a high level programming interface, allowing a rapid deployment of distributed machine learning algorithms. In this case, the GraphLab carries out the deployment through amazon web services. 

Another platform providing CloudCV like features is, which exposes CV capabilities through a Ruby API. It is still in beta version but you can already have access to the following algorithms:

  • Object Detection.
  • Text recognition (OCR).
  • Detection of image similarities.
  • Face recognition (work in progress) started with a lot of energy but activity around it seems to have decreased it in the last few months because latest commits on their Github repo have more than one year.

When looking to the second group, the one devoted to specific algorithms, the two most popular solutions are the ones devoted to face recognition and car plate number recognition. For instance, in the paper “Face Recognition for Social Media with Mobile Cloud Computing” a cloud solution is proposed for face recognition using mobile devices. In it, mobile devices are in charge of detecting faces on the images. Once a face has been detected, the part of the image where the face has been found is sent to the cloud and face recognition is performed. Another example of face recognition is described in the paper “Cloud-Vision: Real-Time Face Recognition Using a Mobile-Cloudlet-Cloud Acceleration Architecture”. It describes different strategies for partitioning CV tasks between mobile devices and the cloud and also for distributing computing load among cloud servers to minimize the time response. An interesting aspect of the paper is that, with the aim to decrease bandwidth consumption, the device only sends metadata corresponding to the Haar features of the face. This metadata is used later to make the matching with he corresponding face databases.

In relation to license plate recognition, in the paper “Cloud Based Anti Vehicle Theft by Using Number Plate Recognition”, a specific system for efficient automatic theft vehicle identification basing on vehicle plate number is described. The proposed algorithm follows the typical phases in license plate recognition: vehicle identification, number plate region extraction and plate characters recognition and comparison. Only the latter step is implemented on the cloud.

After this brief revision, we conclude that there is no much information out there describing mature CV solutions for the cloud and almost none when dealing with CV technologies for real-time video. This may be due to the fact that cloud computing technologies have been extensively used in IT problems but they have not the appropriate degree of maturity for the specific (and complex) problems of CV. Due to this, the Kurento initiative, and as an extension the NUBOMEDIA research project emerge with the aim of creating a cloud platform specifically devoted to them. NUBOMEDIA brings some fresh ideas to the area, which include:

  • Kurento makes possible to solve all the "plumbing" required for sending/receiving real-time video through the network following latest standards such as WebRTC. It also provides abstractions suitable for managing encoding and decoding operations in a transparent and efficient way, so that automatic transcodings among H.264, VP8, H.263 and raw formats happen strictly when they are necessary without developers needing to specify it. Recording of video streams and their recovery is also provided off the shelf.
  • Kurento exposes CV  of capabilities through a simple to use API encapsulating and abstracting the complexities of CV technologies for real-time video. This API makes possible to create applications just by chaining individual media functions known as “Media Elements”. The creation of such chains (called Media Pipelines in the jargon) with different CV services is suitable for tackling complex problems. For example, combining a motion detector media element and a face recognition media element can be done with the media pipeline (chain of media elements) shown in the figure below. Particularly, every time the motion detector detects motion in the image the face detector will try to detect faces on the image.

  • NUBOMEDIA, which is a cloudification of Kurento, enables CV operations on video to be executed in real-time on the cloud in an elastic and adaptive way, which is essential for many fields such as video surveillance or video games, which may required different computing resources depending on system load.
  • NUBOMEDIA and Kurento are Free Open Source Software (FOSS). This guarantees that the platform is open and can be openly accessed in order to create a community of contributors. Therefore, the number of CV services, elements or algorithms could be widely increased generating a big library of computer vision functionalities.
  • NUBOMEDIA makes possible to combine CV with Augmented Reality (AR), which can enrich an even wider range of useful applications.