Thursday, May 23, 2024
HomeDigital Transformation5 Trends in Computer Vision for 2022!

5 Trends in Computer Vision for 2022!

Table of Contents 

  • Machine Learning Engineer Sayak Paul introduces key trends in computer vision
    • Computer vision goals
    • Example of use
    • computer vision trends
  • Trend I: Resource Efficiency Model
    • why it’s trending
    • building process
    • Action plan up to model implementation
  • Trend II: Generative Deep Learning for Creative Applications
    • why it’s trending
    • application
    • Action plan up to model implementation
  • Trend III: Self-Supervised Learning
    • Limitations of supervised learning
    • Training on unlabeled data
    • Disadvantages of self-supervised learning
    • References
  • Trend IV: Use of Transformers and Self-Attention
    • why it’s trending
    • Pros and Cons of Leveraging Transformer
    • Exploring the Vision Transformer
  • Trend V: A Robust Vision Model
    • Problems faced by vision models
    • to be robust

Machine Learning Engineer Sayak Paul introduces key trends in computer vision

Computer vision is a fascinating area of ​​artificial intelligence with enormous real-world value. Billion-dollar startups are popping up in the space, with Forbes predicting the market will reach $49 billion by 2022.

Computer vision goals

A major goal of computer vision is to give computers the ability to understand the world through vision and make decisions based on that understanding.

The technology can be applied to automate or augment human vision, resulting in a myriad of use cases.

If AI enables computers to think, computer vision enables computers to see, observe, and understand. – IBM

Example of use

Use cases for computer vision range from transportation to retail.

A typical use case in transportation can be found at Tesla. The company builds electric, self-driving cars, relying solely on cameras driven by computer vision models.

Computer vision is also revolutionizing the retail industry and making it even more convenient, with things like the Amazon Go program using smart sensors and computer vision systems to enable cashier-less shopping.

Computer vision clearly has a lot of potential in terms of contributing to practical applications. As a practitioner or deep learning enthusiast, it’s important to keep an eye on the latest advances in the field and stay on top of the latest trends.

computer vision trends

In this article, I share the thoughts of Sayak Paul , Machine Learning Engineer at Carted and a recent Bitgrit talker . You can also find him on LinkedIn and Twitter .

(*Translation Note 1) Sayak Paul is a machine learning engineer working at Carted , which develops and provides shopping APIs . He is an active participant in open source projects and has received the Google Open Source Peer Bonus Award in 2020 and 2021 .

This article is not an exhaustive list of the talks, but rather a summary or gist of the talks. You can view the slides from the lecture here. The slides have useful links related to similar topics. The talk is also available on YouTube where you can get more information.

The purpose of this article, like his talk, is to help readers by:

  • Discover more exciting things to work on.
  • Inspire ideas for your next project.
  • Catch up on what’s happening on the ground.

Before we get to the trends, 💵a new data science competition with a prize pool of $3,000 has been announced that some readers may not know about yet.

Trend I: Resource Efficiency Model

why it’s trending

  • The latest models can be very difficult to run offline using small devices with microprocessors, such as mobile phones and Raspberry Pi.
  • Heavy models tend to have high latency (where latency is the time it takes for one model to perform the forward pass), which has a significant impact on infrastructure costs.
  • If cloud-based model hosting is not an option (due to cost, network connectivity, privacy concerns, etc.), what models are available?
(*Translated Note 2) For the relationship between AI model size and latency, see the AINOW translated article ” From Research to Product: Scaling State-of-the-Art Machine Learning Systems .” AI model size and latency are generally in a trade-off relationship .

building process

1. Sparse training

  • Sparse training is the introduction of zeros into the matrices used to train the neural network. This is possible because not all dimensions interact with other dimensions, in other words (certain dimensions in particular) are important.
  • Performance may be degraded, but the result is a much smaller number of multiplications and a faster network training time.
  • A very relevant technique is pruning, which discards network parameters that fall below a certain threshold (there are other criteria for discarding).

2. Inference after training

  • Use quantization in deep learning to reduce model accuracy (from FP16 to INT8) and reduce size.
  • Quantization-Aware Training (QAT) can compensate for the loss of information due to reduced precision.
  • Pruning + quantization works best for many use cases.
(*Translation Note 3) Quantization is the conversion of numbers used in machine learning calculations from floating-point numbers to integers . Quantization can reduce model size and reduce inference time .
Quantized awareness training is provided by Google as a feature of TensorFlow .

3. Knowledge Distillation

  • Train a high-performing teacher model, extract its “knowledge”, and train another small student model to match the labels obtained from the teacher.

Action plan up to model implementation

  1. Cultivate a larger, higher performing teacher model.
  2. Do knowledge distillation and use QAT if possible.
  3. Pruning and quantizing knowledge distilled models.
  4. Implement


Trend II: Generative Deep Learning for Creative Applications

why it’s trending

  • Generative deep learning has really advanced.
  • has an example of what can be achieved with generative deep learning.
(*Translation Note 5) is a site that collects non- existent objects (humans, cats, rooms, etc.) generated by various GAN models.


1. Image super-resolution

  • The image (resolution) can be upscaled according to applications such as surveillance cameras.

2. Domain transfer

  • Transfer an image to another domain
  • Example: cartoonize or animate an image of a person

3. Extrapolation

  • Create a new context for the masked region in the image.
  • It is used in domains such as image editing to simulate functionality found in the Photoshop app.

4. Implicit Neural Representation and CLIP

  • Ability to generate images from captions (e.g. from the text ‘People riding bicycles in New York City’ generate an image of the content)
  • GitHub repository
(*Translation Note 6) A representative model for generating images from captions is DALL E announced by OpenAI I saw it ). The model shown in the GitHub repository is an image generation model deep-daze developed using the image recognition model CLIP and Siren (abbreviation for Sinusoidal representation networks) announced by OpenAI . Using the same model, the following image is output for the caption “mist over green hills”: While DALL·E produces realistic images, deep-daze produces fantastic images .

Image source: deep-daze GitHub repository

Action plan up to model implementation

  • Research and implement such products. This step may be omitted.
  • Develop end-to-end projects.
  • If we improve the elements used in generative deep learning, we may discover something new.

Trend III: Self-Supervised Learning

Self-supervised learning does not use ground truth labels at all, but instead uses pretext tasks. Then use a large unlabeled dataset to train the model on the dataset.

(*Translation Note 7) For self-supervised learning, see ” Self-supervised Deep Learning ” in the AINOW translated article ” 5 Deep Learning Trends to Take Artificial Intelligence to the Next Stage “.

How does self-supervised learning compare to supervised learning?

Limitations of supervised learning

  • Requires huge amount of labeled data to improve performance.
  • Labeled data is expensive to prepare and can be biased.
  • For such large data, the training time becomes very long.

Training on unlabeled data

  • We want the model to be invariant for different views of the same image.
  • Intuitively, the model learns what makes two images (eg, a cat and a mountain) visually different.
  • It’s much cheaper to have an unlabeled dataset!
  • In the field of computer vision, SEER (self-supervised models) outperform supervised learning models in object detection and semantic segmentation.
On March 4, 2021, Facebook announced SEER, a self-supervised learning model . The training data for the model was 1 billion unlabeled Instagram images . The model achieved 84.2% in ImageNet classification, achieving top performance among self-supervised learning models at the time of publication.

Disadvantages of self-supervised learning

  • Self-supervised learning requires a very large data domain to perform well in real-world tasks like image classification.
  • In contrast, self-supervised learning is also computationally intensive.


  • Self-Supervised Learning: The Dark Matter of Intelligence
  • Understanding self-supervised learning using controlled datasets with known structure

Trend IV: Use of Transformers and Self-Attention

why it’s trending

  • Attention learns that the network aligns important contexts in the data by quantifying the interactions of paired entities.
  • The idea of ​​Attetion exists in various forms in computer vision. GC blocks , SE networks , etc. However, the results were marginal.
  • The Self-Attention block is the foundation of Transformer .
(*9) For the mechanics and importance of Attention and Transformer, see the AINOW translated article, Transformer Explained: Understanding the Models Behind GPT-3, BERT, and T5 .

Pros and Cons of Leveraging Transformer

Strong Points

  • Its low a priori recursion makes it a general computational primitive for a variety of learning tasks.
  • Performance equivalent to CNN can be obtained by streamlining parameters.
(*10) By inductive priors in this article, I mean the network structure that an AI model has prior to learning . For example, CNNs have a two-dimensional structure in advance to specialize in image recognition. On the other hand, an image recognition model that utilizes a Transformer such as Vision Transformer does not have a two-dimensional structure in advance.


  • Since Transformer does not have clear a priori inductiveness like CNN, composition of large-scale data is most important for pre-training.

Another trend is that when self-attention is combined with CNN, it establishes a strong baseline (BoTNet).

Exploring the Vision Transformer

  • Facebook Research/deit
  • Google Research/vision transformer
  • Jeonworld/Vit-pytorch
  • Image classification using Vision Transformer (Keras)
(*Translation Note 11) For vision transformer, see the AINOW translation article ” [GoogleAI Research Blog Article] Transformers for Large-Scale Image Recognition “.


Trend V: A Robust Vision Model

Vision models are subject to many vulnerabilities that affect their performance.

Problems faced by vision models

1. Perturbation

  • Deep models are also vulnerable to small changes in input data.
  • Imagine if pedestrians were predicted to be on an empty road (due to perturbation-induced misperceptions)!
(*Translation 12) For image recognition failures due to perturbations, see “Abolishing Convolutional Neural Networks ” in the AINOW translation article, ” 5 Deep Learning Trends That Will Take Artificial Intelligence to the Next Stage .”

2. Corruption

  • Deep models are easily locked into high frequency regions, making them vulnerable to common corruptions such as blur, contrast, and zoom.
(*Translation Note 13) High-frequency areas in an image refer to high-contrast areas where there is a sharp contrast between light and dark . Contrast is measured by converting the brightness transition of an image into frequency .

3. Out of Distribution Data

There are two types of out-of-distribution data:

  1. Domain shifts but labels stay the same – We want the model to perform consistently depending on what it learns.
  2. Exceptional Datapoints – When faced with anomalous datapoints, the model is desired to make low-confidence predictions (according to the exceptional data).

to be robust

There are many techniques that address specific issues such as these for building robust vision models.

1. Perturbation

  • Adversarial Training: Akin to Byzantine Fault Tolerance, basically preparing the system to handle itself when faced with the absolute worst conditions.
  • paper
Byzantine Fault Tolerance (BFT ) is an algorithm that tolerates failures
in distributed computing . A typical practical example of BFT is blockchain. The cited paper, ” Adversarial Examples Improve Image Recognition “, discusses how adding adversarial data to training data can improve the performance of image recognition models. This paper is said to be similar to BFT because BFT allows adversarial phenomena of faults.

2. Corruption

  • Consistent Regularization – You want your model to be consistent with noisy inputs.
  • Examples implementing consistent regularization: RandAugment , Noisy Student Training , FixMatch
(*Translation Note 15) Regularization is a technique that limits the possible range of parameters to prevent overfitting . The example model leverages regularization and noise addition to improve model performance, as follows:

  • RandAugment: Adjust regularization strength according to model and dataset size .
  • Noisy Student Training: Improve the generalization of the student model by adding noise during the knowledge distillation run .
  • FixMatch: Leverages consistency regularization and pseudo-labeling to perform semi-supervised learning.

3. Out-of-distribution data

  • Find exceptional data points instantly.
  • Thesis (*Translation Note 16)
In the above-cited paper ” ODIN generalization: Detecting out-of-distribution images without learning from out-of-distribution data”, a new method to detect out-of-distribution data is to decompose reliability scores and input We propose changes in preprocessing methods for time and report that these new methods improve the ability to detect out-of-distribution data.

I’d like to quote a clever quote from George Box on the principles of robust models.

“All models are wrong, but some models that we know to be wrong are useful.” – Balaji Lakshminarayanan (NeurIPS 2020)

(*Translation Note 17) George Box was a statistician from England who moved to the United States. He has been called “one of the great statisticians of the 20th century”. One of his quotes is, ” All models are wrong, but some are good .”
Balaji Lakshminarayanan is a staff research scientist at Google Brain. In recent years, he has focused his research on robust deep learning, GANs, etc.


This concludes this article. Thank you for reading! I hope you read this article and learned something new. If you like articles like this, be sure to follow the Bitgrit Data Science Publication .



Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments