Seeing through Computer Vision: Convolution 101

Have you ever wondered how modern computer vision applications do their magic? 

Are you curious to learn about the key building block that lies underneath some of the most impressive recent applications, from object detection to segmentation and even impressive image generation models?

What is the secret ingredient common to edge detection, blurring, and locating a person?

It is convolution

In this article, we’ll take a deep dive into this powerful building block, exploring its definition, practical uses, and different implementations. You’ll discover why convolution is essential to unlocking the full potential of computer vision, and we’ll even showcase a real-world example to help bring it all to life.

Let’s start!

What is a convolution?

 

It is probably one of the most popular operations in signal and image processing. In fact, it is often encountered as a first step in computer vision processing pipelines.

In a few words, a convolution is a mathematical transformation that consists in multiplying two signals (one of the signals is reflected and shifted). To be more accurate, here is the mathematical continuous equation of a convolution of two functions f and g (these two functions should belong to a subset of well-behaved functions):

Code

Since we are working with discrete data (image pixels are represented as arrays), here is the discrete equivalent: 

Code

Notice that we won’t be using this convolution equation but rather the cross-correlation one. In fact, as it is the convention in computer vision, we refer to cross-correlation as convolution for historical reasons so we will stick to this convention. Here is the cross-correlation equation we will be using: 

Code

As you can see, the cross-correlation and convolution formula are very similar, except for a minus sign. 

The symbol denoting this operation is often a star . If the two signals are similar, the resulting output will be similar to those signals, if not, the result will be a mix of both (an interpolation between the two signals).

Below is a 1D example (source): 

Comparison of convolution, cross-correlation, and autocorrelation
Comparison of convolution, cross-correlation, and autocorrelation

From this 1D example, we can extend to an n-dimensional array: we start by flattening the n-dimensional array into a 1D array (if it is a 2D array for example, we extract the different rows and concatenate them into one single array) and apply the operation above. For instance, that’s what we do when working with black and white images. Indeed, an image is a 2D array of pixels: think about the width as the columns of a matrix and the height as its rows. For colored images (i.e. a 3D array), we will see how the convolution can be extended.

Thus, here is an animated convolution (source) showing the convolving signal within a square window sliding over a 2D array: 

Animated 2D convolution
Animated 2D convolution

As you can see from the diagram above, there are four important parameters that control a 2D convolution: 

  • how big the window is (i.e. the size of the convolving signal) we use to compute one value: this is the convolving signal reorganized into a 2D matrix;
  • how much we slide the window by;
  • how far away values we use are within the window;
  • how we fill empty values at the borders.

These four parameters translate into the following four concepts:

  • kernel size: the dimension of the kernel, usually the same for width and height (square kernel);
  • stride: by how much the kernel is moved from one computation to the next;
  • dilation: by how much we skip values within the same kernel, usually set to 1;
  • padding: how to complete empty values around the borders of the input, usually we pad with 0.

We will encounter these concepts again in the implementation section.

Now, if a convolution is applied to a colored image, we do the same operation as done in the 2D GIF for each color channel, then we sum the 3 contributions. Notice that we can use the same kernel or 3 different kernels (one per color channel). We will see in the implementation section a generalization of this: input and output channels or planes that control how many kernels are used.

 

Why is it useful for computer vision?

 

Now that we have introduced the concept of convolution, let’s explore why it is useful. It is indeed used in many general computer vision applications and deep learning models in particular. 

In fact, it can be useful for the following applications:

  • altering an image: for example using Gaussian blurring;
  • selecting a region: for example using edge detection ;
  • learning a generic representation: for example as a layer in a neural network.

How can we explain why these applications work?
One intuitive way to explain edge detection is by noticing that a convolution is a way to approximate spatial gradients and thus if we use the appropriate kernel, we can detect edges (which are actually strong variations in colors between close pixels). More generally, since a convolution is a sum of differences and products, it is an efficient tool to approximate many mathematical operators and that’s why it works.

Regarding the third application, it can be understood with this quick reasoning: a machine learning model’s aim is to learn the mathematical function f that links inputs to outputs and since a convolution can be successfully used to select and alter regions, we can hope that chaining enough convolutions will be helpful. This is the case as it turns out, and convolutional neural networks (shortened as CNN) are very efficient models.

The convolution is thus a versatile tool. Let’s explore how one simple application works: edge detection.

 

Application: Edge Detection

 

As seen above, one use of convolution is to detect edges. As you might have guessed, there are many convolution kernels that can be used for edge detection: Laplace, Sobel, Canny to cite a few.

For example, here is the Laplace kernel:

Laplace 2D Kernel
Laplace 2D Kernel (in matrix format)

Here is an image of a typical industrial situation, and the result of applying the Sobel-Feldman kernel to this image: 

Original image (man welding a bar)
Original image
Image through a Sobel-Feldman kernel
Image through a Sobel-Feldman kernel

 You can see the edges detected, and you can also try it yourself online here or here. For more implementation details, check this PyTorch example.

 

Convolution Implementations

 

Let’s finish this blog post by exploring different ways to implement a convolution.

Given the formula above, we can translate it into a simple naive implementation.

 

Naive with loops

 

The first thing that comes to mind looking at the formula above is to implement the convolution using an inner for loop over the values being multiplied to get one single convolution value and then an outer for loop to get all the values, thus 2 loops.

Extending this to a black and white image (a 2D array), we need twice as many loops: two inner loops and two outer loops, thus 4 loops.

Here is one possible implementation using numpy to store arrays:

import numpy as np

def looped_cross_correlation(image, kernel):
    hi, wi = image.shape
    hk, wk = kernel.shape
    image_padded = np.zeros(shape=(hi + hk, wi + wk))
    image_padded[hk//2:-hk//2, wk//2:-wk//2] = image
    out = np.zeros(shape=image.shape)
    # Iterate over the image values (rows and columns)
    for row in range(hi):
      for col in range(wi):
        # Iterate over the kernel values
        for i in range(hk):
          for j in range(wk):
            out[row, col] += image_padded[row + i, col + j] * kernel[i, j]
    return out


img = np.random.randint(255, size=(3, 3))
kernel = np.array([[0, 1, 0], [0, 0, 0], [0, -1, 0]])
cross_correlated_img = looped_cross_correlation(img, kernel)

print(cross_correlated_img)
print(cross_correlated_img.shape)

# You should get something similar to:
"""
[[-142.   -2. -108.]
[ -74. -152. -212.]
[ 142.    2.  108.]]
"""
# and: 
# (3, 3)

One additional thing we notice above is that we had to pad the image with zeros around its borders so that we can access values that are outside the boundaries of the image: to do this, we create a bigger zeros array and then assign the image values.

Let’s extend this implementation to a colored image with three color channels. For that, we will need an additional for loop over the color channels

import numpy as np

def looped_cross_correlation_colored(image, kernel):
    hi, wi, _ = image.shape
    hk, wk, channels = kernel.shape
    image_padded = np.zeros(shape=(hi + hk, wi + wk, channels))
    image_padded[hk//2:-hk//2, wk//2:-wk//2, :] = image
    out = np.zeros(shape=image.shape)
    # Iterate over the image values (rows and columns)
    for row in range(hi):
        for col in range(wi):
            # Iterate over the kernel values
            for i in range(hk):
                for j in range(wk):
            # Iterate over the color channels
                    for c in range(channels):
                        out[row, col] += image_padded[row + i, col + j, c] * kernel[i, j, c]
    return out

img = np.random.randint(255, size=(3, 3, 3))
kernel = np.array([[0, 1, 0], [0, 0, 0], [0, -1, 0]])
colored_kernel = np.stack([kernel, kernel, kernel])
cross_correlated_img = looped_cross_correlation_colored(img, colored_kernel)

print(cross_correlated_img)
print(cross_correlated_img.shape)


# You should get something similar to:
"""
[[[-357. -357. -357.]
  [ 220.  220.  220.]
  [ 357.  357.  357.]]

 [[-462. -462. -462.]
  [ 320.  320.  320.]
  [ 462.  462.  462.]]

 [[-249. -249. -249.]
  [ 310.  310.  310.]
  [ 249.  249.  249.]]]
"""
# and
# (3, 3, 3)

 

 

Vectorized with scipy

 

A better implementation is to use what is known as vectorization (also known as array programming). For that, we can use the following function from scipy: signal.convolve2d.scipy.

Here is the official convolution example from the scipy documentation page using a Scharr kernel: 

from scipy import signal
from scipy import misc
ascent = misc.ascent()
scharr = np.array([[ -3-3j, 0-10j,  +3 -3j],
                   [-10+0j, 0+ 0j, +10 +0j],
                   [ -3+3j, 0+10j,  +3 +3j]]) # Gx + j*Gy
grad = signal.convolve2d(ascent, scharr, boundary='symm', mode='same')

Two things to notice here:

  • the mode keyword controls how big the output array is;
  • the boundary keyword controls how to pad the input array.

 

GPU-boosted with PyTorch

 

Finally, let’s close this section with a more optimized implementation, especially when having access to a GPU: using PyTorch’s torch.nn.Conv2D

from PIL import Image
from torchvision.transforms import ToTensor, ToPILImage
from torch.nn import Conv2d
import torch

# So that the convolution is reproducible since
# it is a learnable layer and thus has a random
# initialization. 

torch.manual_seed(314)

img_path = "path/to/img"
pil_img = Image.open(img_path)
torch_img = ToTensor()(pil_img)

# We need to add a dimension for the convolution to work.
# More on this on other sections.

torch_img = torch_img.unsqueeze(0)

print(torch_img.shape)

# Here we keep the kernel arguments fixed: stride=1, padding=0, dilation=1
# The number of input channels is fixed since we have an RGB image, thus it is 3.

convolution_operator = Conv2d(3, 1, 25)

# We select the first element since we have a batch of 1
torch_conv = op(torch_img)[0]

# Get back an image if we want to plot it
img_conv = ToPILImage()(torch_conv)

How is this implemented under the hood? As with (most of) PyTorch code, the low-level details are implemented using C++ and tailored to work well with CUDA (short for Compute Unified Device Architecture) and CUDNN (short for CUDA Deep Neural Network) particularly. Indeed CUDA/CUDNN are the low-level languages for interacting with GPUs. Check this code snippet to see how the convolution C++ code looks like. Here is an extract (from here):

at::Tensor conv2d(
    const Tensor& input_, const Tensor& weight, const c10::optional<Tensor>& bias,
    IntArrayRef stride, c10::string_view padding, IntArrayRef dilation,
    int64_t groups) {
  Tensor input;
  bool is_batched;
  std::tie(input, is_batched) = batchify(input_, /*num_spatial_dims=*/ 2, "conv2d");
  Tensor output;
  if (at::isComplexType(input_.scalar_type())) {
    output = complex_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
  } else {
    output = at::_convolution_mode(input, weight, bias, stride, std::move(padding), dilation, groups);
  }
  return is_batched ? std::move(output) : output.squeeze(0);
}

One interesting thing to notice is that there is an even more optimized implementation using FFT that can be found here and that works best for larger kernels. 

However, for small kernels (up to around 20 for 2D convolution as seen here), the original PyTorch implementation is the fastest.

 

Suez Smart City Application

 

As we have seen, convolution is a fundamental technique for image recognition and a key building block of many machine learning algorithms. At Fieldbox, we specialize in delivering artificial intelligence (AI) services that leverage the power of convolution and many other advanced techniques to solve real-world problems.

One of our recent smart city projects involved designing a convolutional neural network (shortened as CNN) for Suez to detect defects in sewage pipes. By chaining together multiple convolutional layers, the CNN was able to learn increasingly complex patterns, from simple edges to curved, disconnected, and blurry lines. This enabled the model to accurately identify cracks and other defects in the pipeline, helping Suez operators improve their maintenance and repair processes.

Sewage pipe
Sewage pipe
Sewage pipe with detected defects
Sewage pipe with detected defects

If you’re interested in learning more about this use case and how we’re using convolution and other advanced techniques to drive innovation in the smart city space, visit this section of our website.

 

Conclusion

 

In conclusion, convolution is a versatile and powerful technique that forms the backbone of many computer vision applications. By allowing us to extract meaningful information from visual data, convolution enables us to detect objects, recognize faces, and generate realistic images with remarkable accuracy.

As a company specializing in delivering data-driven solutions for industrial applications, we have extensive experience harnessing the power of computer vision, including the many ways in which convolution can help unlock new insights and capabilities. Our team is uniquely positioned to deliver cutting-edge solutions in this field, thanks to our mastery of both the theoretical and practical aspects of convolution, from mathematical equations to optimized implementation using PyTorch and other tools.

Whether you’re looking to optimize manufacturing processes, improve quality control, or enhance safety and security, our team of experts can help you leverage the full potential of this transformative technology.

Thank you for joining us on this deep dive into convolution. We hope you’ve gained a better understanding of this foundational building block and its many practical applications. Be sure to check back with us next time for another exciting concept explained!

 

Resources

 

Here are some additional resources to dig deeper:

Article contributors
Yassine Alouini