Fundamentals of Computer Vision

Computer vision is one of the core areas of artificial intelligence (AI), and focuses on creating solutions that enable AI application to see the world and make sense of it.

Images and image processing

Images as pixel arrays

To a computer, an image is an array of numeric pixel values. For example, consider the following array:

0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

This array represents a 7x7 pixel image, defining its resolution.

Pixels range from 0 (black) to 255 (white) in the array. The resulting image is grayscale.

Grayscale images use a 2D array of pixel values, while most digital images are RGB, consisting of three layers representing red, green, and blue color channels.

For a color image, three channels of pixel values would maintain the same square shape as the grayscale example.

Red:
150 150 150 150 150 150 150
150 150 150 150 150 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 150 150 150 150 150
150 150 150 150 150 150 150

Green:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

Blue:
255 255 255 255 255 255 255
255 255 255 255 255 255 255
255 255 0 0 0 255 255 255
255 255 0 0 0 255 255 255
255 255 0 0 0 255 255 255
255 255 255 255 255 255 255
255 255 255 255 255 255 255

Here's the resulting image with:

The purple sqaures are represented by the combination Red:150 - Green:0 - Blue:255

The yellow sqaures are represented by the combination Red:255 - Green:255 - Blue:0

Using filters to process images

Filters: Alter pixel values and create visual effects

Filter kernels: A filter consists of one or more arrays of pixel values. For instance, a 3x3 kernal can define a filter

-1 -1 -1
-1  8 -1
-1 -1 -1

The kernel is convolved across the image, calculating a weighted sum for each 3x3 patch of pixels and assigning the result to a new image.

Let's start with the grayscale image we explored previously:

0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0

First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:

(0 x -1) + (0 x -1) + (0 x -1) + 
(0 x -1) + (0 x  8) + (0 x -1) + 
(0 x -1) + (0 x -1) + (255 x -1) = -255

The result (-255) becomes the first value in a new array. Then we move the filter kernel along one pixel to the right and repeat the operation

The resulting array represents a transformed image where the filter has highlighted the edges of shapes, effectively altering the original image.

Filters like the Laplace filter (used in the example) highlight edges by assigning high weights to edge pixels.

Convolutional filtering enables various effects like blurring, sharpening, and color inversion by using different filter types

Machine learning for computer vision

Computer vision aims to extract meaning or actionable insights from images by training model on large datasets.

These models learn to recognize features and patterns in images, enabling them to understand and interpret visual data effectively.

Convolutional neural networks (CNNs)

Convolutional Neural Networks (CNNs) are common machine learning model architectures for computer vision.

CNNs use filters to extract numeric feature maps from images. These feature values are then fed into a deep learning model to generate label predictions.

The following diagram illustrates how a CNN for an image classification model works:

Training data with known labels (e.g., 0: apple, 1: banana, 2: orange) is used to train the CNN.

During training, multiple filter layers extract features from images. Initially, filter kernels start with random weights, producing numeric arrays called feature maps.

Feature maps are flattened into a single-dimensional array of feature values.

These feature values are inputted into a fully connected neural network.

The output layer of the neural network uses a softmax function to produce probability values for each class (e.g., [0.2, 0.5, 0.3]).

The model calculates loss by comparing predicted and actual class scores.

Weights in the fully connected neural network and filter kernels in feature extraction layers are adjusted to minimize this loss.

The training iterates over multiple epochs until optimal weights are learned.

Afterward, these weights are saved, enabling the model to predict labels for new images with unknown labels.

Multi-modal models are trained on captioned images without fixed labels.

An image encoder extracts features from images based on pixel values

Text embeddings from a language encoder are combined with image features.

The model captures relationships between natural language token embeddings and image features.

The overall model encapsulates relationships between natural language token embeddings and image features, as shown here:

The Microsoft Florence model is just such a model

Florence as a foundation model for adaptive models that perform:

Image classification: Identifying to which category an image belongs.

Object detection: Locating individual objects within an image.

Captioning: Generating appropriate descriptions of images.

Tagging: Compiling a list of relevant text tags for an image.

Multi-modal models like Florence are at the cutting edge of computer vision and AI in general, and are expected to drive advances in the kinds of solution that AI makes possible.

Azure AI Vision

Microsoft's Azure AI Vision service offers prebuilt and customizable computer vision models built on the Florence foundation model, providing diverse capabilities. It enables the rapid creation of sophisticated computer vision solutions.

To use Azure AI Vision, first create a resource in your Azure subscription from the available resource types. You can use either of the following resource types:

Azure AI Vision: A specific resource for the Azure AI Vision service. Use this resource type if you don't intend to use any other Azure AI services, or if you want to track utilization and costs for your Azure AI Vision resource separately.

Azure AI services: A general resource that includes Azure AI Vision along with many other Azure AI services; such as Azure AI Language, Azure AI Custom Vision, Azure AI Translator, and others. Use this resource type if you plan to use multiple AI services and want to simplify administration and development.

Azure AI Vision offers various image analysis capabilities, including:

Optical character recognition (OCR) - extracting text from images.

Generating captions and descriptions of images.

Detection of thousands of common objects in images.

Tagging visual features in images

If built-in models aren't enough, train a custom model for image classification or object detection using pre-trained foundation models, needing only a few training images.

Fundamentals of Facial Recognition

Face detection and analysis uses AI algorithms to locate and analyze human faces in images or videos.

Face detection, Face analysis and Facial recognition

Face detection: Involves identifying face regions in an image by returning bounding box coordinates Example:

09 Face Detection Example 0a5fb17421.png

Face analysis: Facial features like the nose, eyes, and lips can train machine learning models to extract additional information. Example:
Facial recognition:
Facial recognition trains a model using multiple images of an individual to identify them in new images based on facial features.
Facial recognition, when used responsibly, enhances efficiency, security, and customer experiences. Example:

Azure AI Face

Microsoft Azure offers several AI services for face detection and analysis:

Azure AI Vision: Provides face detection and basic analysis, like bounding box coordinates.
Azure AI Video Indexer: Detects and identifies faces in videos.
Azure AI Face: Offers the most comprehensive facial analysis, including detection, recognition, and analysis.

Fundamentals of optical character recognition (OCR)

Machines can read text in images, like road signs or chalkboards, using optical character recognition (OCR), which converts words in images into machine-readable text.
OCR, built on machine learning, recognizes text by identifying shapes as letters or symbols. Modern OCR models can now detect and read both printed and handwritten text in images line-by-line and word-by-word. Examples:

Azure Machine Learning - Part2: Computer VIsion