
Computer vision is one of the core areas of artificial intelligence (AI), and focuses on creating solutions that enable AI application to
seethe world and make sense of it.
- To a computer, an image is an array of numeric
pixelvalues. For example, consider the following array:
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
- This array represents a 7x7 pixel image, defining its
resolution.- Pixels range from 0 (black) to 255 (white) in the array. The resulting image is grayscale.
- Grayscale images use a 2D array of pixel values, while most digital images are RGB, consisting of three layers representing red, green, and blue color channels.
- For a color image, three channels of pixel values would maintain the same square shape as the grayscale example.
Red:
150 150 150 150 150 150 150
150 150 150 150 150 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 255 255 255 150 150
150 150 150 150 150 150 150
150 150 150 150 150 150 150
Green:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 255 255 255 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
Blue:
255 255 255 255 255 255 255
255 255 255 255 255 255 255
255 255 0 0 0 255 255 255
255 255 0 0 0 255 255 255
255 255 0 0 0 255 255 255
255 255 255 255 255 255 255
255 255 255 255 255 255 255
- Here's the resulting image with:
- The purple sqaures are represented by the combination
Red:150 - Green:0 - Blue:255- The yellow sqaures are represented by the combination
Red:255 - Green:255 - Blue:0
Filters: Alter pixel values and create visual effectsFilter kernels: A filter consists of one or more arrays of pixel values. For instance, a 3x3 kernal can define a filter
-1 -1 -1
-1 8 -1
-1 -1 -1
- The kernel is
convolvedacross the image, calculating a weighted sum for each 3x3 patch of pixels and assigning the result to a new image.- Let's start with the grayscale image we explored previously:
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 255 255 255 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
- First, we apply the filter kernel to the top left patch of the image, multiplying each pixel value by the corresponding weight value in the kernel and adding the results:
(0 x -1) + (0 x -1) + (0 x -1) +
(0 x -1) + (0 x 8) + (0 x -1) +
(0 x -1) + (0 x -1) + (255 x -1) = -255
- The result (-255) becomes the first value in a new array. Then we move the filter kernel along one pixel to the right and repeat the operation
The resulting array represents a transformed image where the filter has highlighted the edges of shapes, effectively altering the original image.
- Filters like the Laplace filter (used in the example) highlight edges by assigning high weights to edge pixels.
- Convolutional filtering enables various effects like blurring, sharpening, and color inversion by using different filter types
- Computer vision aims to
extract meaningor actionable insightsfrom imagesbytraining modelonlarge datasets.- These models
learn to recognize features and patternsin images, enabling them to understand and interpret visual data effectively.
- Convolutional Neural Networks (CNNs) are common machine learning model architectures for computer vision.
- CNNs use
filters to extract numeric featuremaps from images. These feature values are thenfed into a deep learning modeltogenerate label predictions.- The following diagram illustrates how a CNN for an image classification model works:
- Training data with known labels (e.g., 0: apple, 1: banana, 2: orange) is used to train the CNN.
- During training,
multiple filter layers extract featuresfrom images. Initially,filter kernels start with random weights, producing numeric arrays called feature maps.- Feature maps are flattened into a single-dimensional array of feature values.
- These feature values are inputted into a fully connected neural network.
- The output layer of the neural network uses a softmax function to produce probability values for each class (e.g., [0.2, 0.5, 0.3]).
- The model calculates
lossby comparing predicted and actual class scores.- Weights in the fully connected neural network and filter kernels in feature extraction layers are adjusted to minimize this loss.
- The training iterates over multiple
epochsuntil optimal weights are learned.- Afterward, these weights are saved, enabling the model to predict labels for new images with unknown labels.
- Multi-modal models are trained on captioned images without fixed labels.
- An image encoder extracts features from images based on pixel values
- Text embeddings from a language encoder are combined with image features.
- The model captures relationships between natural language token embeddings and image features.
- The overall model encapsulates relationships between natural language token embeddings and image features, as shown here:
The Microsoft Florencemodel is just such a model- Florence as a foundation model for adaptive models that perform:
Image classification: Identifying to which category an image belongs.Object detection: Locating individual objects within an image.Captioning: Generating appropriate descriptions of images.Tagging: Compiling a list of relevant text tags for an image.- Multi-modal models like Florence are at the cutting edge of computer vision and AI in general, and are expected to drive advances in the kinds of solution that AI makes possible.
- Microsoft's Azure AI Vision service offers prebuilt and customizable computer vision models built on the Florence foundation model, providing diverse capabilities. It enables the rapid creation of sophisticated computer vision solutions.
- To use Azure AI Vision, first create a resource in your Azure subscription from the available resource types. You can use either of the following resource types:
Azure AI Vision: A specific resource for the Azure AI Vision service. Use this resource type if you don't intend to use any other Azure AI services, or if you want to track utilization and costs for your Azure AI Vision resource separately.Azure AI services: A general resource that includes Azure AI Vision along with many other Azure AI services; such as Azure AI Language, Azure AI Custom Vision, Azure AI Translator, and others. Use this resource type if you plan to use multiple AI services and want to simplify administration and development.Azure AI Visionoffers various image analysis capabilities, including:- Optical character recognition (OCR) - extracting text from images.
- Generating captions and descriptions of images.
- Detection of thousands of common objects in images.
- Tagging visual features in images
- If built-in models aren't enough, train a custom model for image classification or object detection using pre-trained foundation models, needing only a few training images.
Face detection and analysis uses AI algorithms to locate and analyze human faces in images or videos.
Face detection: Involves identifying face regions in an image by returning bounding box coordinates Example:
Face analysis: Facial features like the nose, eyes, and lips can train machine learning models to extract additional information. Example:
Facial recognition:
Microsoft Azure offers several AI services for face detection and analysis:
Azure AI Vision: Provides face detection and basic analysis, like bounding box coordinates.Azure AI Video Indexer: Detects and identifies faces in videos.Azure AI Face: Offers the most comprehensive facial analysis, including detection, recognition, and analysis.printed and handwritten text in images line-by-line and word-by-word. Examples:
