Finding problems using unsupervised image categorization
Needle in a Haystack
The most tedious part of supervised machine learning is providing sufficient supervision. However, if the samples come from a restricted sample space, unsupervised learning might be fine for the task.
In any classification project, it is certainly possible to get someone to review a certain number of images and build a classification list. However, when entering a new domain, it can be difficult to identify domain knowledge experts or to develop a ground truth for classification upon which all experts can agree. This is true regardless of whether you are looking at the backside of a silicon wafer for the first time, or if you are trying to identify the presence of volcanoes in radar images from the surface of Venus [1].
Alternatively, you can bypass all these problems and kick-start a classification project with unsupervised machine learning. Unsupervised machine learning is particularly applicable to environments where the typical images are largely identical, much like the pieces of hay in the haystack that you need to ignore when looking for needles.
In this article, I examine the potential for using unsupervised machine learning in Python (version 3.8.3 64-bit) to identify image categories for a restricted image space without resorting to training neural networks. This technique follows from the long tradition within engineering of finding the simplest solution to a problem. In this particular case, the solution relies upon the ability of the functions within the OpenCV and mahotas computer vision libraries to generate parameters for the texture and form within an image.
As an example, I'll look at the images obtained during the semiconductor manufacturing process from the bevel of silicon wafers. As part of the quality control procedure for semiconductor manufacturing, a series of photo images are taken for each wafer. Ideally, the wafers are normal, so the images are identical, but occasionally, a dissimilar photo can reveal a potential manufacturing problem that can generate defects on the affected wafer. Of course, you could train a human to wade through all these photos and look for problems, which would certainly be thorough, but it would take a lot of time and would introduce the possibility of human errors, especially as tedium develops. You could also train a neural network to look for dissimilar images, but neural networks need large amounts of compute resources, not to mention the expertise required for programming and training, as well as a sufficiently broad library of examples for each classification.
A simpler solution is to check the images using unsupervised data analysis techniques. The first step is to derive digital parameters for each of the photos for easier comparison. In this case, I used the Hu Moments and Haralick texture features, which are available through the cv2 and mahotas computer vision libraries. Hu Moments is an image descriptor that characterizes the shape of an object within an image. The Haralick texture features unsurprisingly describe the texture.
I then use principal component analysis (PCA) to reduce the data's dimensionality while still preserving as much of the variance as possible. The points are then grouped using a density-based clustering algorithm to identify the main categories of images as well as abnormal images. I relied upon the Seaborn and Matplotlib libraries to generate the visualizations.
PCA
PCA, a machine learning technique used for dimensionality reduction, is sometimes referred to as feature extraction. PCA is most appropriate for datasets with an unwieldy number of parameters but without classification, which allows PCA to be used in an unsupervised context. PCA's goal is to preserve the salient information within a dataset while generating a more manageable number of virtual parameters that are constructed from a subset of the original parameters. The number of these virtual parameters is configurable, depending on the amount of the variance desired, but typically two are generated because of the convenience of visualization.
The first principal component usually explains 75-80 percent of the total variance, and the second can be expected to represent 12-20 percent. PCA can also be described as finding a new projection line within the parameter space that maximizes the variance of data projected along it. These lines are the principal components.
Once the clusters have been defined, it is then possible to display sample images from the boundary region of each cluster as examples of the range of images that are typical within each cluster. Defining the boundaries allows you to make a convenient comparison of the typical images in each cluster and to clearly identify how effective the method is for unsupervised image categorization.
Background
The bevel inspection tools generate three distinct image types: specular, phase, and scatter. A single image is generated for the entire wafer bevel. However, to simplify the viewing, the images are segmented into 36 equal parts, each representing 10 degrees of the wafer bevel. Figure 1 shows five examples of each of these image types.
Goal
These images may contain localized defects (see the third row in Figure 1). However, these defects are not the primary focus of this work. Instead, the main concern is the general structure and texture within the images and how these may change on and between individual wafers. Artifacts of interest can clearly be seen in the images shown in the third and fifth columns in Figure 1. The importance of these artifacts requires input from a domain knowledge expert; however, you can attempt to segregate these from the less noteworthy images.
Data Collection
The identification and collection of the relevant data from a primary source is often minimized, even though it could be considered the most essential part of data wrangling. The code snippet in Listing 1 demonstrates a succinct way to connect to an Oracle database [2] and use a cursor to extract your username. This simple query (i.e., select user from dual
) needs to be replaced by an appropriate query. (I did not include the actual queries that were used here as they are not generally applicable.) The login credentials and database identification strings were all stored separately in a config module. (You could use a password manager such as PyKeePass for storing login information.)
Listing 1
Querying Username from the Database
import cx_Oracle try: with cx_Oracle.connect( local_config.username, local_config.password, local_config.dsn, encoding=local_config.encoding) as conn: with conn.cursor() as cursor: # Now execute the sqlquery cursor.execute("select user from dual") print(cursor.fetchmany(20)) except cx_Oracle.DatabaseError as e: print("There was a problem with the YMS query ", e)
The data returned by a query can be conveniently imported into a pandas DataFrame with the following command:
my_df = pd.DataFrame(cursor.fetchall())
The essential pandas library [3] provides access to many data structures and methods that greatly simplify manipulation. The DataFrame and its methods are primary among these and should be familiar to users of the R programming language.
To avoid storing all images on the local machine, a pseudo data pipeline was created to pull down each image via a URL, extract the relevant features, and then delete the image. The get_URL_img function in Listing 2 uses the Python Image Library (PIL) [4] to extract each image via its URL onto the local filesystem.
Listing 2
Loading an Image into Memory from a URL
import requests from PIL import Image from io import BytesIO def get_URL_img(URL): # Create a Session to contain you basic AUTH and persist your cookies authed_session = requests.Session() authed_session.auth = (local_config.WEBUSERNAME, local_config.WEBPASSWORD) USER_AGENT = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17" authed_session.headers.update({'User-Agent': USER_AGENT}) # Fetch the actual data fetched_data = authed_session.get(URL) # Convert into an image return Image.open(BytesIO(fetched_data.content))
The functions shown in Listing 3 extract the Hu Moments (extract_hu_moments()
) and Haralick texture features (extract_texture()
) from a specified image. These functions can be found in the cv2 [5] (which is the import name for opencv-python) and mahotas [6] libraries, both of which require NumPy [7].
Listing 3
Feature Extraction Functions
import numpy as np import cv2 import mahotas as mt # Function to Extract Hu Moments def extract_hu_moments(image): feature = cv2.HuMoments(cv2.moments(np.array(image))).flatten() return feature # Function to Extract Features def extract_texture(image): # calculate haralick texture features for 4 types of adjacency textures = mt.features.haralick(np.array(image)) ht_mean = textures.mean(axis=0) return ht_mean
Listing 4 then applies the extract
functions on each image in turn using various pandas methods [8] to combine all results into a single DataFrame (i.e., imgFeatures
).
Listing 4
Extracting Features from All Images
import pandas as pd imgHaralickFeatures = pd.DataFrame imgHuMoments = pd.DataFrame for idx, wafstepIMG in wafStepIMGList_df.iterrows(): imgURI = local_config.imgServerRoot + wafstepIMG['FILENAME'] print("This URI:", imgURI, wafstepIMG['STEP_ID']) img = get_URL_img(imgURI) if img.size == (1820, 1002): print("Correct image size: ",img.size) # Extract and store Haralick features from bevel image har = extract_texture(img) if imgHaralickFeatures.empty: imgHaralickFeatures = pd.DataFrame(har.reshape(-1, len(har)), [wafstepIMG['IMAGE_ID']] ).add_prefix('haralick_') else: imgHaralickFeatures = imgHaralickFeatures.append(pd.DataFrame(har.reshape(-1, len(har)), [wafstepIMG['IMAGE_ID']] ).add_prefix('haralick_')) # Extract and store Hu Moments from bevel image hum = extract_hu_moments(img) if imgHuMoments.empty: imgHuMoments = pd.DataFrame(hum.reshape(-1, len(hum)), [wafstepIMG['IMAGE_ID']] ).add_prefix('hu_') else: imgHuMoments = imgHuMoments.append(pd.DataFrame(hum.reshape(-1, len(hum)), [wafstepIMG['IMAGE_ID']] ).add_prefix('hu_')) else: print("Wrong image size: ",img.size) imgFeatures = imgHaralickFeatures.join(imgHuMoments, how='outer')
The resulting DataFrames can be pickled (serialized) for later analysis using the following command:
my_df.to_pickle(fileName.pkl)
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.