Mvtec Halcon 11 Patch 7
Download https://shoxet.com/2t7hAW
A dataset that is specifically designed for optical inspection of textured surfaces was proposed by Wieler and Hahn (2007). They provide ten classes of artificially generated gray-scale textures with defects weakly annotated in the form of ellipses. Each class comprises 1000 defect-free texture patches for training and 150 defective patches for testing. The annotations, however, are coarse and since the textures were generated by very similar texture models, the variance in appearance between the different textures is insignificant. Furthermore, artificially generated datasets can only be seen as an approximation to the real world.
Convolutional Autoencoders (CAEs) (Goodfellow et al. 2016) are commonly used as a base architecture in unsupervised anomaly detection settings. They attempt to reconstruct defect-free training samples through a bottleneck (latent space). During testing, they should be unable to reproduce images that differ from the data that was observed during training. Anomalies are detected by a per-pixel comparison of the input with its reconstruction. Recently, Bergmann et al. (2019b) pointed out the disadvantages of per-pixel loss functions in autoencoding frameworks when used in anomaly segmentation scenarios and proposed to incorporate spatial information of local patch regions using structural similarity (Wang et al. 2004) for improved segmentation results.
Napoletano et al. (2018) propose to use clustered feature descriptions obtained from the activations of a ResNet-18 (He et al. 2016) classification network pretrained on ImageNet to distinguish normal from anomalous data. We refer to their method as CNN Feature Dictionary. Training features are extracted from patches that are cropped at random locations from the input images and their distribution is modeled with a K-Means classifier. Since feature extraction for all possible image patches quickly becomes prohibitively expensive and the capacity of K-Means classifiers is limited, the total number of available training features is typically heavily subsampled. This method achieves state-of-the-art results on the NanoTWICE dataset. Being designed for one-class classification, it only provides a binary decision whether an input image contains an anomaly or not. In order to obtain a spatial anomaly map, the classifier must be evaluated at multiple image locations, ideally at each single pixel. This quickly becomes a performance bottleneck for large images. To increase performance in practice, not every pixel location is evaluated and the resulting anomaly maps are therefore coarse.
Anomaly maps are obtained by a per-pixel \(\ell ^2\) - comparison of the input image with the generated output. For all evaluated dataset categories, training, validation and testing images are zoomed to size 256 \(\times \) 256 pixels. 50 000 training patches of size 64 \(\times \) 64 pixels are randomly cropped from the training images. During testing, a patchwise evaluation is performed with a horizontal and vertical stride of 64 pixels.
For the evaluation of the \(\ell ^2\)- and SSIM-autoencoder, we build on the same CAE architecture that was described by Bergmann et al. (2019b). They reconstruct patches of size 128 \(\times \) 128, employing either a per-pixel \(\ell ^2\) loss or a loss based on the structural similarity index (SSIM). We extend the architecture by an additional convolution layer to process images at resolution 256 \(\times \) 256. We find an SSIM window size of 11 \(\times \) 11 pixels to work well in our experiments. The latent space dimension is chosen to be 128. Larger latent space dimensions do not yield significant improvements in reconstruction quality while lower dimensions lead to degenerate reconstructions. Training is run for 100 epochs using the Adam optimizer with an initial learning rate of \(2 \times 10^{-4}\) and a batch size of 128.
For the dataset objects, anomaly maps are generated by passing an image through the autoencoder and comparing the reconstruction with its respective input using either per-pixel \(\ell ^2\) comparisons or SSIM. For textures, we reconstruct patches at a stride of 64 pixels and average the resulting anomaly maps. Since SSIM does not operate on color images, for the training and evaluation of the SSIM-autoencoder all images are converted to grayscale.
Feature Dictionary We use our own implementation of the CNN feature dictionary proposed by Napoletano et al. (2018), which extracts features from the 512-dimensional average pooling layer of a ResNet-18 pretrained on ImageNet. Principal Component Analysis (PCA) is performed on the extracted features to explain 95% of the variance. K-means is run with 50 cluster centers and the nearest descriptor to each center is stored as a dictionary vector. We extract 100 000 patches of size 128 \(\times \) 128 for both the textures and objects. All images are evaluated at their original resolution. A stride of 8 pixels is chosen to create a spatially resolved anomaly map. For grayscale images, the channels are triplicated for feature extraction since the used ResNet-18 operates on three-channel input images.
GMM-Based Texture Inspection Model For the Texture Inspection Model (Böttger and Ulrich 2016), an optimized implementation is available in the HALCON machine vision library.Footnote 4 Images are converted to grayscale, zoomed to an input size of 400 \(\times \) 400 pixels, and a four-layer image pyramid is constructed for training and evaluation. On each pyramid level, a separate GMM with dense covariance matrix is trained. The patch size of examined texture regions on each pyramid level is set to 7 \(\times \) 7 pixels. We use a maximum of 50 randomly selected images from the original training set for training the Texture Inspection Model. Anomaly maps for each pyramid level are obtained by evaluating the negative log-likelihood for each image pixel using the corresponding trained GMM. We normalize the anomaly scores of each level such that the mean score is equal to 0 and their standard deviation equal to 1 on the validation set. The different levels are then combined to a single anomaly map by averaging the four normalized anomaly scores per pixel position.
Feature Dictionary The CNN Feature Dictionary was originally designed to model the distribution of repetitive texture patches. However, it also yields promising results for anomaly segmentation on objects when anomalies manifest themselves in features that deviate strongly from the local descriptors of the training data manifold. For example, the small crack on the capsule is well detected. However, since the method randomly subsamples training patches, it yields increased anomaly scores in regions that are underrepresented in the training set, e.g., on the imprint on the left half of the capsule. Additionally, due to the limited capacity of K-Means, the training feature distribution is often insufficiently well approximated. The method does not capture the global context of an object. Hence, it fails to detect the anomaly on the cable cross section, where the inner insulation on the bottom left shows the wrong color, as it is brown instead of blue. 2b1af7f3a8