《1. Introduction》

1. Introduction

Much of the critical infrastructure that serves society today, including bridges, dams, highways, lifeline systems, and buildings, was erected several decades ago and is well past its design life. For example, in the United States, according to the 2017 Infrastructure Report Card published by the American Society of Civil Engineers, there are over 56 000 structurally deficient bridges, requiring a massive 123 billion USD for rehabilitation [1]. The economic implications of repair necessitate systematic prioritization achieved through a careful understanding of the current state of infrastructure.

Civil infrastructure condition assessment is performed by leveraging the information obtained by inspection and/or monitoring processes. Traditional techniques to assess the condition of civil infrastructure typically involve visual inspection by trained inspectors, combined with relevant decision-making criteria (e.g., ATC-20 [2], national bridge inspection standards [3]). However, such inspection can be time-consuming, laborious, expensive, and/or dangerous (Fig. 1) [4]. Monitoring can be used to obtain quantitative understanding of the current state of the structure through measurement of physical quantities, such as accelerations, strains, and/or displacements; such approaches offer the ability to continuously observe structural integrity in real time, with the goal of enhanced safety and reliability, and reduced maintenance and inspection costs [5–8]. While these methods have been shown to produce reliable data, they typically have limited spatial resolution or require installation of dense sensor arrays. Another issue is that once installed, access to the sensors is often limited, making regular system maintenance challenging. If only occasional monitoring is required, the installation of contact sensors is difficult and time consuming. To address some of these problems, improved inspection and monitoring approaches with less intervention from humans, lower cost, and higher spatial resolution must be developed and tested to advance and realize the full benefits of automated civil infrastructure condition assessment.

《Fig. 1》

Fig. 1. Inspectors from the US Army Corps of Engineers rappel down a Tainter gate to inspect for surface damage [4].

Computer vision techniques have been recognized in the civil engineering field as a key component of improved inspection and monitoring. Images and videos are two major modes of data analyzed by computer vision techniques. Images capture visual information similar to that obtained by human inspectors. Because of this similarity, computer implementation of structural inspection that is analogous to visual inspection by human inspectors is anticipated. In addition, images can encode information from the entire field of view in a non-contact manner, potentially addressing the challenges of monitoring using contact sensors. Videos are a sequence of images in which the extra dimension of time provides important information for both inspection and monitoring applications, ranging from the assimilation of context when images are collected from multiple views, to the dynamic response of the structure when high-sampling rates are used. A significant amount of research in the civil engineering community has focused on developing and adapting computer-vision techniques for inspection and monitoring tasks. Moreover, such vision-based approaches, used in conjunction with cameras and unmanned aerial vehicles (UAVs), offer the potential for rapid and automated inspection and monitoring for civil infrastructure condition assessment.

The present paper reviews recent research on vision-based condition assessment of civil infrastructure. To put the research described in this paper in its proper technical perspective, a brief history of computer vision research is first provided in Section 2. Section 3 then reviews in detail several recent efforts dealing with inspection applications of computer vision techniques for civil infrastructure assessment. Section 4 focuses on monitoring applications, and Section 5 outlines challenges for the realization of automated structural inspection and monitoring. Section 6 discusses ongoing work by the authors toward the goal of automated inspections. Finally, Section 7 provides conclusions.

《2. A brief history of computer vision research》

2. A brief history of computer vision research

Computer vision is an interdisciplinary scientific field concerned with the automatic extraction of useful information from image data in order to understand or represent the underlying physical world, either qualitatively or quantitatively. Computer vision methods can be used to automate tasks of the human visual cortex. Initial efforts toward applying computer vision methods began in the 1960s, and sought to extract shape information about objects using edges and primitive shapes (e.g., boxes) [9]. Computer vision methods began to consider more complex perception problems with the development of different representations of image patterns. Optical character recognition (OCR) was of major interest, as the characters and digits of any fonts needed to be recognized for the purpose of increased automation in the United States Postal Service [10], license plate recognition [11], and so forth. Facial recognition, where the input image is evaluated in feature space obtained by applying hand-crafted or learned filters with the aim of detecting patterns representing human faces, has also been an active area of research [12,13]. Other object detection problems, such as pedestrian detection and car detection, have begun to show significant improvement in recent years (e.g., Ref. [14]), motivated by increased demand for surveillance and traffic monitoring. Computer vision techniques have also been used in sports broadcasting for applications such as ball tracking and virtual replays [15].

Recent advances in computer vision techniques have largely been fueled through end-to-end learning using artificial neural networks (ANNs) and convolutional neural networks (CNNs). In ANNs and CNNs, a complex input–output relation of data is approximated by a parametrized nonlinear function defined using units called nodes [16]. Output of each ANN node is computed by the following:

where is a vector of input to the node n; yn is a scalar output from the node; and and are vectors of weight and bias parameters, respectively. is a nonlinear activation function, such as the sigmoid function and rectifier (rectified linear unit, or ReLU [17]). Similarly, for CNNs, each node applies convolution, followed by a nonlinear activation function:

where * denotes convolution and Wn is the convolution kernel. The final layer of CNNs is typically a fully connected layer (FCL) which has dense connections to the output, similar to the layers of an ANN. The CNN is particularly effective for image and video data, because recognition using CNNs is robust to translation with a limited number of parameters. By increasing the number of nodes connected with each other, an arbitrary complex parametrization of the input–output relation can be realized (e.g., multilayer perceptron with many hidden layers and/or many nodes in each layer, deep convolutional neural networks (DCNNs), etc.). The parameters of the ANNs/CNNs are optimized using a collection of input and output data (training data) (e.g., Refs. [18,19]).

These algorithms have achieved remarkable success in building perception systems for highly complex visual problems. CNNs have achieved more than 99.5% accuracy on the Modified National Institute of Standards and Technology (MNIST) handwritten digit classification problem (Fig. 2(a)) [20]. Moreover, state-of-the-art CNN architectures have achieved less than 5% top-five error (ratio of data where the true class does not mark the top five classification score) [21] on the 1000-class ImageNet classification problem (Fig. 2(b)) [22].

《Fig. 2》

Fig. 2. Popular image classification datasets. (a) Example images from the MNIST dataset [20]; (b) ImageNet example images, visualized by t-distributed stochastic neighbor embedding (t-SNE) [22].

Use of CNNs is not limited to image classification (i.e., inferring a single label per image). A DCNN applies multiple nonlinear filters and computes a map of filter responses (F in Fig. 3, which is called a “feature map”). Instead of using all filter responses simultaneously to get a per-image class (upper flow in Fig. 3), filter responses at each location in the map can be used separately to extract information about both object categories and their locations. Using the feature map, semantic segmentation algorithms assign an appropriate label to each pixel of the image [23–26]. Object detection algorithms use the feature map to detect and localize objects of interest, typically by drawing their bounding boxes [27–31]. Instancelevel segmentation algorithms [32–34] further process the feature map to differentiate each instance of an object (e.g., assigning a separate label for each person in an image, instead of assigning the same label to all people in the input image). While dealing with video data, additional temporal information can also be used to conduct spatiotemporal analysis [35–37] for segmentation.

《Fig. 3》

Fig. 3. Fully convolutional neural networks (FCNs) [23].

The Achilles heel of supervised learning techniques is the need for high-quality labeled data (i.e., images in which the objects are already identified) that are used for training purposes. While many software applications have been created to help ease the labeling process (e.g., Refs. [38,39]), manual labeling still remains a very cumbersome task. Weakly supervised training has been proposed to perform object detection and localization tasks without pixellevel or object-level labeling of images [40,41]; here, a CNN is trained on image-wise labels to get the object category and approximate location in the image.

Unsupervised learning techniques further reduce the need for labeled data by identifying the underlying probabilistic structure in the observed data. For example, clustering algorithms (e.g., means algorithm [11]) assume that the data (e.g., image patch) is generated by multiple sources (e.g., different material types), and allocate each data sample to one of the sources based on the maximum likelihood (ML) framework. For example, DeGol et al. [42] used the means algorithm to perform material recognition for imaged surfaces. More complex probabilistic structures can be extracted by fitting parametrized probabilistic models to the observed data (e.g., Gaussian mixture model (GMM) [16], Boltzmann machines [43,44]). In the image processing context, CNNbased architectures for unsupervised learning have been actively investigated, such as auto-encoders [45–47] and generative adversarial networks (GANs) [48–50]. These methods can automatically learn the compact representation of the input image and/or image recovery/generation process from compact representations without manually labeled ground truth. A thorough and concise review of different supervised and unsupervised learning algorithms can be found in Ref. [51].

Another set of algorithms that have spurred significant advances in computer vision and artificial intelligence (AI) across many applications are optical flow techniques. Optical flow estimates a motion field through pixel correspondences between two image frames. Four main classes of algorithms can be used to compute optical flow, including: ① differential methods, ② region matching, ③ energy methods, and ④ phase-based techniques, for which details and references can be found in Ref. [52]. Optical flow has wide ranging applications in processing video data, from video compression [53], to video segmentation [54], motion magnification [55] and vision-based navigation of UAVs [56].

With these advances, computer vision techniques have been used to realize a wide variety of cutting-edge applications. For example, computer vision techniques are used in self-driving cars (Fig. 4) [57–59] to identify and react to potential risks encountered during driving. Accurate face recognition algorithms empower social media [60] and are also used in surveillance applications (e.g., law enforcement in airports [61]). Other successful applications include automated urban mapping [62] and enhanced medical imaging [63]. The significant improvements and successful applications of computer vision techniques in many fields provide increasing motivation for scholars to develop computer vision solutions to the civil engineering problems. Indeed, using computer vision is a natural step toward improved monitoring and inspection of civil infrastructure. With this brief history as background, the following sections describe research efforts to adapt and further develop computer vision techniques for the inspection and monitoring of civil infrastructure.

《Fig. 4》

Fig. 4. A self-driving car system by Waymo [58,59]. All image copyrights with Waymo Inc. image used under “17 US Code § 107. Limitations on exclusive rights: fair use.”

《3. Inspection applications》

3. Inspection applications

Researchers frequently envision an automated inspection framework that consists of two main steps: ① utilizing UAVs for remote automated data acquisition; and ② data processing and inspection using computer vision techniques. Intelligent UAVs are no longer a thing of the future, and the rapid growth in the drone industry over the last few years has made UAVs a viable option for data acquisition. Indeed, UAVs are being deployed by several federal and state agencies, as well as other research organizations in the United States (e.g., Minnesota Department of Transportation [64,65], Florida Department of Transportation [66], University of Florida [67], Michigan Department of Transportation [68], South Dakota State University [69]). These efforts have primarily focused on taking photographs and videos that are used for onsite evaluation or subsequent virtual inspections by engineers. The ability to automatically and robustly convert images or video data into actionable information is still challenging. Toward this goal, the first major subsection below reviews literature in damage detection, and the second reviews structural component recognition. The third major subsection briefly reviews a demonstration that combines both these aspects: damage detection with structure-level consistency.

《3.1. Damage detection》

3.1. Damage detection

Automated damage detection is a crucial component of any automated or semi-automated inspection system. When characterized by the ratio of pixels representing damage to those representing the undamaged portion of the structure’s surface, the presence of defects in images of a structure can be considered a relatively rare occurrence. Thus, the detection of visual defects with high precision and recall is a challenging task. This problem is further complicated by the presence of damage (DP)-like features (e.g., dark edges such as a groove can be mistaken for a crack). As described below, a great deal of research has been devoted to developing methods and techniques to reliably identify different visual defects, including concrete cracks, concrete spalling and delamination, fatigue cracks, steel corrosion, and asphalt cracks. Three different approaches for damage detection are discussed below: ① heuristic featureextraction methods, ② deep learning-based damage detection, and ③ change detection.

3.1.1. Heuristic feature-extraction methods

Researchers have developed different heuristics methods for damage detection using image data. In principle, these methods work by applying a threshold or a machine learning classifier to the output of a hand-crafted filter for the particular damage type (DT) of interest. This section describes some of the key DTs for which heuristic feature-extraction methods have been developed.

(1) Concrete cracks. Much of the early work on vision-based damage detection focused on identifying concrete cracks based on heuristic filters (e.g., Refs. [70–80]). Edge detection filters were the first type of heuristics to be used (e.g., Ref. [70]). An early survey of approaches can be found in Ref. [71]. Jahanshahi and Masri [72] used morphological features, together with classifiers (neural networks and support vector machines), to identify cracks of different thicknesses. The results from this study are presented in Fig. 5 [72,81], where the first column shows the original images used in the study and the subsequent columns show the results from the application of the bottom-hat method, Canny method, and the algorithm from Ref. [72]. The same paper also puts forth a method for quantifying crack thickness by identifying the centerline of each crack and computing the distance to the edges. Nishikawa et al. [74] proposed multiple sequential image filtering for crack detection and property estimation. Other researchers have also developed methods to estimate the properties of concrete cracks. Liu et al. [79] proposed a method for automated crack assessment using adaptive image processing, in which a median filter was used to decompose a crack into its skeleton and edges. Depth and three-dimensional (3D) information was also incorporated to conduct quantitative damage evaluation in Refs. [80,81]. Erkal and Hajjar [82] developed and evaluated clustering process to automatically classify defects such as cracks, corrosion, ruptures, and spalling in colorized laser scan data using surface normal based damage detection. In many of the methods discussed here, binarization is a step typically employed in crackdetection pipelines. Kim et al. [83] compared different binarization methods for crack detection. These methods have been applied to a variety of civil infrastructure, including bridges (e.g., Refs. [84,85]), tunnel linings (e.g., Ref. [76]), and post-earthquake building assessment (e.g., Ref. [86]).

《Fig. 5》

Fig. 5. A comparison of different crack detection methods conducted in Ref. [81].

(2) Concrete spalling. Methods have also been proposed to identify other defects in concrete, such as spalling. A novel orthogonal transformation approach combined with a bridge condition index was used by Adhikari et al. [87] to quantify degradation and subsequently map to condition ratings. The authors were able to achieve a reasonable accuracy of 85% for the detection of spalling in their dataset, but were unable to address situations in which both cracks and spalling were present. Paal et al. [88] employed a combination of segmentation, template matching, and morphological preprocessing, for both spall detection and concrete column assessment.

(3) Fatigue cracks in steel. Fatigue cracks are a critical problem for steel deck bridges because they can significantly shorten the lifespan of a structure. However, research on steel fatigue crack detection in civil infrastructure has been fairly limited. Yeum and Dyke [89] manually created defects on a steel beam to give the appearance of fatigue cracks (Fig. 6). They then used a combination of region localization by object detection and filtering techniques to identify the created fatigue-crack-like defects. The authors made an interesting and useful assumption that fatigue cracks generally develop around bolt holes; however, this assumption may not be valid for other steel structures, for which critical members are usually welded—including, for example, navigational infrastructure such as miter gates [90]. Jahanshahi et al. [91] proposed a region growing method to segment microcracks on internal components of nuclear reactors.

《Fig. 6》

Fig. 6. Vision-based automated crack detection for bridge inspection in Ref. [89].

(4) Steel corrosion. Researchers have used textural, spectral, and color information for the identification of corrosion. Ghanta et al. [92] proposed the use of wavelet features together with principal component analysis for corrosion percent estimation in images. Jahanshahi and Masri [93] parametrically evaluated the performance of wavelet-based corrosion algorithms. Methods using textural and color features have also been proposed and evaluated (e.g., Refs. [94,95]). Automated algorithms for robotic and smartphone-based maintenance systems have also been proposed for image-based corrosion detection (e.g., Refs. [96,97]). A survey of corrosion detection approaches using computer vision can be found in Ref. [98].

(5) Asphalt defects. Numerous techniques exist for the detection and assessment of asphalt pavement cracks and defects using heuristic feature-extraction techniques [99–105]. Hu and Zhao [101] used a local binary pattern-based (LBP) algorithm to identify pavement cracks. Salman et al. [100] proposed the use of Gabor filtering. Koch and Brilakis [99] used histogram shape-based thresholding to automatically detect potholes in pavements. In addition to RGB data (where RGB refers to 3 color channels representing red, green, and blue wavelengths of light), depth data has been used for the condition assessment of roads. For example, Chen et al. [106] reported the use of an inexpensive RGB-D sensor (Microsoft Kinect) to detect, quantify, and localize pavement defects. A detailed review of methods for asphalt defect detection can be found in Ref. [107].

For further study on identification methods for some of these defects, Koch et al. [108] provided an excellent review of computer vision defect detection techniques developed prior to 2015, classified based on the structure to which they are applied.

3.1.2. Deep learning-based damage detection

The studies and techniques described thus far can be categorized as either using machine learning techniques or relying on a combination of heuristic features together with a classifier. In essence, however, the application of such techniques in an automated structural inspection environment is limited, because these techniques do not employ the contextual information that is available in the regions around where the defect is present, such as the nature of the material or structural components. These heuristic filtering-based techniques need to be manually or semiautomatically tuned, depending on the appearance of the target structure being monitored. Real-world situations vary extensively, and hand crafting a general algorithm that can be successful in general cases is quite difficult. The recent success of deep learning for computer vision [21,51] in a number of fields, such as general image classification [109], autonomous transportation systems [57], and medical imaging [63], has driven its application in civil infrastructure inspection and monitoring. Deep learning has greatly extended the capability and robustness of traditional vision-based damage detection for a wide variety of visual defects, ranging from cracks and spalling to corrosion. Different approaches for detection have been studied, including ① image classification, ② object detection or region-proposal methods, and ③ semantic segmentation methods. These applications are discussed below.

(1) Image classification. CNNs have been employed for the application of crack detection in steel decks [110], asphalt pavements [111], and concrete surfaces [112], with very high accuracy being achieved in all cases. Kim et al. [113] proposed a classification framework for identifying cracks in the presence of cracklike patterns using CNN and speeded-up robust features (SURF), and determined the pixel-wise location using image binarization. Architectures such as AlexNet have been fine-tuned for crack detection [114,115] and GoogleNet has been similarly fine-tuned for spalling [116]. Atha and Jahanshahi [117] evaluated different deep learning techniques for corrosion detection, and Chen and Jahanshahi [118] proposed the use of the naïve Bayes data fusion with a CNN for crack detection. Yeum [119] utilized CNNs for the extraction of important regions of interest in highway truss structures in order to ease the inspection process.

Xu et al. [110,120] systematically investigated the detection of steel fatigue cracks for long-span bridges using deep learning neural networks, including a restricted Boltzmann machine and a fusion CNN. The novel fusion CNN proposed was able to identify minor cracks at multiple scales, with high accuracy, and under complex backgrounds present during in-field testing. Maguire et al. [121] developed a concrete crack image dataset for machine learning applications containing 56 000 images that were classified as either having a crack or not.

Bao et al. [122] proposed the use of DCNNs as an anomaly detector to aid inspectors in filtering out anomalous data from a bridge structural health monitoring (SHM) system recording acceleration data. Dang et al. [123] used a UAV to collect close-up images of bridges, and then automatically detected structural damage using CNNs applied to image patches.

(2) Object detection. Object detection methods have recently been applied for damage detection [112,119]. Instead of classifying an entire image, object detection methods create a bounding box around the damaged region. Regions with CNN features (R-CNNs) have been used for spall detection in a post-disaster scenario by Yeum et al. [124], although the results (59.39% true positive accuracy) left some room for improvement. The methods discussed thus far have only been applied to a single DT. In contrast, deep learning methods have the ability to learn general representations of identifiable characteristics in images over a large number of classes. For example, DCNNs have been successful in classification problems with over 1000 classes [21]. Limited research is available on techniques for the detection of multiple DTs. Cha et al. [125] studied the use of Faster R-CNN, a region-based method proposed by Ren et al. [30] to identify multiple DTs including concrete cracks and different levels of corrosion and delamination.

(3) Semantic segmentation. Object-detection-based methods cannot precisely delineate the shape of the damage they are isolating, as they only aim to fit a rectangular box around the region of interest. Another method to isolate regions of interest in an image is termed semantic segmentation. More precisely, semantic segmentation is the act of classifying each pixel in an image into one of a fixed number of classes. The result is a segmented image in which each segment is assigned a certain class. Thus, when applied to damage detection, the precise location and shape of the damage can be delineated.

Zhang et al. [126] proposed CrackNet, an efficient architecture for the semantic segmentation of pavement cracks. An object instance segmentation technique, Mask R-CNN [32], has also been recently adapted [116] for the detection of cracks, spalling, rebar exposure, and efflorescence. Although Mask R-CNN does provide pixel-level delineation of damage, it only segments the part of the image where ‘‘objects” have been found, rather than performing a full semantic segmentation of the image.

Hoskere et al. [127,128] evaluated two methods for the general localization and classification of multiple DTs: ① a multiscale pixel-wise DCNN, and ② a fully convolutional neural network (FCN). As shown in Fig. 7 [127], six different DTs were considered: concrete cracks, concrete spalling, exposed reinforcement bars, steel corrosion, steel fracture and fatigue cracks, and asphalt cracks. In Ref. [127], a new network configuration and dataset was proposed. The dataset is comprised of images from a wide variety of structures including bridges, buildings, pavements, dams, and laboratory specimens. The proposed technique consists of a parallel configuration of two networks—that is, the DP network and DT network—to increase the efficiency of damage detection. The scale invariance of the technique is demonstrated through the variety of scales of damage present in the proposed dataset.

《Fig. 7》

Fig. 7. Deep learning-based semantic segmentation of multiple structural DTs [127].

3.1.3. Change detection

When a structure must be regularly inspected, a baseline representation of the structure can first be established. This baseline can then be compared against using data from subsequent inspections. Any new visual damage to the structure will manifest itself as a change during a comparison with the baseline. Identifying and localizing changes can help to reduce the workload for processing data from UAV inspections. Because any damage must manifest as a change, adopting change detection approaches before conducting damage detection can help to reduce the number of false positives, as damage-like textures are likely to be present in both states. Change detection is a problem that has been studied in computer vision, with applications ranging from environmental monitoring to video surveillance. In this subsection, we examine two main approaches to change detection: ① point-cloud-based change detection and ② image-based change detection.

(1) Point-cloud-based change detection. Structure from motion (SFM) and multi-view stereo (MVS) [129] are vision-based techniques that can be used to generate point clouds of a structure. To conduct change detection, a baseline point cloud must first be established. These point clouds can be highly accurate, even for complex civil infrastructure such as truss bridges or dams, as described in Refs. [130,131]. Subsequent scans are then registered with the baseline cloud, with alignment typically carried out by the iterative closest point (ICP) algorithm [132]. The ICP algorithm has been implemented in open-source software such as MeshLab [133] and CloudCompare [134]. After alignment, different change detection procedures can be implemented. These techniques can be applied to both laser-scanning point clouds and point clouds generated from photogrammetry. Early studies used the Hausdorff distance from cloud to cloud (C2C) as a metric to identify changes in three dimensions [135]. Other techniques include digital elevation models of difference (DoD), cloud-to-mesh (C2M) methods, and Multiscale Model to Model Cloud Comparison (M3C2) methods. A review of these methods can be found in Ref. [136].

Combined with UAV data acquisition, change detection methods have been applied to civil infrastructure. For example, Morgenthal and Hallermann [137] identified manually inserted changes on a retaining wall using aligned orthomosaics for in-plane changes and C2C comparison using the CloudCompare package for out-ofplane changes [138]. Khaloo and Lattanzi [139] utilized the hue values of the pixels in different color spaces to aid the detection of important changes to a gravity dam. Jafari et al. [140] proposed a new method for measuring deformations using the direct pointwise distance in conjunction with statistical sampling to maximize data integrity. Another interesting application of point-cloud-based change detection is for finite-element model updating. Ghahremani et al. [141] used vision-based methods to automatically locate, identify, and quantify damage through comparative analysis of two point clouds of a laboratory structural component, and subsequently used that information to update a finite-element model of the component. Point-cloud-based change detection methods are applicable when the change in point depth is sufficient to be identified. When looking for visual changes that may not cause sufficient geometric changes, image-based change detection methods can be used.

(2) Image-based change detection. Change detection on images is a commonly studied problem in computer vision due to its widespread use across a number of applications [142]. One of the most prevalent use-cases for image-based change detection is seen in remotely sensed satellite imagery, where applications range from land-cover and land-use monitoring to damage assessment and disaster monitoring. An in-depth review of change detection methods for very high-resolution satellite images can be found in Ref. [143]. Before change detection is performed, the images are preprocessed to remove environmental variations such as atmospheric and radiometric effects, followed by image registration. Similar to damage detection, heuristic and deep learning-based techniques are available, as well as point-cloud and object-detection-based techniques. Ref. [144] provides a review of these approaches.

While remotely sensed satellite imagery can provide a cityscale understanding of damage, for individual structures, the resolution and views available from such images limit the amount of useful information that can be extracted. For images from UAV or ground vehicle surveys, change detection methods can be applied as a precursor to damage detection, in order to help localize candidate pixels or regions that may represent damage. To this end, Sakurada et al. [145] proposed a method to detect 3D changes in an outdoor scene from multi-view images taken at different time instances using a probabilistically estimated scene depth. CNNs have also been used to identify changes in urban scenes [146]. Stent et al. [147] proposed the use of a CNN to identify changes on tunnel linings, and subsequently ranked the changes in importance using clustering. A schematic of the method proposed in Stent et al. [147] is shown in Fig. 8.

《Fig. 8》

Fig. 8. Illustration of the system proposed in Ref. [147]. (a) Hardware for data capture; (b) changes detected in new images by localizing within a reconstructed reference model; (c) sample output, in which detected changes are clustered by appearance and ranked.

《3.2. Structural component recognition》

3.2. Structural component recognition

Structural component recognition is a process of detecting, localizing, and classifying characteristic parts of a structure, which is expected to be a key step toward the automated inspection of civil infrastructure. Information about structural components adds semantics to raw images or 3D point-cloud data. Such semantics help humans understand the current state of the structure; they also impose consistency onto error-prone data from field environments [148–150]. For example, by assigning a label ‘‘column” to point-cloud data, a collection of points becomes recognizable as a single structural component (as-built modeling). In the construction progress monitoring context, the column of the as-built model can be corresponded with the column of the 3D model developed at the design stage (the as-planned model), thus permitting the reference-based evaluation of the current state of the column. During the evaluation, points not labeled as ‘‘column” can be safely neglected, because such points are thought to originate from irrelevant objects or erroneous data. In this sense, information about structural components is one of the essential attributes of as-built models used to represent the current state of structures in an effective and consistent manner.

Structural component recognition also provides strong supportive information for the automated vision-based damage evaluation of civil structures. Similar to as-built modeling, information regarding structural components can be used to improve the consistency of the automated damage detection methods by removing damage-like patterns on objects other than the structural components of interest (e.g., a crack detected in a tree is false-positive detection). Furthermore, information about structural components is necessary for the safety evaluation of the entire structure, because damage and the structural components on which the damage appears are jointly evaluated in order to derive the safety ratings in most current structural inspection guidelines (ATC-20 [2], national bridge inspection standards [3]).

Toward fully autonomous inspection, structural component recognition is expected to be a building block of autonomous navigation and data collection algorithms for robotic platforms (e.g., UAVs). Based on the types and positions of structural components recognized by the on-board camera(s), autonomous robots are expected to plan appropriate navigation paths and data collection behaviors. Although full automation of structural inspection has not been achieved yet, an example of an autonomous robot based on vision-based recognition of the surrounding environment can be found in the field of agriculture (i.e., the TerraSentia [151]).

3.2.1. Heuristic-based structural component recognition using image data

Early studies use hand-crafted image filters and heuristics to extract structural components from images. For example, reinforced concrete (RC) columns in an image are recognized using line-segment pairs (Fig. 9) [152,153]. To distinguish columns from other irrelevant line-segment pairs, this method applies a threshold to select near-parallel pairs with a predetermined range of the aspect ratio. The authors of that research applied the method to 20 images in which columns were photographed as the main objects, and detected 38 out of 51 columns with seven falsepositive detections. Despite the simplicity of the method, this approach strongly depends on the values of thresholds, and tends to fail to find partially occluded or relatively distant columns. Furthermore, the method recognizes any line-segment pairs satisfying the thresholds as columns, without further contextual understanding. To improve the results and achieve fewer false-positive detections, high-level context within a scene needs to be incorporated at different scales.

《Fig. 9》

Fig. 9. RC column recognition results [153].

3.2.2. Structural component recognition using 3D point-cloud data

Another important scenario for structural component recognition is to perform the task when dense 3D point-cloud data is available. Different segmentation and classification approaches have been proposed to perform the task of building component recognition using dense 3D point-cloud data. Xiong et al. [154] investigated an automated method for converting dense 3D pointcloud data from a room to a semantically rich 3D model represented by planar walls, floors, ceilings, and rectangular openings (a process referred to as Scan-to-Building Information Modeling (BIM)). Perez-Perez et al. [155] adopted higher dimensional features (193 dimensions for semantic features and 553 dimensions for geometric features) to perform structural and nonstructural component recognition in indoor spaces. With the rich information carried by the extracted features and post-processing performed using conditional random fields, this method was able to label both planar and complex nonplanar surfaces accurately, as shown in Fig. 10 [155]. Armeni et al. [156] proposed a method of filtering, segmenting, and classifying a dense 3D point-cloud data, and demonstrated the method by parsing an entire building to planar components.

《Fig. 10》

Fig. 10. Indoor semantic segmentation by Perez-Perez et al. [155] using 3D dense point-cloud data.

Golparvar-Fard et al. [157] conducted a detailed comparison of image-based point clouds and laser scanning for automated performance monitoring techniques including accuracy and usability for 3D reconstruction, shape modeling, and as-built visualizations, and found that although image-based techniques were not as accurate, they provided tremendous opportunities for visualization and the extraction of rich semantic information. Golparvar-Fard et al. [158] proposed an automated method to monitor changes of 3D building elements by fusing unordered photo collections together with a building information modeling using a combination of SFM followed by voxel-based scene quantization. Recently, Lu et al. [159] proposed an approach to robustly detect four types of bridge components from point clouds of RC bridges using a topdown approach.

The effectiveness of the 3D approaches discussed in this section depends on the available data for solving the problem at hand. Compared with image data, dense 3D point-cloud data carry richer information with the extra dimension, which enables the recognition of complex-shaped structural components and/or recognition tasks that require high localization accuracies. On the other hand, in order to obtain accurate and dense 3D point-cloud data, every part of the structure under inspection needs to be photographed with sufficient resolution and overlap, which requires increased effort in data collection. Also, off-line post-processing is often required, posing challenges to the application of 3D approaches to real-time processing tasks. For such scenarios, deep learning-based structural component recognition using image data is another promising approach to perform structural component recognition tasks, as discussed in the next section.

3.2.3. Deep learning-based structural component recognition using image data

Approaches based on machine learning methods for structural component recognition have recently come under active investigation. Image classification is one of the major applications of CNNs, in which a single representative label is estimated from an input image. Yeum et al. [160] used CNNs to classify candidate image patches of the welded joints of a highway sign truss structure to reliably recognize the region of interest. Gao and Mosalam [161] used CNNs to classify input images into appropriate structural component and damage categories. The authors then inferred rough localization of the target object based on the output of the last convolutional layer (weakly supervised learning; see Fig. 11 [161] for structural component recognition results). Object detection algorithms can also be used for structural component recognition. Liang [162] applied the Faster R-CNN algorithm to detect and localize bridge components by drawing bounding boxes around them automatically.

《Fig. 11》

Fig. 11. Structural component recognition results using weakly supervised learning [161].

Semantic segmentation is another promising approach for solving structural component recognition problems [163–165]. Instead of drawing bounding boxes or inferring approximate object locations from per-image labels, semantic segmentation algorithms output label maps with the same resolutions as the input images, which is particularly effective for accurately detecting, localizing, and classifying complex-shaped structural components. To obtain high-resolution bridge-component recognition results consistent with a high-level scene structure, Narazaki et al. [164] investigated three different configurations of FCNs: ① naïve configuration, which estimates a label map directly from an input image; ② parallel configuration, which estimates a label map based on the semantic segmentation results on high-level scene classes and bridge-component classes running in parallel (Fig. 12(a)); and ③ sequential configuration, which estimates a label map based on the scene segmentation results and input image (Fig. 12(b)). Example bridge-component recognition results are shown in Fig. 13. Except for the third and seventh image, all configurations were able to identify the structural components, including far or partially occluded columns. Notable differences were observed for non-bridge images (see the last two images in Fig. 13 [164]). For the naïve and parallel configuration, false-positive detections were observed in building and pavement pixels. In contrast, FCNs with a sequential configuration did not show false positives. (Table 1 presents the rates of false-positive detections in the results for non-bridge images.) Therefore, the sequential configuration is considered to be effective in imposing high-level scene consistency onto bridge-component recognition in order to improve the robustness of recognition for images of complex scenes.

《Fig. 12》

Fig. 12. Network configurations to impose scene-level consistency [164].

《Fig. 13》

Fig. 13. Example bridge-component recognition results [164].

《Table 1》

Table 1 False-positive rates for the nine scene classes.

《3.3. Damage detection with structure-level consistency》

3.3. Damage detection with structure-level consistency

To conduct automated assessments, combining information about both the structural components and the damage state of those components will be vital. German et al. [166] proposed an automated framework for rapid post-earthquake building evaluation. In the framework, a video was recorded from inside the damaged building and a search was conducted in each frame for the presence of columns [152] and then each column was assigned a damage index. The damage index [88] was estimated by classifying the column’s failure mode into shear or flexure critical based on the location and severity of cracks, spalling, and exposed reinforcement identified by the methods proposed in Refs. [73,167]. The structural layout of the building is then manually recorded and then used to query a fragility database which provides information of the probability of being in a certain damage state.

Anil et al. [168] identified some information requirements to adequately represent visual damage information for structural walls in the aftermath of an earthquake and grouped them into five categories based on 17 damage parameters with varying damage sensitivity. This information was used to then describe a BIMbased approach in Ref. [169] to help automate the engineering analyses introducing some heuristics to incorporate both strength analyses and visual damage assessment information. Wei et al. [170] provided a detailed review of the current status and persistent and emergent challenges related to 3D imaging techniques for construction and infrastructure management.

Hoskere et al. [128] used FCNs for the segmentation of damage and building component images to produce inspection-like semantic information. Three different networks were used: one for scene and building (SB) information, one to identify DP, and another to identify DT. The SB network had an average accuracy of 88.8% and the combined DP and DT networks together had an average accuracy of 91.1%. The proposed method was able to successfully identify the location and type of damage, together with some context about the SB; thus, this method is applicable in a much more general setting than prior work. Qualitative results for several images are shown in Fig. 14 [128], where the right column provides comments on successes and false positives and negatives.

《Fig. 14》

Fig. 14. Qualitative results from Ref. [128].

《3.4. Summary》

3.4. Summary

Damage detection, change detection, and structural component recognition are key steps toward enabling automated structural inspections. While structural inspections provide valuable indicators to assess the condition of infrastructure, more quantitative measurement of structural responses is often required. Quantities such as displacement and strain can also be measured using vision-based techniques to enable structural condition assessment. The next section provides a review of civil infrastructure monitoring applications using vision-based techniques.

《4. Monitoring applications》

4. Monitoring applications

Monitoring is performed to obtain quantitative understanding of the current state of civil infrastructure through the measurement of physical quantities such as accelerations, strains, and/or displacements. Monitoring is often accomplished using wired or wireless contact sensors [171–178]. Although numerous applications can use contact sensors to gather data effectively, these sensors are often expensive to install and difficult to maintain. Vision-based techniques offer the advantages of non-contact methods, which potentially address some of the challenges of using contact sensors. As mentioned in Section 2, a key computer vision algorithm that can perform measurement tasks is optical flow, which estimates the translational motion of each pixel between two image frames [179]. Optical flow algorithms are general computer vision techniques which associate pixels in a reference image with corresponding pixels in another image of the same scene from a slightly different viewpoint by optimizing an objective function, such as sum of square difference (SSD) error, normalized cross correlation (NCC) criterion, global cost function [179], or combined local and global functionals [180,181]. A comparison of methods with different cost functions and optimization algorithms can be found in Ref. [182]. The rest of this section discusses studies of vision-based techniques for monitoring civil infrastructure. The section is divided into two main subsections: ① static applications and ② dynamic applications.

《4.1. Static applications》

4.1. Static applications

Measurement of static displacements and strains for civil infrastructure using vision-based techniques is often carried out using digital image correlation (DIC). According to Sutton et al. [183], DIC refers to “the class of non-contacting methods that acquire images of an object, store images in digital form, and perform image analysis to extract full-field shape, deformation, and/or motion measurements.” (p. 1) In addition to estimation of the displacement field in the image plane, DIC algorithms can include different post-processing steps to compute two-dimensional (2D) inplane strain fields (2D DIC), out-of-plane displacement and strain fields (3D DIC), and volumetric measurement (VDIC). Highly reliable commercial DIC solutions are available today (e.g., VIC-2DTM [184] and GOM Correlate [185]). A detailed review of general DIC applications can be found in Refs. [183,186].

DIC methods have been applied for displacement and strain measurement in civil engineering. Hoult et al. [187] evaluated the performance of the 2D DIC technique using a steel specimen under uniaxial loading by comparing results with strain gage measurements (Fig. 15). The authors then proposed methods to compensate for the effect of out-of-plane deformation. The performance of the 2D DIC technique was also tested using steel and reinforced concrete beam specimens, for which theoretical values of strain and strain measurement data from strain gages were available [188]. In Ref. [189], a 3D DIC system was used as a reference to measure the static displacement of laboratory specimens. In these tests, sub-pixel accuracy for displacements was achieved, and the strain estimation agreed with the measurement by strain gages and the theoretical values.

《Fig. 15》

Fig. 15. A steel plate specimen used for uniaxial testing by Hoult et al. [187].

DIC methods have also been applied to the field measurement of displacement and strain of civil structures. McCormick and Lord [190] applied the 2D DIC technique to vertical displacement measurement of a highway bridge deck statically loaded with four 32-ton trucks. Yoneyama et al. [191] applied the 2D DIC technique to estimate the deflection of a bridge girder loaded by a 20-ton truck. The authors evaluated the accuracy of the deflection measurement with and without artificial patterns using data from displacement transducers. Yoneyama and Ueda [192] also applied the 2D DIC to bridge deflection measurement under operational loading. Helfrick et al. [193] used 3D DIC for full-field vibration measurement. Reagan [194] used a UAV carrying a stereo camera to apply 3D DIC to the long-term monitoring of bridge deformations.

Another promising civil engineering application of the DIC methods is crack mapping, in which 3D DIC is employed to extract cracked regions characterized by large strains. Mahal et al. [195] successfully extracted cracks on an RC specimen, and Ghorbani et al. [196] extended this crack-mapping method to a full-scale masonry wall specimen under cyclic loading (Fig. 16). The crack maps thus obtained are useful for analyzing laboratory test results, as well as for augmenting the information used for the structural inspection.

《Fig. 16》

Fig. 16. Crack maps created using 3D DIC [196]. (a) First crack; (b) peak load; (c) ultimate state. Red corresponds to +3000 μm·m-1 .

《4.2. Dynamic applications》

4.2. Dynamic applications

System identification and modal analysis are powerful tools for SHM, and can provide valuable information about the dynamic properties of structural systems. System identification and other SHM-related tasks are often accomplished using wired or wireless accelerometers due to the reliability of such sensors and their ease of installation [171–178]. Vision-based techniques offer the advantage of non-contact methods, as compared with traditional approaches. Combined with the proliferation of low-cost cameras in the market and increased computation capabilities, videobased methods have become convenient approaches for displacement measurement in structures. Several algorithms are available to accomplish displacement extraction; these work in principle by template matching or by tracking either the contours of the constant phase or intensity through time [55,197,198]. Optical flow methods have been used to measure dynamic and pseudo-static responses for several applications, including system identification, modal analysis, model updating, and direct indication of changes in serviceability based on thresholds. For further information, recent reviews of dynamic monitoring applications using computer vision techniques have been provided by Ye et al. [199] and Feng and Feng [200]. This section discusses studies that have proposed and/or evaluated different algorithms for displacement measurement with ① laboratory experiments and ② field validation.

4.2.1. Laboratory experiments

Early optical flow algorithms used for dynamic monitoring focused on natural frequency estimation [201] and displacement measurement [202,203]. Markers are commonly used to obtain streamlined and precise detection and tracking of points of interest. Min et al. [204] designed high-contrast markers to be used for displacement measurement with smartphone devices and telephoto lenses, and obtained excellent results in laboratory testing (Fig. 17).

《Fig. 17》

Fig. 17. A smartphone-based displacement measurement system including telephoto lenses and high-contrast markers proposed by Min et al. [204]. B: blue; G: green; P: pink; Y: yellow.

Dong et al. [205] proposed a method for the multipoint synchronous measurement of dynamic structural displacement. Celik et al. [206] evaluated different vision-based techniques to measure human loads on structures. Lee et al. [207] presented an approach for displacement measurements, which was tailored for field testing with enhanced robustness to strong sunlight. Park et al. [208] demonstrated the efficacy of the data fusion of vision-based images with accelerations to extend the dynamic range and lower signal noise. The application of vision algorithms has also been extended to the system identification of laboratory structures. Schumacher and Shariati [209] introduced the concept of virtual visual sensors that could be used for the modal analysis of a structure. Yoon et al. [210] implemented a Kanade–Lucas–Tomasi (KLT) tracker to identify a model for a laboratory-scale six-story building model (Fig. 18). Ye et al. [211] conducted multipoint displacement measurement on a scale-model arch bridge and validated the measurements using linear variable differential transformers (LVDTs). Abdelbarr et al. [212] measured 3D dynamic displacements using an inexpensive RGB-D sensor. Ye et al. [213] conducted studies on a shake table to determine the factors that influence system performance for vision-based measurements. Feng and Feng [214] implemented a template tracking approach using up-sampled cross-correlations to obtain displacements at multiple points on a vibrating structure. System identification has also been carried out on laboratory structures using vision data captured by UAVs [215]. The same authors proposed a method to measure dynamic displacements using UAVs with a stationary reference in the background.

《Fig. 18》

Fig. 18. The target-free approach for vision-based structural system identification using consumer-grade cameras [210]. (a) Screenshots of target tracking; (b) extracted mode shapes from different sensors. GoPro and LG G3 are the cameras used in the test.

Wadhwa et al. [55] proposed a technique called motion magnification that band-passes videos with small deformations in order to extract and amplify motion at certain frequencies. The procedure involves decomposing the video at multiple scales, applying a filter to each, and then recombining the filtered spatial bands. Subsequently, a number of papers have appeared on full-field modal analysis of structures using vision-based methods inspired by motion magnification. Chen et al. [216] successfully applied motion magnification to visualize operational deflection shapes of laboratory structures (Fig. 19). Cha et al. [217] used a phasebased approach together with the unscented Kalman filters for system identification using noisy displacement measurements. Yang et al. [218] adapted the method to include blind source separation with multiscale wavelet filters, as opposed to complex steerable filters, in order to obtain the full-field modes of a laboratory specimen. Yang et al. [219] proposed a method for the high-fidelity and realistic simulation of structural response using a high-spatial resolution modal model and video manipulation.

《Fig. 19》

Fig. 19. Screenshots of a motion-magnified video of a vibrating cantilever beam from Ref. [216].

4.2.2. Field validation

The success of vision-based vibration measurement techniques in the laboratory has led to a number of field applications over the past few years. The most common application has been to measure displacements on full-scale bridge structures [220–223], including the measurement of different components such as bridge decks, trusses, and hangar cables. Phase-based methods have also been applied for the displacement and frequency estimation of an antenna tower, and to obtain partial mode shapes of a truss bridge structure [224].

Some researchers have investigated the use of vision sensors to measure displacements at multiple points on a structure. Yoon [225] used a camera system to measure railroad bridge displacement during train crossings. As shown in Fig. 20 [225], the measured displacement was very close to the values predicted from a finite-element model that used the train load as input; the differences were primarily attributed to the train not having a constant speed. Mas et al. [226] developed a method for the simultaneous multipoint measurement of vibration frequencies through the analysis of high-speed video sequences, and demonstrated their algorithm on a steel pedestrian bridge.

《Fig. 20》

Fig. 20. Railroad bridge displacement measurement using computer vision [225]. (a) Images of the optical tracking of the railroad component; (b) comparison of vision-based displacement measurement to FE simulation estimation. FEsim: finite-element simulation.

An interesting application of load estimation was studied by Chen et al. [227], who used computer vision to identify the distribution of vehicle loads across a bridge in both space and time by automatically detecting the type of vehicle entering the bridge and combining that with information from the weigh-in-motion system at one cross section of the bridge. The developed system enables accurate load determination for the entire bridge, as compared with previous methods which can measure the vehicle loads at only one cross section of a bridge.

The use of computer vision techniques for the system identification of structures has been somewhat limited; obtaining measurements of all points on a large structure within a single video frame usually results in a pixel resolution that is insufficient to yield accurate structural displacements. Moreover, finding a good vantage point to place a camera also proves difficult in urban environments, resulting in video data that contains perspective and atmospheric distortions due to monitoring from afar with zoom lenses [216]. Finally, when using a remote camera, only points on the structure that are readily visible from the selected camera location can be monitored. Isolating modal information from video data often involves the selection of manually generated masks or regions of interest, which makes the process cumbersome.

Xu et al. [222] proposed a low-cost and non-contact visionbased system for multipoint displacement measurement based on a consumer-grade camera for video acquisition, and were able to obtain the mode shapes of a pedestrian bridge using the proposed system. Hoskere et al. [228] developed a divide-and-conquer approach to acquire the mode shapes of full-scale infrastructure using UAVs. The proposed approach directly addressed a number of difficulties associated with the modal analysis of full-scale infrastructure using vision-based methods. Initial evaluation was carried out using a six-story shear-building model excited on a shaking table in a laboratory environment. Field tests of a full-scale pedestrian suspension bridge were subsequently conducted to obtain natural frequencies and mode shapes (Fig. 21) [151].

《Fig. 21》

Fig. 21. (a) Image of Phantom 4 recording video of a vibrating bridge (3840 × 2160 pixels at 30 fps); (b) Lake of the Woods pedestrian bridge, Mahomet, IL; (c) finite-element model of the bridge; (d) extracted mode shapes [151].

《 5. Challenges to the realization of vision-based automated inspection and monitoring of civil infrastructure》

 5. Challenges to the realization of vision-based automated inspection and monitoring of civil infrastructure

Although the research community has made significant progress in recent years, a number of technical hurdles must be crossed before the use of vision-based techniques for automated SHM can be fully realized. The key difficulties broadly lie in converting the features and signals extracted by vision-based methods into more actionable information that can aid in higher level decision-making.

《5.1. Automated structural inspections necessitate integral understanding of damage and context》

5.1. Automated structural inspections necessitate integral understanding of damage and context

Humans performing visual inspections have remarkable perception capabilities that are still very difficult for vision and deep learning algorithms to replicate. Trained human inspectors are able to identify regions of importance to the overall health of the structure (e.g., critical structural components, the presence of structurally significant damage, etc.). When a structure is damaged, depending on the shape, size, and location of the damage, and the type and importance of the component in which it is occurring, a trained inspector can infer the structural importance of the damage identified. The inspector can also understand the implications the presence of multiple kinds of damage. Thus, while significant progress has been made in this field, higher accuracy damage detection and component recognition are still needed. Moreover, studies on interpreting the structural significance of identified damage and assimilating local information with global information to make structure-level assessments are almost completely absent from the literature. Addressing these challenges will be vital for the realization of fully automated vision-based inspections.

《5.2. The generality of deep networks depends on the generality of data》

5.2. The generality of deep networks depends on the generality of data

Trained DCNN models tend to perform poorly if the features extracted from the data on which inference is conducted differ significantly from the training data. Thus, the quality of a trained deep model is directly dependent on the underlying dataset. The perception capabilities of DCNN models that have not been trained to be robust to damage-like features, such as grooves or joints, cannot be relied upon to distinguish between such textures during inference. The limited number of datasets available for the detection of structural damage is a challenge that must be overcome in order to advance the perception capabilities of DCNNs for automated inspections.

《5.3. Human-like perception for inspections requires an understanding of sequential views》

5.3. Human-like perception for inspections requires an understanding of sequential views

A single image does not always provide sufficient context for performing damage detection and component-recognition tasks. For example, damage recognition is most likely to be successful when the image is a close-up view of a component; however, component recognition from such images is very difficult. In an extreme case, the inspector may be so close to the component that concrete columns would be indistinguishable from concrete beams or concrete walls. During inspection by humans, this problem is easily resolved by first examining the entire structure, and then moving close to the structural components while keeping in mind the target structural component. To replicate this human function, the viewing sequence (e.g., using video data) must be incorporated into the process and the recognition tasks must be performed based on the current frame as well as on previous frames.

《5.4. Displacements are often small and difficult to capture》

5.4. Displacements are often small and difficult to capture

For monitoring applications, recent work has been successful in demonstrating the feasibility of vision-based methods for measuring modal information, as well as displacements and strains of structures both in the laboratory and in the field. On the other hand, accurate displacement and strain measurement for in situ civil infrastructure is rarely straightforward. The displacement and strain ranges expected during field tests are often smaller than those in laboratory tests, because the target structures in the field are responding to operational loading. In a field environment, the accessibility of the structural components of interest is often limited. In such cases, optimal camera locations for highquality measurement cannot be realized, and markers guiding the displacement measurement cannot be placed. For static applications, surface textures are often added artificially (e.g., speckle patterns) to help with the image-matching step in DIC methods [183], which is also difficult for structures with limited accessibility. Further research and development effort is needed both in terms of hardware and software in order to apply vision-based static displacement/strain measurement in such operational situations.

《5.5. Lighting and environmental effects》

5.5. Lighting and environmental effects

Vision-based methods are highly susceptible to changes in visibility-related environmental effects such as the presence of rain, mist, or fog. While the abovementioned problems are difficult ones to circumvent, other environmental factors such as changes in lighting, shadows, and atmospheric interference are amenable to normalization, although more work is required to improve robustness.

《5.6. Big data needs big data management》

5.6. Big data needs big data management

The realization of continuous and automated vision-based monitoring poses challenges regarding the massive amount of data produced, which becomes difficult to store and process for longterm applications. Automated real-time signal extraction will be necessary to reduce the amount of data that is stored. Methods to handle and process full-field modal information obtained from video-bandpassing techniques are also an open area of research.

《6. Ongoing work toward automated inspections》

6. Ongoing work toward automated inspections

Toward the goal of automated inspections, vision-based perception is still an open research problem that requires a great deal of attention. This section discusses ongoing work at the University of Illinois aimed at addressing the following challenges, which were outlined in Section 5: ① incorporating context to generate condition-aware models; ② generating synthetic, labeled data using photorealistic physics-based graphics models to address the need for more general data; and ③ leveraging video sequences for human-like recognition of structural components.

《6.1. Incorporating context to generate condition-aware models》

6.1. Incorporating context to generate condition-aware models

As discussed in Section 5.1, understanding the context in which damage is occurring is a key aspect of being able to make automated and high-level assessments that render detailed inspection judgements. To address this issue, Hoskere et al. [128] proposed a new procedure in which information about the type of structure, its various components, and the condition of each of the components were combined into a single model, referred to as a condition-aware model. Such models can be viewed as analogous to the as-built models used in the construction and design industry, but for the purpose of inspection and maintenance. Condition-aware models are automatically generated annotations showing the presence of visual defects on the structure. Depending on the particular inspection application being considered, the fidelity of the condition-aware model required also varies. The main advantage of building a condition-aware model, as opposed to just using images directly, is that the context of the structure and the scale of the damage are easily identifiable. Moreover, the availability of global 3D geometry information can aid the assessment process. The model serves as convenient entity to rapidly and automatically document visually identifiable defects on the structure.

The framework proposed by Hoskere et al. [128] for the generation of condition-aware models for rapid automated postdisaster inspections is shown in Fig. 22. A 3D mesh model is generated using multi-view stereo from a UAV survey of the structure. Deep learning-based condition inference is then conducted on the same set of images for the semantic segmentation of the damage and building context. The generated labels are projected onto the mesh using UV mapping (a 3D modeling process of projecting a 2D image to a 3D model), which results in a condition-aware model with averaged damage and context labels superimposed on each cell. Fig. 23 shows the condition-aware model that was developed using this procedure for a building damaged during the central Mexico earthquake in September 2017.

《Fig. 22》

Fig. 22. A framework to generate condition-aware models for rapid post-disaster inspections.

《Fig. 23》

Fig. 23. A condition-aware model of a building damaged during the central Mexico earthquake in September 2017 [128].

《6.2. Generating synthetic labeled data using photorealistic physicsbased graphics models》

6.2. Generating synthetic labeled data using photorealistic physicsbased graphics models

As discussed in Section 5.2, for deep learning techniques aimed at automated inspections, the lack of vast quantities of labeled data makes it difficult to generalize training models across a wide variety of structures and environmental conditions. Each civil engineering structure is unique, which makes the identification of damage more challenging. For example, buildings could be painted in a wide range of colors (a parameter that will almost certainly have an impact in the damage detection result, especially for corrosion); thus, developing a general algorithm for damage detection is difficult without accounting for such issues. To compound the problem, high-quality data from damaged structures is also relatively difficult to obtain, as damaged structures are not a common occurrence.

The significant progress that has been made in computer graphics over the past decade has allowed the creation of photorealistic images and videos. Here, synthetic data refers to data generated from a graphics model, as opposed to a camera in the real world. Synthetic data has recently been used in computer vision to train deep neural networks for the semantic segmentation of urban scenes, and models trained on synthetic data have shown promising performance on real data [229]. The use of synthetic data offers many benefits. Two types of platforms are available to produce synthetic data: ① real-time game engines that use rasterization to render graphics images at low computational cost, at the expense of accuracy and realism; and ② renderers that use physics-based ray tracing engines for accurate simulations of light and materials in order to produce realistic images, at a high computational cost. The generation of synthetic data can help to solve the problem of data labeling, as any data from algorithmically generated graphics models will be automatically labeled, both at the pixel and image levels. Graphics models can also provide a testbed for vision algorithms with readily repeatable conditions. Different environmental conditions, such as lighting, can be simulated, and algorithms can be studied with different camera parameters and UAV data-acquisition flight paths. Algorithms that are effective in these virtual testbeds will be more likely to work on real-world datasets.

3D modeling, simulation, and rendering tools such as Blender [230] can be used to better simulate real-world environmental effects. Combined with deformed meshes from finite-element models, these tools can be used to create graphics models of damaged structures. Understanding the damage condition of a structure requires context awareness. For example, identical cracks at different locations on the same structure could have different implications for the overall health of a structure. Similarly, cracks in bridge columns must be treated differently than cracks in a wall of a building. Hoskere et al. [231] proposed a novel framework (Fig. 24) using physics-based models of structures to create synthetic graphics images of representative damaged structures. The proposed framework has five main steps: ① the use of parameterized finite-element models to structurally model representative structures of various shapes, sizes, and materials; ② nonlinear finite-element analysis to identify structural hotspots on the generated models; ③ the application of material graphic properties for realistic rendering of the generated model; ④ procedural damage generation using hotspots from finite-element models; and ⑤ the training of deep learning models for assessment using generated synthetic data.

《Fig. 24》

Fig. 24. Framework for physics-based graphics generation for automated assessment using deep learning [231].

Physics-based graphics models can be used to generate a wide variety of damage scenarios. As the generated data will be similar to real data, the limits of deep learning methods to identify important damage and structure features can be established. These models provide high-quality labeled data at multiple levels of context, including: ① global structure properties, such as the number of stories and bays, and the structural system; ② structural and nonstructural components and critical regions; and ③ the presence of different types of local and global damage, such as cracks, spalling, soft stories, and buckled columns, along with other hazards such as falling parts. This higher level of context information can enable reliable automated inspections where approaches trained on images of local damage have failed. The inherent subjectivity of onsite human inspectors can be greatly reduced through the use of objective physics-based models as the underlying basis for the training data, rather than the use of subjective hand-labeled data.

Research on the use of synthetic data for vision-based inspection applications has been limited. Hoskere et al. [232] created a physics-based graphics model of a miter gate and trained deep semantic segmentation to identify important changes on the gate in the synthetic environment. Training data for the networks was generated using physics-based graphics models, and included defects such as cracks and corrosion, while accommodating variations in lighting (Fig. 25).

《Fig. 25》

Fig. 25. Deep learning-based change detection [232].

Research is currently under way to transfer the applicability of successful deep learning models trained on synthetic data to work with real data as well.

《6.3. Leveraging video sequences for human-like recognition of structural components》

6.3. Leveraging video sequences for human-like recognition of structural components

Human inspectors first survey an entire structure, and then move closer to make detailed assessments of damaged structural components. When performing a detailed inspection, they keep in mind how the damaged component fits into the global structural context; this is essential in order to understand the relevance of the damage to structural safety. However, as discussed in Section 5.3, computer vision strategies for damage detection usually operate on a frame-by-frame basis—that is, independently using single images; close-up images do not contain the necessary information about global structural context. For the human inspector, the viewing history (i.e., sequences of dependent images such as in videos) provides this context. This section discusses the incorporation of the viewing history embedded in video sequences in order to achieve more accurate structural component recognition throughout the inspection process.

Narazaki et al. [233] applied recurrent neural networks (RNNs) to implement bridge-component recognition using video data, which incorporates views of the global structure and close-up details of structural component surfaces. The network architecture used in the study is illustrated in Fig. 26. First, a deep single imagebased FCN is applied to extract a map of label predictions. Next, three smaller RNN layers are added after the lowest resolution prediction layer. Finally, the output from the RNN layers and other skipped layers with higher resolution are combined to generate the final estimated label maps. The RNN units are inserted only after the lowest resolution prediction layer, because the RNN units in the study are used to memorize where the video is focused, rather than improving the level of detail of the estimated map.

《Fig. 26》

Fig. 26. Illustration of the network architecture used in Ref. [186].

Two types of RNN units were tested in the study [186]: simple RNN and convolutional long short-term memory (ConvLSTM) [234] units. For the simple RNN units, the input to the unit was augmented by the output of the unit at the previous time step, and the convolution with the ReLU activation function was applied. Alternatively, ConvLSTM units were inserted into the RNN of the architecture to model long-term patterns effectively.

One of the major challenges of training and testing RNNs for video processing is to collect video data and their ground-truth labels. In particular, manually labeling every frame of video data is impractical. Following the discussion in Section 6.2 on the benefits of synthetic data, the real-time rendering capabilities of the game engine Unity3D [235] was used in the study [186] to address the challenge. A video dataset was created from a simulation of a UAV navigating around a concrete-girder bridge. The steps used to create the dataset were similar to the steps used to create the SYNTHIA dataset [229]. However, this dataset randomly navigates in 3D space with a larger variety of headings, pitches, and altitudes. The resolution of the video was set to 240×320, and 37 081 training images and 2000 test images were automatically generated along with the corresponding ground-truth labels. Example frames of the video are shown in Fig. 27. The depth map was also retrieved, although those data were not used for the study.

《Fig. 27》

Fig. 27. Example frames of the new video dataset with ground-truth labels and ground-truth depth maps [233].

The example results in Fig. 28 show the effectiveness of the recurrent units when the FCN fails to recognize the bridge components correctly. These results indicate that ConvLSTM units combined with pre-trained FCN is an effective approach to perform automated bridge-component recognition, even if the visual cues of the global structures are temporally unavailable. The total pixel-wise accuracy for the single-image-based FCN was 65.0%. In contrast, the total pixel-wise accuracy for the simple RNN and the ConvLSTM units was 74.9% and 80.5%, respectively. Other details of the dataset, training, and testing are provided in Ref. [233].

《Fig. 28》

Fig. 28. Example results. (a) Input images; (b) FCN; (c) FCN–simple RNN; (d) FCN–ConvLSTM [233].

At present, this research is being leveraged to develop rapid post-earthquake inspection strategies for transportation infrastructure.

《7. Summary and conclusions》

7. Summary and conclusions

This paper presented an overview of recent advances in computer vision-based civil infrastructure inspection and monitoring. Manual visual inspection is currently the main means of assessing the condition of civil infrastructure. Computer vision for civil infrastructure inspection and monitoring is a natural step forward and can be readily adopted to aid and eventually replace manual visual inspection, while offering new advantages and opportunities. However, the use of image data can be a double-edged sword; as although rich spatial, textural, and contextual information is available in each image, the process of extracting actionable information from these images is challenging. The research community has been successful in demonstrating the feasibility of vision algorithms, ranging from deep learning to optical flow. The inspection applications discussed in this article were classified into the following three categories: characterizing local and global visible damage, detecting changes from a reference image, and structural component recognition. Recent progress in automated inspections has stemmed from the replacement of heuristic-based methods with data-driven detection, where deep models are built by training on large sets of data. The monitoring applications discussed spanned both static and dynamic applications. The application of full-field measurement techniques and the extension of laboratory techniques to full-scale infrastructure has provided impetus for further growth.

This paper also presented key challenges that the research community faces toward the realization of automated vision-based inspection and monitoring. These challenges broadly lie in converting the features and signals extracted by vision-based methods into actionable data that can aid decision-making at a higher level.

Finally, three areas of ongoing research aimed at enabling automated inspections were presented: the generation of conditionaware models, the development of synthetic data through graphics models, and methods for the assimilation of data from video. The rapid advances in research in computer vision-based inspection and monitoring of civil infrastructure described in this paper will enable time-efficient, cost-effective, and eventually automated civil infrastructure inspection and monitoring, heralding a coming revolution in the way that infrastructure is maintained and managed, ultimately leading to safer and more resilient cities throughout the world.

《Acknowledgements》

Acknowledgements

This research was supported in part by funding from the US Army Corps of Engineers under a project entitled “Cybermodeling: A Digital Surrogate Approach for Optimal Risk-Based Operations and Infrastructure” (W912HZ-17-2-0024).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Billie F. Spencer Jr., Vedhus Hoskere, and Yasutaka Narazaki declare that they have no conflict of interest or financial conflicts to disclose.