VoxelMorph++: a convolutional neural network architecture for unsupervised CBCT to CT deformable image registration

. We use an unsupervised method based on the VoxelMorph architecture for Cone-beam computed tomography (CBCT) to CT deformable image registration (DIR), and propose VoxelMorph++, a new architecture for predicting the deformation vector field (DVF). The proposed architecture (1) overcomes the limitation that the optimal depth of encoder-decoder is unknown, by forming a nested structure where each feature with varying depth in the encoder path has a corresponding depth decoder; (2) fuses features of varying semantic scales more flexibly by redesigning skip connections. In the testing phase, we used ITK-SNAP software to semi-automatically segment the patients' lung regions as labels to solve the problem of expensive manual labelling. We evaluated these two architectures using lung region registration results from 10 patients’ CBCT and CT images. After registration, the mean Dice score improved from 0.8556 to 0.9412 and 0.9430 for VoxelMorph and the proposed architecture, respectively. The results show that both architectures perform well in our dataset and the proposed architecture outperforms VoxelMorph in terms of registration accuracy.


Introduction
CBCT and CT deformable image registration (DIR) is one of the key technologies for image-guided radiotherapy (IGRT) [1][2][3]. However, there are two challenges in practical applications. First, it's hard to obtain the ground truth transformation. Second, there are large artifacts in CBCT images. To overcome these two challenges, we chose an unsupervised MRI brain DIR method, which was proposed by Balakrishnan et al. [4], to achieve CBCT to CT DIR. This method predicts the deformation vector field (DVF) through a CNN architecture named 'VoxelMorph' and generates warped images with the help of the spatial transformer network (STN) [5]. The overall registration framework is shown in figure 1. The artifacts in CBCT can be regarded as a kind of noise, and the encoder-decoder architectures in VoxelMorph are robust to noise. In addition, this method selects local cross-correlation (CC) as the main part of the loss function. The robustness of local CC to intensity variations [6] can also reduce the interference of artifacts.
However, for CBCT to CT DIR, the VoxelMorph architecture has two limitations. On the one hand, the depth of the current network architecture is not necessarily optimal. On the other hand, the U-Net [7] like structure of VoxelMorph makes it only aggregate the upsampled output from the preceding node in the decoder path and the feature map with the same resolution in the encoder path. This restrictive pairwise fusion scheme may also not be optimal. Therefore, referring to the UNet++ [8], we propose a new architecture, VoxelMorph++, as shown in figure 2. The proposed architecture consists of VoxelMorph architectures of varying depths which share the same encoder so that it is convenient for the network to learn and adjust the weights of varying depth features in the encoder path according to different tasks. By redesigning architecture's long and short skip connections, the proposed architecture can integrate features of varying semantic scales more flexibly, while the gradient can be back-propagated from the deeper decoder to the shallower decoder during the training phase.
In the testing phase, to solve the problem of expensive time and labor cost of manually making labels, we use the 3D medical image segmentation function of ITK-SNAP [9] software to semi-automatically generate lung region's masks for all images in the test set, as the label used to calculate the Dice score. We evaluate VoxelMorph and VoxelMorph++ on our own dataset. The experimental results show that both architectures have high registration performance even under the interference of artifacts, and the performance of our proposed architecture is slightly better than the VoxelMorph.

Overview of the registration framework
Similar to the framework in paper [4], our unsupervised CBCT to CT deformable image registration framework is shown in figure 1. We take the input CBCT image as the moving image M , and the CT image as the fixed image F , each voxel p of M and F is defined in the 3D spatial domain 3 Ω ⊂  . By concatenating M and F into a 2-channel 3D image and sending it to the registration network which is a convolutional neural network (CNN) used to model the function θ are learnable parameters of g , and φ is a registration field output by the network, which is used to deform the moving image M through the spatial transformation function, and obtain the warped image ( ) M φ . In the training phase, the network finds the optimal parameter  θ by minimizing the loss function L .
All figures that depict trunks in this paper show 2D transverse slices for visualization purposes only. All registration is done in 3D.

Proposed registration network architecture: VoxelMorph++
Based on the VoxelMorph-2 architecture which focuses more on registration accuracy among the two variants of VoxelMorph, and inspired by UNet++, we propose a new registration network architecture VoxelMorph++, as shown in figure 2. In this architecture, two separate convolutional layers (followed by Leaky ReLU activations) in the decoder stage are removed to save GPU memory. The parameter settings of each layer in the proposed network are the same as VoxelMorph-2. All convolutional layers are 3D convolutions with a kernel size of 3 3 3 × × , and each LeakyReLU activation layer's negative slope parameter is 0.2.

Fig. 1.
Overview of the registration framework for CBCT to CT deformable image registration.

Fig. 2.
Proposed architecture VoxelMorph++. Each circle represents a 3D volume. The number of channels is shown inside the circle. All 3D volumes in the same horizontal direction have the same spatial resolution, which is printed at the beginning of each row with respect to the input volume.
The input of network is a 2-channel 3D image concatenated by and . In our experiments, the input is concatenated by a pair of patches with the same corresponding positions in M and F . Each patch has a size of ( ) 160 128 64,128,160 is the z-axis of M and F . The size of the multi-channel 3D image is 160 128 The multi-channel input first goes through a convolutional layer, followed by Leaky ReLU activation, and then goes through a down-sampling block which consists of a 2 2 2 × × max pooling layer with a stride of 2 and a convolutional layer followed by Leaky ReLU activation.
In the down-sampling block, the size of the input is first reduced by half through the max pooling layer, and then the convolutional layer followed by Leaky ReLU activation is used to extract deeper features. We successively apply the down-sampling block for 4 times in the encoder stage to reduce the input size to 1 16 of the original size and extract features of varying depths. Unlike VoxelMorph-2, which only applies upsampling to the deepest features in the encoder, we add decoders with the corresponding depth to each feature of varying depths extracted in the encoder, and restore these features to the original image size by upsampling. In this way, the proposed network architecture embeds VoxelMorph-2 architectures of varying depths which share the same encoder. We alternate between upsampling, concatenating skip connections and convolutions (followed by Leaky ReLU activations) for each decoder. And we use redesigning skip connections to concatenate the feature maps of ITM Web of Conferences 47, 02014 (2022) CCCAR2022 https://doi.org/10.1051/itmconf/20224702014 the encoder and different decoders at the same resolution. This design not only enables the shallower decoders to update parameters via back-propagation, but also provides additional features of varying semantic scales to the aggregation layers. The output of the final aggregation layer goes through an extra convolutional layer without Leaky ReLU activation to obtain φ , which in our experiments is of size 160 128 3 z × × × . In the proposed architecture, we do not use deep supervision [10] which is not required.

Spatial transformation function
Because the operations are differentiable almost everywhere, we can use back-propagation to train the registration network.

Loss function
Same as the method in paper [4], the loss function ℒ of our method consists of two parts: sim L which penalizes appearance differences, and smooth L which penalizes local spatial variation in φ . The total equation of L is: Where λ is the regularization parameter.
Where  ( ) The smoothness of φ is enforced with smooth L . Finally, we write the complete loss as: 3 Experiments

Data acquisition and preprocessing
The entire dataset was randomly selected from the clinical database, including whole-body planning CT (pCT) and CBCT scans from 34 patients scanned on different days. The parameters of pCT scans were 120 kVp, an in-plane resolution of range from 1.01 mm to 1.33 mm, and a slice thickness of 3.00 mm. The corresponding parameters of CBCT scans were 110 kVp, 0.51 mm to 0.91 mm, and 1.98 mm to 2.00 mm, respectively. Among those patients, 24 were used in the training set and 10 in the testing set. For training and testing, we resampled all CBCT and pCT scans to a 1.0mm 1.0mm 1.0mm × × grid to achieve consistent spatial resolution, and clipped their HU values to [-1000, 2000]. Then, we rigidly aligned all pCT scans and CBCT scans using AIRLab [11]. For each patient, we extracted a pair of CBCT and pCT slices with corresponding lung positions from scans. That is, there are 34 CBCT-CT image pairs in the total dataset. Finally, All CBCT and pCT slices were cropped or padded to dimensions 416 320 160 × × , except for two patients in the test set, which were 416 320 64 × × and 416 320 128 × × , respectively.

Labels making based on ITK-SNAP software and dice score
Because the data used in this paper are clinical data, we evaluate the registration accuracy of network architectures on lung regions based on the Dice score, rather than landmarks. To achieve this, labels of the lung regions of all CBCT-CT image pairs in the test set are required, but it is arduous and time-consuming to make them manually. In order to overcome this challenge, we use the 3D medical image segmentation function of the ITK-SNAP [9] software to semi-automatically segment the lung regions of these images to make labels. The labels made in this way are shown in In our experiments, the Dice score quantifies the degree of the lung regions' spatial overlap between ( ) M φ and F , with a value between 0 (no overlap) and 1 (complete overlap).

Implementation
We implemented the proposed architecture under the PyTorch deep learning framework. To demonstrate the effectiveness of our proposed improved architecture VoxelMorph++ compared to VoxelMorph-2, the same parameter settings were used for both architectures during training and testing. We used the Adam optimizer [12] with a 4 1 10 − × learning rate.
The value of the hyperparameter λ in equation (2) Figure 3 shows the visual registration results of VoxelMorph-2 and the proposed architecture Voxel-Morph++ in our dataset. As seen, the registration effect of the lung regions between M and F is significantly improved after applying both architectures. We further quantify the registration results through Dice scores. For the 10 CBCT-CT image pairs corresponding to 10 patients in the test set, the results of the single Dice scores and the average Dice scores before and after registration are shown in table 1. The proposed architecture achieves higher registration accuracy than VoxelMorph-2. And both architectures improve significantly on rigid alignment. In terms of single Dice scores, the proposed architecture outperforms VoxelMorph-2 in 7 of the 10 CBCT-CT image pairs, and is also comparable to VoxelMorph-2 in the remaining 3 image pairs. Moreover, the proposed architecture VoxelMorph++ performs slightly better than VoxelMorph-2 in terms of average Dice scores.

Conclusion
In this study, we used an unsupervised method based on the VoxelMorph-2 architecture for CBCT to CT deformable image registration, aiming to overcome two challenges: ground truth acquisition and large artifacts. Inspired by UNet++, we proposed an improved architecture, named VoxelMorph++, for more accurate image registration. Through the nested structure and redesigned skip connections, the proposed architecture addressed two key challenges of VoxelMorph-2: the unknown optimal depth of the architecture and the unnecessary pairwise feature fusion strategy. In the testing phase, we used the ITK-SNAP software to semi-automatically segment the lung regions of the input images as labels, which avoided the expensive cost of manpower and time caused by making labels manually. The experimental results demonstrated that both architectures showed good registration performance in our dataset and the proposed architecture was slightly better than VoxelMorph-2 in terms of registration accuracy.