Noise-induced self-supervised hybrid UNet transformer for ischemic stroke segmentation with limited data annotations | Scientific Reports

Scientific Reports volume 15, Article number: 19783 (2025) Cite this article

Metrics details

We extend the Hybrid Unet Transformer (HUT) foundation model, which combines the advantages of the CNN and Transformer architectures with a noisy self-supervised approach, and demonstrate it in an ischemic stroke lesion segmentation task. We introduce a self-supervised approach using a noise anchor and show that it can perform better than a supervised approach under a limited amount of annotated data. We supplement our pre-training process with an additional unannotated CT perfusion dataset to validate our approach. Compared to the supervised version, the noisy self-supervised HUT (HUT-NSS) outperforms its counterpart by a margin of 2.4% in terms of dice score. HUT-NSS, on average, gained a further margin of 7.2% dice score and 28.1% Hausdorff Distance score over the state-of-the-art network USSLNet on the CT perfusion scans of the Ischemic Stroke Lesion Segmentation (ISLES2018) dataset. In limited annotated data sets, we show that HUT-NSS gained 7.87% of the dice score over USSLNet when we used 50% of the annotated data sets for training. HUT-NSS gained 7.47% of the dice score over USSLNet when we used 10% of the annotated datasets, and HUT-NSS gained 5.34% of the dice score over USSLNet when we used 1% of the annotated datasets for training. The code is available at https://github.com/vicsohntu/HUTNSS_CT.

Stroke is a predominant cause of acquired disability and mortality worldwide1. The main factor that leads to the condition of ischemic stroke is the diminution of blood flow in brain tissues. This reduction of blood flow prevents the necessary transport of glucose and oxygen to the brain cells. Two significant causes of the decreased blood supply are Thrombotic and Embolic conditions2. The contraction of blood vessels typically leads to a thrombotic condition. The accumulation of fatty deposits exemplifies it. Meanwhile, the embolic condition occurs when there is an obstruction of blood flow in the vessel due to a clot in the vessel pathway. Early intervention is significant for the progression of ischemic stroke from perumba to cerebral infarction. Cerebral infarction refers to the process of brain cells dying due to a deficiency of oxygen. Modern imaging technology provides radiologists with a tool to identify regions of anomalies readily and allows clinicians to administer appropriate treatments and therapies. Timely and accurate remedies enable the reversal of perumba or impede further infarction. Among the various imaging technologies, magnetic resonance imaging (MRI) and computerised tomography (CT) imaging are commonly used for their diagnosis. Both MRI and CT imaging can determine the progression of a stroke. MRI provides good resolution and clarity for soft tissues like grey and white matter. The acquisition of CT imaging is inexpensive, fast, and widely available in healthcare facilities3. In contrast, MRI is not readily available as it takes more time to collect the images. Any provision of an accurate prognosis is then compromised.

CT imaging can be non-contrast or contrast. Non-contrast CT imaging is usually limited in sensitivity for acute ischemic stroke, so it may not show a precise diagnosis. Contrast CT imaging through perfusion (CTP) is commonly used to detect acute cerebral ischemia. Radiologists typically obtain CT angiography (CTA) in tandem with the acquisition of CTP imaging. It enables visualisation of possible blood vessel occlusion. Several perfusion parameter maps identify the regions of ischemic lesions. The parameter maps are, namely, cerebral blood volume (CBV), cerebral blood flow (CBF), mean transit time (MTT), and time to peak (TTP). Regions with prolonged MTT but preserved CBV indicate an ischemic penumbra. In contrast, we can deduce an irreversible infarct core if a matching anomaly exists in CBV and MTT4.

Deep learning techniques have hastened the pace of disease classification and lesion segmentation in recent years. Supervised deep learning methods gained much performance over traditional machine learning methods such as regression, support vector machine (SVM)5, and decision tree models. However, supervised deep learning requires many annotated data that are not always easy to acquire. Experts with domain knowledge must carefully curate and label the dataset, especially in the medical domain. Efficient methods must be developed to deal with the issues due to the limited annotated data available in medical imaging. Unsupervised learning alleviates this limitation and can be introduced as a precursor to supervised learning when the amount of ground truth is limited.

Unsupervised learning uses learning strategies that do not depend on explicit ground truth. The earlier development of unsupervised learning involved generating synthetic data to address the scarcity of data. Generative models such as Variational Autoencoders6, Generative adversarial networks (GAN)7,8,9, and Diffusion models10 learn the distribution of the data and produce a vast amount of similarly looking images. The provision of data augmentation11 should allow the network to access a larger pool of data, enabling better learning of downstream tasks.

Self-supervised learning is an unsupervised learning where the model develops its own labels from the input data based on some assumptions. It is an approach to derive useful representations without relying on manually annotated labels. One of the earlier developments of self-supervised learning is context prediction learning. Context prediction attempts to learn meaningful representation through accomplishing hand-crafted tasks, such as solving jigsaw puzzles12 or predicting the rotation of the image13. It uses pretext information to generate reference supervisory signals to match the output of the learning model. These tasks occur during the pre-training phase, where no annotated label is available.

Another widely used self-supervised method, contrastive learning, involves learning by comparing the output features of different views of the same input image, i.e., an augmented version of the original image. One earlier work is SimCLR14 that has a simple framework but typically relies on a larger batch size to perform. The reason is that it requires negative samples to prevent learning collapses. Momentum Contrast (MoCo)15 method overcame this deficit by maintaining information of negative samples in a momentum-encoded queue. Other instance discrimination methods, such as DINO16 and BYOL17, have no requirement of having negative samples in the pre-training. This is carried out through the distillation of the model without any label during the training at every epoch. Both methods average the model weights between a teacher and student network with a momentum parameter. Zbontar et al.18 regularise redundancy in the covariance matrix of feature vectors during the training. It is done by minimizing the cross-correlation between the feature embeddings from different views. Similarly, Bardes et al.19 uses the Barlow twins approach with additional regularisation of the variance, invariance and covariance of embeddings. Self-prediction is another self-supervised technique that allows the model to estimate the outcome when particular parts of the input data are missing. An example of self-prediction methods is Masked Autoencoders20,21,22. Masking is similar to an injection of noise through Bernoulli-dropping of pixels or patches. Injecting noise during training helps regularise the network and thereby increase the robustness of the training and prevent overfitting. The encoder is typically used for classification and other downstream tasks like segmentation. An encoder which learns to compress the information at the output and learns rich and expressive information from the input data generally leads to better performance in other downstream tasks.

Self-supervised learning is naturally adopted in the medical domain due to its merits. Wang et al.23 introduced a deep-learning framework called Annotation-Efficient Deep Learning (AIDE) which generated additional pseudo-labels for existing samples with a minor loss. This mechanism allowed the network to train with less annotated data, improving the efficiency of medical image segmentation. Azizi et al.24 discussed a unified large-scale self-supervised framework called Robust and Efficient MEDical Imaging with Self-supervision (REMEDIS), which combined supervised transfer learning of large image datasets with additional self-supervised learning. The method was data-efficient in generalising other medical imaging tasks, such as classification and segmentation with other site-specific data. Taleb et al.25 introduced a novel approach that used multiple imaging modalities and a multimodal puzzle task for self-supervised learning. The primary purpose of the puzzle task is to assist representation learning in various image modalities. The approach showed that cross-learning of representation improved segmentation performance compared to training on individual modalities. When a smaller annotated dataset was utilised, the method improved learning between single and multiple modalities.

Self-supervised pre-training methods have recently gained much attention in medical imaging due to their ability to learn meaningful structure in a larger amount of unlabelled data. For instance, Hao et al.26 improved the classification accuracy of pneumonia detection from X-ray images of lungs by pre-training the model with a large corpus of unlabelled chest X-ray images. Self-supervised method such as contrastive learning was used in the pre-training and, thereafter, fine-tuned on a smaller, labelled dataset specific to COVID-19 detection. In another recent work, Felfeliyan et al.27 proposed a self-supervised learning approach called Self-Supervised Mask R-CNN (SS-MRCNN) on the Osteoarthritis Initiative (OAI) dataset. Similarly, it attempted to learn meaningful representation by pre-training the model by applying random rotation distortions, masking and downsampling to patches of knee MRI images, a subset of the OAI dataset. The model learned to localise and recover the distortions. The pre-trained network was then fine-tuned on a labelled dataset, which contained the effusion segmentation and greatly gained segmentation performance with minimal labelled data. In another study, Zhou et al.28 proposed Preservational Contrastive Representation Learning v2 (PCRLv2), a self-supervised learning framework that unified local detail and multi-scale context\({-}\)using multi-scale pixel restoration, constrastive siamese network learning and a modified UNet without any skip connection. It performed well for tasks like brain tumour segmentation and disease detection in low labelled data settings.

Self-supervised learning in imaging using convolutional networks has proven to be effective as it approaches the performance of supervised learning12,14,15,17,29,30,31,32. In the study from Ma et al.33, Vision Transformer (ViT) had been shown to perform better than its CNN counterpart in various benchmarks. Some variants of the ViT34,35,36,37,38 also showed remarkable results on medical classification and segmentation tasks. Tang et al.39 extended the Swin Transformer framework with self-supervised learning and showed a slight gain in the performance of the medical imaging segmentation task. In another study40,41, hybrid UNet and ViT architecture performed well in ischemic stroke lesion and brain tumour segmentations.

To address the issues of limited annotated data, we establish a three-fold approach to benefit from the self-supervised approach. First, we extend the HUT architecture from the previous work40 with a novel self-supervised framework with a noise anchor to learn features from the unlabelled dataset and fine-tune it to a smaller labelled dataset. Second, we align the distribution of another dataset to the target dataset to ensure the effectiveness of self-supervised training. Thirdly, we introduce a dynamic weighting mechanism to tune the weight between the dice loss and cross-entropy loss to achieve the best outcome when tuning the downstream task. The main contributions of this paper are as follows:

A novel approach to self-supervised learning is introduced via a noise anchor to regularise the pre-training.

Implement a domain alignment method to match additional unannotated datasets with the target labelled dataset.

Introduce a dynamic weighting method to adjust weighting components to the loss functions.

The Hybrid UNet Transformer (HUT) is a supervised learning system introduced by the work of Soh et al.40. As shown in Fig. 1, the overall architecture consists of a parallel UNet stage (UNS) and Vision Transformer stage (VTS). It encapsulates the advantage of the inductive bias of image identification from the CNN features and the advantage of capturing the global correlation of image patches in the Transformer. A cross-resolution transformer within the VTS generates two different resolutions, which are then combined with the skip connections in the UNS. The authors discover that using two transformers, one for small patches and another for larger patches, followed by the cross-transformer, helps improve performance with additional training on the CLS tokens representing the classification vectors, as illustrated in 2. HUT overcomes the need for the ViT to have extensive training data by introducing a hybrid architecture of Transformer and UNet. Different scales of the attention maps from the ViT are combined with the earlier stages of the UNet decoder. The arrangement achieves notable improvements in the performance of various brain segmentation tasks40,41 Nevertheless, we still face challenges in handling smaller datasets. This work aims to adopt a novel self-supervised training method to improve the performance of the original HUT framework.

The simplified architecture of the original HUT.

Injection of noise to the input is widely used as a training regularisation in machine learning42. In supervised learning, training techniques like weights dropout, input masking, and injection of Gaussian noise are commonly deployed to mitigate overfitting. Specifically, within medical imaging, similar approaches have been applied to enhance robustness against data variation, artifacts, and intrinsic noise in clinical images.

In the context of self-supervised learning, noise-based self-supervised methods43,44,45 have been shown to reduce the amount of image data when applying image denoising. Mansour et al.44 trained a denoising model without any clean or noisy image data, utilising only random noise vectors. Pang et al.45 showed another approach of self-supervised learning in image denoising by synthetically corrupting versions of input data with noise and does not rely on any clean data. Noise2Info46 accurately estimated noise profile and improves denoising performance. Pfaff et al.47 introduced a self-supervised denoising approach that recovers MRI scans corrupted with thermal noise without clean target data by using Stein’s unbiased risk estimator (SURE) to quantify the noise level. InfoNCE48 was a form of noise-contrastive estimation technique which is deployed in self-supervised methods to enhance feature learning and improve the task of image classification. It worked as a loss function that improved models by contrasting the positive pairs, which are the real data, against the negative pairs, which can be unrelated data inputs or noise. Our method is unique and radically different from InfoNCE as we assume an energy model that treats Gaussian noise as spectrally constant and results in a probability function similar to a supervised version.

Self-supervised learning is susceptible to mode collapse when it fails to learn anything expressive and representative of the training. One contributing factor is the lack of a dataset with many examples. The lack of data may lead to overfitting and prevent the model from learning meaningful structures. Another way to compensate for this deficiency is to train with negative and positive samples. The model architecture also plays a role in effective learning. Some of the models may explore shortcuts and produce similar trivial representations. These models sometimes create functions that map the various inputs to one common feature but do not effectively produce meaningful features. Self-supervised tasks are limited to pretext tasks. The selection of pretext tasks is paramount to performing the downstream tasks. We are more concerned with a clear partition of various distinguishable features for the classification task. For segmentation, pretext tasks like reconstructing the masked patches or inpainting allow the network to learn a contextual prediction beneficial to its downstream task performance.

The primary motivation for employing noise regularization in training is to enhance the model’s robustness, which is especially necessary when working with a limited annotated dataset. Our novel proposed approach is governed by the propositions, aided by definitions. We condition and regularise the model to recognise noise and its properties of Gaussian noise with the first proposition. According to the second proposition, we let the model focus on the underlying structures, disregarding the noise since it can identify the nature of a perturbation. These steps lead to reliable and robust learning.

The model is compelled not to rely too much on the training data and fail to generalise and adapt well to unseen inputs, which can vary significantly in the medical imaging applications. In self-supervised learning, where models are trained without labelled data, noise regularization is particularly beneficial. It enhances the model’s ability to extract meaningful feature representations from unlabelled data, especially when dealing with smaller medical imaging datasets of high variability where the risk of overfitting is higher.

The noise-induced self-supervised (NSS) learning method, on the other hand, does not rely on pretext tasks. It uses a noise anchor mechanism to ensure uniform probability outputs for noisy inputs and shows robustness against training collapse observed in other approaches, even when trained on limited data. As an extension to HUT, we train the model without labels to incorporate a broader, readily available, unlabelled dataset. Integrating NSS into the HUT network reduces overfitting and improves the model’s generalisation ability across limited labelled datasets. In the subsequent sections, we formalise the mathematical foundations of the NSS framework and substantiate its merits with experiments and results.

We present a novel approach by introducing an energy model for noise regularization for self-supervised learning. Firstly, we establish the definition 1.1 of the probability density function using an energy function that is normalised by a partition function, which is foundational in energy-based models and determines the likelihood of data input points. Secondly, we represent the distribution of signal energy across frequencies with the definition 1.2 of an energy spectral density. It is computed as an integral or sum of squared Fourier transform magnitudes, essential for analysing signal properties in frequency domains. Lastly, we define autocorrelation and energy spectral density to allow the analysis of signal statistics and structure via the Wiener-Khinchin Theorem49 with definition 1.3.

A common noise source is the Gaussian noise, a signal that follows a normal probability distribution with a mean \(\mu\) and standard deviation \(\sigma\). Signals with statistically independent and normally distributed components form the white noise, and the energy intensity is flat across the frequency domain.

The connotation of developing the training framework through the propositions is paved firstly by stating that we can produce uniformly distributed noise when such an energy model is used. The reason is simply because the Gaussian noise leads to uniform output probabilities across classes, governed by the assumption in the earlier definitions. By ensuring uniform uncertainty, our approach prevents model collapse. It stabilizes training, which is especially beneficial under limited data or noisy conditions, such as in the case of SimCLR where the absence of sufficient negative samples can lead to collapse. This establishes the first proposition 1.4.

With proposition 1.5, the model maintains consistent predictions despite additive Gaussian noise and enhances adaptability in real-world scenarios where data quality varies significantly and samples are deficient. The model’s output with a noisy input is similar to that with an original input. Proposition 1.5 guides the model’s training in a self-supervised setting. Training with a small sample count becomes stable with noise regularisation, as seen in our experiments.

Further proof of the propositions is detailed in the supplementary material.

The probability density function of an energy model, given data x, is defined as :

where \(E_\theta\) is the energy function parameterized by \(\theta\). and Z is the unknown normalising partition function.

The Energy Spectral Density function defines the energy distribution of a signal over frequency. i.e.

where \(\Theta _x(f)=|X(f)|^2\), X is the Fourier transform of x.

The Fourier Transform of the autocorrelation function of a signal, \(R_x(\tau )\), is the energy spectral density function. i.e.

Let the energy model be represented as an energy spectral density function in the frequency domain, and assume the input components are Gaussian noise with variance \(\sigma ^2\) and mean \(\mu\). Then, the resulting probability density function will have a constant output value of \(\frac{1}{K}\), where K denotes the K number of discrete frequency bins or classes.

In this proposition, K represents the number of classes at the softmax output of the final linear projection layer. Hence, we assign \(p(n)=\frac{1}{K}\) with a uniform probability for the output n. It also implies that the energy function produces a flat output when the input is noise. It is akin to a flat profile of the power spectral representation of white noise in the frequency domain. We denote that the system cannot classify or differentiate the category of the input when it is noise. For instance, the noise we used in the self-supervised framework is Gaussian noise, which has a normal distribution of mean 0.0 and variance of 0.1. The proposition provides a prelude to the progression of noise-contrastive learning with noisy image input data and an additional regulariser with a noise anchor. This method prevents training collapse, as observed in SimCLR when there is a much lower number of batch samples for pre-training.

Assuming the energy spectral model described in Proposition 1, consider an input signal corrupted by additive Gaussian noise n. Then, the resulting probability density function will be similar to a supervised probability density function, followed by an output layer comprising a softmax activation function that generates a probability distribution over multiple classes.

Overall, the definitions and propositions collectively establish a theoretical framework for using Gaussian noise in self-supervised learning. They show that incorporating noise as a regularizer prevents training collapse by ensuring uniform class probabilities (Proposition 1.4) and maintains similar classification performance even with noisy inputs (Proposition 1.5). This method addresses key issues in self-supervised learning, such as mode collapse and robustness to noise. It provides a grounded method for improving model stability and generalisation.

Learning in HUT-NSS framework: (a) the UNS module denoises input scans while the VTS component is trained with contrastive learning in a noisy setting. According to proposition 1.5, the impact of noise is minimised within the learned representations; and (b) noise anchor regularisation: a noise anchor is used during pre-training to regularise the learning, which results in a uniform output distribution.

Domain adaptation is an essential step for self-supervised learning in medical imaging. The primary motivation for this step is the differences in the datasets used. The UniTOBrain50 dataset (unlabelled) and ISLES201851,52 dataset (labelled) originate from different sources, which leads to variations in image characteristics such as those of pixel intensities and spatial features.

Transfer normalisation is essential for the network to learn from a different dataset. Minimising the difference in the distribution between the unlabelled and labelled datasets is also crucial for self-supervised pre-training to be significantly effective. One of the primary measures to detect the differences is the Maximum Mean Discrepancy (MMD)53, which is commonly adopted in domain adaptation. Aligning second-order statistics is another essential factor in matching the two domains. Aligning the distributions between the two datasets acquired from different equipment improves performance in classification in medical imaging54. Domain adaptation aims to reduce the distribution shift between datasets. Yu et al.55 and Jin et al56 attempted to map the source to the target domain by matching the data distributions. The authors employed adversarial learning to derive domain-invariant feature representation. With the leverage of a domain adaptation, Yu et al.55 showed an improvement in the recall and the overall accuracy of the x-ray image classification task.

The brain imaging scans acquired by different scanning equipment frequently produce images of various intensity and contrast. Brain CT datasets, such as the UniTOBrain and ISLES2018 datasets, are obtained through dissimilar CT scanners with different scanner protocols, parameters, and settings. It is typically difficult for models to learn with a broader intensity range, especially when the distribution profile of the images from the two domains are vastly disparate. Therefore, models may be relatively slow to converge in training and struggle to generalise across datasets with differing intensity distributions. To address this issue, various domain adaptation methods are proposed to align the distribution profiles of the different datasets used in the training. Low-level features, such as texture, intensity, and contrast, are usually captured at the lower layers of the encoder while the higher layers represent the anatomical structure of the brain. Techniques, including style transfer and intensity normalisation, have been shown to mitigate the discrepancies between the datasets effectively. We have experimented with two approaches. The first approach is the Gram Matrix57 Alignment. The encoder is initially pre-trained on the ISLES2018 dataset. The method aligns the Gram matrices of the first two layers of a VGG encoder to reduce the differences in feature representations between the target and source domains. The difference in the feature statistics is backpropagated through these two layers to modify the input image of the source domain. Such an adjustment achieves the desired shift in the intensity distribution of the source image. The second approach is the AdaIN Alignment, which is preferred when a UNet is implemented instead of just an encoder. We chose the AdaIN approach due to its implementation efficiency and effectiveness. The UNet is initially pre-trained as an autoencoder using the target domain. It involves the deployment of Adaptive Instance Normalization (AdaIN)58 at the first two layers of a pre-trained UNet (with target dataset) to align the feature statistics of the source dataset. In doing so, the style characteristics of the target domain are effectively transferred to the source domain. To provide greater clarity, we will outline the methodology in greater detail. In essence, our proposed approach uses AdaIN to address intensity variations in brain MRI datasets, summarised as the steps below:

Determine the CT datasets to use: UniTOBrain (Source) and ISLES2018 (Target). The source domain attempts to match the style distribution of the target domain.

Obtain feature vectors of early layers from UNet encoder by first pre-training a UNet on target images as an autoencoder and employ the encoder to extract feature maps from the input images. The layers near the bottleneck contain higher-level semantic information, while the first two layers of the encoder determine features that represent edges, texture, and intensity.

Apply AdaIN on the first two layers of UNet with the operation of AdaIN defined as:

where \(\hat{x}\) is the input feature vector per channel, \(\hat{y}\) is the style feature vector per channel, \(\mu (\hat{x})\) and \(\sigma (\hat{x})\) are the mean and standard deviation of the input feature vector \(\hat{x}\), respectively, and \(\mu (\hat{y})\) and \(\sigma (\hat{y})\) are the mean and standard deviation of the style feature vector \(\hat{y}\), respectively. The mean \(\mu (\hat{x})\) and standard deviation \(\sigma (\hat{x})\) of the source domain output feature vector per channel of the first two layers of the encoder are computed. The feature vectors are then normalised to standard statistics. Subsequently, scale and bias the source domain’s statistics with \(\mu (\hat{y})\) and \(\sigma (\hat{y})\) to align the intensity distributions of feature maps at the target domain.

The input image of the source domain now has the feature statistics of the target domain of the first two layers of the UNet. The final image with aligned distribution is reconstructed at the output of the UNet. The processed output is visually similar to the ISLES2018 dataset in style while preserving the semantic content characteristic of the UniTOBrain dataset, as depicted in Fig. S1 in the supplementary materials. The pre-training of the HUT-NSS model is performed using the processed UniTOBrain dataset.

In the section on experiments and results, we showed that the self-supervised framework works in tandem with the alignment of the distributions between the source and target domain; in other words, it matches the two datasets.

We introduce an alternative dynamic weighting method for the loss functions during the second training session of the HUT with the labelled dataset, which, in this case, is the ISLES2018 dataset. The loss functions are typically the dice loss and the cross-entropy loss objective functions. Optimally, dice loss ensures maximum overlap between the predicted and ground truth labels, whereas cross-entropy loss provides accurate pixel classification at the output. We approximate the weighting factor of each loss with the accumulated gradient in each epoch. The gradient is computed with the difference in aggregated loss after a subsequent epoch. The intuition is to weigh a higher gradient as more important, adding more weight for a loss function than the other. The dynamic weighted losses are formulated as weighted with the gradient of the losses. The weighting factors are updated at every epoch instead of every iteration. Updating every iteration degrades the performance because of a more considerable variance between successive losses. Computing the accumulated gradient at each epoch effectively averages out the noise variation. We describe the weighted combination of the losses with the following equation:

where \(\lambda _{dsc}=\frac{\partial \mathcal {L}_{dsc}}{\partial t}\), \(\lambda _{ce}=\frac{\partial \mathcal {L}_{ce}}{\partial t}\), \(\mathcal {L}_{dsc}\) is the dice loss, \(\mathcal {L}_{ce}\) is the cross-entropy loss and t is the computation time step which in this case is taken as per epoch step of the training.

As shown in Fig. 2a and b, our proposed self-supervised learning of HUT comprises three essential parts: (i) contrastive training, (ii) regularisation of VTS, and (iii) denoised training of the UNS. The training of VTS involves contrastive learning between the original CT Perfusion scans that are corrupted with more substantial Gaussian noise of variance of 3.0 and mean centred at 0.0. The affine transformed version of the CTP scans are infused with Gaussian noise of standard normal distribution. In addition, we regularise the training of VTS with a noise anchor mechanism, as illustrated in Fig. 2b. The probability density distribution at the output of the projection network from the CLS token branch will match with a uniform distribution when the input image is noise. Intuitively, the network would not have the certainty to classify a noise as any of the classes, and therefore, the projection network should output equal probabilities at the output nodes. In every training, we include the noise regularisation process with independent input noises that can be easily generated.

As for the UNS, we incorporate denoising UNet training as the unsupervised training mechanism for the CNN within the HUT-NSS model. The same input images to the VTS are used to train the UNS. The UNS is trained to denoise the input images by comparing the network output to the original clean images. We used the \(L_2\)-norm loss function for the training to allow the network to reconstruct the original images from a noisy image and learn robust feature representations. Self-supervised learning includes all three training losses during each iteration, weighted equally.

The pseudo-code of the HUT-NSS algorithm is illustrated in Algorithm 1. It comprises several key steps for the pre-training of HUT, such as dataset preparation, data augmentation and noise regularisation, embedding extraction, output and noise projection into lower dimensional space, and loss gradients backpropagation:

Data Preparation and Augmentation:

Firstly, noise \(n_1\), sampled from a standard normal distribution, is added to each unlabelled training sample x from the processed UniTOBrain dataset \(S_{trg}\). Two augmented views of each input are then created: the noise-corrupted sample \(x_1\), and the other sample \(x_2\) is generated through a random augmentation procedure incorporating independent noise \(n_2\).

Embedding Extraction:

Both augmented samples \(x_1\) and \(x_2\) are passed through a Transformer-based network function \(f_{VTP}\), which acts as the feature extractor. This step produces embeddings \(emb_1\) and \(emb_2\) for inputs \(x_1\) and \(x_2\), respectively.

Output and Noise Projection:

These embeddings are subsequently projected into a lower-dimensional representation using a projection network function \(f_{PROJ}\), which will result in projected outputs softmax values \(proj_{x_1}\) and \(proj_{x_2}\).

In addition, noise \(n_0\) when passed through the Transformer \(f_{VTP}\) produces a noise embedding, which is also projected via \(f_{PROJ}\) to produce a lower-dimensional softmax output \(proj_{n_0}\).

Loss Computation and Backpropagation:

The training objective comprises of three components. Firstly, a cross-entropy loss \(L_{proj_{12}}\) is computed between the projected embeddings \(proj_{x_1}\) and \(proj_{x_2}\), which imposes consistency across different input views. During optimization, gradients are backpropagated only through the branch originating from \(x_1\), to avoid learning collapse. Secondly, a cross-entropy loss \(L_{Nproj_{0}}\) is calculated between the noise projection \(proj_{n_0}\) and a uniform probability distribution \(p_K\), where K denotes the number of classes. Lastly, a mean squared error (MSE) loss \(L_{unet}\) is applied between the UNet’s predicted output and the original input x, with gradients propagated through the UNet branch, and ensures a reconstruction of clean signal at the output. This constraint guides the pre-training flow according to the propositions mentioned in the earlier subsection.

HUT-NSS using noise anchor and contrastive noise pre-training on UniTOBrain dataset

The UniTOBrain dataset50 is an open-access ischemic stroke imaging dataset which contains 109 unlabelled CTP scan volumes. Each volume includes CBF, CBV, MTT, and Tmax image parameters derived from different time snapshots. The dataset is mainly used to train the self-supervised framework to increase the diversity of the network during pre-training. We show the merit of the method on the downstream task of lesion segmentation using the labelled ISLES2018 dataset. The ISLES2018 dataset comprises 94 CT perfusion scans of patient subjects. They are split into 84 subjects for training and 10 subjects for testing. Each volume is typically 240 pixels high by 240 pixels wide, with 2,4 or 8 layers of depth. Expert radiologists carefully and manually annotate the lesions of each volume.

Part of the self-supervised framework includes the necessary domain transfer procedure of another unannotated dataset (source) to match the distribution of the downstream input dataset (target). The UniTOBrain dataset (source) is processed by the procedure described in “Domain adaptation technique for unlabeled dataset” section. We produce a dataset perceptually similar to the ISLES2018 dataset (target) in style but semantically still retaining the content of the UniTOBrain dataset, as illustrated in Fig. S1 in the supplementary material. Experimentally, it is an essential step for the self-supervised method to learn and discover similar structures and anomalies as the target dataset (ISLES2018).

In this section, we conduct experiments focusing on ischemic stroke lesion segmentation with the four-parameter maps of the CT perfusion images in the ISLES2018 dataset. The metrics used to evaluate performance are the Dice score, HD95 score, IOU, Precision, and Recall. All experiments are conducted with equal weighting of soft dice and cross-entropy loss except for the proposed dynamic weighting method for the self-supervised HUT system.

During training experiments, data augmentation methods such as random affine transformation with scaling factors ranging from 75% to 125% and rotation of up to 15 degrees were used. We also used random flipping of the scans of the brain’s two hemispheres. We conducted all the experiments with identical testing sets to maintain consistency. The training and testing sets were randomly selected: 90% of the total 94 subjects were used for training and 10% for testing. We conducted and averaged results from 5 different runs. Dropout of weights was set at 25%, and data augmentation was implemented to improve diversity. We ran 1000 epochs for all experiments, each with an Adam optimiser with a learning rate of 3e-4 and a decay rate of 1e−7. The parameters are generally fixed in settings due to empirical choices and experiences. These parameters are hyperparameters. They are empirically adjusted and determined in many trials to provide better performance.

Similar to the experiments conducted in the study40, we observed that slice-by-slice training for the self-supervised training provided superior and more consistent outcomes since there are a limited number of slices (2,4 or 8) in each subject within the CTP dataset. Therefore, throughout the experiments conducted on the HUT-NSS methodology, we only used a 2D training technique instead of a 3D training technique.

Our proposed method shows that when trained with noise regulariser, the energy spectral density model works particularly well when the labelled or unlabelled dataset is limited in quantity, as observed in Tables S1 and S2 of the supplementary material. We further demonstrate the framework’s foundation with propositions 1.4 and 1.5 and provide a simple framework for HUT-NSS learning, which are illustrated in Fig. 2a and b. respectively.

There are several segmentation methods for medical imaging, which we would like to compare with. The methods include CNN-based nnUnet59, ERFNet60, UShape61, and USSLNet62, as well as hybrid Transformer-based approaches such as TransBTS63 and UNETR37. Table 1 compares the Dice scores and Hausdorff distance across various methods to the proposed self-supervised method using 100% of the ISLES2018 dataset, including 84 CTP scans for training and ten scans for testing. For supervised learning, HUT has an advantage of 4.6% gains over the competitive and state-of-the-art USSLNet method in terms of dice score with a slight advantage over the HD95 score. Our self-supervised version of HUT, HUT-NSS, has an even larger margin gain of 7.2% over the dice score and 28.1% gain over the HD95 score compared to the USSLNet method. HUT-NSS gains a 2.5% of dice score over its supervised counterpart even when 100% of the dataset is used. nnUnet achieves results comparable to that of the USSLNet method regarding supervised learning. However, it performs worse for the HD95 score than USSLNet. In comparison, HUT-NSS surpasses the performance of nnUNet by 8.1% in terms of dice score and 61.9% in terms of HD95 score. Furthermore, HUT-NSS widens the margin against ERFNet and UShape networks in the Dice score and HD95 score over both approaches. Hybrid transformer-based networks like TransBTS and UNETR exhibit suboptimal performance on this dataset due to limited training data despite their hybrid architecture. Transformer-based models are known to require more data to train their networks. By reducing the training dataset by half, we showed that HUT-NSS could still achieve a result comparable to the supervised version of HUT when the entire set was used. Table 2 illustrates the comparison of the metrics between the methods when 50% of the annotated dataset was used for the supervised training. HUT-NSS gained 7.9% of dice score over USSLNet and 7.7% of dice score over nnUnet.

Table 3 compares the metrics between the methods when 10% of the annotated dataset was used. Only nine sets of CTP volumes were used for the training. HUT-NSS gained 7.5% of the dice score over USSLNet and nnUnet. Meanwhile, the supervised version of HUT performed slightly worse than the nnUnet and USSLNet. This signifies that the Transformer is not working as well when there is a minimal amount of annotated data available for training.

Likewise, Transformer-based methods performed worst when there was only a limited annotated dataset for the downstream task of lesion segmentation. Table 4 compares the metrics between the methods when only 1% of the annotated dataset was used. A single set of CTP volumes was used for the training. HUT-NSS gained 12.8% and 16.3% of dice scores over USSLNet and nnUnet, respectively. At the same time, the supervised version of HUT was performing comparably to the nnUnet and USSLNet. A more considerable margin could be observed when the annotated dataset was reduced to a single-shot training.

Figure 3 visually shows the differences between the lesion segmentation of subject 13 using various methods, with 50% of the training dataset being used for training. The CT perfusion parameters, along with the CT scan, are included in the figure. The images labelled c, d, e, and f are the CT perfusion images taken 8 hours after the contrast agent is injected into the patient’s bloodstream. As observed from Fig. 3, even when the dataset is limited to half of the original dataset for training, HUT-NSS is closely similar to the ground truth despite being smaller in size. The original supervised HUT and Erfnet methods resemble the label quite well. USSLNet is missing parts of the segmentation, whereas TransBTS and UNETR, transformer-based networks, underestimate the lesion segment.

Visual comparison of lesion segmentation of our proposed HUT-NSS method against the various methods for subject 13.

As illustrated in Fig. 4, we compare the difference in lesion segmentation for subject 33 when only nine sets of scans were used during the training of the downstream task. With 10% of the training dataset, we show that HUT-NSS is closely shaped according to the outline of the ground truth, although it fails to provide more detail. TransBTS is closely similar but does not cover accurately. Erfnet and UNETR are more optimistic and give a more significant segmentation. In contrast, nnUnet covers the ground truth correctly despite not providing accurate segmentation details. The original HUT is more conservative and covers less area; however, it is similar to HUT-NSS. Hybrid Transformer-based networks like TransBTS and UNETR exhibit suboptimal performance on this dataset due to limited training data despite their hybrid architecture.

Visual comparison of lesion segmentation of our proposed HUT-NSS method against the various methods for subject 33.

Dynamic weighting and Self-supervised learning for limited annotated dataset In the ablation study, we show the performance of our proposed model with different configurations.

Table 5 compares the performance variations of the self-supervised methods for pre-training, namely the DINO16, SIMCLR14 and our proposed noise anchor method. We observe that our method gains 5.2% and 9.5% in Dice score over DINO and SIMCLR, respectively, when all the annotated datasets are used in the segmentation task. DINO and SIMCLR perform slightly worse than the supervised version, indicating some degradation when self-supervision methods are deployed. For the case of 50% labels, HUT-NSS gains 5.8% and 9.6% over the DINO and SIMCLR methods. HUT-NSS gains 4.9% and 6.8% over the DINO and SIMCLR methods when 10% labels are used for the downstream task. A more considerable margin can be seen when only 1% labels were used in the training. HUT-NSS gained 9.1% and 7.6% in Dice score over the DINO and SIMCLR methods, respectively. When only one sample is used, SIMCLR edges out DINO slightly in terms of performance.

In Table 6, we compare the runs between the self-supervised learning of HUT and the supervised version of HUT with and without the dynamic weighting mechanism of the loss functions during the downstream task of lesion segmentation. For comparisons, we employ an equal weighting factor for experiments without a dynamic weighting factor in the loss functions. The table illustrates the merit of the dynamic weighting factor of the loss functions, particularly in the case of HUT-NSS. The difference between the dynamic weighting factor and equal weighting factor is negligible at 100% labels. The gain from the dynamic version is 0.6% over the equal factor for HUT-NSS and 1.8% without self-supervised training. For 50% labels, the gain is 5.4% and 3.0% for the self-supervised and supervised version of HUT. Interestingly, for 10% labels, the gain is 7.7% for the self-supervised HUT but a degradation of 4.5% for the supervised version of HUT. For 10% labels, the gain is 4.5% for the self-supervised HUT and a degradation of 2.9% for the supervised version of HUT.

In the study using the MNIST dataset, as provided in the supplementary material, from Table S1, we show that Noise-based self-supervised can be generalised to perform well in the image classification task even when the number of labelled and unlabelled datasets is limited for training, compared to methods such as SimCLR and DINO.

In segmenting ischemic strokes from CT perfusion scans with a limited annotated dataset, we incorporate self-supervised learning into the hybrid U-Net and vision transformer called HUT40 and show that the self-supervised version of HUT gains a further 2.5% in Dice score as compared to its supervised version, even when the training dataset is 100%. This indicates that the number of CTP scans in the ISLES2018 dataset may not be sufficient for the model to reach its learning capacity. Self-supervised learning without any label on another unannotated set, such as the UniTOBrain dataset, improves the ability of HUT to learn the underlying structure of the input data. However, it is essential to note that the distribution of the UniTOBrain dataset is quite different from the ISLES2018 dataset. Therefore, we employ a transformation technique in domain adaptation so that the mapped UniTOBrain dataset resembles the CTP scans from the ISLES2018 dataset. Matching the distribution allows the self-supervised model to learn similar structures in the target dataset readily.

HUT-NSS has surpassed the performance of state-of-the-art methods in smaller CTP datasets for several reasons. It is designed to address the local and long-range correlations between the patches, and it considerably exceeds the capabilities of current methods using transformers, UNet, and CNN for medical image segmentation. The VTS’s output attends to information at various resolutions. The introduction of the noise anchor acts as a training regulariser during self-supervised training. It effectively prevents training collapse and enhances the network’s performance capacity. Learning from a larger unlabelled dataset improves the ability of the Transformer.

Numerous experiments and comparative analyses show that the self-supervised version of our model, HUT-NSS, outperforms its supervised counterpart by a margin. Moreover, when the full annotated dataset is used in training, HUT-NSS surpasses the SOTA network, USSLNet, by 7.2% in dice score and 28.1% in HD95 score.

We show the robustness of HUT-NSS with noise regularisation under limited annotated datasets. HUT-NSS consistently outperforms HUT, USSLNet, and nnUnet when only 50%, 10% or even 1% of the annotated dataset is used during the downstream segmentation task’s training.

From the ablation study, we observe that only the proposed method benefits the overall performance of the HUT. Other self-supervised methods, such as DINO and SIMCLR, do not provide any gain from training on the UniTOBrain dataset. From the study on the dynamic weighting method introduced in this work, we observe that it is more beneficial for the self-supervised version than the supervised version when the amount of annotated samples available is very limited, like 1 to 9 samples. It indicates that the dynamic weighting factor is more sensitive when the variation of the training sample is low.

From the study of image classification of the MNIST dataset in section 3 of the supplementary material, we demonstrate the significance of cluster distance and representation compactness in evaluating the method’s efficacy through the UMAP visualisation plots. The noise-supervised method presents more consistent and compact clustering. We observe that although some maps are mirrored versions (either vertically or horizontally or both) of others, it still indicates consistent feature representation at a lower dimension. The MNIST experiment showcases the effectiveness of the proposed method, particularly when labelled data is limited. In addition, it also proves more robust with smaller unlabelled data for pre-training. Furthermore, in the previous studies40, HUT has been well-adapted to MRI images in brain tumour and lesion segmentation, either multi-modal or single-modal scans. Therefore, without loss of generality, the model will have similar merits on different sub-types of ischemic strokes or other types of medical imaging tasks.

In the broader sense, although the framework has shown promising results, mainly in brain scans, we should also address the limitations that may occur when such a method is deployed in a clinical environment. Accurate lesion segmentation determines the ischemic areas or infarcts that is important to diagnose its severity so that clinician can decide to carry procedure such as thrombolysis or thrombectomy. Slight errors in volume measurement from segmentation can impact judgment on whether a patient is assessed to be eligible for particular treatments64 Precise diagnosis and review using MRI reduce risks like hemorrhage and edema, is vital for expanding treatment windows safely65. Administering tissue plasminogen activator (tPA) to dissolve blood clots is highly time-sensitive in cases of acute ischemic stroke. There are challenges to face in terms of technical, regulatory, ethical, and practical considerations. Humans may still exhibit superiority in learning with little observation, but machine learning tools provide valuable insights and possible suggestions. They are readily available to help humans in various applications, such as brain imaging segmentation, so that we can make timely and sound decisions about medical intervention. HUT, nevertheless, is designed to augment clinicians in medical imaging. We must consider that it is intended to support the clinician in their decision-making processes, not replace them. We still rely on experts to make informed decisions based on the results provided by the tool. Some other ethical considerations include patient data privacy, data curation and regulatory compliance. Patient data privacy involves the non-disclosure of information about the patient data. The deployment of machine learning models in healthcare must comply with stringent regulations such as GDPR in Europe or HIPAA in the United States. Data curation ensures the dataset’s quality to reduce bias and be more representative. Regulatory compliance from the U.S. Food and Drug Administration (FDA) is essential as it maintains the safety of patients and builds trust with clinicians and patients by protecting their sensitive data.

In this study, we introduce HUT-NSS by combining the strengths of both CNN and Transformer architectures with self-supervised learning and applying it to the ischemic stroke lesion segmentation task. The original Hybrid UNet Transformer (HUT) network enhances MRI and CT perfusion image segmentation tasks. HUT features parallel UNet and transformer stages. It explores the ability of CNN to identify image features and the ability of the Transformer to capture global dependencies.

We propose a novel method for noise-based self-supervised pre-training that uses a noise anchor and operates under an energy spectral density model. We have demonstrated that this method can produce better performance, mainly when working with only limited annotated data, by using additional unannotated CTP datasets that can be inexpensive to acquire.

To sum up, our results accentuate the effectiveness of HUT-NSS in segmenting Ischemic stroke lesions. The proposed framework addresses the local and long-range correlation between voxels and provides various unsupervised strategies to improve performance when limited annotated data is available. We can significantly boost the network’s performance in both Dice score and Hausdorff distance for 1%, 10%, 50% and 100% of the annotated data. This study shows its ability to gain a margin over the supervised methods in medical imaging applications and its robustness in cases with smaller annotated data.

The Annotated Dataset ISLES2018 used in the preparation of this article was obtained from SICAS Medical Image Repository http://www.isles-challenge.org/ISLES2018/51,52. The additional unannotated dataset UniTOBrain was obtained from the University of Turin https://ieee-dataport.org/open-access/UniTOBrain50.

Kuriakose, D. & Xiao, Z. Pathophysiology and treatment of stroke: Present status and future perspectives. Int. J. Mol. Sci. 21, 7609 (2020).

Article CAS PubMed PubMed Central Google Scholar

Hui, C., Tadi, P. & Patti, L. Ischemic stroke. In StatPearls [Internet] (StatPearls Publishing, 2022).

Pelc, N. J. Recent and future directions in CT imaging. Ann. Biomed. Eng. 42, 260–268 (2014).

Article PubMed PubMed Central Google Scholar

Allmendinger, A. M., Tang, E. R., Lui, Y. W. & Spektor, V. Imaging of stroke: Part 1, perfusion CT??? Overview of imaging technique, interpretation pearls, and common pitfalls. Am. J. Roentgenol. 198, 52–62 (2012).

Article Google Scholar

Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).

Article Google Scholar

Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139–144 (2020).

Article Google Scholar

Sandfort, V., Yan, K., Pickhardt, P. J. & Summers, R. M. Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci. Rep. 9, 1–9 (2019).

Article ADS CAS Google Scholar

Frid-Adar, M. et al. Gan-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018).

Article Google Scholar

Kingma, D., Salimans, T., Poole, B. & Ho, J. Variational diffusion models. Adv. Neural. Inf. Process. Syst. 34, 21696–21707 (2021).

Google Scholar

Chlap, P. et al. A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 65, 545–563 (2021).

Article PubMed Google Scholar

Noroozi, M. & Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, 69–84 (Springer, 2016).

Feng, Z., Xu, C. & Tao, D. Self-supervised representation learning by rotation feature decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10364–10374 (2019).

Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, 1597–1607 (PMLR, 2020).

He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738 (2020).

Caron, M. et al. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294 (2021).

Grill, J.-B. et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020).

Zbontar, J., Jing, L., Misra, I., LeCun, Y. & Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, 12310–12320 (PMLR, 2021).

Bardes, A., Ponce, J. & LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021).

He, K. et al. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16000–16009 (2022).

Bao, H., Dong, L., Piao, S. & Wei, F. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).

Xie, Z. et al. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9653–9663 (2022).

Wang, S. et al. Annotation-efficient deep learning for automatic medical image segmentation. Nat. Commun. 12, 5915 (2021).

Article ADS CAS PubMed PubMed Central Google Scholar

Azizi, S. et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat. Biomed. Eng. 7, 756–779 (2023).

Article PubMed Google Scholar

Taleb, A., Lippert, C., Klein, T. & Nabi, M. Multimodal self-supervised learning for medical image analysis. In International Conference on Information Processing in Medical Imaging, 661–673 (Springer, 2021).

Hao, Y., Wang, Y. & Wang, X. Self-supervised pretraining for covid-19 and other pneumonia detection from chest x-ray images. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery: Proceedings of the ICNC-FSKD 2021 17, 1000–1007 (Springer, 2022).

Felfeliyan, B. et al. Self-supervised-RCNN for medical image segmentation with limited data annotation. Comput. Med. Imaging Graph. 109, 102297 (2023).

Article PubMed Google Scholar

Zhou, H.-Y., Lu, C., Chen, C., Yang, S. & Yu, Y. A unified visual information preservation framework for self-supervised pre-training in medical image analysis. IEEE Trans. Pattern Anal. Mach. Intell. 45, 8020–8035 (2023).

PubMed Google Scholar

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

Doersch, C., Gupta, A. & Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, 1422–1430 (2015).

Chopra, S., Hadsell, R. & LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 539–546 (IEEE, 2005).

Van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. arXiv e-prints arXiv–1807 (2018).

Ma, D. et al. Benchmarking and boosting transformers for medical image classification. In MICCAI Workshop on Domain Adaptation and Representation Transfer, 12–22 (Springer, 2022).

Valanarasu, J. M. J., Oza, P., Hacihaliloglu, I. & Patel, V. M. Medical transformer: Gated axial-attention for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 36–46 (Springer, 2021).

Xie, Y., Zhang, J., Shen, C. & Xia, Y. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 171–180 (Springer, 2021).

Cao, H. et al. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021).

Hatamizadeh, A. et al. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 574–584 (2022).

Chen, J. et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021).

Tang, Y. et al. Self-supervised pre-training of swin transformers for 3d medical image analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20730–20740 (2022).

Soh, W. K. & Rajapakse, J. C. Hybrid UNet transformer architecture for ischemic stoke segmentation with MRI and CT datasets. Front. Neurosci. 17, 1298514 (2023).

Article PubMed PubMed Central Google Scholar

Soh, W. K., Yuen, H. Y. & Rajapakse, J. C. Hut: Hybrid unet transformer for brain lesion and tumour segmentation. Heliyon 9 (2023).

Bishop, C. M. Training with noise is equivalent to tikhonov regularization. Neural Comput. 7, 108–116 (1995).

Article Google Scholar

Calvarons, A. F. Improved noise2noise denoising with limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 796–805 (2021).

Mansour, Y. & Heckel, R. Zero-shot noise2noise: Efficient image denoising without any data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14018–14027 (2023).

Pang, T., Zheng, H., Quan, Y. & Ji, H. Recorrupted-to-recorrupted: unsupervised deep learning for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2043–2052 (2021).

Wang, J., Di, S., Chen, L. & Ng, C. W. W. Noise2info: Noisy image to information of noise for self-supervised image denoising. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16034–16043 (2023).

Pfaff, L. et al. Self-supervised MRI denoising: Leveraging Stein’s unbiased risk estimator and spatially resolved noise maps. Sci. Rep. 13, 22629 (2023).

Article ADS CAS PubMed PubMed Central Google Scholar

Oord, A. v. d., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

Wiener, N. Extrapolation, interpolation, and smoothing of stationary time series (The MIT Press, 1964).

Perlo, D. et al. Unitobrain dataset: a brain perfusion dataset, https://doi.org/10.21227/x8ea-vh16, arXiv:2208.00650 (2022).

Cereda, C. W. et al. A benchmarking tool to evaluate computer tomography perfusion infarct core predictions against a DWI standard. J. Cereb. Blood Flow Metab. 36, 1780–1789. https://doi.org/10.1177/0271678X15610586 (2016).

Article PubMed Google Scholar

Hakim, A. et al. Predicting infarct core from computed tomography perfusion in acute ischemia with machine learning: Lessons from the isles challenge. Stroke 52, 2328–2337. https://doi.org/10.1161/STROKEAHA.120.030696 (2021).

Article CAS PubMed PubMed Central Google Scholar

Gretton, A. et al. Optimal kernel choice for large-scale two-sample tests. Adv. Neural Inf. Process. Syst. 25 (2012).

Zheng, Z., Li, R. & Liu, C. Learning robust features alignment for cross-domain medical image analysis. Complex Intell. Syst. 1–15 (2023).

Yu, C. et al. Learning to match distributions for domain adaptation. arXiv preprint arXiv:2007.10791 (2020).

Jin, X., Yang, X., Fu, B. & Chen, S. Joint distribution matching embedding for unsupervised domain adaptation. Neurocomputing 412, 115–128 (2020).

Article Google Scholar

Gatys, L. A., Ecker, A. S. & Bethge, M. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015).

Huang, X. & Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, 1501–1510 (2017).

Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 18, 203–211 (2021).

Article CAS PubMed Google Scholar

Romera, E., Alvarez, J. M., Bergasa, L. M. & Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 19, 263–272 (2017).

Article Google Scholar

Clerigues, A. et al. Acute ischemic stroke lesion core segmentation in ct perfusion images using fully convolutional neural networks. Comput. Biol. Med. 115, 103487 (2019).

Article PubMed Google Scholar

Jiang, Z. & Chang, Q. Ussl net: Focusing on structural similarity with light u-structure for stroke lesion segmentation. J. Shanghai Jiaotong Univ. (Sci.) 27, 485–497 (2022).

Article Google Scholar

Wang, W. et al. Transbts: Multimodal brain tumor segmentation using transformer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 109–119 (Springer, 2021).

Warach, S. et al. Effect of citicoline on ischemic lesions as measured by diffusion-weighted magnetic resonance imaging. Ann. Neurol. 48, 713–722 (2000).

3.0.CO;2-#" data-track-item_id="10.1002/1531-8249(200011)48:53.0.CO;2-#" data-track-value="article reference" data-track-action="article reference" href="https://doi.org/10.1002%2F1531-8249%28200011%2948%3A5%3C713%3A%3AAID-ANA4%3E3.0.CO%3B2-%23" aria-label="Article reference 64" data-doi="10.1002/1531-8249(200011)48:53.0.CO;2-#">Article CAS PubMed Google Scholar

Kim, B. J. et al. Magnetic resonance imaging in acute ischemic stroke treatment. J. Stroke 16, 131 (2014).

Article PubMed PubMed Central Google Scholar

Download references

This research/project is supported by AcRF Tier-2 grant MOE T2EP20121-0003 and Tier-1 grant RG15/24 of the Ministry of Education, Singapore. Annotated Data ISLES2018 was downloaded from SICAS Medical Image Repository51,52. As such, the investigators within SICAS contributed to the design and implementation of ISLES2018 and/or provided data but did not participate in analysis or writing of this report. The unannotated dataset UniTOBrain was downloaded from the University of Turin. https://ieee-dataport.org/open-access/UniTOBrain50. As such, the investigators within University of Turin contributed to the design and implementation of UniTOBrain and/or provided data but did not participate in analysis or writing of this report.

College of Computing and Data Science, Nanyang Technological University, Singapore, 639798, Singapore

Wei Kwek Soh & Jagath C. Rajapakse

You can also search for this author inPubMed Google Scholar

W.K.S: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing–Original Draft, Writing–Review and Editing, Visualisation, Data Curation, Writing–Review and Editing. J.C.R: Conceptualisation, Writing–Review and Editing, Supervision, Funding acquisition.

Correspondence to Jagath C. Rajapakse.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

Soh, W.K., Rajapakse, J.C. Noise-induced self-supervised hybrid UNet transformer for ischemic stroke segmentation with limited data annotations. Sci Rep 15, 19783 (2025). https://doi.org/10.1038/s41598-025-04819-2

Download citation

Received: 29 May 2024

Accepted: 29 May 2025

Published: 05 June 2025

DOI: https://doi.org/10.1038/s41598-025-04819-2

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative