劉夢杰 蒲亦非 張衛(wèi)華
摘 要 ???:基于深度學習算法的預測通常被盲目地認為是準確的,而這種劣勢在半監(jiān)督學習中更為明顯.為了解決這個問題,本文引入了一種簡單但有效的正則化方法,即不確定性交叉?zhèn)伪O(jiān)督.該方法通過不同的參數(shù)初始化對雙學生網(wǎng)絡施加了一致性約束,并將一個學生輸出的獨熱分割圖用作偽標簽來監(jiān)督另一個學生.同時獨熱分割圖之間的詹森-香農(nóng)距離用來估計偽標簽的不確定性.此外,本文還提出了一種不確定損失用于降低擁有高不確定性的偽標簽所帶來的損失項權(quán)重.實驗結(jié)果表明,本文方法實現(xiàn)了最先進的半監(jiān)督語義分割性能.
關(guān)鍵詞 :雙學生模型; 詹森-香農(nóng)距離; 不確定性損失
中圖分類號 :TP391 文獻標識碼 :A DOI : ?10.19907/j.0490-6756.2023.042004
Uncertain cross pseudo supervision for semi-supervised semantic segmentation
LIU Meng-Jie, PU Yi-Fei, ZHANG Wei-Hua
(College of Computer Science, Sichuan University, Chengdu 610065, China)
Predictions based on deep learning algorithms are often blindly assumed to be accurate,and this disadvantage is more pronounced in semi-supervised learning. To address this problem, the authors introduce a simple but effective regularization approach, termed uncertain cross pseudo supervision. The paper approach imposes the consistency constraint on dual student networks perturbed with different initialization, where the one-hot segmentation map, output from one student, is used as pseudo labels to supervise the other. Uncertainty of pseudo labels are estimated by calculating the Jensen-Shannon divergence between dual predictions. Furthermore, a novel loss, called uncertain loss, is introduced to down-weight the loss assigned to high-uncertainty pseudo labels. Extensive experiments show that the proposed method achieves the state-of-the-art semi-supervised segmentation performance.
1 Introduction
Semantic segmentation is a fundamental computer vision task utilized in applications of autonomous driving ??[1]. Semantic segmentation models have recently achieved state-of-the-art performance with enormous annotations. Numerous researchers have begun to segment images via semi-supervised learning.
There are two major methods in semi- supervised learning: consistency regularization and pseudo-labeling. Consistency regularization enforces consistency constraints on predictions with various perturbations, e.g., input perturbations by imposing different augmentations on input data, network perturbations and feature perturbations ??[2,3]. ?Pseudo- labeling methods have also been studied for semi-supervised segmentation ?[4]. ?This process is usually implemented in two stages: (1) Training is performed on the labeled set; (2) Retraining is performed on a mixed dataset in which pseudo labels are incorporated for unlabeled data. Extra unlabeled data are leveraged in semi-supervised methods. However, predictions of unlabeled data usually capture high confidence blindly ??[5].
How to alleviate the blind confidence still keeps inconclusive. Approximate Bayesian inference method is widely used to determine prediction uncertainty ?[6-8]. Yet, this method is difficult to implement. Several recent works have utilized MC-dropout ?[9,10] by ignoring network units in training and obtaining Monte Carlo samples. The standard deviation of Monte Carlo samples is applied to estimate uncertainty of predictions. For example, reference [5] leveraged MC-dropout strategy and drastically improved pseudo labeling accuracy by ignoring pseudo labels of high uncertainty. However, semantic segmentation is a pixel-level task that disables dropout applications. With the application of a dual student approach ?[11], Monte Carlo sampling becomes possible.
This paper propose a novel framework, termed Uncertain Cross Pseudo Supervision (UCPS), for semi-supervised semantic segmentation. Predictions of dual networks are used to generate segmentation maps. As the primary benefit of UCPS, Jensen-Shannon(JS) divergence between maps from dual students is used to estimate the pixel-level uncertainty of pseudo labels. And the uncertain loss addresses unreliable pseudo labels by reshaping the cross entropy loss with uncertainty to down-weight the loss assigned to high-uncertainty examples. Experiments on the Cityscapesand Camvid datasets show that this approach provides significant improvements over baseline methods. The UCPS based Ddrnet23 achieves state-of-the-art performance. Moreover, an ablation study verifies the effectiveness of each component.
2 Related works
2.1 Uncertainty estimations
There are two primary types of uncertainty in Bayesian modeling: aleatoric uncertainty and epistemic uncertainty ??[12]. Aleatoric uncertainty accounts for inherent noise on data. By contrast, epistemic uncertainty represents uncertainty in the model parameters: predictions with higher confidence values may be untrustworthy due to the poor calibration of neural networks. The goal of networks is to find one posterior distribution ??P(w|D) , where ?w ?and ?D ?denote network parameters and data. ?P(w|D) ?is defined as:
P(w|D)= P(D|w)P(w) P(D) ??(1)
Yet, ?P(D) ?denotes an unavailable data distribution, and ?P(w) ?is nonexistent because all parameters are definite numbers rather than distributions. Following a Bayesian approach, ?P(D) ?is computed as:
P(D)=∑ i P(D|w ?i)P(w ?i) ?(2)
Many researchers leverage MC-dropout to obtain ?w i ??[13-16]. After dropout layers of the network are enabled, some units are randomly dropped and we can then obtain ?w i ?and ?P ?(D|w i) .
However, the performance achieved by applying MC-dropout to semantic segmentation is unsatisfactory. First, the Bayesian modeling is difficult to implement. Second, the need for dense outputs disables dropout. Enlighten by Bayesian modeling and MC-dropout, this paper introduce the dual networks to generate Monte Carlo samples and estimate the uncertainty by JS divergence directly.
2.2 Teacher-student structure
The teacher-student is a general structure in recent semi-supervised learning methods. These methods involve two roles: teacher and student models. Predictions under different networks are penalized by a consistency constraint. Yet, the weights of the student and teacher networks are tightly coupled when an exponential moving average (EMA) ??[17] is used. In the dual student model ?[11], the teacher is replaced with another student and each network can be independently optimized in each iteration.
This work argue that the parameters of each network of dual student structure could be viewed as ?w 1 and ?w 2 , and used to estimate satisfactory uncertainty between student and teacher networks.
3 Approach
3.1 Framework overview
As illustrated in Fig.1, The uncertain cross pseudo supervision involves dual student networks: Student 1 ?and ?Student 2 . Labeled samples ?{data l} ?and unlabeled samples ?{data u} ?are fed to generate one-hot segmentation maps, Onehot_pre .
The proposed method is bidirectional. One direction is from ?Student 1 ?to ?Student 2 , where the one-hot map is used as ?Onehot_pre 1 ?as pseudo labels to supervise ?Output 2 . The second direction is from ?Student 2 ?to ?Student 1 . In addition, the uncertainty ?UNC ?js ?is computed from JS divergence between ?Output 1 ?and ?Output 2 ?for unlabeled data and used to down-weight the loss assigned to high-uncertainty examples in uncertain loss.
3.2 Uncertain metrics
For each pixel ?i ?of segmentation maps, we note the prediction ?prob i ?as ?[prob 1 i,prob 2 i,...,prob c i] ?with ?c ?class categories. ?prob c i ?represents the confidence probability that pixel ?i ?will be categorized as category ?c . Thus, ?prob i ?can be seen as a probability distribution. In our work, ?p i ?and ?q i ?denote the predictions obtained from the dual networks.
JS divergence builds upon the concept of the Kullback-Leibler(KL) divergence and measures the gap of the two distributions. We use JS divergence for two reasons. First, the range of JS divergence values is [0, 1], which allows us to represent the uncertainty more accurately and smoothly. A smaller value of ?JS(p i,q i) ?represents greater similarity between distributions ?p i ?and ?q i . Second, JS divergence is symmetrical such that ?JS(p i,q i) ?is equal to ?JS(q i,p i) , whereas KL divergence and cross entropy are both unsymmetrical.
KL(p ?i,q ?i)=∑ j p ??ijlog( p ??ij q ??ij ) ?(3)
JS(p i,q i)= 1 2 KL(p i, 1 2 (p i+q i))+ ??1 2 KL(q i, 1 2 (p i+q i)) ?(4)
3.3 Uncertain loss
The Uncertain Loss (UL) is designed to address cases of semi-supervised semantic segmentation in which some in-correct pixel predictions arise during training. We introduce uncertain loss by starting from the weighted cross entropy loss(WCE) for multi-classification:
WCE(p,y)=-∑ c ?i=1 α ?iy ?ilog(p ?i) ?(5)
In the above, ?p i ∈ [0, 1] is the models estimated probability for the class, ?y ∈ 0, 1, 2,..., ?c-1 ?specifies the ground-truth class, and ?c ?is the ground truth. y 0, y 1, ..., y ?c-1 ?is a one-hot vector. The weighted cross entropy loss is commonly applied to address class imbalances by introducing a weighting factor ?α i ?for each category. This weighting factor down-weight the loss assigned to easy examples. While ?α i ?balances the importance of hard/easy examples, it does not differentiate between uncertain/certain examples. Instead, we propose to reshape the loss function to down-weight uncertain examples and thus reduce training noise pseudo labels. More formally, a modulating factor ??(1-JS) γ ??is proposed to add to the weighted cross entropy loss, with a tunable uncertainty parameter ?γ .
The uncertain loss is defined as:
UL(p,y)=-∑ ?c-1 ?i=1 (1-JS(p,q)) ?γα ?iy ?ilog(p ?i) ?(6)
where ?y ?is the one-hot representation of ?q .
We note two properties of the uncertain loss.
First, when a label is ground truth, the uncertainty of label is zero, and the uncertain loss is equivalent to weighted cross entropy loss. Second, the focusing parameter ?γ ?smoothly adjusts the rate at which uncertain examples are down-weighted.
Following previous works, the UCPS could impose a consistency constraint for unlabeled data based on the uncertain loss. The total loss of our approach is a combination of the supervised loss and uncertain loss:
L ??total =L ?sup+λL ?ul ?(7)
where ?L ?sup ?is the weighted cross entropy loss for labeled data and ?λ ?is a hyperparameter introduced to balance the constraint.
The segmentation networks ?Student 1 ?and ?Student 2 ?shared the same architecture and different initialization. The augmented samples are fed into dual networks to obtain predictions ?Output 1 ?and ?Output 2 ?respectively. Onehot_pre ?is the softmaxed one-hot result of Output and used to supervise other network. In addition, pixel-level uncertainty of ?Onehot_pre ?would be estimated by Jensen-Shannon divergence ( UNC ?js ) between ?Output 1 ?and ?Output 2 . The uncertainty could enable UCPS leverage ?pseudo labels more accurately. Ground truth GT is used to supervise labeled data.
4 Experiments
4.1 Dataset
Cityscapes: The Cityscapes ?[18] dataset is a large urban street scene dataset containing data from 50 different cities. This dataset contains ?5000 pixel-level and 20 000 coarse annotated images, with a resolution of 2048*1024. The standard training, validation, and test sets consist of 2975,500 and 1525 images, respectively. Each pixel of these images is annotated to 19 pre-defined classes. In our experiments, we randomly sample 3000 images from the coarse set as the unlabeled set.
Camvid: The Camvid ?[19] contains 367 images for training, 101 images for validation, and 233 images for testing. The images have a resolution of 960*720 and 11 semantic categories. We merge the validation set and test set as unlabeled set for training.
4.2 Experimental settings
The UCPS using the PyTorch framework on a single Nvidia TitanX GPU, where Ddrnet23s and Ddrnet23 ?[20] were selected as the segmentation networks and supervised baselines, respectively. The mean of category-wise Intersection-over-Union (mIoU) is used as the evaluation metric. This work only use one network in our UCPS to generate the results for evaluation via single-scale testing.
Cityscapes: The learning rate is initially set to 2.5e-3, and a learning rate policy with a power of 0.9 is employed to decrease this rate. We use an SGD ?[21] with batch size 4, momentum 0.9, and weight decay 5e-4 in training. Mean subtraction, random horizontal flip, and color jitter operations are employed for data augmentation. Scales are randomly selected from 0.5, 0.75, 1, 1.25, 1.5, and 1.75 for multi-scale ?data augmentation, and the training samples are randomly cropped to 512*1024. ??λ ?and ?γ ?are set to 1.0 and 2.0, respectively. For testing, the training dataset and validation dataset to train. The UCPS ?was trained for 140 epochs.
Camvid: The initial learning rate is set to 1e-4 and all the models is trained for 242 epochs. Cropped size is set to be 960*720. Other training details are identical to those for Cityscapes.
4.3 Results
Cityscapes: This work ?first compared our UCPS with the supervised baselines in the Cityscapes validation set. The results of UCPS with respect to mIoU performance are reported in ?Tab.1.
All the results are generated via single scale testing. The best results are presented in bold, with the second best results underlined.The UCPS with either Ddrnet23s or Ddrnet23 outperforms the baselines on this dataset. We also note that rare categories in the UCPS performed significantly better, such as the wall, fence, and rider categories, indicating that our method handles large levels of imbalance well.
The Tab.2 shows the accuracy results obtained by the ucps and some state-of-the-art methods on the Cityscapes test dataset.
"-" indicates that the methods didn't give the corresponding result. All the results are generated from models trained on only cityscapes train dataset in Tab.2.The ucps ?achieves significant progress against the other methods. In the evaluation process, this paper do not employ any testing technology except single-scale testing. Moreover, only the official training set of the Cityscapes is used ?for training while Ddrnet merges train. Fig.2 shows the baseline and the ucps results on Cityscapes images.
Camvid: The results of upcs are shown in Tab.3, DDRNet-23-slim achieves 74.7% mIoU on the Camvid test set and DDRNet-23 obtains the highest accuracy.MSFNnet runs at 1024*768 while other methods run at 960*720 in Tab.3.
Method Road Side walk Buil ding Wall Fence Pole traffic ?light traffic ?sign Vege tation Terrain Sky Person Rider Car Truck Bus Train Moto rcycle Bicycle MIoU
Ddrnet23s 98.1 84.6 92.3 51.4 61.7 64.1 70.2 77.6 92.5 66.2 94.5 81.5 63.3 95.0 79.7 88.2 81.3 60.3 76.5 77.8
Ddrnet23 98.2 85.2 ?92.9 ?56.0 ?63.4 ??66.8 ?73.7 80.8 ?92.7 ?64.7 ?95.1 ??83.7 ??67.6 ??95.5 ?79.6 ?90.3 ??81.5 ?65.5 ?78.5 ??79.5
Ucps(Ddr23s) ?98.2 ??85.6 ?92.5 ?61.7 ?62.5 63.9 69.9 77.5 92.5 67.8 94.6 81.5 63.4 94.9 ?82.4 ?89.3 81.2 60.6 76.0 78.7
Ucps(Ddr23) 98.3 86.0 93.1 63.6 65.6 67.1 73.6 ?80.6 ?92.9 ?67.1 ??94.6 ?83.7 68.8 95.7 85.1 91.8 82.9 ?65.3 ?78.6 80.7
5 Ablation study
To demonstrate the influence of different components of the proposed UCPS, the performance of several UCPS variants on the Cityscapes validation set is reported in Tab.4.DS: Dual Student structure; UL u : uncertain weight in UL; ?UL α : weight ?α ?in UL in Tab.4.
Ablation for dual student structure: This work adopt a dual student structure to provide two samples of a semantic segmentation network.The reason for using the dual student structure is twofold. First, it is difficult to maintain three or more students in our framework because of memory limitations.Second, the EMA strategy is unsatisfactory for uncertainty estimations because of strong coupling between the two networks. To demonstrate the superiority of a non-coupling approach, the ucps replaced the dual student structure with the mean teacher structure, and the resulting performance is reported in Tab.4. The gap in the performance comparison indicates that the performances of the two structures belong to different levels in the ucps.
Ablation for uncertain loss: As above stated, pseudo labels face the challenge of high confidence for inaccurate predictions. This work use the JS divergence between ?Output 1 ?and ?Output 2 ?to estimate the uncertainty. We expect the uncertain loss to up-weight the loss assigned to lower-uncertainty pseudo labels and down-weight the loss assigned to higher-uncertainty pseudo labels. This approach improves the performance from 783% to 78.7% when compared with the standard cross entropy loss, as shown in Tab.4.
Ablation for uncertain and ?α ?weight: There are two balance weights in uncertain loss: the uncertain weight and the ?α ?weight. The uncertain weight is a monotonically decreasing function based on the uncertainty of the pseudo labels, and the ucps argue that the credibility of the pseudo labels is closely correlated to this uncertainty. The loss assigned for cases of lower uncertainty should be up-weighted in the training process. The ?α ?is used to address category imbalances in the Cityscapes dataset. The effects of the uncertain and ?α ?are presented in Tab.4, showing 0.3% and 0.5% improvements, respectively.
6 Conclusions
UCPS is proposed in this paper to explore unlabeled data in semi-supervised semantic segmentation. The proposed UCPS contains dual student networks and a novel uncertain loss. The segmentation map output from one network is used to supervise others. Meanwhile, JS divergence between dual predictions is used to determine the uncertainty of the pseudo labels and down-weight the loss assigned to high-uncertainty in uncertain loss. With these accurate pseudo labels, this approach based on Ddrnet23 achieves a mIoU value of 79.6% for the Cityscapes test dataset.
References:
[1] ??Hecker S, Dai D, Van G L. End-to-end learning of driving models with surround-view cameras and route planers[C]// Proceedings of the European Conference on Computer Vision (ECCV). Munich:Springer, 2018.
[2] ?Xie Q, Dai Z, Hovy E, ?et al . Unsupervised data augmentation for consistency training [C]//Proceedings of Advances ?in Neural Information Processing Systems (NIPS). Virtual-only: MIT Press, 2020.
[3] ?Mendel R, De S A, Rauber D, ?et al . Semi- supervised segmentation based on error-correcting supervision [C] //Proceedings of the European conference on computer vision (ECCV). Glasgow: Springer, 2020.
[4] ?Xie Q, Luong M T, Hovy E, ?et al . Self-training with noisy student improves imagenet classification[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020.
[5] ?Rizve M N, Duarte K, Rawat Y S, ?et al . In defense of pseudo-labeling: an uncertainty-aware pseudo- label selection framework for semi-supervised learning [EB/OL]. (2021-04-19) [2022-05-01]. https://arxiv.org/pdf/2101.06329.pdf.
[6] ?Graves A. Practical variational inference for neural networks [C]//Proceedings of Advances in Neural Information Processing Systems (NIPS). Granada: MIT Press, 2011.
[7] ?Blundell C, Cornebise J, Kavukcuoglu K, ?et al . Weight uncertainty in neural network [EB/OL]. (2015-05-21) [2022-05-01]. https://arxiv.org/pdf/1505.05424.pdf.
[8] ?Louizos C, Welling M. Structured and efficient variational deep learning with matrix gaussian posteriors [EB/OL]. (2016-03-15) [2022-05-01]. https://arxiv.org/pdf/1603.04733.pdf.
[9] ?Gal Y, Ghahramani Z. Dropout as a bayesian approximation: representing model uncertainty in deep learning [EB/OL]. (2015-06-06) [2022-05-01]. https://arxiv.org/pdf/1506.02142.pdf.
[10] ?Li Z W, Bai L X, Zhang Y Y. Application of uncertainty evaluation method in parameter correction of high-purity germanium detector [J]. J Sichuan Univ: Nat Sci Ed(四川大學學報:自然科學版), 2020, 57: 961.
[11] Ke Z, Wang D, Yan Q, ?et al . Dual student: breaking the limits of the teacher in semi-supervised learning [C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019.
[12] Kendall A, Gal Y. What uncertainties do we need in bayesian deep learning for computer vision? [C]//Proceedings of Advances in Neural Information Processing Systems (NIPS). Long Beach:MIT Press, 2017.
[13] Tompson J, Goroshin R, Jain A, ?et al .Efficient object localization using convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. [S.l.]: IEEE, 2015: 648.
[14] Althoff D, Rodrigues L N, Bazame H C. Uncertainty quantification for hydrological models based on neural networks: the dropout ensemble [J]. Stoch Env Res Risk A, ?2021, 35: 1051.
[15] Wen M, Tadmor E B.Uncertainty quantification in molecular simulations with dropout neural network potentials [J]. NPJ Comput Mater, 2020, 6: 1.
[16] Camarasa R, Bos D, Hendrikse J, ?et al . Quantitative comparison of monte-carlo dropout uncertainty measures for multi-class segmentation [M]//Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Graphs in Biomedical Image Analysis. Lima: Springer, 2020.
[17] Tarvainen A, Valpol H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results [C]//Proceedings of Advances in Neural Information Processing Systems (NIPS). Long Beach:MIT Press, ?2017.
[18] Cordts M, Omran M, Ramos S, ?et al . The cityscapes dataset for semantic urban scene understanding [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016.
[19] Brostow G J, Fauqueur J, Cipolla R. Semantic object classes in video: a high-definition ground truth database [J]. Pattern Recogn Lett, 2009, 30: 88.
[20] Hong Y, Pan H, Sun W, ?et al . Deep dual-resolution networks for real-time and accurate semantic segmentation of road scenes [EB/OL]. (2021-09-01) [2022-05-01]. https:// arXiv.org/pdf/2101.06085.pdf.
[21] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks [C]//Proceedings of Advances in Neural Information Processing Systems (NIPS). Harrahs and Harveys: MIT Press, 2012.
[22] Chen l C, Papandreou G, Kokkinos I, ?et al. ?Semantic image segmentation with deep convolutional nets and fully connected crfs [EB/OL]. (2014-12-22) [2022-05-01]. https://arXiv.org/pdf/1412.7062.pdf.
[23] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015.
[24] Lin G, Shen C, Van D H A, ?et al . Efficient piecewise training of deep structured models for semantic segmentation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016.
[25] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions [EB/OL]. (2015-11-23) ?[2022-05-01]. https:// arXiv.org/pdf/1511.07122. ?pdf.
[26] Ghiasi G, Fowlkes C C. Laplacian pyramid re-construction and refinement for semantic segmentation [C]// Proceedings ?of the European Conference on Computer Vision (ECCV). Amsterdam: Springer, 2016.
[27] Chen L C, Papandreou G, Kokkinos I, ?et al . Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs [EB/OL]. (2017-05-12) [2022-05-01]. https://arxiv.org/pdf/1606.00915.pdf.
[28] Lin G, Milan A, Shen C, ?et al . Refinenet: multi-path refinement networks for high-resolution semantic segmentation [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017.
[29] Zhao H, Shi J, Qi X, ?et al . Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Honolulu: IEEE, 2017.
[30] Yu C, Wang J, Peng C, ?et al . Bisenet: Bilateral segmentation network for real-time semantic segmentation [C]//Proceedings of the European Conference on Computer Vision (ECCV). Munich: Springer, 2018.
[31] Li X, You A, Zhu Z, ?et al . Semantic flow for fast and accurate scene parsing [C]// Proceedings of the European Conference on Computer Vision (ECCV). Glasgow: Springer, 2020.
[32] Si H, Zhang Z, Lv F, ?et al . Real-time semantic segmentation via multiply spatial fusion network [EB/OL]. (2018-10-16) [2022-05-01]. https:// arXiv.org/pdf/1911.07217.pdf.