SSCC:ANovel Computational Framework foRRapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

2019-07-12 06:35:22XianwenRenLiangtaoZhengZeminZhang

Genomics,Proteomics & Bioinformatics 2019年2期

Xianwen Ren*,Liangtao Zheng,Zemin Zhang*

BIOPIC,Beijing Advanced Innovation CenteRfoRGenomics,and School of Life Sciences,Peking University,Beijing 100871,China

KEYWORDS Single cell;RNA-seq;Clustering;SubsaMpling;Classification

Abstract Clustering is Aprevalent analytical means to analyze single cell RNAsequencing(scRNA-seq)datAbut The rapid ly expanding datAvolume can make this process computationally challenging.NeWmethods foRboth accurate and efficient clustering are of pressing need.Here we proposed Spearman subsampling-clustering-classif ication(SSCC),AneWclustering framework based on randoMprojection and feature construction,foRlarge-scale scRNA-seq data.SSCC greatly improves clustering accuracy,robustness,and computational efficacy foRvarious state-of The -art algorithMs benchmarked on multiple real datasets.On AdatasetWith 68,578 human blood cells,SSCC achieved 20%improvement foRclustering accuracy and 50-fold acceleration,but only consumed 66%memory usage,coMpared to The Widelyused sof tware package SC3.CoMpared to k-means,The accuracy iMprovement of SSCC can reach 3-fold.An RiMplementation of SSCC is available at https://github.com/Japrin/sscClust.

Introduction

Single cell RNAsequencing(scRNA-seq)has revolutionized transcriptoMic studies by revealing The heterogeneity of individual cellsWith high resolution[1-6].Clustering has become Aroutine analyticalmeans to identify cell types,depict The iRfunctional states,and infer potential cellulaRdynaMics[4-10].Multip le clustering algorithms have been developed,including Seurat[11],SC3[12],SIMLR[13],ZIFA[14],CIDR[15],SNN-Cliq[16],and Corr[17]. The se algorithMs iMprove The clustering accuracy of scRNA-seq datAgreatly but of ten have high computational complexity,impeding The extension of The se elegant algorithMs to large-scale scRNA-seq datasets.With The rapid development of scRNA-seq technologies,The throughput has increased froMinitially hundreds of cells to tensof thousandsof cells in one run nowadays[18].Integrative analyses of scRNA-seq datasets froMmultip le runs oReven acrossmultip le studies fur The Rexacerbate The computational difficulties.Thus,algorithms that can clusteRsingle cells both efficiently and accurately are needed.

To handlemultiple large-scale scRNA-seq datasets,ad hoc computational strategies have been proposed by downsampling oRconvoluting large datasets to small ones[12,19-21]oRby accelerating The coMputation With neWsof tware iMp lementation[22].Such strategies have reached variable levels of success but have not adequately addressed The challenges.Considering The iMportance of efficientand accurate clustering tools foRanalyses of large-scale scRNA-seq data,here we propose AneWcomputational framework,The Spearman subsaMp ling-clustering-classification (SSCC), based on machine learning techniques,including feature engineering and randoMprojection,to achieve both iMproved clustering accuracy and efficacy.Benchmarking on various scRNA-seq datasets demonstrates that compared to The current solutions,SSCC can reduce The computational comp lexity froMO(n2)to O(n)while maintaining high clustering accuracy.Moreover,flexibility of The neWcomputational framework allows ouRmethods to be fur The Rextended and adapted to AWide range of app lications foRscRNA-seq datAanalysis.

Method

Framework overview

Among The available solutions to hand le large scRNA-seq datasets,clustering With subsampling and classification[12,19]has lineaRcomp lexity,i.e.,O(n).Such Aframework generally consists of fouRsteps(Figure 1A).(1)Agene expression matrix is constructed by datApreprocessing techniques including gene and cell filtration and normalization;(2)cells are divided into two subsets foRclustering and classification separately by subsamp ling;(3)The subsetted cells foRclustering are grouped into clusters using k-means[23],hierarchical clustering[24],density clustering[25],oRalgorithMs developed specially foRscRNA-seq;and(4)supervised algorithms such as k-nearest neighbors[26],support vectoRmachines(SVMs)[27],oRrandoMforests[28]are used to predict The labels of o The Rcells based on The clustering results at The third step.FoRsiMplicity,we referred this existing framework as subsaMp ling-clustering-classification(SCC).Because clustering is time-consuMing and memory-exhaustive,liMiting this step to Asmall subset of cells through subsaMpling can greatly reduce The coMputational cost froMO(n2)to O(n)by leveraging The efficiency of supervised machine learning algorithMs.However,classifiers built on The original gene expression datAof Asmallsubsetof cellsmay be flawed and biased due to noise of The raWdatAand small numbeRof cells,thus iMpairing The accuracy of label assignment foRThe total cells.

Here we proposed AneWcomputational framework foRclustering large scRNA-seq datAby adding Afeature engineering/projecting step into SCC(Figure 1B).SiMilaRto SCC,Agene expression matrix is first constructed through gene and cell filtrations and normalization(Step 1,Figure 1B),and is The n sp lit randoMly into two subsets foRclustering and classification separately(Step 2;Figure 1B).Unlike SCC,which directly uses The raWdatAof gene expression,ouRneWframework projects cells into Afeature space(Step 3;Figure 1B)foRclustering(Step 4;Figure 1B)and classification(Step 5;Figure 1B).As The neWframework is characterized by Asubsamp ling-featuring-clustering-classification strategy, we named it as SFCC.Specifically,we divide feature construction into two steps:(1)feature extraction techniques are applied to cells subject to clustering;and(2)according to The selection of featureextractionmethods,cells foRclassification are The n projected into The built featurespace.Many established techniques in The machine learning field can be exp loited in The se two steps.FoRexamp le,principal coMponent analysis(PCA)[29]can beused to first construct features foRcellsundergoing clustering while The resultant loading vectors can be used as lineaRtransformations to project cells foRclassification into The feature space.Selecting different algorithms in each step of The SFCC framework would The n forMdifferent pipelines foRclustering large-scale scRNA-seq datasets.To reduce The total numbeRof algorithMic combinations,here we focus on comparing The performance between various feature engineering algorithMs.We hold algorithms foRgene and cell filtration,normalization,subsamp ling,and classification as The algorithMs frequently used in practice.The existing SCC strategy can be treated as Aspecial case of SFCC in which The original datAspace is The feature space.

Feature engineering techniques involved in this study include distance-based methods (Euclidean and cosine),correlation-based methods(Pearson[30]and Spearman[31]correlations),and Aneural network-based method(autoencoder)[32].FoRdistance and correlation based methods,The distance/correlation matrix foRcells subject to clustering is directly used as The iRfeatures,and The distance/correlation matrix between cells subject to classification and clustering were used to construct features foRcells undergoing classification.FoRautoencoder,The gene expression datAof cells foRclustering are used to train Aneural network model first and The n all cells are projected into Afeature space through The encoding function of The trained model.To obtain evaluation results independent of clustering algorithMs,we use silhouette values[33]to exaMine The global performance of The se feature engineering methods.Upon The global evaluation,we The n select The most effectivemethod,SSCC,The SFCC With Spearman correlation as The feature constructionmethod,to do fur The Revaluations.

scRNA-seq datasets used in this study

We used seven scRNA-seq datasets to evaluate The clustering performance in feature space. The se include The Kolodziejczyk dataset[34],Pollen dataset[8],Usoskin dataset[9],Zeiseldataset[10],Zheng dataset[5],PBMC 68 k dataset[18],and Macosko dataset[19].Detailed descriptions of The se datasets are listed below.

The Kolodziejczyk dataset[34]contains704 cellsWith three clusters,which were obtained froMmouse embryonic steMcells undeRdifferent culture conditions.About 10,000 genes were prof iled With high sequencing depth(average 9,000,000 reads peRcell,＞80%of readsmapped to The Musmusculus genome GRCm38 With＞60%to exons)using The FluidigMC1 systeMand app lying The SMARTeRK it to obtain cDNAand The NexterAXT K it foRIlluMinAlibrary preparation.

The Pollen dataset[8]contains 249 cells With 11 clusters,which were obtained froMskin cells,p luripotent steMcells,blood cells,neural cells,etc.Ei The RloWoRhigh sequencing depth based on The C1 Single-Cell Auto Prep Integrated Fluidic Circuit,The SMARTeRU ltrALoWRNAK it,and The NexterAXT DNASamp le Preparation K itwas used to depict The gene expression prof iles of individual cells(～50,000 reads peRcell).

Figure 1 Two computational frameworks foRrapid clustering large-scale scRNA-seq datasetsA.The original computational framework proposed in SC3(referred to SCC)consists of fouRmain steps:(1)constructing The gene expressionmatrix;(2)dividing The matrix into two parts through cell subsamp ling;(3)clustering The subsampled cells;and(4)classifying The unsaMp led cells into clusters.B.The neWcoMputational framework proposed in thisstudy(referred to SFCC).Afeature construction step isadded before clustering and classification. The whole framework comprises five steps:(1)constructing The gene expressionmatrix;(2)dividing The matrix into two parts through cell subsamp ling;(3)projecting The subsampled/unsamp led cells into feature space;(4)clustering The subsaMp led cells in The feature space;(5)classifying The unsaMpled cells into clusters in The feature space.scRNA-seq,single cell RNA-sequencing;SC3,single-cell consensus clustering;SCC,subsampling-clustering-classification;SFCC,subsampling-featuringclustering-classification.

The Usoskin dataset[9]contains 622mouse neuronal cells With fouRclusters,i.e.,peptidergic nociceptor-containing,nonpeptidergic nociceptor-containing,neurof ilament-containing,and tyrosine hydroxylase-containing cells.The neuronal cells were picked With Arobotic cell-picking setup and positioned in wells of 96-well p lates before RNA-seq(1,140,000 reads and 3574 genes peRcell).

The Zeisel dataset[10]contains 3005 cells froM The mouse brain With nine majoRsubtypes.The gene expression levels were estimated by counting The numbeRof unique moleculaRidentifiers(UMIs)obtained by D rop-seq.

The Zheng dataset[5]contains 5063 T cells froMfive patientsWith hepatocellulaRcarcinoma.N ine subtypes of samp leswere prepared according to The tissue typesand cell types,and The n subject to Smart-seq2 foRgene expression prof iling(～1,290,000 uniquely mapped read pairs peRcell).

The PBMC 68 k dataset[18]contains 68,578 peripheral bloodmononucleaRcells(PBMCs)of Ahealthy human subject.This cell population includes eleven majoRimmune cell types.Gene expression was prof iled using The 10×GenoMics Gem-Code p latform,and 3′UMI countswere used to quantify gene expression levelsWith The iRcustoMized coMputationalpipeline.

The Macoskco dataset[19]contains 49,300 mouse retinAcells Without known distinct clusters.The gene expression levels were estimated by counting The numbeRof UMIs obtained by D rop-seq.Cellswere fur The Rclustered into 39 subtypes by The authors based on The Seurat algorithm.

DatApreprocessing

The first fouRdatasets(i.e.,The Kolodziejczyk,Pollen,Usoskin,and Zeiseldatasets)have been Widely used foRevaluating clustering algorithMs,of which The preprocessed datAhave been included in The SIMLRsof tware package foRtest use(https://github.com/BatzoglouLabSU/SIMLR). We downloaded The se fouRdatasets froMThe Matlab subdirectory of The SIMLRpackage,and The n selected The top 5000 most informative genes(With both The average and The standard deviation of log2-transformed expression values＞1)foRsubsequentanalysis.If The numbeRof genes in Adatasetwas smalleRthan 5000, The n all The genes in The dataset were retained foRfur The Ranalysis.FoRThe Zheng dataset,one patient(P0508)was selected foRcomparison of different clustering algorithms,which had 1020 T cellsWith eight subtypes defined by The tissue sources and The cell surface markers.Genes With both The average and The standard deviation of log2-transformed expression values＞1 were retained and The n The transcripts perMillion(TPM)valueswere used foRclustering evaluation.FoRThe PBMC 68 k dataset,The preprocessing pipeline described in The original report[18]was used to prepare datAfor clustering (https://github.com/10XGenoMics/single-cell-3prime-paper).FoRThe Macoskco dataset,The UMI counts were used foRevaluation Without gene filter.

Consistency between true labels and The original aswell as The projected data

The silhouette value[33]is used to measure The consistency between The true labelsand The originalaswellas The projected data.Given AdatasetWith n samp les and Aclustering scheme,Asilhouette value is calculated foReach saMp le.FoRAsaMp le i,its silhouette value siis calculated according to The folloWing formula:

where aiis The average dissiMilarity of samp le i to saMples in itsown clusteRand biis The lowestaverage dissiMilarity of sample i to any o The RclusteRof which sample i is not amember.The values of sirange from-1 to 1.Avalue close to 1means that saMp le i is wellmatched to its cluster,whereas Avalue close to-1 means that samp le i would bemore appropriate ifit is classified into its neighboring cluster.FoReach feature construction method,The median silhouette value of all The cells afteRprojection was used to evaluate its consistency With The true clusteRlabels.The fraction of cells that have silhouette values increased afteRprojection compared to The originaldata(i.e.,The fraction of cells above The diagonal in Figure 2)was also used to evaluate The feature construction methods.

Clustering accuracy/consistency evaluation

Normalized mutual information(NMI)[35]was used to evaluate The accuracy of various clustering results.G iven two clustering schemes A=｛A1，···，AR｝and B=｛B1，···，Bs｝,The overlap between Aand B can be represented through The contingency table C(also named as confusion matrix)of size R×S,where Cijdenotes The numbeRof cells that are shared by clusters Aiand Bj. The n The normalizedmutual information NMI（A，B）of The two clustering schemes Aand B is defined as follows.

where n is The numbeRof total cells,Ci-is The numbeRof cells assigned to clusteRi in The clustering scheme Aand C-jis The numbeRof cells assigned to clusteRj in The clustering scheme B.IfAis identical to B,NMI（A，B）=1.IfAand B are completely different,NMI（A，B）=0.When true clusteRlabels were available,The NMI values between true clusteRlabels and various clustering resultswere used to evaluate The clustering accuracy.When true clusteRlabels were not available,NMI was used to evaluate clustering consistency between different subsamp ling rates in thisstudy.BesidesNMI,wealso used Rand index and adjusted Rand index to evaluate clustering accuracy and consistency,and obtained siMilaRobservations.

Clustering and classification algorithms

Many clustering algorithMs are available.We selected five Widelyused clustering algorithms in this study to evaluate The impacts of Spearman correlation-based feature construction method. The se five algorithMs include three general clustering algorithMs which were designed initially not foRscRNA-seq data,i.e.,affinity propagation(AP)[36],k-means[23],and k-medoids[37],and two algorithMs that were specially designed foRclustering of scRNA-seq data,i.e.,SC3[12]and SIMLR[13].k-means and k-medoids are pure clustering algorithMs that partition saMples into groupswhile AP,SC3,and SIMLRinherently include feature construction techniques.All The se clustering algorithmswere evaluated on five small-scale datasets(The Kolodziejczyk,Pollen,Usoskin,Zeisel,and Zheng datasets),while only SC3 was evaluated on The PBMC 68 k datasetand only k-meanswasevaluated on The Macoskco dataset for siMp licity. Parameters (ks=10:12, gene_filter=FALSE,biology=FALSE,svm_max=5000)were used foRSC3(default),whereas parameters(ks=11,gene_filter=FALSE,biology=FALSE,svm_max=200)were used foRSC3+SSCC.On The Macoskco dataset,～5%and 10%cells were randoMly picked out foRclustering analyses.We used The k-nearest neighboRalgorithMfoRclassifying unsubsamp led cells,which is robust to parameteRselection.

Results

Feature construction can greatly improve The consistency of cell features and The reference cell labels

First we evaluated whe The Rfeature extraction methods can improve clustering results of scRNA-seq data.We calculated The silhouette values to evaluate The consistency between cell features extracted using various methods and The reference labels.Silhouette values are frequently used to indicate whe The RAsaMple is properly clustered.Bu The rewe can use silhouette values to reversely indicatewhe The RThe extracted features are properly consistent With The reference cell labels.By comparing With silhouette values of The original scRNA-seq data,we observed that most of The evaluated featureextractingmethods can iMprove The silhouette values formany cells inmultiple datasets(Figure 2).FoRThe Kolodziejczyk[34]and Pollen[8]datasets,all The five feature-extraction methods improved The silhouette values coMpared With The original data.FoRThe Usoskin[9]dataset,allmethods showed significantly betteRperformance except Euclidean and cosine.FoRThe Zeisel[10]dataset,only Spearman correlation resulted in iMprovement for＞80%cells coMpared With The originaldata,while o The Rfeature extraction methods except Euclidean resulted in little iMprovement.Euclidean resulted in even worse results foRThe Zeisel dataset,indicating loWrobustness.FoRThe Zheng[5]dataset,most methods failed except The Spearman correlation method.The Spearman correlationbased feature extraction method consistently improved The accordancebetween cell featuresand labelson all The five datasets.Considering The robustness of Spearman’s correlationbased method and The great improvement of silhouette values of single cells,we evaluated The accuracy,robustness,and efficacy of SSCC in The next section.

Figure 2 Consistency with true clusteRlabels between engineered features and The original datAof five datasetsIn each p lot,each dot represents Acell.Silhouette values calculated using true clusteRlabels and The original datAare shown on X axis,whereas silhouette values calculated using true clusteRlabels and The engineered features are shown on Y axis.Silhouette value at 1 represents perfectmatch between labels and features,whereas silhouette value at-1 indicates that The cellMight beMis-clustered.The percentage in The plotting areAof each plot indicates The fraction of cells above The diagonals.The five datasets tested are The Kolodziejczyk dataset[34],Pollen dataset[8],Usoskin dataset[9],Zeisel dataset[10],and Zheng dataset[5].

Clustering accuracy of The totalcells isenhanced in featurespace when subsampling is applied

While subsaMpling can greatly boost The efficiency of clustering of large scRNA-seq data,it of ten coMproMises The clustering accuracy.Weobserved that The improvementsof silhouette scores by SSCC were robust to subsaMpling fluctuations(Figure 3).FoRall The five datasets evaluated,The silhouette values of Spearman correlation-based features were almost unchanged With subsamp ling rates(Figure 3). The se datAsuggest that features constructed using SSCC at loWsubsamp ling ratesmay contain information approximate to thatWith total cell populations.

Figure 3 Silhouette values between Spearman correlation features and true clusteRlabels are independent of subsampling rates in five datasetsSpearman correlation featureswere constructed atvarioussubsamp ling ratesof The originaldatAin The fivedatasets.In each plot,each dot represents Acell.Silhouette values of Spearman correlation features constructed With 100%cells are shown on X axis,whereas silhouette values of Spearman correlation features constructed With 10%,20%,30%,40%,and 50%cells in each dataset are shown on Y axis.Pearson correlation between X and Y axeswas calculated,where The correlation coefficient(r)is provided in The uppeRtriangle and The corresponding P value is provided in The loweRtriangle of each p lot.

We fur The Revaluated whe The RThe improved silhouette values can be translated into clustering accuracy.By evaluating five clustering algorithms including k-means,k-medoids,AP,SC3,and SIMLR,we observed that compared to SCC,SSCC can significantly iMprove The clustering accuracy in terMs of NMI,foRall The five clustering algorithms on all The benchmark datasets tested(Figure 4).The accuracy iMprovements measured byΔNMI range froM0.12 to 0.60 foRThe Kolodziejczyk dataset,0.04 to 0.19 foRThe Pollen dataset,0.14 to 0.37 foRThe Usoskin dataset,0.02 to 0.28 foRThe Zeiseldataset,and 0.10 to 0.28 foRThe Zheng dataset,depending on The algorithms and subsampling rates chosen.O The Raccuracy metrics including Rand index,adjusted Rand index,and ad justed mutual information reveal The same trends(datAnot shown),suggesting that SSCC can greatly enhance The poweRof multip le clustering algorithmswhen subsamp ling is used.

Figure 4 Clustering performance comparison between SCC and SSCC with varied subsampling rates in five datasetsClustering accuracy using SCC and SSCC wasmeasured at various subsaMpling rates of The original datAin The five datasets,i.e.,The percentage of cells used in clustering.The clustering accuracy is indicated using NMI.FoReach subsaMp ling rate,calculations were repeated foRten times,based on which The average and The standard deviation of The clustering accuracy were calculated and plotted.NMI,normalized mutual information;SSCC,Spearman subsaMp ling-clustering-classification;AP,affinity propagation.

Clustering consistency between different subsampling runs isalso greatly improved with SSCC

In practice,The reference cell labels are generally unknown.The confidence of clustering results is of ten evaluated by The consistency between different algorithms.Due to The subsamp ling fluctuations,clustering results based on SCC are inconsistent among different subsamp ling operations.However,in The neWframework of SSCC,The consistency was much iMproved foRallevaluated clustering algorithMson alldatasets(Figure5).FoRThe Kolodziejczyk dataset,all The five clustering algorithMshad consistency＞0.5(measured by NMI)in SSCC while The corresponding consistency in SCC wasmuch smaller.FoRThe Pollen dataset,SSCC still showed betteRperformance than SCC although both frameworks had high clustering consistency.SiMilaRtrendswere observed on The Usoskin,Zeisel,and Zheng datasets.

Application of SSCC to large scRNA-seq datasetswith orwithout reference cell labels

Besides The aforementioned five scRNA-seq datasets,we fur The Rtested SSCC on two additional large scRNA-seq datasets.One is The PBMC 68 k dataset[18],which contains 10×GenoMics-based expression datAfoR68,578 blood cells froMAhealthy donor. The o The Ris The Macoskco dataset[19],which contains 49,300 mouse retinAcells lacking of experimentally deterMined cell labels.The large cell numbers generally prohibit classic scRNA-seq clustering algorithms running on Adesktop computer,thus providing two realistic examp les to demonstrate The performance of SCC and SSCC.

FoRThe PBMC 68 k dataset,we compared SSCC With SCC using SC3[12]as The clustering algorithm.The SC3 sof tware package inherently app lies an SCC strategy to hand le large scRNA-seq datasets.By default,ifAdataset has more than 5000 cells,The SCC strategy Will be triggered,With 5000 cells randoMly subsaMp led foRSC3 clustering and The o The Rcells foRclassification by SVM.We app lied SC3 to The PBMC 68 k dataset on Adesktop computeRWith 8GB memory and 3GHz 4-core CPU and repeated ten times.The average clustering accuracy of SC3 in termsof NMIwas 0.48,The calculation took 99Min on average,and The maximuMmemory usage exceeded 5.6GB(Figure 6A).With The SSCC strategy,The average clustering accuracy reached 0.59,representing～21%increaseoveRSC3With The defaultparameters.It isof note that The computation timewas dramatically reduced to 2.2Min on average,representing A50-fold acceleration.Meanwhile,The maximuMmemory usage of SC3+SSCC was 3.7GB,saving＞33%coMpared to that of SC3 With The default parameters.Compared to dropClust[20],Aclustering algorithMspecialized foRlarge scRNA-seq datasets,SC3+SSCC also demonstrated superioRperformance in terms of clustering accuracy,speed,and memory usage(Figure 6A).

Figure 5 Comparison of clustering consistency between SSCC and SCC foRfive datasetsThe consistency(measured by NMI)of clustering between using 10%cellsand thatusing 50%cellsWith SCC isshown on X axis,whereas consistency(measured by NMI)of clustering between using 10%cells and that using 50%cells With SSCC is shown on Y axis.Subsamp lingswere repeated foRten timesand each subsampling resultwas processed using five clustering algorithms shown on The left.

FoRThe Macoskco dataset,using k-means as The clustering algorithMand k-nearest neighbors foRclassification,The SCC strategy resulted in great average silhouette difference(0.29)between two subsampling schemes(-0.80 with 5%cells and-0.51 With 10%cells),whereas The difference using SSCC became negligible(0.01).The NMI values between The two subsamp ling schemes were 0.60 and 0.69 when using SCC and SSCC,respectively.Pearson correlation coefficients of silhouette values between The two subsamp ling schemes were increased froM0.47 to 0.58when switching froMSCC to SSCC(Figure 6B).

All The se metrics demonstrate that SSCC can not only greatly improve The clustering efficiency and accuracy foRlarge-scale scRNA-seq datasets,but also can greatly iMprove The consistency.

Discussion

The availability of large-scale scRNA-seq datAraises urgent need foRefficient and accurate clustering tools.Currently AfeWscRNA-seq datAanalysis packages have been proposed to address this challenge.O f The se tools,SC3[12],Seurat[11],and dropClust[20]adopt ASCC strategy,bigScale[21]eMploys Aconvolution strategy to merge siMilaRsingle cells intomegAcellsby Agreedy-searching algorithm,and SCANPY[22]used Python as The programMing language to accelerate The clustering process.Although The se strategies greatly boost The efficiency of large scRNA-seq datAanalysis, The re exists much rooMfoRfur The Rimprovement.Particularly The SCC strategy suffers froMbiases introduced by subsaMpling which may greatly decrease The clustering accuracy and robustness,although it can reduce The computational comp lexity froMO(n2)to O(n).Here we introduce feature engineering and projecting techniques into The SCC framework and propose SFCC as an alternative.Specially,With Spearman correlations as The feature engineering and projecting methods,we formulate Aframework named as SSCC,which can significantly improve clustering accuracy and consistency formany generaland speciallydesigned clustering algorithms.Evaluations on real scRNA-seq datasets,which coveRAWide range of scRNAseq technologies,sequencing depths,and organisms,demonstrate The robustness of The superioRperformance of SSCC. The refore,SSCC is expected to be Auseful computational framework that can fur The Runleash The great poweRof scRNA-seq in The future.

Figure 6 Clustering performance evaluation of SSCC on two extremely large scRNA-seq datasetsA.Performance comparison between SC3(default),dropClust,and SC3+SSCC on The PBMC 68 k dataset[18]in terms of clustering accuracy,running time and maximuMmemory required.In total 5000 cells were subsaMp led foRSC3(default),while 200 cells were subsamp led foRSC3+SSCC.B.Consistency comparison between SSCC(on The right)and SCC(on The left)evaluated on 49,300mouse retinAcells in The Macosko dataset[19].Silhouette values of two clustering schemes(using 2000 cells and 4930 cells,separately)were p lotted and The n Pearson correlation coefficientswere calculated.The 39 cell clusterswere colored according to clusteRlabels based on～10%cells and original expression data.

Authors’contributions

XRand ZZ designed The study.XRand LZ collected The data,iMp lemented The sof tware,and did The analysis.XRand ZZ Wrote The manuscript.Allauthors read and approved The final manuscript.

Competing interests

The authors have declared no coMpeting interests.

AcknoWledgments

This project was supported by grants froMBeijing Advanced Innovation CenteRfoRGenoMics at Peking University,Key Technologies R&D Program(G rant No.2016YFC0900100)by The Ministry of Science and Technology of China,and The National Natural Science Foundation of China(G rant Nos.81573022 and 31530036).

Genomics,Proteomics & Bioinformatics2019年2期

Genomics,Proteomics & Bioinformatics的其它文章: SeqSQC:ABioconductoRPackage foREvaluating The Sample Quality of Next-generation Sequencing Data; Transcriptome and Regulatory Network Analyses of CD19-CAR-T Immuno The rapy foRB-ALL; Chronic Food Antigen-specific IgG-mediated Hypersensitivity Reaction as ARisk FactoRfoRAdolescent Depressive Disorder; Integrating Culture-based Antibiotic Resistance Prof ileswith Whole-genome Sequencing DatAfoR11,087 Clinical Isolates; m6ARegulates Neurogenesis and Neuronal Development by Modulating H istone Methyltransferase Ezh2; Global Quantitative Mapping of Enhancers in Rice by STARR-seq

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡