scimap.tl.cluster
Function Call
scimap.tl.cluster
(
adata,
method='kmeans',
subset_genes=None,
sub_cluster=False,
sub_cluster_column='phenotype',
sub_cluster_group=None,
parc_small_pop=50,
parc_too_big_factor=0.4,
k=10,
n_pcs=None,
resolution=1,
phenograph_clustering_metric='euclidean',
nearest_neighbors=30,
use_raw=True,
random_state=0,
collapse_labels=False)
Short description
The cluster
function allows users to cluster single-cell data.
The function currently supports four clustering algorithms- kmeans, phenograph, leiden and parc.
The function also allows users to sub-cluster existing clusters by setting sub_cluster=True
. Check arguments
sub_cluster_column
and sub_cluster_group
for more information.
Additionally, if the user wishes to use only a subset of genes for the purpose of clustering, it can be
acheived by passing the genes as a list to subset_genes
.
The resultant clusters are saved under adata.obs[method used]
.
Parameters
adata
: AnnData Object
method
: string, optional (The default is 'kmeans')
Clustering method to be used- Implemented methods- kmeans, phenograph, leiden and parc.
subset_genes
: list, optional (The default is None)
Pass a list of genes ['CD3D', 'CD20', 'KI67'] that should be included for the purpose of clustering.
By default the algorithm uses all genes in the dataset.
sub_cluster
: bool, optional (The default is False)
If the user has already performed clustering or phenotyping previously and would like to sub-cluster within a particular cluster/phenotype, this option can be used.
sub_cluster_column
: string, optional (The default is 'phenotype')
The column name that contains the cluster/phenotype information to be sub-clustered. This is only required when sub_cluster is set to True.
sub_cluster_group
: list, optional (The default is None)
By default the program will sub-cluster all groups within column passed through the argument sub_cluster_column. If user wants to sub cluster only a subset of phenotypes/clusters this option can be used. Pass them as list e.g. ["tumor", "b cells"].
parc_small_pop
: int, optional (The default is 50)
Smallest cluster population to be considered a community in PARC clustering.
parc_too_big_factor
: float, optional (The default is 0.4)
If a cluster exceeds this share of the entire cell population, then the PARC will be run on the large cluster. at 0.4 it does not come into play.
k
: int, optional (The default is 10)
Number of clusters to return when using K-Means clustering.
n_pcs
: int, optional (The default is None)
Number of PC's to be used in leiden clustering. By default it uses all PC's.
resolution
: float, optional (The default is 1)
A parameter value controlling the coarseness of the clustering. Higher values lead to more clusters.
phenograph_clustering_metric
: string, optional (The default is 'euclidean')
Distance metric to define nearest neighbors. Note that performance will be slower for correlation and cosine.
Available methods- cityblock’, ‘cosine’, ‘euclidean’, ‘manhattan’, braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’
nearest_neighbors
: int, optional (The default is 30)
Number of nearest neighbors to use in first step of graph construction. This parameter is used both in leiden
and phenograph
clustering.
use_raw
: bool, optional (The default is True)
If True, log transformed raw data will be used for clustering. If False, normalized/scaled data within adata.X
will be used.
random_state
: int, optional (The default is 0)
Change the initialization of the optimization.
collapse_labels
: bool, optional (The default is False)
While sub clustering only a few phenotypes/clusters, this argument helps to group all the other phenotypes/clusters into a single category - Helps in visualisation.
label
: string, optional (The default is None)
Key or optional column name for the returned data, stored in adata.obs
. The default is adata.obs[method used].
Returns
AnnData
object with the results stored in adata.obs[method used]
.
Example
# Running clustering on entire data using K-means method
adata = sm.tl.cluster (adata, method = 'kmeans', k= 10, use_raw = True)
# Sub-cluster a already named cluster called `Tumor` using leiden clustering
adata = sm.tl.cluster (adata, method = 'leiden', resolution = 0.5,
nearest_neighbors = 20, use_raw = True,
sub_cluster=True, sub_cluster_column='phenotype', sub_cluster_group='Tumor')
# Run phenograph clustering by only using a subset of genes
gene_subset = ['CD25', 'CD2', 'CD10', 'CD163', 'CD3D', 'CD5', 'CD30', 'ACTIN', 'CD45',
'CD206', 'CD68', 'PD1', 'KI67', 'CD11C', 'CD7', 'CD8A', 'FOXP3', 'CD20']
adata = sm.tl.cluster (adata, subset_genes = gene_subset, method = 'phenograph',
nearest_neighbors = 10, use_raw = True)