Skip to content

Exploratory analysis (Jupyter)

We will do some exploratory data analysis on the adult mouse hippocampus dataset that you just preprocessed.

Here, we show our code and results as a jupyter notebook, but you can copy and paste the code into a standalone script, and it will still work.

Loading the modules

import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

Loading the data

adata = sc.read_h5ad("alignment/openst_demo_e13_mouse_head_by_cell.h5ad")

After transfering the segmentation information, the cell with the identifier 0 will have all the transcripts belonging to the background. Therefore, make sure to omit it from the dataset.


We all know that views in anndata can be a bit problematic sometimes... You might want to append .copy() at the end of the following line of code if you encounter issues.

adata = adata[adata.obs.cell_ID_mask != 0].copy()

Calculating QC metrics of the sample

Now, we can show different plots (histograms) to assess the number of unique transcripts (UMIs), genes, percentage of mitochondrial transcripts per segmented cell, as a quick way of assessing the quality of the dataset. This will help to decide on filtering thresholds, to remove potential low quality cells (i.e., due to poor sequencing coverage, or wrong segmentation of background as cells).

sc.pp.calculate_qc_metrics(adata, inplace=True)
adata.var["mt"] = adata.var_names.str.startswith("mt-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)
fig, axs = plt.subplots(1, 3, figsize=(12, 4))
sns.histplot(adata.obs["total_counts"], kde=False, bins=60, ax=axs[0])
axs[0].axvline(250, 0,19000)
sns.histplot(adata.obs["pct_counts_mt"], kde=False, bins=60, ax=axs[1])
sns.histplot(adata.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
<AxesSubplot: xlabel='n_genes_by_counts', ylabel='Count'>
No description has been provided for this image

Here we apply the following filters, decided by looking at the histogram. You can be more conservative, we chose these in order to keep most of the cells while removing very obvious bad quality data points.

# Filter data
sc.pp.filter_cells(adata, min_counts=250)
sc.pp.filter_cells(adata, max_counts=10000)
# sc.pp.filter_cells(adata, max_counts=35000)
adata = adata[adata.obs["pct_counts_mt"] &lt; 20]
print(f"#cells after MT filter: {adata.n_obs}")
sc.pp.filter_genes(adata, min_cells=10) 
#cells after MT filter: 49687

/home/dleonpe/miniconda3/envs/napari-env/lib/python3.9/site-packages/scanpy/preprocessing/ ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
  adata.var['n_cells'] = number


From this point, some of the analysis decisions are just heuristics taken from single cell analysis. We would not say these are the best possible practices for analyzing this data, but the most common ones. Especially, regarding normalization (see Warning below).

Let's apply the common normalization to \(10^4\) counts per cell, and the log-normalization with pseudocount. This is supposed to stabilize the variance of genes and remove potential biases from sequencing depth per cell.

sc.pp.normalize_total(adata, inplace=True)


Normalization is necessary, i.e., to account for the differences in depth per cell and remove potential biases, or to estabilize the variance of genes and remove their dependency on the counts.

It is still under study, but normalization strategies for single-cell are most likely not well suited for this kind of high-resolution, single-cell spatial data. Especially, since counts have an additional source of (spatial) coviariance that does not exist in single-cell datasets. Therefore, it is possible that the covariance and errors from spatial components are propagated throughout the analysis pipeline, and some signal or significant results are just noise from the spatial autocorrelations. Anyway, we are using these common practices just as a way of getting a feeling about the data, and what could be genes playing a role at specific regions in space.

Now, we detect highly variable genes. Any of the flavor(s) are designed with single-cell data in mind, so the list of highly variable genes that they select will be likely affected by the spatial autocorrelation structure. Alternative strategies could be selecting these genes based on the spatial variable genes (e.g., by Moran's I value).

sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)

Dimensionality reduction and clustering

We perform the dimensionality reduction with PCA, and community clustering of the nearest neighbors graph using the leiden algorithm. For this exploratory analysis, we keep the default parameters.

TODO: put the PCA loadings and select from this.

# Clustering &amp; embedding
sc.pp.neighbors(adata), resolution = 0.9, key_added="leiden")

Now we can take a look at the clusters in space. It seems like we recapitulate the major morphological structures from this tissue. We also show Ttr, a marker of the choroid plexus, which seems to be restricted to a specific location in space, that also clusters separately.

plt.rcParams["figure.figsize"] = (8, 8), img_key=None, color=["total_counts", "n_genes_by_counts", "leiden", "Ttr"], spot_size=40)