Preprocessing sequencing data¶
After the experimental protocol, there are two major computational steps:
- Mapping the barcodes from the flow cell to spatial coordinates. Important: this is done only once per flow cell, and will be useful for ~80-300 experiments when capture areas are sized 3x4 mm.
- Map the transcriptomic reads to reference genome and tissue space. This is done once per sample. In the case of 3D reconstruction, all steps are the same, but you will repeat them for each individual library (i.e., one per section).
First of all, make sure that the working directory is set to the one you created previously:
Also, we set the base URL for the server where data will be downloaded from
Generating barcode-to-coordinate map¶
For each flow cell (thus, done only once), we generate plain text files with three columns: cell_bc
, x_pos
, and y_pos
. These files are later used by spacemake
to reconstruct the spatial coordinates from transcriptomic libraries. This process is performed only once per barcoded flow cell.
Software dependencies
Running openst flowcell_map
below requires installing either bcl2fastq
or bclconvert
.
You can find instructions for bcl2fastq
,
and bclconvert
.
Then, make sure they are added to the PATH
environment variable.
For instance, in Linux:
Make sure you use a version of these softwares compatible with your sequencer.
To reproduce this human metastatic lymph node example, we provide the BCLs of the flow cell used for these tissue sections. When you run your own data, you or your sequencing facility will need to preprocess the flow-cell barcodes, only once.
Download the BCLs to your machine:
After the bcl2fastq
or bclconvert
dependencies have been installed, and you have downloaded the BCL, you can create the barcode-to-coordinate map for all tiles:
openst flowcell_map \
--bcl-in $WORKDIR/raw_data/flowcell_data \
--tiles-out $WORKDIR/raw_data/tiles \
--crop-seq 5:30 \
--rev-comp
This command will barcode-to-coordinate maps at $WORKDIR/raw_data/tiles
- as many .txt.gz
files as tiles in the barcoded flow cell.
Alternatively, you can skip the step above by downloading the expected results and putting them directly into the $WORKDIR/raw_data/tiles
folder, e.g.:
wget ${BASE_URL}/tile_data.tar.gz
tar xvf tile_data.tar.gz --strip-components=1 -C $WORKDIR/raw_data/tiles/.
This is most likely not compatible with your own barcoded flow cell, as it will have completely different barcode-to-coordinate maps.
Warning
spacemake
, when using the default openst
mode, needs a coordinate system whose column cell_bc
matches the names of the files in the root of $WORKDIR/raw_data/tiles
.
By default, spacemake
includes a coordinate system that follows the naming:
However, the files provided in our example follow the names:
To accomodate for this, you can use the coordinate system provided here
This can be downloaded directly with:
mkdir -p $WORKDIR/raw_data/coordinate_system
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/openst_metastatic_lymph_node/coordinate_system/openst_coordinate_system.csv -O $WORKDIR/raw_data/coordinate_system/openst_coordinate_system.csv
Later, we will configure it in spacemake
.
Preparing the data¶
The complete dataset that inspires this tutorial is available to download from GEO. Here, to make all processing quicker, we will use a downsampled version (~50M reads instead of 500M-1B reads), that is available from our server.
Below we use the wget
command to download the files to the correct relative locations:
# create folder for the raw data
mkdir -p $WORKDIR/raw_data/reads
mkdir -p $WORKDIR/raw_data/images
# download all raw data
## reads
cd $WORKDIR/raw_data/reads
wget ${BASE_URL}/reads/mLN_S2_R1.fastq.gz &
wget ${BASE_URL}/reads/mLN_S2_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S3_R1.fastq.gz &
wget ${BASE_URL}/reads/mLN_S3_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S4_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S4_R1.fastq.gz
## images
cd $WORKDIR/raw_data/images
wget ${BASE_URL}/images/mLN_S2.tif &
wget ${BASE_URL}/images/mLN_S3.tif &
wget ${BASE_URL}/images/mLN_S4.tif
Transcriptomic & spatial mapping with spacemake
¶
Initialize¶
Create the folder where spacemake
will be initialized and run
Then, intialize the conda environment we created for spacemake
(see notebook 0_environment.ipynb
)
Then, following the spacemake
Quick start guide, browse to the spacemake directory you just created in the openst_demo
folder, and run the initialization.
Configure¶
As spacemake
comes with no default value for species, before anything can be done, a new species has to be added. In this case, we add mouse; you will need to download the correct fa
and gtf
files. For instance, you can download the mouse genome from gencode, as well as the annotation.
Then, you need to run the following commands:
mkdir -p $WORKDIR/genomes
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/GRCh38p13.genome.fa -O $WORKDIR/genomes/GRCh38p13.genome.fa
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/gencodev41.annotation.gtf -O $WORKDIR/genomes/gencodev41.annotation.gtf
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/human.rRNA.fa -O $WORKDIR/genomes/human.rRNA.fa
spacemake config add_species \
--name human \
--reference genome \
--sequence $WORKDIR/genomes/GRCh38p13.genome.fa \
--annotation $WORKDIR/genomes/gencodev41.annotation.gtf
spacemake config add_species \
--name human \
--reference rRNA \
--sequence $WORKDIR/genomes/human.rRNA.fa
Additionally, make sure you use the correct coordinate system for the tile data relevant for your flow cell. In this specific example, you will need to copy the coordinate system downloaded above, as:
Add sample¶
Now you add the sample data and metadata to spacemake
.
For simplicity, we provide the tile barcode files that are related to this sample, as well as the coordinate system for the Illumina flow cell that was used to generate the capture area of this experiment. Notice that we wrap it inside a for
loop to add the three sections (with IDs 2, 3, 4) at once.
for sample in mLN_S{2..4}; do
spacemake projects add_sample \
--project_id mLN \
--sample_id "$sample" \
--R1 $WORKDIR/raw_data/reads/"${sample}"_R1.fastq.gz \
--R2 $WORKDIR/raw_data/reads/"${sample}"_R2.fastq.gz \
--species human \
--puck openst \
--run_mode openst \
--barcode_flavor openst \
--puck_barcode_file $WORKDIR/raw_data/tiles/*.txt.gz \
--map_strategy "bowtie2:rRNA->STAR:genome:final"
done
Run¶
That's it! Now, you can run spacemake
:
Modify the number of --cores
depending on your machine (minimum of 4 cores). Using more cores will also use more memory.
Warning
Since this is a subsampled dataset (<100M reads, too few for these tissue section size), there's very few cells that go beyond the UMI cutoffs set at the run-mode, for the generation of automated reports. You can change these to lower values. Otherwise, some rules (concerning automated reports) might fail if you run spacemake==0.7.9
. If only rules that fail are related to automated_analysis
, you can ignore the errors (important to use --keep-going
), and still proceed with the rest of the workflow.
Quality control & troubleshooting¶
spacemake
automatically creates html
reports with convenient information about library QC and automated analysis (clustering, gene markers...).
These are found at the sample's folders (e.g., for mLN_S2
):
$WORKDIR/spacemake/projects/mLN/processed_data/mLN_S2/illumina/complete_data/
inside the qc_sheets
and automated_analysis
subfolders.
- The folder
qc_sheets
containshtml
reports with basic visualizations like histograms of unique molecules and genes per spatial unit (e.g., meshed/pseudo-cells of default size), and other metrics such as PCR bias - The folder
automated_analysis
contains different subfolders with different UMI thresholds, and the results of automated clustering, neighborhood analysis, and differential gene expression between clusters (i.e., marker gene analysis)
Here you can browse the QC reports we obtained for these (downsampled) data:
Taking a look at these files gives a first impression of the quality of the data:
- Did the data yield the expected genes or molecules per ~cell with the chosen sequencing depth?
- Can one tell the tissue structure apart from the background by looking at UMIs or genes?
- Was the library efficiently amplified, or did the capture work well?
- Are there any noticeable spatial artifacts, e.g., missing areas of tissue?
Next steps¶
After you successfully run spacemake
, you can proceed with the specific steps of Open-ST data with the openst
package.
Otherwise, you can use the hexbin objects, which are a good approximation for expression at cellular resolution (but not suitable for subcellular localization, or neighborhood analysis).
These files are found at (e.g., for mLN_S2
):
$WORKDIR/spacemake/projects/mLN/processed_data/mLN_S2/illumina/complete_data/dge/dge.all.polyA_adapter_trimmed.mm_included.spatial_beads.mesh_7_hexagon_puck_collection
These can be used for pairwise alignment to the images, but cannot be used for segmentation. In that case, you need to use the 0.6 µm-resolved spots, and not this 7 µm-side hexagon binning.
Also, these can be used for 3D registration, but we recommend using the segmented objects.