Skip to content

Preprocessing sequencing data

After the experimental protocol, there are two major computational steps:

  1. Mapping the barcodes from the flow cell to spatial coordinates. Important: this is done only once per flow cell, and will be useful for ~80-300 experiments when capture areas are sized 3x4 mm.
  2. Map the transcriptomic reads to reference genome and tissue space. This is done once per sample. In the case of 3D reconstruction, all steps are the same, but you will repeat them for each individual library (i.e., one per section).

First of all, make sure that the working directory is set to the one you created previously:

WORKDIR="/home/user/openst_demo"
mkdir -p $WORKDIR
cd $WORKDIR

Also, we set the base URL for the server where data will be downloaded from

BASE_URL="http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/openst_metastatic_lymph_node"

Generating barcode-to-coordinate map

For each flow cell (thus, done only once), we generate plain text files with three columns: cell_bc, x_pos, and y_pos. These files are later used by spacemake to reconstruct the spatial coordinates from transcriptomic libraries. This process is performed only once per barcoded flow cell.

Software dependencies

Running openst flowcell_map below requires installing either bcl2fastq or bclconvert. You can find instructions for bcl2fastq, and bclconvert.

Then, make sure they are added to the PATH environment variable.

For instance, in Linux:

export PATH=/path/to/bcl2fastq:$PATH
# or
# export PATH=/path/to/bclconvert:$PATH

Make sure you use a version of these softwares compatible with your sequencer.

To reproduce this human metastatic lymph node example, we provide the BCLs of the flow cell used for these tissue sections. When you run your own data, you or your sequencing facility will need to preprocess the flow-cell barcodes, only once.

Download the BCLs to your machine:

mkdir -p $WORKDIR/raw_data/flowcell_data
mkdir -p $WORKDIR/raw_data/tiles

wget ${BASE_URL}/flowcell_data.tar.gz
tar xvf flowcell_data.tar.gz $WORKDIR/raw_data/flowcell_data

After the bcl2fastq or bclconvert dependencies have been installed, and you have downloaded the BCL, you can create the barcode-to-coordinate map for all tiles:

openst flowcell_map \
    --bcl-in $WORKDIR/raw_data/flowcell_data \
    --tiles-out $WORKDIR/raw_data/tiles \
    --crop-seq 5:30 \
    --rev-comp

This command will barcode-to-coordinate maps at $WORKDIR/raw_data/tiles - as many .txt.gz files as tiles in the barcoded flow cell.

Alternatively, you can skip the step above by downloading the expected results and putting them directly into the $WORKDIR/raw_data/tiles folder, e.g.:

wget ${BASE_URL}/tile_data.tar.gz
tar xvf tile_data.tar.gz --strip-components=1 -C $WORKDIR/raw_data/tiles/.

This is most likely not compatible with your own barcoded flow cell, as it will have completely different barcode-to-coordinate maps.

Warning

spacemake, when using the default openst mode, needs a coordinate system whose column cell_bc matches the names of the files in the root of $WORKDIR/raw_data/tiles.

By default, spacemake includes a coordinate system that follows the naming:

fc_1_L1_tile_1101
fc_1_L1_tile_1102
...
fc_1_L4_tile_2678

However, the files provided in our example follow the names:

fc_1_1_1101.txt.gz
fc_1_1_1102.txt.gz
...
fc_1_4_2678.txt.gz

To accomodate for this, you can use the coordinate system provided here

This can be downloaded directly with:

mkdir -p $WORKDIR/raw_data/coordinate_system
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/openst_metastatic_lymph_node/coordinate_system/openst_coordinate_system.csv -O $WORKDIR/raw_data/coordinate_system/openst_coordinate_system.csv

Later, we will configure it in spacemake.

Preparing the data

The complete dataset that inspires this tutorial is available to download from GEO. Here, to make all processing quicker, we will use a downsampled version (~50M reads instead of 500M-1B reads), that is available from our server.

Below we use the wget command to download the files to the correct relative locations:

# create folder for the raw data
mkdir -p $WORKDIR/raw_data/reads
mkdir -p $WORKDIR/raw_data/images

# download all raw data
## reads
cd $WORKDIR/raw_data/reads
wget ${BASE_URL}/reads/mLN_S2_R1.fastq.gz &
wget ${BASE_URL}/reads/mLN_S2_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S3_R1.fastq.gz &
wget ${BASE_URL}/reads/mLN_S3_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S4_R2.fastq.gz &
wget ${BASE_URL}/reads/mLN_S4_R1.fastq.gz

## images
cd $WORKDIR/raw_data/images
wget ${BASE_URL}/images/mLN_S2.tif &
wget ${BASE_URL}/images/mLN_S3.tif &
wget ${BASE_URL}/images/mLN_S4.tif

Transcriptomic & spatial mapping with spacemake

Initialize

Create the folder where spacemake will be initialized and run

mkdir -p $WORKDIR/spacemake

Then, intialize the conda environment we created for spacemake (see notebook 0_environment.ipynb)

cd $WORKDIR/spacemake

mamba activate openst

You will have a folder structure like:

/home/user # or other root folder
|-- openst_demo
|   |-- raw_data
|   |   |-- raw_reads
|   |   |   |-- mLN_S2_R1.fastq.gz
|   |   |   `-- ...
|   |   `-- ... 
|   `-- spacemake

Then, following the spacemake Quick start guide, browse to the spacemake directory you just created in the openst_demo folder, and run the initialization.

wget https://github.com/broadinstitute/Drop-seq/releases/download/v2.5.1/Drop-seq_tools-2.5.1.zip -O Drop-seq_tools-2.5.1.zip
unzip Drop-seq_tools-2.5.1.zip

spacemake init \
    --dropseq_tools Drop-seq_tools-2.5.1

Configure

As spacemake comes with no default value for species, before anything can be done, a new species has to be added. In this case, we add mouse; you will need to download the correct fa and gtf files. For instance, you can download the mouse genome from gencode, as well as the annotation.

Then, you need to run the following commands:

mkdir -p $WORKDIR/genomes
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/GRCh38p13.genome.fa -O $WORKDIR/genomes/GRCh38p13.genome.fa
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/gencodev41.annotation.gtf -O $WORKDIR/genomes/gencodev41.annotation.gtf
wget http://bimsbstatic.mdc-berlin.de/rajewsky/openst-public-data/genomes/human.rRNA.fa -O $WORKDIR/genomes/human.rRNA.fa

spacemake config add_species \
   --name human \
   --reference genome \
   --sequence $WORKDIR/genomes/GRCh38p13.genome.fa \
   --annotation $WORKDIR/genomes/gencodev41.annotation.gtf

spacemake config add_species \
   --name human \
   --reference rRNA \
   --sequence $WORKDIR/genomes/human.rRNA.fa

Additionally, make sure you use the correct coordinate system for the tile data relevant for your flow cell. In this specific example, you will need to copy the coordinate system downloaded above, as:

cp $WORKDIR/raw_data/coordinate_system/openst_coordinate_system.csv $WORKDIR/spacemake/puck_data/openst_coordinate_system.csv

Add sample

Now you add the sample data and metadata to spacemake.

For simplicity, we provide the tile barcode files that are related to this sample, as well as the coordinate system for the Illumina flow cell that was used to generate the capture area of this experiment. Notice that we wrap it inside a for loop to add the three sections (with IDs 2, 3, 4) at once.

for sample in mLN_S{2..4}; do
    spacemake projects add_sample \
        --project_id mLN \
        --sample_id "$sample" \
        --R1 $WORKDIR/raw_data/reads/"${sample}"_R1.fastq.gz \
        --R2 $WORKDIR/raw_data/reads/"${sample}"_R2.fastq.gz \
        --species human \
        --puck openst \
        --run_mode openst \
        --barcode_flavor openst \
        --puck_barcode_file $WORKDIR/raw_data/tiles/*.txt.gz \
        --map_strategy "bowtie2:rRNA->STAR:genome:final"
done

Run

That's it! Now, you can run spacemake:

spacemake run --cores 16 --keep-going

Modify the number of --cores depending on your machine (minimum of 4 cores). Using more cores will also use more memory.

Warning

Since this is a subsampled dataset (<100M reads, too few for these tissue section size), there's very few cells that go beyond the UMI cutoffs set at the run-mode, for the generation of automated reports. You can change these to lower values. Otherwise, some rules (concerning automated reports) might fail if you run spacemake==0.7.9. If only rules that fail are related to automated_analysis, you can ignore the errors (important to use --keep-going), and still proceed with the rest of the workflow.

Quality control & troubleshooting

spacemake automatically creates html reports with convenient information about library QC and automated analysis (clustering, gene markers...).

These are found at the sample's folders (e.g., for mLN_S2):

$WORKDIR/spacemake/projects/mLN/processed_data/mLN_S2/illumina/complete_data/

inside the qc_sheets and automated_analysis subfolders.

  1. The folder qc_sheets contains html reports with basic visualizations like histograms of unique molecules and genes per spatial unit (e.g., meshed/pseudo-cells of default size), and other metrics such as PCR bias
  2. The folder automated_analysis contains different subfolders with different UMI thresholds, and the results of automated clustering, neighborhood analysis, and differential gene expression between clusters (i.e., marker gene analysis)

Here you can browse the QC reports we obtained for these (downsampled) data:

Taking a look at these files gives a first impression of the quality of the data:

  • Did the data yield the expected genes or molecules per ~cell with the chosen sequencing depth?
  • Can one tell the tissue structure apart from the background by looking at UMIs or genes?
  • Was the library efficiently amplified, or did the capture work well?
  • Are there any noticeable spatial artifacts, e.g., missing areas of tissue?

Next steps

After you successfully run spacemake, you can proceed with the specific steps of Open-ST data with the openst package.

Otherwise, you can use the hexbin objects, which are a good approximation for expression at cellular resolution (but not suitable for subcellular localization, or neighborhood analysis).

These files are found at (e.g., for mLN_S2):

$WORKDIR/spacemake/projects/mLN/processed_data/mLN_S2/illumina/complete_data/dge/dge.all.polyA_adapter_trimmed.mm_included.spatial_beads.mesh_7_hexagon_puck_collection

These can be used for pairwise alignment to the images, but cannot be used for segmentation. In that case, you need to use the 0.6 µm-resolved spots, and not this 7 µm-side hexagon binning.

Also, these can be used for 3D registration, but we recommend using the segmented objects.