Skip to content

Preprocessing transcriptomic library

After sequencing the transcriptomic sequences of Open-ST library, you will get basecall files in bcl format, or raw reads in fastq format (see sequence file formats from Illumina's website).

For the Open-ST workflow, we leverage spacemake, an an automated pipeline designed for the preprocessing, alignment, and quantification of single-cell and spatial transcriptomics data.

Configuring spacemake

We refer to the official documentation for a complete tutorial on how to install and initialize spacemake.

Once installed, initialized and species data have been added, an Open-ST sample can be added:

spacemake projects add_sample \
   --project_id <project_id> \
   --sample_id <sample_id> \
   --R1 <path_to_R1.fastq.gz> \ # single R1 or several R1 files
   --R2 <path_to_R2.fastq.gz> \ # single R2 or several R2 files
   --species <species> \
   --puck openst \
   --puck_barcode_file <path_to_puck_barcode_file.tsv.gz> \
   --run_mode openst \
   --barcode_flavor openst

The above will add a new Open-ST project with barcode_flavor, run_mode, puck all set to openst.

How to populate --puck_barcode_file

With Open-ST data, each sample covers a piece of capture area, which contains at least one tile (puck).

Thus, we need to provide --puck_barcode_file (each tile in a sample has different barcodes, unlike for visium samples). This file should be a comma or tab separated, containing column names as first row. Acceptable column names are:

  • cell_bc, barcodes or barcode for cell-barcode
  • xcoord or x_pos for x-positions
  • ycoord or y_pos for y-positions

These are generated by the openst package as previously described.

All puck_barcode_files generated in the previous step at the folder /path/to/fc_tiles need to be specified after --puck_barcode_file, e.g., with the wildcards /path/to/fc_tiles/*.txt.gz.

To generate output files and reports only for the relevant tiles per sample, you can configure the variable spatial_barcode_min_matches under run_mode (see spacemake documentation). This represents the minimum proportion of spatial barcodes that a tile must have in common with the sample transcriptomic data to be further included during quantification and downstream analysis.

Tip

If some tiles are wrongly missing (present), this might be because the threshold was too high (low). You can update the sample to add missing tiles (see spacemake documentation). Then, rerun spacemake by configuring spatial_barcode_min_matches to zero.

Running spacemake

After a sample is added spacemake can be run with:

spacemake run --cores <n_cores> --keep-going

The --keep-going flag is optional, however it will ensure that spacemake runs all the jobs it can, even if one job fails (this logic is directly taken from snakemake).

Expected output

After running all the steps of this section, spacemake generates the following folder structure (e.g., for a single sample):

spacemake_folder
`-- projects
    `-- <project_id>
        |-- processed_data
        |   `-- <sample_id>
        |       `-- illumina
        |           `-- complete_data
        |               |-- dge # folder, spatial gene expression as h5ad files 
        |               |-- qc_sheets # folder, sequencing QC as HTML
        |               |-- automated_analysis # folder, automated analysis results as HTML
        |               `-- ... # intermediate output files and folders
        `-- raw_data # folder, contains R1 and R2 reads (fastq)

Importantly for the openst pipeline are the h5ad file(s) per sample (under the dge folder), which contain the gene expression and spatial coordinates of each barcoded spot.

In the following sections, you will learn how to merge and align these with imaging data, to later aggregate transcriptomic information into single cells, rather than using a more arbitrary regular binning of spatial data into squares or hexagons (part of the output from spacemake for QC purposes).