Preprocessing transcriptomic library¶
After sequencing the transcriptomic sequences of Open-ST library,
you will get basecall files in bcl
format, or raw reads in fastq
format
(see sequence file formats
from Illumina's website).
For the Open-ST workflow, we leverage spacemake
, an
an automated pipeline designed for the preprocessing, alignment, and quantification of single-cell and spatial transcriptomics data.
Configuring spacemake
¶
We refer to the official documentation for a complete tutorial on how to install and initialize spacemake.
Once installed, initialized and species data have been added, an Open-ST sample can be added:
spacemake projects add_sample \
--project_id <project_id> \
--sample_id <sample_id> \
--R1 <path_to_R1.fastq.gz> \ # single R1 or several R1 files
--R2 <path_to_R2.fastq.gz> \ # single R2 or several R2 files
--species <species> \
--puck openst \
--puck_barcode_file <path_to_puck_barcode_file.tsv.gz> \
--run_mode openst \
--barcode_flavor openst
The above will add a new Open-ST project with barcode_flavor
, run_mode
, puck
all set to openst
.
How to populate --puck_barcode_file
With Open-ST data, each sample covers a piece of capture area, which contains at least one tile (puck).
Thus, we need to provide --puck_barcode_file
(each tile in a sample has different barcodes, unlike for visium samples).
This file should be a comma or tab separated, containing column names as first row. Acceptable column names are:
cell_bc
,barcodes
orbarcode
for cell-barcodexcoord
orx_pos
for x-positionsycoord
ory_pos
for y-positions
These are generated by the openst
package as previously described.
All puck_barcode_files
generated in the previous step at the folder /path/to/fc_tiles
need to be specified after --puck_barcode_file
, e.g., with the wildcards /path/to/fc_tiles/*.txt.gz
.
To generate output files and reports only for the relevant tiles per sample, you can configure the variable
spatial_barcode_min_matches
under run_mode
(see spacemake documentation).
This represents the minimum proportion of spatial barcodes that a tile must have in common
with the sample transcriptomic data to be further included during quantification and downstream analysis.
Tip
If some tiles are wrongly missing (present), this might be because the threshold was too high (low).
You can update the sample to add missing tiles (see spacemake documentation).
Then, rerun spacemake by configuring spatial_barcode_min_matches
to zero.
Running spacemake
¶
After a sample is added spacemake can be run with:
The --keep-going
flag is optional, however it will ensure that spacemake runs all the jobs it can,
even if one job fails (this logic is directly taken from snakemake).
Expected output¶
After running all the steps of this section, spacemake
generates the following folder structure (e.g., for a single sample):
spacemake_folder
`-- projects
`-- <project_id>
|-- processed_data
| `-- <sample_id>
| `-- illumina
| `-- complete_data
| |-- dge # folder, spatial gene expression as h5ad files
| |-- qc_sheets # folder, sequencing QC as HTML
| |-- automated_analysis # folder, automated analysis results as HTML
| `-- ... # intermediate output files and folders
`-- raw_data # folder, contains R1 and R2 reads (fastq)
Importantly for the openst
pipeline are the h5ad
file(s) per sample (under the dge
folder),
which contain the gene expression and spatial coordinates of each barcoded spot.
In the following sections, you will learn how to merge and align these with imaging data, to later aggregate
transcriptomic information into single cells, rather than using a more arbitrary regular binning of
spatial data into squares or hexagons (part of the output from spacemake
for QC purposes).