GESSO Demo - 10x Visium BRCA Dataset
Gene set expression quantification with GESSO
GESSO (Gene sEt activity Score analysis with Spatial lOcation) is a computational method for quantifying the overall expression levels of gene sets for spatial transcriptomics data.
Given spatial transcriptomics data and a user-provided gene set, GESSO returns a gene set activity score (GAS) for each spot. GESSO can also systematically identify spots with elevated gene set activity through a permutation-based hypothesis test.
This notebook provides an overview of GESSO on a small dataset (a subset of human tumor 10x Visium BRCA from https://www.10xgenomics.com/datasets). This small dataset can be immediately downloaded from https://github.com/YMa-lab/GESSO/tree/main/demo_data.
Import the gesso package.
The gesso Python package can be easily downloaded from source. Simply run the following script in your terminal after ensuring Python and pip are available in your environment. We recommend installing GESSO in a new Python environment.
git clone https://github.com/YMa-Lab/GESSO.git
cd gesso
pip install .
cd ..
[1]:
from pathlib import Path
import pandas as pd
import sys
project_directory = Path("__notebook__").resolve().parent.parent
sys.path.append(str(project_directory))
from gesso import GESSO
Configure logging (optional)
GESSO uses Python’s standard logging module under the gesso.* hierarchy. By default the package is silent. Use the gesso.logging submodule to turn on console output, attach a log file, or silence the chatty per-geneset progress messages. You can also just call print(...) from your own code as usual - GESSO’s logging is independent of standard print.
Common patterns:
from gesso import logging as glog
glog.enable() # print INFO messages to stderr
glog.set_level("DEBUG") # show debug-level messages
glog.silence_per_geneset() # mute per-geneset progress, keep summaries
glog.unsilence_per_geneset() # re-enable per-geneset progress
h = glog.add_file_handler("gesso.log", level="DEBUG") # also log to a file
glog.remove_handler(h) # detach when done
glog.disable() # back to silent
Sub-loggers gesso.init, gesso.compute, and gesso.compute.geneset can be configured independently via glog.set_level(level, logger=...) if you want even finer control.
[2]:
from gesso import logging as glog
# enable console logging at INFO level for the rest of the notebook
glog.enable();
Load the spatial transcriptomics data.
In this demo, we’ll analyze the 10x Visium BRCA dataset. Let’s first load and inspect the gene expression count data. The gene expression dataframe should be of dimension \(N \times G\), where \(N\) denotes the number of spots, and \(G\) denotes the number of genes.
[3]:
expression_df: pd.DataFrame = pd.read_pickle(
project_directory / "demo_data" / "10xVisiumBRCA-count_data.pkl"
)
display(expression_df.head())
| gene | DVL1 | PIK3CD | MTOR | CASP9 | WNT4 | E2F2 | HDAC1 | CSF3R | SLC2A1 | PTCH2 | ... | TLR8 | VEGFD | ARAF | AR | IL2RG | FGF16 | COL4A6 | XIAP | FGF13 | IKBKG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| barcode | |||||||||||||||||||||
| GTAGACAACCGATGAA-1 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ACAGATTAGGTTAGTG-1 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| TGGTATCGGTCTGTAT-1 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| ATTATCTCGACAGATC-1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 |
| TGAGATCAAATACTCA-1 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 334 columns
Next, let’s load and inspect the spatial location data. The location data is of dimension \(N \times 2\). Note: the index of the location dataframe must match the columns of the expression dataframe.
[4]:
locations_df: pd.DataFrame = pd.read_pickle(
project_directory / "demo_data" / "10xVisiumBRCA-location_data.pkl"
)
display(locations_df.head())
# note: GESSO requires spatial locations to be in the form of a DataFrame containing columns "x" and "y"
locations_df = locations_df.rename(columns={"array_row": "x", "array_col": "y"})
display(locations_df.head())
| array_row | array_col | |
|---|---|---|
| barcode | ||
| GTAGACAACCGATGAA-1 | 7 | 55 |
| ACAGATTAGGTTAGTG-1 | 7 | 57 |
| TGGTATCGGTCTGTAT-1 | 7 | 59 |
| ATTATCTCGACAGATC-1 | 7 | 61 |
| TGAGATCAAATACTCA-1 | 7 | 63 |
| x | y | |
|---|---|---|
| barcode | ||
| GTAGACAACCGATGAA-1 | 7 | 55 |
| ACAGATTAGGTTAGTG-1 | 7 | 57 |
| TGGTATCGGTCTGTAT-1 | 7 | 59 |
| ATTATCTCGACAGATC-1 | 7 | 61 |
| TGAGATCAAATACTCA-1 | 7 | 63 |
Finally, let’s load the gene set/pathway data. This data may be stored as a \(G \times n_\text{genesets}\) binary matrix.
[5]:
genesets_df: pd.DataFrame = pd.read_pickle(
project_directory / "demo_data" / "10xVisiumBRCA-pathways_data.pkl"
)
display(genesets_df.head())
# we can also work with gene sets as dictionaries mapping gene set name to list of genes
genesets_dict = {
geneset: list(genesets_df.index[genesets_df[geneset] == 1])
for geneset in genesets_df.columns
}
print(genesets_dict)
| KEGG_PATHWAYS_IN_CANCER | WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT | |
|---|---|---|
| DVL1 | 1 | 0 |
| PIK3CD | 1 | 0 |
| MTOR | 1 | 0 |
| CASP9 | 1 | 0 |
| WNT4 | 1 | 0 |
{'KEGG_PATHWAYS_IN_CANCER': ['DVL1', 'PIK3CD', 'MTOR', 'CASP9', 'WNT4', 'E2F2', 'HDAC1', 'CSF3R', 'SLC2A1', 'PTCH2', 'PIK3R3', 'JUN', 'JAK1', 'WNT2B', 'NRAS', 'PIAS3', 'ARNT', 'NTRK1', 'RXRG', 'FASLG', 'LAMC1', 'LAMC2', 'TPR', 'PTGS2', 'RASSF5', 'LAMB3', 'TRAF5', 'TGFB2', 'WNT9A', 'WNT3A', 'EGLN1', 'FH', 'AKT3', 'SOS1', 'EPAS1', 'MSH2', 'MSH6', 'TGFA', 'CTNNA2', 'TCF7L1', 'PAX8', 'RALB', 'GLI2', 'ITGA6', 'ITGAV', 'STAT1', 'CASP8', 'FZD7', 'FZD5', 'FN1', 'STK36', 'WNT6', 'WNT10A', 'COL4A4', 'VHL', 'PPARG', 'RAF1', 'WNT7A', 'RARB', 'TGFBR2', 'MLH1', 'CTNNB1', 'LAMB2', 'RHOA', 'RASSF1', 'WNT5A', 'APPL1', 'MITF', 'TFG', 'CBLB', 'GSK3B', 'PIK3CB', 'MECOM', 'PLD1', 'PIK3CA', 'DVL3', 'FGF12', 'CTBP1', 'FGFR3', 'PDGFRA', 'KIT', 'CXCL8', 'FGF5', 'MAPK10', 'NFKB1', 'LEF1', 'EGF', 'FGF2', 'HHIP', 'VEGFC', 'CASP3', 'SKP2', 'FGF10', 'ITGA2', 'PIK3R1', 'MSH3', 'APC', 'TCF7', 'WNT8A', 'CTNNA1', 'FGF1', 'CSF1R', 'PDGFRB', 'FGF18', 'MAPK9', 'E2F3', 'RXRB', 'PPARD', 'CDKN1A', 'VEGFA', 'HSP90AB1', 'LAMA4', 'HDAC2', 'LAMA2', 'PDGFA', 'RAC1', 'IL6', 'CYCS', 'RALA', 'GLI3', 'EGFR', 'FZD9', 'HGF', 'FZD1', 'CDK6', 'PIK3CG', 'LAMB1', 'LAMB4', 'MET', 'WNT2', 'WNT16', 'SMO', 'BRAF', 'SHH', 'FGF20', 'FGF17', 'NKX3-1', 'FZD3', 'FGFR1', 'IKBKB', 'CCNE2', 'FZD6', 'MYC', 'PTK2', 'CDKN2B', 'DAPK1', 'PTCH1', 'TGFBR1', 'TRAF1', 'ABL1', 'LAMC3', 'RALGDS', 'RXRA', 'TRAF2', 'ITGB1', 'CUL2', 'FZD8', 'RET', 'NCOA4', 'MAPK8', 'CCDC6', 'CTNNA3', 'PTEN', 'FAS', 'CHUK', 'WNT8B', 'FGF8', 'NFKB2', 'SUFU', 'TCF7L2', 'FGFR2', 'HRAS', 'TRAF6', 'SPI1', 'VEGFB', 'BAD', 'RELA', 'GSTP1', 'CCND1', 'FGF19', 'FGF4', 'FGF3', 'FADD', 'WNT11', 'FZD4', 'BIRC3', 'BIRC2', 'MMP1', 'ZBTB16', 'CBL', 'ETS1', 'WNT5B', 'FGF23', 'FGF6', 'CDKN1B', 'KRAS', 'WNT10B', 'WNT1', 'CDK2', 'GLI1', 'CDK4', 'MDM2', 'KITLG', 'IGF1', 'HSP90B1', 'FGF9', 'FLT3', 'BRCA2', 'CCNA1', 'FOXO1', 'RB1', 'FGF14', 'COL4A1', 'COL4A2', 'EGLN3', 'NFKBIA', 'SOS2', 'BMP4', 'HIF1A', 'MAX', 'PGF', 'FOS', 'TGFB3', 'HSP90AA1', 'TRAF3', 'AKT1', 'RAD51', 'FGF7', 'DAPK2', 'MAP2K1', 'SMAD3', 'PIAS1', 'PML', 'ARNT2', 'IGF1R', 'AXIN1', 'CREBBP', 'PRKCB', 'MAPK3', 'MMP2', 'CDH1', 'PLCG2', 'CRK', 'DVL2', 'FGF11', 'TP53', 'PIK3R5', 'NOS2', 'TRAF4', 'ERBB2', 'RARA', 'JUP', 'STAT5B', 'STAT5A', 'STAT3', 'ITGA2B', 'FZD2', 'WNT3', 'WNT9B', 'ITGA3', 'AXIN2', 'PRKCA', 'GRB2', 'BIRC5', 'LAMA1', 'RALBP1', 'LAMA3', 'PIAS2', 'SMAD2', 'SMAD4', 'DCC', 'BCL2', 'FGF22', 'APC2', 'DAPK3', 'PIAS4', 'MAP2K2', 'PIK3R2', 'CCNE1', 'CEBPA', 'AKT2', 'EGLN2', 'TGFB1', 'CBLC', 'FGF21', 'BAX', 'FLT3LG', 'KLK3', 'PRKCG', 'BMP2', 'BCL2L1', 'E2F1', 'PLCG1', 'STK4', 'MMP9', 'LAMA5', 'RUNX1', 'BID', 'CRKL', 'MAPK1', 'BCR', 'RAC2', 'PDGFB', 'RBX1', 'EP300', 'WNT7B', 'CSF2RA', 'VEGFD', 'ARAF', 'AR', 'FGF16', 'COL4A6', 'XIAP', 'FGF13', 'IKBKG'], 'WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT': ['PIAS3', 'TGFB2', 'CTLA4', 'PDCD1', 'TGFBR2', 'CD80', 'CD86', 'CXCL10', 'NFKB1', 'IL4', 'CD274', 'TLR4', 'IL2RA', 'NFKB2', 'TRAF6', 'IRAK4', 'STAT6', 'TGFB3', 'SOCS1', 'IL4R', 'CCL2', 'CCL5', 'STAT3', 'TGFB1', 'IL2RB', 'TLR7', 'TLR8', 'IL2RG']}
Use GESSO to compute gene set activity scores
This is the simplest use case for GESSO. Simply inpute the data and generate an activity scores report.
[6]:
model = GESSO(
expression_df=expression_df,
locations_df=locations_df,
genesets_df=genesets_df,
k=6, # increase k to increase spatial smoothing effect
normalize_counts_method="normalize-log1p", # easily normalize counts for non-preprocessed data
)
# compute activity scores for all gene sets in genesets_df
gas_report = model.compute_gas(
beta=0.33, # increase beta to increase spatial smoothing strength; must be between 0 and 1
compute_method="cpu", # either "cpu" or "lowres"
)
# quickly plot spatial map of activity scores for a specific gene set
figure = gas_report.plot_gas_spatial_map(
geneset="KEGG_PATHWAYS_IN_CANCER", # specify the gene set of interest
size=20, # dot size
cmap="viridis",
figsize=(5, 5),
) # returns a matplotlib Figure object
display(figure)
GESSO (info): Identified 334 common genes in the gene set and expression data.
GESSO (info): Identified 2518 common spots in the location and expression data.
GESSO (info): Normalized expression data with strategy 'normalize-log1p'.
GESSO (info): Model initialization complete.
GESSO (info): Beginning activity score computation for 2 gene sets with 2 jobs. Method
used: cpu.
GESSO (info): (Job 2:
WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT)
Activity score computation for
WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT
completed in 1.09 seconds.
GESSO (info): (Job 1: KEGG_PATHWAYS_IN_CANCER) Activity score computation for
KEGG_PATHWAYS_IN_CANCER completed in 4.11 seconds.
You can easily retrieve relevant data in pd.DataFrame form from the GESSO gene set activity scores report.
[7]:
display(gas_report.gas_df())
display(gas_report.gene_contributions_df(geneset="KEGG_PATHWAYS_IN_CANCER"))
| KEGG_PATHWAYS_IN_CANCER | WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT | |
|---|---|---|
| barcode | ||
| GTAGACAACCGATGAA-1 | -1.137452 | 0.044172 |
| ACAGATTAGGTTAGTG-1 | -0.076933 | -1.285475 |
| TGGTATCGGTCTGTAT-1 | -1.564224 | -1.565298 |
| ATTATCTCGACAGATC-1 | -1.624066 | -0.296113 |
| TGAGATCAAATACTCA-1 | -2.743630 | -0.461900 |
| ... | ... | ... |
| GCCCTGAGGATGGGCT-1 | -3.237531 | 1.022693 |
| CGGGCGATGGATCACG-1 | -2.021051 | -1.913347 |
| TGCGGACTTGACTCCG-1 | -4.610999 | 0.047913 |
| TCGCTGCCAATGCTGT-1 | -3.376165 | 0.271323 |
| GGTGTAGGTAAGTAAA-1 | -1.997408 | 1.049836 |
2518 rows × 2 columns
| KEGG_PATHWAYS_IN_CANCER | |
|---|---|
| ERBB2 | 0.209451 |
| RARA | 0.174657 |
| PRKCA | 0.174332 |
| CDH1 | 0.168812 |
| VEGFA | 0.157994 |
| ... | ... |
| GSTP1 | -0.127424 |
| PDGFRB | -0.139686 |
| CSF1R | -0.144355 |
| FOS | -0.167397 |
| MMP2 | -0.192850 |
315 rows × 1 columns
[8]:
model = GESSO(
expression_df=expression_df,
locations_df=locations_df,
genesets_df=genesets_df,
k=6, # increase k to increase spatial smoothing effect
normalize_counts_method="normalize-log1p", # easily normalize counts for non-preprocessed data
)
# compute activity scores for all gene sets in genesets_df
gas_report = model.compute_gas(
beta=0.33, # increase beta to increase spatial smoothing strength; must be between 0 and 1
compute_method="cpu", # either "cpu" or "lowres"
)
# quickly plot spatial map of activity scores for a specific gene set
figure = gas_report.plot_gas_spatial_map(
geneset="KEGG_PATHWAYS_IN_CANCER", # specify the gene set of interest
size=20, # dot size
cmap="viridis",
figsize=(5, 5),
) # returns a matplotlib Figure object
display(gas_report.gas_df())
display(gas_report.gene_contributions_df(geneset="KEGG_PATHWAYS_IN_CANCER"))
GESSO (info): Identified 334 common genes in the gene set and expression data.
GESSO (info): Identified 2518 common spots in the location and expression data.
GESSO (info): Normalized expression data with strategy 'normalize-log1p'.
GESSO (info): Model initialization complete.
GESSO (info): Beginning activity score computation for 2 gene sets with 2 jobs. Method
used: cpu.
GESSO (info): (Job 2:
WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT)
Activity score computation for
WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT
completed in 1.08 seconds.
GESSO (info): (Job 1: KEGG_PATHWAYS_IN_CANCER) Activity score computation for
KEGG_PATHWAYS_IN_CANCER completed in 4.11 seconds.
| KEGG_PATHWAYS_IN_CANCER | WP_INTERACTIONS_BETWEEN_IMMUNE_CELLS_AND_MICRORNAS_IN_TUMOR_MICROENVIRONMENT | |
|---|---|---|
| barcode | ||
| GTAGACAACCGATGAA-1 | -1.137452 | 0.044172 |
| ACAGATTAGGTTAGTG-1 | -0.076933 | -1.285475 |
| TGGTATCGGTCTGTAT-1 | -1.564224 | -1.565298 |
| ATTATCTCGACAGATC-1 | -1.624066 | -0.296113 |
| TGAGATCAAATACTCA-1 | -2.743630 | -0.461900 |
| ... | ... | ... |
| GCCCTGAGGATGGGCT-1 | -3.237531 | 1.022693 |
| CGGGCGATGGATCACG-1 | -2.021051 | -1.913347 |
| TGCGGACTTGACTCCG-1 | -4.610999 | 0.047913 |
| TCGCTGCCAATGCTGT-1 | -3.376165 | 0.271323 |
| GGTGTAGGTAAGTAAA-1 | -1.997408 | 1.049836 |
2518 rows × 2 columns
| KEGG_PATHWAYS_IN_CANCER | |
|---|---|
| ERBB2 | 0.209451 |
| RARA | 0.174657 |
| PRKCA | 0.174332 |
| CDH1 | 0.168812 |
| VEGFA | 0.157994 |
| ... | ... |
| GSTP1 | -0.127424 |
| PDGFRB | -0.139686 |
| CSF1R | -0.144355 |
| FOS | -0.167397 |
| MMP2 | -0.192850 |
315 rows × 1 columns
You can also work directly with gene sets in dictionary form.
[9]:
model_alt = GESSO(
expression_df=expression_df,
locations_df=locations_df,
k=6,
normalize_counts_method="normalize-log1p",
verbose=False, # per-call override: suppress all messages from this model
# (for finer control, use gesso.logging - see above)
)
gas_report = model_alt.compute_gas(
genesets_dict=genesets_dict,
beta=0.33,
compute_method="cpu"
)
gas_report.plot_gas_spatial_map(
geneset="KEGG_PATHWAYS_IN_CANCER",
size=20,
cmap="viridis",
figsize=(5, 5)
)
[9]:
Use GESSO to systematically identify spots with significantly elevated gene set activity
Please note that the output of the permutation-based hypothesis test varies with the total number of genes in the dataset. The example dataset used in this demo contains a subset of all genes from the original dataset.
[10]:
model = GESSO(
expression_df=expression_df,
locations_df=locations_df,
genesets_df=genesets_df,
k=6,
normalize_counts_method="normalize-log1p",
verbose=False
)
htest_report = model.htest_elevated_gas(
geneset="KEGG_PATHWAYS_IN_CANCER",
beta=0.33,
n_permutations=100, # number of random gene sets to sample for the test
n_jobs=4 # set number of CPUs
)
[11]:
display(htest_report.htest_df())
htest_report.plot_pval_spatial_map(
size=20,
significance_threshold=0.05
)
| x | y | gas | p | |
|---|---|---|---|---|
| barcode | ||||
| GTAGACAACCGATGAA-1 | 7 | 55 | -1.137452 | 0.97 |
| ACAGATTAGGTTAGTG-1 | 7 | 57 | -0.076933 | 0.32 |
| TGGTATCGGTCTGTAT-1 | 7 | 59 | -1.564224 | 1.00 |
| ATTATCTCGACAGATC-1 | 7 | 61 | -1.624066 | 1.00 |
| TGAGATCAAATACTCA-1 | 7 | 63 | -2.743630 | 1.00 |
| ... | ... | ... | ... | ... |
| GCCCTGAGGATGGGCT-1 | 68 | 74 | -3.237531 | 0.96 |
| CGGGCGATGGATCACG-1 | 69 | 75 | -2.021051 | 0.98 |
| TGCGGACTTGACTCCG-1 | 68 | 76 | -4.610999 | 0.98 |
| TCGCTGCCAATGCTGT-1 | 68 | 80 | -3.376165 | 0.95 |
| GGTGTAGGTAAGTAAA-1 | 70 | 50 | -1.997408 | 0.93 |
2518 rows × 4 columns
[11]:
[ ]: