GESSO Demo - Stereo-seq E12.5 Mouse Embryo Dataset

Gene set expression quantification on a large spatial transcriptomics dataset

GESSO (Gene sEt activity Score analysis with Spatial lOcation) is a computational method for quantifying the overall expression levels of gene sets for spatial transcriptomics data. Given spatial transcriptomics data and a user-provided gene set, GESSO returns a gene set activity score (GAS) for each spot. The lowres compute method makes GESSO scalable to high-resolution platforms with tens of thousands of spots.

This notebook demonstrates GESSO on the Stereo-seq mouse embryo dataset at embryonic day 12.5 (E12.5) (Chen et al., Cell, 2022), which contains ~51,000 spots. We score the curated panel of organ-system-specific gene sets used in the GESSO manuscript, visualize spatial maps for representative organ systems, and compute local spatial gradients of the activity scores to highlight tissue boundaries.

All paths below point to the original locations of the data on the OSCAR compute system; replace them with your own when re-running.

Import the gesso package.

The gesso Python package can be easily downloaded from source. Simply run the following script in your terminal after ensuring Python and pip are available in your environment. We recommend installing GESSO in a new Python environment.

git clone https://github.com/YMa-Lab/GESSO.git
cd gesso
pip install .
cd ..

Reading Stereo-seq .Rds files additionally requires pyreadr:

pip install pyreadr
[1]:
from pathlib import Path
import sys
import time

import numpy as np
import pandas as pd
import pyreadr
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import cKDTree

project_directory = Path("__notebook__").resolve().parent.parent
sys.path.append(str(project_directory))

from gesso import GESSO

Configure logging (optional)

GESSO uses Python’s standard logging module under the gesso.* hierarchy. By default the package is silent. Because this dataset is large and we score hundreds of gene sets, we silence the per-geneset progress messages and only keep top-level summaries.

[ ]:
from gesso import logging as glog

glog.enable()
glog.silence_per_geneset();  # keep summaries, mute per-geneset progress

Load the spatial transcriptomics data.

The Stereo-seq mouse embryo data ships as R .Rds files. We read them with pyreadr. The gene expression matrix is provided as \(G \times N\) (genes by spots); GESSO expects \(N \times G\), so we transpose.

[3]:
stereoseq_data_dir = Path("/users/ayang103/data/Project/SPLAGE/ST_Data/StereoSeq_Mouse")
pathways_dir = Path("/users/ayang103/data/Project/SPLAGE/Target_Pathway_List/PathwaysTable")

expression_rds = stereoseq_data_dir / "E12.5.gene.express.mat.Rds"
locations_rds = stereoseq_data_dir / "E12.5.location.Rds"
full_pathways_csv = pathways_dir / "StereoSeq.MouseEmbryo.PathwaysTable.1006.csv"
organ_system_pathways_csv = pathways_dir / "StereoSeq.MouseEmbryo.TargetTest.Pathways.1126.csv"
for p in (expression_rds, locations_rds, full_pathways_csv, organ_system_pathways_csv):
    assert p.exists(), p
[4]:
# expression matrix: genes x spots, transpose to spots x genes for GESSO
expression_df: pd.DataFrame = pyreadr.read_r(str(expression_rds))[None]
expression_df.columns.name = None
expression_df.index.name = None
expression_df = expression_df.T
display(expression_df.iloc[:5, :8])
print("expression shape (spots x genes):", expression_df.shape)
1700007G11Rik 1700123O20Rik 1810030O07Rik 2010107E04Rik 2210016F16Rik 2600014E21Rik 2610001J05Rik 2810429I04Rik
100_124-3 0.000000 0.0 0.0 2.923904 0.0 0.000000 0.0 0.0
100_125-3 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
100_126-3 0.000000 0.0 0.0 1.424620 0.0 0.000000 0.0 0.0
100_127-3 0.000000 0.0 0.0 1.355934 0.0 1.355934 0.0 0.0
100_128-3 1.186376 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
expression shape (spots x genes): (51365, 23761)

Next, let’s load and inspect the spatial location data. GESSO requires spatial locations to be in the form of a DataFrame containing columns "x" and "y".

[5]:
locations_df: pd.DataFrame = pyreadr.read_r(str(locations_rds))[None][["x", "y"]]
locations_df.columns.name = None
locations_df.index.name = None

# align expression row index with location row index
expression_df.index = locations_df.index

display(locations_df.head())
print("locations shape:", locations_df.shape)
x y
100_124-3 587.642906 -316.80058
100_125-3 586.776881 -316.30058
100_126-3 585.910855 -315.80058
100_127-3 585.044830 -315.30058
100_128-3 584.178805 -314.80058
locations shape: (51365, 2)

Finally, let’s load the gene set/pathway data. This data is stored as a \(G \times n_\text{genesets}\) binary matrix.

For this demo we score the curated panel of organ-system-specific gene sets used in the manuscript. The panel is read from a separate CSV listing the gene-set names to keep.

[6]:
full_pathways_df: pd.DataFrame = pd.read_csv(full_pathways_csv, index_col=0)
full_pathways_df.columns.name = None
full_pathways_df.index.name = None
print("full pathway membership table:", full_pathways_df.shape, "(genes x pathways)")

organ_system_pathways = (
    pd.read_csv(organ_system_pathways_csv, index_col=0)["geneset"].tolist()
)
organ_system_pathways = [p for p in organ_system_pathways if p in full_pathways_df.columns]
print(f"organ-system panel: {len(organ_system_pathways)} gene sets")

# restrict to the organ-system panel for this demo
genesets_df = full_pathways_df[organ_system_pathways]
display(genesets_df.iloc[:5, :3])
full pathway membership table: (23761, 9499) (genes x pathways)
organ-system panel: 313 gene sets
GOBP_CARDIAC_MUSCLE_CELL_ACTION_POTENTIAL GOBP_CARDIAC_MUSCLE_CELL_CONTRACTION GOBP_CARDIAC_MUSCLE_CELL_DIFFERENTIATION
1700007G11Rik 0 0 0
1700123O20Rik 0 0 0
1810030O07Rik 0 0 0
2010107E04Rik 0 0 0
2210016F16Rik 0 0 0

Use GESSO to compute gene set activity scores

For high-resolution platforms (~\(10^4\)+ spots), the lowres compute method partitions spots into low-resolution subsets, computes GAS within each subset, and aggregates back to spot level. This keeps both runtime and memory tractable for the ~51k spots in E12.5. Here we use n_partitions=10 and stratified_kmeans partitioning, the settings used in the manuscript.

This computation scores all 313 organ-system pathways across ~51k spots and typically takes ~30-60 minutes on 20 CPUs.

[7]:
model = GESSO(
    expression_df=expression_df,
    locations_df=locations_df,
    genesets_df=genesets_df,
    k=20,  # number of nearest-neighbor edges for the spatial graph
)

start = time.time()
gas_report = model.compute_gas(
    beta=0.33,                       # spatial smoothing strength; manuscript default
    compute_method="lowres",         # scalable estimator; required for ~51k spots
    partition_method="stratified_kmeans",
    n_partitions=10,
    n_jobs=20,
    store_gene_contributions=True,
)
print(f"compute_gas done in {time.time() - start:.1f} s for {len(organ_system_pathways)} gene sets")
GESSO (info): Identified 23761 common genes in the gene set and expression data.
GESSO (info): Identified 51365 common spots in the location and expression data.
GESSO (info): Model initialization complete.
GESSO (info): Beginning low resolution activity score computation for 313 gene sets with
               20 jobs. Method used: lowres.
compute_gas done in 2769.9 s for 313 gene sets

You can retrieve relevant data in pd.DataFrame form from the GESSO gene set activity scores report.

[8]:
gas_df = gas_report.gas_df()
display(gas_df.iloc[:, :3])
display(gas_report.gene_contributions_df(geneset="GOBP_HEART_DEVELOPMENT").head(10))
GOBP_CARDIAC_MUSCLE_CELL_ACTION_POTENTIAL GOBP_CARDIAC_MUSCLE_CELL_CONTRACTION GOBP_CARDIAC_MUSCLE_CELL_DIFFERENTIATION
100_124-3 1.079260 0.790949 -1.585441
100_125-3 -1.394282 -1.426940 -1.603940
100_126-3 -1.156461 -1.159831 -1.836268
100_127-3 -1.663883 -1.684750 -1.634100
100_128-3 -0.851946 -0.739961 -1.005923
... ... ... ...
99_272-3 -1.058986 -0.991000 -1.081240
99_273-3 -0.547736 0.384818 -1.580013
99_274-3 -1.252604 -1.290861 0.046630
99_275-3 -1.622624 -1.466577 -1.202910
99_276-3 -1.225003 -1.405247 -1.651946

51365 rows × 3 columns

GOBP_HEART_DEVELOPMENT
Tnnt2 0.199859
Myh6 0.187223
Actc1 0.181760
Myh7 0.181544
Tnni3 0.177076
Csrp3 0.175717
Tnni1 0.175315
Myl7 0.174126
Ttn 0.171646
Ankrd1 0.169321

Visualize spatial maps for representative organ-system gene sets

We plot four representative gene sets covering distinct organ systems in the E12.5 embryo: heart, brain, skeletal, and liver.

[9]:
focal_genesets = [
    "GOBP_HEART_DEVELOPMENT",
    "MANNO_MIDBRAIN_NEUROTYPES_HNPROG",
    "GOBP_EMBRYONIC_SKELETAL_SYSTEM_DEVELOPMENT",
    "AIZARANI_LIVER_C11_HEPATOCYTES_1",
]

for geneset in focal_genesets:
    fig = gas_report.plot_gas_spatial_map(
        geneset=geneset,
        size=2,
        cmap="inferno",
        figsize=(5, 5),
    )
    ax = fig.axes[0]
    ax.set_title(geneset, fontsize=9)
    ax.invert_yaxis()  # anatomical orientation
    display(fig)
    plt.close(fig)
_images/gesso_demo_stereoseq_17_0.png
_images/gesso_demo_stereoseq_17_1.png
_images/gesso_demo_stereoseq_17_2.png
_images/gesso_demo_stereoseq_17_3.png

Spatial gradients of GESSO activity scores

GESSO’s smoothed activity scores form a scalar field over the tissue. Local spatial gradients \((\partial \text{GAS} / \partial x,\ \partial \text{GAS} / \partial y)\) point from low-activity to high-activity regions and so highlight tissue boundaries. The gradient direction tells us where an organ system is forming, and the gradient magnitude tells us how sharp the boundary is.

For each spot we estimate the gradient by fitting a local plane to its \(k\) nearest neighbors (unweighted least squares).

[10]:
def compute_spatial_gradients(
    coords: np.ndarray, values: np.ndarray, k_neighbors: int = 30, min_neighbors: int = 4
) -> np.ndarray:
    """Estimate (df/dx, df/dy) at each point via local unweighted least squares.

    For each point, fits delta_f ~= grad_x * delta_x + grad_y * delta_y over its k
    nearest neighbors and returns (grad_x, grad_y).
    """
    n_points = coords.shape[0]
    k_query = min(k_neighbors + 1, n_points)
    tree = cKDTree(coords)
    distances, indices = tree.query(coords, k=k_query)

    gradients = np.full((n_points, 2), np.nan, dtype=np.float64)
    for i in range(n_points):
        neighbor_idx = indices[i, 1:]
        neighbor_dist = distances[i, 1:]
        valid = neighbor_dist > 1e-12
        if np.count_nonzero(valid) < min_neighbors:
            continue
        delta_xy = (coords[neighbor_idx] - coords[i])[valid]
        delta_values = (values[neighbor_idx] - values[i])[valid]
        coef, *_ = np.linalg.lstsq(delta_xy, delta_values, rcond=None)
        gradients[i, 0] = coef[0]
        gradients[i, 1] = coef[1]
    return gradients


coords = locations_df[["x", "y"]].to_numpy(dtype=np.float64)
gradient_records: dict[str, pd.DataFrame] = {}
for geneset in focal_genesets:
    values = gas_df[geneset].to_numpy(dtype=np.float64)
    grads = compute_spatial_gradients(coords, values)
    grad_df = locations_df.copy()
    grad_df["score"] = values
    grad_df["grad_x"] = grads[:, 0]
    grad_df["grad_y"] = grads[:, 1]
    grad_df["grad_mag"] = np.sqrt(grad_df["grad_x"] ** 2 + grad_df["grad_y"] ** 2)
    gradient_records[geneset] = grad_df
    print(f"{geneset}: median |grad| = {grad_df['grad_mag'].median():.4f}")
GOBP_HEART_DEVELOPMENT: median |grad| = 0.1476
MANNO_MIDBRAIN_NEUROTYPES_HNPROG: median |grad| = 0.1208
GOBP_EMBRYONIC_SKELETAL_SYSTEM_DEVELOPMENT: median |grad| = 0.2156
AIZARANI_LIVER_C11_HEPATOCYTES_1: median |grad| = 0.0776

We render each gradient field as a heatmap of \(|\nabla \text{GAS}|\) overlaid with binned down-gradient arrows. Bins keep only the strongest gradients per gene set (per-gene-set percentile threshold, since each organ has a different signal-to-background ratio). Arrows point down the gradient so they flow away from the highest-activity region.

[ ]:
N_BINS_X = N_BINS_Y = 22
MIN_POINTS_PER_BIN = 6
QUIVER_MAG_PERCENTILE_BY_GENESET = {
    "MANNO_MIDBRAIN_NEUROTYPES_HNPROG": 88,
    "GOBP_HEART_DEVELOPMENT": 93,
    "GOBP_EMBRYONIC_SKELETAL_SYSTEM_DEVELOPMENT": 80,
    "AIZARANI_LIVER_C11_HEPATOCYTES_1": 94,
}
BG_VMAX_PERCENTILE = 99
QUIVER_LENGTH_FRAC = 2.2
QUIVER_LENGTH_FLOOR = 0.4


def bin_strongest_gradient(df: pd.DataFrame, mag_percentile: float):
    x, y = df["x"].to_numpy(float), df["y"].to_numpy(float)
    x_min, x_max, y_min, y_max = x.min(), x.max(), y.min(), y.max()
    x_span, y_span = max(x_max - x_min, 1e-12), max(y_max - y_min, 1e-12)
    bin_pitch = min(x_span / N_BINS_X, y_span / N_BINS_Y)
    df = df.copy()
    df["bin_x"] = np.clip(((x - x_min) / x_span * N_BINS_X).astype(int), 0, N_BINS_X - 1)
    df["bin_y"] = np.clip(((y - y_min) / y_span * N_BINS_Y).astype(int), 0, N_BINS_Y - 1)
    max_idx = df.groupby(["bin_x", "bin_y"], observed=True)["grad_mag"].idxmax()
    tails = df.loc[max_idx, ["bin_x", "bin_y", "x", "y", "grad_x", "grad_y", "grad_mag"]]
    sizes = (
        df.groupby(["bin_x", "bin_y"], observed=True).size()
        .rename("n_points").reset_index()
    )
    g = tails.merge(sizes, on=["bin_x", "bin_y"]).reset_index(drop=True)
    g = g[g["n_points"] >= MIN_POINTS_PER_BIN].copy()
    g["bin_grad_mag"] = np.sqrt(g["grad_x"] ** 2 + g["grad_y"] ** 2)
    if mag_percentile > 0 and len(g) > 0:
        thr = np.percentile(g["bin_grad_mag"], mag_percentile)
        g = g[g["bin_grad_mag"] >= thr].copy()
    return g, bin_pitch


fig, axes = plt.subplots(1, len(focal_genesets), figsize=(5 * len(focal_genesets), 5))
for ax, geneset in zip(axes, focal_genesets):
    df = gradient_records[geneset].dropna(subset=["grad_x", "grad_y"]).copy()
    pct = QUIVER_MAG_PERCENTILE_BY_GENESET.get(geneset, 90)
    quiver_df, bin_pitch = bin_strongest_gradient(df, pct)

    vmin = float(df["grad_mag"].min())
    vmax = float(np.percentile(df["grad_mag"], BG_VMAX_PERCENTILE))
    sns.scatterplot(
        data=df, x="x", y="y", hue="grad_mag", ax=ax, s=4, edgecolor=None, legend=False,
        palette="inferno", alpha=0.85, linewidth=0, hue_norm=(vmin, vmax),
    )
    if not quiver_df.empty:
        mag = quiver_df["bin_grad_mag"].to_numpy()
        mag_safe = np.where(mag > 0, mag, 1.0)
        ux = -quiver_df["grad_x"].to_numpy() / mag_safe  # negate -> down-gradient arrows
        uy = -quiver_df["grad_y"].to_numpy() / mag_safe
        if mag.max() > mag.min():
            weight = QUIVER_LENGTH_FLOOR + (1.0 - QUIVER_LENGTH_FLOOR) * (mag - mag.min()) / (mag.max() - mag.min())
        else:
            weight = np.ones_like(mag)
        L = bin_pitch * QUIVER_LENGTH_FRAC * weight
        ax.quiver(
            quiver_df["x"], quiver_df["y"], ux * L, uy * L,
            color="white", edgecolor="black", linewidth=0.6, alpha=0.95,
            angles="xy", scale_units="xy", scale=1,
            width=0.008, headwidth=4, headlength=5, headaxislength=4.5, zorder=3,
        )
    ax.set_aspect("equal", adjustable="box")
    ax.invert_yaxis()
    ax.set_xlabel(""); ax.set_ylabel("")
    ax.tick_params(axis="both", which="both", length=0, labelbottom=False, labelleft=False)
    for spine in ax.spines.values():
        spine.set_visible(False)
    ax.set_title(geneset, fontsize=9)
fig.tight_layout()
display(fig)
plt.close(fig)
_images/gesso_demo_stereoseq_21_0.png

Each panel shows the local gradient magnitude as a heatmap (bright = sharp boundary) and the strongest down-gradient arrows pointing from low to high activity in reverse, i.e. arrows point away from each organ’s center. The cardiac, midbrain, skeletal, and hepatic signals each localize to anatomically expected regions of the E12.5 embryo.