Cell-Cell Interaction Analysis with Xenium Data

This tutorial runs through stLearn CCI analysis on extremely large data, containing >166,000 single cells with gene expression measured in space.

To increase computation speed, we grid the cells and perform stLearn CCI analysis while taking into account the proportion of different cell types detected per grid.

Environment setup

[1]:
import os
import platform

# Only constrain threads on macOS where BLAS/numba deadlocks are common.
# Must run before any numpy/scanpy import.
if platform.system() == "Darwin":
    os.environ["OPENBLAS_NUM_THREADS"] = "1"
    os.environ["MKL_NUM_THREADS"] = "1"
    os.environ["NUMBA_NUM_THREADS"] = "1"
    n_cpus = 1
else:
    n_cpus = None
[2]:
import stlearn as st
import pathlib as pathlib
import matplotlib.pyplot as plt
import numpy as np
import random as random
import os as os

st.settings.set_figure_params(dpi=120)

# Make sure all the seeds are set
seed = 0
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

# Ignore all warnings
import warnings
warnings.filterwarnings("ignore")

Loading the data

For this tutorial purpose, we don’t perform the clustering. Please run the clustering method with the clustering tutorial or a part of spatial trajectory inference tutorial.

[3]:
# Setup download directory and get data
st.settings.datasetdir =  pathlib.Path.cwd().parent / "data"
library_id = "Xenium_FFPE_Human_Breast_Cancer_Rep1"
data_dir = st.settings.datasetdir / "Xenium_FFPE_Human_Breast_Cancer_Rep1"
[4]:
st.datasets.xenium_sge(library_id=library_id, include_hires_tiff=True)
[5]:
adata = st.read_xenium(feature_cell_matrix_file=data_dir / "cell_feature_matrix.h5",
                      cell_summary_file=data_dir / "cells.csv.gz",
                      library_id=library_id,
                      image_path=data_dir / "he_image.ome.tif",
                      scale=1,
                      spot_diameter_fullres=15,
                      alignment_matrix_file=data_dir / "he_imagealignment.csv",
                      experiment_xenium_file=data_dir / "experiment.xenium",
                      )
Added tissue image to the object!
[6]:
# QC - Filter genes and cells with at least 10 counts
st.pp.filter_genes(adata, min_counts=10)
st.pp.filter_cells(adata, min_counts=10)
[7]:
adata.X.toarray()
[7]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [4., 1., 5., ..., 0., 0., 3.],
       [1., 0., 0., ..., 0., 0., 0.],
       [4., 1., 3., ..., 0., 0., 0.]], shape=(164000, 313), dtype=float32)
[8]:
# Store the raw data for using PSTS
adata.raw = adata
[9]:
# Run PCA, neighbors and clustering.
st.em.run_pca(adata, n_comps=50, random_state=0)
st.pp.neighbors(adata, n_neighbors=25, use_rep='X_pca', random_state=0)
st.tl.clustering.leiden(adata, resolution=1.1, random_state=0)
PCA is done! Generated in adata.obsm['X_pca'], adata.uns['pca'] and adata.varm['PCs']
Created k-Nearest-Neighbor graph in adata.uns['neighbors']
Applying Leiden cluster ...
Leiden cluster is done! The labels are stored in adata.obs['leiden']
[10]:
st.pl.cluster_plot(adata, use_label="leiden", image_alpha=0, size=4, figsize=(10, 10), bbox_to_anchor=(1.1, 1))
[10]:
AnnData object with n_obs × n_vars = 164000 × 313
    obs: 'imagecol', 'imagerow', 'n_counts', 'leiden'
    var: 'gene_ids', 'feature_types', 'genome', 'n_counts'
    uns: 'spatial', 'pca', 'neighbors', 'leiden', 'leiden_colors'
    obsm: 'spatial', 'X_pca'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
../_images/tutorials_cell_cell_interaction_xenium_11_1.png

Note on normalisation

No log1p or shrinking to make genes of similar expression range. In our case, for calling hotspots, we want genes to be more separate, since we select background genes with similar expression levels to detect hotspots.

[11]:
#### Normalize total...
st.pp.normalize_total(adata)
Normalization step is finished in adata.X
[12]:
adata.X.toarray()
[12]:
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.787234  , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [1.6926951 , 0.4231738 , 2.115869  , ..., 0.        , 0.        ,
        1.2695214 ],
       [1.4358974 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.7777778 , 0.44444445, 1.3333334 , ..., 0.        , 0.        ,
        0.        ]], shape=(164000, 313), dtype=float32)

Gridding

Now performing the gridding. The resolution chosen here may effect the results. The higher resolution, the better this represents the single cell data but the longer the computation takes.

To summarise the gene expression across cells in a grid, we sum the library size normalised gene expression. Summing allows for representing the fact there are multiple cells in a given spot.

[13]:
### Calculating the number of grid spots we will generate
n_ = 125
print(f'{n_} by {n_} has this many spots:\n', n_ * n_)
125 by 125 has this many spots:
 15625

By providing ‘use_label’ to the function below, the cell type information is saved as deconvolution information per spot, and also the dominant cell annotation. That way we can perform stLearn CCI with the cell type information!

[14]:
### Gridding.
grid = st.tl.cci.grid(adata, n_row=n_, n_col=n_, use_label='leiden', n_cpus=n_cpus)
print(grid.shape)  # Slightly less than the above calculation, since we filter out spots with 0 cells.
Gridding...
(14364, 313)

Checking the gridding

Comparing the gridded data to the original, to make sure it makes sense.

It’s recommend to visualise the dominant cell types per spot, in order to gauge whether the tissue structure is adequately maintained after gridding (i.e. to make sure it is not too low resolution!).

[15]:
fig, axes = plt.subplots(ncols=2, figsize=(20, 8))
st.pl.cluster_plot(grid, use_label='leiden', size=10, ax=axes[0], show_plot=False, bbox_to_anchor=(1.1, 1))
st.pl.cluster_plot(adata, use_label='leiden', ax=axes[1], show_plot=False, bbox_to_anchor=(1.1, 1))
axes[0].set_title(f'Grid leiden dominant spots')
axes[1].set_title(f'Cell leiden labels')
plt.show()
../_images/tutorials_cell_cell_interaction_xenium_20_0.png
[16]:
groups = list(grid.obs['leiden'].cat.categories)
idc_cluster = groups[12]
dcis_cluster = groups[6]
for group in [idc_cluster, dcis_cluster]:
    fig, axes = plt.subplots(ncols=3, figsize=(20, 8))
    group_props = grid.uns['leiden'][group].values
    grid.obs['group'] = group_props
    st.pl.feat_plot(grid, feature='group', ax=axes[0], show_plot=False, vmax=1, show_color_bar=False)
    st.pl.cluster_plot(grid, use_label='leiden', list_clusters=[group], ax=axes[1], show_plot=False)
    st.pl.cluster_plot(adata, use_label='leiden', list_clusters=[group], ax=axes[2], show_plot=False)
    axes[0].set_title(f'Grid {group} proportions (max = 1)')
    axes[1].set_title(f'Grid {group} max spots')
    axes[2].set_title(f'Individual cell {group}')
    plt.show()
../_images/tutorials_cell_cell_interaction_xenium_21_0.png
../_images/tutorials_cell_cell_interaction_xenium_21_1.png
[17]:
fig, axes = plt.subplots(ncols=2, figsize=(20, 5))
st.pl.gene_plot(grid, gene_symbols='CXCL12', ax=axes[0], show_color_bar=False, show_plot=False)
st.pl.gene_plot(adata, gene_symbols='CXCL12', ax=axes[1], show_color_bar=False, show_plot=False, vmax=80)
axes[0].set_title(f'Grid CXCL12 expression')
axes[1].set_title(f'Cell CXLC12 expression')
plt.show()
../_images/tutorials_cell_cell_interaction_xenium_22_0.png

LR Permutation Test

Running the LR permutation test, to determine regions of high LR co-expression.

[18]:
# Loading the LR databases available within stlearn (from NATMI)
lrs = st.tl.cci.load_lrs(['connectomeDB2020_lit'], species='human')
print(len(lrs))
2293
[19]:
# Running the analysis #
st.tl.cci.run(grid, lrs,
              min_spots=20,    # Filter out any LR pairs with no scores for less than min_spots
              distance=250,    # None defaults to spot+immediate neighbours; distance=0 for within-spot mode
              n_pairs=1000,    # Number of random pairs to generate; low as example, recommend ~10,000
              n_cpus=n_cpus,   # Number of CPUs for parallel. If None, detects & use all available.
              random_state=seed
              )
Calculating neighbours...
1 spots with no neighbours, 10 median spot neighbours.
Spot neighbour indices stored in adata.obsm['spot_neighbours'] & adata.obsm['spot_neigh_bcs'].
Altogether 20 valid L-R pairs
Generating backgrounds & testing each LR pair...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ [ time left: 00:00 ]

Storing results:

lr_scores stored in adata.obsm['lr_scores'].
p_vals stored in adata.obsm['p_vals'].
p_adjs stored in adata.obsm['p_adjs'].
-log10(p_adjs) stored in adata.obsm['-log10(p_adjs)'].
lr_sig_scores stored in adata.obsm['lr_sig_scores'].

Per-spot results in adata.obsm have columns in same order as rows in adata.uns['lr_summary'].
Summary of LR results in adata.uns['lr_summary'].

[20]:
lr_info = grid.uns['lr_summary']  # A dataframe detailing the LR pairs ranked by number of significant spots.
print(lr_info.shape)
print(lr_info)
(20, 3)
                 n_spots  n_spots_sig  n_spots_sig_pval
CXCL12_CXCR4       14215          465              2858
CXCL12_CD4         14057          340              4122
SLAMF7_SLAMF7       8162          320               837
MMRN2_CLEC14A       7961          233               859
EDN1_EDNRB          8249          175               909
MMRN2_CD93         10928          105               818
EPCAM_EPCAM        13271           99              2135
PTPRC_MRC1         13055           92              1093
MRC1_PTPRC         13055           90              1085
C1QA_CD93          12527           81               908
CEACAM6_CEACAM8    10562           78              1564
CD274_CD80          4944           41               247
CD80_CD274          4944           40               261
SLAMF1_SLAMF1       7660           40               360
CD86_CTLA4         10168           34               234
CD80_CTLA4          6094           31               266
CCL5_SDC4          12907           24               219
CD274_PDCD1         1756           11                66
CDH1_EGFR          13712            7               365
PDCD1LG2_PDCD1      1902            6                59
[21]:
# Showing the rankings of the LR from a global and local perspective.
# Ranking based on number of significant hotspots.
st.pl.lr_summary(grid, n_top=500)
st.pl.lr_summary(grid, n_top=50, figsize=(10, 3))
Note: max_text (100) exceeds available LRs (20)
../_images/tutorials_cell_cell_interaction_xenium_27_1.png
Note: max_text (100) exceeds available LRs (20)
../_images/tutorials_cell_cell_interaction_xenium_27_3.png
[22]:
### Can adjust significance thresholds.
st.tl.cci.adj_pvals(grid, correct_axis='spot',
                    pval_adj_cutoff=0.05, adj_method='fdr_bh')
Updated adata.uns[lr_summary]
Updated adata.obsm[lr_scores]
Updated adata.obsm[lr_sig_scores]
Updated adata.obsm[p_vals]
Updated adata.obsm[p_adjs]
Updated adata.obsm[-log10(p_adjs)]

Downstream visualisations from LR analsyis

For more downstream visualisations, please see the stLearn CCI tutorial:

https://stlearn.readthedocs.io/en/latest/tutorials/stLearn-CCI.html

[23]:
best_lr = grid.uns['lr_summary'].index.values[0]  # Just choosing one of the top from lr_summary
[24]:
stats = ['lr_scores', '-log10(p_adjs)', 'lr_sig_scores']
fig, axes = plt.subplots(ncols=len(stats), figsize=(12, 6))
for i, stat in enumerate(stats):
    st.pl.lr_result_plot(grid, use_result=stat, use_lr=best_lr, show_color_bar=False, ax=axes[i])
    axes[i].set_title(f'{best_lr} {stat}')
../_images/tutorials_cell_cell_interaction_xenium_31_0.png

Predicting Significant Cell-Cell Interactions

For this analysis, we are using within-spot mode, with the deconvolution information which is stored in grid.uns, as automatically determined by st.tl.cci.grid. The permutation is performed to permute the cell proportions associated with each spot, to determine spots located in LR hotspots that have certain cell types over-represented.

[25]:
grid
[25]:
AnnData object with n_obs × n_vars = 14364 × 313
    obs: 'imagecol', 'imagerow', 'n_cells', 'leiden', 'group'
    uns: 'spatial', 'leiden', 'leiden_colors', 'grid_counts', 'grid_xedges', 'grid_yedges', 'lrfeatures', 'lr_summary'
    obsm: 'spatial', 'spot_neighbours', 'spot_neigh_bcs', 'lr_scores', 'p_vals', 'p_adjs', '-log10(p_adjs)', 'lr_sig_scores'
[26]:
st.tl.cci.run_cci(grid, 'leiden',  # Spot cell information either in data.obs or data.uns
                  min_spots=2,  # Minimum number of spots for LR to be tested.
                  spot_mixtures=True,  # If True will use the deconvolution data,
                  # so spots can have multiple cell types if score>cell_prop_cutoff
                  cell_prop_cutoff=0.1,  # Spot considered to have cell type if score>0.1
                  sig_spots=True,  # Only consider neighbourhoods of spots which had significant LR scores.
                  n_perms=100,  # Permutations of cell information to get background, recommend ~1000
                  n_cpus=n_cpus,
                  random_state=seed
                  )
Getting cached neighbourhood information...
Getting information for CCI counting...
Counting celltype-celltype interactions per LR and permuting 100 times.: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████ [ time left: 00:00 ]
Significant counts of cci_rank interactions for all LR pairs in data.uns[lr_cci_leiden]
Significant counts of cci_rank interactions for each LR pair stored in dictionary data.uns[per_lr_cci_leiden]

Diagnostic: checking for interaction and cell type frequency correlation

Should be little to no correlation; indicating the permutation has adequately controlled for cell type frequency.

[27]:
st.pl.cci_check(grid, 'leiden', figsize=(16, 5))
../_images/tutorials_cell_cell_interaction_xenium_36_0.png

CCI Visualisations

CCI Network Plot

[28]:
# Visualising the no. of interactions between cell types across all LR pairs #
pos_1 = st.pl.ccinet_plot(grid, 'leiden', return_pos=True, min_counts=30)

# Just examining the cell type interactions between selected pairs #
lrs = grid.uns['lr_summary'].index.values[0:3]
for best_lr in lrs[0:2]:
    st.pl.ccinet_plot(grid, 'leiden', best_lr, min_counts=2,
                      figsize=(10, 7.5), pos=pos_1,
                      )
../_images/tutorials_cell_cell_interaction_xenium_39_0.png
../_images/tutorials_cell_cell_interaction_xenium_39_1.png
../_images/tutorials_cell_cell_interaction_xenium_39_2.png

CCI Chord Plot

[29]:
st.pl.lr_chord_plot(grid, 'leiden')

for lr in lrs[0:2]:
    st.pl.lr_chord_plot(grid, 'leiden', lr)
../_images/tutorials_cell_cell_interaction_xenium_41_0.png
../_images/tutorials_cell_cell_interaction_xenium_41_1.png
../_images/tutorials_cell_cell_interaction_xenium_41_2.png

Acknowledgement

We would like to thank Soo Hee Lee and 10X Genomics team for their able support, contribution and feedback to this tutorial

Tutorial by Brad Balderson