TFcomb.tools module

Performs link recovery using a GAT-based graph neural network model.

This function trains a GAT model to recover gene regulatory network (GRN) links, filters low-confidence links, and updates the GRN and links for further analysis.

Parameters:
  • g (dgl.DGLGraph, optional) – Input graph representing the GRN. Defaults to None.

  • save_dir_GNN (str, optional) – Directory to save GNN models and predictions. Defaults to None.

  • tf_list (list, optional) – List of transcription factors. Defaults to None.

  • gene_list (list, optional) – List of all genes in the GRN. Defaults to None.

  • tf_GRN_dict (dict, optional) – Dictionary of TF-to-target regulatory relationships. Defaults to None.

  • oracle_part (co.Oracle, optional) – Oracle object for GRN inference and simulation. Defaults to None.

  • links_part (Links, optional) – Links object containing GRN links. Defaults to None.

  • combine (str, optional) – Cluster or condition to analyze. Defaults to None.

  • threshold_number (int, optional) – Number of links to consider during filtering. Defaults to 10000.

  • alpha_fit_GRN (float, optional) – Parameter for refitting the GRN. Defaults to 10.

  • n_splits (int, optional) – Number of cross-validation splits for training. Defaults to 10.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

  • neg_link_split (str, optional) – Strategy for negative link sampling (‘all’ or other). Defaults to ‘all’.

  • model_name (str, optional) – Name of the GNN model. Defaults to ‘GAT’.

  • pred_name (str, optional) – Name of the prediction module. Defaults to ‘mlp’.

  • hidden_dim (int, optional) – Dimension of hidden layers in the GNN. Defaults to 16.

  • out_dim (int, optional) – Output dimension of the GNN. Defaults to 7.

  • num_heads (list, optional) – List of attention heads for each GNN layer. Defaults to [4, 4, 6].

  • epochs (int, optional) – Number of training epochs. Defaults to 1500.

  • lr (float, optional) – Learning rate for the optimizer. Defaults to 0.01.

  • patience (int, optional) – Early stopping patience. Defaults to 300.

  • device (str, optional) – Device to run the model (‘cuda’ or ‘cpu’). Defaults to ‘cuda’.

  • verbose (bool, optional) – Whether to print training progress. Defaults to True.

  • count_fold (int, optional) – Minimum number of folds where a link must appear to be considered. Defaults to 9.

  • link_score_quantile (float, optional) – Quantile threshold for filtering link scores. Defaults to 0.1.

Returns:

A tuple containing:
  • gene_GRN_mtx (pd.DataFrame): Updated gene GRN matrix.

  • tf_GRN_mtx (pd.DataFrame): Updated TF GRN matrix.

  • tf_GRN_dict (dict): Updated TF-to-target regulatory dictionary.

  • links_recover (Links): Updated links object with recovered links.

Return type:

tuple

class TFcomb.tools.GRN_Dataset(adata_part, tf_GRN_mtx, tf_list)[source]

Bases: DGLDataset

A custom dataset class for representing Gene Regulatory Networks (GRN) as DGL graphs.

This class constructs a DGL graph dataset using gene expression data, transcription factors (TFs), and regulatory relationships encoded in a GRN matrix.

Parameters:
  • adata_part (anndata.AnnData) – A subset of AnnData containing gene expression data.

  • tf_GRN_mtx (pd.DataFrame) – A GRN matrix with TFs as rows and target genes as columns. The values represent regulatory weights.

  • tf_list (list) – A list of transcription factors to label nodes.

adata_part

Input AnnData object.

Type:

anndata.AnnData

tf_GRN_mtx

Input GRN matrix.

Type:

pd.DataFrame

tf_list

List of transcription factors.

Type:

list

graph

Constructed DGL graph with node and edge features.

Type:

dgl.DGLGraph

process()[source]

Constructs the DGL graph from the GRN matrix and gene expression data.

__getitem__(i)[source]

Returns the constructed graph. Since this dataset only contains one graph, i is ignored.

__len__()[source]

Returns the number of graphs in the dataset. Always 1 for this class.

process()[source]

Overwrite to realize your own logic of processing the input data.

TFcomb.tools.TF_inference(adata, cluster_name_for_GRN_unit, tf_list, tf_GRN_mtx, gene_GRN_mtx, tf_GRN_mtx_ori, gene_GRN_mtx_ori, init_cluster, control_cluster, prop_mode='soft', layer_use='normalized_count', model='ridge', alpha=1, plot=True, a1=0.8, a2=0.15, a3=0.05, regression_percentile=90, annot_shifts=(2, 0.01), xlabel='index of TFs', ylabel='Expected alteration', save=None)[source]

Infers transcription factor (TF) variation between an initial and control state.

This function performs regression analysis using GRN matrices and imputed gene expression data to infer the expected TF activity alterations between two cell states.

Parameters:
  • adata (anndata.AnnData) – Input AnnData object containing single-cell data.

  • cluster_name_for_GRN_unit (str) – Column in adata.obs specifying cluster labels.

  • tf_list (list) – List of transcription factors to filter rows of the GRN matrix.

  • tf_GRN_mtx (pd.DataFrame) – GRN matrix with TFs as rows and genes as columns.

  • gene_GRN_mtx (pd.DataFrame) – GRN matrix with all genes as rows and columns.

  • tf_GRN_mtx_ori (pd.DataFrame) – Original GRN matrix with TFs as rows and genes as columns.

  • gene_GRN_mtx_ori (pd.DataFrame) – Original GRN matrix with all genes as rows and columns.

  • init_cluster (str) – Name of the initial cluster/state.

  • control_cluster (str) – Name of the control cluster/state.

  • prop_mode (str, optional) – Mode of propagation: - ‘fix’: Considers only direct links. - ‘soft’: Includes propagated links. Defaults to ‘soft’.

  • layer_use (str, optional) – Layer in adata to use for gene expression. Defaults to ‘normalized_count’.

  • model (str, optional) – Regression model to use (‘ridge’, ‘linear’, ‘lasso’). Defaults to ‘ridge’.

  • alpha (float, optional) – Regularization strength for Ridge/Lasso regression. Defaults to 1.

  • plot (bool, optional) – Whether to generate a plot of the results. Defaults to True.

  • a1 (float, optional) – Weight for the first propagation level. Defaults to 0.8.

  • a2 (float, optional) – Weight for the second propagation level. Defaults to 0.15.

  • a3 (float, optional) – Weight for the third propagation level. Defaults to 0.05.

  • regression_percentile (int, optional) – Percentile of regression coefficients to highlight in the plot. Defaults to 90.

  • annot_shifts (tuple, optional) – Annotation shifts for the plot (x, y). Defaults to (2, 0.01).

  • xlabel (str, optional) – Label for the x-axis of the plot. Defaults to ‘index of TFs’.

  • ylabel (str, optional) – Label for the y-axis of the plot. Defaults to ‘Expected alteration’.

  • save (str, optional) – Path to save the plot. If None, the plot is not saved. Defaults to None.

Returns:

  • rr (sklearn.base.RegressorMixin): Trained regression model.

  • X (numpy.ndarray): Design matrix used for regression.

  • init_ave (numpy.ndarray): Average gene expression in the initial cluster.

  • control_ave (numpy.ndarray): Average gene expression in the control cluster.

Return type:

tuple

TFcomb.tools.get_GRN_parameters(oracle, combine)[source]

Extracts GRN (Gene Regulatory Network) parameters for downstream analysis.

This function processes the Oracle object’s GRN data and calculates key GRN-related matrices and dictionaries required for further analysis.

Parameters:
  • oracle (co.Oracle) – The Oracle object containing the GRN data.

  • combine (str) – The specific cluster or condition to extract GRN information from.

Returns:

A tuple containing the following:
  • gene_GRN_mtx_ori (pd.DataFrame): Original GRN matrix with all genes as rows and columns.

  • tf_GRN_mtx_ori (pd.DataFrame): Filtered GRN matrix with TFs (transcription factors) as rows and all genes as columns.

  • tf_GRN_dict (dict): Dictionary where keys are TFs and values are dictionaries. The nested dictionary maps target genes to their regulatory scores.

Return type:

tuple

TFcomb.tools.get_benchmark_score(gt_list, tf_list)[source]

Computes the benchmark score based on the ranking of predicted transcription factors (TFs) compared to the ground truth list of TFs.

The score rewards TFs that appear earlier in the predicted list if they are present in the ground truth list. The benchmark score is based on a weighted sum where the weight decreases as the position in the predicted list increases.

Parameters:
  • gt_list (list) – A list of ground truth transcription factors (TFs), which represents the correct order of TFs.

  • tf_list (list) – A list of predicted transcription factors (TFs) based on the model.

Returns:

The benchmark score indicating the quality of the prediction. The score is between 0 and 1,

with higher values indicating better performance.

Return type:

float

TFcomb.tools.get_de_genes(adata, cluster_name_for_GRN_unit, init_cluster, control_cluster, tf_list, p_val=0.05)[source]

Identifies differentially expressed genes (DEGs) between two clusters in an AnnData object.

This function performs differential expression analysis between the init_cluster and the control_cluster, returning genes that are upregulated (positive) and downregulated (negative) in the init_cluster relative to the control_cluster, as well as those that are transcription factors (TFs).

Parameters:
  • adata (anndata.AnnData) – Annotated data matrix containing the gene expression data.

  • cluster_name_for_GRN_unit (str) – The column name in adata.obs that contains cluster labels.

  • init_cluster (str) – The name of the initial cluster to compare.

  • control_cluster (str) – The name of the control cluster to compare.

  • tf_list (list) – List of known transcription factors (TFs) to filter the DEGs.

  • p_val (float, optional) – The adjusted p-value threshold for determining significant genes. Defaults to 5e-2.

Returns:

A tuple containing four lists:
  • pos_gene (list): Genes upregulated in init_cluster relative to control_cluster.

  • neg_gene (list): Genes downregulated in init_cluster relative to control_cluster.

  • pos_gene_tf (list): Transcription factors among the upregulated genes.

  • neg_gene_tf (list): Transcription factors among the downregulated genes.

Return type:

tuple

TFcomb.tools.get_directing_score(change_tf, rr, tf_GRN_mtx, tf_GRN_dict, diff_ave=None, mode='single', if_print=False, X=None)[source]

Computes the predictive correlation (PCC) and accuracy of transcription factor (TF) changes.

This function evaluates the correlation and/or accuracy of predicted changes in target gene expression based on TF alterations, using a trained regression model and the GRN matrix.

Parameters:
  • change_tf (list) – List of TFs to modify.

  • rr (sklearn.base.RegressorMixin) – Trained regression model.

  • tf_GRN_mtx (pd.DataFrame) – GRN matrix where rows are genes and columns are TFs.

  • tf_GRN_dict (dict) – Dictionary mapping each TF to its regulated genes and their scores.

  • diff_ave (numpy.ndarray, optional) – Observed differences between control and initial states. Defaults to None.

  • mode (str, optional) – Evaluation mode: - ‘single’: Compute PCC and accuracy for each TF individually. - ‘multi’: Compute PCC for all input TFs together. Defaults to ‘single’.

  • if_print (bool, optional) – Whether to print the results. Defaults to False.

  • X (numpy.ndarray, optional) – Optional design matrix for regression. If None, it is calculated from tf_GRN_mtx. Defaults to None.

Returns:

  • If mode=’single’: A tuple containing:
    • TF_pcc_dict (dict): Dictionary mapping each TF to its PCC.

    • TF_acc_dict (dict): Dictionary mapping each TF to its accuracy score.

  • If mode=’multi’: A single PCC value.

Return type:

Union[dict, float]

TFcomb.tools.get_multi_TF(tf_list, rr_corr, tf_GRN_mtx, tf_GRN_dict, source_ave, target_ave, direction, number)[source]

Retrieves and ranks all possible combinations of transcription factors (TFs) of a specified size (number of TFs) based on their directing scores.

This function considers TFs based on the direction of their coefficients in the regression model (rr_corr) and computes their directing score for all combinations of a specified size.

Parameters:
  • tf_list (list) – A list of transcription factors (TFs) to be considered for analysis.

  • rr_corr (sklearn.linear_model.Ridge) – A fitted regression model with coefficients for each TF in tf_list.

  • tf_GRN_mtx (pandas.DataFrame) – A matrix of TF-gene interactions used to calculate directing scores.

  • tf_GRN_dict (dict) – A dictionary containing the TF-gene relationships used for score calculations.

  • source_ave (numpy.ndarray) – The average expression of the source cluster (initial state).

  • target_ave (numpy.ndarray) – The average expression of the target cluster (control state).

  • direction (str) – The direction of TF correlation to filter by. - ‘pos’: Selects TFs with positive coefficients. - ‘neg’: Selects TFs with negative coefficients.

  • number (int) – The number of TFs to combine in each group.

Returns:

A DataFrame containing the following columns:
  • ’TF_combination’: A tuple of TFs in the combination.

  • ’directing_score’: The directing score for each TF combination, which reflects the predicted effect of the combination on gene expression.

Return type:

pandas.DataFrame

TFcomb.tools.get_percentile_thre(df, value, fillna_with_zero, cluster1, cluster2, TF_number)[source]

Determines the optimal percentile threshold to select a specific number of transcription factors (TFs) based on their expression levels in two clusters.

This function calculates the percentile threshold that selects a set of TFs whose number is closest to the desired TF_number, based on their expression in two clusters (cluster1 and cluster2). It iteratively adjusts the threshold until the number of selected TFs matches TF_number.

Parameters:
  • df (pandas.DataFrame) – A dataframe containing the expression data for genes. It should have at least the following columns: index, cluster, and the column specified by value representing expression values.

  • value (str) – The name of the column in df that contains the expression values for the genes.

  • fillna_with_zero (bool) – If True, fills NaN values with zeros. If False, fills NaN values with the mean expression across clusters.

  • cluster1 (str) – The name of the first cluster to use for selecting TFs.

  • cluster2 (str) – The name of the second cluster to use for selecting TFs.

  • TF_number (int) – The target number of TFs to select based on their expression.

Returns:

The percentile threshold that results in selecting TF_number TFs.

Return type:

float

TFcomb.tools.get_single_TF(tf_list, rr_corr, TF_ds_dict, direction='pos')[source]

Retrieves a list of transcription factors (TFs) that are either positively or negatively correlated with the expected alteration and ranks them based on their directing score.

This function filters the list of TFs based on their coefficients in the regression model (rr_corr) and returns a DataFrame containing the selected TFs along with their directing scores and expected alterations.

Parameters:
  • tf_list (list) – A list of transcription factors (TFs) to be considered for analysis.

  • rr_corr (sklearn.linear_model.Ridge) – A fitted regression model with coefficients for each TF in tf_list.

  • TF_ds_dict (dict) – A dictionary where keys are TFs and values are their directing scores.

  • direction (str, optional) – The direction of TF correlation to filter by. - ‘pos’ (default): Selects TFs with positive correlation (coefficients > 0). - ‘neg’: Selects TFs with negative correlation (coefficients < 0).

Returns:

A DataFrame containing the following columns:
  • ’TF’: The transcription factor.

  • ’directing_score’: The directing score for each TF from TF_ds_dict.

  • ’expected_alteration’: The expected alteration value (i.e., the regression coefficient for each TF).

Return type:

pandas.DataFrame

TFcomb.tools.import_TF_data(TF_info_matrix=None, TF_info_matrix_path=None, TFdict=None)[source]

Load data about potential-regulatory TFs. You can import either TF_info_matrix or TFdict. For more information on how to make these files, please see the motif analysis module within the celloracle tutorial.

Parameters:
  • TF_info_matrix (pandas.DataFrame) – TF_info_matrix.

  • TF_info_matrix_path (str) – File path for TF_info_matrix (pandas.DataFrame).

  • TFdict (dictionary) – Python dictionary of TF info.

TFcomb.tools.select_add_TF(adata_anno, cluster_name_for_GRN_unit, init_cluster, control_cluster, base_GRN_dir, subset=False, filter_num_1=100, filter_num_2=100, filter_num_3=400)[source]

Selects additional transcription factors (TFs) based on different gene grouping criteria from the given data.

The function performs multiple filtering steps based on various conditions such as highly variable genes, marker genes, and high-expression genes. It then returns a list of TFs that satisfy the filtering criteria across these groupings.

Parameters:
  • adata_anno (AnnData) – Annotated data matrix containing gene expression and metadata.

  • cluster_name_for_GRN_unit (str) – The name of the cluster column in the adata_anno object.

  • init_cluster (str) – The name of the initial cluster to be compared in the analysis.

  • control_cluster (str) – The name of the control cluster to be compared in the analysis.

  • base_GRN_dir (str) – The directory path for the base GRN file in Parquet format.

  • subset (bool, optional) – Whether to use a subset of the data for filtering. Default is False.

  • filter_num_1 (int, optional) – Number of TFs to be filtered based on marker gene grouping. Default is 100.

  • filter_num_2 (int, optional) – Number of TFs to be filtered based on high-expression gene grouping. Default is 100.

  • filter_num_3 (int, optional) – Number of TFs to be filtered based on highly variable gene grouping. Default is 400.

Returns:

A list of transcription factors (TFs) that satisfy all the filtering conditions.

Return type:

list