ArchR process scATAC-seq data

Usually the scATAC-seq data is provided in the format of ‘fragments.tsv’. In this tutorial we showed the users how to utilize ArchR to process fragment files into a cell-by-peak matrix. The cell-by-peak matrix will be the input of CellOracle to generate the base GRN. Note that this notebook was excuted under an R environment.

The details of usage of ArchR can be refered at https://www.archrproject.com/articles/Articles/tutorial.html.

If you have already got the cell-by-peak matrix, you can directly skip to 1.2 to generate the base GRN.

import packages

[3]:
library(ArchR)
set.seed(1)
library(parallel)

load fragment files & create arrow files

[4]:
# - initial the ArchR folder to save temporary files
path = "/nfs/public/lichen/code/TFcomb_github_doc/TFcomb/tutorial_example/scATAC_process" # replace this to your own path
prefix = 'ArchR_files'
dir_ = file.path(dirname(dirname(path)), "data/iPSC_example", prefix)
dir.create(dir_, recursive=T)
setwd(dir=dir_)
[5]:
# - load the fragment files
inputFiles=c(
    'day00' = '/nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/fastq/6231STDY9258270/Day00/outs/fragments.tsv.gz',
    'day14' = '/nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/fastq/6231STDY9258270/Day14/outs/fragments.tsv.gz'
)

addArchRGenome("hg38") # be aware of the species version of refrence genome
# addArchRGenome("hg19")
# addArchRGenome("mm10")
# addArchRGenome("mm9")

addArchRThreads(threads = 16)
ArrowFiles <- createArrowFiles(
  inputFiles = inputFiles,
  sampleNames = names(inputFiles),
  minTSS = 4, #Dont set this too high because you can always increase later
  minFrags = 1000,
  addTileMat = TRUE,
  addGeneScoreMat = TRUE
)
Setting default genome to Hg38.

Setting default number of Parallel threads to 16.

Using GeneAnnotation set by addArchRGenome(Hg38)!

Using GeneAnnotation set by addArchRGenome(Hg38)!

ArchR logging to : ArchRLogs/ArchR-createArrows-1cdb03743a2877-Date-2023-09-20_Time-18-46-04.log
If there is an issue, please report to github with logFile!

Cleaning Temporary Files

2023-09-20 18:46:04 : Batch Execution w/ safelapply!, 0 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-createArrows-1cdb03743a2877-Date-2023-09-20_Time-18-46-04.log

[6]:
doubScores <- addDoubletScores(
    input = ArrowFiles,
    k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
    knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search with doublet projection.
    LSIMethod = 1
)

proj <- ArchRProject(
  ArrowFiles = ArrowFiles,
  outputDirectory = prefix,
  copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage.
)
ArchR logging to : ArchRLogs/ArchR-addDoubletScores-1cdb0332496e6f-Date-2023-09-20_Time-18-59-37.log
If there is an issue, please report to github with logFile!

2023-09-20 18:59:37 : Batch Execution w/ safelapply!, 0 mins elapsed.

2023-09-20 18:59:37 : day14 (1 of 2) :  Computing Doublet Statistics, 0.001 mins elapsed.

day14 (1 of 2) : UMAP Projection R^2 = 0.97165

day14 (1 of 2) : UMAP Projection R^2 = 0.97165

Warning message:
“`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.”
2023-09-20 19:03:17 : day00 (2 of 2) :  Computing Doublet Statistics, 3.674 mins elapsed.

Filtering 1 dims correlated > 0.75 to log10(depth + 1)

Filtering 1 dims correlated > 0.75 to log10(depth + 1)

day00 (2 of 2) : UMAP Projection R^2 = 0.95947

day00 (2 of 2) : UMAP Projection R^2 = 0.95947

Warning message:
“`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.”
ArchR logging successful to : ArchRLogs/ArchR-addDoubletScores-1cdb0332496e6f-Date-2023-09-20_Time-18-59-37.log

Using GeneAnnotation set by addArchRGenome(Hg38)!

Using GeneAnnotation set by addArchRGenome(Hg38)!

Validating Arrows...

Getting SampleNames...



Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE

1
2


Getting Cell Metadata...



Merging Cell Metadata...

Initializing ArchRProject...


                                                   / |
                                                 /    \
            .                                  /      |.
            \\\                              /        |.
              \\\                          /           `|.
                \\\                      /              |.
                  \                    /                |\
                  \\#####\           /                  ||
                ==###########>      /                   ||
                 \\##==......\    /                     ||
            ______ =       =|__ /__                     ||      \\\
        ,--' ,----`-,__ ___/'  --,-`-===================##========>
       \               '        ##_______ _____ ,--,__,=##,__   ///
        ,    __==    ___,-,__,--'#'  ==='      `-'    | ##,-/
        -,____,---'       \\####\\________________,--\\_##,/
           ___      .______        ______  __    __  .______
          /   \     |   _  \      /      ||  |  |  | |   _  \
         /  ^  \    |  |_)  |    |  ,----'|  |__|  | |  |_)  |
        /  /_\  \   |      /     |  |     |   __   | |      /
       /  _____  \  |  |\  \\___ |  `----.|  |  |  | |  |\  \\___.
      /__/     \__\ | _| `._____| \______||__|  |__| | _| `._____|


[7]:
proj <- filterDoublets(ArchRProj = proj)
Filtering 1053 cells from ArchRProject!

        day14 : 636 of 7975 (8%)

        day00 : 417 of 6465 (6.5%)

Clustering & UMAP plotting

[8]:
# call peak and save peak matrix

pathToMacs2='/data1/lichen/anaconda3/envs/R4.0.2/bin/macs2' # replace this with your own directory path

proj <- addIterativeLSI(ArchRProj = proj, useMatrix = "TileMatrix", name = "IterativeLSI",
                       varFeatures = 50000,iteration=4,force = TRUE)
Checking Inputs...

ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-1cdb036a6fa807-Date-2023-09-20_Time-19-09-48.log
If there is an issue, please report to github with logFile!

2023-09-20 19:09:49 : Computing Total Across All Features, 0.009 mins elapsed.

2023-09-20 19:09:51 : Computing Top Features, 0.055 mins elapsed.

###########
2023-09-20 19:09:53 : Running LSI (1 of 4) on Top Features, 0.086 mins elapsed.
###########

2023-09-20 19:09:53 : Sampling Cells (N = 10001) for Estimated LSI, 0.087 mins elapsed.

2023-09-20 19:09:53 : Creating Sampled Partial Matrix, 0.087 mins elapsed.

2023-09-20 19:10:33 : Computing Estimated LSI (projectAll = FALSE), 0.75 mins elapsed.

2023-09-20 19:11:13 : Identifying Clusters, 1.421 mins elapsed.

2023-09-20 19:11:30 : Identified 6 Clusters, 1.694 mins elapsed.

2023-09-20 19:11:30 : Saving LSI Iteration, 1.694 mins elapsed.

2023-09-20 19:11:49 : Creating Cluster Matrix on the total Group Features, 2.008 mins elapsed.

2023-09-20 19:12:10 : Computing Variable Features, 2.366 mins elapsed.

###########
2023-09-20 19:12:10 : Running LSI (2 of 4) on Variable Features, 2.369 mins elapsed.
###########

2023-09-20 19:12:10 : Sampling Cells (N = 10001) for Estimated LSI, 2.37 mins elapsed.

2023-09-20 19:12:10 : Creating Sampled Partial Matrix, 2.37 mins elapsed.

2023-09-20 19:12:52 : Computing Estimated LSI (projectAll = FALSE), 3.061 mins elapsed.

2023-09-20 19:13:26 : Identifying Clusters, 3.639 mins elapsed.

2023-09-20 19:13:42 : Identified 6 Clusters, 3.898 mins elapsed.

2023-09-20 19:13:42 : Saving LSI Iteration, 3.898 mins elapsed.

2023-09-20 19:14:01 : Creating Cluster Matrix on the total Group Features, 4.221 mins elapsed.

2023-09-20 19:14:23 : Computing Variable Features, 4.575 mins elapsed.

###########
2023-09-20 19:14:23 : Running LSI (3 of 4) on Variable Features, 4.578 mins elapsed.
###########

2023-09-20 19:14:23 : Sampling Cells (N = 10001) for Estimated LSI, 4.579 mins elapsed.

2023-09-20 19:14:23 : Creating Sampled Partial Matrix, 4.579 mins elapsed.

2023-09-20 19:15:18 : Computing Estimated LSI (projectAll = FALSE), 5.494 mins elapsed.

2023-09-20 19:15:47 : Identifying Clusters, 5.986 mins elapsed.

2023-09-20 19:16:03 : Identified 6 Clusters, 6.251 mins elapsed.

2023-09-20 19:16:03 : Saving LSI Iteration, 6.252 mins elapsed.

2023-09-20 19:16:22 : Creating Cluster Matrix on the total Group Features, 6.569 mins elapsed.

2023-09-20 19:17:06 : Computing Variable Features, 7.299 mins elapsed.

###########
2023-09-20 19:17:06 : Running LSI (4 of 4) on Variable Features, 7.302 mins elapsed.
###########

2023-09-20 19:17:06 : Creating Partial Matrix, 7.302 mins elapsed.

2023-09-20 19:18:14 : Computing LSI, 8.43 mins elapsed.

2023-09-20 19:18:52 : Finished Running IterativeLSI, 9.057 mins elapsed.

[9]:
proj <- addClusters(input = proj, reducedDims = "IterativeLSI",
                       method = "Seurat",
                        name = "Clusters",
                        resolution = 0.8,
                        force = TRUE
                   )
ArchR logging to : ArchRLogs/ArchR-addClusters-1cdb032e879c77-Date-2023-09-20_Time-19-18-52.log
If there is an issue, please report to github with logFile!

2023-09-20 19:18:52 : Running Seurats FindClusters (Stuart et al. Cell 2019), 0.002 mins elapsed.

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 13387
Number of edges: 511225

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8593
Number of communities: 18
Elapsed time: 1 seconds
2023-09-20 19:19:11 : Testing Biased Clusters, 0.316 mins elapsed.

2023-09-20 19:19:11 : Testing Outlier Clusters, 0.316 mins elapsed.

2023-09-20 19:19:11 : Assigning Cluster Names to 18 Clusters, 0.316 mins elapsed.

2023-09-20 19:19:11 : Finished addClusters, 0.318 mins elapsed.

[10]:
proj <- addUMAP(
    ArchRProj = proj,
    reducedDims = "IterativeLSI",
    name = "UMAP",
    nNeighbors = 30,
    minDist = 0.5,
    metric = "cosine",
    force = TRUE
)
19:19:11 UMAP embedding parameters a = 0.583 b = 1.334

19:19:11 Read 13387 rows and found 30 numeric columns

19:19:11 Using Annoy for neighbor search, n_neighbors = 30

19:19:11 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|

19:19:13 Writing NN index file to temp file /tmp/Rtmp4Pb07o/file1cdb034c71a7b8

19:19:13 Searching Annoy index using 48 threads, search_k = 3000

19:19:13 Annoy recall = 100%

19:19:14 Commencing smooth kNN distance calibration using 48 threads

19:19:15 Initializing from normalized Laplacian + noise

19:19:16 Commencing optimization for 200 epochs, with 637890 positive edges

19:19:23 Optimization finished

19:19:23 Creating temp model dir /tmp/Rtmp4Pb07o/dir1cdb033622ed54

19:19:23 Creating dir /tmp/Rtmp4Pb07o/dir1cdb033622ed54

19:19:24 Changing to /tmp/Rtmp4Pb07o/dir1cdb033622ed54

19:19:24 Creating /nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/E-MTAB-11616/process/peak_calling/fib_ipsc_day00_day14/fib_ipsc_day00_day14/Embeddings/Save-Uwot-UMAP-Params-IterativeLSI-1cdb0328d210f-Date-2023-09-20_Time-19-19-23.tar

[11]:
# - check the umap of the clustering result.
# This step is important to check the batch of the data. If the data is from different batches,
# we recommend users to subset specific donor or tissue to generate the cell-by-peak matrix


p <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")
p
ArchR logging to : ArchRLogs/ArchR-plotEmbedding-1cdb031ad709f3-Date-2023-09-20_Time-19-19-24.log
If there is an issue, please report to github with logFile!

Getting UMAP Embedding

ColorBy = cellColData

Plotting Embedding

1


ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-1cdb031ad709f3-Date-2023-09-20_Time-19-19-24.log

../../_images/tutorial_example_scATAC_process_ArchR_process_13_1.png

Generate cell-by-peak matrix

[12]:
proj2 <- addGroupCoverages(ArchRProj = proj, groupBy = "Clusters")
ArchR logging to : ArchRLogs/ArchR-addGroupCoverages-1cdb032bb9681d-Date-2023-09-20_Time-19-19-30.log
If there is an issue, please report to github with logFile!

C1 (1 of 18) : CellGroups N = 2

C2 (2 of 18) : CellGroups N = 2

C3 (3 of 18) : CellGroups N = 2

C4 (4 of 18) : CellGroups N = 2

C5 (5 of 18) : CellGroups N = 2

C6 (6 of 18) : CellGroups N = 2

C7 (7 of 18) : CellGroups N = 2

C8 (8 of 18) : CellGroups N = 2

C9 (9 of 18) : CellGroups N = 2

C10 (10 of 18) : CellGroups N = 2

C11 (11 of 18) : CellGroups N = 2

C12 (12 of 18) : CellGroups N = 2

C13 (13 of 18) : CellGroups N = 2

C14 (14 of 18) : CellGroups N = 2

C15 (15 of 18) : CellGroups N = 2

C16 (16 of 18) : CellGroups N = 2

C17 (17 of 18) : CellGroups N = 2

C18 (18 of 18) : CellGroups N = 2

2023-09-20 19:19:31 : Creating Coverage Files!, 0.031 mins elapsed.

2023-09-20 19:19:31 : Batch Execution w/ safelapply!, 0.031 mins elapsed.

2023-09-20 19:24:39 : Adding Kmer Bias to Coverage Files!, 5.156 mins elapsed.

Completed Kmer Bias Calculation

Adding Kmer Bias (1 of 36)

Adding Kmer Bias (2 of 36)

Adding Kmer Bias (3 of 36)

Adding Kmer Bias (4 of 36)

Adding Kmer Bias (5 of 36)

Adding Kmer Bias (6 of 36)

Adding Kmer Bias (7 of 36)

Adding Kmer Bias (8 of 36)

Adding Kmer Bias (9 of 36)

Adding Kmer Bias (10 of 36)

Adding Kmer Bias (11 of 36)

Adding Kmer Bias (12 of 36)

Adding Kmer Bias (13 of 36)

Adding Kmer Bias (14 of 36)

Adding Kmer Bias (15 of 36)

Adding Kmer Bias (16 of 36)

Adding Kmer Bias (17 of 36)

Adding Kmer Bias (18 of 36)

Adding Kmer Bias (19 of 36)

Adding Kmer Bias (20 of 36)

Adding Kmer Bias (21 of 36)

Adding Kmer Bias (22 of 36)

Adding Kmer Bias (23 of 36)

Adding Kmer Bias (24 of 36)

Adding Kmer Bias (25 of 36)

Adding Kmer Bias (26 of 36)

Adding Kmer Bias (27 of 36)

Adding Kmer Bias (28 of 36)

Adding Kmer Bias (29 of 36)

Adding Kmer Bias (30 of 36)

Adding Kmer Bias (31 of 36)

Adding Kmer Bias (32 of 36)

Adding Kmer Bias (33 of 36)

Adding Kmer Bias (34 of 36)

Adding Kmer Bias (35 of 36)

Adding Kmer Bias (36 of 36)

2023-09-20 19:26:16 : Finished Creation of Coverage Files!, 6.782 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addGroupCoverages-1cdb032bb9681d-Date-2023-09-20_Time-19-19-30.log

[13]:
table(proj2$Clusters)

  C1  C10  C11  C12  C13  C14  C15  C16  C17  C18   C2   C3   C4   C5   C6   C7
1081 2123  305 1052 1055  383  155  563  284   96  334  847 1196  356  157  556
  C8   C9
 691 2153
[14]:
proj2 <- addReproduciblePeakSet(
    ArchRProj = proj2,
    groupBy = "Clusters",
    pathToMacs2 = pathToMacs2
)
ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-1cdb0359e50823-Date-2023-09-20_Time-19-26-18.log
If there is an issue, please report to github with logFile!

Calling Peaks with Macs2

2023-09-20 19:26:18 : Peak Calling Parameters!, 0.003 mins elapsed.

    Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
C1     C1   1081        540           2   40  500   150000
C2     C2    334        334           2   40  294   150000
C3     C3    847        540           2   40  500   150000
C4     C4   1196        540           2   40  500   150000
C5     C5    356        356           2   40  316   150000
C6     C6    157        157           2   40  117    78500
C7     C7    556        540           2   40  500   150000
C8     C8    691        540           2   40  500   150000
C9     C9   2153        540           2   40  500   150000
C10   C10   2123        561           2   61  500   150000
C11   C11    305        305           2   40  265   150000
C12   C12   1052        540           2   40  500   150000
C13   C13   1055        984           2  484  500   150000
C14   C14    383        383           2   40  343   150000
C15   C15    155        155           2   40  115    77500
C16   C16    563        540           2   40  500   150000
C17   C17    284        284           2   40  244   142000
C18   C18     96         96           2   40   56    48000
2023-09-20 19:26:18 : Batching Peak Calls!, 0.003 mins elapsed.

2023-09-20 19:26:18 : Batch Execution w/ safelapply!, 0 mins elapsed.

2023-09-20 19:32:23 : Identifying Reproducible Peaks!, 6.082 mins elapsed.

2023-09-20 19:32:35 : Creating Union Peak Set!, 6.282 mins elapsed.

Converged after 7 iterations!

Plotting Ggplot!

2023-09-20 19:32:43 : Finished Creating Union Peak Set (246132)!, 6.414 mins elapsed.

[15]:
proj3 <- addPeakMatrix(proj2)
ArchR logging to : ArchRLogs/ArchR-addPeakMatrix-1cdb0372812190-Date-2023-09-20_Time-19-32-43.log
If there is an issue, please report to github with logFile!

2023-09-20 19:32:43 : Batch Execution w/ safelapply!, 0 mins elapsed.

ArchR logging successful to : ArchRLogs/ArchR-addPeakMatrix-1cdb0372812190-Date-2023-09-20_Time-19-32-43.log

[16]:
PeakMatrix=getMatrixFromProject(
  ArchRProj = proj3,
  useMatrix = "PeakMatrix",
  useSeqnames = NULL,
  verbose = TRUE,
  binarize = FALSE,
  threads = getArchRThreads(),
  logFile = createLogFile("getMatrixFromProject")
)
ArchR logging to : ArchRLogs/ArchR-getMatrixFromProject-1cdb036f0ac7cc-Date-2023-09-20_Time-19-35-39.log
If there is an issue, please report to github with logFile!

2023-09-20 19:36:46 : Organizing colData, 1.119 mins elapsed.

2023-09-20 19:36:46 : Organizing rowData, 1.119 mins elapsed.

2023-09-20 19:36:46 : Organizing rowRanges, 1.12 mins elapsed.

2023-09-20 19:36:46 : Organizing Assays (1 of 1), 1.12 mins elapsed.

2023-09-20 19:36:49 : Constructing SummarizedExperiment, 1.169 mins elapsed.

2023-09-20 19:36:50 : Finished Matrix Creation, 1.186 mins elapsed.

[17]:
pm_peak=getPeakSet(proj3)
pm_barcode=colData(PeakMatrix)
pm_matrix=assay(PeakMatrix)
[18]:
dim(pm_matrix)
  1. 246132
  2. 13387

save the results

[20]:
# - save the cell-by-peak matrix generated by ArchR
write.csv(pm_peak,paste0(dir_,'/',"pm_peak.csv"),row.names=TRUE)
write.csv(pm_barcode,paste0(dir_,'/',"pm_barcode.csv"),row.names=TRUE)
writeMM(as(pm_matrix, "dgTMatrix"), paste0(dir_,'/',"pm_matrix.mtx"))
NULL
[ ]: