ArchR process scATAC-seq data¶
Usually the scATAC-seq data is provided in the format of ‘fragments.tsv’. In this tutorial we showed the users how to utilize ArchR to process fragment files into a cell-by-peak matrix. The cell-by-peak matrix will be the input of CellOracle to generate the base GRN. Note that this notebook was excuted under an R environment.
The details of usage of ArchR can be refered at https://www.archrproject.com/articles/Articles/tutorial.html.
If you have already got the cell-by-peak matrix, you can directly skip to 1.2 to generate the base GRN.
import packages¶
[3]:
library(ArchR)
set.seed(1)
library(parallel)
load fragment files & create arrow files¶
[4]:
# - initial the ArchR folder to save temporary files
path = "/nfs/public/lichen/code/TFcomb_github_doc/TFcomb/tutorial_example/scATAC_process" # replace this to your own path
prefix = 'ArchR_files'
dir_ = file.path(dirname(dirname(path)), "data/iPSC_example", prefix)
dir.create(dir_, recursive=T)
setwd(dir=dir_)
[5]:
# - load the fragment files
inputFiles=c(
'day00' = '/nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/fastq/6231STDY9258270/Day00/outs/fragments.tsv.gz',
'day14' = '/nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/fastq/6231STDY9258270/Day14/outs/fragments.tsv.gz'
)
addArchRGenome("hg38") # be aware of the species version of refrence genome
# addArchRGenome("hg19")
# addArchRGenome("mm10")
# addArchRGenome("mm9")
addArchRThreads(threads = 16)
ArrowFiles <- createArrowFiles(
inputFiles = inputFiles,
sampleNames = names(inputFiles),
minTSS = 4, #Dont set this too high because you can always increase later
minFrags = 1000,
addTileMat = TRUE,
addGeneScoreMat = TRUE
)
Setting default genome to Hg38.
Setting default number of Parallel threads to 16.
Using GeneAnnotation set by addArchRGenome(Hg38)!
Using GeneAnnotation set by addArchRGenome(Hg38)!
ArchR logging to : ArchRLogs/ArchR-createArrows-1cdb03743a2877-Date-2023-09-20_Time-18-46-04.log
If there is an issue, please report to github with logFile!
Cleaning Temporary Files
2023-09-20 18:46:04 : Batch Execution w/ safelapply!, 0 mins elapsed.
ArchR logging successful to : ArchRLogs/ArchR-createArrows-1cdb03743a2877-Date-2023-09-20_Time-18-46-04.log
[6]:
doubScores <- addDoubletScores(
input = ArrowFiles,
k = 10, #Refers to how many cells near a "pseudo-doublet" to count.
knnMethod = "UMAP", #Refers to the embedding to use for nearest neighbor search with doublet projection.
LSIMethod = 1
)
proj <- ArchRProject(
ArrowFiles = ArrowFiles,
outputDirectory = prefix,
copyArrows = TRUE #This is recommened so that if you modify the Arrow files you have an original copy for later usage.
)
ArchR logging to : ArchRLogs/ArchR-addDoubletScores-1cdb0332496e6f-Date-2023-09-20_Time-18-59-37.log
If there is an issue, please report to github with logFile!
2023-09-20 18:59:37 : Batch Execution w/ safelapply!, 0 mins elapsed.
2023-09-20 18:59:37 : day14 (1 of 2) : Computing Doublet Statistics, 0.001 mins elapsed.
day14 (1 of 2) : UMAP Projection R^2 = 0.97165
day14 (1 of 2) : UMAP Projection R^2 = 0.97165
Warning message:
“`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.”
2023-09-20 19:03:17 : day00 (2 of 2) : Computing Doublet Statistics, 3.674 mins elapsed.
Filtering 1 dims correlated > 0.75 to log10(depth + 1)
Filtering 1 dims correlated > 0.75 to log10(depth + 1)
day00 (2 of 2) : UMAP Projection R^2 = 0.95947
day00 (2 of 2) : UMAP Projection R^2 = 0.95947
Warning message:
“`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.”
ArchR logging successful to : ArchRLogs/ArchR-addDoubletScores-1cdb0332496e6f-Date-2023-09-20_Time-18-59-37.log
Using GeneAnnotation set by addArchRGenome(Hg38)!
Using GeneAnnotation set by addArchRGenome(Hg38)!
Validating Arrows...
Getting SampleNames...
Copying ArrowFiles to Ouptut Directory! If you want to save disk space set copyArrows = FALSE
1
2
Getting Cell Metadata...
Merging Cell Metadata...
Initializing ArchRProject...
/ |
/ \
. / |.
\\\ / |.
\\\ / `|.
\\\ / |.
\ / |\
\\#####\ / ||
==###########> / ||
\\##==......\ / ||
______ = =|__ /__ || \\\
,--' ,----`-,__ ___/' --,-`-===================##========>
\ ' ##_______ _____ ,--,__,=##,__ ///
, __== ___,-,__,--'#' ===' `-' | ##,-/
-,____,---' \\####\\________________,--\\_##,/
___ .______ ______ __ __ .______
/ \ | _ \ / || | | | | _ \
/ ^ \ | |_) | | ,----'| |__| | | |_) |
/ /_\ \ | / | | | __ | | /
/ _____ \ | |\ \\___ | `----.| | | | | |\ \\___.
/__/ \__\ | _| `._____| \______||__| |__| | _| `._____|
[7]:
proj <- filterDoublets(ArchRProj = proj)
Filtering 1053 cells from ArchRProject!
day14 : 636 of 7975 (8%)
day00 : 417 of 6465 (6.5%)
Clustering & UMAP plotting¶
[8]:
# call peak and save peak matrix
pathToMacs2='/data1/lichen/anaconda3/envs/R4.0.2/bin/macs2' # replace this with your own directory path
proj <- addIterativeLSI(ArchRProj = proj, useMatrix = "TileMatrix", name = "IterativeLSI",
varFeatures = 50000,iteration=4,force = TRUE)
Checking Inputs...
ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-1cdb036a6fa807-Date-2023-09-20_Time-19-09-48.log
If there is an issue, please report to github with logFile!
2023-09-20 19:09:49 : Computing Total Across All Features, 0.009 mins elapsed.
2023-09-20 19:09:51 : Computing Top Features, 0.055 mins elapsed.
###########
2023-09-20 19:09:53 : Running LSI (1 of 4) on Top Features, 0.086 mins elapsed.
###########
2023-09-20 19:09:53 : Sampling Cells (N = 10001) for Estimated LSI, 0.087 mins elapsed.
2023-09-20 19:09:53 : Creating Sampled Partial Matrix, 0.087 mins elapsed.
2023-09-20 19:10:33 : Computing Estimated LSI (projectAll = FALSE), 0.75 mins elapsed.
2023-09-20 19:11:13 : Identifying Clusters, 1.421 mins elapsed.
2023-09-20 19:11:30 : Identified 6 Clusters, 1.694 mins elapsed.
2023-09-20 19:11:30 : Saving LSI Iteration, 1.694 mins elapsed.
2023-09-20 19:11:49 : Creating Cluster Matrix on the total Group Features, 2.008 mins elapsed.
2023-09-20 19:12:10 : Computing Variable Features, 2.366 mins elapsed.
###########
2023-09-20 19:12:10 : Running LSI (2 of 4) on Variable Features, 2.369 mins elapsed.
###########
2023-09-20 19:12:10 : Sampling Cells (N = 10001) for Estimated LSI, 2.37 mins elapsed.
2023-09-20 19:12:10 : Creating Sampled Partial Matrix, 2.37 mins elapsed.
2023-09-20 19:12:52 : Computing Estimated LSI (projectAll = FALSE), 3.061 mins elapsed.
2023-09-20 19:13:26 : Identifying Clusters, 3.639 mins elapsed.
2023-09-20 19:13:42 : Identified 6 Clusters, 3.898 mins elapsed.
2023-09-20 19:13:42 : Saving LSI Iteration, 3.898 mins elapsed.
2023-09-20 19:14:01 : Creating Cluster Matrix on the total Group Features, 4.221 mins elapsed.
2023-09-20 19:14:23 : Computing Variable Features, 4.575 mins elapsed.
###########
2023-09-20 19:14:23 : Running LSI (3 of 4) on Variable Features, 4.578 mins elapsed.
###########
2023-09-20 19:14:23 : Sampling Cells (N = 10001) for Estimated LSI, 4.579 mins elapsed.
2023-09-20 19:14:23 : Creating Sampled Partial Matrix, 4.579 mins elapsed.
2023-09-20 19:15:18 : Computing Estimated LSI (projectAll = FALSE), 5.494 mins elapsed.
2023-09-20 19:15:47 : Identifying Clusters, 5.986 mins elapsed.
2023-09-20 19:16:03 : Identified 6 Clusters, 6.251 mins elapsed.
2023-09-20 19:16:03 : Saving LSI Iteration, 6.252 mins elapsed.
2023-09-20 19:16:22 : Creating Cluster Matrix on the total Group Features, 6.569 mins elapsed.
2023-09-20 19:17:06 : Computing Variable Features, 7.299 mins elapsed.
###########
2023-09-20 19:17:06 : Running LSI (4 of 4) on Variable Features, 7.302 mins elapsed.
###########
2023-09-20 19:17:06 : Creating Partial Matrix, 7.302 mins elapsed.
2023-09-20 19:18:14 : Computing LSI, 8.43 mins elapsed.
2023-09-20 19:18:52 : Finished Running IterativeLSI, 9.057 mins elapsed.
[9]:
proj <- addClusters(input = proj, reducedDims = "IterativeLSI",
method = "Seurat",
name = "Clusters",
resolution = 0.8,
force = TRUE
)
ArchR logging to : ArchRLogs/ArchR-addClusters-1cdb032e879c77-Date-2023-09-20_Time-19-18-52.log
If there is an issue, please report to github with logFile!
2023-09-20 19:18:52 : Running Seurats FindClusters (Stuart et al. Cell 2019), 0.002 mins elapsed.
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 13387
Number of edges: 511225
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8593
Number of communities: 18
Elapsed time: 1 seconds
2023-09-20 19:19:11 : Testing Biased Clusters, 0.316 mins elapsed.
2023-09-20 19:19:11 : Testing Outlier Clusters, 0.316 mins elapsed.
2023-09-20 19:19:11 : Assigning Cluster Names to 18 Clusters, 0.316 mins elapsed.
2023-09-20 19:19:11 : Finished addClusters, 0.318 mins elapsed.
[10]:
proj <- addUMAP(
ArchRProj = proj,
reducedDims = "IterativeLSI",
name = "UMAP",
nNeighbors = 30,
minDist = 0.5,
metric = "cosine",
force = TRUE
)
19:19:11 UMAP embedding parameters a = 0.583 b = 1.334
19:19:11 Read 13387 rows and found 30 numeric columns
19:19:11 Using Annoy for neighbor search, n_neighbors = 30
19:19:11 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|
19:19:13 Writing NN index file to temp file /tmp/Rtmp4Pb07o/file1cdb034c71a7b8
19:19:13 Searching Annoy index using 48 threads, search_k = 3000
19:19:13 Annoy recall = 100%
19:19:14 Commencing smooth kNN distance calibration using 48 threads
19:19:15 Initializing from normalized Laplacian + noise
19:19:16 Commencing optimization for 200 epochs, with 637890 positive edges
19:19:23 Optimization finished
19:19:23 Creating temp model dir /tmp/Rtmp4Pb07o/dir1cdb033622ed54
19:19:23 Creating dir /tmp/Rtmp4Pb07o/dir1cdb033622ed54
19:19:24 Changing to /tmp/Rtmp4Pb07o/dir1cdb033622ed54
19:19:24 Creating /nfs/public/lichen/data/single_cell/reprogramming_case/fib_ipsc/ATAC/E-MTAB-11616/process/peak_calling/fib_ipsc_day00_day14/fib_ipsc_day00_day14/Embeddings/Save-Uwot-UMAP-Params-IterativeLSI-1cdb0328d210f-Date-2023-09-20_Time-19-19-23.tar
[11]:
# - check the umap of the clustering result.
# This step is important to check the batch of the data. If the data is from different batches,
# we recommend users to subset specific donor or tissue to generate the cell-by-peak matrix
p <- plotEmbedding(ArchRProj = proj, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")
p
ArchR logging to : ArchRLogs/ArchR-plotEmbedding-1cdb031ad709f3-Date-2023-09-20_Time-19-19-24.log
If there is an issue, please report to github with logFile!
Getting UMAP Embedding
ColorBy = cellColData
Plotting Embedding
1
ArchR logging successful to : ArchRLogs/ArchR-plotEmbedding-1cdb031ad709f3-Date-2023-09-20_Time-19-19-24.log
Generate cell-by-peak matrix¶
[12]:
proj2 <- addGroupCoverages(ArchRProj = proj, groupBy = "Clusters")
ArchR logging to : ArchRLogs/ArchR-addGroupCoverages-1cdb032bb9681d-Date-2023-09-20_Time-19-19-30.log
If there is an issue, please report to github with logFile!
C1 (1 of 18) : CellGroups N = 2
C2 (2 of 18) : CellGroups N = 2
C3 (3 of 18) : CellGroups N = 2
C4 (4 of 18) : CellGroups N = 2
C5 (5 of 18) : CellGroups N = 2
C6 (6 of 18) : CellGroups N = 2
C7 (7 of 18) : CellGroups N = 2
C8 (8 of 18) : CellGroups N = 2
C9 (9 of 18) : CellGroups N = 2
C10 (10 of 18) : CellGroups N = 2
C11 (11 of 18) : CellGroups N = 2
C12 (12 of 18) : CellGroups N = 2
C13 (13 of 18) : CellGroups N = 2
C14 (14 of 18) : CellGroups N = 2
C15 (15 of 18) : CellGroups N = 2
C16 (16 of 18) : CellGroups N = 2
C17 (17 of 18) : CellGroups N = 2
C18 (18 of 18) : CellGroups N = 2
2023-09-20 19:19:31 : Creating Coverage Files!, 0.031 mins elapsed.
2023-09-20 19:19:31 : Batch Execution w/ safelapply!, 0.031 mins elapsed.
2023-09-20 19:24:39 : Adding Kmer Bias to Coverage Files!, 5.156 mins elapsed.
Completed Kmer Bias Calculation
Adding Kmer Bias (1 of 36)
Adding Kmer Bias (2 of 36)
Adding Kmer Bias (3 of 36)
Adding Kmer Bias (4 of 36)
Adding Kmer Bias (5 of 36)
Adding Kmer Bias (6 of 36)
Adding Kmer Bias (7 of 36)
Adding Kmer Bias (8 of 36)
Adding Kmer Bias (9 of 36)
Adding Kmer Bias (10 of 36)
Adding Kmer Bias (11 of 36)
Adding Kmer Bias (12 of 36)
Adding Kmer Bias (13 of 36)
Adding Kmer Bias (14 of 36)
Adding Kmer Bias (15 of 36)
Adding Kmer Bias (16 of 36)
Adding Kmer Bias (17 of 36)
Adding Kmer Bias (18 of 36)
Adding Kmer Bias (19 of 36)
Adding Kmer Bias (20 of 36)
Adding Kmer Bias (21 of 36)
Adding Kmer Bias (22 of 36)
Adding Kmer Bias (23 of 36)
Adding Kmer Bias (24 of 36)
Adding Kmer Bias (25 of 36)
Adding Kmer Bias (26 of 36)
Adding Kmer Bias (27 of 36)
Adding Kmer Bias (28 of 36)
Adding Kmer Bias (29 of 36)
Adding Kmer Bias (30 of 36)
Adding Kmer Bias (31 of 36)
Adding Kmer Bias (32 of 36)
Adding Kmer Bias (33 of 36)
Adding Kmer Bias (34 of 36)
Adding Kmer Bias (35 of 36)
Adding Kmer Bias (36 of 36)
2023-09-20 19:26:16 : Finished Creation of Coverage Files!, 6.782 mins elapsed.
ArchR logging successful to : ArchRLogs/ArchR-addGroupCoverages-1cdb032bb9681d-Date-2023-09-20_Time-19-19-30.log
[13]:
table(proj2$Clusters)
C1 C10 C11 C12 C13 C14 C15 C16 C17 C18 C2 C3 C4 C5 C6 C7
1081 2123 305 1052 1055 383 155 563 284 96 334 847 1196 356 157 556
C8 C9
691 2153
[14]:
proj2 <- addReproduciblePeakSet(
ArchRProj = proj2,
groupBy = "Clusters",
pathToMacs2 = pathToMacs2
)
ArchR logging to : ArchRLogs/ArchR-addReproduciblePeakSet-1cdb0359e50823-Date-2023-09-20_Time-19-26-18.log
If there is an issue, please report to github with logFile!
Calling Peaks with Macs2
2023-09-20 19:26:18 : Peak Calling Parameters!, 0.003 mins elapsed.
Group nCells nCellsUsed nReplicates nMin nMax maxPeaks
C1 C1 1081 540 2 40 500 150000
C2 C2 334 334 2 40 294 150000
C3 C3 847 540 2 40 500 150000
C4 C4 1196 540 2 40 500 150000
C5 C5 356 356 2 40 316 150000
C6 C6 157 157 2 40 117 78500
C7 C7 556 540 2 40 500 150000
C8 C8 691 540 2 40 500 150000
C9 C9 2153 540 2 40 500 150000
C10 C10 2123 561 2 61 500 150000
C11 C11 305 305 2 40 265 150000
C12 C12 1052 540 2 40 500 150000
C13 C13 1055 984 2 484 500 150000
C14 C14 383 383 2 40 343 150000
C15 C15 155 155 2 40 115 77500
C16 C16 563 540 2 40 500 150000
C17 C17 284 284 2 40 244 142000
C18 C18 96 96 2 40 56 48000
2023-09-20 19:26:18 : Batching Peak Calls!, 0.003 mins elapsed.
2023-09-20 19:26:18 : Batch Execution w/ safelapply!, 0 mins elapsed.
2023-09-20 19:32:23 : Identifying Reproducible Peaks!, 6.082 mins elapsed.
2023-09-20 19:32:35 : Creating Union Peak Set!, 6.282 mins elapsed.
Converged after 7 iterations!
Plotting Ggplot!
2023-09-20 19:32:43 : Finished Creating Union Peak Set (246132)!, 6.414 mins elapsed.
[15]:
proj3 <- addPeakMatrix(proj2)
ArchR logging to : ArchRLogs/ArchR-addPeakMatrix-1cdb0372812190-Date-2023-09-20_Time-19-32-43.log
If there is an issue, please report to github with logFile!
2023-09-20 19:32:43 : Batch Execution w/ safelapply!, 0 mins elapsed.
ArchR logging successful to : ArchRLogs/ArchR-addPeakMatrix-1cdb0372812190-Date-2023-09-20_Time-19-32-43.log
[16]:
PeakMatrix=getMatrixFromProject(
ArchRProj = proj3,
useMatrix = "PeakMatrix",
useSeqnames = NULL,
verbose = TRUE,
binarize = FALSE,
threads = getArchRThreads(),
logFile = createLogFile("getMatrixFromProject")
)
ArchR logging to : ArchRLogs/ArchR-getMatrixFromProject-1cdb036f0ac7cc-Date-2023-09-20_Time-19-35-39.log
If there is an issue, please report to github with logFile!
2023-09-20 19:36:46 : Organizing colData, 1.119 mins elapsed.
2023-09-20 19:36:46 : Organizing rowData, 1.119 mins elapsed.
2023-09-20 19:36:46 : Organizing rowRanges, 1.12 mins elapsed.
2023-09-20 19:36:46 : Organizing Assays (1 of 1), 1.12 mins elapsed.
2023-09-20 19:36:49 : Constructing SummarizedExperiment, 1.169 mins elapsed.
2023-09-20 19:36:50 : Finished Matrix Creation, 1.186 mins elapsed.
[17]:
pm_peak=getPeakSet(proj3)
pm_barcode=colData(PeakMatrix)
pm_matrix=assay(PeakMatrix)
[18]:
dim(pm_matrix)
- 246132
- 13387
save the results¶
[20]:
# - save the cell-by-peak matrix generated by ArchR
write.csv(pm_peak,paste0(dir_,'/',"pm_peak.csv"),row.names=TRUE)
write.csv(pm_barcode,paste0(dir_,'/',"pm_barcode.csv"),row.names=TRUE)
writeMM(as(pm_matrix, "dgTMatrix"), paste0(dir_,'/',"pm_matrix.mtx"))
NULL
[ ]: