新闻动态
4个NC杂志的空间转录组数据分析(GSE190811)
我们生信马拉松授课群里有个学员问空间转录组数据如何读取,给出的数据编号为 GSE190811,由于不是10x spaceranger 的标准格式,对于初学者来说有一些难度,下面一起来看看这个数据集!
数据背景GSE190811 这个数据来自2022年11月10号发表在NC杂志上,标题为《Single cell profiling of primary and paired metastatic lymph node tumors in breast cancer patients》,文章仅用这个数据绘制了一幅图:上皮细胞的特征基因在空间切片上的打分分布。
图片
这个数据的GEO链接为:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190811,总共是4个空间转录组的样本:
GSM5732357 PT_3 LNM STGSM5732358 PT_6 LNM STGSM5732359 PT_7 LNM STGSM5732360 PT_8 LNM ST
提供的附件 GSE190811_RAW.tar 包含的文件如下:
GSM5732357_A_1938345_11.jpg.gz 52.6 MbGSM5732357_A_raw_feature_bc_matrix.h5 29.6 MbGSM5732357_A_scalefactors_json.json.gz 161 bGSM5732357_A_tissue_hires_image.png.gz 14.5 MbGSM5732357_A_tissue_positions_list.csv.gz 66.5 KbGSM5732358_B_1938529_9.jpg.gz 68.6 MbGSM5732358_B_raw_feature_bc_matrix.h5 34.7 MbGSM5732358_B_scalefactors_json.json.gz 164 bGSM5732358_B_tissue_hires_image.png.gz 16.9 MbGSM5732358_B_tissue_positions_list.csv.gz 64.4 KbGSM5732359_C_2000752_23.jpg.gz 75.1 MbGSM5732359_C_raw_feature_bc_matrix.h5 40.5 MbGSM5732359_C_scalefactors_json.json.gz 164 bGSM5732359_C_tissue_hires_image.png.gz 17.3 MbGSM5732359_C_tissue_positions_list.csv.gz 62.6 KbGSM5732360_D_2000910_33.jpg.gz 65.7 MbGSM5732360_D_raw_feature_bc_matrix.h5 38.5 MbGSM5732360_D_scalefactors_json.json.gz 160 bGSM5732360_D_tissue_hires_image.png.gz 16.1 MbGSM5732360_D_tissue_positions_list.csv.gz 59.7 Kb目录中的文件说明raw_feature_bc_matrix.h5:基因表达矩阵的h5格式tissue_positions.csv:记录了对应的空间切片里面的每个barcode所在的切片的空间坐标,理论上有了这个csv文件其实就可以相当于有了空间切片的图片的细胞定位作用tissue_hires_image.png:分辨率很高,所以图片文件就很大tissue_lowres_image.png:文件很小,因为图片分辨率低,官网说它 is always 600 pixels. 【这个数据没有提供这个文件】scalefactors_json.json:它关联了tissue_hires_image.png and tissue_lowres_image.png,文件内容如下:
{"tissue_hires_scalef": 0.06628223, "tissue_lowres_scalef": 0.019884668, "fiducial_diameter_fullres": 383.95734, "spot_diameter_fullres": 237.68787000000003}下载下来后整理成如下格式:
├── GSM5732357_A│ ├── 1938345_11.jpg│ ├── raw_feature_bc_matrix.h5│ ├── scalefactors_json.json│ ├── tissue_hires_image.png│ └── tissue_positions_list.csv├── GSM5732358_B│ ├── 1938529_9.jpg│ ├── raw_feature_bc_matrix.h5│ ├── scalefactors_json.json│ ├── tissue_hires_image.png│ └── tissue_positions_list.csv├── GSM5732359_C│ ├── 2000752_23.jpg│ ├── raw_feature_bc_matrix.h5│ ├── scalefactors_json.json│ ├── tissue_hires_image.png│ └── tissue_positions_list.csv├── GSM5732360_D│ ├── 2000910_33.jpg│ ├── raw_feature_bc_matrix.h5│ ├── scalefactors_json.json│ ├── tissue_hires_image.png│ └── tissue_positions_list.csv使用Seurat读取
这里用到的文件为 raw_feature_bc_matrix.h5,以及 图片tissue_hires_image.png:
###### Create: Jianming Zeng### Date: 2023-12-31 ### Email: jmzeng1314@163.com### Blog: http://www.bio-info-trainee.com/### Forum: http://www.biotrainee.com/thread-1376-1-1.html### CAFS/SUSTC/Eli Lilly/University of Macau### Update Log: 2023-12-31 First version ### Update Log: 2024-12-09 by juan zhang (492482942@qq.com)### rm(list=ls())options(stringsAsFactors = F)library(ggsci)library(dplyr) library(future)library(Seurat)library(clustree)library(cowplot)library(data.table)library(ggplot2)library(patchwork)library(stringr)library(qs)library(Matrix)getwd()# 创建目录getwd()gse <- "GSE190811"dir.create(gse)################## 读取数据,有h5与tissue_hires_image.pngsamples <- list.dirs("GSE190811/", recursive = F, full.names = F)samplesscRNAlist <- lapply(samples, function(pro){# pro <- samples[1]print(pro)# 先读取 h5 data <- Read10X_h5(filename = paste0("GSE190811/",pro,"/raw_feature_bc_matrix.h5")) dim(data) data[1:5,1:5] object <- CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro) object# 再读取 image <- Read10X_Image(image.dir = paste0("GSE190811/",pro), image.name = "tissue_hires_image.png", filter.matrix = TRUE) image dim(image) image <- image[Cells(x = object)] DefaultAssay(object = image) <- "Spatial"# 添加图片到object中 object[[pro]] <- image object@images[[1]]@scale.factors$lowres <- object@images[[1]]@scale.factors$hiresreturn(object)})names(scRNAlist) <- samplesscRNAlist# mergesce.all <- merge(scRNAlist[[1]], y=scRNAlist[-1], add.cell.ids=samples)sce.all <- JoinLayers(sce.all) # seurat v5sce.all## 简单探索一下数据结构as.data.frame(sce.all[["Spatial"]]$counts[1:4,1:4])as.data.frame(LayerData(sce.all, assay = "Spatial", layer = "counts")[1:5,1:5])head(sce.all@meta.data)table(sce.all$orig.ident)Layers(sce.all)Assays(sce.all)library(qs)qsave(sce.all, file="GSE190811/sce.all.qs")简单查看一下数据的分布每个样本中 nCount_Spatial 和nFeature_Spatial的小提琴分布:
# Data preprocessingplot1 <- VlnPlot(sce.all, features = c("nCount_Spatial","nFeature_Spatial"), pt.size = 0.1) + NoLegend()plot1ggsave(filename = "1-QC/VlnPlot.pdf", width = 12,height = 6, plot = plot1)图片
空间分布特征模式:nCount_Spatial
plot2 <- SpatialFeaturePlot(sce.all, features = "nCount_Spatial",pt.size.factor = 2) ggsave(filename = "1-QC/FeaturePlot_nCount.pdf", width = 12,height = 6, plot = plot2)
图片
nFeature_Spatial 分布:
plot3 <- SpatialFeaturePlot(sce.all, features = "nFeature_Spatial",pt.size.factor = 2) ggsave(filename = "1-QC/FeaturePlot_nFeature.pdf", width = 12,height = 6, plot = plot3)
图片
然后就是后面的空转标准分析降维聚类分群:
## 计算线粒体/核糖体/特定基因集的比例mito_genes <- grep("^MT-", rownames(sce.all),ignore.case = T, value = T) # 可能是13个线粒体基因print(mito_genes)sce.all <- PercentageFeatureSet(sce.all, features = mito_genes, col.name = "percent_mito")fivenum(sce.all@meta.data$percent_mito)# 另外一种方法# sce.all$percent_mito <- (Matrix::colSums(sce.all[["Spatial"]]$counts[mt.genes, ])/Matrix::colSums(sce.all[["Spatial"]]$counts))*100# fivenum(sce.all$percent_mito)## 过滤spotsce.all <- subset(sce.all, subset = nFeature_Spatial > 200 )## 标准化options(future.globals.maxSize= 5*1024*1024^2)sce.all <- SCTransform(sce.all, assay = "Spatial", verbose = T)sce.all <- RunPCA(sce.all, assay = "SCT", verbose = FALSE)sce.all <- FindNeighbors(sce.all, reduction = "pca", dims = 1:30)sce.all <- FindClusters(sce.all, verbose = FALSE,resolution = 0.8)sce.all <- RunUMAP(sce.all, reduction = "pca", dims = 1:30)p1 <- DimPlot(sce.all, reduction = "umap", label = TRUE,label.size = 7)p2 <- SpatialDimPlot(sce.all, label = TRUE, label.size = 3,pt.size.factor = 2,image.alpha = 0.6)p <- p1 + p2pggsave(filename = "1-QC/DimPlot.pdf", width = 12,height = 6, plot = p)降维聚类结果如下:
图片
图片
上面的代码运行过程中踩了两个坑,对应的报错以及github上面给出的解决方案如下:
踩坑1:矩阵在log后含NA值,https://github.com/satijalab/sctransform/issues/86,我这里 设置了 CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro)的时候min.cells过滤掉在所有细胞中表达都为0的基因。
Running SCTransform on assay: SpatialRunning SCTransform on layer: countsvst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.Error in make_cell_attr(umi, cell_attr, latent_var, batch_var, latent_var_nonreg, : cell attribute "log_umi" contains NA, NaN, or infinite value
踩坑2:运行内存不足 https://github.com/satijalab/seurat/discussions/9191
Error in getGlobalsAndPackages(expr, envir = envir, globals = globals) : The total size of the 19 globals exported for future expression ('FUN()’) is 555.16 MiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are 'FUN’ (538.70 MiB of class 'function’), 'umi_bin’ (16.12 MiB of class 'numeric’) and 'data_step1’ (270.82 KiB of class 'list’)到这里简单的基础分析就做完了~
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报。上一篇:评测:埃德·布朗公司Kobra Carry 25周年纪念版手枪 1911便携型典范(附视频)
下一篇:没有了
