Source Format Details¶
HDF5 Input¶
For HDF5 input (no conversion necessary), you can do your analysis with scanpy to create an anndata object ad
. SCelVis will use embedding coordinates from ad.obsm
, cell annotation from ad.obs
and expression data directly from ad.raw.X
(if present) or ad.X
(this should contain normalized and log-transformed expression values for all genes and should be sparse, otherwise performance will suffer)
If present, information about the dataset will be extracted from strings stored in ad.uns['about_title']
, ad.uns['about_short_title']
and ad.uns['about_readme']
(assumed to be Markdown).
Information about marker genes will be taken from entries starting with marker_
in ad.uns
: entries called marker_gene
(required!), marker_cluster
, marker_padj
, marker_LFC
will create a table with the columns gene
, cluster
, padj
, and LFC
. SCelVis
will also extract marker information from ad.uns['rank_genes_groups']
. However, certain datatypes in ad.uns
together with version mismatches of scanpy
, h5py
and anndata
can lead to .h5ad
files that are not readable by SCelVis
(see issue #832). To be on the safe side, it’s recommended to delete unneccessary slots in ad.uns
(e.g., del ad.uns['rank_genes_groups']
). Also, ad.obsm['X_pca']
, ad.varm['PCs']
and entries in ad.obsp
are likely dispensable.
If you prepared your data with Seurat
(v2), you can use Convert(from = sobj, to = "anndata", filename = "data.h5ad")
to get an HDF5 file.
Alternatively, you can use sceasy to convert your objects into anndata
HDF5 format.
Text Input¶
For “raw” text input, you need to prepare a file with expression values, cell meta data and coordinates, and potentially information about this dataset as well as cluster markers.
normalized expression values for each gene (rows) in each cell (columns) can be given either as tab-separated file (dense) or in matrix-market format:
if your file is called
expression.tsv.gz
,SCelVis
expects a tab-separated file , e.g., like this:. cell_1 cell_2 cell_3 ... gene_1 0.13 0.0 1.5 ... gene_2 0.0 3.1 0.3 ... gene_3 0.0 0.0 0.0 ...
if your file is called
expression.mtx
,SCelVis
expects this to be a sparse matrix-market file and additional files calledbarcodes.tsv
(containing a list of cell names / barcodes, one per line, no header or row names) andgenes.tsv
(containing a list of gene names, one per line, no header or row names) to be present.
annotations for each cell can be provided in a tab-separated file called
annotation.tsv
, e.g., like this:. cluster genotype ... cell_1 cluster_1 WT ... cell_2 cluster_2 KO ...
embedding coordinates for each cell can be provided in a tab-separated file called
coords.tsv
, e.g., like this:. tSNE_1 tSNE_2 UMAP_1 UMAP_2 ... cell_1 20.53 -10.05 3.9 2.4 ... cell_2 -5.34 13.94 -1.3 3.4 ...
an optional tab-separated file called
markers.tsv
can contain information about marker genes. it needs to have a column named ``gene``, e.g., like this:gene cluster log2FC adj_pval ... gene_1 cluster_1 3.4 1.5e-6 ... gene_2 cluster_1 1.3 0.00004 ... gene_3 cluster_2 2.1 5.3e-9 ...
finally, a markdown file (e.g.,
text_input.md
) can provide information about this dataset:---- title: An Optional Long Data Set Title short_title: optional short title ---- A verbose description of the data in Markdown format.
conversion to .h5ad
is then performed like so:
$ scelvis convert --input-dir text_input --output data/text_input.h5ad --about-md text_input.md
in examples/dummy_raw.zip
and examples/dummy_about.md
we provide raw data for a simulated dummy dataset.
if you prepared you data with Seurat
, you can export to raw text like this
writeMM(sobj@assays$RNA@data, file = 'expression.mtx')
write.table(Cells(sobj), file = 'barcodes.tsv', col.names = FALSE, row.names = FALSE, sep = ',')
write.table(row.names(sobj@assays$RNA@data), file = 'genes.tsv', col.names = FALSE, row.names = FALSE, sep = ',')
sobj@meta.data$cluster <- paste0('cluster_', sobj@meta.data$seurat_clusters)
write.table(sobj@meta.data, file = 'annotation.tsv', sep = '\t')
write.table(sobj@reductions$umap@cell.embeddings, file = 'coords.tsv', sep = '\t')
Loom Input¶
for loompy or loomR input, you can convert your data like this:
$ scelvis convert --i input.loom -m markers.tsv -a about.md -o loom_input.h5ad
if you prepared your data with Seurat
(v3), you can use as.loom(sobj, filename = "output.loom")
to get a .loom
file and then convert to .h5ad
with the above command (this is quite slow, however, and exact format specifications for .loom
and .h5ad
are not always compatible between versions)
CellRanger Input¶
Alternatively, the output directory of CellRanger
can be used. This is the directory called outs
containing either a file called filtered_gene_bc_matrices_h5.h5
(version 2) or a file called filtered_feature_bc_matrix.h5
(version 3), and a folder analysis
with clustering, embedding and differential expression results. This will not no any further processing except log-normalization. Additionally, a markdown file provides meta information about the dataset (see above)
$ mkdir -p data
$ cat <<EOF > data/cellranger.md
----
title: My Project
short_title: my_project
----
This is my project data.
EOF
$ scelvis convert --input-dir cellranger-out --output data/cellranger_input.h5ad --about-md cellranger.md
In examples/hgmm_1k_raw
we provide CellRanger
output for the 1k 1:1 human mouse mix.
Specifically, from the outs
folder we selected
filtered_feature_bc_matrix.h5
- tSNE and PCA projections from
analysis/tsne
andanalysis/pca
- clustering from
analysis/clustering/graphclust
and - markers from
analysis/diffexp/graphclust
examples/hgmm_1k_about.md
contains information about this dataset.