LibraryQuality Control & Filtering of Single Cells

Quality Control & Filtering of Single Cells

Learn about Quality Control & Filtering of Single Cells as part of Computational Biology and Bioinformatics Research

Quality Control and Filtering of Single Cells in Single-Cell Sequencing Analysis

Single-cell sequencing (scRNA-seq) generates vast amounts of data, but not all of it is reliable. Effective quality control (QC) and filtering are crucial steps to remove low-quality cells and technical artifacts, ensuring that downstream analyses reflect true biological variation. This module will guide you through the essential QC metrics and filtering strategies.

Why is Quality Control Essential?

Low-quality cells can arise from various sources, including cell death, lysis during sample preparation, or technical issues during library construction. These cells often exhibit distinct molecular profiles that can skew results, leading to erroneous conclusions about cell populations, gene expression patterns, and biological processes. Proper QC helps to:

  • Improve data integrity: Remove noise and artifacts.
  • Enhance downstream analysis: Ensure robust clustering, differential expression, and trajectory inference.
  • Reduce computational burden: Work with a cleaner, more manageable dataset.
  • Increase biological interpretability: Focus on genuine biological signals.

Key Quality Control Metrics

Several metrics are commonly used to assess the quality of individual cells. Understanding these metrics is the first step in identifying problematic cells.

Number of detected genes (Unique Molecular Identifiers - UMIs or Genes detected).

This metric counts how many distinct genes (or UMIs) are detected in a single cell. Cells with very few detected genes might be dead or empty droplets, while cells with an exceptionally high number could indicate doublets or multiplets.

The number of detected genes (or UMIs) per cell is a primary indicator of cell capture and library complexity. A healthy, viable cell typically expresses a moderate range of genes. Cells with extremely low counts often represent empty droplets or cells that did not undergo successful reverse transcription and amplification. Conversely, cells with an unusually high number of detected genes might be a sign of cell doublets (two cells captured in the same droplet) or multiplets (more than two cells). The acceptable range for this metric is highly dependent on the experimental protocol, cell type, and sequencing depth.

Total UMI counts (or Reads) per cell.

This metric reflects the overall sequencing depth or molecular complexity of a cell. Similar to gene counts, very low or very high values can indicate issues.

The total number of UMIs (or reads) per cell is a measure of the total molecular content captured and sequenced from that cell. It is often correlated with the number of detected genes. Cells with very low UMI counts might be poorly captured or have low RNA content. Cells with very high UMI counts could be indicative of doublets or multiplets, or simply cells with exceptionally high transcriptional activity. It's important to consider this metric in conjunction with the number of detected genes.

Percentage of mitochondrial reads.

High mitochondrial gene expression often signifies compromised cell membranes, leading to leakage of cytoplasmic mRNA and relative enrichment of mitochondrial transcripts.

Mitochondrial genes are encoded in the mitochondrial genome and are transcribed independently of nuclear genes. In healthy, intact cells, cytoplasmic mRNA is abundant, and mitochondrial transcripts represent a small fraction of the total. However, if a cell's membrane is compromised (e.g., due to cell death or lysis), cytoplasmic mRNA can leak out, while mitochondrial transcripts, being enclosed within the mitochondria, may remain relatively enriched. Therefore, a high percentage of reads mapping to mitochondrial genes is a strong indicator of poor cell viability or membrane integrity. A common threshold for filtering is often set around 10-20%, but this can vary.

Filtering Strategies

Once QC metrics are calculated, filtering strategies are applied to remove low-quality cells. This is typically done by setting thresholds for the metrics discussed above.

MetricLow Quality IndicatorHigh Quality IndicatorTypical Filtering Action
Number of Genes/UMIsVery LowVery HighRemove cells below a minimum threshold and potentially above a maximum threshold (to remove doublets).
Mitochondrial PercentageHigh (>10-20%)Low (<10-20%)Remove cells with a high percentage of mitochondrial reads.

The specific thresholds are often determined empirically by visualizing the distribution of these metrics across all cells. Scatter plots of gene counts vs. UMI counts, or gene counts vs. mitochondrial percentage, are invaluable for this.

Visualizing QC metrics is crucial for setting appropriate filtering thresholds. A common approach is to plot the number of detected genes (y-axis) against the total UMI counts (x-axis) for each cell. Cells that fall below a certain diagonal line (indicating fewer genes than expected for their UMI count) or are very far from the main cluster are often considered low quality. Similarly, plotting the percentage of mitochondrial reads against the number of genes can reveal cells with high mitochondrial content that should be removed. These plots help identify outliers and define the boundaries for filtering.

📚

Text-based content

Library pages focus on text content

Doublet Detection and Removal

Doublets, where two or more cells are captured in the same droplet, are a common artifact. They can manifest as cells with higher UMI counts and more detected genes than typical single cells. Specialized computational tools are often used to identify and remove potential doublets.

The choice of QC thresholds is not absolute and can depend on the specific biological question, experimental design, and cell type. It's often an iterative process.

Tools for QC and Filtering

Several popular bioinformatics tools and packages are designed to facilitate single-cell QC and filtering, often integrated into larger scRNA-seq analysis workflows.

What are the three primary QC metrics discussed for single cells?

Number of detected genes/UMIs, total UMI counts/reads, and percentage of mitochondrial reads.

Why is a high percentage of mitochondrial reads a concern?

It often indicates compromised cell membrane integrity, leading to leakage of cytoplasmic mRNA and relative enrichment of mitochondrial transcripts.

Learning Resources

Seurat - Quality Control(documentation)

The official Seurat vignette on quality control, providing practical examples and explanations for performing QC within the Seurat framework.

Scanpy - Quality Control(documentation)

Scanpy's documentation on quality control, detailing common metrics and filtering strategies used in single-cell RNA sequencing analysis.

Single-Cell RNA Sequencing: A Practical Guide(paper)

A comprehensive review article covering various aspects of scRNA-seq, including essential steps for data processing and quality control.

DoubletFinder: A Tool for Detecting and Removing Doublets(documentation)

The GitHub repository for DoubletFinder, a popular tool for identifying and removing doublet cells in scRNA-seq data.

DropletUtils: An R Package for Identifying Cells in Droplet-Based Single-Cell Assays(documentation)

Vignette for the DropletUtils package, which includes methods for identifying empty droplets and cells based on UMI counts.

Single-cell RNA sequencing data analysis: a practical approach(paper)

A practical guide to analyzing single-cell RNA sequencing data, with a section dedicated to quality control and filtering.

Introduction to Single-Cell RNA Sequencing Analysis(video)

A YouTube video providing an overview of single-cell RNA sequencing analysis, including the importance of QC.

Cell Ranger Documentation - Quality Control(documentation)

10x Genomics' official documentation on quality control metrics and interpretation for their Cell Ranger software.

The Landscape of Single-Cell RNA Sequencing Technologies(paper)

A review discussing different scRNA-seq technologies and their implications for data quality and analysis, including QC considerations.

scater: Preprocessing and Quality Control for Single-Cell RNA-Sequencing Data(documentation)

Vignette for the scater package, a widely used R package for preprocessing and quality control of scRNA-seq data.