LibraryQuality Scores and Their Interpretation

Quality Scores and Their Interpretation

Learn about Quality Scores and Their Interpretation as part of Genomics and Next-Generation Sequencing Analysis

Understanding Quality Scores in Genomics and NGS Data

Next-Generation Sequencing (NGS) technologies generate vast amounts of data, but not all of it is equally reliable. Quality scores are a crucial metric used to assess the accuracy of individual base calls made by sequencing machines. Understanding these scores is fundamental for interpreting NGS data and ensuring the validity of downstream genomic analyses.

What are Quality Scores?

Quality scores, often referred to as Phred scores, are a numerical representation of the probability that a base call made by the sequencer is incorrect. A higher Phred score indicates a lower probability of error, and thus higher confidence in the base call. These scores are typically assigned to each base in a sequencing read.

Interpreting Quality Scores

Interpreting quality scores involves looking at the distribution of scores across reads and over the length of the reads. This helps identify potential issues with the sequencing run or the library preparation.

Phred Score (Q)Error Probability (P)Interpretation
100.1 (10%)Low confidence; likely error
200.01 (1%)Moderate confidence; acceptable for some analyses
300.001 (0.1%)High confidence; generally considered good quality
400.0001 (0.01%)Very high confidence; excellent quality

Quality Score Distributions

Visualizing quality score distributions is a common practice. A typical plot shows the average quality score per base position across all reads. Ideally, quality scores should remain high throughout the read length. A drop in quality scores towards the end of reads is common due to various sequencing artifacts.

A common visualization for NGS data quality is a 'per-base quality score plot'. This plot typically shows the average Phred score on the y-axis and the base position within a read on the x-axis. Different colored lines might represent different samples or lanes. A healthy plot shows high and relatively stable quality scores across most of the read length, with a potential gradual decline towards the end. Sharp drops or consistently low scores can indicate problems with the sequencing run, library preparation, or instrument performance. For example, a plateau at Q30 or higher for the majority of the read length is desirable.

📚

Text-based content

Library pages focus on text content

Common Quality Issues and Their Impact

Several factors can lead to low-quality scores, including:

  • Sequencing chemistry issues: Problems with reagents or flow cell.
  • Library preparation artifacts: Inefficient adapter ligation, PCR amplification bias, or contamination.
  • Instrument calibration: Drift in instrument performance over time.
  • Contamination: Presence of foreign DNA or reagents.

Low-quality bases can lead to incorrect variant calls, misassemblies, and unreliable downstream analyses. Therefore, quality control and filtering are essential steps in any NGS workflow.

A Phred score of 20 is often considered a minimum threshold for reliable base calls in many genomic applications, meaning a 1% error rate. However, for critical applications like variant detection, higher thresholds (e.g., Q30) are often preferred.

Tools for Quality Control

Numerous bioinformatics tools are available to assess and visualize NGS data quality, including FastQC, MultiQC, and QualiMap. These tools generate comprehensive reports that summarize various quality metrics, including base quality scores, adapter content, GC content, and sequence duplication levels.

What does a Phred score of 30 represent in terms of error probability?

A Phred score of 30 represents an error probability of 0.001, or 0.1% (1 in 1000 bases are wrong).

Learning Resources

FastQC: A Quality Control Tool for High Throughput Sequence Data(documentation)

The official documentation for FastQC, a widely used tool for generating quality control reports for raw sequencing data. It provides detailed explanations of the metrics it assesses, including base quality scores.

MultiQC: Aggregate Bioinformatics Analyses across Samples(documentation)

Learn how to use MultiQC to consolidate and summarize results from multiple FastQC reports and other bioinformatics tools, providing an aggregated view of sequencing quality across many samples.

Illumina Quality Score Basics(documentation)

An official Illumina document explaining the fundamentals of their quality score system, its meaning, and how it's used to assess sequencing data accuracy.

Understanding Phred Quality Scores(documentation)

A concise explanation of Phred quality scores, their mathematical basis, and their importance in interpreting sequencing data. Useful for grasping the core concept.

NGS Quality Control: A Practical Guide(video)

A video tutorial that walks through the practical aspects of NGS quality control, including interpreting quality score plots and common issues. (Note: Specific video content may vary, but this type of resource is valuable).

Introduction to Next-Generation Sequencing Data Analysis(video)

A lecture from a Coursera course that provides an overview of NGS data, including the importance of quality scores as a foundational concept in data analysis. (Note: Access may require Coursera account).

Wikipedia: Phred quality score(wikipedia)

The Wikipedia entry for Phred quality scores, offering a comprehensive overview of their definition, history, and application in bioinformatics.

QualiMap: Quality Control for Genomics(documentation)

The official website for QualiMap, another powerful tool for assessing and visualizing NGS data quality, with a focus on detailed statistical analysis and graphical reports.

Best Practices for NGS Data Analysis(paper)

A scientific paper discussing best practices in NGS data analysis, often covering the critical role of quality control and interpretation of quality scores. (Note: May be behind a paywall or require institutional access).

Galaxy Project: Quality Control(tutorial)

A tutorial from the Galaxy Project that guides users through performing NGS quality control using various tools, often including explanations of quality score interpretation within a user-friendly platform.