Variant Filtering and Annotation in Genomics

After identifying potential genetic variations (variants) through sequencing, the critical next step is to filter and annotate these variants. This process helps us distinguish true biological signals from noise and understand the potential impact of these variations on an organism's traits or disease susceptibility.

The Importance of Filtering

Next-generation sequencing (NGS) technologies, while powerful, generate vast amounts of data. This data can contain errors introduced during library preparation, sequencing, or alignment. Variant filtering aims to remove these false positives, ensuring that the variants we analyze are likely to be real.

The Role of Annotation

Once variants are filtered, annotation provides context. It enriches the variant data with information from various biological databases, helping us understand the potential functional consequences of each variant.

Common Filtering Strategies

Filter Type	Purpose	Example Metrics
Quality Score	Remove low-confidence variant calls	Variant Quality Score (VQS), Phred-scaled quality
Read Depth	Ensure sufficient coverage for reliable variant detection	Minimum total reads, minimum alternate allele reads
Allele Balance	Assess the proportion of reads supporting each allele	Ratio of alternate allele reads to total reads (e.g., 0.4-0.6 for heterozygotes)
Mapping Quality	Filter variants in regions with ambiguous alignments	Minimum mapping quality score
Strand Bias	Identify variants that appear on only one sequencing strand	Ratio of reads on forward vs. reverse strand

Annotation Tools and Databases

A variety of bioinformatics tools and databases are used for variant annotation. These tools often integrate information from multiple sources to provide a comprehensive view of each variant.

Variant annotation is like adding labels and descriptions to a map. A raw map might show locations, but annotation adds street names, points of interest, and population density. Similarly, variant annotation adds gene names, functional impacts, and population frequencies to raw variant calls, making them interpretable.

📚

Text-based content

Library pages focus on text content

What is the primary goal of variant filtering?

To remove false positive variant calls caused by sequencing or alignment errors.

Name two types of information provided by variant annotation.

Gene context and predicted functional consequence (e.g., missense, synonymous).

Advanced Filtering and Interpretation

Beyond basic quality metrics, advanced filtering can involve using population genetics principles, comparing variants across different samples (e.g., case vs. control), or leveraging known functional annotations. The ultimate goal is to prioritize variants that are most likely to be biologically relevant to the research question.

The process of filtering and annotation is iterative. Initial filters might be too stringent or too lenient, requiring adjustments based on the observed results and biological context.

Learning Resources

GATK Best Practices for Variant Filtering(documentation)

Official documentation from the Genome Analysis Toolkit (GATK) detailing recommended strategies for filtering variants identified by their tools.

Variant Annotation with VEP (Variant Effect Predictor)(documentation)

Learn how to use Ensembl's Variant Effect Predictor (VEP) to annotate variants with their potential functional consequences and population frequencies.

Understanding Variant Quality Scores(paper)

A technical note from Illumina explaining the concept of variant quality scores and their importance in variant calling and filtering.

Introduction to Variant Annotation(video)

A YouTube video providing a conceptual overview of variant annotation and its significance in genomic analysis.

The gnomAD Browser: Exploring Human Genetic Variation(wikipedia)

Access and explore population allele frequencies for millions of genetic variants from large-scale sequencing projects, crucial for filtering rare variants.

ClinVar: A Public Archive of Relationships Among Variants and Phenotypes(documentation)

A database that aggregates information on the relationships between human genetic variations and observed phenotypes, essential for disease-gene association studies.

SnpEff: Variant Effect Prediction(documentation)

Documentation for SnpEff, a popular tool for annotating the effects of genetic variants on genes and proteins.

Best Practices for Variant Calling and Filtering in Human Genomics(paper)

A review article discussing current best practices in variant calling, filtering, and annotation for human genomic data.

ANNOVAR: Functional Annotation of Genetic Variants(documentation)

Learn about ANNOVAR, a widely used software package for annotating genetic variants from high-throughput sequencing data.

Understanding Allele Frequency(wikipedia)

A clear explanation of allele frequency, a fundamental concept used in filtering variants based on their prevalence in populations.