Computational approaches to identify and explain sources of error in cancer somatic mutation data
Date
2024-04-18Author
O’Sullivan, Brian
Metadata
Show full item recordUsage
This item's downloads: 215 (view details)
Abstract
Errors in the identification of somatic mutations in cancer samples can have critical
implications in both research and clinical applications. Failure to detect potential
variants of interest can lead to missed opportunities in patient treatment or sci entific research. Incorrectly identifying a somatic variant may result in inaccurate
prognosis, unsuitable treatment selection, or misleading research. By understanding
the sources of error in somatic mutation calling, we are better placed to mitigate
these risks. The reevaluation of variants that have been excluded from analysis by
mutation calling methodologies can provide valuable insights in this regard. By
considering the allele frequency, nucleotide context, and potential impact on pro tein of a mutation that has been discarded from analysis, we can incorporate the
overall biological context into our assessment of the variant call. This approach
enables us to identify putative somatic variants that were overlooked by the caller
and, importantly, investigate the reason for their omission.
In Chapter 2, we outline vcfView, an interactive R Shiny tool designed to support
the evaluation and exploratory analysis of somatic mutation records from cancer se quencing data. We use vcfView to reevaluate the TCGA acute myeloid leukaemia
data and identify clinically actionable mutation records in patients that were incor rectly excluded from analysis due to the presence of tumour sample DNA in the
matched normal sample.
The validation of somatic mutation calling pipelines is a critical step in ensur ing the accuracy and reliability of the results obtained from the analysis of cancer
genomic sequencing data. However, the trustworthiness of the validation results is
directly linked to the quality of the truth set used for validation. In Chapter 3, we
introduce a simulation framework designed to generate comprehensive and realistic
tumour genomic sequencing data. This framework takes into account the inherent
randomness of genomic sequencing, providing an accurate representation of the fre quency profile as it is observed in real sequencing data. It generates a corresponding
truth set alongside the simulated sequencing data, documenting the true source of
each non-reference base in the data. Unlike existing validation methods, this truth
set not only identifies variant caller errors but, crucially, enables us to understand
the reasons behind the erroneous calls. Using the GATK Mutect2 variant calling
pipeline, we apply this framework to highlight and explain sources of error in somatic
mutation data and biases in the estimation of somatic allele frequency.
Finally in Chapter 4, we analyse tumour-only sequencing and somatic variant
data from an unpublished dataset comprising 60 individuals diagnosed with early onset and aggressive pancreatic ductal adenocarcinoma. We apply the tools and
methods we have developed previously to recover somatic variant information from
sequence data obtained from heavily damaged FFPE samples. We provide an im proved estimate of the true incidence of pathogenic KRAS variants within the cohort
that accounts for the sequencing strategy and sample preparation methods used. We
also highlight recurrent mutations in several other cancer associated genes that may
have played a role in disease progression in these patients.