The overall goal of this procedure is to identify genes within a population of individuals showing a preponderance of rare functional variation. This is accomplished by first pooling a population of DNA samples. The second step is to create and sequence a next generation sequencing library.
This is followed by the alignment of reads to the reference sequence and the creation of an error model. The final step is computational analysis using the splinter algorithm. Ultimately splinter analysis of pooled Next generation sequencing is used to show genes within populations harboring a preponderance of rare functional variance Demonstrating the procedure.
Today will be Francesco Vilania, who's a graduate student in the lab of my mentor and our collaborator, Rob Mitra, and he'll be joined by Enrique Ramos, a graduate student in my laboratory. The main advantage of this technique over existing methods such as single individual genotypes, is that it allows you to detect very precisely rare sequence variants in a mixed population of DNA molecule without requiring any prior information. This method can help answer key questions in the genetics and genomics fields, such as how to determine the frequency of novel disease causing rare variants in large cohort studies.
Every splinter experiment requires the presence of a negative and positive control to obtain optimal accuracy, prepare the PCR reaction mix using PFU ultra high fidelity. DNA polymerase. The negative control is a PCR product from any DNA sequence known to be without genetic variation, such as a cloned vector backbone.
Here, a 1, 934 base pair amplicon from the M 13 MP 18 vector is used. The positive control can be any set of previously validated sequence variants present in the entire population. If this data is not available, this lab has designed an artificial positive control consisting of a 331 base per PCR product from a blend of engineered sequences cloned into the PGMT easy vector as listed in this table.
These sequences are combined to mimic various minor allele frequencies of true variants within the patient pool. Following PCR amplification of samples as discussed in the written protocol accompanying this video, clean each PCR product of excess primers using kyogen kayak quick column purification, or 96 well filter plates with vacuum manifold for large scale cleanup. Once purified, quantify each PCR product using standard techniques.
Prepare to combine all PCR products and controls into a pool normalized by molecule number. Pooling by concentration will result in overrepresentation of small amplicons over larger products. Instead, pool a normalized number of molecules per amplicons.
Choose arbitrary numbers that are large enough to maintain accuracy during pipetting. Pull the PCR products and controls. Ligation of the PCR products is necessary because fragmentation of small PCR applicants will likely bias the representation toward their ends.
Because of this reason, we ligate the pull PCR products into large con prior to their fragmentation. Prepare mix for blunt ended ligation using T four Ligase, T four PNK, and PEG as listed in the protocol. Incubate the reaction at 22 degrees Celsius for 17 hours.
Follow with incubation at 65 degrees Celsius for 20 minutes and hold at four degrees Celsius.Thereafter. Check the ligation by loading 50 nanograms of sample into an arose gel. Successful ligation will result in a high molecular weight band present in the lane.
Prepare for DNA fragmentation through a random sonication strategy by diluting the sample 10 to one in Qiagen PB Buffer to make it less viscous. Then fragment the large conus of PCR products using a 24 sample diagon node bio rupture, sonicate at high power over the course of 25 minutes with 40 seconds on and 20 seconds off per minute. Check the results of DNA fragmentation on an agro gel and proceed with illuminous sequencing as described in the text.
To begin sequencing, read alignment. Either convert raw sequencing, read files into scarf format, or compress them. Compression is optional.
It saves time and space for the subsequent analysis steps without losing any relevant information. Using the included alignment tool, align the raw reads to the annotated faster reference sequence. Specific to the targeted regions include the PCR reactions as well as the positive and negative controls.
The input format must be in scarf format or compressed. Next, perform file tagging as described in the text. Each run generates a unique profile of sequencing error to be characterized for accurate variant calling to model errors for each run.
An internal control known to be deployed of sequence variation is included in each pool sample library From the aligned tagged file. Generate an error model file using the included tool with the negative control reference sequence, all the negative control sequence can be used or alternatively only a subset when specified by its five prime and three prime ends. Unique reads and pseudo counts should always be applied.
The tool will generate three files named as the output file name parameter ending with zero, one or two. These files correspond to a zero first and second order error model respectively for variant calling with splinter. The second order error model should always be used for visualization of the run error rate profile.
The Pearl script used to plot the error model graph can be employed to generate a PDF error plot on the zeroth order error model file. The plot file will reveal run specific error trends and can be utilized to infer the maximum number of read bases for the analysis. The following section will demonstrate how to run splinter on the aligned file using the error model to detect rare sequence variants.
The first step in the analysis is to run splinter on the aligned file using the reference sequence and the error model. Single read bases can be excluded from the analysis if found to be defective. The P-value cutoff dictates how stringent the variant calling analysis will be.
A minimum cutoff of minus 1.301 is a good start. The pool size option optimizes the algorithm signal to noise discrimination by eliminating potential variance with minor allele frequencies less than that of a single allele in the actual pool. The pool size option should be set to the closest value that is greater than the actual number of alleles analyzed in the experiment.
Variance called at lower frequencies will be ignored as noise. After inputting all the parameters and file names run splinter. This file returns all hits that are statistically significant across the sample with a description of the position of the variant type of variant.
P-value per DNA strand frequency of the variant and total coverage per DNA strand. The list vial is used by splinter to normalize coverage across the sample. The first field indicates the amplicon of interest, whereas the second field indicates the position in which the mutation is present.
N indicates that the rest of the sequence does not contain any mutation. A normalization, the analysis of the positive control is key to maximize sensitivity and specificity for a particular run. This is important because most likely the initial cutoff of minus 1.301 will not be enough to eliminate all the false positives.
Every splinter analysis will show the actual P-value for each called variant, which could not be predicted a priority. However, the entire analysis can be repeated by using the least stringent P-value displayed on the initial output for the known true positive base positions. This will serve to retain all true positives while excluding most, if not all, false positives, which typically have much less significant P values as compared to true positives.
To automate this process, the cutoff tester script can be used. The cutoff tester script requires a splinter output file and a list of positive control hits in the form of a tab delimited file as the one used for normalization. The resulting output will be a list of cutoffs that progressively reach the optimal one.
The last line represents the most optimal cutoff for the run and can therefore be used for data analysis. The optimal result is to achieve sensitivity and specificity of one. However, if not reached, the splinter analysis can be optimized by changing the number of incorporated read bases.
The final cutoff can be applied to the data using cutoff cut script, which will filter the splinter output file from hits below the optimal cutoff. This step will generate the final splinter output file, which will contain snips and indels present in the sample. Please note that the output for insertions is slightly different than for substitutions or deletions.
Accuracy as a function of coverage for a single allele in a pooled sample is visualized in this type of plot. Accuracy is estimated as the area under the curve abbreviated a UC of a receiver operator curve and ranges from a random accuracy of 0.5 to a perfect accuracy of 1.0. In this example, a UC is plotted as a function of coverage per allele for the detection of single mutant alleles in pools of 200 501, 000 alleles.
Here a UC is plotted as a function of total for insertions, deletions, and substitutions. This error plot shows the probability of incorporating an erroneous base at a given position. The error profile shows low error rates with an increasing trend towards the three prime end of the sequencing read.
Notably different reference nucleotides display different error probabilities. This plot reveals the accuracy of splinter in estimating allele frequency for positions that had greater than 25 fold coverage per allele. A comparison between pooled DNA allele frequencies estimated by splinter with allele counts measured by genome wide association studies or GWAS results.
In a very high correlation, a population of 974 individuals was pulled and targeted over 20 kilobases for sequencing. Splinter was applied for the detection of rare variants. Following the standard protocol, each individual had genotyping previously performed by gwas concordance between genotyping of tagged and novel variants.
Called in the pooled sample were excellent. Three variants, two of which were rare in the population were called denovo from sequencing results, and were validated by individual pyro sequencing, minor allele frequencies or math concordance between pyro sequencing and pulled sequencing was excellent. Once you've finished finding your rare variance in your pooled sample, many people want to know what are the functional consequences of the variance that have identified.
So annotation of your variance becomes the next step in the process After a development. This technique paved the way for researchers in the field of DNA sequencing to study rare variants in a rapid and cost-effective way to characterize rare variants in a large population studies. After watching this video, you should have a good understanding on how to detect rare sequence variants in a pool, DNA sample using splinter.