The overall goal of the following experiment is to utilize the density of mapped read positions from chromatin immunoprecipitation sequencing data to estimate the posterior mean read density across the genome. This is achieved by pre-processing. The mapped ChIP-seq reads into blocked density profiles with the same number of reads falling within 200 base pair non-overlapping bins.
Any adjacent bins with the same density are merged into a larger block as a second step posterior mean densities of each block are calculated recursively within the context of all surrounding blocks using a Bayesian model with forward and backward filters. Where the read count for a block is modeled with a Poisson distribution with a theta parameter that takes on a gamma prior distribution with alpha and beta parameters. Next posterior mean density estimates of each block are evaluated for significance based on whether or not it exceeds the 90th quantile with respect to the input control background density in order to generate the final enriched genome segments results are obtained that illustrate the progression from raw sequenced reads to posterior mean read density estimates, and finally enriched islands on ChIP-seq data during BCP analysis.
Furthermore, the results show that BCP outperforms a competitor tool cer. The main advantage of this technique over existing methods like CER is that BCP used the most recent A advances in hidden marker models, so it better characterize the nuance of chipsy data analysis than previous heuristic methods. This method can help key questions in the epigenomics field, such as the role of histo modifications by way of characterizing their genome wide enrichment patterns.
Though this patient method can provide insight into ChIP-seq data analysis, the basic framework may also be applied to other next generation sequencing data analysis, such as identifying differentially methylated regions in bis Sufi sequencing data, novel transcription loci in RNA-Seq, copy number variation or any number of microarray tiling data. Visual demonstration of this method is critical for clear understanding of the methodology and it advantages things. The theoretical advantages are hidden within the software.
All of the procedural steps demonstrated here have been packaged into a single executable in BCP software package, which is available for download in this video. The steps executed by the program are described to run the software. Three parameters are required.
A file containing uniquely mapped reads from a chip sample and a similar file for input control reads, as well as an output file name to prepare input files for BCP analysis. First, align the short reads produced from sequencing runs to the appropriate reference genome using the preferred short read alignment software. The mapped locations should be converted to the six column browser extensible data or BED format, a tab delimited line per mapped read indicating the mapped chromosome start position, end position, read name, score, and strand.
Extend the chip and input map locations to a predetermined fragment length. For example, the fragment size targeted during enzyme digestion or sonication of the DNA, usually around 200 base pairs. Fragment counts are then aggregated in adjacent bins.
By default, bin size is set to the estimated fragment length of 200 base pairs. Any possible change points in a set of bins with identical re counts will most likely fall at the outermost boundaries. Accordingly, it is improbable that a change point will occur at an internal boundary between two bins with the same read counts.
Therefore, group adjacent bins with identical reads per bin into a single block. After preparing the input files invoke the BCP estimation by simply typing the command shown at the bottom of the screen. The read density of each block is modeled as a poisson distribution with a mean parameter theta following a mixture of gamma distributions with alpha and beta parameters and a prior probability of a change point occurring at any block.
Boundary of P conditioning each block this way effectively renders an infinite state hidden Markov model or HMM. The hyper parameters alpha, beta, and P are estimated using maximum posterior likelihood. The bays estimates are explicitly calculated for each block theta sub T as the expectation of theta sub T given why sub T the more traditional but time consuming forward and backward filters often used in HMS are replaced with the more computationally efficient bounded complexity mixture approximation to estimate posterior means theta hat sub T.The resulting posterior means will be smoothed into an approximate piece-wise constant profile, so blocks with identical theta hat sub T should be further blocked together with updated boundary coordinates.
BCP uses the number of input reads per block as the background rate and determines enrichment. Using a simple hypothesis test based on whether the chip position mean density for a block exceeds some significance threshold. The 90th quantile is the default threshold and is appropriate in most cases.
BCP then merges adjacent posterior mean density blocks that exceed the enrichment into a single region and reports the merged coordinates in the browser. Extensible data format BCP excels at identifying regions of broad enrichment in histone modification data.Here. BCP results are compared to those of cser, an existing tool which has demonstrated strong performance preceding work from this lab studying H three K 36 trimethylation demonstrated a tendency for much larger island size in BCP than cer.
Larger islands are more in line with the conventional expectation of broad diffuse islands of H three K 36 trimethylation enrichment. Larger islands do not alone indicate accuracy. Therefore, the known association of H three K 36 Trimethylation Islands with actively transcribed genes bodies as well as their mutually exclusivity with H three K 27 Trimethylation Islands was used to evaluate performance of BCP and CER compared to CER BCP called larger contiguous islands that better capture gene bodies without sacrificing increased overlap with H three K 27, trimethylation Islands.
BCP maintains the high overlap of active genes by H three K 36 Trimethylation Islands with boundaries closely aligned to gene bodies without increasing the degree of false positive overlap with intergenic space genes with repressed transcription or the H three K 27 TRIMETHYLATION repressive mark while assessing the reproducibility of BCP Island calls in two replicate data sets, it was observed that BCP did not suffer from a heavy dependence on reed coverage depth in the competing algorithm cer additional evidence of BCPS robustness and reproducibility is provided by examining additional distinct regions, demonstrating consistent island boundaries despite the reduced coverage depth. To fully demonstrate the versatility of BCP, A broad spectrum of histone modification data was obtained, including the punctate marks H three K 27 acetylation, H three K nine acetylation, and H three K four trimethylation, and the diffuse mark H three K nine trimethylation in addition to H three K 27 trimethylation and H three K 36 trimethylation. These data sets were analyzed using the default parameter settings for both BCP and cser.
At the center lies H three K 36 trimethylation enrichment at the PX DN gene marking active transcription falling expectedly at the transcription start site are the additional punctate active marks H three K 27 acetylation, H three K nine acetylation, and H three K four trimethylation. Just downstream of PXDN is repressed intergenic space marked by H three K 27 trimethylation enrichment on the opposite flank lies an H three K 27 TRIMETHYLATION repressed gene. Moving one more step out.
Our silenced chromatin as indicated by the presence of H three K nine trimethylation enrichment, which appears to indicate silencing of SN TG two and MYT one L, perhaps in a less transient sense than H three K 27 trimethylation repression. This region encompasses the majority of phenomena encountered in ChIPseek of histone modifications. It illustrates how the dynamic nature of BCP can identify both punctate acetylation and H three K four trimethylation marks, while at the same time distinguishing large contiguous islands of H three K 27 trimethylation and H three K nine trimethylation repression, as well as H three K 36 trimethylation active transcription.
This algorithm can be performed roughly 30 minutes depending on the number of reads and the genome signs result. Any significant optimization as is often required with other methods Following this procedure. Many different target proteins of chromatin immunoprecipitation can be studied using BBCP including various other hisone modifications as well as DNA binding transcription factors to answer additional questions about epigenomic mechanisms and gene regulation.
After watching this video, you should have a good understanding of how BCP is used to identify regions in reach for diffuse hisone marks in chipsy data analysis.