zavolan_group
pipelines
Bind-n-Seq PWMs

Repository



-----------------------------------------------
Niels Schlusser, July, 28th 2022
-----------------------------------------------

This directory contains the (procedural) c++ code that runs the actual expectation maximization procedure optimizing the model parameters.
The code contains a main file "main.cpp", a parameter file "params.h", a header file containing all the methods for in- and output "inout.h", and a header file containing the methods for the EM optimization "update.h". Additionally, there is a Makefile and a version of the main file that should be run on supercomputing clusters.

The code was ran on the following system configuration
	- Ubuntu 20.04.1
	- gcc 9.4.0
	- GSL 2.6
	- c++11 standard
The code also utilizes openmp which is included in the installation of gcc.

The parameter file specifies a couple of external parameters:
	- the number of compute node threads 
	- the length of the RBP binding site that we would like to investigate
	- the maximum read length
	- the number of different files (usually corresponding to different concentrations)
	- the numer of masked sequences
	- the quadratic accuracy (abortion criterion of the EM algorithm)
	- the name of the files with masked sequences
	- the array of tuples of input file name and corresponding number of reads (=number of lines)
	
The input files are supposed to be in the following format:
	<length of the read>	<copy number of the read>	<background frequency of the read>	<sequence of the read>
While it seems to be desirable to construct the background frequencies from the overlap of foreground datasets (protein concentrations!=0) and background datasets (control experiments with no protein supplied), one needs to resort to constructing the frequencies from a Markov model (see code attached) due to small overlap of the two pools.
	
In the Makefile, the GSL directory should be specified. The code is compiled via <make>.
The code can be run via <./main.exe>. When running without any arguments, the PWM and the unspecific binding term are initialized completly with random numbers. However, if one is looking for anything in particular, one can give a dedicated starting point sequence, e.g. "TGCATG", via running <./main.exe TGCATG>. Besides the PWM being very polarized at one spot (encoded by the corresponding nucleotide) there are the options "N" (even distribution at that position), and "R" (random distribution at that position). The position-unspecific binding term e^E_0 is always initialized randomly.
When using the cluster-mainfile, you the code requires another input parameter as the seed of the random number generator. This is necessary since time-seeding is not possible on supercomputers because they often start jobs at the exact same time, leading to exact same initialization which is not very insightful. Therefore, the cluster version of running the code works via <./main.exe -i $RANDOM> if you want to use the thermal noise random number generator of the linux terminal. Just as in the local case, initializations can be given in front of the "$RANDOM".