Write a succinct (bullet point style) but sufficiently detailed plan of individual work packages that are needed to implement the desired functionality of the software. Add the plan as a comment to this issue. After one or more rounds of reviews, the "definite" plan can be used as a guide to create individual issues that you can distribute among yourselves to take care of. More info on how to write issues, assign them etc. will follow in the "version control" session on Oct 26.
Note that writing the project design plan is one of the milestones of the course.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Hi guys, I‘m doing this kind of lecture with a multi person programming problem for the first time, so please be aware, that I could be totally wrong. Nevertheless, I‘m happy to contribute as good as I can and I‘m eager to learn if someone is more experienced. Following, you can read my design idea of how the program could be structured (together with your ideas of course). - RobinC
Task: Programming software package to predict priming sites for oligo dT primers on mRNA sequence with added poly-A tail and implementing it into a Nextflow process.
Input: Fasta files with added poly-A tail, library of available oligo dT primers
Step by Step Software Package Design Idea:
• First of all, we have to check the input data on correctness. Hence we have to prove, that we get a correct set of transcript sequences (fasta) and that they have added poly-A tails (Do we get sequences without poly-A tails? If yes, we have to deal with this issue). Necessity based on the testing of the former group.
• If the input data has passed the test, the transcript sequences can be stored in a array/list.
• Create library of possible oligo dT primers if not given (idk).
• Incorporate a function, which uses the RIBlast algorithm to make a priming score sheet for each transcript sequence and for each cell. I have unfortunately not read much about the algorithm, but it calculates the accessible energy and hybridisation energy of the RNA-RNA binding. I assume, that the algorithm picks out only the priming sites with a binding possibility > 0.
• If the RIBlast algorithm does not calculate the binding possibility p, we have to incorporate a function which calculates the binding possibility based on the calculated energies.
• Incorporating some sort of test function, which evaluates the RIBlast results, if the RIBlast is not self controlling. Anyway it has to be checked somehow.
• Run the priming site function on the input data in some sort of a loop. Hence, all the oligo dT primers on each transcript sequence of every cell. (Actually I don‘t know the actual input data looks, so it is hard to specify)
• Evaluating the results form the RIBlast search with the test function. What has to be tested? P > 0 for all the priming sites, and probably some more things… depends on what the RIBlast algorithm is capable of.
• If necessary, calculate the binding possibilities with the incorporated function and test it with a test-function.
• Storing the generated data cleverly in a gene transfer file (gtf file) and marking it as the output of the software package. Incorporating the software package (PredictPrimingSite) into a Nextflow process.
Yes, will do. Great ideas indeed, will iron out small details tomorrow. Overview 1 is actually solving the task, overview 2 is doing more (unnecessary for this task, could be tackled as a separate task). Some functions that have emerged from the discussion above:
error checking the inputs
wrapper around RIBlast call (check the types of inputs to RIBlast to determine whether it operates on entire files of sequences or one sequence at at time
function to convert the free energies of interaction into probabilities of priming
function to prepare output lines, containing appropriate info and formatted correctly.
A few observations:
there is a single primer sequence
it is not necessary that transcripts contain poly(A) tails (the package should be able to compute probabilities of priming for any primer sequence to any target sequence)
the most meaningful tests for this task is that priming probabilities are calculated correctly and written out in the correct format
Check input files if they are of the format fasta, else, raise an error.
Generate RIBlast readable file to generate RIBlast database of primer sequence(s).
Regarding inputs for RIBlast:
Generate RIBlast database of (all) primer sequence(s) (start with 1 sequence) with the RIblast db command
For each transcript in list of transcripts:
Run each transcript against the primer database using the RIblast ris command which also takes input energy threshold as a variable.
Investigate the output file format (interaction is expressed in 5 columns): The first, second, third, fourth and fifth column of an interaction describes intearction id, name of query RNA, name of target RNA, interaction energy of the interaction, and interacted regions ([region in query]:[region in target]), respectively.
Store data in dictionary with transcript as key and list with 5 values as value.
End for loop
For each transcript, calculate the number of priming sites and from there, calculate probability of each priming site and append it to column of each primer (new, 6 entries in column).
End for loop
Generate GFF file for each transcript (parent sequence) with child primer sequences in the GFF- formatted file, containing location in transcript and the associated probability
Here's my attempt to how the program could be structured. I'd suggest to write the program in the style of a script instead of an attempt to create a program with heavy "Object Oriented" style of programming.
Input:
Fasta file with sequences, oligo dT primer, interaction energy threshold value.
Create a test function, which verifies that the input files are in the fasta format. Otherwise raise an error for wrong input files. According to Prof. Zavolan, we do not have to check for poly-A sequences, since no poly-A are also occurring/accepted. I suppose we get the oligo dT primer as-well as a fasta file. Otherwise we have to convert them into fasta files for the RIBlast search.
Generate Communication Protocol to OS
Further, we have to import modules which allow to communicate with the OS. Therefore we can import the modules;
os.system,
os.spawn
with subprocess.run() for a direct communication to the OS. In addition, another test function should be implemented to verify the communication to the Terminal (i.e. subprocess.run(echo 'Hello Wold'). Otherwise raise an error, that the communication is not working.
Conversion of Fast File to RIBlast Database (RIBlast db)
In order to create a readable database file for the RIBlast search, we have to create a database for the output files and use;
whereas -I and -O are the minimum inputs. After this step, everything should be ready to test the RNA-RNA interaction of the dT primer and the sequences in the database.
RIBlast Search (RIBlast ris):
As far as I understood, we do not need a for-loop, since the primer sequence gets automatically tested for all the sequences in the previously generated database. Nevertheless, we need to introduce an output variable (.txt) for the RNA-RNA search. The RIBlast search can be made with the following terminal command;
whereas the first three inputs are mandatory. The others are set by default or are not used/can be made use of.
Investigate RIBlast Output:
Well described by @max.baer! I agree with the concept of converting the output txt. file into a dictionary.
Calculate Binding Probabilities:
Using the formula from the lecture, we can calculate the Binding Probabilities in a for-loop and add the resulting values to the list as a 6th collumn.
Generate Output Files
Final function to generate the GFF output files + eventual test function (See @max.baer). Return correct formatted GFF files!