Design project

Hmm, I think yours is more sophisticated xD Will check it in more detail tomorrow before our discussion!

thank you for your ideas

Here my idea:

Inputs: transcrips (fasta) and poly(dt)primer

Workflow:

Determine minimal coverage allowing priming of poly(dt)primer, incl mismatches allowing priming
Identify priming sites for each transcript
Determine the probability of binding for priming sites

Note:

Hi guys, I‘m doing this kind of lecture with a multi person programming problem for the first time, so please be aware, that I could be totally wrong. Nevertheless, I‘m happy to contribute as good as I can and I‘m eager to learn if someone is more experienced. Following, you can read my design idea of how the program could be structured (together with your ideas of course). - RobinC

Task: Programming software package to predict priming sites for oligo dT primers on mRNA sequence with added poly-A tail and implementing it into a Nextflow process.

Input: Fasta files with added poly-A tail, library of available oligo dT primers

Step by Step Software Package Design Idea:

• First of all, we have to check the input data on correctness. Hence we have to prove, that we get a correct set of transcript sequences (fasta) and that they have added poly-A tails (Do we get sequences without poly-A tails? If yes, we have to deal with this issue). Necessity based on the testing of the former group.

• If the input data has passed the test, the transcript sequences can be stored in a array/list.

• Create library of possible oligo dT primers if not given (idk).

• Incorporate a function, which uses the RIBlast algorithm to make a priming score sheet for each transcript sequence and for each cell. I have unfortunately not read much about the algorithm, but it calculates the accessible energy and hybridisation energy of the RNA-RNA binding. I assume, that the algorithm picks out only the priming sites with a binding possibility > 0.

• If the RIBlast algorithm does not calculate the binding possibility p, we have to incorporate a function which calculates the binding possibility based on the calculated energies.

• Incorporating some sort of test function, which evaluates the RIBlast results, if the RIBlast is not self controlling. Anyway it has to be checked somehow.

• Run the priming site function on the input data in some sort of a loop. Hence, all the oligo dT primers on each transcript sequence of every cell. (Actually I don‘t know the actual input data looks, so it is hard to specify)

• Evaluating the results form the RIBlast search with the test function. What has to be tested? P > 0 for all the priming sites, and probably some more things… depends on what the RIBlast algorithm is capable of.

• If necessary, calculate the binding possibilities with the incorporated function and test it with a test-function.

• Storing the generated data cleverly in a gene transfer file (gtf file) and marking it as the output of the software package. Incorporating the software package (PredictPrimingSite) into a Nextflow process.

added 1 design

First merged draft

Great discussion. Please have a look at these issues (and discussions, if available) to further refine your project design:

added 1 design

New version, have quite some questions. Maybe we can discuss it quickly in the lesson tomorrow?

Yes, will do. Great ideas indeed, will iron out small details tomorrow. Overview 1 is actually solving the task, overview 2 is doing more (unnecessary for this task, could be tackled as a separate task). Some functions that have emerged from the discussion above:

error checking the inputs
wrapper around RIBlast call (check the types of inputs to RIBlast to determine whether it operates on entire files of sequences or one sequence at at time
function to convert the free energies of interaction into probabilities of priming
function to prepare output lines, containing appropriate info and formatted correctly. A few observations:

there is a single primer sequence
it is not necessary that transcripts contain poly(A) tails (the package should be able to compute probabilities of priming for any primer sequence to any target sequence)
the most meaningful tests for this task is that priming probabilities are calculated correctly and written out in the correct format

Check input files if they are of the format fasta, else, raise an error.

Generate RIBlast readable file to generate RIBlast database of primer sequence(s).

Regarding inputs for RIBlast: Generate RIBlast database of (all) primer sequence(s) (start with 1 sequence) with the RIblast db command

For each transcript in list of transcripts:

Run each transcript against the primer database using the RIblast ris command which also takes input energy threshold as a variable.

Investigate the output file format (interaction is expressed in 5 columns): The first, second, third, fourth and fifth column of an interaction describes intearction id, name of query RNA, name of target RNA, interaction energy of the interaction, and interacted regions ([region in query]:[region in target]), respectively. Store data in dictionary with transcript as key and list with 5 values as value.

End for loop

For each transcript, calculate the number of priming sites and from there, calculate probability of each priming site and append it to column of each primer (new, 6 entries in column).

End for loop

Generate GFF file for each transcript (parent sequence) with child primer sequences in the GFF- formatted file, containing location in transcript and the associated probability

Output: List of GFF files of each transcript.

Hangon, do we only have 1 primer sequence?

Also, not sure if I got db and query mixed up (now db=primer and query=transcript, not sure if other way round).

Clarify primer sequence, where do we get it from and in what format is it, or do we create it ourselves somehow?

Here's my attempt to how the program could be structured. I'd suggest to write the program in the style of a script instead of an attempt to create a program with heavy "Object Oriented" style of programming.

Input: Fasta file with sequences, oligo dT primer, interaction energy threshold value.

Create a test function, which verifies that the input files are in the fasta format. Otherwise raise an error for wrong input files. According to Prof. Zavolan, we do not have to check for poly-A sequences, since no poly-A are also occurring/accepted. I suppose we get the oligo dT primer as-well as a fasta file. Otherwise we have to convert them into fasta files for the RIBlast search.

Generate Communication Protocol to OS Further, we have to import modules which allow to communicate with the OS. Therefore we can import the modules;

os.system, os.spawn

with subprocess.run() for a direct communication to the OS. In addition, another test function should be implemented to verify the communication to the Terminal (i.e. subprocess.run(echo 'Hello Wold'). Otherwise raise an error, that the communication is not working.

Conversion of Fast File to RIBlast Database (RIBlast db)

In order to create a readable database file for the RIBlast search, we have to create a database for the output files and use;

RIblast db [-i InputFastaFile] [-o OutputDbName] [-r RepeatMaskingStyle]
[-s LookupTableSize] [-w MaximalSpan] [-d MinAccessibleLength],

whereas -I and -O are the minimum inputs. After this step, everything should be ready to test the RNA-RNA interaction of the dT primer and the sequences in the database.

RIBlast Search (RIBlast ris):

As far as I understood, we do not need a for-loop, since the primer sequence gets automatically tested for all the sequences in the previously generated database. Nevertheless, we need to introduce an output variable (.txt) for the RNA-RNA search. The RIBlast search can be made with the following terminal command;

RIblast ris [-i InputFastaFile] [-o OutputFileName] [-d DatabaseFileName] [-l MaxSeedLength] [-e HybridizationEnergyThreshold] [-f InteractionEnergyThreshold] [-x DropOutLengthInGappedExtension] [-y DropOutLengthInUngappedExtension] [-g OutputEnergyThreshold] [-s OutputStyle],

whereas the first three inputs are mandatory. The others are set by default or are not used/can be made use of.

Investigate RIBlast Output:

Well described by @max.baer! I agree with the concept of converting the output txt. file into a dictionary.

Calculate Binding Probabilities:

Using the formula from the lecture, we can calculate the Binding Probabilities in a for-loop and add the resulting values to the list as a 6th collumn.

Generate Output Files

Final function to generate the GFF output files + eventual test function (See @max.baer). Return correct formatted GFF files!

Test function should actually open file and check content and format.

Clarify primer sequence, where do we get it from and in what format is it, or do we create it ourselves somehow?

There is software that converts your sequence to a fasta file

Are the primers natural occurring ones or do we "add" one specific primer in the experiment?

' >Testprimer (15 bp) TTTTTTTTTTTTTTT

Let's stick with standard formats: primer is provides in a fasta-formatted file.

closed

Design project

Designs

Child items ...

Activity