Skip to content
Snippets Groups Projects

fix: add cli to sample_from_input

Closed Michele Garioni requested to merge feature into main
7 unresolved threads

Merge request reports

Pipeline #13842 failed

Pipeline failed for f9d2e0df on feature

Approval is optional

Closed by Alex KanitzAlex Kanitz 2 years ago (Sep 28, 2022 8:52am UTC)

Merge details

  • The changes were not merged into main.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
src/gene_count.py 0 → 100644
1 """Module containing function to count reads per gene.
2
3 Function:
4 count_reads: Take as input an aligned BAM file and an
5 annotated GTF file, and counts the reads belonging to
6 each gene.
7 """
8 import logging
  • Alex Kanitz
  • 71 71 myfile.write(line + '\n')
    72 72 myfile.close()
    73 73 LOG.info('output written.')
    74
    75
    76 def main():
  • Alex Kanitz
  • 1 pysam 0.18.0
  • The CI pipeline fails early because of a syntax error in requirements.txt, making it hard to evaluate all of the code. From a first glance though, it looks like there are only minor issues. Can you please check whether the code does what it should, @zavolan?

  • Alex Kanitz mentioned in merge request !24 (closed)

    mentioned in merge request !24 (closed)

  • src/gene_count.py 0 → 100644
    26 with open(gtf_file) as f1:
    27 LOG.info("Reading annotations...")
    28 gtf_dc = {}
    29 # Read gtf skipping header and non-gene lines
    30 for line in f1:
    31 if line.startswith('#!'):
    32 continue
    33 fields = line.split('\t')
    34 if fields[2] != 'gene':
    35 continue
    36 # extract gene id
    37 gene_id = fields[8].split(';')
    38 gene_id = gene_id[0]
    39 quote_indx = [i for i, c in enumerate(gene_id) if c == '"']
    40 gene_id = gene_id[quote_indx[0]+1:quote_indx[1]]
    41 # build gtf dictionary
    • "Magic numbers" like 8, 3, etc. will make the code difficult to read by another development without checking the format of the input file. Would be better to give these field indices meaningful names that indicate what they are.

    • Please register or sign in to reply
  • src/gene_count.py 0 → 100644
    45 gtf_dc[fields[0]].append((fields[3], fields[4], gene_id))
    46 LOG.info("Annotations read.")
    47
    48 # Initialize empty gene count dictionary {gene:count}
    49 count_dc = {}
    50 # Initialize empty 'history' dictionary
    51 # to detect multiple alignments {id_read:[{gene:count},n_occurence]}
    52 history = {}
    53 # Read trough BAM file and add counts to genes
    54 # we assume that the BAM is formatted as STAR output
    55 bam_seq = pysam.AlignmentFile(bam_file, 'rb', ignore_truncation=True, require_index=False)
    56 LOG.info("Reading alignments...")
    57 for line in bam_seq.fetch(until_eof=True):
    58 line_fields = str(line).split('\t')
    59 read_start = int(line_fields[3])
    60 read_end = read_start+len(line_fields[9])
  • src/gene_count.py 0 → 100644
    66 g = []
    67 n = []
    68 # run through the gene intervals, find the overlaps with the reads
    69 # and memorize the found genes
    70 flag = 0
    71 for i in genes:
    72 gene_range = (int(i[0]), int(i[1]))
    73 if read_range[1] < gene_range[0] or read_range[0] > gene_range[1]:
    74 continue
    75 flag = 1
    76 g.append(i[2])
    77 # assign to the gene the value of the overlap between gene and read
    78 overlap = range(max(read_range[0], gene_range[0]), min(read_range[1], gene_range[1]))
    79 n.append(len(overlap)/len(line_fields[9]))
    80 # normalize per n genes
    81 n = [x/len(g) for x in n]
  • src/gene_count.py 0 → 100644
    92 history[line_fields[0]] = [dict(zip(g, n)), 1]
    93 # else, update history and correct the count dictionary
    94 else:
    95 genes_matched = history[line_fields[0]]
    96 old_occurrence = genes_matched[1]
    97 # update occurrence
    98 new_occurence = old_occurrence + 1
    99 genes_matched[1] = new_occurence
    100 # update mapping history of this read
    101 for g_iter in range(len(g)):
    102 if g[g_iter] not in genes_matched[0]:
    103 genes_matched[0][g[g_iter]] = n[g_iter]
    104 else:
    105 genes_matched[0][g[g_iter]] += n[g_iter]
    106 # correct count dictionary by mapping history
    107 for (gene_m, score) in genes_matched[0].items():
    • Looks like double work here. What you want to do is to add each read to each gene to which it maps proportionally, taking into account all places to which the read can be assigned. So you could first determine the likelihood of the read being assigned to a particular gene (e.g. overlap(read, gene)/len(read)) and then normalize these values so that they add up to 1 read.

    • Thank you for reviewing my code. Indeed, I think I can avoid to add the read counts to the count dicitonary while reading the bam file. I can do it at the end and directly calculate the normalized reads from the history. I will work on fixing the code following your suggestions

    • Please register or sign in to reply
  • closed

  • Please register or sign in to reply
    Loading