fix: add cli to sample_from_input
Merge request reports
Activity
requested review from @zavolan
- src/gene_count.py 0 → 100644
1 """Module containing function to count reads per gene. 2 3 Function: 4 count_reads: Take as input an aligned BAM file and an 5 annotated GTF file, and counts the reads belonging to 6 each gene. 7 """ 8 import logging Please see this comment and the comment linked in it (as well as the course materials) for how to group and sort imports: !22 (comment 25061)
The CI pipeline fails early because of a syntax error in
requirements.txt
, making it hard to evaluate all of the code. From a first glance though, it looks like there are only minor issues. Can you please check whether the code does what it should, @zavolan?mentioned in merge request !24 (closed)
- src/gene_count.py 0 → 100644
26 with open(gtf_file) as f1: 27 LOG.info("Reading annotations...") 28 gtf_dc = {} 29 # Read gtf skipping header and non-gene lines 30 for line in f1: 31 if line.startswith('#!'): 32 continue 33 fields = line.split('\t') 34 if fields[2] != 'gene': 35 continue 36 # extract gene id 37 gene_id = fields[8].split(';') 38 gene_id = gene_id[0] 39 quote_indx = [i for i, c in enumerate(gene_id) if c == '"'] 40 gene_id = gene_id[quote_indx[0]+1:quote_indx[1]] 41 # build gtf dictionary - src/gene_count.py 0 → 100644
45 gtf_dc[fields[0]].append((fields[3], fields[4], gene_id)) 46 LOG.info("Annotations read.") 47 48 # Initialize empty gene count dictionary {gene:count} 49 count_dc = {} 50 # Initialize empty 'history' dictionary 51 # to detect multiple alignments {id_read:[{gene:count},n_occurence]} 52 history = {} 53 # Read trough BAM file and add counts to genes 54 # we assume that the BAM is formatted as STAR output 55 bam_seq = pysam.AlignmentFile(bam_file, 'rb', ignore_truncation=True, require_index=False) 56 LOG.info("Reading alignments...") 57 for line in bam_seq.fetch(until_eof=True): 58 line_fields = str(line).split('\t') 59 read_start = int(line_fields[3]) 60 read_end = read_start+len(line_fields[9]) - src/gene_count.py 0 → 100644
66 g = [] 67 n = [] 68 # run through the gene intervals, find the overlaps with the reads 69 # and memorize the found genes 70 flag = 0 71 for i in genes: 72 gene_range = (int(i[0]), int(i[1])) 73 if read_range[1] < gene_range[0] or read_range[0] > gene_range[1]: 74 continue 75 flag = 1 76 g.append(i[2]) 77 # assign to the gene the value of the overlap between gene and read 78 overlap = range(max(read_range[0], gene_range[0]), min(read_range[1], gene_range[1])) 79 n.append(len(overlap)/len(line_fields[9])) 80 # normalize per n genes 81 n = [x/len(g) for x in n] - src/gene_count.py 0 → 100644
92 history[line_fields[0]] = [dict(zip(g, n)), 1] 93 # else, update history and correct the count dictionary 94 else: 95 genes_matched = history[line_fields[0]] 96 old_occurrence = genes_matched[1] 97 # update occurrence 98 new_occurence = old_occurrence + 1 99 genes_matched[1] = new_occurence 100 # update mapping history of this read 101 for g_iter in range(len(g)): 102 if g[g_iter] not in genes_matched[0]: 103 genes_matched[0][g[g_iter]] = n[g_iter] 104 else: 105 genes_matched[0][g[g_iter]] += n[g_iter] 106 # correct count dictionary by mapping history 107 for (gene_m, score) in genes_matched[0].items(): Looks like double work here. What you want to do is to add each read to each gene to which it maps proportionally, taking into account all places to which the read can be assigned. So you could first determine the likelihood of the read being assigned to a particular gene (e.g. overlap(read, gene)/len(read)) and then normalize these values so that they add up to 1 read.