Skip to content
Snippets Groups Projects
Commit 98bc811b authored by Gina's avatar Gina
Browse files

Prebedtools script to sort exons per strand type

parent ccde0764
No related branches found
No related tags found
1 merge request!50Prebedtools script to sort exons per strand type
import pandas as pd
from gtfparse import read_gtf
"""This script defines a BED from exon annotation in a GTF, to get sequences with transcript ID as header after usage in bedtools.
For each transcript, take exons only and sort exons by start position (reverse order for -ve strand)
Input: GTF file
Columns needed for BED: chr, start, end, transcript_id, score, strand, gene_id
...
:returns: BED file format
:rtype: dataframe
"""
gtf = read_gtf('../scrna-seq-simulation-main/inputs/ref_annotation.gtf')
gtf_exons = gtf[gtf["feature"] == "exon"]
gtf_exons = gtf_exons[["seqname", "start", "end", "transcript_id", "score", "strand", "gene_id"]]
gtf_df_neg = gtf_exons[gtf_exons["strand"] == "-"]
gtf_df_neg = gtf_df_neg.sort_values(['transcript_id','start'],ascending=False).groupby('transcript_id').head(len(gtf_df_neg. transcript_id))
gtf_df_pos = gtf_exons[gtf_exons["strand"] == "+"]
gtf_df_pos = gtf_df_pos.sort_values(['transcript_id','start'],ascending=True).groupby('transcript_id').head(len(gtf_df_pos. transcript_id))
pd.concat([gtf_df_pos, gtf_df_neg]).to_csv("bed_file.bed",sep="\t",index=False) #gtf_df_pos and gtf_df_neg must be dataframes
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment