python - Merge fields in a file -
i have file 7 columns, gff file having chromosomal regions.i want collapse rows region ="exon" 1 row in file.the row has collapsed on basis of regions being overlapping each other.
region start end score strand frame attribute exon 26453 26644 . + . transcript "xm_092971"; name "xm_092971" exon 26842 27020 . + . transcript "xm_092971"; name "xm_092971" exon 30355 30899 . - . transcript "xm_104663"; name "xm_104663" gs_tran 30355 34083 . - . gs_tran "hs22_30444_28_1_1"; name "hs22_30444_28_1_1" snp 30847 30847 . + . snp "rs2971719"; name "rs2971719" exon 31012 31409 . - . transcript "xm_104663"; name "xm_104663" exon 34013 34083 . - . transcript "xm_104663"; name "xm_104663" exon 40932 41071 . + . transcript "xm_092971"; name "xm_092971" snp 44269 44269 . + . snp "rs2873227"; name "rs2873227" snp 45723 45723 . + . snp "rs2227095"; name "rs2227095" exon 134031 134495 . - . transcript "xm_086913"; name "xm_086913" exon 134034 134457 . - . transcript "xm_086914"; name "xm_086914"
looking @ sample data above,only last 2 rows can merged 1 row.so,the new row become.
exon 134031 134495 . - . transcript "xm_086913"; name "xm_086913"
in case,the end of other row have been greater previous,that end region in case.basically,if there overlap,then take region starts earlier,and 1 ends later.
there can multiple rows of such instance,here last 2 rows there.one thing atrribute column show different transcript names such rows,which same in other cases.
i have in python,and beginner in python.
break down simpler steps:
- read file , parse list of data
- loop list , check each row against next
- append ones fullfill requirements new list
- save new list new file or print console
you might want manually move through list instead of using for row in mylist
this:
newlist = [] = 0 while < len(mylist): if can_collapse( mylist[i], mylist[i+1] ): newlist.append[ collapse( mylist[i], mylist[i+1] ) ] += 2 else: newlist.append[ mylist[i] ] += 1
Comments
Post a Comment