python - Merging DataFrames on multiple conditions - not specifically on equal values -
firstly, sorry if bit lengthy, wanted describe have having problems , have tried already.
i trying join (merge) 2 dataframe objects on multiple conditions. know how if conditions met 'equals' operators, however, need make use of less , more than.
the dataframes represent genetic information: 1 list of mutations in genome (referred snps) , other provides information on locations of genes on human genome. performing df.head() on these returns following:
snp dataframe (snp_df):
chromosome snp bp 0 1 rs3094315 752566 1 1 rs3131972 752721 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144
this shows snp reference id , locations. 'bp' stands 'base-pair' position.
gene dataframe (gene_df):
chromosome chr_start chr_stop feature_id 0 1 10954 11507 geneid:100506145 1 1 12190 13639 geneid:100652771 2 1 14362 29370 geneid:653635 3 1 30366 30503 geneid:100302278 4 1 34611 36081 geneid:645520
this dataframe shows locations of genes of interest.
what want find out of snps fall within gene regions in genome, , discard outside of these regions.
if wanted merge 2 dataframes based on multiple (equals) conditions, following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
however, in instance - need find snps chromosome values match in gene dataframe, , bp value falls between 'chr_start' , 'chr_stop'. makes challenging these dataframes quite large. in current dataset snp_df has 6795021 rows, , gene_df has 34362.
i have tried tackle either looking @ chromosomes or genes seperately. there 22 different chromosome values (ints 1-22) sex chromosomes not used. both methods taking extremely long time. 1 uses pandasql
module, while other approach loop through separate genes.
sql method
import pandas pd import pandasql psql pysqldf = lambda q: psql.sqldf(q, globals()) q = """ select s.snp, g.feature_id this_snp s inner join this_genes g s.bp >= g.chr_start , s.bp <= g.chr_stop; """ all_dfs = [] chromosome in snp_df['chromosome'].unique(): this_snp = snp_df.loc[snp_df['chromosome'] == chromosome] this_genes = gene_df.loc[gene_df['chromosome'] == chromosome] genic_snps = pysqldf(q) all_dfs.append(genic_snps) all_genic_snps = pd.concat(all_dfs)
gene iteration method
all_dfs = [] line in gene_df.iterrows(): info = line[1] # getting series object this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) & (snp_df['bp'] >= info['chr_start']) & (snp_df['bp'] <= info['chr_stop'])] if this_snp.shape[0] != 0: this_snp = this_snp[['snp']] this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id']) all_dfs.append(this_snp) all_genic_snps = pd.concat(all_dfs)
can give suggestions of more effective way of doing this?
you can use following accomplish you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner') merged_df=merged_df[(merged_df.bp>=merged_df.chr_start) & (merged_df.bp<=merged_df.chr_stop)][['snp','feature_id']]
note: example dataframes not meet join criteria. here example using modified dataframes:
snp_df out[193]: chromosome snp bp 0 1 rs3094315 752566 1 1 rs3131972 30400 2 1 rs2073814 753474 3 1 rs3115859 754503 4 1 rs3131956 758144 gene_df out[194]: chromosome chr_start chr_stop feature_id 0 1 10954 11507 geneid:100506145 1 1 12190 13639 geneid:100652771 2 1 14362 29370 geneid:653635 3 1 30366 30503 geneid:100302278 4 1 34611 36081 geneid:645520 merged_df out[195]: snp feature_id 8 rs3131972 geneid:100302278
Comments
Post a Comment