python - Merging DataFrames on multiple conditions - not specifically on equal values -

- February 15, 2014

firstly, sorry if bit lengthy, wanted describe have having problems , have tried already.

i trying join (merge) 2 dataframe objects on multiple conditions. know how if conditions met 'equals' operators, however, need make use of less , more than.

the dataframes represent genetic information: 1 list of mutations in genome (referred snps) , other provides information on locations of genes on human genome. performing df.head() on these returns following:

snp dataframe (snp_df):

   chromosome        snp      bp 0           1  rs3094315  752566 1           1  rs3131972  752721 2           1  rs2073814  753474 3           1  rs3115859  754503 4           1  rs3131956  758144

this shows snp reference id , locations. 'bp' stands 'base-pair' position.

gene dataframe (gene_df):

   chromosome  chr_start  chr_stop        feature_id 0           1      10954     11507  geneid:100506145 1           1      12190     13639  geneid:100652771 2           1      14362     29370     geneid:653635 3           1      30366     30503  geneid:100302278 4           1      34611     36081     geneid:645520

this dataframe shows locations of genes of interest.

what want find out of snps fall within gene regions in genome, , discard outside of these regions.

if wanted merge 2 dataframes based on multiple (equals) conditions, following:

merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])

however, in instance - need find snps chromosome values match in gene dataframe, , bp value falls between 'chr_start' , 'chr_stop'. makes challenging these dataframes quite large. in current dataset snp_df has 6795021 rows, , gene_df has 34362.

i have tried tackle either looking @ chromosomes or genes seperately. there 22 different chromosome values (ints 1-22) sex chromosomes not used. both methods taking extremely long time. 1 uses pandasql module, while other approach loop through separate genes.

sql method

import pandas pd import pandasql psql  pysqldf = lambda q: psql.sqldf(q, globals())  q           = """ select s.snp, g.feature_id this_snp s inner join this_genes g s.bp >= g.chr_start , s.bp <= g.chr_stop; """  all_dfs = []  chromosome in snp_df['chromosome'].unique():     this_snp    = snp_df.loc[snp_df['chromosome'] == chromosome]     this_genes  = gene_df.loc[gene_df['chromosome'] == chromosome]     genic_snps  = pysqldf(q)     all_dfs.append(genic_snps)  all_genic_snps  = pd.concat(all_dfs)

gene iteration method

all_dfs = [] line in gene_df.iterrows():     info    = line[1] # getting series object     this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &             (snp_df['bp'] >= info['chr_start']) & (snp_df['bp'] <= info['chr_stop'])]     if this_snp.shape[0] != 0:         this_snp = this_snp[['snp']]         this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])         all_dfs.append(this_snp)   all_genic_snps = pd.concat(all_dfs)

can give suggestions of more effective way of doing this?

you can use following accomplish you're looking for:

merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner') merged_df=merged_df[(merged_df.bp>=merged_df.chr_start) & (merged_df.bp<=merged_df.chr_stop)][['snp','feature_id']]

note: example dataframes not meet join criteria. here example using modified dataframes:

snp_df out[193]:     chromosome        snp      bp 0           1  rs3094315  752566 1           1  rs3131972   30400 2           1  rs2073814  753474 3           1  rs3115859  754503 4           1  rs3131956  758144  gene_df out[194]:     chromosome  chr_start  chr_stop        feature_id 0           1      10954     11507  geneid:100506145 1           1      12190     13639  geneid:100652771 2           1      14362     29370     geneid:653635 3           1      30366     30503  geneid:100302278 4           1      34611     36081     geneid:645520  merged_df out[195]:           snp        feature_id 8  rs3131972  geneid:100302278

Search This Blog

Bay WIKI