scala-spark: How to filter RDD after groupby -
i have started rdd has pipe separated string. have processed data , gotten following format:
((0001f46468,239394055),(7665710590658745,-414963169),0,1420276980302) ((0001f46468,239394055),(8016905020647641,183812619),1,1420347885727) ((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110) ((0001f46468,239394055),(6633110906332136,294201185),0,1420451687525) ((0001f46468,239394055),(7722056727387069,1396896294),1,1420537469065) ((0001f46468,239394055),(7722056727387069,1396896294),1,1420623297340) ((0001f46468,239394055),(8045651092287275,-4814845),1,1420720722185) ((0001f46468,239394055),(5170029699836178,-1332814297),0,1420750531018) ((0001f46468,239394055),(7722056727387069,1396896294),0,1420807545137) ((0001f46468,239394055),(4784119468604853,1287554938),1,1421050087824)
just give high level view on description of data. can think first element in main tuple (first tuple) user identification, second tuple product identification, , third element user's preference on product. (for future reference going mark above data set val userdata
)
my goal if user has casted both positive (1) , negative (0) preference product take record positive. example:
((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110) ((0001f46468,239394055),(6633110906332136,294201185),0,1420451687525)
i want keep
((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110)
so grouped users user-product tuple (0001f46468,239394055),(6633110906332136,294201185
val groupedfiltered = userdata.groupby(x => (x._1, x._2)).map(u => { for(k <- u._2) { if(k._3 > 0) u } })
but return empty tuples.
so took following approach:
val groupedfiltered = userdata. groupby(x => (x._1, x._2)).flatmap(u => u._2).filter(m => m._3 > 0) ((47734739656882457,-1782798434),(7585453414177905,-461779195),1,1422013413082) ((47734739656882457,-1782798434),(7585453414177905,-461779195),1,1422533237758) ((55218449094787901,-1374432022),(6227831620534109,1195766703),1,1420410603596) ((71212122719822610,-807015489),(6769904840922490,1642054117),1,1422549467554) ((75414197560031509,1830213715),(6724015489416254,-1389654186),1,1420196951100) ((60422797294995441,734266951),(6335216393920738,1528026712),1,1421161253600) ((35091051395844216,451349158),(8135854751464083,-1751839326),1,1422083101033) ((16647193023519619,990937787),(5384884550662007,-910998857),1,1420659873572) ((43355867025936022,-945669937),(7336240855866885,518993644),1,1420880078266) ((12188366927481231,-2007889717),(5336507724485344,363519858),1,1420827788022)
this promising looks taking records has 0 want if user has 1 , 0 same item keep 1 1.
you keep maximum user preference grouped results.
userdata // group user , product .groupby(x => (x._1, x._2)) // keep maximum user preference per user/product .mapvalues(_.maxby(_._3)) // keep values .values
Comments
Post a Comment