scala-spark: How to filter RDD after groupby -

- April 15, 2015

i have started rdd has pipe separated string. have processed data , gotten following format:

((0001f46468,239394055),(7665710590658745,-414963169),0,1420276980302) ((0001f46468,239394055),(8016905020647641,183812619),1,1420347885727) ((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110) ((0001f46468,239394055),(6633110906332136,294201185),0,1420451687525) ((0001f46468,239394055),(7722056727387069,1396896294),1,1420537469065) ((0001f46468,239394055),(7722056727387069,1396896294),1,1420623297340) ((0001f46468,239394055),(8045651092287275,-4814845),1,1420720722185) ((0001f46468,239394055),(5170029699836178,-1332814297),0,1420750531018) ((0001f46468,239394055),(7722056727387069,1396896294),0,1420807545137) ((0001f46468,239394055),(4784119468604853,1287554938),1,1421050087824)

just give high level view on description of data. can think first element in main tuple (first tuple) user identification, second tuple product identification, , third element user's preference on product. (for future reference going mark above data set val userdata)

my goal if user has casted both positive (1) , negative (0) preference product take record positive. example:

((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110) ((0001f46468,239394055),(6633110906332136,294201185),0,1420451687525)

i want keep

((0001f46468,239394055),(6633110906332136,294201185),1,1420398323110)

so grouped users user-product tuple (0001f46468,239394055),(6633110906332136,294201185

val groupedfiltered = userdata.groupby(x => (x._1, x._2)).map(u => {       for(k <- u._2) {         if(k._3 > 0)           u       }     })

but return empty tuples.

so took following approach:

val groupedfiltered = userdata. groupby(x => (x._1, x._2)).flatmap(u => u._2).filter(m => m._3 > 0)  ((47734739656882457,-1782798434),(7585453414177905,-461779195),1,1422013413082) ((47734739656882457,-1782798434),(7585453414177905,-461779195),1,1422533237758) ((55218449094787901,-1374432022),(6227831620534109,1195766703),1,1420410603596) ((71212122719822610,-807015489),(6769904840922490,1642054117),1,1422549467554) ((75414197560031509,1830213715),(6724015489416254,-1389654186),1,1420196951100) ((60422797294995441,734266951),(6335216393920738,1528026712),1,1421161253600) ((35091051395844216,451349158),(8135854751464083,-1751839326),1,1422083101033) ((16647193023519619,990937787),(5384884550662007,-910998857),1,1420659873572) ((43355867025936022,-945669937),(7336240855866885,518993644),1,1420880078266) ((12188366927481231,-2007889717),(5336507724485344,363519858),1,1420827788022)

this promising looks taking records has 0 want if user has 1 , 0 same item keep 1 1.

you keep maximum user preference grouped results.

userdata  // group user , product  .groupby(x => (x._1, x._2))  // keep maximum user preference per user/product  .mapvalues(_.maxby(_._3))  // keep values  .values

Search This Blog

Bay WIKI

scala-spark: How to filter RDD after groupby -

Comments

Post a Comment

Popular posts from this blog

Android : Making Listview full screen -

javascript - Parse JSON from the body of the POST -

Revit Family Rename in a project -