Need some inputs in feature extraction in Apache Spark -


i new apache spark , trying use mlib utility analysis. collated code convert data features , apply linear regression algorithm that. facing issues . please , excuse if silly question

my person data looks like

1,1000.00,36 2,2000.00,35 3,2345.50,37 4,3323.00,45

just simple example code working

import org.apache.spark.sparkcontext import org.apache.spark.sparkcontext._ import org.apache.spark.sparkconf import org.apache.spark.mllib.linalg.{vector, vectors} import org.apache.spark.mllib.regression.labeledpoint  case class person(rating: string, income: double, age: int) val persondata = sc.textfile("d:/spark/mydata/persondata.txt").map(_.split(",")).map(p => person(p(0), p(1).todouble, p(2).toint))  def preparefeatures(people: seq[person]): seq[org.apache.spark.mllib.linalg.vector] = {   val maxincome = people.map(_ income) max   val maxage = people.map(_ age) max    people.map (p =>     vectors.dense(       if (p.rating == "a") 0.7 else if (p.rating == "b") 0.5 else 0.3,       p.income / maxincome,       p.age.todouble / maxage)) }   def preparefeatureswithlabels(features: seq[org.apache.spark.mllib.linalg.vector]): seq[labeledpoint] =   (0d 1 (1d / features.length)) zip(features) map(l => labeledpoint(l._1, l._2))  ---its working till here. ---it breaks in below code  val data = sc.parallelize(preparefeatureswithlabels(preparefeatures(people))  scala> val data = sc.parallelize(preparefeatureswithlabels(preparefeatures(people))) <console>:36: error: not found: value people error occurred in application involving default arguments.        val data = sc.parallelize(preparefeatureswithlabels(preparefeatures(people)))                                                                            ^ 

please advise

you seem going in right direction there few minor problems. first off trying reference value (people) haven't defined. more seem writing code work sequences, , instead should modify code work rdds (or dataframes). seem using parallelize try , parallelize operation, parallelize helper method take local collection , make available distributed rdd. i'd recommend looking @ programming guides or additional documentation better understanding of spark apis. best of luck adventures spark.


Comments

Popular posts from this blog

Android : Making Listview full screen -

javascript - Parse JSON from the body of the POST -

javascript - How to Hide Date Menu from Datepicker in yii2 -