sql - regexp in PySpark -
i trying reproduce results of django orm query in pyspark:
social_filter = '(facebook|flipboard|linkedin|pinterest|reddit|twitter)' collection.objects.filter(social__iregex=social_filter)
my main problem should case insensitive.
i have tried this:
social_filter = "social ilike 'facebook' or social ilike 'flipboard' or social ilike 'linkedin' or social ilike 'pinterest' or social ilike 'reddit' or social ilike 'twitter'" df = sessions.filter(social_filter)
which result in following error:
py4jjavaerror: error occurred while calling o31.filter. : java.lang.runtimeexception: [1.22] failure: end of input expected social ilike 'facebook' or social ilike 'flipboard' or social ilike 'linkedin' or social ilike 'pinterest' or social ilike 'reddit' or social ilike 'twitter'
and following expression:
social_filter = "social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter)" df = sessions.filter(social_filter)
crashes this:
py4jjavaerror: error occurred while calling o31.filter. : java.lang.runtimeexception: [1.17] failure: identifier expected social ~* (facebook|flipboard|linkedin|pinterest|reddit|twitter) ^ @ scala.sys.package$.error(package.scala:27) @ org.apache.spark.sql.catalyst.sqlparser.parseexpression(sqlparser.scala:45) @ org.apache.spark.sql.dataframe.filter(dataframe.scala:652) @ sun.reflect.nativemethodaccessorimpl.invoke0(native method)
please, help!
how following:
>>> rdd = sc.parallelize([row(name='bob', social='twitter'), row(name='steve', social='facebook')]) >>> df = sqlcontext.createdataframe(rdd) >>> df.where("lower(social) 'twitter'").collect() [row(name=u'bob', social=u'twitter')]
you can of social networks want if need actual regular expression. otherwise, if match exact, can this:
>>> df.where("lower(social) in ('twitter', 'facebook')").collect() [row(name=u'bob', social=u'twitter'), row(name=u'steve', social=u'facebook')]
Comments
Post a Comment