r - Setting weightages for Jarowinkler in compare.linkage -
i'm using compare.linkage method in record linkage package in r compare similarity of 2 set of strings. default string comparing method jarowinkler 3 default weightages set @ 1/3, 1/3 , 1/3.
i want overwrite default weightages 4/9, 4/9 , 1/9. how do that? in advance.
the default script is:
rpairs <- compare.linkage(stringset1, stringset2, strcmp = true, strcmpfun = jarowinkler)
you have create own comparison function, compares 2 strings. in function can call jarowinkler. easiest way create closure:
jw <- function(w_1, w_2, w_3) { function(str1, str2) { jarowinkler(str1, str2, w_1, w_2, w_3) } } this function pass weight parameters want use. function returns comparison function can use in compare.linkage call:
rpairs <- compare.linkage(stringset1, stringset2, strcmp = true, strcmpfun = jw(4/9, 4/9, 1/9)) the jaro-winkler algorithm counts number of characters match (withing bandwidth) m. 2 strings john , johan there 4 characters match (j, o, h , n). taking selected characters:
john jonh it counts number of transpositions t. in case there 1 transposition (the h , n switched).
the jaro similarity given by:
1/3 * (w1 * m/l1 + w2 * m/l2 + w3 * (m-t)/m)) with l1 , l2 lengths of 2 strings. weights equal 1/3 results in score between 0 , 1 (1=perfect match).
the jaro-winkler measure adds 'bonus' characters match @ beginning of string there less errors @ beginning (the measure created names). more information see example m.p.j van der loo (2014), stringdist package approximate string matching.
Comments
Post a Comment