Python fuzzywuzzy error string or buffer expect -

i'm using fuzzywuzzy find near matches in csv of company names. i'm comparing manually matched strings unmatched strings in hope of finding useful proximity matches, however, i'm getting string or buffer error within fuzzywuzzy. code is:

from fuzzywuzzy import process pandas import read_csv  if __name__ == '__main__':     df = read_csv("usm_clean.csv", encoding = "iso-8859-1")     df_false = df[df['match_manual'].isnull()]       df_true = df[df['match_manual'].notnull()]     sss_false = df_false['sss'].values.tolist()     sss_true = df_true['sss'].values.tolist()       sssf in sss_false:         mmm = process.extractone(sssf, sss_true) # find best choice         print sssf + str(tuple(mmm))

this creates following error:

traceback (most recent call last): file "fuzzywuzzy_usm2_csv_test.py", line 21, in <module> mmm = process.extractone(sssf, sss_true) # find best choice file "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractone best_list = extract(query, choices, processor, scorer, limit=1) file "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract processed = processor(choice) file "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process string_out = stringprocessor.replace_non_letters_non_numbers_with_whitespace(s) file "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace return cls.regex.sub(u" ", a_string) typeerror: expected string or buffer

this effects of importing pandas encoding specified, added prevent unicodedecodeerrors had knock on effect of causing error. i've tried force object using str(sssf) doesn't work.

so, i've isolated line causing error, here: #n/a,,,,,, (line 29 in code pasted below). assumed # causing error, strangely not, a char causing problem, because file works when removed. strange me string 2 rows below n/a parses fine, however, row 29 won't parse when delete # symbol, though field appears identical field below.

sss,sid,match_manual,notes,match_date,source,match_by n20 kids,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n21 festival,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n21 ltd,,,,,, n21 ltd.,04615294_com,true,,2014-12-02,,opencorps n2 check,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n2 check limited,06139690_com,true,,2014-12-02,,opencorps n2check limited,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3) n 2 check ltd,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2) n2 check ltd,06139690_com,true,,2014-12-02,,opencorps n2check ltd,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2) n2e & ltd,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2) n2 group llc,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n2 group ltd,04475764_com,true,,2014-05-05,data taken u_supplier_match,20140429_fuzzy_match.ktr (stream 2) n2r productions,sc266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n2 visual communications limited,,,,,, n2 visual communications ltd,03144224_com,true,,2014-12-02,data taken u_supplier_match,opencorps n2web,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3) n3 display graphics ltd,04008480_com,true,,2014-12-02,data taken u_supplier_match,opencorps n3o limited,06561158_com,true,,2014-12-02,,opencorps n3o ltd,,,,,, n400138,,,,,, n400360,,,,,, n4k ltd,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2) n51 ltd,,,,,, n68 ltd,,,,,, n8 ltd,,,,,, n9 design,07342091_com,true,,2015-02-07,openrefine/opencorporates,im #n/a,,,,,, n a,,,,,, n/a,red_general_xtr,true,matches done manually,2015-04-16,manual matching,im (n) & builders ltd,,,,,,

by default, pandas.read_csv parses string 'n/a' not number (nan)

in case, means end nan value rather string. in sample data set, happens in 2 places

the third line bottom (the line highlight in question) results in sss_false[-3] == nan

the last line results in sss_true[-1] == nan.

option 1

if want parse string 'n/a' string instead of nan, way replace

df = read_csv("usm_clean.csv", encoding = "iso-8859-1")

with

df = read_csv("usm_clean.csv", encoding = "iso-8859-1", keep_default_na=false, na_values='')

the meaning of these options described in pandas docs.

na_values : list-like or dict, default none

additional strings recognize na/nan. if dict passed, specific per-column na values

keep_default_na : bool, default true

if na_values specified , keep_default_na false default nan values overridden, otherwise they’re appended to

so, above modification tells pandas recognize empty string na , discard default value 'n/a'

option 2

if want discard lines 'n/a' in first column need remove nan members sss_true , sss_false. 1 way is:

sss_true = [x x in sss_true if type(x) != str] sss_false = [x x in sss_false if type(x) != str]

Search This Blog

Script