java - Apache POI Anomalous Whitespace (Resolved: \u00A0 non-breaking space) -
edit: resolved answer: 00a0 nonbreaking space, not c0a0 nonbreaking space.
after using apache poi convert docx plaintext, , reading plaintext java , trying parse i've run following problems.
output:
" " first characterequals space or tab false [b@5e481248 [b@66d3c617 arraytostring space: [32] arraytostring ?????: [-62, -96]
for code:
system.out.println("\t\"" + line.substring(0,1) + "\"\n\tfirst characterequals space or tab \n\t" + (line.substring(0,1).equals(" ") || line.substring(0,1).equals("\t") )); system.out.println(line.substring(0,1).getbytes()); system.out.println(" ".getbytes()); system.out.println("arraytostring space: " + arrays.tostring(" ".getbytes())); system.out.println("arraytostring ?????: " + arrays.tostring(line.substring(0,1).getbytes()));
string.trim() not rid of it
string.replaceall("\s" , "") not rid of it
i'm trying parse enormous materials document , turning major hurdle. have no idea what's going on or how interface it, can shed light on what's going on here?
this translates bytes hex codes c2 a0
, according this answer utf-8 encoded non-breaking space. note not space , \s not match it.
Comments
Post a Comment