python - Using Pandas how do I deduplicate a file being read in chunks? -


i have large fixed width file being read pandas in chunks of 10000 lines. works great except removing duplicates data because duplicates can in different chunks. file being read in chunks because large fit memory in entirety.

my first attempt @ deduplicating file bring in 2 columns needed deduplicate , make list of rows not read. reading in 2 columns (out of 500) fits in memory , able use id column find duplicates , eligibility column decide of 2 or 3 same id keep. used skiprows flag of read_fwf() command skip rows.

the problem ran pandas fixed width file reader doesn't work skiprows = [list] , iterator = true @ same time.

so, how deduplicate file being processed in chunks?

my solution bring in columns needed find duplicates want drop , make bitmask based on information. then, knowing chunksize , chunk i'm on reindex chunk i'm on matches correct position represents on bitmask. pass through bitmask , duplicate rows dropped.

bring in entire column deduplicate on, in case 'id'. create bitmask of rows aren't duplicates. dataframe.duplicated() returns rows duplicates , ~ inverts that. have our 'dupemask'.

dupemask = ~df.duplicated(subset = ['id']) 

then create iterator bring file in in chunks. once done loop on iterator , create new index each chunk. new index matches small chunk dataframe position in 'dupemask' bitmask, can use keep lines aren't duplicates.

for i, df in enumerate(chunked_data_iterator):     df.index = range(i*chunksize, i*chunksize + len(df.index))     df = df[dupemask] 

this approach works in case because data large because wide. still has read in column in entirety in order work.


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - .htaccess mod_rewrite for dynamic url which has domain names -

Website Login Issue developed in magento -