Unix day- 11 : Remove duplicates from multiple files

How to merge multiple files and dedupe the lines in the files

Two questions answered here :

  • What should I do if I bring in a file from windows to unix to do text operations ?

First things first — dedupe the entries in some files into a single file.

simplest way to do that would be to

  1. merge all the files

Lets see the —

Unix Solution :

Lets say we have 3 files (file1.txt file2.txt file3.txt)from which you want to extract the unique lines —

Step 1 : merging all files into merge.txt

cat file1.txt file2.txt file3.txt > merge.txt 

Step 2 & 3 : sort and dedupe

sort merge.txt | uniq > final.txt

Thats it .

But wait — Sometimes these commands might not give you the desired outcome. Read along ..

Word of caution :

Some entries might have some special characters like eg : ^M which represents a carriage return in Linux. So these might not get deduped.

One reason why this could happens because the files might have originated in a DOS ( Windows ) of text editor and later brought into a unix environment.

To check this , you can view the file with the hidden characters

cat -v merged.txt1^M

And there you can find the line without a ^M ( the second line )

you can solve this by changing the text format into a unix type

dos2unix merged.txt

This will remove the ^M from the end of the lines and now you can do your uniq command as above.

