Unix day- 11 : Remove duplicates from multiple files
Two questions answered here :
- What should I do if I bring in a file from windows to unix to do text operations ?
- What are the steps to dedupe the entries in multiple file ?
First things first — dedupe the entries in some files into a single file.
simplest way to do that would be to
- merge all the files
- sort them
- dedupe the entries in the file
Lets see the —
Unix Solution :
Lets say we have 3 files (
file1.txt file2.txt file3.txt)from which you want to extract the unique lines —
Step 1 : merging all files into
cat file1.txt file2.txt file3.txt > merge.txt
Step 2 & 3 : sort and dedupe
sort merge.txt | uniq > final.txt
Thats it .
But wait — Sometimes these commands might not give you the desired outcome. Read along ..
Word of caution :
Some entries might have some special characters like eg :
^M which represents a carriage return in Linux. So these might not get deduped.
One reason why this could happens because the files might have originated in a DOS ( Windows ) of text editor and later brought into a unix environment.
To check this , you can view the file with the hidden characters
cat -v merged.txt1^M
And there you can find the line without a
^M ( the second line )
you can solve this by changing the text format into a unix type
This will remove the
^M from the end of the lines and now you can do your
uniq command as above.