Unix day- 11 : Remove duplicates from multiple files
How to merge multiple files and dedupe the lines in the files
Two questions answered here :
- What should I do if I bring in a file from windows to unix to do text operations ?
- What are the steps to dedupe the entries in multiple file ?
First things first — dedupe the entries in some files into a single file.
simplest way to do that would be to
- merge all the files
- sort them
- dedupe the entries in the file
Lets see the —
Unix Solution :
Lets say we have 3 files (file1.txt file2.txt file3.txt
)from which you want to extract the unique lines —
Step 1 : merging all files into merge.txt
cat file1.txt file2.txt file3.txt > merge.txt
Step 2 & 3 : sort and dedupe
sort merge.txt | uniq > final.txt
Thats it .
But wait — Sometimes these commands might not give you the desired outcome. Read along ..
Word of caution :
Some entries might have some special characters like eg : ^M
which represents a carriage return in Linux. So these might not get deduped.
One reason why this could happens because the files might have originated in a DOS ( Windows ) of text editor and later brought into a unix environment.
To check this , you can view the file with the hidden characters
cat -v merged.txt1^M
2
2^M
2^M
3^M
3^M
4^M
5^M
9^M
And there you can find the line without a ^M
( the second line )
you can solve this by changing the text format into a unix type
dos2unix merged.txt
This will remove the ^M
from the end of the lines and now you can do your uniq
command as above.