Unix day- 11 : Remove duplicates from multiple files

How to merge multiple files and dedupe the lines in the files

Ned Poplaski (CISSP)
2 min readJun 7, 2021

Two questions answered here :

  • What should I do if I bring in a file from windows to unix to do text operations ?
  • What are the steps to dedupe the entries in multiple file ?

First things first — dedupe the entries in some files into a single file.

simplest way to do that would be to

  1. merge all the files
  2. sort them
  3. dedupe the entries in the file

Lets see the —

Unix Solution :

Lets say we have 3 files (file1.txt file2.txt file3.txt)from which you want to extract the unique lines —

Step 1 : merging all files into merge.txt

cat file1.txt file2.txt file3.txt > merge.txt 

Step 2 & 3 : sort and dedupe

sort merge.txt | uniq > final.txt

Thats it .

But wait — Sometimes these commands might not give you the desired outcome. Read along ..

Word of caution :

Some entries might have some special characters like eg : ^M which represents a carriage return in Linux. So these might not get deduped.

One reason why this could happens because the files might have originated in a DOS ( Windows ) of text editor and later brought into a unix environment.

To check this , you can view the file with the hidden characters

cat -v merged.txt1^M
2
2^M
2^M
3^M
3^M
4^M
5^M
9^M

And there you can find the line without a ^M ( the second line )

you can solve this by changing the text format into a unix type

dos2unix merged.txt

This will remove the ^M from the end of the lines and now you can do your uniq command as above.

Read other Unix Day tips :

Unix day — 3 : Removing dupes from a single file

Unix day 4 — Careful when using grep

--

--

Ned Poplaski (CISSP)
Ned Poplaski (CISSP)

Written by Ned Poplaski (CISSP)

I share news and Lessons to make possible a safer cyber experience. cyber security educator. ex-McAfee, Consultant snyk.io,sonatype.

No responses yet