Problem
I’ve written a shell script that checks whether two files have the same data. I do this for a lot of files, and the diff function appears to be the bottleneck in my routine.
Here’s the line:
diff -q $dst $new > /dev/null
if ($status) then ...
Could a custom algorithm, rather than the default diff, be used to compare the files more quickly?
Asked by JDS
Solution #1
CMP will, I presume, end at the first byte difference:
cmp --silent $old $new || echo "files are different"
Answered by Alex Howansky
Solution #2
@Alex Howansky used ‘cmp —silent’ for this, which I like. However, I require both a good and negative reaction, so I employ:
cmp --silent file1 file2 && echo '### SUCCESS: Files Are Identical! ###' || echo '### WARNING: Files Are Different! ###'
To check files against a constant file, I can execute this in the terminal or using an ssh.
Answered by pn1 dude
Solution #3
To compare any two files quickly and safely:
if cmp --silent -- "$FILE1" "$FILE2"; then
echo "files contents are identical"
else
echo "files differ"
fi
It’s easy to understand, quick, and works with any file name, including “‘ $ ()
Answered by VasiliNovikov
Solution #4
I can’t add this tidbit in as a corollary because I’m a jerk and don’t have enough reputation points.
However, if you’re going to use cmp (and don’t need/want to be verbose), you can just get the exit status. According to the cmp man page:
As an example, you could do the following:
STATUS="$(cmp --silent $FILE1 $FILE2; echo $?)" # "$?" gives exit status for each comparison
if [[ $STATUS -ne 0 ]]; then # if status isn't equal to 0, then execute code
DO A COMMAND ON $FILE1
else
DO SOMETHING ELSE
fi
EDIT: Thank you all for your feedback! The test syntax has been modified. If you’re searching for anything close to my answer in terms of readability, style, and syntax, I recommend Vasili’s solution.
Answered by Gregory Martin
Solution #5
Any approach that requires reading both files completely, even if the read was in the past, will be required for files that are not different.
There is no other option. As a result, creating hashes or checksums at some point necessitates reading the entire file. It takes time to process large files.
File metadata retrieval is much faster than reading a large file.
Is there any way to tell if the files are distinct by looking at their metadata? What is the file size? or even the output of the file command, which reads only a piece of the file?
Example code segment for file size:
ls -l $1 $2 |
awk 'NR==1{a=$5} NR==2{b=$5}
END{val=(a==b)?0 :1; exit( val) }'
[ $? -eq 0 ] && echo 'same' || echo 'different'
You’re stuck with complete file reads if the files are the same size.
Answered by jim mcnamara
Post is based on https://stackoverflow.com/questions/12900538/fastest-way-to-tell-if-two-files-have-the-same-contents-in-unix-linux