No Support => Feature requests => Topic started by: PsycoEwok on November 12, 2004, 10:38:52 pm

Title: duplicate image checker
Post by: PsycoEwok on November 12, 2004, 10:38:52 pm
I'm an administrator for a coppermine-powered gallery that has over 2000 pictures in it. Users are free to upload pictures as they wish and then an admin approves the pictures, but I try to not allow duplicate images. This wasn't so hard back when we had only a few hundred pictures in the gallery, but with 2000+, I simply can't keep track of what pictures we already have in the gallery. So sometimes I'll approve a picture only to find out later that we already had a higher quality version of the same picture in the gallery. Since I'm a neat-freak, I'll then proceed to track down the duplicate picture and delete it. But as we get more and more pictures in the gallery, I'm sure more and more duplicate images are slipping into the gallery unnoticed.

So my request is for some sort of feature that will scan the gallery for duplicate images, if such a feature is possible. Even a feature as simple as an md5 checker would help (though something a bit more powerful is preferred of course).
Title: Re: duplicate image checker
Post by: Joachim Müller on November 13, 2004, 01:18:22 pm
in theory, a check for identical filenames would be possible, or a binary check for identical files. This will of course not solve the issue of the same pic being submitted in different qualities, as the pics themselves differ greatly. Visual comparison is beyond what a webserver is capable of (at least nowadays). I agree that a scan of the database for identical files would be helpfull, not sure though how easy it would be to implement and especially how resources-consuming (probably the crucial point, as the script would have to read every image from the webserver's hard drive and compare them, checksum-method or not).
Not a bad idea imo, code submissions welcome.

Title: Re: duplicate image checker
Post by: PsycoEwok on November 13, 2004, 08:49:57 pm
Well, I was thinking that maybe it could have a little database of md5 checksums. That way it would only have to get the md5 for the picture in question, and then compare it to the md5 database. Of course, this would still require that the script scan the entire gallery in order to make the database, but you'd only need to do that once obviously.

Maybe you could even go so far as to divide the database up into sections. Like, if the first character of a picture's md5 checksum is '4', then the script only scans the section of the database that has md5 checksums that begin with '4'. This way, it only has to scan maybe a few hundred numbers for a match instead of the entire database of 2000+ (in my gallery's case). It seems to me that this would greatly reduce how resource-consuming this duplicate checker would be.

Anyways, just tossing ideas out there for anyone that wants to try coding this. :)

Edit: Also, another idea. Maybe instead of md5 checksums, a method of looking at colors can be used. Possibly something like a database of what color (rgb value) is used the most in each picture. And maybe have a small margin of error for the rgb values (like + or - 5 in the red value, the green value, and the blue value), since color can be lost due to compression. If this method is used, it would be much more powerful than simply checking md5's since the resolution, filesize, and other such things wouldn't matter. And I don't think it would be too much more resource-consuming than checking md5's as long as a database is used. Again, just an idea. I'm not a web-developer in any way, shape, or form. :P