Advanced search  

News:

cpg1.5.48 Security release - upgrade mandatory!
The Coppermine development team is releasing a security update for Coppermine in order to counter a recently discovered vulnerability. It is important that all users who run version cpg1.5.46 or older update to this latest version as soon as possible.
[more]

Pages: [1]   Go Down

Author Topic: Removing duplicates from gallery.  (Read 10788 times)

0 Members and 1 Guest are viewing this topic.

remdex

  • Coppermine newbie
  • Offline Offline
  • Posts: 17
    • My anime wallpapers database
Removing duplicates from gallery.
« on: June 15, 2008, 09:00:13 am »

Hi,

Wanned share simple script that removes duplicate images from gallery. Script do not depend on gallery itself and should be run though console.
For increasing script process you should alter table  "ALTER TABLE `cpg<copermine version>_pictures` ADD INDEX ( `filesize` )". But it's not neccasary but if done, will increase script speed.
You should change these setting depending on your gallery configuration.
Code: [Select]
$HomePath = './albums/';
$TableName = 'cpg1410_pictures';
$DatabaseName = 'copermine';
$DBHost = 'localhost';
$DBUser = '<enter username your here>';
$DBPaswd = '<enter password>';

Also default limit is set to 150000 records.
Code: [Select]
LIMIT 0,150000For testing purpose you can change it to 1.

Script should be run like "php -f removeduplicates.php" where removeduplicates.php contains script body from bellow.
Script body.
Code: [Select]
<?php
/**
 * Script for removing dublicates from database
 * */


/**
 * User settings for settings
 * */
$HomePath './albums/';
$TableName 'cpg1410_pictures';
$DatabaseName 'copermine';
$DBHost 'localhost';
$DBUser '<enter username your here>';
$DBPaswd '<enter password>';
/**END USER SETTINGs. Do not change bellow**/


$connect mysql_connect($DBHost,$DBUser,$DBPaswd) or die("error");
mysql_select_db($DatabaseName);
echo 
"Connected to database\n";

$sql "SELECT pid, filesize, count( * ) AS n FROM $TableName GROUP BY filesize HAVING n >1 LIMIT 0,150000";
$result mysql_query($sql);

while (
$row mysql_fetch_assoc($result))
{

$SQL "SELECT pid,ctime,aid,filepath,filename FROM $TableName WHERE filesize = {$row['filesize']} ORDER BY ctime";
$resultDuplicates mysql_query($SQL);

$Original = array();
while ($rowDuplicate mysql_fetch_assoc($resultDuplicates))
{

if (count($Original) == 0) {$Original $rowDuplicate;continue;}

if (!file_exists($HomePath.$Original['filepath'].$Original['filename'])) //Original does not exits. Critical ERROR
{
echo "ERROR original does not exist - \n";
echo "Original \n -";
print_r($Original);
echo "Dublicate \n -";
print_r($rowDuplicate);
exit;
}
elseif(!file_exists($HomePath.$rowDuplicate['filepath'].$rowDuplicate['filename'])) //Dublicate does not exist
{
echo "-----------------------------------------------------\n";
if (is_dir(dirname($HomePath.$rowDuplicate['filepath'].$rowDuplicate['filename']))) //Dupblicate directory exists
{
echo "Deleting duplicate DB Record \n";
$SQL "DELETE FROM $TableName WHERE pid = {$rowDuplicate['pid']}";
mysql_query($SQL);

}else 
{
echo "ERROR Duplicate does not exist - \n";
echo "Original \n -";
print_r($Original);
echo "Duplicate \n -";
print_r($rowDuplicate);
exit;
}

}
elseif ($Original['filepath'].$Original['filename'] == $rowDuplicate['filepath'].$rowDuplicate['filename'])
{
echo "-----------------------------------------------------\n";
echo "Dublikate points same file. Deleting only DB record \n";
$SQL "DELETE FROM $TableName WHERE pid = {$rowDuplicate['pid']}";
mysql_query($SQL);
}
elseif ($Original['filepath'].$Original['filename'] != $rowDuplicate['filepath'].$rowDuplicate['filename'] && sha1_file($HomePath.$Original['filepath'].$Original['filename']) == sha1_file($HomePath.$rowDuplicate['filepath'].$rowDuplicate['filename']) )
{

echo "-----------------------------------------------------\n";
$SQL "DELETE FROM $TableName WHERE pid = {$rowDuplicate['pid']}";
mysql_query($SQL);

echo "Deleting original filename - ".$rowDuplicate['pid'].' '.$rowDuplicate['filepath'].$rowDuplicate['filename']."\n";
echo "Original file - ".$HomePath.$Original['filepath'].$Original['filename']."\n";

$OriginalFilename $HomePath.$rowDuplicate['filepath'].$rowDuplicate['filename'];

if (unlink($OriginalFilename))
{
echo "OK\n";
}
else 
{
echo "FAILD\n";
}


$NormalThumbnail $HomePath.$rowDuplicate['filepath'].'normal_'.$rowDuplicate['filename'];
if (file_exists($NormalThumbnail))
{

echo "Normal thumbnail found. Proceeding to delete \n";
if (unlink($NormalThumbnail)) 
echo "OK\n";
else 
echo "FAIL\n";

}else 
{
echo "Normal thumbnail not found skipping \n";
}

$smallThumbnail $HomePath.$rowDuplicate['filepath'].'thumb_'.$rowDuplicate['filename'];
if (file_exists($smallThumbnail))
{
echo "Small thumbnail found. Proceeding to delete \n";
if (unlink($smallThumbnail)) 
echo "OK\n";
else 
echo "FAIL\n";
}else 
{
echo "Small thumbnail not found skipping \n";
}



}else //Sizes matches but sha1 sum does not match. So images are different
{
echo "Skipping -> ".$rowDuplicate['pid']."\n";
}

}

}

?>


I hope's that for someone it will be useful. I run script containing 150 000 records. It run fine and deleted over 25 000 duplicates.
Important - you should make database and gallery backup in case something goes wrong.
Logged

net

  • Coppermine regular visitor
  • **
  • Offline Offline
  • Posts: 88
Re: Removing duplicates from gallery.
« Reply #1 on: August 11, 2008, 04:25:02 pm »

I have a question regarding how this script knows its a dupe? Does it check the size or just filename? I mean if the filename is the same its not fully possible to be a dupe.
Logged

Hein Traag

  • Dev Team member
  • Coppermine addict
  • ****
  • Country: nl
  • Offline Offline
  • Gender: Male
  • Posts: 2166
  • A, B, Cpg
    • Personal website - Spintires.nl
Re: Removing duplicates from gallery.
« Reply #2 on: August 11, 2008, 04:30:13 pm »

//Sizes matches but sha1 sum does not match. So images are different

That does the trick of comparing the two pictures which seems to be duplicates.
Logged
Pages: [1]   Go Up
 

Page created in 0.021 seconds with 19 queries.