Hi Joachim,
As you allready noticed the intension is the following one:
- on each picture upload a histogram vector is created, which is stored in the database. (e.g. v(n1,n2,n3,...)
of course when we later want to search for the most similar pictures we have to calculate a metric distance, e.g. euclidean, manhatten, and so on.., for every vector in the database. Although this operation is linear in complexity it would take really long if you have thousands of pictures in you database.
- to speed this up I would suggest to calculate a certain value of each vector and store it in the database too, and also index it with a btree or something like that. This value could be the euclidean distance on the vector itsself: value=(n1^2+n2^2+n3^2)^(1/2). Now there comes the great thing about this value (at least i think so): If two pictures are similar in their histograms they surely will have a value which is similar (difference of the two values is not big). Of course there are a lot of other not similar pictures where the value will be similar although the pictures are not. But this doesn't matter because we surely will reduce the resultset to just a fraction of all the pictures in the database.
We first determine the pictures which are similar considering the calculated single value. Lets say we get 30 Pictures as result out of thousands. Because the value discribed above should be a lower bound considering the similarity, it is garanted that the most similar pictures will among this 30 pictures. Now we do some more calculation on similarity which will be ok for a low number of pictures and then we have the most similar pictures.
When this certain value is stored with a tree index, which should be able in mysql, this thing could work well.
Here is a PDF from some lecture where the idea of distance functions is described a bit.
Any questions on this?
cu