I have a really big problem with my image storage server.
It has about 2 million product images and continues to grow, but many of them are very similar. For example: iPad photo with many similar sizes 120 * 120, 118 * 120, 131 * 125 ... etc. My site had a lot of unnecessary disk space and a poor user interface (similar images in the gallery).
These images are indexed in the database, I can find them with some conditions, for example, by product, category, etc. I need to find a way to mark these similar images in the database and delete them.
What I did: a found library called pHash can calculate two similarities of an image, I can use it to calculate images one by one. But thus, it will take a long time to find these images. Now I do not know how to make this process faster.
Any ideas?
source
share