Friday, April 19, 2013

Similar image identification theory

Let's assume that we've got thousands of images. If we're working on these images applying post process techniques, we're probably also in possession of duplicates. What's a good method to find these duplicates?

Let's consider what "fixes" we may do to a photo. 

  • JPEG compression level - Introduces artifacts, colors may change a little, banding, file size.
  • De-noise - Reduces noise, generally reduces filesize a little, removes some detail/micro-contrast.
  • Exposure/color correction - Changes average brightness of image channel (red/green/blue). Color correction changes the ratio of one channel to another over samples of the image
  • Scale - Changing the size of the image means we have to look at areas of the image proportional to the width and height.
  • HDR - HDR I think is misunderstood. The term means capturing more stops of light than normally permitted my a digital camera. The act of displaying that many levels of light on a screen which only shows eight stops of light per channel is tone mapping. Tone mapping can be done in many ways, and can be a type of local contrast manipulation - i.e. making a dark area a little brighter to show details there better. Unlike simply lowering the contrast of the entire image which will effectively do the same thing, local contrast manipulation means a solid dark or bright area, may not be  the same brightness throughout. e.g. a dark area surrounded by a bright area -  the dark area may be brighter overall, but not uniformly so, in order to keep contrast high at the border to the brighter area of the image. More on playing with this in a future post ;)
  • Cropping - this can be quite difficult or at least intensive to check.

So, looking at these, we can rule out 
  • file-size and checksums
  • image size and aspect ratio checks
  • pixel by pixel comparisons - noise and jpeg artifacts will render this unusable
One check that can be made is relative average comparisons between clumps of pixels and the overall average. If the clump of pixels is done relative to the change in width and height, then even re-sized images can be found.

Assuming a picture is divided by a grid into sections. The average brightness of each section is then compared to the overall average (add red, green, blue values of each pixel). If the section is brighter than overall average, call it '1', if darker it's '0'. If it's within a threshold, a third value can be used.

Here's how it works:

I've compared some pictures and added them below.
The first and third are the same - just color and exposure corrections, some curves used. The picture in the center is from a slightly different position and the position of my baby's head is different. The transparent yellow/green overlays highlight the sections where the local:global exposure ratio is different.

Writing out the flags for the exposure checks,
Picture 1:
Picture 2:
Picture 1-color corrected:

I'm using a 5% threshold value - i.e. if the difference between local and global exposure of a tile is under 5%, that "bit" is marked "x" and not compared against the alternate image string.

From the image, the colored sections show the tiles which have been found to be different. As we can see the color corrected image only had a slight difference. This can be considered within a threshold and added to a collection of similar images for later review.

Here's what the python code for the image comparison looks like.

Python, Numpy, PIL (or pillow).
from PIL import Image
import numpy
#------------------------------ DEFINED CONSTANTS ----------------------------
grid=10  # 10x10 dot sample
unclear=# permit use of 'x' in hash string if variance threshold is met
#------------------------------ DEFINED CONSTANTS ----------------------------

def debug(string):
    global debugger
    if (debugger==1):

def imgsum(filename):

    def fuzzsamxy(px,py,prx,pry): # at x,y, get average in radius pr
        cntx = px-prx
        psum = 0
        ptot = 0
        x2=min(px+prx+1,ax) #quirk: operation.array[a:b] apparently operates on indexes a to b-1
        return pavg

    global grid
    #read image
    debug("Opening "+filename)
    img_a =
    im1arr = numpy.asarray(img_a)
    debug("Size of image"+str([ax, ay]))
    debug("Determining average brightness")
    avg = int(3*numpy.average(im1arr))
    debug("Grid comparison")
    debug("radius of :"+str(radius))
    while cx < grid:
        cy = 0
        while cy < grid:
            if (localcontrast==1):
                avg = fuzzsamxy(chkpntx,chkpnty,radx,rady) #get sample about chkpntx-chkpnty
            #get sample about chkpntx-chkpnty
            sampavg = fuzzsamxy(chkpntx,chkpnty,min(1,radx//8),min(1,rady//8)) 
            if float(abs(sampavg-avg))/avg < unclearvariance and unclear==1:
                if sampavg > avg:

            cy = cy + 1
        cx = cx + 1

def stringdiffs(str1, str2):
    if len(str1) != len(str2):
        return -1
    while a<len(str1):
        if  str1[a] != str2[a]:
            if (str1[a] != "x" and str2[a] != "x"):
                cum = cum + 1;
        a = a + 1
    return [cum,(float(cum)/len(str1))]

##########################  MAIN HERE ################################
# replace with your own pictures
print("identifying image 1")
print("identifying image 3")
print("identifying image 4")

# output the ID strings - these "hash" values are simple strings
print (id1)
print (id3)
print (id4)

# string diffs(str1, str2) will compute the total number of differences and the fraction of the total checks)
print("ID differential similar stuff    :"+str(stringdiffs(id1,id3)))
print("ID differential little different :"+str(stringdiffs(id1,id4)))


And here's the output:
>pythonw -u ""
identifying image 1
identifying image 3
identifying image 4
ID differential similar stuff    :[1, 0.01]
ID differential little different :[18, 0.18]
>Exit code: 0    Time: 1.896

Averaging 0.63 seconds per image isn't bad considering the image is fully loaded, and all pixels are used in averages. Numpy is extremely efficient at speeding this up. Accessing individual pixels, instead of Numpy built in array averages is several times slower.

From this point, it's pretty simple to create a main function that will output filename, image-hash-string if passed an image, so it's trivial to use this to get a list of hash strings.

The next difficulty is more of a UI layout. Using "stringdiffs" to identify similar images, I believe forming collections and letting the user commit actions to images would be efficient. 

Either way, the above functions would permit someone to handle it in a manner of their choosing.

For lower contrast images, I've also added a flag to the function for doing local contrast comparisons instead of global contrast. The difference check here will be the tile's average vs the center of the tile - though this has some weaknesses.

No comments:

Post a Comment