Beomagi's thoughts...: Similar image identification theory

Let's assume that we've got thousands of images. If we're working on these images applying post process techniques, we're probably also in possession of duplicates. What's a good method to find these duplicates?

Let's consider what "fixes" we may do to a photo.

JPEG compression level - Introduces artifacts, colors may change a little, banding, file size.
De-noise - Reduces noise, generally reduces filesize a little, removes some detail/micro-contrast.
Exposure/color correction - Changes average brightness of image channel (red/green/blue). Color correction changes the ratio of one channel to another over samples of the image
Scale - Changing the size of the image means we have to look at areas of the image proportional to the width and height.
HDR - HDR I think is misunderstood. The term means capturing more stops of light than normally permitted my a digital camera. The act of displaying that many levels of light on a screen which only shows eight stops of light per channel is tone mapping. Tone mapping can be done in many ways, and can be a type of local contrast manipulation - i.e. making a dark area a little brighter to show details there better. Unlike simply lowering the contrast of the entire image which will effectively do the same thing, local contrast manipulation means a solid dark or bright area, may not be the same brightness throughout. e.g. a dark area surrounded by a bright area - the dark area may be brighter overall, but not uniformly so, in order to keep contrast high at the border to the brighter area of the image. More on playing with this in a future post ;)
Cropping - this can be quite difficult or at least intensive to check.

So, looking at these, we can rule out

file-size and checksums
image size and aspect ratio checks
pixel by pixel comparisons - noise and jpeg artifacts will render this unusable

One check that can be made is relative average comparisons between clumps of pixels and the overall average. If the clump of pixels is done relative to the change in width and height, then even re-sized images can be found.

Assuming a picture is divided by a grid into sections. The average brightness of each section is then compared to the overall average (add red, green, blue values of each pixel). If the section is brighter than overall average, call it '1', if darker it's '0'. If it's within a threshold, a third value can be used.

Here's how it works:

I've compared some pictures and added them below.

The first and third are the same - just color and exposure corrections, some curves used. The picture in the center is from a slightly different position and the position of my baby's head is different. The transparent yellow/green overlays highlight the sections where the local:global exposure ratio is different.

Writing out the flags for the exposure checks,

Picture 1:

00000x010000x11101000001111100000000010000000x0x00000000111001110010x11100011111x000001111x110111011

Picture 2:

00000x0100001111010000x1111100001000010000000x0x0000000011100111001011110x011111x000001111x110111011

Picture 1-color corrected:

00000001000001111100000011110000010000000000010000000011000011110001110111110111xx000001x1x110111011

I'm using a 5% threshold value - i.e. if the difference between local and global exposure of a tile is under 5%, that "bit" is marked "x" and not compared against the alternate image string.

From the image, the colored sections show the tiles which have been found to be different. As we can see the color corrected image only had a slight difference. This can be considered within a threshold and added to a collection of similar images for later review.

Here's what the python code for the image comparison looks like.

Requirements:

Python, Numpy, PIL (or pillow).

___________________________________________________

from PIL import Image
import numpy
#------------------------------ DEFINED CONSTANTS ----------------------------
grid=10  # 10x10 dot sample
debugger=0
unclear=1 # permit use of 'x' in hash string if variance threshold is met
localcontrast=0
unclearvariance=0.05
#------------------------------ DEFINED CONSTANTS ----------------------------

def debug(string):
  global debugger
  if (debugger==1):
  print(string)

def imgsum(filename):

  def fuzzsamxy(px,py,prx,pry): # at x,y, get average in radius pr
  cntx = px-prx
  psum = 0
  ptot = 0
  x1=max(px-prx,0)
  x2=min(px+prx+1,ax) #quirk: operation.array[a:b] apparently operates on indexes a to b-1
  y1=max(py-pry,0)
  y2=min(py+pry+1,ay)
  pavg=int(3*numpy.average(im1arr[x1:x2,y1:y2]))
  return pavg

  global grid
  #read image
  debug("Opening "+filename)
  img_a = Image.open(filename)
  im1arr = numpy.asarray(img_a)
  ax,ay,colors=im1arr.shape
  debug("Size of image"+str([ax, ay]))
  debug("Determining average brightness")
  avg = int(3*numpy.average(im1arr))
  debug("Grid comparison")
  cx=0
  signature=""
  radius=(ax//(grid*4))
  debug("radius of :"+str(radius))
  while cx < grid:
  cy = 0
  while cy < grid:
  chkpntx=int((float(cx)+0.5)*ax)//grid
  chkpnty=int((float(cy)+0.5)*ay)//grid
  radx=ax//(grid*2)
  rady=ay//(grid*2)
  if (localcontrast==1):
  avg = fuzzsamxy(chkpntx,chkpnty,radx,rady) #get sample about chkpntx-chkpnty

  #get sample about chkpntx-chkpnty
  sampavg = fuzzsamxy(chkpntx,chkpnty,min(1,radx//8),min(1,rady//8))
  if float(abs(sampavg-avg))/avg < unclearvariance and unclear==1:
  signature=signature+"x"
  else:
  if sampavg > avg:
  signature=signature+"1"
  else:
  signature=signature+"0"

  cy = cy + 1
  cx = cx + 1
  return(signature)

def stringdiffs(str1, str2):
  if len(str1) != len(str2):
  return -1
  a=0
  cum=0
  while a<len(str1):
  if  str1[a] != str2[a]:
  if (str1[a] != "x" and str2[a] != "x"):
  cum = cum + 1;
  a = a + 1
  return [cum,(float(cum)/len(str1))]

########################## MAIN HERE ################################
# replace with your own pictures
print("identifying image 1")
id1=imgsum('1.jpg')
print("identifying image 3")
id3=imgsum('1b.jpg')
print("identifying image 4")
id4=imgsum('2.jpg')

# output the ID strings - these "hash" values are simple strings
print (id1)
print (id3)
print (id4)

# string diffs(str1, str2) will compute the total number of differences and the fraction of the total checks)
print("ID differential similar stuff    :"+str(stringdiffs(id1,id3)))
print("ID differential little different :"+str(stringdiffs(id1,id4)))

___________________________________________________________

And here's the output:

>pythonw -u "imageiden2b.py"

identifying image 1

identifying image 3

identifying image 4

00000x010000x11101000001111100000000010000000x0x00000000111001110010x11100011111x000001111x110111011

00000x0100001111010000x1111100001000010000000x0x0000000011100111001011110x011111x000001111x110111011

00000001000001111100000011110000010000000000010000000011000011110001110111110111xx000001x1x110111011

ID differential similar stuff :[1, 0.01]

ID differential little different :[18, 0.18]

>Exit code: 0 Time: 1.896

Averaging 0.63 seconds per image isn't bad considering the image is fully loaded, and all pixels are used in averages. Numpy is extremely efficient at speeding this up. Accessing individual pixels, instead of Numpy built in array averages is several times slower.

From this point, it's pretty simple to create a main function that will output filename, image-hash-string if passed an image, so it's trivial to use this to get a list of hash strings.

The next difficulty is more of a UI layout. Using "stringdiffs" to identify similar images, I believe forming collections and letting the user commit actions to images would be efficient.

Either way, the above functions would permit someone to handle it in a manner of their choosing.

For lower contrast images, I've also added a flag to the function for doing local contrast comparisons instead of global contrast. The difference check here will be the tile's average vs the center of the tile - though this has some weaknesses.

Beomagi's thoughts...

Friday, April 19, 2013

Similar image identification theory

No comments:

Post a Comment