Pastebin launched a little side project called VERYVIRAL.com, check it out ;-) Want more features on Pastebin? Sign Up, it's FREE!
Guest

cython-damerau-levenshtein

By: a guest on Aug 7th, 2010  |  syntax: Python  |  size: 25.45 KB  |  views: 77  |  expires: Never
download  |  raw  |  embed  |  report abuse  |  print
Text below is selected. Please press Ctrl+C to copy to your clipboard. (⌘+C on Mac)
  1.  
  2.  
  3. i'm back with another longish question. having experimented with a number of Python-based Damerau-Levenshtein
  4. edit distance implementations, [i finally found the one listed below][1] as ``editdistance_reference()``. it
  5. seems to deliver correct results and appears to have an efficient implementation.
  6.  
  7.  [1]: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
  8.  
  9. so i set down to convert the code to Cython. on my test data, the reference method manages to deliver results
  10. for 11,000 comparisons (for pairs of words aound 12 letters long), while the Cythonized method does over
  11. 200,000 comparisons per second. sadly, the results are incorrect: when you look at the variable ``thisrow``
  12. which i print out for debugging, my version has it filled with ones no matter what data i throw at it,
  13. while the reference output shows another picture. for example, testing ``'helo'`` against ``'world'``
  14. produces the following output (``ED`` marks my function, ``EDR`` is the correctly working reference):
  15.  
  16. from ``editdistance()``:
  17.  
  18.    #ED  A [0, 0, 0, 0, 0, 1]
  19.    #ED  B [1, 0, 0, 0, 0, 1]
  20.    #ED  B [1, 1, 0, 0, 0, 1]
  21.    #ED  B [1, 1, 1, 0, 0, 1]
  22.    #ED  B [1, 1, 1, 1, 0, 1]
  23.    #ED  B [1, 1, 1, 1, 1, 1]
  24.  
  25.    #ED  A [0, 0, 0, 0, 0, 2]
  26.    #ED  B [1, 0, 0, 0, 0, 2]
  27.    #ED  B [1, 1, 0, 0, 0, 2]
  28.    #ED  B [1, 1, 1, 0, 0, 2]
  29.    #ED  B [1, 1, 1, 1, 0, 2]
  30.    #ED  B [1, 1, 1, 1, 1, 2]
  31.  
  32.    #ED  A [0, 0, 0, 0, 0, 3]
  33.    #ED  B [1, 0, 0, 0, 0, 3]
  34.    #ED  B [1, 1, 0, 0, 0, 3]
  35.    #ED  B [1, 1, 1, 0, 0, 3]
  36.    #ED  B [1, 1, 1, 1, 0, 3]
  37.    #ED  B [1, 1, 1, 1, 1, 3]
  38.  
  39.    #ED  A [0, 0, 0, 0, 0, 4]
  40.    #ED  B [1, 0, 0, 0, 0, 4]
  41.    #ED  B [1, 1, 0, 0, 0, 4]
  42.    #ED  B [1, 1, 1, 0, 0, 4]
  43.    #ED  B [1, 1, 1, 1, 0, 4]
  44.    #ED  B [1, 1, 1, 1, 1, 4]
  45.  
  46. from ``editdistance_reference()``:
  47.  
  48.  
  49.    #EDR A [0, 0, 0, 0, 0, 1]
  50.    #EDR B [1, 0, 0, 0, 0, 1]
  51.    #EDR B [1, 2, 0, 0, 0, 1]
  52.    #EDR B [1, 2, 3, 0, 0, 1]
  53.    #EDR B [1, 2, 3, 4, 0, 1]
  54.    #EDR B [1, 2, 3, 4, 5, 1]
  55.  
  56.    #EDR A [0, 0, 0, 0, 0, 2]
  57.    #EDR B [2, 0, 0, 0, 0, 2]
  58.    #EDR B [2, 2, 0, 0, 0, 2]
  59.    #EDR B [2, 2, 3, 0, 0, 2]
  60.    #EDR B [2, 2, 3, 4, 0, 2]
  61.    #EDR B [2, 2, 3, 4, 5, 2]
  62.  
  63.    #EDR A [0, 0, 0, 0, 0, 3]
  64.    #EDR B [3, 0, 0, 0, 0, 3]
  65.    #EDR B [3, 3, 0, 0, 0, 3]
  66.    #EDR B [3, 3, 3, 0, 0, 3]
  67.    #EDR B [3, 3, 3, 3, 0, 3]
  68.    #EDR B [3, 3, 3, 3, 4, 3]
  69.  
  70.    #EDR A [0, 0, 0, 0, 0, 4]
  71.    #EDR B [4, 0, 0, 0, 0, 4]
  72.    #EDR B [4, 4, 0, 0, 0, 4]
  73.    #EDR B [4, 4, 4, 0, 0, 4]
  74.    #EDR B [4, 4, 4, 4, 0, 4]
  75.    #EDR B [4, 4, 4, 4, 4, 4]
  76.  
  77. i must be very dumb as the error is probably one of those very very obvious things. but i can't seem to find it.
  78.  
  79. there is a second problem: i ``malloc`` space for the three arrays ``twoago``, ``oneago``, and ``thisrow``,
  80. then they get swapped around in a circular fashion. when i try to ``free( twoago )`` and so on, i get a
  81. line where glibc complains about ``double free or corruption``. i googled for that; could it be that the
  82. pointer-swapping business makes glibc a bit dizzy so it becomes unable to correctly free memory?
  83.  
  84. below i list first the ``setup.py`` that is needed to do the compilation
  85. (``/path/to/python3.1 ./setup.py build_ext --inplace``), then the edit distance code proper, so interested
  86. people find it easier to replicate.
  87.  
  88. one more thing: this is run with Python3.1; one funny thing is that inside the ``*.pyx`` file we do have
  89. bare unicode strings, but ``print`` is still a statement, not a function.
  90.  
  91. and yes i know this is a lot of code to paste here but the thing is the code is simply not runnable when you
  92. cut it down all too much. i believe all methods except ``editdistance()`` to work correctly, but feel free
  93. to point out any problems you perceive.
  94.  
  95. ``setup.py``:
  96.  
  97.     from distutils.core import setup
  98.     from distutils.extension import Extension
  99.     from Cython.Distutils import build_ext
  100.  
  101.     setup(
  102.       name            = 'cython_dameraulevenshtein',
  103.       ext_modules     = [
  104.         Extension( 'cython_dameraulevenshtein', [ 'cython_dameraulevenshtein.pyx', ] ), ],
  105.       cmdclass        = {
  106.         'build_ext': build_ext }, )
  107.  
  108.  
  109. ``cython_dameraulevenshtein.pyx`` (scroll all the way to the end to see the interesting stuff):
  110.  
  111.  
  112.  
  113.     ############################################################################################################
  114.     cdef extern from "stdlib.h":
  115.       ctypedef  unsigned int size_t
  116.       void      *malloc(size_t size)
  117.       void      *realloc( void *ptr, size_t size )
  118.       void      free(void *ptr)
  119.  
  120.     #-----------------------------------------------------------------------------------------------------------
  121.     cdef inline unsigned int _minimum_of_two_uints( unsigned int a, unsigned int b ):
  122.       if a < b: return a
  123.       return b
  124.  
  125.     #-----------------------------------------------------------------------------------------------------------
  126.     cdef inline unsigned int _minimum_of_three_uints( unsigned int a, unsigned int b, unsigned int c ):
  127.       if a < b:
  128.         if c < a:
  129.           return c
  130.         return a
  131.       if c < b:
  132.         return c
  133.       return b
  134.  
  135.     #-----------------------------------------------------------------------------------------------------------
  136.     cdef inline int _warp( unsigned int limit, int value ):
  137.       return value if value >= 0 else limit + value
  138.  
  139.     ############################################################################################################
  140.     # ARRAYS THAT SAY SIZE ;-)
  141.     #-----------------------------------------------------------------------------------------------------------
  142.     cdef class Array_of_unsigned_int:
  143.       cdef unsigned int *data
  144.       cdef unsigned int length
  145.  
  146.       #---------------------------------------------------------------------------------------------------------
  147.       def __cinit__( self, unsigned int length, fill_value = None ):
  148.         self.length = length
  149.         self.data   = <unsigned int *>malloc( length * sizeof( unsigned int ) )  ###OBS### must check malloc doesn't return NULL pointer
  150.         if fill_value is not None:
  151.           self.fill( fill_value )
  152.  
  153.       #---------------------------------------------------------------------------------------------------------
  154.       cdef fill( self, unsigned int value ):
  155.         cdef unsigned int idx
  156.         cdef unsigned int *d    = self.data
  157.         for idx from 0 <= idx < self.length:
  158.           d[ idx ] = value
  159.  
  160.       #---------------------------------------------------------------------------------------------------------
  161.       cdef resize( self, unsigned int length ):
  162.         self.data   = <unsigned int *>realloc( self.data, length * sizeof( unsigned int ) )  ###OBS### must check realloc doesn't return NULL pointer
  163.         self.length = length
  164.  
  165.       #---------------------------------------------------------------------------------------------------------
  166.       def free( self ):
  167.         """Always remember the milk: Free up memory."""
  168.         free( self.data )  ###OBS### should free memory here
  169.  
  170.       #---------------------------------------------------------------------------------------------------------
  171.       def as_list( self ):
  172.         """Return the array as a Python list."""
  173.         R                       = []
  174.         cdef unsigned int idx
  175.         cdef unsigned int *d    = self.data
  176.         for idx from 0 <= idx < self.length:
  177.           R.append( d[ idx ] )
  178.         return R
  179.  
  180.  
  181.     ############################################################################################################
  182.     # CONVERTING UNICODE TO CHARACTER IDs (CIDs)
  183.     #---------------------------------------------------------------------------------------------------------
  184.     cdef unsigned int _UMX_surrogate_lower_bound    = 0x10000
  185.     cdef unsigned int _UMX_surrogate_upper_bound    = 0x10ffff
  186.     cdef unsigned int _UMX_surrogate_hi_lower_bound = 0xd800
  187.     cdef unsigned int _UMX_surrogate_hi_upper_bound = 0xdbff
  188.     cdef unsigned int _UMX_surrogate_lo_lower_bound = 0xdc00
  189.     cdef unsigned int _UMX_surrogate_lo_upper_bound = 0xdfff
  190.     cdef unsigned int _UMX_surrogate_foobar_factor  = 0x400
  191.  
  192.     #---------------------------------------------------------------------------------------------------------
  193.     cdef Array_of_unsigned_int _cids_from_text( text ):
  194.       """Givn a ``text`` either as a Unicode string or as a ``bytes`` or ``bytearray``, return an instance of
  195.      ``Array_of_unsigned_int`` that enumerates either the Unicode codepoints of each character or the value of
  196.      each byte. Surrogate pairs will be condensed into single values, so on narrow Python builds the length of
  197.      the array returned may be less than ``len( text )``."""
  198.       #.........................................................................................................
  199.       # Make sure ``text`` is either a Unicode string (``str``) or a ``bytes``-like thing:
  200.       is_bytes = isinstance( text, ( bytes, bytearray, ) )
  201.       assert is_bytes or isinstance( text, str ), '#121'
  202.       #.........................................................................................................
  203.       # Whether it is a ``str`` or a ``bytes``, we know the result can only have at most as many elements as
  204.       # there are characters in ``text``, so we can already reserve that much space (in the case of a Unicode
  205.       # text, there may be fewer CIDs if there happen to be surrogate characters):
  206.       cdef unsigned int           length  = <unsigned int>len( text )
  207.       cdef Array_of_unsigned_int  R       = Array_of_unsigned_int( length )
  208.       #.........................................................................................................
  209.       # If ``text`` is empty, we can return an empty array right away:
  210.       if length == 0: return R
  211.       #.........................................................................................................
  212.       # Otherwise, prepare to copy data:
  213.       cdef unsigned int idx               = 0
  214.       #.........................................................................................................
  215.       # If ``text`` is a ``bytes``-like thing, use simplified processing; we just have to copy over all byte
  216.       # values and are done:
  217.       if is_bytes:
  218.         for idx from 0 <= idx < length:
  219.           R.data[ idx ] = <unsigned int>text[ idx ]
  220.         return R
  221.       #.........................................................................................................
  222.       cdef unsigned int cid               = 0
  223.       cdef bool         is_surrogate      = False
  224.       cdef unsigned int hi                = 0
  225.       cdef unsigned int lo                = 0
  226.       cdef unsigned int chr_count         = 0
  227.       #.........................................................................................................
  228.       # Iterate over all indexes in text:
  229.       for idx from 0 <= idx < length:
  230.         #.......................................................................................................
  231.         # If we met with a surrogate CID in the last cycle, then that was a high surrogate CID, and the
  232.         # corresponding low CID is on the current position. Having both, we can compute the intended CID
  233.         # and reset the flag:
  234.         if is_surrogate:
  235.           lo = <unsigned int>ord( text[ idx ] )
  236.           # IIRC, this formula was documented in Unicode 3:
  237.           cid = ( ( hi - _UMX_surrogate_hi_lower_bound ) * _UMX_surrogate_foobar_factor
  238.                 + ( lo - _UMX_surrogate_lo_lower_bound ) + _UMX_surrogate_lower_bound )
  239.           is_surrogate = False
  240.         #.......................................................................................................
  241.         else:
  242.           # Otherwise, we retrieve the CID from the current position:
  243.           cid = <unsigned int>ord( text[ idx ] )
  244.           #.....................................................................................................
  245.           if _UMX_surrogate_hi_lower_bound <= cid <= _UMX_surrogate_hi_upper_bound:
  246.             # If this CID is a high surrogate CID, set ``hi`` to this value and set a flag so we'll come back
  247.             # in the next cycle:
  248.             hi                = cid
  249.             is_surrogate      = True
  250.             continue
  251.         #.......................................................................................................
  252.         R.data[ chr_count ] = cid
  253.         chr_count     += 1
  254.       #.........................................................................................................
  255.       # Surrogate CIDs take up two characters but end up as a single resultant CID, so the return value may
  256.       # have fewer elements than the naive string length indicated; in this case, we want to free some memory
  257.       # and correct array length data:
  258.       if chr_count != length:
  259.         R.resize( chr_count )
  260.       #.........................................................................................................
  261.       return R
  262.  
  263.     #---------------------------------------------------------------------------------------------------------
  264.     def cids_from_text( text ):
  265.       cdef Array_of_unsigned_int c_R  =_cids_from_text( text )
  266.       R                               = c_R.as_list()
  267.       c_R.free() ###OBS### should free memory here
  268.       return R
  269.  
  270.  
  271.     ############################################################################################################
  272.     # SECOND-ORDER SIMILARITY
  273.     #-----------------------------------------------------------------------------------------------------------
  274.     cpdef float similarity( char *a, char *b ):
  275.       """Given two byte strings ``a`` and ``b``, return their Damerau-Levenshtein similarity as a float between
  276.      0.0 and 1.1. Similarity is computed as ``1 - relative_editdistance( a, b )``, so a result of ``1.0``
  277.      indicates identity, while ``0.0`` indicates complete dissimilarity."""
  278.       return 1.0 - relative_editdistance( a, b )
  279.  
  280.     #-----------------------------------------------------------------------------------------------------------
  281.     cpdef float relative_editdistance( char *a, char *b ):
  282.       """Given two byte strings ``a`` and ``b``, return their relative Damerau-Levenshtein distance. The return
  283.      value is a float between 0.0 and 1.0; it is calculated as the absolute edit distance, divided by the
  284.      length of the longer string. Therefore, ``0.0`` indicates identity, while ``1.0`` indicates complete
  285.      dissimilarity."""
  286.       cdef int length = max( len( a ), len( b ) )
  287.       if length == 0: return 0.0
  288.       return editdistance( a, b ) / <float>length
  289.  
  290.     ############################################################################################################
  291.     # EDIT DISTANCE
  292.     #-----------------------------------------------------------------------------------------------------------
  293.     cpdef unsigned int editdistance( text_a, text_b ):
  294.       """Given texts as Unicode strings or ``bytes`` / ``bytearray`` objects, return their absolute
  295.      Damerau-Levenshtein distance. Each deletion, insertion, substitution, and transposition is counted as one
  296.      difference, so the edit distance between ``abc`` and ``ab``, ``abcx``, ``abx``, ``acb``, respectively, is
  297.      ``1``."""
  298.       #.........................................................................................................
  299.       # This should be fast in Python, as it can (and probably is) implemented by doing an identity check in
  300.       # the case of ``bytes`` and ``str`` objects:
  301.       if text_a == text_b: return 0
  302.       #.........................................................................................................
  303.       # Convert Unicode text to C array of unsigned integers:
  304.       cdef Array_of_unsigned_int a  = _cids_from_text( text_a )
  305.       cdef Array_of_unsigned_int b  = _cids_from_text( text_b )
  306.       R                             = c_editdistance( a, b )
  307.       #.........................................................................................................
  308.       # Always remember the milk:
  309.       a.free()
  310.       b.free()
  311.       #.........................................................................................................
  312.       return R
  313.  
  314.     #-----------------------------------------------------------------------------------------------------------
  315.     cdef unsigned int c_editdistance( Array_of_unsigned_int cids_a, Array_of_unsigned_int cids_b ):
  316.       # Conceptually, this is based on a len(a) + 1 * len(b) + 1 matrix.
  317.       # However, only the current and two previous rows are needed at once,
  318.       # so we only store those.
  319.       #.........................................................................................................
  320.       # This shortcut is pretty useless if comparison is not very fast; therefore, it is done in the function
  321.       # that deals with the Python objects, q.v.
  322.       # if cids_a.equals( cids_b ): return 0
  323.       #.........................................................................................................
  324.       cdef unsigned int a_length            = cids_a.length
  325.       cdef unsigned int b_length            = cids_b.length
  326.       #.........................................................................................................
  327.       # Another shortcut: if one of the texts is empty, then the edit distance is trivially the length of the
  328.       # other text. This also works for two empty texts, but those have already been taken care of by the
  329.       # previous shortcut:
  330.       #.........................................................................................................
  331.       if a_length == 0: return b_length
  332.       if b_length == 0: return a_length
  333.       #.........................................................................................................
  334.       cdef unsigned int row_length          = b_length   + 1
  335.       cdef unsigned int row_length_1        = row_length - 1
  336.       cdef unsigned int row_bytecount       = sizeof( unsigned int ) * row_length
  337.       cdef unsigned int *oneago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  338.       cdef unsigned int *twoago             = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  339.       cdef unsigned int *thisrow            = <unsigned int *>malloc( row_bytecount ) ###OBS### must check malloc doesn't return NULL pointer
  340.       cdef unsigned int idx                 = 0
  341.       cdef unsigned int idx_a               = 0
  342.       cdef unsigned int idx_b               = 0
  343.       cdef          int idx_a_1_text        = 0
  344.       cdef          int idx_b_1_row         = 0
  345.       cdef          int idx_b_2_row         = 0
  346.       cdef          int idx_b_1_text        = 0
  347.       cdef unsigned int deletion_cost       = 0
  348.       cdef unsigned int addition_cost       = 0
  349.       cdef unsigned int substitution_cost   = 0
  350.       #.........................................................................................................
  351.       # Equivalent of ``thisrow = list( range( 1, b_length + 1 ) ) + [ 0 ]``:
  352.       #print( '#305', cids_a.as_list(), cids_b.as_list(), a_length, b_length, row_length, row_length_1 )
  353.       for idx from 1 <= idx < row_length:
  354.         thisrow[ idx - 1 ] = idx
  355.       thisrow[ row_length - 1 ] = 0
  356.       #.........................................................................................................
  357.       for idx_a from 0 <= idx_a < a_length:
  358.         idx_a_1_text      = _warp(   a_length, idx_a - 1 )
  359.         twoago, oneago = oneago, thisrow
  360.         #.......................................................................................................
  361.         # Equivalent of ``thisrow = [ 0 ] * b_length + [ idx_a + 1 ]``:
  362.         for idx from 0 <= idx < row_length_1:
  363.           thisrow[ idx ] = 0
  364.         thisrow[ row_length - 1 ] = idx_a + 1
  365.         #.......................................................................................................
  366.         # some diagnostic output:
  367.         x = []
  368.         for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
  369.         print
  370.         print '#ED  A', x
  371.         #.......................................................................................................
  372.         for idx_b from 0 <= idx_b < b_length:
  373.           #.....................................................................................................
  374.           idx_b_1_row       = _warp( row_length, idx_b - 1 )
  375.           idx_b_1_text      = _warp(   b_length, idx_b - 1 )
  376.           #.....................................................................................................
  377.           assert 0 <= idx_b_1_row  < row_length, ( '#323', idx_b_1_row, )
  378.           assert 0 <= idx_a_1_text <   a_length, ( '#324', idx_a_1_text, )
  379.           assert 0 <= idx_b_1_text <   b_length, ( '#325', idx_b_1_text, )
  380.           #.....................................................................................................
  381.           deletion_cost     = oneago[  idx_b       ] + 1
  382.           addition_cost     = thisrow[ idx_b_1_row ] + 1
  383.           substitution_cost = oneago[  idx_b_1_row ] + ( 1 if    cids_a.data[ idx_a ]
  384.                                                               != cids_b.data[ idx_b ] else 0 )
  385.           thisrow[ idx_b ]  = _minimum_of_three_uints( deletion_cost, addition_cost, substitution_cost )
  386.           #.....................................................................................................
  387.           # Transpositions:
  388.           if (  idx_a > 0
  389.             and idx_b > 0
  390.             and cids_a.data[ idx_a        ] == cids_b.data[ idx_b_1_text ]
  391.             and cids_a.data[ idx_a_1_text ] == cids_b.data[ idx_b        ]
  392.             and cids_a.data[ idx_a        ] != cids_b.data[ idx_b        ] ):
  393.             #...................................................................................................
  394.             idx_b_2_row       = _warp( row_length, idx_b - 2 )
  395.             assert 0 <= idx_b_2_row  < row_length, ( '#340', idx_b_2_row, )
  396.             thisrow[ idx_b ]  = _minimum_of_two_uints( thisrow[ idx_b ], twoago[ idx_b_2_row ] + 1 )
  397.           #.....................................................................................................
  398.           # some diagnostic output:
  399.           x = []
  400.           for idx from 0 <= idx < row_length: x.append( thisrow[ idx ] )
  401.           print '#ED  B', x
  402.       #.........................................................................................................
  403.       # Here, ``b_length - 1`` can't become negative, since we already tested for ``b_length == 0`` in the
  404.       # shortcut above:
  405.       cdef unsigned int R = thisrow[ b_length - 1 ]
  406.       #.........................................................................................................
  407.       # Always remember the milk:
  408.       # BUG: Activating below lines leads to glibc failing with ``double free or corruption``
  409.       #free( twoago )
  410.       #free( oneago )
  411.       #free( thisrow )e
  412.       #.........................................................................................................
  413.       return R
  414.  
  415.     #-----------------------------------------------------------------------------------------------------------
  416.     def editdistance_reference( text_a, text_b ):
  417.       """This method is believed to compute a correct Damerau-Levenshtein edit distance, with deletions,
  418.      insertions, substitutions, and transpositions. Do not touch it; it is here to validate results returned
  419.      from the above method. Code adapted from
  420.      http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance"""
  421.       # Conceptually, the implementation is based on a ``( len( seq1 ) + 1 ) * ( len( seq2 ) + 1 )`` matrix.
  422.       # However, only the current and two previous rows are needed at once, so we only store those. Python
  423.       # lists wrap around for negative indices, so we put the leftmost column at the *end* of the list. This
  424.       # matches with the zero-indexed strings and saves extra calculation.
  425.       b_length  = len( text_b )
  426.       oneago    = None
  427.       thisrow   = list( range( 1, b_length + 1 ) ) + [ 0 ]
  428.       for idx_a in range( len( text_a ) ):
  429.         twoago, oneago, thisrow = oneago, thisrow, [ 0 ] * b_length + [ idx_a + 1 ]
  430.         #.......................................................................................................
  431.         # some diagnostic output:
  432.         print
  433.         print '#EDR A', thisrow
  434.         #.......................................................................................................
  435.         for idx_b in range( b_length ):
  436.           deletion_cost     = oneago[  idx_b     ] + 1
  437.           addition_cost     = thisrow[ idx_b - 1 ] + 1
  438.           substitution_cost = oneago[  idx_b - 1 ] + ( text_a[ idx_a ] != text_b[ idx_b ] )
  439.           thisrow[ idx_b ]  = min( deletion_cost, addition_cost, substitution_cost )
  440.           if (  idx_a > 0
  441.             and idx_b > 0
  442.             and text_a[ idx_a     ] == text_b[ idx_b - 1 ]
  443.             and text_a[ idx_a - 1 ] == text_b[ idx_b     ]
  444.             and text_a[ idx_a     ] != text_b[ idx_b     ] ):
  445.             thisrow[ idx_b ] = min( thisrow[ idx_b ], twoago[ idx_b - 2 ] + 1 )
  446.           #.....................................................................................................
  447.           # some diagnostic output:
  448.           print '#EDR B', thisrow
  449.           #.....................................................................................................
  450.       return thisrow[ len( text_b ) - 1 ]