DETECTION OF IDENTICAL OBJECTS

in the MapInfo environment :

 

 

comments to help in the formulation of search strategies

and implementation in SKIMALL 1.0

Jacques Paris, October 1998

 

 

 

Detecting identical objects comes to finding objects that have the same characteristics; the determination of these characteristics is in itself a challenge. If many users are primarily concerned about finding objects that have been duplicated in some stage of generating a table, generally speaking, this domain is a much more complex one that would seem at first glance. In this paper, we will try to identify different aspects of this question and to formulate some recommendations for defining search strategies of identical objects in a MapInfo table.

The fact that MapInfo works with geographic objects that are displayed with graphic attributes and to which are attached tabular information forces us to deal with these three dimensions. Strategies must be formulated for each one, as well as a supra strategy is required for calling them one at the time or in any combination. Some technical considerations that must be taken into account are mentioned as a last dimension.

  • By strategy we mean a set of tests that are going to be applied to a pair of objects in order to find if they are different; these tests should be ranked by their increasing power to detect differences. The increase in power results generally in a greater cost of use because they normally require comparisons of growing complexity, involving more elements or detailed functions.

    As a consequence of not using all the available information, the simpler tests would introduce some risks of arriving at wrong decisions (not finding a difference when there may be one). The user would have to trade between the costs of implementing and using a given strategy and the more or less greater certitude of getting good decisions, i.e. that identical objects are properly "found" and that objects that are "found" are really identical.

  • Because the information relative to an object depends in a large measure on the nature of the object, each type of object must be dealt with separately. In this document, we are going to respect the order with which they are generally presented by MapInfo. The only type in the MI set that is left out is the "frame" object, the reason being that it is not a geographical object and it is not found in a MI table.

    MapInfo definitions of terms and the information one can extract from a MI table are the basis for some observations and comments on the potentialities for comparisons. They are followed by the outline of strategies for implementation, with in some cases, a breakdown into several levels of increasing strictness.

     

     

    THE GEOGRAPHIC DIMENSION

    Geographic identity has a double aspect that we will translate into three levels of requirements. The first level takes into account the location and size of the objects in their geographic space. the elements entering into play are generally speaking global characteristics (limits, measurements of some sorts).

    The internal structure of the objects enters into play at the second level. We mean by internal structure the sequence of elements in a composite object for example, or the tracing direction of a line (a polyline or the border of a polygon). This level would be of particular interest for example when direction of flow must be reflected in a network by doubling the arcs with the proper direction.

    A strict geographic definition would lead to a complete node to node verification of each pair of objects and constitutes the third level. This most secure procedure is also the most time-consuming one and strategies should call on it only after screening out objects with simpler and faster techniques that can detect obvious differences. It would also apply only to polylines and regions.

     

    ARC

    By definition, an arc is a portion of an ellipse. In a full ellipse, the MBR "frames" the principal axes. If start angle and end angle are in the same quadrant and start angle is smaller than end angle, the MBR will be that of a full ellipse because the two axis are intersected in both directions by such an arc. Any pair of arcs respecting that definition, even if their "missing'" part is not in the same quadrant, will have the same MBR. It seems essential to use start and end angles even in a basic comparison.

    Internal structure is reduced here to line direction; that characteristic is fully accounted for by the use of the start and end angles.

    strategy for implementation:

    There is a unique strategy: the bounding rectangle and the start and end angles are the required elements for a complete geographic identity.

     

    ELLIPSE

    The main axes of a MI ellipse are always horizontal/vertical. The MBR defines them, and the ellipse, entirely.

    Drawing direction is under MI control and is irrelevant.

    strategy for implementation:

    For a full, unambiguous definition the bounding rectangle is enough.

     

    LINE

    A line is fully defined by its start and end node coordinates.

    Drawing direction results from that definition.

    If simple geometric identity is required, one should compare start node of one with end node or start node of the other, and end node of one with start node or end node of the other. If structure is also taken into account, a direct start for start and end for end nodes comparison will reflect drawing direction.

    strategies for implementation:

    level 1 :

    simple geometric identity by superposition, using start and end node coordinates (START with start and END with end) or (START with end and END with start) .

    level 2 :

    adding line direction, using start and end node coordinates (START with start and END with end)

     

    POLYLINE

    Much detailed and global information is available for a polyline. First, at the most detailed level, each node coordinate is accessible. Then, globally, one can get the MBR, the total number of nodes and the total length of the polyline. If it is a multiple-segment object, one can add the number of segments and the number of nodes in each segment, but one cannot get directly at individual segment length.

    For the internal structure, one must proceed in two steps. If the polyline is a single-segment object, the line direction is given, as with a simple line, by end and start nodes. If it is a multiple-segment object, one can consider the sequence in which the segments are assembled and the line direction of each segment.

     

     strategies for implementation:

    level 1 :

    single-segment polyline:

  • bounding rectangle, object length, number of nodes, start and end node coordinates of unique segment (START with start and END with end) or (START with end and END with start)
  • multiple-segment polyline:

  • bounding rectangle, number of segments, object length, number of nodes
  • Testing for start/end node coordinates would require determining which segment to consider, thus having to deal first with the internal structure treated only at level 2.
  • level 2 :

    single-segment polyline:

  • adding line direction, start and end node coordinates of unique segment (START with start and END with end)
  • multiple-segment polyline:

  • sequence of the number of nodes in the segments.
  • level 3 :

    node coordinates for node to node comparison

  • For a multiple-segment polyline, identity of nodes requires comparisons between related segments, therefore a "breakdown" of multiple segment polylines in their basic parts, then the matching of these parts and finally the comparisons of their nodes taking into account or not the line direction.
  • global procedures

  • There are other resources that can be brought into play to avoid specific node to node comparisons, such as some object manipulating functions. These should yield some information indicating differences (with certitude) or similarities (with possible risk). If these should be proved efficient and not to "expensive" to implement, they may even be implemented at level 1.

    Here are some comments on the subject:

    If OVERLAYNODES() returns more nodes in than the original object, the objects are different. Given the fact that the 2 objects would already have the same MBR, even if they had the same length and the same number of nodes, they could have different forms, but there are cases where the objects can be different and no node is added (e.g. lines formed by 2 adjacent sides of the MBR on opposite corners).

    The polyline obtained by a COMBINE() of the two objects will form a one-segment polyline because end and start nodes have already been found as being coincidental. If the other nodes are coincidental, then the polyline is doubled up on itself, whatever be the line directions. As there is no "cancellation" of nodes but at its "hinge" (end of first original), its number of nodes will always be twice that of the original minus 1. It is of no help in our query.

    OBJECTLEN() could be used to compare two polylines, but it has certain constraints due to floating point operations. It should not be used without some 'fuzziness", and because of that, this test is not fully conclusive (see "technical note" in Geographic identity parameters level 1, in the implementation section)

    Other functions call for closed objects and are not applicable directly in this case. There is however an opening if polylines were transformed in "regions" with the BUFFER() function. Then the ERASE() function would be interesting : convert both objects to buffered regions with the same parameters and unit type, erase one buffered object by the other. If the area of the "erased object" is different from zero both polylines are different; if it is zero, they must be identical.

  • One may argue that MI "non-perfect precision" may yield a non-null area for an "ERASE()" of two objects considered identical. If these objects have the same exact nodes, the buffers would be identical, MI doing the same "non-precise" manipulations on both.

    There are, however, circumstances where this test will fail (non-null area for identical objects), in particular when lines are drawn in opposite directions; one would have to build in some "fuzziness" as one must do if using OBJECTLEN(), but the operations would be much simpler with a "fuzzy" AREA() comparison.

    One possibility using the ERASE(buffers) technique would be to reverse the direction of one of the polyline. It works but only on single-segment polylines; with multiple-segments polylines, the required programming and the resulting processing time do not justify the investment. Besides, it would require to go into the internal structure of a complex object, and that is beyond level 1 tests, as stated already in similar conditions.

  • We should note to finish that these functions work on single- as well as multiple-segment polylines, but do not take into account line direction or the sequence of segments in the polyline; if they are accepted for implementation, they should be introduced at level 1.

  • POINT

    A point is fully defined by its x and y coordinates.

    strategy for implementation:

    There is a unique strategy : the x and y coordinates are the required

     

    REGION

    Let us deal with the centroid right away. In MI, the centroid is not an element that can be relied upon for geographical comparisons because it is not attached rigidly to the object geometry nor determined uniquely by it. It is an element added by MI to any region but that can be moved at will by the user relative to the object itself within some limits. Because of that relatively loose "geometric" connection with the object form, it is too unreliable to be considered in a serious strategy.

  • This restriction could be lifted if all the centroids in the table were recomputed by MapInfo in order to guarantee that identical objects would be 'represented" by identical points. To achieve that, all regions must be selected, they should be transformed to polylines, then back to regions.
  • Besides the centroid, much detailed and global information is available for a region. First, at the most detailed level, each node coordinate is accessible. Then, globally, one can get the MBR, the total number of nodes and the total area of the region. If it is a multiple-polygon object, one can add the number of polygons and the number of nodes in each polygon, but one cannot get directly at individual polygon area.

    For the internal structure, one must proceed in two steps. One must first remember that the first and the last nodes of the polyline forming the border of a polygon are the same. If the region is a single-polygon object, the (border) line direction is given by start to penultimate nodes or second to end nodes. If it is a multiple-polygon region, then one can consider the sequence in which the polygons are assembled and the (border) line direction of each polygon.

    strategies for implementation:

    level 1 :

    single-polygon region:

  • bounding rectangle, object area, number of nodes, start and end node coordinates of unique polygon (START with start and END with end) or (START with end and END with start)

    START|END is first|penultimate or second|last nodes

  • multiple-polygon region:

  • bounding rectangle, number of polygons, object area, number of nodes
  • Testing for start/end node coordinates would require determining which polygon to consider, thus having to deal first with the internal structure treated only at level 2.
  • level 2 :

    single-polygon region:

  • adding line direction, start and end node coordinates of unique polygon (START with start and END with end)

    START|END is first|penultimate or second|last nodes

  • multiple-polygon region:

  • sequence of the number of nodes in the polygons.
  • level 3 :

    node coordinates for node for node comparison

  • For a multiple-polygon region, identity of nodes requires comparisons between related polygons, therefore a "breakdown" of multiple-polygon regions in their basic parts, then the matching of these parts and finally the comparisons of their nodes taking into account or not the line direction.
  • global procedures

  • We can refer to the polyline section for the introduction to that subject. OVERLAYNODES() has the same drawback : an increase in the number of nodes will detect a difference but the same number after the operation does not mean identity.

    COMBINE() could be used to determine some "values" for the resulting object. If "combined" values (in particular number of nodes, number of polygons, and area) are different from one of the two original objects, those objects are certainly different; but identity of values should not necessarily imply identity of objects.

    AREA() opens the same reservation with the fuzziness attached to its calculation (see "technical note" in Geographic identity parameters level 1, in the implementation section)

    ERASE() of one region by the other should yield an area of zero if they are identical. But that zero result could also be obtained when a smaller object is entirely erased by a larger one; therefore, if used alone, the ERASE() should be carried out in both directions (A to erase B and B to erase A)

  • However, if "equality" of area has already been proven, what is tested here are differences of shapes of two equal area objects. With this qualification, no object can hide totally behind the other, and one look at the area of the "erased" object should be enough.

    This test seems to be not affected by the direction with which the polygons have been drawn.

  • The same remarks as those made for polylines apply here : these functions work on single- as well as multiple-polygon regions, but do not take into account line direction or the sequence of polygons in the region; if they are accepted for implementation, they should be introduced at the level 1.

  • RECTANGLE

    A rectangle is fully defined by the x and y coordinates of two opposite corners.

    Drawing direction is under MI control and is irrelevant.

    strategy for implementation:

    There is a unique strategy : using the bounding rectangle

     

    ROUNDED RECTANGLE

    A rounded rectangle is defined by the x and y coordinates of two opposite corners and by the corner radius.

    Drawing direction is under MI control and is irrelevant.

    strategy for implementation:

    There is a unique strategy : using the bounding rectangle and the corner radius

     

    TEXT

    The MBR of a text object is not reliable in itself because its width and its height are defined not only by the text length but also by style considerations (point size, font, character style..). There is however a constant, the so-called anchor point, i.e. the upper left corner of MBR (XMIN, YMAX) that serves as the center for rotating the text box and that remains in its position when style parameters are changed.

    strategy for implementation:

    There is a unique strategy : using the anchor point (XMIN, YMAX of the MBR), the angle of rotation, and the text string itself.

     

     

     

    THE GRAPHIC DIMENSION

    Generally speaking, the graphic parameters of MI objects are contained in style variables (pen, brush, symbol, font). Some formatting parameters are added for text.

    Style variables are complex entities and cannot be compared globally; one should extract each style component and do a multiple comparison. This technique necessary when dealing with style manipulations is not practical here. The alternative is to transform each style clause into a string variable making one-shot comparisons possible.

    ARC pen

    ELLIPSE pen + brush

    LINE pen

    POLYLINE pen

    POINT symbol

    REGION pen + brush

    RECTANGLE pen + brush

    ROUNDED RECTANGLE pen + brush

    TEXT font

  • also available as text formatting parameters : text justification and text spacing, could be eventually treated as a level 2 strategy.
  • THE CONTENTS DIMENSION

    Each MI table contains at least one data column. Comparisons can be done on the tabular contents of a table. The search can involve a single column, or several in succession or all together. There are no known limitations or constraints to such comparisons.

     

     

     

    THE TECHNICAL DIMENSION

    Systematic comparison in a table implies that each object must be compared to each other one and that means that (n-1)*n/2 basic comparisons have to be carried out in a table containing n objects. Performance will be influenced by two main considerations: access time to need data and comparison algorithm.

    For most geographic information, the data has to be extracted via calls to some functions. Overall performance can be improved if the data to be compared is stored in memory under the format required for the comparisons. If data manipulation functions are used, speed can be improved by storing object definitions for objects (of specific types only) and using object variables for manipulations and calculations.. There may be, however, a limit to the size of temporary arrays with some hardware configurations.

    The comparison algorithm reveals the way and the order in which the various elements enter into play. The faster two objects can be tagged as different and the algorithm exited, the shorter will be the elapsed time. Ranking the elements by their capacity at distinguishing differences may reveal various priorities among the users, but may also be the source on improved performance.

    Implementation of search strategies for identical objects in

    SKIMALL v.1.00

     

     

    limitation on table size

    As all the operations are carried out in memory, the maximum size of the table that can be processed by this application is of 32767 objects, which is the MapBasic upper limit for an array.

    If a table exceed that size, the user should split it in smaller files. If geographic comparisons are used, the split can be done on the basis of continuous sub-regions; it there were identical cases, they would have to be in the same sub-region. Other criteria will have to be used according to the chosen strategy.

    Even if splitting the original file requires some extra work, the time that can be saved in actual processing

    of smaller files may largely compensate for the extra preparatory work.

     

    global principles

    Comparisons are made between objects of the same type.

    Elements are checked out in the order of the list.

    The bounding rectangle coordinates are treated in the order Xmin, Ymin, Xmax, Ymax.

    The general sequence (it can be reduced by the user's choices) is

    geographic identity parameters level 1

    geographic identity parameters level 2

    style identity parameters level 1

    style identity parameters level 2

    contents identity parameters

     

     

    geographic identity parameters

    level 1

    Geographic information of level 1 is applied to all objects of the same type.

    ARC bounding rectangle, beginning and end angles

    ELLIPSE bounding rectangle

    LINE end points (no line direction)

    POLYLINE bounding rectangle, number of nodes, number of segments, length ("fuzzy"

    technique, see below "technical note")

    POINT x-y coordinates

    REGION bounding rectangle, number of nodes, number of regions, area and shape

  • (ERASE() technique)
  • RECTANGLE bounding rectangle

    ROUNDED RECTANGLE bounding rectangle, rounded corner radius

    TEXT anchor point, text string, rotation angle

  • technical note :
  • As the AREA() and OBJECTLEN() functions of MI yield slightly different results depending on the "line direction", equality is assumed when the difference between the two measures is found to be less than a threshold.

    MI uses float variables (8 byte IEEE format) to store the results of these functions. With such a format and given the cumulative building of the results, the last digits may be variously affected by the sequence with which the terms of the sum are entered. The threshold must then be stated in function of the size of the result itself; the larger the value, the smaller the number of significant decimals. As it seems that 12 digits is the maximum, we can make a pretty sure bet that the difference would affect only the last 3. the threshold can then be fixed at 1 billionth (e-9) of the value.

  • level 2

    Geographic information of level 2 is added to those of level 1.

    LINE line direction

    POLYLINE sequence of number of nodes in multiple-segment polyline

    REGION sequence of number of nodes in multiple-polygon region

     

     

    style identity parameters

    level 1

    ARC : PEN style

    ELLIPSE : PEN and BRUSH styles

    LINE : PEN style

    POLYLINE : PEN style

    POINT : SYMBOL style

    REGION : PEN and BRUSH styles

    RECTANGLE : PEN and BRUSH styles

    ROUNDED RECTANGLE : PEN and BRUSH styles

    TEXT : FONT style

    level 2

    Style information of level 2 is added to those of level 1.

    TEXT : line spacing and text justification

     

    contents identity parameters

    The contents of all the columns (up to six) are checked in the order they have been selected by the user.