3

I have a points data set of toponyms with several thousand elements. There are a lot of duplicate names that fall in two categories:

  1. legit ones, i.e. different places with the same name at several km of distance
  2. points that have been entered two (rarely more) times at a somewhat slightly different position but actually refer to the same place

I'd like to filter data to only show the nearest homonym (which likely are in category 2) so that I can decide which one keep, without the distant ones getting in the way. I looked at all the various clustering, matrix and nearest neighbor tools I found in QGIS (including SAGA and GRASS in the toolbox) but they all work on the entire data set, while I want to measure the distance only within points that share the same value in the name column.

Any tip? I can switch to other (free) software if necesssary

Tartamillo
  • 61
  • 7

3 Answers3

2

I eventually solved my problem in a kludgy way. I'll briefly explain the procedure below for those interested, but I'm sure that there is a simpler way.

First I added a column to count duplicates using the expression: Count(1, "name")

Then I extracted only the names that had repetitions and split the file by name attribute. Next step has been to calculate the distance between points with v.distance in batch mode from the toolbox. It's not as easy as it seems, because QGIS dies if you batch process too many files, I had to do it around 200 files per batch and it's a pain to set up. At this point I merged the files with the distance field added. Points that had a distance above a threshold have been automatically marked ok, the others went under manual revision.

Tartamillo
  • 61
  • 7
  • can you please show the demonstration in a brief way – Bruno B Apr 21 '21 at 09:27
  • I don't think it would be useful. As said it's a kludge I've resorted to because I couldn't find any better, but it was very time consuming. Since then I experimented with the overlay_nearest function suggested by @Babel and made it work for my case. I posted a separate answer using that function, try that one. – Tartamillo May 12 '21 at 21:02
1

Since QGIS version 3.16, you can use the expression overlay_nearest to find the nearest neigbouring point. With the help of this, you can build an expression that compares the names of the current feature with the name of it's next neighbour and if they are the same, highlight it. Edit: As requested in one comment, a detailed explanation how this expression works / is built can be found at the bottom of this solution.

That's what I did in the screenshot below, highlighting the points that have the same name as their nearest neighbour with a blue circle. I added a symbol layer of the type geometry generator, geometry style: point and added this expression:

if ( 
    "name" =
    attribute (
        get_feature_by_id (
            @layer,  
            array_first (
                overlay_nearest( @layer, $id)
            )
        ), 
        'name' ) , 
        $geometry, 
        ''
)

You can use the same expression to create a new field in the attribute table using field calculator. Just replace the antepenultimate and penultimate lines - instead of

        $geometry, 
        ''

insert

        'true', 
        'false'

to get true in case of the features highlighted on the screenshot and false for the others.

On this screenshot you see the two QGIS and OpenSource names highlighted with blue dots, but not the other two instances of these names (red rectangles) further away because their nearest neighbour has another name:

enter image description here

Edit: Explanation of the expression. The step-by-step reconstruction starts on line 7. In fact, writing complex expressions, you start somewhere and than use this output as input of a next function and so on. Thus you get a nested hierachy of functions (visually structured by indents, but unlike using Python, with QGIS expressions indents are not mandatory, they have a visual function), where the "beginning" part is somewhere "inside" the complex function. But logically, it starts from there and you could try to reproduce each step separately in the expression editor.

  1. overlay_nearest( @layer, $id) gets the nearest item on the same layer (@layer) the expression is applied on and returns the id ($id) of this item.

  2. As the function in step 1 returns an array, you must convert it (get one of the items in the array) to be able to use this further. The array contains only one item (the id of the nearest feature), thus get the first element of the array: array_first ([1]) - [1] is the expression from step 1

  3. We have the id of the nearest feature (line), now we need to get the feature itself: it is on the current layer (@layer), thus we use this expression: get_feature_by_id (@layer, [2]) - [2] is the expression from step 2, thus the id-value we calculated in steps 1 and 2.

  4. Now, from the feature of step 3 , we want to get the value of it's attribute name, thus we use this expression: attribute ( [3], 'name' ), where [3] is the expression from step 3 (= the nearest feature from our current feature).

  5. Now we use an if-clause, referring to the value contained in the field name of the current feature - we refer to a field using double quotes, thus "name". If this name of the current feature is equal to the value of the name of the nearest feature (what we calculated in step 4), than QGIS should plot the actual geometry ($geometry), otherwise nothing (empty with two single quotes: ''), all together: if ( "name" = [4], $geometry, '' ). And here you are with the expression from above.

Babel
  • 71,072
  • 14
  • 78
  • 208
  • Thanks @babel, it's an interesting function but that's not exactly what I'm after. I probably wasn't clear, I'll make an example. Let's say I have three points named "foo" and one "bar" arranged like this: foo bar foo foo I'd like to calculate the distance between the various "foo" ignoring the "bar". Then I can filter the points to only show duplicate names that fall within a certain radius to validate that they are actually duplicates. Actually I've since kind of "solved" the problem in a rather inelegant way, I'll post it as an answer. – Tartamillo Dec 24 '20 at 21:12
  • @babel can you please explain it in detail. Thanks in advance.... – Bruno B Mar 18 '21 at 11:52
  • @BrunoB : see above, I added an explanation – Babel Mar 18 '21 at 13:44
  • @babel no point data is visible after applying this expression in geometric generator – Bruno B Mar 18 '21 at 17:34
  • Don't chnge the function get_feature_by_id to get_feature_by_Unique_id or something else, this is a pre-defined function name. Only rename the field name - in my example once 'name' and once "name", the rest remains exactly the same – Babel Mar 18 '21 at 18:31
  • @babel It will be helpful to me if you can share me the attribute table of your data or screenshorts of the application process in detail. – Bruno B Mar 19 '21 at 08:47
  • @babel its working fine!!! – Bruno B Mar 19 '21 at 09:23
  • @babel cant we export the highlighted points separately – Bruno B Mar 19 '21 at 09:27
  • Yes, you can select the points using select by expression and replace the $geometry, '' part in the third last and penultimate line by true, false. Than export the layer and check the box next to Save only selected features. Or you can use the changed expression to create a new boolean field for the selected features to have in permanently saved in your data. – Babel Mar 19 '21 at 09:50
  • @babel can't we add any distance to this expression provided – Bruno B Mar 22 '21 at 06:02
  • Yes, you can add a maximum distance, see the syntax of the function (the help in the expression editor is quite good): overlay_nearest(layer[,expression][,filter][,limit=1][,max_distance][,cache=false]) - thus replace the overlay_nearest part of the expression with something like: overlay_nearest( @layer, $id, max_distance:= 10) - arguments in angular brackets [] are optional. You have to add the name of the argument if you don't use all of them in the intended order: as we skip the filter and limit arguments, we have to explicitely name the max_distance argument. – Babel Mar 22 '21 at 08:08
  • @Babel some duplicates are not highlighted by using this. is their any thing wrong with versions or something else – Bruno B Apr 21 '21 at 06:10
  • No idea - without seeing the data, your expression and the use case, it's extremely difficult to say what went wrong. Maybe ask a new question here for this and refer to this one here. In this case clearly describe what you've tried and where you're stuck. Including data and screenshots is always a good idea. – Babel Apr 21 '21 at 07:19
1

Eventually I found the simple and effective way. The expression

array_contains(
                overlay_nearest( @layer ,
                    max_distance:=1000 ,
                    expression:="name" ,
                    limit:= 40 )
                , "name")

will return true if in a radius of 1000 layer units there is at least another feature with the same name. Note however the limit parameter in the example; the expression will only examine the 40 closest features. According to the inline help setting the limit to -1 should return all the features within the radius, but in my QGIS 3.16.5 it doesn't, not sure why. Anyway, just set a reasonably high value and it works.

Thanks again @Babel for pointing me to the overlay_nearest function, I've used it a lot for other stuff and it's really cool.

Tartamillo
  • 61
  • 7