Improving the speed of ApplyExpressions

Hi,

I’m trying to speed up a bit of our software that just runs too slowly to be acceptable to the end user - the main holdup is calling Shapefile.Categories.ApplyExpressions, so I’m looking at ways to speed it up.

What I am doing is drawing (for instance) approx 300,000 polygons (filled).

Do you have any suggestions? Will using an OgrLayer make any difference? I would go ahead and try it, but I’m having trouble getting an OgrLayer to display at all right now, so I want to check it’s worth the effort before pressing ahead any further.

Would it be worth looking into using a DrawingLayer? One issue I foresee, is that I need to be able to apply a transparency to the layer, which I don’t think is possible? (Edit: I’ve just tried this, seems too slow, zooming, panning etc. once drawn).

Any other ideas? Can I somehow cache the results of ApplyExpressions?

Regards,

Rob H

Hi Rob,
300,000 polygons are far to many to display … I have a similar problem, but what I do is load only the polygons I need via sql. like:
sql := 'SELECT * '+
'FROM “table” ’ +
‘WHERE “table”.geom &&’ +
‘ST_MakeEnvelope(’ + ToStr(XMin,1,2) + ‘,’ +
ToStr(YMin,1,2) + ‘,’ +
ToStr(XMax,1,2) + ‘,’ +
ToStr(YMax,1,2) + ‘)’;

layer := OGR.RunQuery(sql);
Lay := Map1.AddLayer(layer, true);

Then if I need other portion of the map I do:
query := define other query here
Map1.OgrLayer [lay].RedefineQuery(sql) ;

I am still thinking on how to handle data…

Alex

Hello @robhoney

In addition to minimizing how much data is loaded, it may help us to understand more about your context. But here are some initial things to consider:

  1. How volatile is the data? Is it changing often? How much of it changes at any given time?
  2. You can use this information to determine how often to ApplyExpressions, which is only necessary if data changes.
  3. If possible, you could perhaps pre-identify which category of data has changed and then call ApplyExpression for that category only.
  4. You can also set a category on a single shape (set_ShapeCategory), so if changes come in less frequently, you could perhaps deduce the category for the singleton, and only make the single change.

In response to some of your other questions:

  • An OgrLayer may perform updates and reload more quickly, but in and of itself would not affect the category generation, since that is done in the in-memory Shapefile.
  • You said that your concern is in the ApplyExpressions call, not the rendering itself? If there were any concerns about rendering, you could tag the layer as Volatile, in which case you could redraw that layer without drawing any of the base data layers that may be underneath.

Hi @jerryfaust,

Thanks for your response. I’m not sure I can minimize the data I’m loading. It is very common for the user to look at an overview of the whole thing.

To expand on what I’m doing a bit further - I have a set of results I display from a simulation at a range of time steps. For each time step, I have values for a range of “cells” (each being e.g. 5m square area). In the case mentioned above, there are 300,000 of these. To do this I create a shapefile containing 300,000 polygons (squares) adding a field to the attribute table on which to apply the category. I’m then creating, typically 10 categories based on this field, each one being of the:

lower bound < value < upper bound

variety. I then call ApplyExpressions.

So this is all for 1 timestep. The user can click anywhere on a list of timesteps (can be in the 100s) to load the data for that timestep (so potentially every cell/polygon’s data has changed). Then the creation and populating of the data is run again for the new timestep. In this process it is the “ApplyExpressions” that is the bottleneck.

So to answer #1 - it is very volatile, it changes often, and a lot of it can change at any given time!

For what it’s worth - there is another piece of software out there that does the same thing and does it very fast, so in theory there’s a way…

Thanks @AlexBV too, for your response. All the data I’m loading is from a binary format, I’m not using a spatial db.

Regards,

Rob

Hello Rob. Thanks for the detail.

I need a day or so to look further into this. With the little time I’ve spent in the Expression parsing code, I believe it is just that, string parsing to sift out whether or not each case satisfies the conditions. If it is indeed always parsing the expressions, then it is not going to be inherently fast for that many items. I will verify.

There are a few options, however, for defining categories. Look at ShapefileCategory.ValueType (tkCategoryValue).

  1. If you are using tkCategoryValue.cvExpression, and passing in a string expression, that is likely the slowest.
  2. The next option to try would be to use tkCategoryValue.cvRange, since you are using a Range, you can specify the MinValue and MaxValue. If that is still too slow, my suggestion would be #3.
  3. You could try specifying tkCategoryValue.cvSingleValue. But to do this, you would have to do some programmatic preprocessing. Add a single integer field to the shapefile, effectively indicating which category it is in. You would then preprocess the data prior to loading the data for the next timestep - precalculate which category each shape falls into based on it’s value. Then when you load the data, you have already done the work and each shape is already in it’s category. Hopefully in this way, presuming that you can process the data faster than the OCX, the single-value category rendering will be the fastest. Does this make sense?

Regards,
Jerry.

Hi @jerryfaust

Thank you for your help and for looking into it further. I am indeed using the expression rather than range option, so I’m keen to try out #2 and #3. I have only used the expression before and as if yet I haven’t got the range working. I assumed I would need to set the Shapefile.Categories.ClassificationField to the index of the field I want to use, set the value type of each category and the min and max values.

I am generating these manually, not using the “Generate” functions. Should I be calling something else afterwards, instead of ApplyExpressions? I’ve experimented with ApplyColorScheme, but in every case all cells are given the same colour. I think I’m missing something simple.

Regards,

Rob

Hello Rob.

You’re correct in that you don’t need to call Generate or ApplyColorScheme. I don’t know enough about AddRange, either, but each of these are automating the process to some extent. I would just try defining the categories very specifically so that you know exactly what you’re getting.

Also, the Min and Max values are actual values, not field indices. So you would do something like:

dim cat as ShapefileCategory = sf.Categories.Add("Less than 100")
cat.ValueType = tkCategoryValue.cvRange
cat.Expression = "[FieldName]"
cat.MinValue = 0
cat.MaxValue = 100
cat.DrawingOptions.FillColor = 255 ' Red

After adding each of the categories, you would call sf.Categories.ApplyExpressions, as you are doing.

Thanks @jerryfaust, what you’ve spelt out is exactly what I’m doing though.

It’s a very minor change from what I was doing really - changing from the expression string to the above. I’ll keep looking, but I can’t see anything wrong.

Regards,

Rob

I’ve never actually used the Range values; so maybe I’m overlooking something as well.

And it may well be that this is not much faster than the full Expression, since in the code, it passes through the same parser logic. Hopefully it’s not too much trouble for you to find out. Otherwise the single-value method will be your best bet.

Regards,
Jerry.

Hi @jerryfaust,

I’ve now tried the single value approach and that has sped things up significantly - about 40x faster. I still have to see if I can actually use that approach in production, but I’ll do that next.

One thing to note - you have to set the ClassificationField property for it to work, rather than expression. Without this it doesn’t work (and for some reason runs 5x more slowly).

Interestingly, for both cvSingleValue and cvRange, the ApplyExpressions is about 5 x slower than cvExpression to run when they “don’t work”.

I’d still be interested in trying the cvRange out because, as I mentioned, using the cvSingleValue may be an issue for my production code. Note - setting ClassificationField didn’t help for cvRange.

Regards,

Rob

Thanks Rob.

When I get a chance, I will try to improve the documentation as it relates to the Categories; specifically, which fields to use for each category type.

Interestingly, reviewing the code, here’s what I see.

For a single-value category

  1. iterate the categories 1 time to get each value (let’s say 10 iterations in your case)
  2. iterate the shapefile 1 time to place each shape into a category (300,000 iterations in your case)
  3. to summarize, 1 full iteration of the shapefile

For an expression-based category (which includes the Range type)

  1. iterate the categories (x 10)
    a. build the expression
    b. iterate ALL shapes to determine if a shape satisfies the expression (x 300,000)
    c. at least it bypasses shapes that have already been categoried, but still has to iterate all
  2. to summarize, 10 full iterations of the shapefile, or 3,000,000 iterations
  3. perhaps the only benefit to the Range expression is that it can build the expression more quickly than a full string parsing of the same range expression, if that makes sense
    a. Addendum: it may be that it can also more quickly evaluate the expression on each of the 300,000 iterations; but I’d have to dig deeper to find out