How does shapefile 2GB limit equate to 70 million points?

Question

Reading Geoprocessing considerations for shapefile output by Esri, it says that the 2GB limit translates roughly to 70 million Points.

When looking at the ESRI Shapefile Technical Description, the header has a field at Byte 24 representing file length in 16-bit words that is a signed integer of 4-bytes, thus the maximum positive value it can represent is 2,147,483,647. This number is the total number of 16-bit words in the file, including the 50 16-bit words in the header.

If we start with the maximum number and remove the header, we get 2,147,483,597. This means that the number of Point features in 16-bit words should equal that number (at max).

Per the spec, a record is 8-bytes and is thus 4 16-bit words. The content length field of a record represents the length of a record's content in 16-bit words. A Point feature takes a maximum of 20 bytes and is thus 10 16-bit words. Therefore, each Point feature takes a total of 14 16-bit words (4 from the record header, 10 from the Point record).

From this, how is a rough maximum of 70 million Point features derived?

It would appear the 2GB limit was Esri imposed. If we assume the following:

limit = (1024^3) * 2 = 2147483648 bytes
limit - 100 = 2147483548 bytes (header removed)
limit / 28 bytes = 76695841 Points

So it actually has nothing to do with Byte 24 but namely the .SHX 32-bit offsets.

However, Byte 24 in the main file and the .SHX offsets represent the number of 16-bit words in the file. Assuming our was all points, and Byte 24 was maxed at 2,147,483,647 then that implies that there are 2*((2^31)-1) bytes in the files for addressing. This would put the physical limit of the file in terms of memory addressing at 4GB. That is to say:

byte_24 = (2^31) - 1 = 2147483647 total 16-bit words
byte_24 - 50 = 2147483597 total 16-bit records
byte_24 * 2 = 4294967194 total 32-bit (4-byte word) file size
(byte_24 * 2) / 1024^3 = 3.99GB

So there is nothing preventing a file (other than conformance) from growing beyond the 2GB limit in terms of the .SHP or .SHX files; the .DBF (dBase) may have a different impact. This could even be increased to 8GB if unsigned integers were used (replace byte_24 with (2^32) - 1).

I am sure that this is a fairly gross approximation. Historically, this limit has to do with dBase4 (dbf) addressing limitations. There is some incidental overhead in the dbf attribute requirements. There is the bit requirements for a given coordinate precision and then the bit requirements for minimum attribution of the points in dfb. I would however, not take this as an exact threshold thus the verbiage "roughly 70 million point features. — Jeffrey Evans, Jan 24 '20 at 18:52
In the GDAL documentation the limit is indeed said to be at 8 GB, and no limit at all for the .dbf part https://gdal.org/drivers/vector/shapefile.html. Geometry: The Shapefile format explicitly uses 32bit offsets and so cannot go over 8GB (it actually uses 32bit offsets to 16bit words), but the OGR shapefile implementation has a limitation to 4GB. Attributes: The dbf format does not have any offsets in it, so it can be arbitrarily large. — user30184, Jan 24 '20 at 21:29
@user30184 If I'm not mistaken, it was documented somewhere that GDAL "hacks" the specification to use unsigned integers, that's why. — pstatix, Jan 24 '20 at 22:08

score 13 · Answer 1 · edited Apr 10 '21 at 22:28

13

The "equate" in your title is probably too strong a representation for a document which uses "roughly 70 million" points.

The maximum file size for a .shp or .shx is 2³¹-2, not 2³¹, because Esri chose to keep the 2³¹-1 filesystem limitations in existence when the format was published (see this answer to Why are Shapefiles limited to 2GB in size?), and the 16-bit word size assures that the file will always be even¹.

The general formula, subtracting the fixed 100-bytes header from the limit, is:

floor((2**31-2 - 100) / bytes_per_feature)

so the correct maximum for 2D points (fixed 28 bytes/feature) is

floor((2**31-2 - 100) / 28) = 76,695,840 features

Sure, "roughly 76 million" might be a better way to word this, but "roughly 80m" and even "roughly 77 million" would be rounding up too much.

If the 2D points were stored as degenerate MultiPoint records (don't do this!) then the limit from the .shp side is 56 bytes per feature, which works out to:

floor((2**31-2 - 100) / 56) = 38,347,920 features

3D points have a 36 byte footprint, which works out to:

floor((2**31-2 - 100) / 36) = 59,652,320 features

More complex geometries and additional dimensions further limit the feature count. For 2D single-part quadrilateral polygons (5 vertices, for closure), the .shp limit would be:

floor((2**31-2 - 100) / 136) = 15,790,320 features

But this is not the full story, because the dBase-III+(ish) flavor of .dbf used by Esri also has a 2³¹-1 limit (.dbf can have an odd record size, so it's a -1 and not -2), and since the only tie between shp and dbf is via record number, the actual restriction is the smaller of the two record counts, so with a single 100-character text field the limit would be:

floor((2**31-1 - 32*2) / 101) = 21,262,213 features

and with 100 fields² at maximum 4000 bytes/record, the limit would be:

floor((2**31-1 - 32*101) / 4001) = 536,735 features

(dBase-III has a 32-byte table header, and a 32-byte field header for each field, plus a 1-byte deletion marker per record.)

There is no way for the .shx size limit to impact record count, even if all the stored shapes were null, since:

 floor((2**31-2 - 100) / 8) < floor((2**31-2 - 100) / 12)

In the real world, shapefiles are too inefficient to effectively utilize more than 5-10 million features, and I try to limit even well-compressed file geodatabase to 20-40 million rows.

It would be far better to abandon the use of shapefile altogether than to start creating polluted files with a .shp suffix that do not conform to the shapefile specification (by exploiting the 16-bit wordsize for a 4GB limit or unsigned integers for an 8GB limit).

¹Since the record sizes are all modulus four, you could argue a limit of 2³¹-3.

²Technically, the dBase-III+ spec limits field count to 100. However, ArcGIS doesn't choose to enforce that, permitting the field count to reach the 1-byte maximum of 255. For this reason (and other de facto quirks), I generally add the "-ish" to the flavor descriptor "III+" (note that BLOB-like Memo fields are not supported).

edited Apr 10 '21 at 22:28

PolyGeo

65,136
29
109
338

answered Jan 24 '20 at 19:26

Vince

20,017
15
45
64

Shapefile is just easy for us to work with (several tools built around it). We use Python libraries outside the arcpy realm (which is agonizingly slow and not optimized) to read/write them. I wanted to understand how to achieve the math so that files may be written to conform (i.e. how to compute how many shapes a user would generate on a program, and if its too many tell them if its beyond 2GB). – pstatix Jan 24 '20 at 20:10
The while the .SHP file length and .SHX offsets represent the number of 16-bit words, thus actually allowing for file addressing up to 4GB, I don't plan on doing that and will restrict users. May I ask why you used (2^31) - 2 and not (2^31) - 1? – pstatix Jan 24 '20 at 20:13
I see you edited the post, but still don't understand why you're using (2^31) - 2. You've also linked my own question that was closed! But still, not sure why you have some as (2^31) - 1 and some as (2^31) - 2. I agree with the formulas, just not the maximum byte size, I think you're 2 bytes off. Is it to keep it an even number since all the byte footprints of features are even numbers? – pstatix Jan 24 '20 at 20:28
I've written scores of data translators, including shapefile implementations in 'C', Java, and JavaScript, and I've rarely, if ever, known what the total record count was going to be. I've found that predicting overflow on stream-oriented solutions takes a significant amount of time, and that it's generally better to just test for overflow before writing each feature than to make assumptions about null geometries and vertex counts which could end up being incorrect. – Vince Jan 25 '20 at 03:45
You mention the dBase-III+ spec, but where did you get it? I couldn't find the actual spec regardless of how much searching I do. As for your comment about 16-bit words ensuring a shp, what do you mean? The file can have an odd number of records. – pstatix Mar 31 '20 at 22:29
I don't follow why you use 2^31 - 2 over 2^31 - 1 either. – pstatix Mar 31 '20 at 22:55
There isn't a spec for the dBase-III flavor used by Esri, but there is Xbase documentation, which comes close. An odd number of records with an even record size will always have an even length, so the shp and shx will always be even (see footnote 1). – Vince Mar 31 '20 at 23:13
Since each record (shape) is an even number of bytes, yes the file size would be an even number of bytes. I was just unsure of what you meant would be even (total file size or the number of records). Still, how does 16-bit words guarantee that? – pstatix Apr 01 '20 at 14:19
Further, I'd argue that your formula here shows the maximum addressable features. Byte 24 of the header is a 32-bit signed integer, giving 2^31-1 addresses (for some reason you're using 2^32-2). Regardless, using that formula you resolved that the maximum number of points is 76,695,840. However, each 2D point record uses 28 bytes, so the total file size is (100 + 76,695,840 * 28) = 2.147GB. I'd state that a more accurate formula is ((2(1000*3) - 100) // 28) = 71428567. The specification limits the file size to 2GB, which a 32-bit integer would exceed. – pstatix Apr 01 '20 at 14:23
If the size is measured in 16bit words (2 bytes each), then file size, in 8-bit bytes, must be even (no fractional words). GB is used to denote 2^30 (1073741824) bytes. I just created a valid 74m row 2D point shapefile, which used 2072000100 bytes, and matches my formula exactly; my shapefile generation tool successfully created 76695840 features (with a single integer attribute, .shp size 2147483620 bytes, 1.999999974GB). – Vince Apr 01 '20 at 15:21
Not to be pedantic, but your unit of measurement is in Gibibytes (GiB), not Giga-bytes (GB). Binary prefixes were introduced in 1998 by the IEC, but I doub't that ESRI meant 1GB = 2^30 over 10^9. – pstatix Apr 01 '20 at 16:01
If your own tool for developing shapefiles was written around that math, of course it would work. That math would not change the accuracy of the result, but changing the math obviously changes the result. Which math is correct (as it pertains to file size limits) is unclear. I was merely pointing out the the limit of "roughly 70 million" makes more sense using my formula than yours under the notion of what a GB represents (2^30 vs 10^9). Regardless, if you generate a shapefile out of spec, many GIS applications will still open them. So no tool will tell you if the file is valid or not today. – pstatix Apr 01 '20 at 16:02
Since the "2GB" was published in 1995 (though the format is older than that), I'm quite sure they meant 2^31-1, which was the filesystem limit at the time of creation. – Vince Apr 01 '20 at 17:03
Where did you get 1995 from? – pstatix Apr 01 '20 at 17:38
For consistency, your number for "2D single-part quadrilateral polygons" is incorrect. You have them using 128 bytes but thats just the geometry, not the record; its actually 136. For example, 2D points use 20 bytes with an 8 byte header (hence you use 28). – pstatix Apr 17 '20 at 13:43

How does shapefile 2GB limit equate to 70 million points?

1 Answers1

Linked