4

I'm hoping somebody can clarify for me why .shp are limited to a 2GB file size? Having read through the ESRI considerations and technical description, I cannot understand why it exists.

Since they use dBASE for the .dbf component of the multifile format, it must abide by dBASE limits which have a maximum file size of 2GB. Although, that points to the same question, why does that limit exist? Does it have something to do with these formats being created when 32-bit OS' were widely used? If so, how does that influence the limit? I've seen posts regarding this as 2^(31-1) which is ~2.1GB but that just means 32-bit addressing is used, but I am not sure how it fits here. Other posts mention that these formats use 32-bit offsets, specifically "32-bit offsets to 16-bit words", but I don't follow that either.

pstatix
  • 426
  • 4
  • 13
  • 1
    because they use 32bit ints for the addressing (and change byte order half way through the header) – Ian Turton Feb 19 '19 at 12:12
  • @IanTurton Can you possible expand in an example via an answer if you have the time? I'm thoroughly reading the technical description, but I don't see anywhere that 32-bit addressing is used. Why does changing from Big Endian to Little Endian affect that too? I've also updated the last sentence in my OP with something that could hopefully be clarified. – pstatix Feb 19 '19 at 12:15
  • 1
    Table 1 Byte 24 Integer File Length. Four bytes, 32 bits. The offsets in the SHX file are four byte integers too, so they can't offset anything past the 2Gb limit, assuming signed binary. – Spacedman Feb 19 '19 at 12:32
  • 2
    Offsets are encoded as 32-bit signed integer, but of 2-byte word-length, so they could have addressed 4Gb (8Gb if unsigned). Which is why the specification explicitly limits size. – Vince Feb 19 '19 at 12:38
  • @Vince Now onto the offsetting and 2-byte word. I'm not sure what this means in relation to the file size limitation. Aren't the offsets simply 32-bit numbers representing the total number of 16-bit words between main file start and the specific feature record header? How does this tie back into the 2GB? – pstatix Feb 19 '19 at 15:17
  • Cross-posted as https://stackoverflow.com/q/54766096/820534 – PolyGeo Feb 19 '19 at 20:20
  • @Spacedman I'm not following how the Table 1 Byte 24 indicates a maximum of 2GB simply because its 32-bits. Your referencing Gigabits but the file maximum is Gigabytes. Perhaps you can expand? – pstatix Feb 20 '19 at 06:49
  • @Spacedman I'm with Datta on this one, how did you come to a limit of 2Gb from 32-bit pointers? The SHX offsets represent the number of 16-bit words from start to record start. If the offset is at max, then that would be 2*((2^31)-1), representing that from start to record start there are ~4.3GB in the main file. Similar logic (question) flows from your Table 1 Byte 24 statement, it represents the number of 16-bit words in the file. – pstatix Mar 05 '19 at 06:36

1 Answers1

13

You're asking several History of Computing questions here. All the reasons you've listed are true. The maximum file size on the OS was 2GB. The maximum integer size was 2GB. The maximum file offset in the OSes was 2GB. But once those weren't obstacles, Esri explicitly stated that it has a 2GB limit. Isn't that enough of a reason?

There are scads of new formats that out-perform shapefile. File geodatabase is so much better that I haven't created an output shapefile this decade. But I've used input shapefiles because that was what was available, and I've generated new shapefiles with turn-of-the-millennium tools, because that's what was available then.

Has computing changed? Of course it has. Can you hack the shapefile format to 4Gb or 8Gb? Yes, but not without being non-conformant. And it's the conformance that is shapefile's greatest strength, and violating conformance is what will destroy whatever utility remains of the format.

Vince
  • 20,017
  • 15
  • 45
  • 64
  • I appreciate your input. The driving reason behind this question is a C++ header I am writing for a library and I wanted to add some implementation details regarding file creation for a personal project. I was trying to understand things like "Why use 32-bit signed?" since your offsets will never be negative. Are things like that just safeguards so that they could limit the file to 2GB? Why did they specify 16-bit words? So on and so forth. – pstatix Feb 19 '19 at 14:36
  • 1
    The design was a combination of "simple is best" and "platform independence is hard". The header has both big-endian and little-endian values to assure that any translator, on Intel or Motorola, would need to implement endian swapping. Unsigned offers more opportunities for undetected corruption. Doing processing of 70m features in one file was so far beyond the 80386 chip on which it was implemented, it wasn't worth worrying about. – Vince Feb 19 '19 at 14:42
  • Thanks for sharing the background. I find this stuff very important when determining how to implement a library so I can share what I learned with others in the implementation details and reasoning. – pstatix Feb 19 '19 at 15:19
  • You do not need this information to implement a standard, especially such an ancient standard. – Vince Feb 19 '19 at 16:15
  • Fair enough. As I said its a personal project. Now I am interested to know how the 32-bit offset works with the 2-byte words to establish a file size limit of 2GB. Do you think you could expand on that final piece? You've answered everything else in great detail, and thats the last component I'm fuzzy on. – pstatix Feb 19 '19 at 17:46
  • Part of my library contains a writer, which will raise errors when a user attempts to create a non-conforming file. So understanding (to some extent) how the offsets are used will help with this. – pstatix Feb 19 '19 at 17:52
  • The offsets are a critical part of the shx content, and are described in the specification. The best way to be sure you have it right is to write a validator based on known input. User data should not have any role in a writer with respect to offsets. – Vince Feb 19 '19 at 18:27
  • From this similar post, I'm just lost on how the Byte 24 in the .shp header leads to a maximum file size of 2GB. It is supposed to represent the total count of 16-bit words, which would be 2*((2^31)-1) which is 4GB, right? Perhaps I'm messing up my understanding of Byte 24. – pstatix Feb 20 '19 at 08:21
  • The year in which it was created leads to a 2GB maximum, not the file length indicator. – Vince Feb 20 '19 at 11:26
  • Alright, then why the comment to the OP regarding the 32-bit offsets to 16-bit words? How do they fit? – pstatix Feb 20 '19 at 15:05
  • Offsets use signed long (32-bit) integers in or of 16-bit words. The PDF states that the empty SHP (just header) has a length of 50, and the SHX has a length of 50+(4nRecs) (because it's fixed-width). The file sizes are limited to 2Gb because Esri said so. It's not documented anywhere why* short words were chosen, but the offsets are always a multiple of two, so a parity check is possible there as well. – Vince Feb 20 '19 at 15:41
  • If I'm following your comments, you're saying that they could've addressed 4GB (not Gb like you've stated above) but enforced via spec and application a 2GB limit. As Datta pointed out, the offsets represent the number of 16-bit words. So if an offset held the value 2,147,483,647 (max val), that means there are (max val * 2) bytes in the main file. Had they used unsigned short, that would be ((max val + 1 * 2)) - 1 * 2) bytes, or ~8GB. Is my thinking about right? Or did you intentionally mean Gb and not GB? – pstatix Mar 05 '19 at 06:52
  • I'd like to suggest that "Gb" and "GB" are interchangeable to some folks and that being pedantic about a difference may reduce the willingness of people to help you. – Vince Mar 05 '19 at 11:44