All the different data words a transmitter can send and a receiver can detect can be imagined as dots arranged in a large space.
Selecting a data encoding for error detection and correction is about keeping valid code words at a certain distance from each other. As a result, a slight change ("movement") of a valid word doesn't make it look like another valid word.
As I'm not going to make any drawings of 24 dimensional spheres and hypercubes, I'm going to restrict this answer to three dimensions.
3-bit example
Imagine a data word consisting of three binary digits.
We can arrange them in the form of a cube by treating each digit as the coordinate in one dimension:
Each transmission error, that is a bit that is a '0' mistaken for a '1' or vice versa, corresponds to moving one step along one of the edges of this cube.
In normal, everyday communication with sufficiently low error rates, we can treat all code points as valid:
But, every flipped bit leads to another valid word. So, if we remove every second valid word, we get this:
Now, all the valid words are two edges apart. One flipped bit gets us to an invalid word and we know there was an error, but we are not able to correct it, because there are three possible bits that could have flipped. This is called a "0 error correcting, 1 error detecting" code.
To improve robustness, remove another two valid code words:
Now, all valid words are three edges apart. If one bit flips we get to an invalid word, but we still can tell where we came from. If two bits flipped we can't correct the word, because another valid word is closer to the wrong code than the correct word. Hence, this code is called "1 error correcting, 2 bit detecting"[*]. This is the best we can get with our simple 3-bit code words.
4 bit code extension
If you add another bit to have 4-bit code words, you can increase the distance between valid points to three edges of the then 4-dimensional cube. This gives another level of error mitigation: If two bits flip, you reach a invalid word right "in the middle" of several valid words. You can't decide which one might be the correct one, but at least you know that two bits flipped. This type of code is called "1 error correcting, 3 error detecting".
Voyager
In the case of Voyager, this wouldn't be enough, so they went for a 24 bit long code word. From the total of 16 million code words, they only defined 4096 as valid. I.e. 12 Bits carry actual information and another 12 are used for error correction. This resulted in a "3 error correcting, 7 error detecting" code. I.e., if in any word 3 bit were wrongly received, it could still be corrected properly and if up to 7 bits flipped; at least it would be known something is wrong. This code could be represented in the same way I did above as the corners of a 24-dimensional hypercube.
Now, how does this relate to packing of spheres? In fact, the three images show the densest possible packing of spheres with diameters of $\sqrt 1$, $\sqrt 2$ and $\sqrt 3$, respectively, under the constraint that their centers need to be located on corners of the cube.[**]
Obviously, this doesn't look too spectacular, but it gets a lot more challenging if we're not looking at digital, binary data, but use a transmitter that also supports values in between, e.g., by using not a simple on/off modulation, but add amplitude modulation on top. By adding one more step (e.g., power off/low/high) for each of the three digits in our example, we don't have eight valid code words, but actually $3^3 = 27$ - start packing your spheres onto that grid!
[*] Note that "detecting" doesn't mean that you will be able to tell how many bit errors there are exactly. It refers to the number of bit errors that can occur before you end up with another valid code word.
[**] Strictly speaking, we don't deal with spheres in a regular Euclidean space here, but Hamming spheres - these are defined by the set of corners that are a given number of edges away from their center. This accounts for the fact that in a binary world only the corners of the cube represent valid points while any other point would have fractional coordinates and just doesn't exist.
Practically, there is no difference between the two in the examples given here.