Indeed - there'd be more information associated with each pixel than could be stored in a color value. I've mainly been approaching the problem from the point of view of the scripts I've been writing, which aren't integrated with any image editors like PS or GIMP, so defining a place to store extra data isn't the tricky part. That said, you probably could full-automate the process in them, but working in an environment integrated with a much more heavyweight system like that has a good chance of incurring significant costs in efficiency (especially if you need to use image components like extra layers in large numbers as your data allocation). Not all of that extra information would ever be output to any one image, but would be used to construct them. By 'pixel-by-pixel vector field' I meant a data structure with a vector for every pixel in the source image - antillies' mention of "pixel objects" is pretty much right on to what ultimately would be needed, although the precise layout of that data might differ (not sure at the moment, but it is possible there would be relevant tradeoffs to storing a set of objects-per-pixel or having instead a set of parallel arrays that each hold one data value). For the moment I've been doing more of the latter since I've been working on single steps, but 'pixel objects' is a natural progression in constructing a more generalized approach to a broader/multistep formulation of the problem.
The reason efficiency is a concern is that these images can be fairly large, so whatever operations you do per pixel have a huge constant factor. I'll admit my perception is a bit colored by the fact that testing the code involves running it quite a bit , whereas in general use you're right that a command-line process in this style that could be launched in the background can reasonably take a bit of time to run. The space efficiency can be somewhat of a concern, since that doesn't make a big difference up to a certain point and then it is enormously important - and with the input sizes of the maps it can become an issue. For say a 4000x2000 image which isn't really all that big every byte of information stored per pixel is ~8MB of memory which, while not all that huge in and of itself, can add up, especially if one starts looking at higher-res images for source data (or say, storing a vector that in Python is probably two doubles, so 16 bytes per pixel -> 128MB for the 4000x2000 example - again not a ton by itself, but starting to look more concerning in the presence of multiple other similar data values). And if you move up to a significantly larger resolution like say 16000x8000 - then it's ~128MB per byte per pixel, and as soon as that adds up enough that the OS starts swapping memory to disk (at best, when you approach the amount of RAM on the system) the performance will become much, much worse.
The shortest-path tree idea comes from how a lot of flood-fill algorithms work - they treat pixels as nodes and do a graph traversal. Here we'd just do a weighted graph traversal and annotate each pixel/node with the distance we found from the 'supernode' (which represents all the oceanic pixels together). Practically we might need to separate closed seas from ocean first in constructing that set (e.g. like the Caspian Sea on Earth) but that can just be union-find again (and that can be implemented in linear time in this context; it just takes one pass over the image since we only care about adjacency and not distance for 'what pixels are part of which separated bodies of water?').
As for separating landmasses with a land-bridge, you could do union-find for major landmasses on only pixels meeting a threshold of distance-from-ocean (not just in pixel count, but distance - this both helps with projection distortions and keeps the algorithm's output agnostic to the image resolution). Then you could make these 'major landmass' pixel collections into supernodes to use as roots for a new graph traversal - labeling each pixel with the supernode they have the closest land distance to and using that to allocate pixels to discrete landmasses. Obviously this couldn't be flawless - it would depend significantly on what the threshold for considering something a narrower land-bridge is (i.e. how close to the ocean a pixel is considered coastal as opposed to central to a major landmass), but with the right tuning I'd expect this approach could reliably identify as separate North America vs South America or Eurasia vs Africa.
Regarding generating winds - I hadn't thought of that in terms of overlaying a base 'natural winds' framework the likes of which I think I recall fairly straightforward diagrams of before with another for the pressure systems - which could have systematically defined behavior based on an input of a human-readable color-by-pressure-system map like what is already in the tutorial.