I'm not a mathematician or anything so I haven't read the paper as the maths would probably go over my head, but I've been thinking about how to make my own simple Photosynth-style viewer.
Pictures exist as infinite length pyramidy objects in 3d space, with the focal point at the peak and the lens a little way down, the distance to the lens decides the Field of View (FoV). A pixel exists as a cone projected into the pyramid, the X and Y radius of them are determined by the min and max FoV. If you have four known points across two images, the difference between the two trapezoids can give you the rotation of the cameras. The intersections between the cones give you ellipsoids of estimated point position, while back the other way gives you worst-possible distance from estimated camera position.
Identifying points for matching is a harder problem, what constitutes a point? I'm thinking some kind of edge detection at multiple zooms, and store it all in a single...