very interesting direction. One thing I’m curious about with extremely long videos is how you handle temporal drift over time. Do you periodically re-anchor the reconstruction or rely purely on accumulated frame consistency?
In a traditional SLAM pipeline you do periodically fix drift by detecting when you've visited an area that you've mapped before this lets you align your sub maps so they are globally consistent.
In the areas you have visited previously you have two estimates of your position one from your frame-to-frame estimates and another from the map you built of the area the first time. You can then solve an optimization problem to bring those two estimates closer together.
In order to find out if you've already visited an area you store a description of the locations in a DB and search through them. The paper says they use a compressed representation of the "maps" and use test time training to optimize the global consistency between their sub maps.