We first had the idea of Monuments Mode about 4 years ago, in one of those “what if…” discussions where we explore crazy thoughts about potential new features for our app. We thought of a system to automatically shoot against static subjects, and automatically remove moving objects or people walking by and getting in the way. The basic idea was: what if we grab a bunch of high-resolution video frames from the camera feed, perfectly align them on top of each other, and then somehow select the best parts without the moving distractions.
We were not total newcomers to the problem. Some time back we had already created a Slow Shutter mode for long exposure photography. iPhone cameras can’t keep the shutter open for more than a few tenths of a second, so slow shutter mode works by grabbing a bunch of video frames and combining all of them together. For maximum performance, we implemented Slow Shutter using Apple’s low-level Metal framework, which was pretty new at the time. The solution looked similar to what we wanted to do in Monuments Mode, except for a couple of details.
First, in slow shutter mode we do not attempt to align the frames. It is a feature meant for photographers that know what they are doing, so we thought that someone wanting to capture a waterfall for 10 seconds would want to take a tripod with them, or support their phone against a rock or something. This wouldn’t fly for Monuments Mode: it is a feature for people on a trip that just want to grab the cleanest photo they possibly can. They don’t need to know about how the feature might be implemented under the hood.
Then we had the problem of how to combine the frames. In slow shutter mode, every pixel is the average of the same pixel in that position from all the frames that were captured. Using the average is what produces that silky look in waterfalls or rivers, but it won’t work for people or bikes or cars. Unless you capture a lot of frames (which forces the exposure time to be very long), there’s bound to be some ghosting in the areas where people used to be. Users would frequently see dirty areas or outlines in the final shots.
Aligning the frames is a problem computer engineers like to call “registration”. We know that because we had looked into it at some point, further back in the past. We experimented with some algorithms that seemed feasible, but computation requirements were high. Assuming we had a solution for alignment, we also thought about how to combine the frames. We knew that the average wouldn’t work. Instead, we thought about selecting the most frequent pixel among all the pixels in the captured frames, in each and every position of the image. If the moving distraction is traveling left to right and we grab 10 frames half a second apart from each other and look at a particular pixel in the image, we see that most of the time the pixel is clear. That’s the reason for using the most frequent pixel at each location. However, there are practical difficulties. Calculating the average is easy because you can set up your computation in such a way that you only need the current frame and the previous average. In order to get the most frequent pixel, however, you need to store all the frames and analyze them after you have saved them all. There were high memory (or storage) requirements, and a potentially slow post-processing step.
So we shelved the project.
The following years we applied the Monuments Mode moniker to anything that looked impossible. When facing a really hard problem, my dear friends at LateNiteSoft would say, “Yeah, we are going to ace this, just like Pedro did Monuments Mode that time. Ha ha.” Every single time.
This year we decided to give it another go. Four years is a long time in computing, with Moore’s Law and all. This is specially so in Apple’s camp –just look at the M1 computers, or the new iPad Pro. Insane.
First, how to tackle the alignment problem. This turned out to be easier than expected. At some point Apple had introduced homography-based registration algorithms inside the Vision framework in iOS, so we tested them before attempting to hand-craft a proprietary solution. After a bit of experimenting and fine-tuning, the Vision method worked pretty well! It gives us an homography matrix between two consecutive frames. We apply it to the second frame of the lot using a custom vertex shader written in Metal. We keep track of the image edges that are only present in one of the frames, and we slightly crop the end result after the process is over. If you were to use a tripod and no trepidation was present, alignment would be perfect and no crop would be necessary. We repeat this process for all the frames in the sequence.
Regarding the selection of the most frequent pixels, we still needed to keep in memory all the frames we grab. Fortunately, this is possible in modern devices without resorting to disk storage. To actually select the “most frequent” pixel we could use the median, but you need to sort all pixel values to do so. In order to sort efficiently, we wrote a custom Metal kernel that uses something called “sorting network” algorithms, which are much faster than general-purpose sorting algorithms when the number of items to sort is small and fixed. We wrote a bunch of sorting networks for 5 to 16 items. Again, we used Metal in order to be able to process all pixels in parallel.
With the two major components of the solution in place, the next step was to provide visual feedback as the capture goes. Instead of beginning processing when all the frames have been grabbed, we start aligning and sorting as we are gathering them. As soon as we have a few frames, we display the current on-going result in the viewfinder, and we refresh when a new frame arrives. It’s kind of mesmerizing to watch how objects disappear in front of your eyes.
This method works well under the assumptions we made: if most of the frames are clean for any particular pixel, the median will select a clean pixel in the final result. However, what happens if a certain area is busy or “dirty” for most of the frames? Then the final result will select a dirty pixel and the object will not be removed. So we kept working on the problem.
We came up with the idea to use a sophisticated machine-learning object detection model to identify areas in the frames that are busy with obstacles. The model was trained on millions of images, and we configured it to look for subjects such as persons, animals, vehicles, and the like. If we realize that a certain area is busy for most frames, but clean for just a couple of shots, then we favor the clean shot for the final compositing, instead of blindly using the median. This AI-assisted optimization can improve Monuments Mode in situations where the initial algorithm would fail.
Before we wrap-up, this is another video we made during development to showcase the use of Monuments Mode.
Keep in mind that Monuments Mode is not perfect, because it’s designed to work with moving subjects. If a person is standing there talking on the phone for a long time, we can’t remove them. We are not attempting to “invent” the pixels that are behind the person in the background –that’s an interesting project for another time–. We do our best to identify clean areas, but there will be scenes that can’t be solved and you’ll still see artifacts or ghosting or static subjects. Rachel has a few tips on how to use the new mode best, we recommend you take a look at her advise!
The amount of computation engaged to make Monuments Mode a reality is impressive. It requires a modern device with at least 4GB of RAM, and iOS 14.