How we created and deployed a super-resolution model on iPhone

Pedro Cuenca

Programmer. DL and photography aficionado.

We recently released a new version of Camera+. UltraRes, our super-resolution feature, is a big part of it. UltraRes allows you to capture images at a resolution of 48 Mpx, even though the iPhone sensor is only capable of 12 Mpx. It also allows you to upscale and edit any photo in your library to produce enlarged versions of up to 48 Mpx.

UltraRes screenshot

Creating UltraRes has been our largest ML effort after Magic ML. This post is about the challenges we faced and how we solved them. I’ll pay some attention to the compromises required to turn a experimental model into a feature that can be enjoyed by thousands of people. It is not enough that a model exists and that it works; it must also be fast, lightweight and straightforward to use. I think that deployment and production are still areas that receive relatively little attention, especially when compared with the big news around impressive (but usually impractical) research results.

Our Requirements

From the project inception, we decided on a simple set of requirements that we thought made sense for Camera+.

  • The feature would double the width and the height of the input image, for a total resolution increase factor of 4. Thus, a photo taken with a modern iPhone would go from 3024×4032 to 6048×8064 pixels, or 12 to 48 Mpx. There are machine learning models that are capable of larger scales, but we didn’t want to impose a huge penalty on performance. Creating a 100 Mpx image requires more time and more memory, of course. But it also requires a lot more disk storage space, and makes every future operation (editing, browsing, sharing) slower, more resource intensive, and more energy hungry. We didn’t want to melt your phone or make its battery last less, and found 48 Mpx to be a good compromise.
  • We want models to run locally in your device, instead of hosted in some computing cloud. Cloud inference requires connectivity, increases latency, and implicitly demands users to forego some privacy, as photos need to travel back and forth to the inference server. A local model provides a better experience for Camera+ users, but it comes at a cost. Computing power is limited to the device hardware, and model size must be contained so the application doesn’t become extremely large. There are mechanisms to download models on demand, but they still require storage space and introduce other complexities. In this regard, our mandatory requirement was for the model to run locally, but we were open to explore download options if we needed them. We ended up not using any.
  • The model must be fast enough that the feature can be used interactively. We could have designed some sort of background upscaling queue to which UltraRes jobs were submitted, but it’s a mess and it detracts a lot from the user experience. We want our users to be encouraged to explore the tools in the app, and the best way to do it is that every action is cheap, fast and revertible.

Go / No Go

Before jumping on to train a model we decided to explore existing options and find out what their strong points and limitations were. This would give us a sense of what type of quality we could expect, and at what cost. We knew that high-quality super-resolution models exist, but we didn’t know if we could make them run fast in an iPhone.

The first task I did was to convert a few pre-trained models to Core ML, Apple’s native machine learning API and model format. The conversion process has improved a lot recently –it used to require an intermediate step through the ONNX intermediate format that is no longer required–, but sometimes you still find obstacles. The reason is that not all the primitive operations in your ML framework of choice have a translation to Core ML, and you have to express them in a different way or implement the translation. I was using PyTorch, and managed to successfully convert and test Real-ESRGAN, BSRGAN and a few others. I was also able to convert SwinIR, but I couldn’t make it work on a real device.

I also wrote a Mac app to easily test model variants on a bunch of assorted photos, including many shot on iPhone. This gave us some initial insights on model limitations, such as excessively soft and flat skin textures in some cases, or slightly subdued colors in some areas. See below for a couple of examples on one of our test images.

Detail of results from two super-resolution models applied to a photo of a girl's face, zoomed up to show an area around one of her eyes and her nose. The results show flat skin areas and minor texture differences between the two.
Initial proof-of-concept tests. Notice the flat skin areas, especially in the nose and around her eye. Note the differences in hat texture and slight color differences in her hair. There’s also a tiling artifact in her forehead because my Mac app was not dealing with tiling at the time.

Visual testing is a crucial step. Research papers do report quality figures, but they are measured on certain images under certain conditions using specific metrics, and it’s hard to compare. For example, some papers report PSNR on images that were originally degraded using bicubic interpolation, and then upscaled using the model. PSNR conceptually measures how far away pixel values are between the upscaled image and the original one (before the degradation was applied). This is not particularly strong at detecting images that are pleasing to the human eye. In addition, bicubic degradation is easy to apply, but unrealistic. If we train our model using that degradation only, it will learn about the artifacts and particular characteristics of the algorithm, but it might not be good enough for real-life photos that have not gone through a bicubic process. More on this later.

At the end of this phase I also created a proof-of-concept upscaling feature in Camera+, to see what it felt like. It took 68s for a BSRGAN model to process a 12 Mpx iPhone photo in my iPhone 13 Pro, while the user had to stare at the screen waiting for it to complete. Jorge set the impossible goal to do it in 10s or less, and we decided to keep working on it.

Training a Model

As we said before, degradation plays a key role in super-resolution training. You want your model to learn how to invent a large amount of pixels by looking at a low-resolution image. The way to do this is:

  • Start with a high-resolution image.
  • Downscale it using some degradation method.
  • Ask the model to generate a large image starting from the small one.
  • Measure the difference between the image generated by the model and the one we started with, and make the model do it better in the next training iteration.
  • Repeat many times with many pictures until the model (hopefully) is good enough to solve the problem in a general way.

Research papers on general-purpose super-resolution abound. In recent years, improvements usually come from two areas: better and larger model architectures; and more elaborate degradation schemes. A larger architecture increases the computational capacity, while a better degradation makes the model generalize better. After our initial tests, we observed that degradation was possibly the most important factor of the two. A recent, large model will be better than the same model using less layers, but if the degradation is good enough the small model will approach the quality of the large one. This is something I didn’t measure numerically, but an intuition I got by observing our training runs.

With this hypothesis, I decided to focus on a few key architecture components, collected a large image dataset, selected several degradation algorithms from different sources, and started launching training combinations. The clear consensus on experimenting on a problem is to move as fast as you can to try many ideas and quickly be able to accept or reject them. I have to say that I failed to do this. The problem was that progress was slow and training losses decreased very slowly, so you could barely see the difference between epochs. I would have liked to come up with a metric that could measure how well we were doing, but I couldn’t find it and resorted to looking at examples. I would launch a few runs and save model checkpoints every so often. Every hour I would run the checkpoints on the test set, and visually inspect how they were doing. Fortunately, I could automate everything (except the visual inspection) using Weights & Biases. I logged training losses, uploaded model checkpoints and submitted image predictions from cron jobs. Then I could filter by any training parameter to easily examine results. I was slow to try this service out, but I found it tremendously useful during this project.

Screenshot of four images that represent results from 4 different training runs, visualized in an interactive widget that allows to explore model and image variants.
Visualization using Weights & Biases. Each image comes from a different training run. I could use the Step slider to see what images looked like as training progressed. The Index slider cycles through various test images. It’s hard to see differences at this scale, but I spent hours looking at these plots.

In terms of computing power, I used whatever I could get my hands on:

  • My RTX 3090 GPU in my Linux box.
  • GPU Cloud Computing from Coreweave. I really like their UI.
  • TPU instances from Google Cloud. I was lucky enough to receive some free TPU credits as part of the TPU Research Cloud program, which I highly recommend to apply to. I had to port some code to PyTorch-XLA to take full advantage of the hardware, but it was worthwhile. With some more time, I would have ported some models to JAX/Flax for much faster training.

If I had to repeat the process now, I would put more effort on trying to achieve shorter iteration times. Perhaps I could have used FID (Fréchet Inception Distance) as a metric, but I didn’t know about it yet. I didn’t know about Jarvislabs either, I would definitely have used them for GPU computing.

The goal of this process was to find a small model with high quality. We considered several candidates, but they usually had shortcomings in some areas. One of our test images was my github user profile avatar. It’s a terrible image: initially shot on an iPhone, I cropped it to upload to some social media site, then downloaded it again and submitted to github. Having gone through this process made it blurry and out of focus. It also has difficult features: a face, and a striped fabric texture in the hat. So it was great for testing. The following is an example of some runs showing limitations in the reconstruction of the hat texture.

Grid of four images: a low quality original and three upscaled versions. The upscaled versions show different artifacts, like flat areas or unnatural skin regions.
Top left: original image. Top right: large flat areas in the hat. Bottom left: my skin doesn’t have enough texture, it looks almost posterized. Bottom right is more balanced overall. The original is 512×512 and the rest were upscaled to 1024×1024 and downscaled for visualization.

At the end of this process, we found a winner using:

  • An architecture based on Residual Dense Blocks, with some additional tweaks.
  • The use of LPIPS (perceptual loss), in addition to pixel distance. Many papers seem to use an initial training phase with some sort of distance loss, and then a separate GAN phase where they try to make results more visually appealing. I did not use the GAN phase because I found it confusing, and doubled the training time requirements. Adopting LPIPS in the first phase seemed to be enough for my tests.
  • A combination of degradations from several papers. I relaxed some of them so strong that produced unrealistically looking results sometimes. And accentuated others that might have been too subtle.
  • The smaller architecture we could get away with that still produced high quality results on our test set.

Making it Fast. Use the ANE.

When I mentioned Core ML I didn’t say that a converted model may run on any of the computing processors inside modern iPhones: the CPU (you don’t usually want this), the GPU (fast!), or the ANE (very fast). ANE stands for Apple’s Neural Engine. It is a custom piece of hardware that is highly optimized for the type of operations that are required for ML tasks. A feature like UltraRes works like this: you tile a large 12 Mpx photo into much smaller pieces, process each tile through the model and stitch all the pieces together to assemble the final 48 Mpx image. The slowest operation is the model’s upscaling, so you want it to be as fast as possible, because it has to run many times for a single image. We wanted the ANE.

Unfortunately, there is no way to tell the system to please run this model in the ANE at all costs. Instead, you prepare your Core ML model, and if the system deems it appropriate to run in the ANE, it will be usually sent there. As it turns out, the ANE was built using a different set of constraints than the GPU, and there are some operations that it cannot run. There is no easy way to know whether the model is running in the ANE, other than examining logs or setting breakpoints in your code.

In our case, our model was not compatible with the ANE. The reason was in the initial layers, which are reproduced below.

Schematic diagram of the initial layers of the model under discussion. The model receives an image as its input, and a block of three layers (highlighted), converts it to a tensor of a different shape.

The input image (a tensor with shape (3, 256, 256)) is converted to a matrix sized (12, 128, 128), but the operation requires an intermediate transformation to a matrix with 6 dimensions. Unfortunately, the maximum number of dimensions the ANE seems to support is 5, so the model could not run there and was relegated to GPU hardware. Upscaling an image took 36s in this setting.

What I did to try to overcome this problem was to remove those layers and, instead of supplying the tile input as an image, prepare the matrix outside the network and supply it as a MLMultiArray object, which is Apple’s spelling for tensor. However, this is usually not a good idea. The recommended and most sensible consensus is to provide image inputs, if possible, when the input to your model is an image. Lo and behold, upscaling using this technique now took 85s (instead of 36s), despite the fact that the model was now running in the ANE! (I forced it to run on GPU for comparison’s sake, and the tensor-based model took 129s to run, which is directly comparable to the 36s it took for the image input to run in the same device).

The problem was that all the pixel manipulation required to convert the image to the desired input matrix involved a slow loop in which pixels were copied and reorganized. I split this operation in three parts: two of them were just a matter of managing the strides of the tensors to refer to the memory using a different layout (which is nearly instantaneous), and the other one was a loop to make all pixels contiguous, just like PyTorch requires contiguous() for view() to work.

The trick was to implement the contiguous operation as a Metal kernel instead of a loop. This is a gist of the kernel I used.

Therefore, the full process was:

  • Cut the original image into tiles.
  • Prepare the model input leveraging a Metal kernel that runs in the GPU, in addition to some reshaping operations.
  • Run the input through the model, in the ANE.
  • Collect the results and stitch the image.

This was now taking 4s for a 12 Mpx iPhone photo, running the process in my iPhone 13 Pro. We were golden!

I’m proud of that result, so just allow me to show a summary table with the evolution:

BSRGAN-x2Proof of conceptYes68s
CustomImage InputNo36s
CustomTensor Input (preparation overhead)No (Forced)129s
CustomTensor Input (preparation overhead)Yes85s
CustomTensor Input, Metal kernelYes4s
The optimization path

Final Thoughts

  • Creating a ML feature is hard. It’s also difficult to assess and plan for. In my work as a software developer, I may underestimate the time to create a new feature, but I know that the feature is possible and I usually have a good idea from the onset on how to do it. ML is a different beast, as I don’t even know if what we want to do is feasible. And if it is, I have no idea how long it will take to create.
  • Turning a model into a production-grade feature is still challenging, despite many advances in recent years. Going deep inside the bottom layers of the software stack helps.
  • It’s useful to set up a full pipeline from the beginning of the project: test dataset, training, model preparation for the desired hardware. Make sure you can complete the cycle and have reasonable checkpoints at every step of the way. Be prepared to cancel or adapt early.
  • A few things I would have liked to do but didn’t:
    • I didn’t try Diffusion models. Too new, possibly too slow. That’s what I thought, at least.
    • I should have tried newer architectures like ViT or ConvNext, possibly in a U-Net setting.
    • As mentioned before, I should have made a better effort to run a much shorter development cycle to validate ideas.
  • Thanks to the team for incredible ideas and suggestions! A great team is always the best way to push further.

Published by Pedro Cuenca

Programmer. DL and photography aficionado.