Every year new iPhones are released, and every year the new cameras in those iPhones are arguably their most important selling points. Advances in optics and sensors are obvious when you compare the latest models with those from just a few years back. But the new software that accompanies the new cameras is becoming increasingly important as well. Consider, for example, how features like Night Mode would have been considered unrealizable science fiction just 4 or 5 years ago.
This is the reason why in the past couple of years our team has had frequent internal discussions about how this new concept of Computational Photography is bound to change the way we experience photography, no matter if you are an amateur photographer or a pro. We have dabbled in Machine Learning / AI in the past – Camera+ 2 has had features like Smile Mode or Action Mode for a while. Smile Mode automatically shoots a picture when everybody in the frame is smiling, and Action Mode tracks a moving subject to ensure exposure and focus are always correct for that subject.
These features made us learn about techniques such as feature detection and object recognition, and the intricacies involved in making ML models run fast and efficiently in a mobile device.
Magic ML is the most ambitious AI project we’ve undertaken so far. It has been as much a learning process as a product development activity, and in certain respects it has also included some research. Having completed the feature we set out to achieve, we are now much more comfortable with our understanding of the technology and what it can do for our users in years to come. Magic ML is the first step in a completely new photographic direction, and we’d like to share some of the details of the journey with you.
Designing Magic ML
The idea behind Magic ML is simple: take a picture shot with an iPhone, and try to make it better by applying some of the photographic adjustments Camera+ 2 already has. Details, as usual, are what matters most in the previous sentence.
First, we did not want our definition of “better” to be too opinionated. Everyone has their own taste and style, and we did not want our app to impose its own vision of what makes a good picture. Some people, like myself, tend to like vivid colors so they are frequently the subject of some of the photos I take for fun. Others prefer a more subdued style, others emphasise texture or geometry. We wanted for Magic ML to “just” look at the optical qualities of the image and come up with reasonable suggestions regarding exposure and color balance, and then have the user apply their own style on top.
For similar reasons, we also wanted the changes to be completely configurable and non-destructive. Magic ML had to be the easiest way to apply a few different filters at once and those filters should make most photos better exposed, but the user should be able to tweak them to her own liking and not just accept whatever values Magic ML suggests. This requirement ruled out a number of popular techniques such as those under the Style Transfer umbrella, as well as most other generative approaches. Those methods are able to achieve impressive results in some domains, but results are completely opaque and cannot be easily tuned or tweaked. They are also slow and impractical for high resolution pictures. In short, we did not want to replace the pixels in your photo with new ones, what we wanted is to recommend a set of values for the essential photographic adjustments in The Lab.
Finally, we also recognized that some of our users don’t have a lot of interest in obsessing over fine-tuning their photos – they just want the best photo their iPhone can possible take. Another requirement we had for Magic ML was that it should be able to work as a camera preset, so it would be applied to all of the photos taken by users that select that preset. This requires an efficient implementation that works fast and does not take up a lot of resources.
There’s this conception about AI requiring millions and millions of photos to successfully train a model for a given task. Well, let me let you in on a secret – this is not always the case. It is indeed true if you create a neural network from scratch, but most times you can use networks created by others and tweak them to solve your particular problem. That’s because modern neural networks are very deep (they have a lot of layers), and all those layers learn how to detect certain features in any photo. Some layers may learn how to detect a particular combination of colors in a certain area, others might learn about some recognizable shapes or patterns. Those tasks are representative of all images and not only of the ones that were used to train the network, as long as they are sufficiently diverse. Therefore, a model trained on a dataset comprising millions of images can be adapted to work on other images, because chances are those images will exhibit features and patterns similar to the ones in the original dataset.
I learned about this technique from fast.ai‘s Jeremy Howard. They constantly create fantastic courses that explain in easy terms how neural networks work and provide lots of practical guidance that demistify the practice of applying AI to solve real problems. They also author a software library for AI, whose version 2 (still under development) I used to make most of the experiments that led to the creation of Magic ML. I also used fast.ai’s nbdev (github), a fabulous programming environment for Jupyter notebooks that made my life much easier while I was exploring ideas. I cannot recommend those resources enough to anyone interested in learning or applying AI & Deep Learning.
By applying this idea of transfer learning (take a previously trained network, and adapt it to solve a different problem), we were able to perform all our experiments and training using just a few thousand photos, instead of millions.
Photo Selection and Training Process
To do that, we first selected several thousand pictures from public sources, making sure that they were of good quality. This is key, because most of the datasets that are frequently used for ML purposes do not have a great photographic quality – they are just intended for other uses such as object detection, where good exposure or color are really not very important. We then manually culled all those photos to select the best ones in terms of exposure and color merit, making sure at the same time that they were relatively homogeneous in style. For example, there were great artistic black & white photos that had to be discarded because that’s a very particular visual decision, and we didn’t want the network to make it on its own.
This was a relatively lengthy task, but it would have been almost impossible if we had attempted to collect millions of images!
In order to train the network, even starting from a pre-trained one, you need to show it a lot of different examples so it can learn the “rules” behind the examples. I borrowed another great technique from Jeremy Howard: starting from the beautiful pictures we had selected, we ran them through a process that made them much worse, by randomly making changes to exposure, brightness, contrast, saturation, color temperature and other parameters. This is what the fastai team jokingly calls the crappification process. We then created a neural network and showed it a lot of pairs, each one consisting of two items:
- An ugly picture.
- The parameters used to create the ugly picture from the corresponding beautiful one.
This way, the network is configured to learn the crappification parameters by just looking at the ugly pictures, without ever seeing the beautiful ones. After the training process runs for a while, you can show the network an ugly picture, and the model predicts the parameters that made that image worse. In a way, the model is predicting how the bad picture originated from a reference, “ideal” photo that is unknown.
The great thing about the crappification process is that you can apply it as many times as you like. For a single set of a few thousand images, you can generate a lot of ugly images from which the network can learn.
The prediction of a few numbers is what’s called a regression problem. It was the first approach we tried in order to assess the feasibility of Magic ML. It did work, as you can see in the following figures.
However, even though regression worked, we soon found that it was not good enough to be used for Magic ML. There were two main problems:
Reversion of filter parameters
Our model did successfully predict the crappification values, i.e., the parameters that make a photo look worse than the original. What we really need are the opposite values: the ones that make a photo look better. For very simple transformations, such as, i.e., an increase of +1 EV in exposure, there is a direct mathematical relationship between the way to increase the value and the way to decrease it. In the case of EV, an increase of one stop can be reverted by symmetrically decreasing one stop.
The bad news is that there are no such simple filters in Camera+ 2, or in most any other modern app. In real life, filters are not created using simple mathematical formulas or physical concepts. Instead, they are crafted to make them look great in a wide range of situations, and very often a single filter performs a lot of primitive operations on the image. Many of these operations cannot be easily reverted by using a simple equation. Therefore, being able to predict what happened to the image to make it look worse does not help much in identifying what needs to be done to make it better.
Some transformations are destructive and can never be fully reverted. This is the case, for example, if you manually increase exposure by several stops: the light areas of the image become pure white and we then refer to those areas as being burnt out. All texture in the affected area is lost – a lot of slightly different colours were all converted to white, and there is no going back from white to the original whitish colours.
When this happens, there is really no point in predicting the reverse transformations, because the information is already lost. In the regression problem, the loss of information produced by a single filter that goes out of range cannot be compensated by the other filters, because they are all predicted independently. Even though it is impossible to recover the lost range, we can at least try to minimize the impact and do the best we can with the tools we have. For example, if the original image was heavily overexposed, we can try to play with the brightness and the saturation to make it a bit better, and that would be much better than just decreasing the exposure by a large amount.
FilterNet: let the network do the filtering
In order to improve the quality of the model and overcome the obstacles identified during our regression tests, we took a different approach, which we ended up calling FilterNet. It is a way to render inside the neural network to make the model learn the best combination of filter values for a particular image. We don’t know about any other AI model that is doing image generation by explicitly rendering filters as part of the network computation.
It works like this:
- Show the network the ugly pictures and the corresponding beautiful ones (instead of the transformation parameters).
- Modify the network architecture so it looks at the ugly pictures and predicts the values to apply to improve them. It will initially try random values (not totally random, actually, but that’s a story for a different post), and will progressively learn to make them better.
- Render the predicted filter values. This produces a transformed, and hopefully improved version of the initial ugly photo.
- Compare the improved version with the original, beautiful one, and measure the difference. This produces a loss figure that is used to slightly tune the improvement parameters so they work a bit better the next iteration.
As this is a novel process (as far as we are aware of), we needed to solve a few problems we encountered along the way. We don’t want to bore you with all the technical details in what’s already a very long post, but we think a couple of topics are interesting to outline.
Do not use sRGB
Everybody uses sRGB. It acts as the default colorspace if nothing else is indicated. The problem with sRGB is that, for historical reasons, colors are encoded using a non-linear relationship.
This is usually not a problem, unless you want to perform mathematical operations on the colors. For example, multiplying a color intensity by a factor of 2 does not make it twice as bright – the end result is something else. In our case, we need to render our filters as part of the computations the network performs, so we need those operations to be accurate. This means that we first need to convert all images to a linear colorspace, and then we can safely perform all computations in that linear space.
And while we are at that, we also use 10-bit precision instead of the usual 8-bit per channel that is commonplace in many domains, including Deep Learning. This will ensure that all our filters compose adequately with each other with minimal precision errors.
As already indicated, filters need to be applied as part of the computations the network performs. This is because we have to compare the rendered images with the reference ones at every step of the training process. In order to do that, we had to port all of our filters and implement them as custom network layers, so they can run as any other network module and operate on multidimensional tensors rather than 2D images. We had to be really careful to ensure that the ported filters produced exactly the same results as their native Camera+ 2 counterparts.
But the main obstacle was not the porting itself, it was that filters had to be differentiable for the network to be able to learn. Differentiability is the way to measure how a small modification of a value in the network impacts the final result, and it is essential to decide by how much we need to update the filter predictions at every step of the training process, to make them slightly better each time.
Of course, filters are not required to be differentiable in any editing application, so this was a completely new an unfamiliar requirement. We implemented our network filters as a combination of image processing primitives, and we had to dig out our high-school calculus to write custom differentiability code for each one of those primitives. If we had skipped this crucial step, our model would have never learnt. Thankfully, the built-in automatic differentiation features in PyTorch allowed us to write the differentiability code in a high-level language (Python) instead of requiring specialized low-level GPU code.
With all these pieces in place, we could finally attempt to train our model.
Training a model is not just pushing the ignition button. There are a ton of network topology variants, parameters to select from, ways to initialize the layers, methods to normalize the network weights, features to select for your loss function. We trained more than 100 different model variants and tested each one of them on 300 images we had taken ourselves. Some of the models were great for inanimate photos such as landscapes or buildings, but produced unnaturally accentuated skin tones when applied to persons. Our brains are trained to reject unnaturally looking representations of people, so those models had to be rejected, refined or fine-tuned. Finally, 6 models made it to the shortlist and we chose the best one for Magic ML 1.0.
The final step was to convert the model and adapt it to run efficiently in iOS. The model was created in a Linux box with a GPU from nvidia, a configuration that is the most usual these days. We performed some model surgery (a term I borrowed from the excellent CoreML Survival Guide by Matthijs Hollemans) to remove the rendering layers and to ensure image normalization was accurate. The custom rendering layers we created are only necessary during the training process; at inference time (i.e., when the model is used in your iPhone), we just need the prediction of the filter values – the final image is rendered at full resolution using those values. After some tinkering and working around a couple of bugs and inconsistencies in the conversion process, we got a CoreML model suitable for iOS 13. Because CoreML relies on Metal and uses either the GPU or the dedicated Neural Engine (in devices that have one), predictions are fast and energy efficient. And, very importantly, the whole process runs inside your iPhone or iPad – your photos are never uploaded to a server.
Magic ML has been a great learning experience. We achieved what we set out to do: it is reasonable, judicious and cautious – it will improve your underexposed pictures and correct color casts, but will not attempt to perform gratuitous changes to destroy an image that was already fine to begin with.
We have a much more solid grasp of AI now and how to apply it. As a consequence, we are starting to identify areas where it could be used to make your life as a photographer easier without getting in your way. It’s time to get back to the drawing board and design the next feature!