Pablo Picasso? I wish.

Sarrah Rose
7 min readJan 26, 2021

“To draw you must close your eyes & sing.” — Pablo Picasso

Reaching back through the hallowed halls of art museums & human history, “style” is a notion oft-regarded as beautifully intangible, & yet a simultaneously visceral thing.

It is Picasso’s Cubism and impressionist paintings by Monet. It is the confusingly entrancing images of Salvador Dali and baroque architecture which overwhelmed 17th Century Europe. And while one could certainly attempt to imitate these styles, there exists a sense of “soul” within art, something incredibly hard to pin down despite being exceedingly obvious at the same time.

Introduction to Style Transfer

This is why Neural Style Transfer is so, incredibly cool! In recent years, we’ve seen increasing attempts to conduct “style transfer”, where we are recomposing the contents of an image in the style of another. While we did come up with techniques such as non-photorealistic rendering prior, these methods were pretty inflexible and inefficient. Then came Leon.

In 2015, Leon Gatsys et. al released the paper, “A Neural Algorithm in Style”, proving the viability of using Deep Neural Networks (DNNs) to conduct style transfer. This was exciting because it demonstrated that DNNs could disentangle the representation of the “content” (structure) of an image from its “style” (appearance). Essentially, they had found a way to encode these somewhat abstract concepts into concrete mathematical functions that could be manipulated & computed.

Neural Style Transfer

The objective of the style transfer algorithm is to (i) synthesise a generated image by (ii) minimising the difference in “content” between the content image & generated image, while simultaneously (ii) minimising the style difference between the style image & generated image.

A represents the content image, while the smaller picture represents the style image — both of which are synthesised to produce composite image B.

To understand how the model first understands what “content” & “style” are, we’d first have to understand the model itself. Leon Gatsys introduced a model using the VGG-19 architecture, a type of Convolutional Neural Network (CNN), which in itself is a subset of the Deep Neural Networks we mentioned earlier.

Convolutional Neural Networks

CNNs can be thought of as self-assembling robots which train themselves to get really good at image recognition. Sounds like a weird analogy? It probably is — but I couldn’t think of a better one and we’re getting off-track.

This CNN is a model composed of multiple convolutional layers in the same way that a robot is a structure composed of a bunch of modular components. (There’s other stuff too like non-linearity & max pooling, but that’s non-essential right now.)

VGG-19 Architecture — A type of CNN

At each convolutional layer, multiple image filters conduct convolutions across the inputted image, each outputting a feature map. Image filters can be thought of as feature detectors, with each image filter being trained to extract a specific feature from the input image.

The images with a grey background represent the feature maps of each image filter (i.e. the representation of the colorised images) — We can see how each image filter has detected a specific feature (e.g. lines, yellow, stripes, curves)

Convolutions can be simply understood as a mathematical function, in which image filters are slid across the input image and the respective values of the image filters & the input image are multiplied against each other to obtain a product.

The image filter is being slid across the input image, outputting the convolved feature — the feature map!

And finally, this product manifests itself in the form of feature maps. Essentially, representations of the features detected by the image filters.

Interestingly, as the number of convolutional layers increase, the feature maps (i.e. representations of the image) outputted become increasingly high-level and complex. For instance, lower convolutional layers tend to detect simple lines & edges, while higher layers detect composite objects & faces. We see this in the image below, where reconstructions of the input image from lower-level feature maps are more precise compared to image reconstructions from higher-level feature maps.

This occurs because lower convolutional layers, which output feature maps of simpler features, correspond almost precisely to the exact pixel values of the original image. In contrast, higher layers in these networks tend to care more about the content of the image itself, capturing the high-level details in these images instead.

Now, we’ve understood how feature maps successfully encode the “content” of the image, relating more specifically to the positioning & shape of the image more than anything else. This links back to part (ii) of our original objective: How do we actually compute these values to minimise content loss?

Content Loss

After extracting these values from the feature maps of corresponding layers in the content image & generated image, they’re then plugged into the following loss function.

The loss function takes in 3 inputs — C (content image), G (generated image) and L, the layer of the respective feature maps. aL therefore denotes the activation layer (aka feature map) of the content image in layer L, while aL represents the corresponding layer in the generated image.

We calculate the euclidean distance between these 2 “content” layers, essentially taking the mean squared error from these values. This function is especially useful because it preserves the content of the target image while performing gradient descent during later processes!

The image filter is being slid across the input image, outputting the convolved feature — the feature map!

Style Loss

Finally, this brings us to encoding the nebulous concept that is style. Unlike “content loss”, we can’t simply compare the raw values of the feature maps. Instead, we have to use something called the gram matrix. (Vastly different from a similarly termed english biscuit. Hah.)

The gram matrix is interesting because it essentially allows us to find the correlation between features across different feature maps within the same layer. Put another way, it’s encoding the “overlap” between different features by matching the distribution of specific features. This therefore captures the tendency of features to co-occur in the image, instead of evaluating the presence of these features themselves.

We obtain the gram matrix by computing the dot product of the values across 2 separate feature maps. This basically means multiplying the values of the first matrix against a transposed (flipped on its side) second matrix, to obtain the output — the gram matrix!

Once we’ve gotten the respective gram matrices of the style image & the content image, we’re able to compute the style loss within a single layer.

It looks a little confusing, but it’s really quite simple. NI represents the number of feature maps we’re deriving gram matrices from while Ml represents the dimensions (height*width) of these respective feature maps. We then compute the mean squared error as we did with “content loss” and we’re done!

To gain a better representation of the desired style, we then compute the above function across multiple layers of the CNN. This all comes together in our Style Loss function, where we take the sum of all the individual loss functions across individual layers.

Loss & Back Propagation

The final loss function we use during our image synthesis is the sum of weighted content & style functions.

Therefore, it follows that as both are functions of the generated image’s pixels, they’re essentially “competing”, and will hence never both be completely optimised. To manage this, we use symbols α & β as hyperparameters to adjust the emphasis of the synthesised image on content or on style.

In this context p represents the content image, a represents the style image and x represents the generated image.

Once we’ve obtained our loss function, we then use back-propagation to iteratively optimise the “generated image”, with the objective of matching the desired CNN features with the generated image. Basically, this means minimising “content loss” (ensuring that most of the shape & positioning from the content image remains intact) , while minimising “style loss” (while largely recomposing the original image in the desired style).

A useful model to understand how the model optimises the generated image through back-propagation!

Unlike typical neural networks which use back-propagation to adjust its weights & biases within the network itself, the NST algorithm optimises the pixel values of the generated image instead, until the overall loss function is sufficiently reduced. The model therefore takes 3 inputs — the content image, the style image, and the generated image!

Conclusion

And voila we’re done! It’s been incredibly cool seeing all the research flood into the field following the inception of this paper — from the optimised models by Dr Li Fei-Fei to Google’s crazy psychedelic experience that is Deep Dream.

Synthetic images generated from Deep Dream

A particularly interesting use case is style transfer on videos, which has shown to be super effective and can even be done live!

A 2 Minute Paper video on Video-Based Style Transfer

There’s obviously still room for improvement to improve the stylised quality over a range of style & content — for instance, the NST model does particularly well when transferring the style of paintings but doesn’t perform nearly as effectively when faced with pixelated or photorealistic images.

Even so, I think it’s an incredibly cool field with a ton of cool applications, and poses interesting questions to our understanding of creativity, art, and perhaps more confusingly, the intrinsic value behind it.

--

--