Generating Neural Style Transfer image using Convolutional Neural Network

The algorithm for Convolutional Neural Networks (CNNs) utilized in artistic style transfer enables the application of the stylistic elements of one image onto the content of another. This process results in a new artistic image that effectively combines the stylistic attributes of the first image with the content of the second image.

So lets say we have a content image C and a style image S and then we combine these images to generate an image called G.

Neural Style transfer at TensorFlow website has a good tutorial explaining this and below images are taken from that tutorial.

Content Image

$$+$$

Style Image

$$\bf{=}$$

Generated Image

Neural Style Transfer (NST) uses a pre-trained convolutional neural network to enhance its capabilities. This approach, which involves applying a network trained for one task to a different task, is known as transfer learning.

For our purposes, we can utilize VGG-19, a 19-layer version of the VGG network. This model has been trained on the extensive ImageNet database and has learned to recognize a wide range of both low-level features (found in the shallower layers) and high-level features (identified in the deeper layers).

Cost function used for this kind of algorithm is defined as

$J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$ where $\alpha$ and $\beta$ are hyper parameters and where

$J_{content}(C,G)$ is the content cost function

$J_{style}(S,G)$ is the style cost function

Content Cost Function

The goal of content cost function is to match the content in generated image G to match the content of image C.

This is calculated using Norm2 which is the sum of the squares of differences of all activations in a chosen hidden layer. Normally it is best to chose a layer in convolutional network which is in the middle.

Lets assume that after convolution in layer $l$ of CNN we get a 3d volume as show below where each block represents the filter with activations of content or generated image $a^{(C)} $ or $a^{(G)}$and $n_h$, $n_w$ and $n_c$ are the height, width and number of channels of the hidden layer we have chosen

Content cost function will be calculated as

$J_{content}(C,G) = \frac{1}{4 \times n_H \times n_W \times n_C}\sum _{ \text{all entries}} (a^{(C)} - a^{(G)})^2\tag{1}$

Style Cost function

For the calculation of Style cost function $J_{style}(S,G)$, the goal will be to minimize the distance between the Gram matrix of the "style" image S and the Gram matrix of the "generated" image G.

In linear algebra, the Gram matrix G of a set of vectors $(v_{1},\dots ,v_{n})$ is the matrix of dot products, whose entries are $G_{ij} = v_{i}^T v_{j}$

The gram matrix $G_{ij}$ precisely quantifies the similarity ( correlation) between the activations of filter ( i ) and those of filter ( j ). While The gram matrix $G_{ii}$measures how common vertical textures are in the image as a whole.

If the value of $G_{ii}$is large , this indicates that the image possesses a significant amount of vertical texture.

The Style matrix $G$quantifies the style of an image by capturing the prevalence of various feature types of features $G_{ii}$, as well how much different features occur together $G_{ij}$

For a hidden layer $l$ the style cost for this layer is given as

$J_{style}^{[l]}(S,G) = \frac{1}{4 \times {n_C}^2 \times (n_H \times n_W)^2} \sum _{i=1}^{n_C}\sum_{j=1}^{n_C}(G^{(S)}_{(gram)i,j} - G^{(G)}_{(gram)i,j})^2\tag{2}$

$G_{gram}^{(S)}$ is Gram matrix of the "style" image.
$G_{gram}^{(G)}$ Gram matrix of the "generated" image.
The cost is calculated using the activations from a specific hidden layer in the network, denoted as $a^{[l]}$

we combine the style costs for different layers as

$J_{style}(S,G) = \sum_{l} \lambda^{[l]} J^{[l]}_{style}(S,G)$

Lets do its implementation using Keras and TensorFLow

Load the libraries

import os
import sys
import scipy.io
import scipy.misc
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
from PIL import Image
import numpy as np
import tensorflow as tf
import pprint
%matplotlib inline

Load parameters from VGG19 pre-trained model

As we are going to us transfer learning , we will load VGG19 pertained model with its weights.

tf.random.set_seed(272) 
img_size = 400
vgg = tf.keras.applications.VGG19(include_top=False, input_shape=(img_size, img_size, 3), weights='imagenet')
vgg.trainable = False

Lets check the layers loaded with this VGG19 model

for layer in vgg.layers:
    print(layer.name)

input_1
block1_conv1
block1_conv2
block1_pool
block2_conv1
block2_conv2
block2_pool
block3_conv1
block3_conv2
block3_conv3
block3_conv4
block3_pool
block4_conv1
block4_conv2
block4_conv3
block4_conv4
block4_pool
block5_conv1
block5_conv2
block5_conv3
block5_conv4
block5_pool

Lets choose the layers to represent the style of the image and assign style costs , more on this later.

STYLE_LAYERS = [
    ('block1_conv1', 0.2),
    ('block2_conv1', 0.2),
    ('block3_conv1', 0.2),
    ('block4_conv1', 0.2),
    ('block5_conv1', 0.2)]

We will define a function which loads the VGG19 model and returns a list of the outputs for the middle layers.

def get_layer_outputs(vgg, layer_names):
    """ Creates a vgg model that returns a list of intermediate output values."""
    outputs = [vgg.get_layer(layer[0]).output for layer in layer_names]

    model = tf.keras.Model([vgg.input], outputs)
    return model

We will use ‘block5_conv4’ layer as the content layer

content_layer = [('block5_conv4', 1)]

and build the model.

vgg_model_outputs = get_layer_outputs(vgg, STYLE_LAYERS + content_layer)

Lets validate the Outputs layers

print((vgg_model_outputs.outputs))

[<KerasTensor: shape=(None, 400, 400, 64) dtype=float32 (created by layer 'block1_conv1')>,
 <KerasTensor: shape=(None, 200, 200, 128) dtype=float32 (created by layer 'block2_conv1')>,
 <KerasTensor: shape=(None, 100, 100, 256) dtype=float32 (created by layer 'block3_conv1')>,
 <KerasTensor: shape=(None, 50, 50, 512) dtype=float32 (created by layer 'block4_conv1')>,
 <KerasTensor: shape=(None, 25, 25, 512) dtype=float32 (created by layer 'block5_conv1')>,
 <KerasTensor: shape=(None, 25, 25, 512) dtype=float32 (created by layer 'block5_conv4')>]

Calculate the Content Cost function

Content and style images used in this example are loaded as shown below

content_path = tf.keras.utils.get_file('YellowLabradorLooking_new.jpg', 'https://storage.googleapis.com/download.tensorflow.org/example_images/YellowLabradorLooking_new.jpg')
style_path = tf.keras.utils.get_file('kandinsky5.jpg','https://storage.googleapis.com/download.tensorflow.org/example_images/Vassily_Kandinsky%2C_1913_-_Composition_7.jpg')

content_image = np.array(Image.open(content_path).resize((img_size, img_size)))
print(content_image.shape)
content_image = tf.constant(np.reshape(content_image, ((1,) + content_image.shape)))

print(content_image.shape)
imshow(content_image[0])
plt.show()

style_image =  np.array(Image.open(style_path).resize((img_size, img_size)))
style_image = tf.constant(np.reshape(style_image, ((1,) + style_image.shape)))
print(style_image.shape)
imshow(style_image[0])
plt.show()

Generated image to be used

Initialize the "generated" image as a noisy image created from the content_image. The generated image is slightly correlated with the content image.By initializing the pixels of the generated image to be mostly noise but slightly correlated with the content image, this will help the content of the "generated" image more rapidly match the content of the "content" image.

generated_image = tf.Variable(tf.image.convert_image_dtype(content_image, tf.float32))
noise = tf.random.uniform(tf.shape(generated_image), -0.25, 0.25)
generated_image = tf.add(generated_image, noise)
generated_image = tf.clip_by_value(generated_image, clip_value_min=0.0, clip_value_max=1.0)
print(generated_image.shape)
imshow(generated_image.numpy()[0])
plt.show()

Content Cost is calculated as shown below

$J_{content}(C,G) = \frac{1}{4 \times n_H \times n_W \times n_C}\sum _{ \text{all entries}} (a^{(C)} - a^{(G)})^2\tag{1}$

$a^{(C)} $ and $a^{(G)}$ are the 3D volumes corresponding to a hidden layer's activations, representing content of the image C and image G, with width $w_h$, height $w_h$ and $w_c$ as channels.

To compute the content cost, we will encode the content image using the correct hidden layer activations and assign this encoding to the variable ( a_C ). We will perform the same process for the generated image, setting the variable ( a_G ) to its corresponding hidden layer activations. For the encoding of ( a_C ), we will use the hidden layer named ( block5_conv4 ). Define a_C as the tensor representing the hidden layer activation for the layer "block5_conv4" using the content image.

We forward propagate image C by Setting the image C as the input to the pre-trained VGG network, and run forward propagation.

preprocessed_content =  tf.Variable(tf.image.convert_image_dtype(content_image, tf.float32))
a_C = vgg_model_outputs(preprocessed_content)

Similarly We forward propagate image S by setting the image S as the input to the pre-trained VGG network, and run forward propagation.

we Compute the Style image Encoding (a_S)

We set a_S to be the tensor giving the hidden layer activation for STYLE_LAYERS using our style image.

preprocessed_style =  tf.Variable(tf.image.convert_image_dtype(style_image, tf.float32))
a_S = vgg_model_outputs(preprocessed_style)

Compute Content cost

Now we define a function to calculate content cost

def compute_content_cost(content_output, generated_output):
    a_C = content_output[-1]
    a_G = generated_output[-1]
    # Retrieve dimensions from a_G
    _, n_H, n_W, n_C = a_G.get_shape().as_list()
    # Reshape 'a_C' and 'a_G'
    a_C_unrolled = tf.reshape(a_C, shape=[_, n_H * n_W, n_C])
    a_G_unrolled = tf.reshape(a_G, shape=[_, n_H * n_W, n_C])
    # compute the cost with tensorflow (≈1 line)
    J_content =  (1 / (4 * n_H * n_W * n_C)) * tf.reduce_sum(tf.square(tf.subtract(a_C_unrolled,a_G_unrolled)))  
    return J_content

Compute the Style Cost

$J_{style}(S,G)$\= $J_{style}^{[l]}(S,G) = \frac{1}{4 \times {n_C}^2 \times (n_H \times n_W)^2} \sum _{i=1}^{n_C}\sum_{j=1}^{n_C}(G^{(S)}_{(gram)i,j} - G^{(G)}_{(gram)i,j})^2\tag{2}$

Style Cost for specific layer $l$

def compute_layer_style_cost(a_S, a_G):
    # Retrieve dimensions from a_G 
    _, n_H, n_W, n_C = a_G.get_shape().as_list()
    # Reshape the tensors from (1, n_H, n_W, n_C) to (n_C, n_H * n_W)
    a_S = tf.transpose(tf.reshape(a_S, shape=[-1, n_C]))
    a_G = tf.transpose(tf.reshape(a_G, shape=[-1, n_C]))
    # Computing gram_matrices for both images S and G
    GS = tf.linalg.matmul(a_S,tf.transpose(a_S))
    GG = tf.linalg.matmul(a_G,tf.transpose(a_G))
    # Computing the loss
    J_style_layer = (1 / (4 * n_C **2 * (n_H * n_W) **2)) * tf.reduce_sum(tf.square(tf.subtract(GS,GG))) 
    return J_style_layer

Style cost for all chosen layers

$J_{style}(S,G) = \sum_{l} \lambda^{[l]} J^{[l]}_{style}(S,G)$

where the values for $\lambda^{[l]}$ are given in STYLE_LAYERS.

For each layer from STYLE_LAYERS:

Get the style of the style image "S" from the current layer.
Get the style of the generated image "G" from the current layer.
Compute the "style cost" for the current layer
Add the weighted style cost to the overall style cost (J_style)

def compute_style_cost(style_image_output, generated_image_output, STYLE_LAYERS=STYLE_LAYERS):
    # initialize the overall style cost
    J_style = 0
    # Set a_S to be the hidden layer activation from the layer we have selected.
    # The last element of the array contains the content layer image, which must not be used.
    a_S = style_image_output[:-1]
    # Set a_G to be the output of the choosen hidden layers.
    # The last element of the list contains the content layer image which must not be used.
    a_G = generated_image_output[:-1]
    for i, weight in zip(range(len(a_S)), STYLE_LAYERS):  
        # Compute style_cost for the current layer
        J_style_layer = compute_layer_style_cost(a_S[i], a_G[i])
        # Add weight * J_style_layer of this layer to overall style cost
        J_style += weight[1] * J_style_layer
    return J_style

Total cost to optimize

$J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$


@tf.function()
def total_cost(J_content, J_style, alpha = 10, beta = 40):    
    J = alpha * J_content + beta * J_style
    return J

Now we will implement the train_step function

Which uses the Adam optimizer to minimize the total cost J.
we will Use a learning rate of 0.01
Within the tf.GradientTape():
- We will Compute the encoding of the generated image using vgg_model_outputs. Assign the result to a_G.
- Compute the total cost J, using the global variables a_C, a_S and the local a_G
- we will Use $\alpha=10$ and $\beta=40$

optimizer = tf.keras.optimizers.legacy.Adam(learning_rate=0.01)
@tf.function()
def train_step(generated_image):
    with tf.GradientTape() as tape:
        # In this function you must use the precomputed encoded images a_S and a_C      
        # Compute a_G as the vgg_model_outputs for the current generated image
        a_G = vgg_model_outputs(generated_image)  
        # Compute the style cost
        J_style = compute_style_cost(a_S, a_G)
        # Compute the content cost
        J_content = compute_content_cost(a_C, a_G)
        # Compute the total cost
        J = total_cost(J_content, J_style)      
    grad = tape.gradient(J, generated_image)
    optimizer.apply_gradients([(grad, generated_image)])
    generated_image.assign(clip_0_1(generated_image))
    return J

create a tensor object for generated Image

generated_image = tf.Variable(generated_image)

Train the Model

epochs = 2500
for i in range(epochs):
    train_step(generated_image)
    if i % 500 == 0:
        print(f"Epoch {i} ")
    if i % 500 == 0:
        image = tensor_to_image(generated_image)
        imshow(image)
        image.save(f"output/image_{i}.jpg")
        plt.show()

# Show the 3 images in a row
fig = plt.figure(figsize=(16, 4))
ax = fig.add_subplot(1, 3, 1)
imshow(content_image[0])
ax.title.set_text('Content image')
ax = fig.add_subplot(1, 3, 2)
imshow(style_image[0])
ax.title.set_text('Style image')
ax = fig.add_subplot(1, 3, 3)
imshow(generated_image[0])
ax.title.set_text('Generated image')
plt.show()

More better neural transfer image can be generated by manipulating hyper parameters like epochs to 2500 and playing with learning rate. Below was done with 500 epochs.