Content-Based Filtering Using Neural Networks¶

This notebook relies on files created in the content_based_preproc.ipynb notebook. Be sure to run the code in there before completing this notebook.
Also, we'll be using the python3 kernel from here on out so don't forget to change the kernel if it's still Python2.

This lab illustrates:

how to build feature columns for a model using tf.feature_column
how to create custom evaluation metrics and add them to Tensorboard
how to train a model and make predictions with the saved model

Tensorflow Hub should already be installed. You can check that it is by using "pip freeze".

%%bash
pip freeze | grep tensor

tensorboard==1.13.1
tensorflow==1.13.1
tensorflow-data-validation==0.21.5
tensorflow-datasets==1.2.0
tensorflow-estimator==1.13.0
tensorflow-hub==0.4.0
tensorflow-io==0.8.1
tensorflow-metadata==0.21.2
tensorflow-model-analysis==0.21.6
tensorflow-probability==0.8.0
tensorflow-serving-api==1.15.0
tensorflow-transform==0.21.2

Let's make sure we install the necessary version of tensorflow-hub. After doing the pip install below, click "Reset Session" on the notebook so that the Python environment picks up the new packages.

!pip3 install tensorflow-hub==0.4.0
!pip3 install --upgrade tensorflow==1.13.1

Requirement already satisfied: tensorflow-hub==0.4.0 in /opt/conda/lib/python3.7/site-packages (0.4.0)
Requirement already satisfied: protobuf>=3.4.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-hub==0.4.0) (3.11.4)
Requirement already satisfied: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-hub==0.4.0) (1.14.0)
Requirement already satisfied: numpy>=1.12.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-hub==0.4.0) (1.18.3)
Requirement already satisfied: setuptools in /opt/conda/lib/python3.7/site-packages (from protobuf>=3.4.0->tensorflow-hub==0.4.0) (46.1.3.post20200325)
Requirement already up-to-date: tensorflow==1.13.1 in /opt/conda/lib/python3.7/site-packages (1.13.1)
Requirement already satisfied, skipping upgrade: tensorboard<1.14.0,>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.13.1)
Requirement already satisfied, skipping upgrade: absl-py>=0.1.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (0.8.1)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.1.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (3.11.4)
Requirement already satisfied, skipping upgrade: gast>=0.2.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (0.2.2)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.28.1)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.1.0)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.6 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.0.8)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (0.8.1)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<1.14.0rc0,>=1.13.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.13.0)
Requirement already satisfied, skipping upgrade: numpy>=1.13.3 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.18.3)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (0.34.2)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow==1.13.1) (1.14.0)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /opt/conda/lib/python3.7/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow==1.13.1) (3.2.1)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /opt/conda/lib/python3.7/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow==1.13.1) (1.0.1)
Requirement already satisfied, skipping upgrade: setuptools in /opt/conda/lib/python3.7/site-packages (from protobuf>=3.6.1->tensorflow==1.13.1) (46.1.3.post20200325)
Requirement already satisfied, skipping upgrade: h5py in /opt/conda/lib/python3.7/site-packages (from keras-applications>=1.0.6->tensorflow==1.13.1) (2.10.0)
Requirement already satisfied, skipping upgrade: mock>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from tensorflow-estimator<1.14.0rc0,>=1.13.0->tensorflow==1.13.1) (3.0.5)

import os
import tensorflow as tf
import numpy as np
import tensorflow_hub as hub
import shutil

PROJECT = 'my-project' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'my-bucket' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-west1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.13.1'

%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].

Build the feature columns for the model.¶

To start, we'll load the list of categories, authors and article ids we created in the previous Create Datasets notebook.

categories_list = open("categories.txt").read().splitlines()
authors_list = open("authors.txt").read().splitlines()
content_ids_list = open("content_ids.txt").read().splitlines()
mean_months_since_epoch = 523

In the cell below we'll define the feature columns to use in our model. If necessary, remind yourself the various feature columns to use.
For the embedded_title_column feature column, use a Tensorflow Hub Module to create an embedding of the article title. Since the articles and titles are in German, you'll want to use a German language embedding module.
Explore the text embedding Tensorflow Hub modules available here. Filter by setting the language to 'German'. The 50 dimensional embedding should be sufficient for our purposes.

embedded_title_column = hub.text_embedding_column(
    key="title", 
    module_spec="https://tfhub.dev/google/nnlm-de-dim50/1",
    trainable=False)

content_id_column = tf.feature_column.categorical_column_with_hash_bucket(
    key="content_id",
    hash_bucket_size= len(content_ids_list) + 1)
embedded_content_column = tf.feature_column.embedding_column(
    categorical_column=content_id_column,
    dimension=10)

author_column = tf.feature_column.categorical_column_with_hash_bucket(key="author",
    hash_bucket_size=len(authors_list) + 1)
embedded_author_column = tf.feature_column.embedding_column(
    categorical_column=author_column,
    dimension=3)

category_column_categorical = tf.feature_column.categorical_column_with_vocabulary_list(
    key="category",
    vocabulary_list=categories_list,
    num_oov_buckets=1)
category_column = tf.feature_column.indicator_column(category_column_categorical)

months_since_epoch_boundaries = list(range(400,700,20))
months_since_epoch_column = tf.feature_column.numeric_column(
    key="months_since_epoch")
months_since_epoch_bucketized = tf.feature_column.bucketized_column(
    source_column = months_since_epoch_column,
    boundaries = months_since_epoch_boundaries)

crossed_months_since_category_column = tf.feature_column.indicator_column(tf.feature_column.crossed_column(
  keys = [category_column_categorical, months_since_epoch_bucketized], 
  hash_bucket_size = len(months_since_epoch_boundaries) * (len(categories_list) + 1)))

feature_columns = [embedded_content_column,
                   embedded_author_column,
                   category_column,
                   embedded_title_column,
                   crossed_months_since_category_column]

Create the input function.¶

Next we'll create the input function for our model. This input function reads the data from the csv files we created in the previous labs.

record_defaults = [["Unknown"], ["Unknown"],["Unknown"],["Unknown"],["Unknown"],[mean_months_since_epoch],["Unknown"]]
column_keys = ["visitor_id", "content_id", "category", "title", "author", "months_since_epoch", "next_content_id"]
label_key = "next_content_id"
def read_dataset(filename, mode, batch_size = 512):
  def _input_fn():
      def decode_csv(value_column):
          columns = tf.decode_csv(value_column,record_defaults=record_defaults)
          features = dict(zip(column_keys, columns))          
          label = features.pop(label_key)         
          return features, label

      # Create list of files that match pattern
      file_list = tf.gfile.Glob(filename)

      # Create dataset from file list
      dataset = tf.data.TextLineDataset(file_list).map(decode_csv)

      if mode == tf.estimator.ModeKeys.TRAIN:
          num_epochs = None # indefinitely
          dataset = dataset.shuffle(buffer_size = 10 * batch_size)
      else:
          num_epochs = 1 # end-of-input after this

      dataset = dataset.repeat(num_epochs).batch(batch_size)
      return dataset.make_one_shot_iterator().get_next()
  return _input_fn

Create the model and train/evaluate¶

Next, we'll build our model which recommends an article for a visitor to the Kurier.at website. Look through the code below. We use the input_layer feature column to create the dense input layer to our network. This is just a sigle layer network where we can adjust the number of hidden units as a parameter.

Currently, we compute the accuracy between our predicted 'next article' and the actual 'next article' read next by the visitor. We'll also add an additional performance metric of top 10 accuracy to assess our model. To accomplish this, we compute the top 10 accuracy metric, add it to the metrics dictionary below and add it to the tf.summary so that this value is reported to Tensorboard as well.

def model_fn(features, labels, mode, params):
  net = tf.feature_column.input_layer(features, params['feature_columns'])
  for units in params['hidden_units']:
        net = tf.layers.dense(net, units=units, activation=tf.nn.relu)
   # Compute logits (1 per class).
  logits = tf.layers.dense(net, params['n_classes'], activation=None) 

  predicted_classes = tf.argmax(logits, 1)
  from tensorflow.python.lib.io import file_io
    
  with file_io.FileIO('content_ids.txt', mode='r') as ifp:
    content = tf.constant([x.rstrip() for x in ifp])
  predicted_class_names = tf.gather(content, predicted_classes)
  if mode == tf.estimator.ModeKeys.PREDICT:
    predictions = {
        'class_ids': predicted_classes[:, tf.newaxis],
        'class_names' : predicted_class_names[:, tf.newaxis],
        'probabilities': tf.nn.softmax(logits),
        'logits': logits,
    }
    return tf.estimator.EstimatorSpec(mode, predictions=predictions)
  table = tf.contrib.lookup.index_table_from_file(vocabulary_file="content_ids.txt")
  labels = table.lookup(labels)
  # Compute loss.
  loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)

  # Compute evaluation metrics.
  accuracy = tf.metrics.accuracy(labels=labels,
                                 predictions=predicted_classes,
                                 name='acc_op')
  top_10_accuracy = tf.metrics.mean(tf.nn.in_top_k(predictions=logits, 
                                                   targets=labels, 
                                                   k=10))
  
  metrics = {
    'accuracy': accuracy,
    'top_10_accuracy' : top_10_accuracy}
  
  tf.summary.scalar('accuracy', accuracy[1])
  tf.summary.scalar('top_10_accuracy', top_10_accuracy[1])

  if mode == tf.estimator.ModeKeys.EVAL:
      return tf.estimator.EstimatorSpec(
          mode, loss=loss, eval_metric_ops=metrics)

  # Create training op.
  assert mode == tf.estimator.ModeKeys.TRAIN

  optimizer = tf.train.AdagradOptimizer(learning_rate=0.1)
  train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
  return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

Train and Evaluate¶

outdir = 'content_based_model_trained'
shutil.rmtree(outdir, ignore_errors = True) # start fresh each time
tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file
estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    model_dir = outdir,
    params={
     'feature_columns': feature_columns,
      'hidden_units': [200, 100, 50],
      'n_classes': len(content_ids_list)
    })

train_spec = tf.estimator.TrainSpec(
    input_fn = read_dataset("training_set.csv", tf.estimator.ModeKeys.TRAIN),
    max_steps = 2000)

eval_spec = tf.estimator.EvalSpec(
    input_fn = read_dataset("test_set.csv", tf.estimator.ModeKeys.EVAL),
    steps = None,
    start_delay_secs = 30,
    throttle_secs = 60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

INFO:tensorflow:Using default config.

INFO:tensorflow:Using default config.

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f82c82af710>, '_model_dir': 'content_based_model_trained', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_session_creation_timeout_secs': 7200, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f82c82af710>, '_model_dir': 'content_based_model_trained', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_tf_random_seed': None, '_save_summary_steps': 100, '_device_fn': None, '_session_creation_timeout_secs': 7200, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_experimental_max_worker_delay_secs': None, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_master': ''}

INFO:tensorflow:Not using Distribute Coordinator.

INFO:tensorflow:Not using Distribute Coordinator.

INFO:tensorflow:Running training and evaluation locally (non-distributed).

INFO:tensorflow:Running training and evaluation locally (non-distributed).

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.

This takes a while to complete but in the end, I get about 30% top 10 accuracy.

Make predictions with the trained model.¶

With the model now trained, we can make predictions by calling the predict method on the estimator. Let's look at how our model predicts on the first five examples of the training set.
To start, we'll create a new file 'first_5.csv' which contains the first five elements of our training set. We'll also save the target values to a file 'first_5_content_ids' so we can compare our results.

%%bash
head -5 training_set.csv > first_5.csv
head first_5.csv
awk -F "\"*,\"*" '{print $2}' first_5.csv > first_5_content_ids

1000148716229112932,299913368,News,U4-Störung legt Wiener Frühverkehr lahm ,Yvonne Widler,574,299931241
1000148716229112932,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtlingspolitik,,574,299913879
1000360453832106474,299925700,Lifestyle,Nach Tod von Vater: Tochter bekommt jedes Jahr Blumen,Marlene Patsalidis,574,299922662
1000360453832106474,299922662,Lifestyle,Australischer Fernsehstar rechtfertigt Sexismus mit Asperger-Syndrom,Marlene Patsalidis,574,299826775
1001846185946874596,299930679,News,Wintereinbruch naht: Erster Schnee im Osten möglich,Daniela Wahl,574,299930679

Recall, to make predictions on the trained model we pass a list of examples through the input function. Complete the code below to make predicitons on the examples contained in the "first_5.csv" file we created above.

output = list(estimator.predict(input_fn=read_dataset("first_5.csv", tf.estimator.ModeKeys.PREDICT)))

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Calling model_fn.

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Saver not created because there are no variables in the graph to restore

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Done calling model_fn.

INFO:tensorflow:Graph was finalized.

INFO:tensorflow:Graph was finalized.

import numpy as np
recommended_content_ids = [np.asscalar(d["class_names"]).decode('UTF-8') for d in output]
content_ids = open("first_5_content_ids").read().splitlines()

Finally, we map the content id back to the article title. Let's compare our model's recommendation for the first example. This can be done in BigQuery. Look through the query below and make sure it is clear what is being returned.

from google.cloud import bigquery
recommended_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(recommended_content_ids[0])

current_title_sql="""
#standardSQL
SELECT
(SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) = \"{}\"
LIMIT 1""".format(content_ids[0])
recommended_title = bigquery.Client().query(recommended_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
current_title = bigquery.Client().query(current_title_sql).to_dataframe()['title'].tolist()[0].encode('utf-8').strip()
print("Current title: {} ".format(current_title))
print("Recommended title: {}".format(recommended_title))

Current title: U4-Störung legt Wiener Frühverkehr lahm 
Recommended title: Fahnenskandal von Mailand: Die Austria zeigt Flagge