Create datasets for the Content-based Filter

This notebook builds the data we will use for creating our content based model. We'll collect the data via a collection of SQL queries from the publicly avialable Kurier.at dataset in BigQuery. Kurier.at is an Austrian newsite. The goal of these labs is to recommend an article for a visitor to the site. In this lab we collect the data for training, in the subsequent notebook we train the recommender model.

This notebook illustrates

  • how to pull data from BigQuery table and write to local files
  • how to make reproducible train and test splits
In [1]:
import os
import tensorflow as tf
import numpy as np
from google.cloud import bigquery 

PROJECT = 'my-project' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'my-bucket' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

# do not change these
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.8'
In [2]:
%%bash
gcloud  config  set project $PROJECT
gcloud config set compute/region $REGION
Updated property [core/project].
Updated property [compute/region].

We will use this helper funciton to write lists containing article ids, categories, and authors for each article in our database to local file.

In [3]:
def write_list_to_disk(my_list, filename):
  with open(filename, 'w') as f:
    for item in my_list:
        line = "%s\n" % item
        f.write(line.encode('utf8'))

Pull data from BigQuery

The cell below creates a local text file containing all the article ids (i.e. 'content ids') in the dataset.

Have a look at the original dataset in BigQuery. Then read through the query below and make sure you understand what it is doing.

In [4]:
sql="""
#standardSQL

SELECT  
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id 
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY
  content_id
  
"""

content_ids_list = bigquery.Client().query(sql).to_dataframe()['content_id'].tolist()
write_list_to_disk(content_ids_list, "content_ids.txt")
print("Some sample content IDs {}".format(content_ids_list[:3]))
print("The total number of articles is {}".format(len(content_ids_list)))
Some sample content IDs [u'299824032', u'299865757', u'299918857']
The total number of articles is 15634

There should be 15,634 articles in the database.
Next, we'll create a local file which contains a list of article categories and a list of article authors.

Note the change in the index when pulling the article category or author information. Also, we are using the first author of the article to create our author list.
Refer back to the original dataset, use the hits.customDimensions.index field to verify the correct index.

In [5]:
sql="""
#standardSQL
SELECT  
  (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category  
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY   
  category
"""
categories_list = bigquery.Client().query(sql).to_dataframe()['category'].tolist()
write_list_to_disk(categories_list, "categories.txt")
print(categories_list)
[u'News', u'Lifestyle', u'Stars & Kultur']

The categories are 'News', 'Stars & Kultur', and 'Lifestyle'.
When creating the author list, we'll only use the first author information for each article.

In [6]:
sql="""
#standardSQL
SELECT
  REGEXP_EXTRACT((SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)), r"^[^,]+")  AS first_author  
FROM `cloud-training-demos.GA360_test.ga_sessions_sample`,   
  UNNEST(hits) AS hits
WHERE 
  # only include hits on pages
  hits.type = "PAGE"
  AND (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
GROUP BY   
  first_author
"""
authors_list = bigquery.Client().query(sql).to_dataframe()['first_author'].tolist()
write_list_to_disk(authors_list, "authors.txt")
print("Some sample authors {}".format(authors_list[:10]))
print("The total number of authors is {}".format(len(authors_list)))
Some sample authors [u'Marlene Patsalidis', u'Yvonne Widler', u'Thomas  Trescher', u'Johanna Hager', u'Elisabeth Spitzer', u'Daniela Wahl', u'Stefan Berndl', u'Heinz Wagner', u'Brigitte Schokarth', u'Mathias Kainz']
The total number of authors is 385

There should be 385 authors in the database.

Create train and test sets.

In this section, we will create the train/test split of our data for training our model. We use the concatenated values for visitor id and content id to create a farm fingerprint, taking approximately 90% of the data for the training set and 10% for the test set.

In [7]:
sql="""
WITH site_history as (
  SELECT
      fullVisitorId as visitor_id,
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,
      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions
  FROM 
    `cloud-training-demos.GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"
      AND
      fullVisitorId IS NOT NULL
      AND
      hits.time != 0
      AND
      hits.time IS NOT NULL
      AND
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
)
SELECT
  visitor_id,
  content_id,
  category,
  REGEXP_REPLACE(title, r",", "") as title,
  REGEXP_EXTRACT(author_list, r"^[^,]+") as author,
  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id
FROM
  site_history
WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) < 9
"""
training_set_df = bigquery.Client().query(sql).to_dataframe()
training_set_df.to_csv('training_set.csv', header=False, index=False, encoding='utf-8')
training_set_df.head()
Out[7]:
visitor_id content_id category title author months_since_epoch next_content_id
0 1000148716229112932 299913368 News U4-Störung legt Wiener Frühverkehr lahm Yvonne Widler 574 299931241
1 1000148716229112932 299931241 Stars & Kultur Regisseur Michael Haneke kritisiert Flüchtling... None 574 299913879
2 1000360453832106474 299925700 Lifestyle Nach Tod von Vater: Tochter bekommt jedes Jahr... Marlene Patsalidis 574 299922662
3 1000360453832106474 299922662 Lifestyle Australischer Fernsehstar rechtfertigt Sexismu... Marlene Patsalidis 574 299826775
4 1001846185946874596 299930679 News Wintereinbruch naht: Erster Schnee im Osten mö... Daniela Wahl 574 299930679
In [8]:
sql="""
WITH site_history as (
  SELECT
      fullVisitorId as visitor_id,
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS content_id,
      (SELECT MAX(IF(index=7, value, NULL)) FROM UNNEST(hits.customDimensions)) AS category, 
      (SELECT MAX(IF(index=6, value, NULL)) FROM UNNEST(hits.customDimensions)) AS title,
      (SELECT MAX(IF(index=2, value, NULL)) FROM UNNEST(hits.customDimensions)) AS author_list,
      SPLIT(RPAD((SELECT MAX(IF(index=4, value, NULL)) FROM UNNEST(hits.customDimensions)), 7), '.') as year_month_array,
      LEAD(hits.customDimensions, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) as nextCustomDimensions
  FROM 
    `cloud-training-demos.GA360_test.ga_sessions_sample`,   
     UNNEST(hits) AS hits
   WHERE 
     # only include hits on pages
      hits.type = "PAGE"
      AND
      fullVisitorId IS NOT NULL
      AND
      hits.time != 0
      AND
      hits.time IS NOT NULL
      AND
      (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) IS NOT NULL
)
SELECT
  visitor_id,
  content_id,
  category,
  REGEXP_REPLACE(title, r",", "") as title,
  REGEXP_EXTRACT(author_list, r"^[^,]+") as author,
  DATE_DIFF(DATE(CAST(year_month_array[OFFSET(0)] AS INT64), CAST(year_month_array[OFFSET(1)] AS INT64), 1), DATE(1970,1,1), MONTH) as months_since_epoch,
  (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) as next_content_id
FROM
  site_history
WHERE (SELECT MAX(IF(index=10, value, NULL)) FROM UNNEST(nextCustomDimensions)) IS NOT NULL
      AND ABS(MOD(FARM_FINGERPRINT(CONCAT(visitor_id, content_id)), 10)) >= 9
"""
test_set_df = bigquery.Client().query(sql).to_dataframe()
test_set_df.to_csv('test_set.csv', header=False, index=False, encoding='utf-8')
test_set_df.head()
Out[8]:
visitor_id content_id category title author months_since_epoch next_content_id
0 1004996140885326201 299918278 News Skipässe in Wintersport-Hochburgen massiv teurer Stefan Hofer 574 299451483
1 1007264386119876541 299965853 News ÖVP und FPÖ dementieren geplante Kassenreduktion Stefan Berndl 574 299975649
2 1008065814070719756 299931241 Stars & Kultur Regisseur Michael Haneke kritisiert Flüchtling... None 574 299899819
3 1010974920021844642 299972800 News Strengere Regeln für Schüler Eltern – und auch... None 574 299972800
4 1012414335303532764 299972248 News Bildungsexperte Salcher: "Lehrer nach Leistung... Bernhard Gaul 574 298972803

Let's have a look at the two csv files we just created containing the training and test set. We'll also do a line count of both files to confirm that we have achieved an approximate 90/10 train/test split.
In the next notebook, Content Based Filtering we will build a model to recommend an article given information about the current article being read, such as the category, title, author, and publish date.

In [9]:
%%bash
wc -l *_set.csv
   25599 test_set.csv
  232308 training_set.csv
  257907 total
In [10]:
!head *_set.csv
==> test_set.csv <==
1004996140885326201,299918278,News,Skipässe in Wintersport-Hochburgen massiv teurer,Stefan Hofer,574,299451483
1007264386119876541,299965853,News,ÖVP und FPÖ dementieren geplante Kassenreduktion,Stefan Berndl,574,299975649
1008065814070719756,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtlingspolitik,,574,299899819
1010974920021844642,299972800,News,Strengere Regeln für Schüler Eltern – und auch Lehrer,,574,299972800
1012414335303532764,299972248,News,"Bildungsexperte Salcher: ""Lehrer nach Leistung bezahlen""",Bernhard Gaul,574,298972803
1013636511271824035,299965853,News,ÖVP und FPÖ dementieren geplante Kassenreduktion,Stefan Berndl,574,299836255
1019846149190726736,299957318,News,27-Jähriger soll betagte Nachbarin vergewaltigt haben,Yvonne Widler,574,299907275
1020572535543246728,299917726,News,OÖ: Drogendealer agierten von Asylquartier aus,Daniela Wahl,574,299410466
1021150165661122687,299816215,News,Fahnenskandal von Mailand: Die Austria zeigt Flagge,Alexander Strecha,574,299930032
1022059616427871901,299906166,Lifestyle,"Foodwatch vergibt ""Goldenen Windbeutel"" an Alete-Kinderkekse",Anita Kattinger,574,299826767

==> training_set.csv <==
1000148716229112932,299913368,News,U4-Störung legt Wiener Frühverkehr lahm ,Yvonne Widler,574,299931241
1000148716229112932,299931241,Stars & Kultur,Regisseur Michael Haneke kritisiert Flüchtlingspolitik,,574,299913879
1000360453832106474,299925700,Lifestyle,Nach Tod von Vater: Tochter bekommt jedes Jahr Blumen,Marlene Patsalidis,574,299922662
1000360453832106474,299922662,Lifestyle,Australischer Fernsehstar rechtfertigt Sexismus mit Asperger-Syndrom,Marlene Patsalidis,574,299826775
1001846185946874596,299930679,News,Wintereinbruch naht: Erster Schnee im Osten möglich,Daniela Wahl,574,299930679
1002283924872562598,299796775,News,Wiener kaufen heuer neun Millionen Geschenke,Bernhard Ichner,574,299853016
1002283924872562598,299853016,News,Schröcksnadel gegen Werdenigg: Keine Aussprache,,574,299899396
1002283924872562598,299899396,News,Suche nach U-Boot: Wasser im Schnorchel verursachte Brand,,574,299798467
1004209053768679755,18976804,News,Heimskandal - Brigitte Wanker: Die Landesverräterin,Georg Hönigsberger,522,299695400
1004209053768679755,299695400,News,Formel 1: Niki Lauda verkündete Abschied als RTL-Experte,Mirad Odobasic,574,299802551