2. Creating a sampled dataset

This notebook illustrates:

Sampling a BigQuery dataset to create datasets for ML
Preprocessing with Pandas

# change these to try this notebook out
BUCKET = 'cloud-training-demos-ml'
PROJECT = 'cloud-training-demos'
REGION = 'us-central1'

import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

%%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
  gsutil mb -l ${REGION} gs://${BUCKET}
fi

Create ML dataset by sampling using BigQuery

Let's sample the BigQuery data to create smaller datasets.

# Create SQL query using natality data after the year 2000
from google.cloud import bigquery
query = """
SELECT
  weight_pounds,
  is_male,
  mother_age,
  plurality,
  gestation_weeks,
  FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth
FROM
  publicdata.samples.natality
WHERE year > 2000
"""

There are only a limited number of years and months in the dataset. Let's see what the hashmonths are.

# Call BigQuery but GROUP BY the hashmonth and see number of records for each group to enable us to get the correct train and evaluation percentages
df = bigquery.Client().query("SELECT hashmonth, COUNT(weight_pounds) AS num_babies FROM (" + query + ") GROUP BY hashmonth").to_dataframe()
print("There are {} unique hashmonths.".format(len(df)))
df.head()

There are 96 unique hashmonths.

Here's a way to get a well distributed portion of the data in such a way that the test and train sets do not overlap:

# Added the RAND() so that we can now subsample from each of the hashmonths to get approximately the record counts we want
trainQuery = "SELECT * FROM (" + query + ") WHERE ABS(MOD(hashmonth, 4)) < 3 AND RAND() < 0.0005"
evalQuery = "SELECT * FROM (" + query + ") WHERE ABS(MOD(hashmonth, 4)) = 3 AND RAND() < 0.0005"
traindf = bigquery.Client().query(trainQuery).to_dataframe()
evaldf = bigquery.Client().query(evalQuery).to_dataframe()
print("There are {} examples in the train dataset and {} in the eval dataset".format(len(traindf), len(evaldf)))

There are 13405 examples in the train dataset and 3262 in the eval dataset

Preprocess data using Pandas

Let's add extra rows to simulate the lack of ultrasound. In the process, we'll also change the plurality column to be a string.

traindf.head()

Also notice that there are some very important numeric fields that are missing in some rows (the count in Pandas doesn't count missing data)

# Let's look at a small sample of the training data
traindf.describe()

# It is always crucial to clean raw data before using in ML, so we have a preprocessing step
import pandas as pd
def preprocess(df):
  # clean up data we don't want to train on
  # in other words, users will have to tell us the mother's age
  # otherwise, our ML service won't work.
  # these were chosen because they are such good predictors
  # and because these are easy enough to collect
  df = df[df.weight_pounds > 0]
  df = df[df.mother_age > 0]
  df = df[df.gestation_weeks > 0]
  df = df[df.plurality > 0]
  
  # modify plurality field to be a string
  twins_etc = dict(zip([1,2,3,4,5],
                   ['Single(1)', 'Twins(2)', 'Triplets(3)', 'Quadruplets(4)', 'Quintuplets(5)']))
  df['plurality'].replace(twins_etc, inplace=True)
  
  # now create extra rows to simulate lack of ultrasound
  nous = df.copy(deep=True)
  nous.loc[nous['plurality'] != 'Single(1)', 'plurality'] = 'Multiple(2+)'
  nous['is_male'] = 'Unknown'
  
  return pd.concat([df, nous])

traindf.head()# Let's see a small sample of the training data now after our preprocessing
traindf = preprocess(traindf)
evaldf = preprocess(evaldf)
traindf.head()

traindf.tail()

# Describe only does numeric columns, so you won't see plurality
traindf.describe()

Write out

In the final versions, we want to read from files, not Pandas dataframes. So, write the Pandas dataframes out as CSV files. Using CSV files gives us the advantage of shuffling during read. This is important for distributed training because some workers might be slower than others, and shuffling the data helps prevent the same data from being assigned to the slow workers.

traindf.to_csv('train.csv', index=False, header=False)
evaldf.to_csv('eval.csv', index=False, header=False)

%%bash
wc -l *.csv
head *.csv
tail *.csv

   6444 eval.csv
  26606 train.csv
  33050 total
==> eval.csv <==
7.25100379718,False,23,Single(1),39.0,7146494315947640619
7.936641432,True,19,Single(1),39.0,6244544205302024223
6.6248909731,True,27,Single(1),37.0,1891060869255459203
7.31273323054,True,33,Single(1),38.0,2246942437170405963
6.13326012884,True,42,Single(1),36.0,6365946696709051755
7.43839671988,False,23,Single(1),40.0,4740473290291881219
7.936641432,False,23,Single(1),40.0,7146494315947640619
6.52788757782,True,39,Single(1),36.0,4740473290291881219
6.75055446244,False,25,Single(1),34.0,8904940584331855459
6.20821729792,False,33,Single(1),38.0,6365946696709051755

==> train.csv <==
7.87491199864,True,27,Single(1),40.0,774501970389208065
9.31232594688,True,33,Single(1),38.0,774501970389208065
9.37626000286,True,31,Single(1),40.0,774501970389208065
7.3744626639,True,31,Single(1),38.0,774501970389208065
8.5098433132,False,34,Single(1),38.0,774501970389208065
7.3744626639,False,28,Single(1),39.0,774501970389208065
7.1870697412,False,33,Single(1),38.0,774501970389208065
8.75014717878,False,22,Single(1),41.0,774501970389208065
7.35903030556,True,18,Single(1),42.0,774501970389208065
6.686620406459999,False,30,Single(1),39.0,774501970389208065
==> eval.csv <==
7.31273323054,Unknown,24,Single(1),40.0,1639186255933990135
6.3118345610599995,Unknown,40,Single(1),38.0,74931465496927487
7.1870697412,Unknown,33,Single(1),37.0,74931465496927487
8.24969784404,Unknown,33,Single(1),39.0,3182182455926341111
8.0689187892,Unknown,24,Single(1),41.0,74931465496927487
8.421658408399999,Unknown,32,Single(1),41.0,6910174677251748583
6.80787465056,Unknown,25,Single(1),39.0,6141045177192779423
6.8122838958,Unknown,39,Single(1),40.0,6141045177192779423
7.16281889238,Unknown,22,Single(1),37.0,1639186255933990135
7.5618555866,Unknown,30,Single(1),42.0,8904940584331855459

==> train.csv <==
8.000575487979999,Unknown,21,Single(1),39.0,6637442812569910270
8.56275425608,Unknown,27,Single(1),40.0,6637442812569910270
5.93704871566,Unknown,28,Single(1),36.0,6637442812569910270
6.7902376696,Unknown,29,Single(1),40.0,6637442812569910270
8.0358494499,Unknown,27,Single(1),41.0,6637442812569910270
6.6248909731,Unknown,17,Single(1),39.0,6637442812569910270
5.3131405142,Unknown,30,Single(1),40.0,6637442812569910270
8.75014717878,Unknown,22,Single(1),40.0,6637442812569910270
8.3665428429,Unknown,32,Single(1),38.0,6637442812569910270
8.3114272774,Unknown,31,Single(1),41.0,6637442812569910270

Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License

	hashmonth	num_babies
0	6392072535155213407	323758
1	8387817883864991792	331629
2	8391424625589759186	364497
3	9183605629983195042	329975
4	328012383083104805	359891

	weight_pounds	mother_age	plurality	gestation_weeks	hashmonth
count	13391.000000	13405.000000	13405.000000	13314.000000	1.340500e+04
mean	7.238220	27.337635	1.036255	38.614466	4.403132e+18
std	1.328578	6.170848	0.196276	2.576437	2.786276e+18
min	0.500449	12.000000	1.000000	17.000000	1.244589e+17
25%	6.563162	22.000000	1.000000	38.000000	1.622638e+18
50%	7.312733	27.000000	1.000000	39.000000	4.329667e+18
75%	8.062305	32.000000	1.000000	40.000000	7.108882e+18
max	13.459221	50.000000	4.000000	47.000000	9.183606e+18

	weight_pounds	mother_age	gestation_weeks	hashmonth
count	26606.000000	26606.000000	26606.000000	2.660600e+04
mean	7.239026	27.343231	38.619409	4.404453e+18
std	1.328190	6.170775	2.558954	2.784172e+18
min	0.500449	12.000000	17.000000	1.244589e+17
25%	6.563162	22.000000	38.000000	1.622638e+18
50%	7.312733	27.000000	39.000000	4.329667e+18
75%	8.062305	32.000000	40.000000	7.108882e+18
max	13.459221	50.000000	47.000000	9.183606e+18

	weight_pounds	is_male	mother_age	plurality	gestation_weeks	hashmonth
0	7.874912	True	27	1	40.0	774501970389208065
1	9.312326	True	33	1	38.0	774501970389208065
2	9.376260	True	31	1	40.0	774501970389208065
3	7.374463	True	31	1	38.0	774501970389208065
4	8.509843	False	34	1	38.0	774501970389208065

	weight_pounds	is_male	mother_age	plurality	gestation_weeks	hashmonth
0	7.874912	True	27	Single(1)	40.0	774501970389208065
1	9.312326	True	33	Single(1)	38.0	774501970389208065
2	9.376260	True	31	Single(1)	40.0	774501970389208065
3	7.374463	True	31	Single(1)	38.0	774501970389208065
4	8.509843	False	34	Single(1)	38.0	774501970389208065

	weight_pounds	is_male	mother_age	plurality	gestation_weeks	hashmonth
13400	6.624891	Unknown	17	Single(1)	39.0	6637442812569910270
13401	5.313141	Unknown	30	Single(1)	40.0	6637442812569910270
13402	8.750147	Unknown	22	Single(1)	40.0	6637442812569910270
13403	8.366543	Unknown	32	Single(1)	38.0	6637442812569910270
13404	8.311427	Unknown	31	Single(1)	41.0	6637442812569910270