I looked at all the quickstart tutorials and used the basic_retrieval example to adjust it to my dataset.
views_df contains pairs of user_ids and content_ids and represent when a user viewed a content.
The dataset is fairly small (1026 views from 63 users on 187 contents) but the code seems to work and my results are as follows:
Train:factorized_top_k/top_1_categorical_accuracy: 0.0012 factorized_top_k/top_5_categorical_accuracy: 0.0816 factorized_top_k/top_10_categorical_accuracy: 0.2046 factorized_top_k/top_50_categorical_accuracy: 0.7430 factorized_top_k/top_100_categorical_accuracy: 0.8965 loss: 494.7287
Test:factorized_top_k/top_1_categorical_accuracy: 0.0 factorized_top_k/top_5_categorical_accuracy: 0.0243 factorized_top_k/top_10_categorical_accuracy: 0.0585 factorized_top_k/top_50_categorical_accuracy: 0.3804 factorized_top_k/top_100_categorical_accuracy: 0.6146 loss: 31.29269790649414,
I am unsure if I created the query embeddings and candidate embeddings correctly from my dataset for the calculation of the metrics FactorizedTopK metric. I am also having trouble to unterstand the computation of theFactorizedTopK metric in general. I looked at the source code but don't understand the explanation of how it is calculated.
The main argument are pairs of query and candidate embeddings: the first row of query_embeddings denotes a query for which the candidate from the first row of candidate embeddings was selected by the user. The task will try to maximize the affinity of these query, candidate pairs while minimizing the affinity between the query and candidates belonging to other queries in the batch.
Where does it take the ground truth from? Aren't the query and candidate embeddings just lists of all users and contents? Is the order of the list of importance? Can someone explan the computation of the FactorizedTopK metric in simpler terms?
Thanks in advance
import os
import pprint
import tempfile
from typing import Dict, Text
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import pandas as pd
import matplotlib.pyplot as plt
# Variables
seed = 42
test_percentage = 20
train_percentage = 100-test_percentage
embedding_dimension = 32 # 64 ?
metrics_batchsize = 16
train_batchsize = 128
test_batchsize = 64
learning_rate = 0.1 # 0.5 ?
epochs = 3
index_batchsize = 100
views_df = pd.read_csv('filepath')
views_df = views_df[['user_id','content_id']]
users_df = views_df['user_id'].unique()
contents_df = views_df['content_id'].unique()
views_ds = tf.data.Dataset.from_tensor_slices(dict(views_df))
contents_ds = tf.data.Dataset.from_tensor_slices(contents_df)
view_size = len(views_df)
train_size = round(view_size/100*train_percentage)
test_size = view_size-train_size
tf.random.set_seed(seed)
views_ds_shuffled = views_ds.shuffle(len(views_df), seed=seed, reshuffle_each_iteration=False)
train = views_ds_shuffled.take(train_size)
test = views_ds_shuffled.skip(train_size).take(test_size)
user_model = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=users_df, mask_value=None),
tf.keras.layers.Embedding(input_dim=len(users_df) + 1, output_dim=embedding_dimension)
])
content_model = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=contents_df, mask_value=None),
tf.keras.layers.Embedding(input_dim=len(contents_df) + 1, output_dim=embedding_dimension)
])
candidates=contents_ds.batch(metrics_batchsize).map(content_model)
metrics = tfrs.metrics.FactorizedTopK(
candidates=candidates
)
task = tfrs.tasks.Retrieval(
metrics=metrics
)
class ContentModel(tfrs.Model):
def __init__(self, user_model, content_model):
super().__init__()
self.content_model: tf.keras.Model = content_model
self.user_model: tf.keras.Model = user_model
self.task: tf.keras.layers.Layer = task
def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
content_embeddings = self.content_model(features["content_id"])
user_embeddings = self.user_model(features["user_id"])
return self.task(user_embeddings, content_embeddings)
model = ContentModel(user_model, content_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=learning_rate))
cached_train = train.shuffle(view_size).batch(train_batchsize).cache()
cached_test = test.batch(test_batchsize).cache()
model.fit(cached_train, epochs=epochs)
model.evaluate(cached_test, return_dict=True)
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index(contents_ds.batch(index_batchsize).map(model.content_model), contents_ds)
user_id = 2
_, contents = index(tf.constant([user_id]))
print(f"Recommendations for user {user_id}: {contents}")
You must give just positive labels for your dataset. For example purchase data may your positive sample for each user. Tensorflow calculate in batch negative sampling to train model.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With