Usage of FactorizedTopK in TensorFlow Recommenders

Question

I looked at all the quickstart tutorials and used the basic_retrieval example to adjust it to my dataset. views_df contains pairs of user_ids and content_ids and represent when a user viewed a content.

Dataset and Result

The dataset is fairly small (1026 views from 63 users on 187 contents) but the code seems to work and my results are as follows:

Train:

factorized_top_k/top_1_categorical_accuracy: 0.0012 factorized_top_k/top_5_categorical_accuracy: 0.0816 factorized_top_k/top_10_categorical_accuracy: 0.2046 factorized_top_k/top_50_categorical_accuracy: 0.7430 factorized_top_k/top_100_categorical_accuracy: 0.8965 loss: 494.7287

Test:

factorized_top_k/top_1_categorical_accuracy: 0.0 factorized_top_k/top_5_categorical_accuracy: 0.0243 factorized_top_k/top_10_categorical_accuracy: 0.0585 factorized_top_k/top_50_categorical_accuracy: 0.3804 factorized_top_k/top_100_categorical_accuracy: 0.6146 loss: 31.29269790649414,

Question

I am unsure if I created the query embeddings and candidate embeddings correctly from my dataset for the calculation of the metrics FactorizedTopK metric. I am also having trouble to unterstand the computation of theFactorizedTopK metric in general. I looked at the source code but don't understand the explanation of how it is calculated.

The main argument are pairs of query and candidate embeddings: the first row of query_embeddings denotes a query for which the candidate from the first row of candidate embeddings was selected by the user. The task will try to maximize the affinity of these query, candidate pairs while minimizing the affinity between the query and candidates belonging to other queries in the batch.

Where does it take the ground truth from? Aren't the query and candidate embeddings just lists of all users and contents? Is the order of the list of importance? Can someone explan the computation of the FactorizedTopK metric in simpler terms?

Thanks in advance

Code

import os
import pprint
import tempfile
from typing import Dict, Text
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs
import pandas as pd
import matplotlib.pyplot as plt

# Variables
seed = 42
test_percentage = 20
train_percentage = 100-test_percentage
embedding_dimension = 32 # 64 ?
metrics_batchsize = 16
train_batchsize = 128
test_batchsize = 64
learning_rate = 0.1 # 0.5 ?
epochs = 3
index_batchsize = 100

views_df = pd.read_csv('filepath')

views_df = views_df[['user_id','content_id']]
users_df = views_df['user_id'].unique()
contents_df = views_df['content_id'].unique()

views_ds = tf.data.Dataset.from_tensor_slices(dict(views_df))
contents_ds = tf.data.Dataset.from_tensor_slices(contents_df)

view_size = len(views_df)
train_size = round(view_size/100*train_percentage)
test_size = view_size-train_size

tf.random.set_seed(seed)

views_ds_shuffled = views_ds.shuffle(len(views_df), seed=seed, reshuffle_each_iteration=False)

train = views_ds_shuffled.take(train_size)
test = views_ds_shuffled.skip(train_size).take(test_size)

user_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=users_df, mask_value=None),
  tf.keras.layers.Embedding(input_dim=len(users_df) + 1, output_dim=embedding_dimension)
])

content_model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=contents_df, mask_value=None),
  tf.keras.layers.Embedding(input_dim=len(contents_df) + 1, output_dim=embedding_dimension)
])

candidates=contents_ds.batch(metrics_batchsize).map(content_model)

metrics = tfrs.metrics.FactorizedTopK(
  candidates=candidates
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

class ContentModel(tfrs.Model):

  def __init__(self, user_model, content_model):
    super().__init__()
    self.content_model: tf.keras.Model = content_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    content_embeddings = self.content_model(features["content_id"])
    user_embeddings = self.user_model(features["user_id"])

    return self.task(user_embeddings, content_embeddings)

model = ContentModel(user_model, content_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=learning_rate))

cached_train = train.shuffle(view_size).batch(train_batchsize).cache()
cached_test = test.batch(test_batchsize).cache()

model.fit(cached_train, epochs=epochs)

model.evaluate(cached_test, return_dict=True)

index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index(contents_ds.batch(index_batchsize).map(model.content_model), contents_ds)

user_id = 2
_, contents = index(tf.constant([user_id]))
print(f"Recommendations for user {user_id}: {contents}")

mustfkeskin · Accepted Answer

You must give just positive labels for your dataset. For example purchase data may your positive sample for each user. Tensorflow calculate in batch negative sampling to train model.

Usage of FactorizedTopK in TensorFlow Recommenders

Tags:

tensorflow

recommendation-engine

Dataset and Result

Question

Code

HansWurst90

1 Answers

mustfkeskin

Recent Activity

Donate For Us

Usage of FactorizedTopK in TensorFlow Recommenders

Tags:

tensorflow

recommendation-engine

Dataset and Result

Question

Code

HansWurst90

1 Answers

mustfkeskin

Related questions

Recent Activity

Donate For Us