I wanted to try my hand at building a recommendation system and have outlined the steps I took to do so. The data I worked with was in the following format:
Where the shopping_profile_id refers to the profile of a user that could have purchased multiple brands.I utilised a technique called Alternating Least Squares which is appropriate for a recommender system since we only have purchase data and no ratings. It takes a large matrix of user/item interactions and tries to find the hidden features that relate the products to each other through matrix factorisation which are represented through confidence levels on preferences. Unseen items are weighted negatively with low confidence while items that are seen are treated as positive with high confidence. Finally, ALS minimises a sum of squares errors loss function which is weighted by the previously determined confidence values. The beauty of ALS is that the optimisation problem is turned into a quadratic one by alternating fixed user and brand values. This allows for easy optimisation through stochastic gradient descent. To read more about how ALS works, I have linked the paper titled “Collaborative Filtering for Implicit Feedback Datasets” here. With that let’s get started…
import pandas as pd import numpy as np import scipy from scipy.sparse import coo_matrix import implicit #Read in the data file path = 'brands_filtered.txt' df = pd.read_table(path, sep='\t')
I used the implicit package to implement the recommendation system. Below, I set up a ‘values’ column which assigns a 1 for each purchase of a brand by a user. Since there are many data points, I built a sparse matrix with coo_matrix which doesn’t take up much memory.
def create_sparse(dataframe): dataframe['values'] = 1 shopping_profile_id_u = list(np.sort(dataframe.shopping_profile_id.unique())) #Create sorted profile list brand_id_u = list(np.sort(dataframe.brand_id.unique())) #Create sorted brand list data = dataframe['values'].astype(float).tolist() #Create list of all the 1's I assigned for a purchase #Create row and column objects to build the sparse matrix with. row = dataframe.brand_id.astype('category', categories= brand_id_u).cat.codes col = dataframe.shopping_profile_id.astype('category', categories=shopping_profile_id_u).cat.codes sparse = coo_matrix((data, (row, col)), shape=(len(brand_id_u), len(shopping_profile_id_u))) return sparse
Essentially I created a unique brand and shopping profile id list and then a sparse matrix of the unique brand ids as rows and then the unique shopping profile ids as columns.
sparse_matrix = create_sparse(df) #Create the sparse matrix
For the fitting of the model I would normally perform some form of a hyperparameter search in order to fine tune the algorithm but due to time constraints I stuck to the defaults.
def fit_model(sparse): model = implicit.als.AlternatingLeastSquares(factors=100) print "Fitting the model... \n" model.fit(sparse) print "Done!" return model model = fit_model(sparse_matrix)
Finally, I built a function to take the fitted model and brand name that the user inputted which would then output the most similar brands based off of the data given.
def predict(brand_name, models): unique_brands = np.sort(df.brand_id.unique()) try: b_id = df.at[df[df['name']== brand_name].index[0], 'brand_id'] #get the brand id of the input except: print "brand does not exist" arr_val = np.where(unique_brands == b_id) #get the array position in the sparse matrix of the brand related = models.similar_items(arr_val[0][0]) #Feed the value into the similarity calculation of the model similar = [] scores = [] #Convert the sparse matrix positions back to brand names for i in related: value = int(i[0]) scores.append(str(i[1])) similar_id = unique_brands[value] similar.append(df.loc[df['brand_id'] == similar_id, 'name'].tolist()[0]) return zip(similar, scores) unique = df['name'].unique() list1 = [] for name in unique[:4186]: list1.append(predict(name, model))
The output of this is a list of each brand and all brand recommendations that are similar to it based off of what consumers bought in this dataset. The full notebook can be found on my github here.
Hi, what did you do next?
LikeLike