Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 6 Dodajanje vrstic"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Do sedaj smo spoznali, kako dano matriko faktorizirati na dve manjši."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import itertools\n",
- "import time\n",
- "\n",
- "class NMF:\n",
- " \n",
- " \"\"\"\n",
- " Fit a matrix factorization model for a matrix X with missing values.\n",
- " such that\n",
- " X = W H.T + E \n",
- " where\n",
- " X is of shape (m, n) - data matrix\n",
- " W is of shape (m, rank) - approximated row space\n",
- " H is of shape (n, rank) - approximated column space\n",
- " E is of shape (m, n) - residual (error) matrix\n",
- " \"\"\"\n",
- " \n",
- " def __init__(self, rank=10, max_iter=100, eta=0.01):\n",
- " \"\"\"\n",
- " :param rank: Rank of the matrices of the model.\n",
- " :param max_iter: Maximum nuber of SGD iterations.\n",
- " :param eta: SGD learning rate.\n",
- " \"\"\"\n",
- " self.rank = rank\n",
- " self.max_iter = max_iter\n",
- " self.eta = eta\n",
- " \n",
- " \n",
- " def fit(self, X, verbose=False):\n",
- " \"\"\"\n",
- " Fit model parameters W, H.\n",
- " :param X: \n",
- " Non-negative data matrix of shape (m, n)\n",
- " Unknown values are assumed to take the value of zero (0).\n",
- " \"\"\"\n",
- " m, n = X.shape\n",
- "\n",
- " W = np.random.rand(m, self.rank)\n",
- " H = np.random.rand(n, self.rank)\n",
- " \n",
- " # Indices to model variables\n",
- " w_vars = list(itertools.product(range(m), range(self.rank)))\n",
- " h_vars = list(itertools.product(range(n), range(self.rank)))\n",
- "\n",
- " # Indices to nonzero rows/columns\n",
- " nzcols = dict([(j, X[:, j].nonzero()[0]) for j in range(n)])\n",
- " nzrows = dict([(i, X[i, :].nonzero()[0]) for i in range(m)])\n",
- "\n",
- " # Errors\n",
- " self.error = np.zeros((self.max_iter,))\n",
- "\n",
- " for t in range(self.max_iter):\n",
- " t1 = time.time()\n",
- " np.random.shuffle(w_vars)\n",
- " np.random.shuffle(h_vars)\n",
- "\n",
- " for i, k in w_vars:\n",
- " wgrad = sum([(X[i, j] - W[i, :].dot(H[j, :]))*W[i, k] for j in nzrows[i]])\n",
- " W[i, k] = max(0, W[i, k] + self.eta * wgrad)\n",
- "\n",
- " for j, k in h_vars:\n",
- " hgrad = sum([(X[i, j] - W[i, :].dot(H[j, :]))*H[j, k] for i in nzcols[j]])\n",
- " H[j, k] = max(0, H[j, k] + self.eta * hgrad)\n",
- " \n",
- " self.error[t] = sum([sum([(X[i, j] - W[i, :].dot(H[j, :]))**2 for j in nzrows[i]]) \n",
- " for i in range(X.shape[0])])\n",
- "\n",
- " if verbose: print(t, self.error[t])\n",
- " \n",
- " self.W = W\n",
- " self.H = H\n",
- " \n",
- " \n",
- " def predict(self, i, j):\n",
- " \"\"\"\n",
- " Predict score for row i and column j\n",
- " :param i: Row index.\n",
- " :param j: Column index.\n",
- " \"\"\"\n",
- " return self.W[i, :].dot(self.H[:, j])\n",
- " \n",
- "\n",
- " def predict_all(self):\n",
- " \"\"\"\n",
- " Return approximated matrix for all\n",
- " columns and rows.\n",
- " \"\"\"\n",
- " return self.W.dot(self.H.T)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Pri priporočilnih sistemih bomo pogosto dodali nove uporabnike (vrstice). Najprej opravimo faktorizacijo:\n",
- "\n",
- "$X\\approx W H$\n",
- "\n",
- "Ko v $X$ dodamo novo vrstico, lahko poiščemo novo matriko $W$ z uporabo nespremenjene matrike $H$.\n",
- "\n",
- "Ponavljamo $W =W \\frac{X H^T}{W H H^T}$"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "##### Vprašanje 6-2-1\n",
- "Napišite funkcijo, ki ob danih matrikah $X$ in $H$ izračuna novo matriko $W$."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "def nmf_fix (X, H, k, max_iter = 100): \n",
- " \"\"\"\n",
- " :param X: matrix with descriptions of test examples.\n",
- " :param H: matrix for a pre-built model.\n",
- " :param k: Factorization rank\n",
- " :param max_iter: Max. number of iterations.\n",
- " :return:\n",
- " W: A predicted row clustering matrix.\n",
- " \"\"\"\n",
- " m, N = X.shape \n",
- " W = np.random.rand(m, k) \n",
- " \n",
- " for itr in range(max_iter): \n",
- " # TODO: your code here\n",
- " pass\n",
- " return W"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Zlivanje podatkov z matrično faktorizacijo"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "import skfusion\n",
- "from skfusion import fusion as skf\n",
- "\n",
- "def scale(X, amin, amax):\n",
- " return (X - X.min()) / (X.max() - X.min()) * (amax - amin) + amin\n",
- "\n",
- "def rmse(y_true, y_pred):\n",
- " return np.sqrt(np.sum((y_true - y_pred)**2) / y_true.size)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Naložimo podmnožico podatkov o filmih.\n",
- "\n",
- "$R_{12}$ -> uporabnik-film\n",
- "\n",
- "$R_{23}$ -> film-žanr\n",
- "\n",
- "$R_{24}$ -> film-igralec"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "R12_true = np.load(\"R12.npy\")\n",
- "R23 = np.load(\"R23.npy\")\n",
- "R24 = np.load(\"R24.npy\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "V `R12_true` so z -1 predstavljene manjkajoče vrednosti."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "R12_true"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Pred nadaljevanjem bomo ustvarili masko, ki nam bo povedala, v katerih celicah še ni ocene. Podatke skaliramo na vrednosti med 0 in 1."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "R12_true = np.ma.masked_equal(R12_true, -1)\n",
- "R12_true = scale(R12_true, 0, 1)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Pripravimo podatke za evalvacijo napovedi: uporabili bomo 90% znanih ocen, ostale bomo skrili."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "frac = 0.9\n",
- "R12 = R12_true.copy()\n",
- "hidden = np.logical_and(np.random.random(R12_true.shape) > frac, ~R12_true.mask)\n",
- "R12 = np.ma.masked_where(hidden, R12)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Pripravimo objektne tipe in jim določimo rang faktorizacije."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "p = 0.05\n",
- "t1 = skf.ObjectType('User', max(int(p*R12.shape[0]), 5))\n",
- "t2 = skf.ObjectType('Movie', max(int(p*R12.shape[1]), 5))\n",
- "t3 = skf.ObjectType('Genre', max(int(p*R23.shape[1]), 5))\n",
- "t4 = skf.ObjectType('Actor', max(int(p*R24.shape[1]), 5))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Vse relacije postavimo v graf."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "relations = [skf.Relation(R12, t1, t2, name='User ratings'),\n",
- " skf.Relation(R23, t2, t3, name='Movie genres'),\n",
- " skf.Relation(R24, t2, t4, name='Movie actors')]\n",
- "graph = skf.FusionGraph(relations)\n",
- "print('Ranks:', ''.join(['\\n{}: {}'.format(o.name, o.rank)\n",
- " for o in graph.object_types]))\n",
- "\n",
- "graph_small = skf.FusionGraph([skf.Relation(R12, t1, t2, name='User ratings')])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Za začetek zračunamo napako povprečne ocene (mean regressor) po uporabnikih, filmih in za celo matriko."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "n_users, n_movies = R12_true.shape\n",
- "\n",
- "mean_user = np.mean(R12, 1)\n",
- "mean_movie = np.mean(R12, 0)\n",
- "mean_rating = np.mean(R12)\n",
- "\n",
- "# mean rating\n",
- "score = rmse(R12_true[hidden], mean_rating)\n",
- "print('RMSE(mean rating): {}'.format(score))\n",
- "\n",
- "# mean user\n",
- "R12_pred = np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
- "score = rmse(R12_true[hidden], R12_pred[hidden])\n",
- "print('RMSE(mean user): {}'.format(score))\n",
- "\n",
- "# mean movie\n",
- "R12_pred = np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
- "score = rmse(R12_true[hidden], R12_pred[hidden])\n",
- "print('RMSE(mean movie): {}'.format(score))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Nato izvedemo zlivanje z uporabo samo ene relacije (v bistvu je to tri-faktorizacija ene matrike)."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "collapsed": true
- },
- "outputs": [],
- "source": [
- "# DFMF on ratings data only (it benefits if unknown values are set to mean)\n",
- "scores = []\n",
- "for _ in range(10):\n",
- " dfmf_fuser = skf.Dfmf(max_iter=100, init_type='random')\n",
- " dfmf_mod = dfmf_fuser.fuse(graph_small)\n",
- " R12_pred = dfmf_mod.complete(graph_small['User ratings'])\n",
- " # R12_pred = scale(R12_pred, 0, 1)\n",
- " R12_pred += np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
- " R12_pred += np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
- " R12_pred = scale(R12_pred, 0, 1)\n",
- " scores.append(rmse(R12_true[hidden], R12_pred[hidden]))\n",
- "print('RMSE(ratings; out-of-sample dfmf): {}'.format(np.mean(scores)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In na koncu opravimo zlivanje vseh treh relacij."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "# DFMF (it benefits if unknown values are set to mean)\n",
- "scores = []\n",
- "for _ in range(10):\n",
- " dfmf_fuser = skf.Dfmf(max_iter=100, init_type='random')\n",
- " dfmf_mod = dfmf_fuser.fuse(graph)\n",
- " R12_pred = dfmf_mod.complete(graph['User ratings'])\n",
- " # R12_pred = scale(R12_pred, 0, 1)\n",
- " R12_pred += np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
- " R12_pred += np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
- " R12_pred = scale(R12_pred, 0, 1)\n",
- " scores.append(rmse(R12_true[hidden], R12_pred[hidden]))\n",
- "print('RMSE(ratings; out-of-sample dfmf): {}'.format(np.mean(scores)))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Iz modela lahko dobimo tudi vse $G_i$ in $S_{ij}$ matrike."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "G1 = dfmf_fuser.factor(t1)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": [
- "for chain in dfmf_fuser.chain(t1, t2):\n",
- " S12 = dfmf_fuser.backbone(dfmf_fuser.fusion_graph[chain[0]][chain[1]][0])\n",
- "for chain in dfmf_fuser.chain(t2, t3):\n",
- " S23 = dfmf_fuser.backbone(dfmf_fuser.fusion_graph[chain[0]][chain[1]][0])"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.3"
- },
- "nbTranslate": {
- "displayLangs": [
- "en",
- "sl"
- ],
- "hotkey": "alt-t",
- "langInMainMenu": true,
- "sourceLang": "sl",
- "targetLang": "en",
- "useGoogleTranslate": true
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
- }
Add Comment
Please, Sign In to add comment