Guest User

Untitled

a guest
May 27th, 2018
78
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 12.87 KB | None | 0 0
  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "## 6 Dodajanje vrstic"
  8. ]
  9. },
  10. {
  11. "cell_type": "markdown",
  12. "metadata": {},
  13. "source": [
  14. "Do sedaj smo spoznali, kako dano matriko faktorizirati na dve manjši."
  15. ]
  16. },
  17. {
  18. "cell_type": "code",
  19. "execution_count": null,
  20. "metadata": {
  21. "collapsed": true
  22. },
  23. "outputs": [],
  24. "source": [
  25. "import numpy as np\n",
  26. "import itertools\n",
  27. "import time\n",
  28. "\n",
  29. "class NMF:\n",
  30. " \n",
  31. " \"\"\"\n",
  32. " Fit a matrix factorization model for a matrix X with missing values.\n",
  33. " such that\n",
  34. " X = W H.T + E \n",
  35. " where\n",
  36. " X is of shape (m, n) - data matrix\n",
  37. " W is of shape (m, rank) - approximated row space\n",
  38. " H is of shape (n, rank) - approximated column space\n",
  39. " E is of shape (m, n) - residual (error) matrix\n",
  40. " \"\"\"\n",
  41. " \n",
  42. " def __init__(self, rank=10, max_iter=100, eta=0.01):\n",
  43. " \"\"\"\n",
  44. " :param rank: Rank of the matrices of the model.\n",
  45. " :param max_iter: Maximum nuber of SGD iterations.\n",
  46. " :param eta: SGD learning rate.\n",
  47. " \"\"\"\n",
  48. " self.rank = rank\n",
  49. " self.max_iter = max_iter\n",
  50. " self.eta = eta\n",
  51. " \n",
  52. " \n",
  53. " def fit(self, X, verbose=False):\n",
  54. " \"\"\"\n",
  55. " Fit model parameters W, H.\n",
  56. " :param X: \n",
  57. " Non-negative data matrix of shape (m, n)\n",
  58. " Unknown values are assumed to take the value of zero (0).\n",
  59. " \"\"\"\n",
  60. " m, n = X.shape\n",
  61. "\n",
  62. " W = np.random.rand(m, self.rank)\n",
  63. " H = np.random.rand(n, self.rank)\n",
  64. " \n",
  65. " # Indices to model variables\n",
  66. " w_vars = list(itertools.product(range(m), range(self.rank)))\n",
  67. " h_vars = list(itertools.product(range(n), range(self.rank)))\n",
  68. "\n",
  69. " # Indices to nonzero rows/columns\n",
  70. " nzcols = dict([(j, X[:, j].nonzero()[0]) for j in range(n)])\n",
  71. " nzrows = dict([(i, X[i, :].nonzero()[0]) for i in range(m)])\n",
  72. "\n",
  73. " # Errors\n",
  74. " self.error = np.zeros((self.max_iter,))\n",
  75. "\n",
  76. " for t in range(self.max_iter):\n",
  77. " t1 = time.time()\n",
  78. " np.random.shuffle(w_vars)\n",
  79. " np.random.shuffle(h_vars)\n",
  80. "\n",
  81. " for i, k in w_vars:\n",
  82. " wgrad = sum([(X[i, j] - W[i, :].dot(H[j, :]))*W[i, k] for j in nzrows[i]])\n",
  83. " W[i, k] = max(0, W[i, k] + self.eta * wgrad)\n",
  84. "\n",
  85. " for j, k in h_vars:\n",
  86. " hgrad = sum([(X[i, j] - W[i, :].dot(H[j, :]))*H[j, k] for i in nzcols[j]])\n",
  87. " H[j, k] = max(0, H[j, k] + self.eta * hgrad)\n",
  88. " \n",
  89. " self.error[t] = sum([sum([(X[i, j] - W[i, :].dot(H[j, :]))**2 for j in nzrows[i]]) \n",
  90. " for i in range(X.shape[0])])\n",
  91. "\n",
  92. " if verbose: print(t, self.error[t])\n",
  93. " \n",
  94. " self.W = W\n",
  95. " self.H = H\n",
  96. " \n",
  97. " \n",
  98. " def predict(self, i, j):\n",
  99. " \"\"\"\n",
  100. " Predict score for row i and column j\n",
  101. " :param i: Row index.\n",
  102. " :param j: Column index.\n",
  103. " \"\"\"\n",
  104. " return self.W[i, :].dot(self.H[:, j])\n",
  105. " \n",
  106. "\n",
  107. " def predict_all(self):\n",
  108. " \"\"\"\n",
  109. " Return approximated matrix for all\n",
  110. " columns and rows.\n",
  111. " \"\"\"\n",
  112. " return self.W.dot(self.H.T)"
  113. ]
  114. },
  115. {
  116. "cell_type": "markdown",
  117. "metadata": {},
  118. "source": [
  119. "Pri priporočilnih sistemih bomo pogosto dodali nove uporabnike (vrstice). Najprej opravimo faktorizacijo:\n",
  120. "\n",
  121. "$X\\approx W H$\n",
  122. "\n",
  123. "Ko v $X$ dodamo novo vrstico, lahko poiščemo novo matriko $W$ z uporabo nespremenjene matrike $H$.\n",
  124. "\n",
  125. "Ponavljamo $W =W \\frac{X H^T}{W H H^T}$"
  126. ]
  127. },
  128. {
  129. "cell_type": "markdown",
  130. "metadata": {},
  131. "source": [
  132. "##### Vprašanje 6-2-1\n",
  133. "Napišite funkcijo, ki ob danih matrikah $X$ in $H$ izračuna novo matriko $W$."
  134. ]
  135. },
  136. {
  137. "cell_type": "code",
  138. "execution_count": null,
  139. "metadata": {
  140. "collapsed": true
  141. },
  142. "outputs": [],
  143. "source": [
  144. "def nmf_fix (X, H, k, max_iter = 100): \n",
  145. " \"\"\"\n",
  146. " :param X: matrix with descriptions of test examples.\n",
  147. " :param H: matrix for a pre-built model.\n",
  148. " :param k: Factorization rank\n",
  149. " :param max_iter: Max. number of iterations.\n",
  150. " :return:\n",
  151. " W: A predicted row clustering matrix.\n",
  152. " \"\"\"\n",
  153. " m, N = X.shape \n",
  154. " W = np.random.rand(m, k) \n",
  155. " \n",
  156. " for itr in range(max_iter): \n",
  157. " # TODO: your code here\n",
  158. " pass\n",
  159. " return W"
  160. ]
  161. },
  162. {
  163. "cell_type": "markdown",
  164. "metadata": {},
  165. "source": [
  166. "## Zlivanje podatkov z matrično faktorizacijo"
  167. ]
  168. },
  169. {
  170. "cell_type": "code",
  171. "execution_count": null,
  172. "metadata": {},
  173. "outputs": [],
  174. "source": [
  175. "import skfusion\n",
  176. "from skfusion import fusion as skf\n",
  177. "\n",
  178. "def scale(X, amin, amax):\n",
  179. " return (X - X.min()) / (X.max() - X.min()) * (amax - amin) + amin\n",
  180. "\n",
  181. "def rmse(y_true, y_pred):\n",
  182. " return np.sqrt(np.sum((y_true - y_pred)**2) / y_true.size)"
  183. ]
  184. },
  185. {
  186. "cell_type": "markdown",
  187. "metadata": {},
  188. "source": [
  189. "Naložimo podmnožico podatkov o filmih.\n",
  190. "\n",
  191. "$R_{12}$ -> uporabnik-film\n",
  192. "\n",
  193. "$R_{23}$ -> film-žanr\n",
  194. "\n",
  195. "$R_{24}$ -> film-igralec"
  196. ]
  197. },
  198. {
  199. "cell_type": "code",
  200. "execution_count": null,
  201. "metadata": {},
  202. "outputs": [],
  203. "source": [
  204. "R12_true = np.load(\"R12.npy\")\n",
  205. "R23 = np.load(\"R23.npy\")\n",
  206. "R24 = np.load(\"R24.npy\")"
  207. ]
  208. },
  209. {
  210. "cell_type": "markdown",
  211. "metadata": {},
  212. "source": [
  213. "V `R12_true` so z -1 predstavljene manjkajoče vrednosti."
  214. ]
  215. },
  216. {
  217. "cell_type": "code",
  218. "execution_count": null,
  219. "metadata": {},
  220. "outputs": [],
  221. "source": [
  222. "R12_true"
  223. ]
  224. },
  225. {
  226. "cell_type": "markdown",
  227. "metadata": {},
  228. "source": [
  229. "Pred nadaljevanjem bomo ustvarili masko, ki nam bo povedala, v katerih celicah še ni ocene. Podatke skaliramo na vrednosti med 0 in 1."
  230. ]
  231. },
  232. {
  233. "cell_type": "code",
  234. "execution_count": null,
  235. "metadata": {
  236. "collapsed": true
  237. },
  238. "outputs": [],
  239. "source": [
  240. "R12_true = np.ma.masked_equal(R12_true, -1)\n",
  241. "R12_true = scale(R12_true, 0, 1)"
  242. ]
  243. },
  244. {
  245. "cell_type": "markdown",
  246. "metadata": {},
  247. "source": [
  248. "Pripravimo podatke za evalvacijo napovedi: uporabili bomo 90% znanih ocen, ostale bomo skrili."
  249. ]
  250. },
  251. {
  252. "cell_type": "code",
  253. "execution_count": null,
  254. "metadata": {
  255. "collapsed": true
  256. },
  257. "outputs": [],
  258. "source": [
  259. "frac = 0.9\n",
  260. "R12 = R12_true.copy()\n",
  261. "hidden = np.logical_and(np.random.random(R12_true.shape) > frac, ~R12_true.mask)\n",
  262. "R12 = np.ma.masked_where(hidden, R12)"
  263. ]
  264. },
  265. {
  266. "cell_type": "markdown",
  267. "metadata": {},
  268. "source": [
  269. "Pripravimo objektne tipe in jim določimo rang faktorizacije."
  270. ]
  271. },
  272. {
  273. "cell_type": "code",
  274. "execution_count": null,
  275. "metadata": {
  276. "collapsed": true
  277. },
  278. "outputs": [],
  279. "source": [
  280. "p = 0.05\n",
  281. "t1 = skf.ObjectType('User', max(int(p*R12.shape[0]), 5))\n",
  282. "t2 = skf.ObjectType('Movie', max(int(p*R12.shape[1]), 5))\n",
  283. "t3 = skf.ObjectType('Genre', max(int(p*R23.shape[1]), 5))\n",
  284. "t4 = skf.ObjectType('Actor', max(int(p*R24.shape[1]), 5))"
  285. ]
  286. },
  287. {
  288. "cell_type": "markdown",
  289. "metadata": {},
  290. "source": [
  291. "Vse relacije postavimo v graf."
  292. ]
  293. },
  294. {
  295. "cell_type": "code",
  296. "execution_count": null,
  297. "metadata": {},
  298. "outputs": [],
  299. "source": [
  300. "relations = [skf.Relation(R12, t1, t2, name='User ratings'),\n",
  301. " skf.Relation(R23, t2, t3, name='Movie genres'),\n",
  302. " skf.Relation(R24, t2, t4, name='Movie actors')]\n",
  303. "graph = skf.FusionGraph(relations)\n",
  304. "print('Ranks:', ''.join(['\\n{}: {}'.format(o.name, o.rank)\n",
  305. " for o in graph.object_types]))\n",
  306. "\n",
  307. "graph_small = skf.FusionGraph([skf.Relation(R12, t1, t2, name='User ratings')])"
  308. ]
  309. },
  310. {
  311. "cell_type": "markdown",
  312. "metadata": {},
  313. "source": [
  314. "Za začetek zračunamo napako povprečne ocene (mean regressor) po uporabnikih, filmih in za celo matriko."
  315. ]
  316. },
  317. {
  318. "cell_type": "code",
  319. "execution_count": null,
  320. "metadata": {
  321. "collapsed": true
  322. },
  323. "outputs": [],
  324. "source": [
  325. "n_users, n_movies = R12_true.shape\n",
  326. "\n",
  327. "mean_user = np.mean(R12, 1)\n",
  328. "mean_movie = np.mean(R12, 0)\n",
  329. "mean_rating = np.mean(R12)\n",
  330. "\n",
  331. "# mean rating\n",
  332. "score = rmse(R12_true[hidden], mean_rating)\n",
  333. "print('RMSE(mean rating): {}'.format(score))\n",
  334. "\n",
  335. "# mean user\n",
  336. "R12_pred = np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
  337. "score = rmse(R12_true[hidden], R12_pred[hidden])\n",
  338. "print('RMSE(mean user): {}'.format(score))\n",
  339. "\n",
  340. "# mean movie\n",
  341. "R12_pred = np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
  342. "score = rmse(R12_true[hidden], R12_pred[hidden])\n",
  343. "print('RMSE(mean movie): {}'.format(score))"
  344. ]
  345. },
  346. {
  347. "cell_type": "markdown",
  348. "metadata": {},
  349. "source": [
  350. "Nato izvedemo zlivanje z uporabo samo ene relacije (v bistvu je to tri-faktorizacija ene matrike)."
  351. ]
  352. },
  353. {
  354. "cell_type": "code",
  355. "execution_count": null,
  356. "metadata": {
  357. "collapsed": true
  358. },
  359. "outputs": [],
  360. "source": [
  361. "# DFMF on ratings data only (it benefits if unknown values are set to mean)\n",
  362. "scores = []\n",
  363. "for _ in range(10):\n",
  364. " dfmf_fuser = skf.Dfmf(max_iter=100, init_type='random')\n",
  365. " dfmf_mod = dfmf_fuser.fuse(graph_small)\n",
  366. " R12_pred = dfmf_mod.complete(graph_small['User ratings'])\n",
  367. " # R12_pred = scale(R12_pred, 0, 1)\n",
  368. " R12_pred += np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
  369. " R12_pred += np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
  370. " R12_pred = scale(R12_pred, 0, 1)\n",
  371. " scores.append(rmse(R12_true[hidden], R12_pred[hidden]))\n",
  372. "print('RMSE(ratings; out-of-sample dfmf): {}'.format(np.mean(scores)))"
  373. ]
  374. },
  375. {
  376. "cell_type": "markdown",
  377. "metadata": {},
  378. "source": [
  379. "In na koncu opravimo zlivanje vseh treh relacij."
  380. ]
  381. },
  382. {
  383. "cell_type": "code",
  384. "execution_count": null,
  385. "metadata": {},
  386. "outputs": [],
  387. "source": [
  388. "# DFMF (it benefits if unknown values are set to mean)\n",
  389. "scores = []\n",
  390. "for _ in range(10):\n",
  391. " dfmf_fuser = skf.Dfmf(max_iter=100, init_type='random')\n",
  392. " dfmf_mod = dfmf_fuser.fuse(graph)\n",
  393. " R12_pred = dfmf_mod.complete(graph['User ratings'])\n",
  394. " # R12_pred = scale(R12_pred, 0, 1)\n",
  395. " R12_pred += np.tile(mean_user.reshape((n_users, 1)), (1, n_movies))\n",
  396. " R12_pred += np.tile(mean_movie.reshape((1, n_movies)), (n_users, 1))\n",
  397. " R12_pred = scale(R12_pred, 0, 1)\n",
  398. " scores.append(rmse(R12_true[hidden], R12_pred[hidden]))\n",
  399. "print('RMSE(ratings; out-of-sample dfmf): {}'.format(np.mean(scores)))"
  400. ]
  401. },
  402. {
  403. "cell_type": "markdown",
  404. "metadata": {},
  405. "source": [
  406. "Iz modela lahko dobimo tudi vse $G_i$ in $S_{ij}$ matrike."
  407. ]
  408. },
  409. {
  410. "cell_type": "code",
  411. "execution_count": null,
  412. "metadata": {},
  413. "outputs": [],
  414. "source": [
  415. "G1 = dfmf_fuser.factor(t1)"
  416. ]
  417. },
  418. {
  419. "cell_type": "code",
  420. "execution_count": null,
  421. "metadata": {},
  422. "outputs": [],
  423. "source": [
  424. "for chain in dfmf_fuser.chain(t1, t2):\n",
  425. " S12 = dfmf_fuser.backbone(dfmf_fuser.fusion_graph[chain[0]][chain[1]][0])\n",
  426. "for chain in dfmf_fuser.chain(t2, t3):\n",
  427. " S23 = dfmf_fuser.backbone(dfmf_fuser.fusion_graph[chain[0]][chain[1]][0])"
  428. ]
  429. }
  430. ],
  431. "metadata": {
  432. "kernelspec": {
  433. "display_name": "Python 3",
  434. "language": "python",
  435. "name": "python3"
  436. },
  437. "language_info": {
  438. "codemirror_mode": {
  439. "name": "ipython",
  440. "version": 3
  441. },
  442. "file_extension": ".py",
  443. "mimetype": "text/x-python",
  444. "name": "python",
  445. "nbconvert_exporter": "python",
  446. "pygments_lexer": "ipython3",
  447. "version": "3.6.3"
  448. },
  449. "nbTranslate": {
  450. "displayLangs": [
  451. "en",
  452. "sl"
  453. ],
  454. "hotkey": "alt-t",
  455. "langInMainMenu": true,
  456. "sourceLang": "sl",
  457. "targetLang": "en",
  458. "useGoogleTranslate": true
  459. }
  460. },
  461. "nbformat": 4,
  462. "nbformat_minor": 2
  463. }
Add Comment
Please, Sign In to add comment