Advertisement
Guest User

matrix

a guest
Jun 29th, 2017
88
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.70 KB | None | 0 0
  1. # Loop over each synopsis and append its content to a list of string named 'corpus'
  2. corpus = []
  3. for i in range(0, frame["Synopsis"].size):
  4.     corpus.append(frame["Synopsis"][i])
  5. # Create tf–idf matrix
  6. from sklearn.feature_extraction.text import TfidfVectorizer
  7. vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 0.2)
  8. # min_df = 0.2 means that the term must be in at least 20% of the documents
  9. X = vectorizer.fit_transform(corpus)
  10.  
  11. k = 2 # Define the number of clusters in which we want to partion our data
  12. # Define the proper notion of distance to deal with documents
  13. from sklearn.metrics.pairwise import cosine_similarity
  14. dist = 1 - cosine_similarity(X)
  15. # Run the algorithm kmeans
  16. model = KMeans(n_clusters = k)
  17. model.fit(X);
  18.  
  19. no_words = 4 # Number of words to print per cluster
  20. order_centroids = model.cluster_centers_.argsort()[:, ::-1] # Sort cluster centers by proximity to centroid
  21. terms = vectorizer.get_feature_names()
  22. labels = model.labels_ # Get labels assigned to each data
  23.  
  24. print("Top terms per cluster:\n")
  25. for i in range(k):
  26.  
  27.     print("Cluster %d movies:" % i, end='')
  28.     for title in frame["Title"][labels == i]:
  29.         print(' %s,' % title, end='')
  30.     print() #add a whitespace
  31.  
  32.     print("Cluster %d words:" % i, end='')
  33.     for ind in order_centroids[i, :no_words]:
  34.         print (' %s' % terms[ind], end=','),
  35.     print()
  36.     print()
  37. Top terms per cluster:
  38.  
  39. Cluster 0 movies: Mad Max: Fury Road, The Matrix, No Country for Old Men, A Beautiful Mind, Inception, Frozen, Finding Nemo, Toy Story,
  40. Cluster 0 words: room, tank, says, joe,
  41.  
  42. Cluster 1 movies: The King's Speech, The Lion King, Aladdin, Cinderella, Robin Hood,
  43. Cluster 1 words: king, prince, john, palace,
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement