teodor_dalavera

Класификација на текст со дрва на одлука

Jan 3rd, 2019
225
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 18.72 KB | None | 0 0
  1. #Класификација на текст со дрва на одлука
  2.  
  3. #Со користење на дрва на одлука да се класифицира текстот на влез според тренинг множеството.
  4.  
  5. # coding: utf-8
  6. import re
  7. #import math
  8.  
  9.  
  10. # In[31]:
  11.  
  12.  
  13. train_data=[
  14. ("""What Are We Searching for on Mars?
  15. Martians terrified me growing up. I remember watching the 1996 movie Mars Attacks! and fearing that the Red Planet harbored hostile alien neighbors. Though I was only 6 at the time, I was convinced life on Mars meant little green men wielding vaporizer guns. There was a time, not so long ago, when such an assumption about Mars wouldn’t have seemed so far-fetched.
  16. Like a child watching a scary movie, people freaked out after listening to “The War of the Worlds,” the now-infamous 1938 radio drama that many listeners believed was a real report about an invading Martian army. Before humans left Earth, humanity’s sense of what—or who—might be in our galactic neighborhood was, by today’s standards, remarkably optimistic.
  17. """,
  18. "science"),
  19. ("""Mountains of Ice are Melting, But Don't Panic (Op-Ed)
  20. If the planet lost the entire West Antarctic ice sheet, global sea level would rise 11 feet, threatening nearly 13 million people worldwide and affecting more than $2 trillion worth of property.
  21. Ice loss from West Antarctica has been increasing nearly three times faster in the past decade than during the previous one — and much more quickly than scientists predicted.
  22. This unprecedented ice loss is occurring because warm ocean water is rising from below and melting the base of the glaciers, dumping huge volumes of additional water — the equivalent of a Mt. Everest every two years — into the ocean.
  23. """,
  24. "science"),
  25. ("""Some scientists think we'll find signs of aliens within our lifetimes. Here's how.
  26. Finding extraterrestrial life is the essence of science fiction. But it's not so far-fetched to predict that we might find evidence of life on a distant planet within a generation.
  27. "With new telescopes coming online within the next five or ten years, we'll really have a chance to figure out whether we're alone in the universe," says Lisa Kaltenegger, an astronomer and director of Cornell's new Institute for Pale Blue Dots, which will search for habitable planets. "For the first time in human history, we might have the capability to do this."
  28. """,
  29. "science"),
  30. ("""'Magic' Mushrooms in Royal Garden: What Is Fly Agaric?
  31. Hallucinogenic mushrooms are perhaps the last thing you'd expect to find growing in the Queen of England's garden.
  32. Yet a type of mushroom called Amanita muscaria — commonly known as fly agaric, or fly amanita — was found growing in the gardens of Buckingham Palace by the producers of a television show, the Associated Press reported on Friday (Dec. 12).
  33. A. muscaria is a bright red-and-white mushroom, and the fungus is psychoactive when consumed.
  34. """,
  35. "science"),
  36. ("""Upcoming Parks : 'Lost Corner' Finds New Life in Sandy Springs
  37. At the corner of Brandon Mill Road, where Johnson Ferry Road turns into Dalrymple Road, tucked among 24 forested acres, sits an early 20th Century farmhouse. A vestige of Sandy Springs' past, the old home has found new life as the centerpiece of Lost Forest Preserve. While the preserve isn't slated to officially debut until some time next year, the city has opened the hiking trails to the public until construction begins on the permanent parking lot (at the moment the parking lot is a mulched area). The new park space includes community garden plots, a 4,000-foot-long hiking trail and an ADA-accessible trail through the densely wooded site. For Atlantans seeking an alternate escape to serenity (or those who dig local history), it's certainly worth a visit.
  38. """,
  39. "science"),
  40. ("""Stargazers across the world got a treat this weekend when the Geminids meteor shower gave the best holiday displays a run for their money.
  41. The meteor shower is called the "Geminids" because they appear as though they are shooting out of the constellation of Gemini. The meteors are thought to be small pieces of an extinct comment called 3200 Phaeton, a dust cloud revolving around the sun. Phaeton is thought to have lost all of its gas and to be slowly breaking apart into small particles.
  42. Earth runs into a stream of debris from 3200 Phaethon every year in mid-December, causing a shower of meteors, which hit its peak over the weekend.
  43. """,
  44. "science"),
  45. ("""Envisioning a River of Air
  46. By the classification rules of the world of physics, we all know that the Earth's atmosphere is made of gas (rather than liquid, solid, or plasma). But in the world of flying it's often useful to think
  47. """,
  48. "science"),
  49. ("""Following Sunday's 17-7 loss to the Seattle Seahawks, the San Francisco 49ers were officially eliminated from playoff contention, and they have referee Ed Hochuli to blame. OK, so they have a lot of folks to point the finger at for their 7-7 record, but Hochuli's incorrect call is the latest and easiest scapegoat.
  50. """
  51. ,"sport"),
  52. ("""Kobe Bryant and his teammates have an odd relationship. That makes sense: Kobe Bryant is an odd guy, and the Los Angeles Lakers are an odd team.
  53. They’re also, for the first time this season, the proud owners of a three-game winning streak. On top of that, you may have heard, Kobe Bryant passed Michael Jordan on Sunday evening to move into third place on the NBA’s all-time scoring list.
  54. """
  55. ,"sport"),
  56. ("""The Patriots continued their divisional dominance and are close to clinching home-field advantage throughout the AFC playoffs. Meanwhile, both the Colts and Broncos again won their division titles with head-to-head wins.The Bills' upset of the Packers delivered a big blow to Green Bay's shot at clinching home-field advantage throughout the NFC playoffs. Detroit seized on the opportunity and now leads the NFC North.
  57. """
  58. ,"sport"),
  59. ("""If you thought the Washington Redskins secondary was humbled by another scintillating performance from New Yorks Giants rookie wide receiver sensation Odell Beckham Jr., think again.In what is becoming a weekly occurrence, Beckham led NFL highlight reels on Sunday, collecting 12 catches for 143 yards and three touchdowns in Sunday's 24-13 victory against an NFC East rival.
  60. """
  61. ,"sport")
  62. ,("""That was two touchdowns and 110 total yards for the three running backs. We break down the fantasy implications.The New England Patriots' rushing game has always been tough to handicap. Sunday, all three of the team's primary running backs put up numbers, and all in different ways, but it worked for the team, as the Patriots beat the Miami Dolphins, 41-13.
  63. """
  64. ,"sport"),
  65. ("""General Santos (Philippines) (AFP) - Philippine boxing legend Manny Pacquiao vowed to chase Floyd Mayweather into ring submission after his US rival offered to fight him next year in a blockbuster world title face-off. "He (Mayweather) has reached a dead end. He has nowhere to run but to fight me," Pacquiao told AFP late Saturday, hours after the undefeated Mayweather issued the May 2 challenge on US television. The two were long-time rivals as the "best pound-for-pound" boxers of their generation, but the dream fight has never materialised to the disappointment of the boxing world.
  66. """
  67. ,"sport"),
  68. ("""When St. John's landed Rysheed Jordan, the consensus was that he would be an excellent starter.
  69. So far, that's half true.
  70. Jordan came off the bench Sunday and tied a career high by scoring 24 points to lead No. 24 St. John's to a 74-53 rout of Fordham in the ECAC Holiday Festival.
  71. ''I thought Rysheed played with poise,'' Red Storm coach Steve Lavin said. ''Played with the right pace. Near perfect game.''
  72. """
  73. ,"sport"),
  74. ("""Five-time world player of the year Marta scored three goals to lead Brazil to a 3-2 come-from-behind win over the U.S. women's soccer team in the International Tournament of Brasilia on Sunday. Carli Lloyd and Megan Rapinoe scored a goal each in the first 10 minutes to give the U.S. an early lead, but Marta netted in the 19th, 55th and 66th minutes to guarantee the hosts a spot in the final of the four-team competition.
  75. """
  76. ,"sport")
  77. ]
  78.  
  79.  
  80.  
  81.  
  82. p_words = re.compile(r'\w+')
  83.  
  84. def parse_line(line):
  85.     words = p_words.findall(line)
  86.     return [w.lower() for w in words]
  87.  
  88. import pprint
  89. from math import log
  90. df = {}
  91. vocab=set()
  92. documents = []
  93. i = 0
  94. for doc_text, label in train_data:
  95.     words = parse_line(doc_text)
  96.     #print(words)
  97.     documents.append(words)
  98.     #print(documents)
  99.     words_set=set(words)
  100.     #print(words_set)
  101.     vocab.update(words_set)  # site zborovi vo korpusot (od site dokumenti)
  102.     for word in words_set:
  103.         # vo kolku dokumenti se srekjava ovoj zbor
  104.         df.setdefault(word, 0)
  105.         df[word]+=1
  106. # pprint.pprint(df)
  107. idf = {}
  108. N = float(len(train_data))
  109. for word, cdf in df.items():
  110.     idf[word]=log(N/cdf)
  111. #pprint.pprint(idf)
  112.  
  113.  
  114. # # Calculate TF, normalized TF, and TF * IDF
  115.  
  116. # ## This is the training process
  117.  
  118. # In[33]:
  119.  
  120.  
  121. def calc_vector(cur_tf_idf, vocabular):
  122.     vec = []
  123.     for word in vocab:
  124.         tf_idf = cur_tf_idf.get(word, 0)
  125.         vec.append(tf_idf)
  126.     return vec
  127.  
  128.  
  129. # In[34]:
  130.  
  131.  
  132. def proccess_document(doc, idf, vocabular):
  133.     if isinstance(doc, str):
  134.         words = parse_line(doc)
  135.     else:
  136.         words = doc
  137.     f={}  # kolku pati se javuva sekoj zbor vo ovoj dokument
  138.     for word in words:
  139.         f.setdefault(word, 0)
  140.         f[word]+=1
  141. #     pprint.pprint(f)
  142.     max_f=max(f.values())  # kolku pati se javuva najcestiot zbor vo ovoj dokument
  143.     tf_idf = {}
  144.     for word, cnt in f.items():
  145. #         print(w,cnt)
  146.         ctf = cnt*1.0/max_f
  147.         tf_idf[word] = ctf * idf.get(word, 0)
  148.     vec = calc_vector(tf_idf, vocabular)
  149.     return vec
  150.  
  151.  
  152. # In[35]:
  153.  
  154.  
  155. freq=[]
  156. doc_vectors = []
  157. w = []
  158. tf_idfs = []
  159. for words in documents:
  160.     vec = proccess_document(words, idf, vocab)
  161.     doc_vectors.append(vec)
  162. #print(len(vocab))
  163. #print(len(doc_vectors[0]))
  164.  
  165.  
  166. # In[37]:
  167.  
  168.  
  169. import math
  170. def cosine_similarity(v1, v2):
  171.     """compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"""
  172.     sumxx, sumxy, sumyy = 0, 0, 0
  173.     for i in range(len(v1)):
  174.         x = v1[i]
  175.         y = v2[i]
  176.         sumxx += x*x
  177.         sumyy += y*y
  178.         sumxy += x*y
  179.     return sumxy/math.sqrt(sumxx*sumyy)
  180.    
  181.  
  182.  
  183. # # Caclulate the distances between documents
  184.  
  185. # In[38]:
  186.  
  187.  
  188. distances = {}
  189. for i in range(len(train_data)-1):
  190.     for j in range(i+1, len(train_data)):
  191.         v1 = doc_vectors[i]
  192.         v2 = doc_vectors[j]
  193.         dist = cosine_similarity(v1, v2)
  194.         distances[(i,j)]=dist
  195.        
  196.  
  197.  
  198. # # Print the most similar document pairs
  199.  
  200. # In[40]:
  201.  
  202.  
  203. sort_dist = sorted(distances.items(), key=lambda x: x[1], reverse=True)
  204. # print(sort_dist)
  205. #
  206. #for (i, j), d in sort_dist[:5]:
  207.     #print('DISTANCE:', d, 'TEXT 1:', train_data[i][0], 'TEXT 2:', train_data[j][0])
  208.     #print('\n'*4)
  209.  
  210.  
  211. # In[41]:
  212.  
  213.  
  214. def proccess_document(doc, idf, vocabular):
  215.     if isinstance(doc, str):
  216.         words = parse_line(doc)
  217.     else:
  218.         words = doc
  219.     f={}  # kolku pati se javuva sekoj zbor vo ovoj dokument
  220.     for word in words:
  221.         f.setdefault(word, 0)
  222.         f[word]+=1
  223. #     pprint.pprint(f)
  224.     max_f=max(f.values())  # kolku pati se javuva najcestiot zbor vo ovoj dokument
  225.     tf_idf = {}
  226.     for word, cnt in f.items():
  227. #         print(w,cnt)
  228.         ctf = cnt*1.0/max_f
  229.         tf_idf[word] = ctf * idf.get(word, 0)
  230.     vec = calc_vector(tf_idf, vocabular)
  231.     return vec
  232.  
  233.  
  234. # In[42]:
  235.  
  236.  
  237. def rank_documents(doc, idf, vocabular, doc_vectors):
  238.     query_vec = proccess_document(doc, idf, vocabular)
  239.     similarities = []
  240.     for i, doc_vec in enumerate(doc_vectors):
  241.         dist = cosine_similarity(query_vec, doc_vec)
  242.         similarities.append((dist, i))
  243.     similarities.sort(reverse=True)
  244.     return similarities
  245.  
  246.  
  247. # In[43]:
  248.  
  249.  
  250. def print_top(query, idf, vocab, doc_vectors, topN=5):
  251.     res=rank_documents(query, idf, vocab, doc_vectors)
  252.     for dist, i in res[:topN]:
  253.         print(dist, 'DOCUMENT:', train_data[i][0])
  254.  
  255.  
  256. # In[46]:
  257.  
  258.  
  259. #query='can we find a new galaxy in space'
  260. #print_top(query, idf, vocab, doc_vectors)
  261.  
  262.  
  263. # In[47]:
  264.  
  265.  
  266. #print_top('Messi is a good sportsman. He and his teammates won the Liga competition', idf, vocab, doc_vectors)
  267.  
  268.  
  269. # In[48]:
  270.  
  271.  
  272. def classify(query, idf, vocab, doc_vectors, topN=5):
  273.     res=rank_documents(query, idf, vocab, doc_vectors)
  274.     labels={}
  275.     for dist, i in res[:topN]:
  276.         label = train_data[i][1]
  277.         labels.setdefault(label, 0)
  278.         labels[label]+=1
  279.     results=sorted(labels.items(), key=lambda x:x[1], reverse=True)
  280.     final_label = results[0][0]
  281.     prob = results[0][1]/topN
  282.     return final_label, prob
  283.  
  284.  
  285. # In[49]:
  286.  
  287.  
  288. classify('Messi is a good sportsman. He and his teammates won the Liga competition', idf, vocab, doc_vectors)
  289.  
  290.  
  291. # In[50]:
  292.  
  293.  
  294. classify('can we find a new galaxy in space', idf, vocab, doc_vectors)
  295.  
  296.  
  297. # In[52]:
  298.  
  299.  
  300. dataset = []
  301. for i, doc_vec in enumerate(doc_vectors):
  302.     doc_vec.append(train_data[i][1])
  303.     dataset.append(doc_vec)
  304.  
  305. #print(dataset)
  306.  
  307.  
  308. # In[61]:
  309.  
  310.  
  311. def uniquecounts(rows):
  312.     d={}
  313.     for r in rows:
  314. #         print(r[-1])
  315.         d.setdefault(r[-1], 0)
  316.         d[r[-1]]+=1
  317.     return d
  318.  
  319. class decisionnode(object):
  320.     def __init__(self, col=-1, value=None, results=None, tb=None, fb=None):
  321.         self.col = col
  322.         self.value = value
  323.         self.results = results
  324.         self.tb = tb
  325.         self.fb = fb
  326.  
  327.  
  328. def sporedi_broj(value1, value2):
  329.     return value1 >= value2
  330.  
  331.  
  332. def sporedi_string(value1, value2):
  333.     return value1 == value2
  334.  
  335. def get_compare_func(value):
  336.     if isinstance(value, int) or isinstance(value, float):
  337.         comparer = sporedi_broj
  338.     else:
  339.         comparer = sporedi_string
  340.     return comparer
  341.  
  342. def compare_values(v1, v2):
  343.     sporedi = get_compare_func(v1)
  344.     return sporedi(v1, v2)
  345.  
  346. def divideset(rows, column, value):
  347.     sporedi = get_compare_func(value)
  348. #     print(split_function)
  349.     # Divide the rows into two sets and return them
  350.     set_false = []
  351.     set_true = []
  352.     for row in rows:
  353.         uslov=sporedi(row[column], value)
  354. #         print(column, value, row[column], uslov, row)
  355.         if uslov:
  356.             set_true.append(row)
  357.         else:
  358.             set_false.append(row)
  359. #     print(len(set_true), len(set_false))
  360. #     set_true = [row for row in rows if
  361. #             split_function(row, column, value)]  # za sekoj row od rows za koj split_function vrakja true
  362. #     set_false = [row for row in rows if
  363. #             not split_function(row, column, value)]  # za sekoj row od rows za koj split_function vrakja false
  364.     return (set_true, set_false)
  365.  
  366. import math
  367. def entropy(rows):
  368.     results = uniquecounts(rows)
  369.     # Now calculate the entropy
  370.     ent = 0.0
  371.     n = len(rows)
  372.     for label, cnt in results.items():
  373. #         print(r)
  374.         p = float(cnt) / n
  375. #         print(label, cnt, p)
  376.         ent -= p * math.log(p,2)
  377.     return ent
  378.  
  379. def info_gain(current_score, sets, scoref=entropy):
  380.     m = sum([len(s) for s in sets])
  381.     gain = current_score
  382.     for s in sets:
  383.         n=len(s)
  384.         p=1.*n/m
  385.         gain -= p*scoref(s)
  386.     return gain
  387.  
  388. def buildtree(rows, scoref=entropy):
  389.     if len(rows) == 0:
  390.         return decisionnode()
  391.     current_score = scoref(rows)
  392.  
  393.     # Set up some variables to track the best criteria
  394.     best_gain = 0.0
  395.     best_column = -1
  396.     best_value = None
  397.     best_subsetf = None
  398.     best_subsett = None
  399.    
  400.     column_count = len(rows[0]) - 1
  401.     for col in range(column_count):
  402.         # Generate the list of different values in
  403.         # this column
  404. #         column_values = set()
  405. #         for row in rows:
  406. #             column_values.add(row[col])
  407. #         print(column_values)
  408.         column_values = set([row[col] for row in rows])
  409. #         print('Zemame vo predvid podelba po:', col, len(column_values), column_values)
  410. #         continue
  411.         # Now try dividing the rows up for each value
  412.         # in this column
  413.         for value in column_values:
  414.             sets = divideset(rows, col, value)
  415.  
  416.             # Information gain
  417. #             p = float(len(set1)) / len(rows)
  418. #             gain = current_score - p * scoref(set1) - (1 - p) * scoref(set2)
  419.             gain = info_gain(current_score, sets, scoref)
  420.             if gain > best_gain and len(sets)>0 and len(sets[0]) > 0 and len(sets[1]) > 0:
  421.                 best_gain = gain
  422.                 best_column = col
  423.                 best_value = value
  424.                 best_subsett = sets[0]
  425.                 best_subsetf = sets[1]
  426.                 # best_criteria = (col, value)
  427.                 # best_sets = (set1, set2)
  428. #             print('Dividing dataset', col, value, gain, sets)
  429.     # pronajden e korenot
  430. #     return
  431.     # Create the subbranches
  432.     if best_gain > 0:
  433. #         print(best_subsett)
  434. #         print(best_subsetf)
  435.         #print(best_column, best_value, best_gain)
  436.         #print('Starting true subbranch')
  437.         trueBranch = buildtree(best_subsett, scoref)
  438.         #print()
  439.         #print('Starting false subbranch')
  440.         falseBranch = buildtree(best_subsetf, scoref)
  441.         #print()
  442.         return decisionnode(col=best_column, value=best_value,
  443.                             tb=trueBranch, fb=falseBranch)
  444.  
  445.     else:
  446.         #print('Terminalen jazol')
  447.         #print()
  448.         return decisionnode(results=uniquecounts(rows))
  449.    
  450. def printtree(tree, indent=''):
  451.     # Is this a leaf node?
  452.     if tree.results is not None:
  453.         print(indent + str(sorted(tree.results.items())))
  454.     else:
  455.         # Print the criteria
  456.         print(indent + str(tree.col) + ':' + str(tree.value) + '? ')
  457.         # Print the branches
  458.         print(indent + 'T->')
  459.         printtree(tree.tb, indent + '  ')
  460.         print(indent + 'F->')
  461.         printtree(tree.fb, indent + '  ')
  462.  
  463.            
  464.  
  465. def classify(observation, tree):
  466.     if tree.results != None:
  467. #         return tree.results
  468.         if len(tree.results)==1:
  469.             return list(tree.results.keys())[0], 1.0
  470.         else:
  471.             inv = sorted([(cnt, label) for label, cnt in tree.results.items()], reverse=True)
  472.             label = inv[0][1]
  473.             cnt = inv[0][0]
  474.             total_count = sum(tree.results.values())
  475.             return label, cnt/total_count
  476. #         return tree.results
  477.     else:
  478.         vrednost = observation[tree.col]
  479.         if compare_values(vrednost, tree.value):
  480.            branch = tree.tb
  481.         else:
  482.            branch = tree.fb
  483.         return classify(observation, branch)
  484.    
  485. def classify_text(doc,tree):
  486.     query_vec = proccess_document(doc, idf, vocab)
  487.     return classify(query_vec,tree)
  488.  
  489.  
  490. tree = buildtree(dataset)
  491. #printtree(tree)
  492. text = input()
  493. print(classify_text(text,tree))
  494.  
  495.  
  496.  
  497.  
  498. #print 'output'
Add Comment
Please, Sign In to add comment