Advertisement
DECROMAX

Catagorise text series

Oct 1st, 2022 (edited)
754
0
Never
1
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.35 KB | None | 0 0
  1. """
  2. Catagorise text series based on wherever specific string is contained within text
  3. example 'I like big red apples' catagory = 'Red Apple'
  4. """
  5.  
  6. # example df
  7.  
  8. df = {
  9.     "first_name": ['Bruce', 'Clark', 'Bruce', 'James', 'Nanny', 'Dot'],
  10.     "last_name": ['Lee', 'Kent', 'Banner', 'Bond', 'Mc Phee', 'Cotton'],
  11.     "title": ['mr', 'mr', 'mr', 'mr', 'mrs', 'mrs'],
  12.     "text": ["He is a Kung Fu master", "Wears capes and tight Pants", "Cocktails shaken not stirred", "angry Green man", "suspect scottish accent", "East end legend"],
  13.     "age": [32, 33, 28, 30, 42, 80]
  14. }
  15.  
  16. """
  17. first_name     last_name title        text                   age
  18. 0      Bruce       Lee    Mr        He is a Kung Fu master   32
  19. 1      Clark      Kent    Mr   Wears capes and tight Pants   33
  20. 2      Bruce    Banner    Mr  Cocktails shaken not stirred   28
  21. 3      James      Bond    Mr               angry Green man   30
  22. 4      Nanny   Mc Phee   Mrs       suspect scottish accent   42
  23. 5        Dot    Cotton   Mrs               East end legend   80
  24.  
  25.  
  26. """
  27.  
  28.  
  29. """Create catagory dict
  30. keys: extracted str that is contained within series text
  31. vals: mapped assigned catagorys""""
  32.  
  33. category_dict = {
  34.                 "Kung Fu":"Martial Art",
  35.                 "capes":"Clothing",
  36.                 "cocktails": "Drink",
  37.                 "green": "Colour",
  38.                 "scottish": "Scotland",
  39.                 "East": "Direction"
  40.                 }
  41.  
  42. cd= {k.lower(): v for k,v in category_dict.items()} # convert keys to lower case
  43.  
  44.  
  45. # convert the extracted word to lowercase and then map with the lowercase dict
  46. df['category'] = (
  47.    df['text'].str.extract(
  48.        fr"\b({'|'.join((category_dict.keys()))})\b",
  49.        flags=re.IGNORECASE)[0].str.lower().map(cd)) # add .fillna('Other') to capture unassigned
  50.  
  51. # outputs
  52.  
  53. """"
  54. first_name      last_name title      text                    age     category
  55. 0      Bruce       Lee    Mr        He is a Kung Fu master   32  Martial Art
  56. 1      Clark      Kent    Mr   Wears capes and tight Pants   33     Clothing
  57. 2      Bruce    Banner    Mr  Cocktails shaken not stirred   28        Drink
  58. 3      James      Bond    Mr               angry Green man   30       Colour
  59. 4      Nanny   Mc Phee   Mrs       suspect scottish accent   42     Scotland
  60. 5        Dot    Cotton   Mrs               East end legend   80    Direction
  61.  
  62. """"
Tags: pandas
Advertisement
Comments
Add Comment
Please, Sign In to add comment
Advertisement