Web Scraping & Data Parsing with Python

###########################################
# Required libs to make the tutorial work #
##########################################
sudo apt-get install -y python-pip python-virtualenv python-lxml libxslt1-dev libxml2 libxml2-dev python-bs4
pip install beautifulsoup4
pip install lxml
pip install cssselect
pip install requests
virtualenv venv
source venv/bin/activate


#######################
# Regular Expressions #
#######################
References:
http://www.pythonforbeginners.com/regex/regular-expressions-in-python
http://www.tutorialspoint.com/python/python_reg_expressions.htm


re.findall - finds every occurrence of whatever you are searching for
---------------------------------------------------------------------
python
import re
str = 'purple [email protected], blah monkey [email protected] blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)
for email in emails:
	# do something with each found email string
	print email


re.search - This function searches for first occurrence a pattern within a string.
----------------------------------------------------------------------------------
python
import re
str = 'an example word:cat!!'

match = re.search(r'word:www', str)

# If-statement after search() tests if it succeeded
if match:
	print 'found', match.group()

else:
	print 'did not find'  ## Hit enter 2 times


str = 'an example word:cat!!'

match = re.search(r'word:cat', str)

# If-statement after search() tests if it succeeded
if match:
	print 'found', match.group()

else:
	print 'did not find'  ## Hit enter 2 times


re.sub - replaces all occurrences of the pattern in the string
--------------------------------------------------------------
import re
text = "Python for beginner is a very cool website"
pattern = re.sub("cool", "good", text)
print pattern


re.compile - replaces all occurrences of the pattern in the string
--------------------------------------------------------------
vi namecheck.py


#!/usr/bin/python
import re

name_check = re.compile(r"[^A-Za-zs.]")

name = raw_input ("Please, enter your name: ")

while name_check.search(name):
    print "Please enter your name correctly!"
    name = raw_input ("Please, enter your name: ")


Quick website connect via URLLib2 and IP address parse via re.compile
---------------------------------------------------------------------
vi getip.py


#!/usr/bin/env python
import re
import urllib2

def getIP():
    ip_checker_url = "http://checkip.dyndns.org/"
    address_regexp = re.compile ('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
    response = urllib2.urlopen(ip_checker_url).read()
    result = address_regexp.search(response)

    if result:
            return result.group()
    else:
            return None


################
# Web Scraping #
################

Reference:
http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/


vi urlget.py
----------------------------------------------------
#!/usr/bin/env python
from bs4 import BeautifulSoup

import requests

url = raw_input("Enter a website to extract the URL's from: ")

r  = requests.get("http://" +url)

data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

-----------------------------------------------------


Reference:
http://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python


-----------------------------------------------------

Reference:
http://www.pythonforbeginners.com/beautifulsoup/scraping-websites-with-beautifulsoup


-----------------------------------------------------

Reference:
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/


-----------------------------------------------------

Reference:
http://www.pythonforbeginners.com/api/python-api-and-json


-----------------------------------------------------
Reference:
http://www.pythonforbeginners.com/scripts/using-the-youtube-api/


-----------------------------------------------------
Reference:
http://www.pythonforbeginners.com/code-snippets-source-code/imdb-crawler


#############
# XML Intro #
#############
Basics:
http://effbot.org/zone/element-index.htm
http://lxml.de/tutorial.html
http://lxml.de/parsing.html


Reference:
http://www.blog.pythonlibrary.org/2010/11/12/python-parsing-xml-with-minidom/

vi example.xml
-------------------------------
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-11-17</publish_date>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   </book>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-03-10</publish_date>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
      Ascendant.</description>
   </book>
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2001-09-10</publish_date>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>
   </book>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-09-02</publish_date>
      <description>When Carla meets Paul at an ornithology
      conference, tempers fly as feathers get ruffled.</description>
   </book>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <genre>Romance</genre>
      <price>4.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>A deep sea diver finds true love twenty
      thousand leagues beneath the sea.</description>
   </book>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <genre>Horror</genre>
      <price>4.95</price>
      <publish_date>2000-12-06</publish_date>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   </book>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <price>6.95</price>
      <publish_date>2000-11-02</publish_date>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems
      of being quantum.</description>
   </book>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-09</publish_date>
      <description>Microsoft's .NET initiative is explored in
      detail in this deep programmer's reference.</description>
   </book>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>36.95</price>
      <publish_date>2000-12-01</publish_date>
      <description>The Microsoft MSXML3 parser is covered in
      detail, with attention to XML DOM interfaces, XSLT processing,
      SAX and more.</description>
   </book>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <genre>Computer</genre>
      <price>49.95</price>
      <publish_date>2001-04-16</publish_date>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are
      integrated into a comprehensive development
      environment.</description>
   </book>
</catalog>

-------------------------------


vi xmlparse.py
-------------------------------
#!/usr/bin/python
import xml.dom.minidom as minidom


def getTitles(xml):
    """
    Print out all titles found in xml
    """
    doc = minidom.parse(xml)
    node = doc.documentElement
    books = doc.getElementsByTagName("book")

    titles = []
    for book in books:
        titleObj = book.getElementsByTagName("title")[0]
        titles.append(titleObj)

    for title in titles:
        nodes = title.childNodes
        for node in nodes:
            if node.nodeType == node.TEXT_NODE:
                print node.data

if __name__ == "__main__":
    document = 'example.xml'
    getTitles(document)


This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.


-------------------------------
#####################
# Python and SQLite #
#####################
http://zetcode.com/db/sqlitepythontutorial/


####################
# Python and MySQL #
####################
Reference:
http://zetcode.com/db/mysqlpython/

################################
# Lesson 16: Parsing XML Files #
################################

/---------------------------------------------------/
--------------------PARSING XML FILES----------------
/---------------------------------------------------/


Type the following commands:
---------------------------------------------------------------------------------------------------------

wget https://s3.amazonaws.com/SecureNinja/Python/samplescan.xml

wget https://s3.amazonaws.com/SecureNinja/Python/application.xml

wget https://s3.amazonaws.com/SecureNinja/Python/security.xml

wget https://s3.amazonaws.com/SecureNinja/Python/system.xml

wget https://s3.amazonaws.com/SecureNinja/Python/sc_xml.xml


-------------TASK 1------------
vi readxml1.py

#!/usr/bin/python
from xmllib import attributes
from xml.dom.minidom import toxml
from xml.dom.minidom import firstChild
from xml.dom import minidom
xmldoc = minidom.parse('sc_xml.xml')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('host')
count = 0

for node in nodes:
    os = node.getElementsByTagName('os')[0]
    osclasses = os.getElementsByTagName('osclass')
    for osclass in osclasses:
        if osclass.attributes['osfamily'].value == 'Windows' and osclass.attributes['osgen'].value == 'XP':
            try:
                print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'OS',os.getElementsByTagName('osmatch')[0].attributes['name'].value)
            except:
                print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','OS',os.getElementsByTagName('osmatch')[0].attributes['name'].value)


-------------TASK 2------------
vi readxml2.py

#!/usr/bin/python
from xmllib import attributes
from xml.dom.minidom import toxml
from xml.dom.minidom import firstChild
from xml.dom import minidom
xmldoc = minidom.parse('sc_xml.xml')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('host')
count = 0
for node in nodes:
    portsNode = node.getElementsByTagName('ports')[0]
    ports = portsNode.getElementsByTagName('port')
    for port in ports:
        if port.attributes['portid'].value == '22' and port.attributes['protocol'].value == 'tcp':
            state = port.getElementsByTagName('state')[0]
            if state.attributes['state'].value == 'open':
                try:
                    print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'Ports','open : tcp : 22')
                except:
                    print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','Ports','open : tcp : 22')


-------------TASK 3------------
vi readxml3.py

#!/usr/bin/python
from xmllib import attributes
from xml.dom.minidom import toxml
from xml.dom.minidom import firstChild
from xml.dom import minidom
xmldoc = minidom.parse('sc_xml.xml')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('host')
count = 0
for node in nodes:
    portsNode = node.getElementsByTagName('ports')[0]
    ports = portsNode.getElementsByTagName('port')
    flag = 0
    for port in ports:
        if flag == 0:
            if port.attributes['protocol'].value == 'tcp' and (port.attributes['portid'].value == '443' or port.attributes['portid'].value == '80'):
                state = port.getElementsByTagName('state')[0]
                if state.attributes['state'].value == 'open':
                    try:
                        print '%-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'Ports','open : tcp : '+port.attributes['portid'].value)
                    except:
                        print '%-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','Ports','open : tcp : '+port.attributes['portid'].value)
                    flag = 1


-------------TASK 4------------
vi readxml4.py

#!/usr/bin/python
from xmllib import attributes
from xml.dom.minidom import toxml
from xml.dom.minidom import firstChild
from xml.dom import minidom
xmldoc = minidom.parse('sc_xml.xml')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('host')
count = 0
for node in nodes:
    flag = 0
    naddress = ''
    addresses = node.getElementsByTagName('address')
    for address in addresses:
        if address.attributes['addrtype'].value == 'ipv4' and address.attributes['addr'].value[0:6] == '10.57.':
            naddress = address.attributes['addr'].value
            flag = 1
    if flag == 1:
        portsNode = node.getElementsByTagName('ports')[0];
        ports = portsNode.getElementsByTagName('port')
        flag = 0
        for port in ports:
                status = {}
                if port.attributes['protocol'].value == 'tcp' and port.attributes['portid'].value[0:2] == '22':
                    state = port.getElementsByTagName('state')[0]
                    if "open" in state.attributes['state'].value:
                        status[0] = state.attributes['state'].value
                        status[1] = port.attributes['portid'].value
                        flag = 1
                else:
                    flag = 0
                if port.attributes['protocol'].value == 'tcp' and flag == 1:
                    if port.attributes['portid'].value == '80' or port.attributes['portid'].value == '443':
                        state = port.getElementsByTagName('state')[0]
                        if state.attributes['state'].value == 'open':
                            flag = 0
                            try:
                                print '%-8s: %s -> %-8s: %s -> %-8s: %s' % ('Host',node.getElementsByTagName('hostnames')[0].getElementsByTagName('hostname')[0].attributes['name'].value,'IP',naddress,'Ports',status[0]+' : tcp : '+status[1]+' and open : tcp : '+port.attributes['portid'].value)
                            except:
                                print '%-8s: %s -> %-8s: %s -> %-8s: %s' % ('Host','Unable to find Hostname','IP',naddress,'Ports',status[0]+' : tcp : '+status[1]+' and open : tcp : '+port.attributes['portid'].value)


################################
# Lesson 17: Parsing EVTX Logs #
################################
/---------------------------------------------------/
--------------------PARSING EVTX FILES----------------
/---------------------------------------------------/


Type the following commands:
---------------------------------------------------------------------------------------------------------

wget https://s3.amazonaws.com/SecureNinja/Python/Program-Inventory.evtx

wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_Application.evtx

wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_Security.evtx

wget https://s3.amazonaws.com/SecureNinja/Python/WIN-M751BADISCT_System.evtx


-------------TASK 1------------
vi readevtx1.py

import mmap
import re
import contextlib
import sys
import operator
import HTMLParser
from xml.dom import minidom
from operator import itemgetter, attrgetter

from Evtx.Evtx import FileHeader
from Evtx.Views import evtx_file_xml_view

pars = HTMLParser.HTMLParser()
print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
file_name = str(raw_input('Enter EVTX file name without extension : '))
file_name = 'WIN-M751BADISCT_System'
with open(file_name+'.evtx', 'r') as f:
    with contextlib.closing(mmap.mmap(f.fileno(), 0,
                                      access=mmap.ACCESS_READ)) as buf:
        fh = FileHeader(buf, 0x0)
        xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
        try:
            for xml, record in evtx_file_xml_view(fh):
                xml_file += xml
        except:
            pass
        xml_file += "</Events>"
xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
xml_file = re.sub('<local>', '<local></local>', xml_file)
xml_file = re.sub('&amp;', '&amp;', xml_file)
f = open(file_name+'.xml', 'w')
f.write(xml_file)
f.close()
try:
    xmldoc = minidom.parse(file_name+'.xml')
except:
    sys.exit('Invalid file...')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('Event')


event_num = int(raw_input('How many events you want to show : '))
length = int(len(nodes)) - 1
event_id = 0
if event_num > length:
    sys.exit('You have entered an ivalid num...')
while True:
    if event_num > 0 and length > -1:
        try:
            event_id = nodes[length].getElementsByTagName('EventID')[0].childNodes[0].nodeValue
            try:
                print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event',node.getElementsByTagName('string')[1].childNodes[0].nodeValue)
            except:
                print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event','Name not found')
            event_num -= 1
            length -= 1
        except:
            length -= 1
    else:
        sys.exit('...Search Complete...')


-------------TASK 2------------
vi readevtx2.py

import mmap
import re
import contextlib
import sys
import operator
import HTMLParser
from xml.dom import minidom
from operator import itemgetter, attrgetter

from Evtx.Evtx import FileHeader
from Evtx.Views import evtx_file_xml_view

pars = HTMLParser.HTMLParser()
print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
file_name = str(raw_input('Enter EVTX file name without extension : '))
file_name = 'WIN-M751BADISCT_System'
with open(file_name+'.evtx', 'r') as f:
    with contextlib.closing(mmap.mmap(f.fileno(), 0,
                                      access=mmap.ACCESS_READ)) as buf:
        fh = FileHeader(buf, 0x0)
        xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
        try:
            for xml, record in evtx_file_xml_view(fh):
                xml_file += xml
        except:
            pass
        xml_file += "</Events>"
xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
xml_file = re.sub('<local>', '<local></local>', xml_file)
xml_file = re.sub('&amp;', '&amp;', xml_file)
f = open(file_name+'.xml', 'w')
f.write(xml_file)
f.close()
try:
    xmldoc = minidom.parse(file_name+'.xml')
except:
    sys.exit('Invalid file...')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('Event')

event = int(raw_input('Enter Event ID : '))
event_id = 0
for node in nodes:
    try:
        event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
        if int(event_id) == event:
            try:
                print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event',node.getElementsByTagName('string')[1].childNodes[0].nodeValue)
            except:
                print '%-8s: %s - %-8s: %s' % ('Event ID',event_id,'Event','Name not found')
    except:
        continue
sys.exit('...Search Complete...')


-------------TASK 3------------
vi readevtx3.py

import mmap
import re
import contextlib
import sys
import operator
import HTMLParser
from xml.dom import minidom
from operator import itemgetter, attrgetter

from Evtx.Evtx import FileHeader
from Evtx.Views import evtx_file_xml_view

pars = HTMLParser.HTMLParser()
print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
file_name = str(raw_input('Enter EVTX file name without extension : '))
file_name = 'WIN-M751BADISCT_System'
with open(file_name+'.evtx', 'r') as f:
    with contextlib.closing(mmap.mmap(f.fileno(), 0,
                                      access=mmap.ACCESS_READ)) as buf:
        fh = FileHeader(buf, 0x0)
        xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
        try:
            for xml, record in evtx_file_xml_view(fh):
                xml_file += xml
        except:
            pass
        xml_file += "</Events>"
xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
xml_file = re.sub('<local>', '<local></local>', xml_file)
xml_file = re.sub('&amp;', '&amp;', xml_file)
f = open(file_name+'.xml', 'w')
f.write(xml_file)
f.close()
try:
    xmldoc = minidom.parse(file_name+'.xml')
except:
    sys.exit('Invalid file...')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('Event')

event = int(raw_input('Enter Event ID : '))
event_id = 0
event_count = 0;
for node in nodes:
    try:
        event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
        if int(event_id) == event:
            event_count += 1
    except:
        continue
print '%-8s: %s - %-8s: %s' % ('Event ID',event,'Count',event_count)
sys.exit('...Search Complete...')


-------------TASK 4------------
vi readevtx4.py

import mmap
import re
import contextlib
import sys
import operator
import HTMLParser
from xml.dom import minidom
from operator import itemgetter, attrgetter

from Evtx.Evtx import FileHeader
from Evtx.Views import evtx_file_xml_view

pars = HTMLParser.HTMLParser()
print pars.unescape('<Data Name="MaxPasswordAge">&amp;12856;"</Data>')
file_name = str(raw_input('Enter EVTX file name without extension : '))
file_name = 'WIN-M751BADISCT_System'
with open(file_name+'.evtx', 'r') as f:
    with contextlib.closing(mmap.mmap(f.fileno(), 0,
                                      access=mmap.ACCESS_READ)) as buf:
        fh = FileHeader(buf, 0x0)
        xml_file = "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?><Events>"
        try:
            for xml, record in evtx_file_xml_view(fh):
                xml_file += xml
        except:
            pass
        xml_file += "</Events>"
xml_file = re.sub('<NULL>', '<NULL></NULL>', xml_file)
xml_file = re.sub('<local>', '<local></local>', xml_file)
xml_file = re.sub('&amp;', '&amp;', xml_file)
f = open(file_name+'.xml', 'w')
f.write(xml_file)
f.close()
try:
    xmldoc = minidom.parse(file_name+'.xml')
except:
    sys.exit('Invalid file...')
grandNode = xmldoc.firstChild
nodes = grandNode.getElementsByTagName('Event')

events = []
event_id = 0
count = 0
for node in nodes:
    try:
        event_id = node.getElementsByTagName('EventID')[0].childNodes[0].nodeValue
        try:
            events.append({'event_id' : int(event_id), 'event_name' : node.getElementsByTagName('string')[1].childNodes[0].nodeValue})
        except:
            events.append({'event_id' : int(event_id), 'event_name' : 'Name not found...'})
        count += 1
    except:
        continue
events = sorted(events, key=itemgetter('event_id'))
for e in events:
    print e
sys.exit('...Search Complete...')


############################
# XML to MySQL with Python #
############################
http://programmazioneit.altervista.org/Programmazione_script/Python/How-to-insert-XML-data-into-MYSQL-through-PYTHON.php
http://stackoverflow.com/questions/10128921/xml-to-mysql-using-python
http://stackoverflow.com/questions/15784208/storing-data-from-xml-to-database-using-python


#################
# Numpy & Scipy #
################
http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf


#################
# Data Analysis #
#################
Reference:
http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/

Reference (Lesson 1):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/01%20-%20Lesson.ipynb

Reference (Lesson 2):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/02%20-%20Lesson.ipynb

Reference (Lesson 3):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/03%20-%20Lesson.ipynb

Reference (Lesson 4):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/04%20-%20Lesson.ipynb

Reference (Lesson 5):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/05%20-%20Lesson.ipynb

Reference (Lesson 6):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/06%20-%20Lesson.ipynb

Reference (Lesson 7):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/07%20-%20Lesson.ipynb

Reference (Lesson 8):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/08%20-%20Lesson.ipynb

Reference (Lesson 9):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/09%20-%20Lesson.ipynb

Reference (Lesson 10):
http://nbviewer.ipython.org/urls/bitbucket.org/hrojas/learn-pandas/raw/master/lessons/10%20-%20Lesson.ipynb


##################################
# Data Visualization with Python #
##################################
Reference:
http://jakevdp.github.io/mpl_tutorial/


Reference:
http://machinelearningmastery.com/machine-learning-in-python-step-by-step/


sudo apt install -y python-scipy python-numpy python-matplotlib python-matplotlib-data python-pandas python-sklearn python-sklearn-pandas python-sklearn-lib python-scikits-learn


vi libcheck.py

----------------------------------------------------------
#!/usr/bin/env python

# Check the versions of libraries

# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
----------------------------------------------------------


python

import pandas csv
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


url = "https://s3.amazonaws.com/infosecaddictsfiles/sampleSubmission.csv"
names = ["Id", "Prediction1", "Prediction2", "Prediction3", "Prediction4", "Prediction5", "Prediction6", "Prediction7", "Prediction8", "Prediction9"]
dataset = pandas.read_csv(url, names=names)


Summarize the Dataset
---------------------
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

>>> print(dataset.shape)


>>> print(dataset.head(20))

You should see the first 20 rows of the data:


Statistical Summary
-------------------

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

>>> print(dataset.describe())


Class Distribution
------------------
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

>>> print(dataset.groupby('class').size())

We can see that each class has the same number of instances


Data Visualization
------------------

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

- Univariate plots to better understand each attribute.
- Multivariate plots to better understand the relationships between attributes.


Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

>>> dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
>>> plt.show()

This gives us a much clearer idea of the distribution of the input attributes:


******************* INSERT DIAGRAM SCREENSHOT *******************


We can also create a histogram of each input variable to get an idea of the distribution.


>>> dataset.hist()
>>> plt.show()

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.


******************* INSERT DIAGRAM SCREENSHOT *******************


Multivariate Plots
------------------
Now we can look at the interactions between the variables.

First let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.


>>> scatter_matrix(dataset)
>>> plt.show()

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

******************* INSERT DIAGRAM SCREENSHOT *******************


Create a Validation Dataset
---------------------------

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.


>>> array = dataset.values
>>> X = array[:,0:4]
>>> Y = array[:,4]
>>> validation_size = 0.20
>>> seed = 7
>>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)


Test Harness
------------
We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.


>>> seed = 7
>>> scoring = 'accuracy'

We are using the metric of ‘accuracy‘ to evaluate models.
This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).
We will be using the scoring variable when we run build and evaluate each model next.


Build Models
------------
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our five models:


# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = model_selection.KFold(n_splits=10, random_state=seed)
	cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
	print(msg)


Select Best Model
-----------------
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:


******************* INSERT DIAGRAM SCREENSHOT *******************


We can see that it looks like KNN has the largest estimated accuracy score.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model.
There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).


# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.


Make Predictions
----------------