View difference between Paste ID: aQwMRisG and YpfzJnfU
SHOW: | | - or go back to the newest paste.
1
#######################
2
# VMs for this course #
3
#######################
4
https://s3.amazonaws.com/infosecaddictsvirtualmachines/Win7x64.zip
5
	username: workshop
6
	password: password
7
	
8
https://s3.amazonaws.com/infosecaddictsvirtualmachines/InfoSecAddictsVM.zip
9
user:      infosecaddicts
10
pass:      infosecaddicts
11
12
You don't have to, but you can do the updates in the Win7 VM (yes, it is a lot of updates).
13
14
You'll need to create directory in the Win7 VM called "c:\ps"
15
16
In this file you will also need to change the text '192.168.200.144' to the IP address of your Ubuntu host.
17
18
19
20
21
22
##############################################
23
# Log Analysis with Linux command-line tools #
24
##############################################
25
The following command line executables are found in the Mac as well as most Linux Distributions.
26
27
cat –  prints the content of a file in the terminal window
28
grep – searches and filters based on patterns
29
awk –  can sort each row into fields and display only what is needed
30
sed –  performs find and replace functions
31
sort – arranges output in an order
32
uniq – compares adjacent lines and can report, filter or provide a count of duplicates
33
34
35
##############
36
# Cisco Logs #
37
##############
38
39
wget https://s3.amazonaws.com/infosecaddictsfiles/cisco.log
40
41
42
AWK Basics
43
----------
44
To quickly demonstrate the print feature in awk, we can instruct it to show only the 5th word of each line. Here we will print $5. Only the last 4 lines are being shown for brevity.
45
46
cat cisco.log | awk '{print $5}' | tail -n 4
47
48
49
50
51
Looking at a large file would still produce a large amount of output. A more useful thing to do might be to output every entry found in “$5”, group them together, count them, then sort them from the greatest to least number of occurrences. This can be done by piping the output through “sort“, using “uniq -c” to count the like entries, then using “sort -rn” to sort it in reverse order.
52
53
cat cisco.log | awk '{print $5}'| sort | uniq -c | sort -rn
54
55
56
57
58
While that’s sort of cool, it is obvious that we have some garbage in our output. Evidently we have a few lines that aren’t conforming to the output we expect to see in $5. We can insert grep to filter the file prior to feeding it to awk. This insures that we are at least looking at lines of text that contain “facility-level-mnemonic”.
59
60
cat cisco.log | grep %[a-zA-Z]*-[0-9]-[a-zA-Z]* | awk '{print $5}' | sort | uniq -c | sort -rn
61
62
63
64
65
66
Now that the output is cleaned up a bit, it is a good time to investigate some of the entries that appear most often. One way to see all occurrences is to use grep.
67
68
cat cisco.log | grep %LINEPROTO-5-UPDOWN:
69
70
cat cisco.log | grep %LINEPROTO-5-UPDOWN:| awk '{print $10}' | sort | uniq -c | sort -rn
71
72
cat cisco.log | grep %LINEPROTO-5-UPDOWN:| sed 's/,//g' | awk '{print $10}' | sort | uniq -c | sort -rn
73
74
cat cisco.log | grep %LINEPROTO-5-UPDOWN:| sed 's/,//g' | awk '{print $10 " changed to " $14}' | sort | uniq -c | sort -rn
75
76
77
78
79
#########
80
# EGrep #
81
#########
82
egrep is an acronym for "Extended Global Regular Expressions Print". It is a program which scans a specified file line by line, returning lines that contain a pattern matching a given regular expression.
83
84
The standard egrep command looks like:
85
86
egrep <flags> '<regular expression>' <filename>
87
88
To specify a set or range of characters use braces. To negate the set, use the hat symbol ^ as the first character. For example
89
90
[a9A05] is the set { a, 9, A, 0, 5 }
91
[^a9A05] is the complementary set ASCII - { a, 9, A, 0, 5 } (everything except a, 9, A, 0 and 5).
92
[a-z] is the set of all lowercase letters { a, b, c, d, …, z }
93
[^a-z4-9QR] is the set of all ASCII letters except for the lowercase letters, the numerals between 4 and 9, and the uppercase letters Q and R.
94
A few of examples:
95
96
 
97
98
egrep '^(0|1)+ [a-zA-Z]+$' searchfile.txt
99
match all lines in searchfile.txt which start with a non-empty bitstring, followed by a space, followed by a non-empty alphabetic word which ends the line
100
101
 
102
103
egrep -c '^1|01$' lots_o_bits
104
count the number of lines in lots_o_bits which either start with 1 or end with 01
105
106
 
107
108
egrep -c '10*10*10*10*10*10*10*10*10*10*1' lots_o_bits
109
count the number of lines with at least eleven 1's
110
111
 
112
113
egrep -i '\<the\>' myletter.txt
114
list all the lines in myletter.txt containing the word the insensitive of case.
115
116
117
118
119
#####################
120
# Powershell Basics #
121
#####################
122
123
PowerShell is Microsoft’s new scripting language that has been built in since the release Vista. 
124
125
PowerShell file extension end in .ps1 . 
126
127
An important note is that you cannot double click on a PowerShell script to execute it. 
128
129
To open a PowerShell command prompt either hit Windows Key + R and type in PowerShell or Start -> All Programs -> Accessories -> Windows PowerShell -> Windows PowerShell.
130
131
dir 
132
cd 
133
ls
134
cd c:\
135
136
137
To obtain a list of cmdlets, use the Get-Command cmdlet
138
139
Get-Command
140
 
141
142
143
You can use the Get-Alias cmdlet to see a full list of aliased commands.
144
145
Get-Alias
146
147
148
149
Don't worry you won't blow up your machine with Powershell
150
Get-Process | stop-process 				Don't press [ ENTER ] What will this command do?
151
Get-Process | stop-process -whatif
152
153
154
To get help with a cmdlet, use the Get-Help cmdlet along with the cmdlet you want information about.
155
156
Get-Help Get-Command
157
158
Get-Help Get-Service –online
159
160
Get-Service -Name TermService, Spooler
161
162
Get-Service –N BITS
163
164
165
166
PowerShell variables begin with the $ symbol. First lets create a variable
167
168-
Methods can return properties and properties can have sub properties. You can chain them together by appending them to the first call.
168+
169
170
To see the value of a variable you can just call it in the terminal.
171
172
$serv
173
174
$serv.gettype().fullname
175-
- Run cmdlet through a pie and refer to its properties as $_
175+
176-
Get-Service | where-object {  $_.Status -eq "Running"}
176+
177
Get-Member is another extremely useful cmdlet that will enumerate the available methods and properties of an object. You can pipe the object to Get-Member or pass it in
178
179-
Variables
179+
180-
---------
180+
181
Get-Member -InputObject $serv
182-
vs1 = 1
182+
183-
vs1.GetType().Name
183+
184
185
186-
vs1 = "string "
186+
187-
vs1.GetType().Name
187+
188
189
$serv.Status
190
$serv.Stop()
191-
- Get a listing of variables
191+
192-
Get-variable
192+
193-
Get-ChildItem variable
193+
194
$serv.Refresh()
195
$serv.Status
196
 
197-
For Loops
197+
198-
---------
198+
199-
1..5 | ForEach-Object   { $Sum = 0 } { $Sum += $_ }
199+
200
#############################
201
# Simple Event Log Analysis #
202
#############################
203
204-
$Numbers = 4..7
204+
205-
1..1 | forecach-object { if ($Numbers -contains $_)
205+
206-
{ continue }; $_ }
206+
207
208
To dump the event log, you can use the Get-EventLog and the Exportto-Clixml cmdlets if you are working with a traditional event log such as the Security, Application, or System event logs. 
209
If you need to work with one of the trace logs, use the Get-WinEvent and the ExportTo-Clixml cmdlets.
210
211-
foreach ($i in (1..10)){
211+
212-
	if ($i -gt 5){
212+
213-
		continue
213+
214-
	}
214+
215-
	$i
215+
216-
)
216+
217
The % symbol is an alias for the Foreach-Object cmdlet. It is often used when working interactively from the Windows PowerShell console
218
219
$logs | % { get-eventlog -LogName $_ | Export-Clixml "$_.xml" }
220
221
222
223-
PSDrives
223+
224-
--------
224+
225
Step 2: Import the event log of interest
226-
To get a list of current PSDrives that are available on a system we use Get-PSDrive cmdlet
226+
227
To parse the event logs, use the Import-Clixml cmdlet to read the stored XML files. 
228-
To get a list of the Providers the current sessions has available with the modules it has loaded the Get-PSProvider cmdlet is used.
228+
229
Let's take a look at the commandlets Where-Object, Group-Object, and Select-Object. 
230-
The default PSDrives created when a Shell Session is started are:
230+
231
The following two commands first read the exported security log contents into a variable named $seclog, and then the five oldest entries are obtained.
232-
- Alias - Represent all aliases valid for the current PowerShell session.
232+
233
$seclog = Import-Clixml security.xml
234-
- Cert - Certificate store for the user represented in Current Location.
234+
235
$seclog | select -Last 5
236-
- Env - All environment variables for the current PowerShell Session
236+
237
238-
- Function - All functions available for the current PowerShell
238+
239
240-
- HKLM - Registry HKey Local Machine registry hive
240+
241
242-
- HKCU - Registry HKCU Current user hive
242+
243
244-
- WSMan - WinRM (Windows Remote Management) configuration and credentials
244+
245
By default, an ordinary user does not have permission to read the security log. 
246
247
248
249-
Playing with WMI
249+
250-
----------------
250+
251
-----------------------------------
252-
# List all namespaces in the default root/cimv2
252+
253-
Get-WmiObject -Class __namespace | Select-Object Name
253+
254
255
$seclog | select -first 1 | fl *
256-
# List all namespaces under root/microsoft
256+
257-
Get-WmiObject -Class __namespace -Namespace root/microsoft | Select-Object Name
257+
258
259-
# To list classes under the default namespace
259+
260-
Get-WmiObject -List *
260+
261
262-
# To filter classes with the word network in their name
262+
263-
Get-WmiObject -List *network*
263+
264
265
266-
# To list classes in another namespace 
266+
267-
Get-WmiObject -List * -Namespace root/microsoft/homenet
267+
268
To obtain this information, pipe the contents of the security log to a Where-Object to filter the events, and then send the results to the Measure-Object cmdlet to determine the number of events:
269
270-
# To get a description of a class
270+
271-
(Get-WmiObject -list win32_service -Amended).qualifiers | Select-Object name, value | ft -AutoSize -Wrap
271+
272
If you want to ensure that only event log entries return that contain SeSecurityPrivilege in their text, use Group-Object to gather the matches by the EventID property. 
273
274
275
$seclog | ? { $_.message -match 'SeSecurityPrivilege'} | group eventid
276-
PowerShell treats WMI objects the same as .Net Objects so we can use Select-Object, Where-Object, ForEach-Object and Formatting cmdlets like we do with any other .Net object type.
276+
277
Because importing the event log into a variable from the stored XML results in a collection of event log entries, it means that the count property is also present. 
278-
In the case of WMI with Get-WMIObject we also have the ability to use filters based on WQL Operators with the -Filter parameter
278+
279
280-
$wmishare = [wmiclass] "win32_process"
280+
281-
$wmishare.Methods
281+
282
283
284-
Invoke-WMIMethod -class Win32_Process -Name create -ArgumentList 'calc.exe'
284+
285
286
287
############################
288
# Simple Log File Analysis #
289
############################
290
291
292
You'll need to create the directory c:\ps and download sample iss log http://pastebin.com/raw.php?i=LBn64cyA
293-
Get-PSProvider Registry
293+
294
295-
- To list sub-keys of a registry path
295+
296-
Get-childItem -Path hkcu:\
296+
297
(new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=LBn64cyA", "c:\ps\u_ex1104.log")
298-
- To copy a key and all sub-keys
298+
299-
Copy-Item -Path 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion' -Destination hkcu: -Recurse
299+
300
Select-String 192.168.208.63 .\CiscoLogFileExamples.txt 
301-
- To create a key
301+
302-
New-Item -Path HKCU:\_DeleteMe
302+
303
304-
- To Remove keys
304+
305-
Remove-Item -Path HKCU:\_DeleteMe
305+
306-
Remove-Item -Path HKCU:\CurrentVersion
306+
307
Select-String 192.168.208.63 .\CiscoLogFileExamples.txt | select line
308
309
310
311-
- Selecting Objects
311+
312-
- Selecting specific Objects from a list
312+
313-
Get-Process | Sort-Object workingset -D
313+
314
Select-String 192.168.208.63 .\CiscoLogFileExamples.txt | select line | Measure-Object
315-
$str = "my string"
315+
316-
$str.contains("                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ")
316+
317
318-
- Selecting a range of objects from a list
318+
319-
Get-Process | Sort-Object workingset -Descending | Select-Object -Index (0..4)
319+
320
Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select -ExpandProperty value | Sort-Object -Unique | Measure-Object
321-
- Creating/Renaming a property
321+
322-
Get-Process | Select-Object -Property name,@{name = 'PID'; expression = {$_.id}}
322+
323
324
Removing Measure-Object shows all the individual IPs instead of just the count of the IP addresses. The Measure-Object command counts the IP addresses. 
325-
Get-Process | Sort-Object workingset -Descending | Select-Object -Index 0,1,2,3,4  
325+
326
Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select -ExpandProperty value | Sort-Object -Unique
327
328
329
In order to determine which IP addresses have the most communication the last commands are removed to determine the value of the matches. Then the group command is issued on the piped output to group all the IP addresses (value), and then sort the objects by using the alias for Sort-Object: sort count –des.
330
This sorts the IP addresses in a descending pattern as well as count and deliver the output to the shell.
331
332
Select-String “\b(?:\d{1,3}\.){3}\d{1,3}\b” .\CiscoLogFileExamples.txt | select -ExpandProperty matches | select value | group value | sort count -des
333
334
##############################################
335
# Parsing Log files using windows PowerShell #
336
##############################################
337
338
Download the sample IIS log http://pastebin.com/LBn64cyA 
339
340
341
(new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=LBn64cyA", "c:\ps\u_ex1104.log")
342
343
Get-Content ".\*log" | ? { ($_ | Select-String "WebDAV")}  
344
345
346
347
The above command would give us all the WebDAV requests.
348
349
To filter this to a particular user name, use the below command:
350
351
Get-Content ".\*log" | ? { ($_ | Select-String "WebDAV") -and ($_ | Select-String "OPTIONS")}  
352
353
 
354
355
Some more options that will be more commonly required : 
356
357
For Outlook Web Access : Replace WebDAV with OWA 
358
359
For EAS : Replace WebDAV with Microsoft-server-activesync 
360
361
For ECP : Replace WebDAV with ECP
362
363
 
364
365
366
#######################################
367
# Regex Characters you might run into #
368
#######################################
369
370
^	Start of string, or start of line in a multiline pattern
371
$	End  of string, or start of line in a multiline pattern
372
\b	Word boundary
373
\d	Digit
374
\	Escape the following character
375
*	0 or more	{3}	Exactly 3
376
+	1 or more	{3,}	3 or more
377
?	0 or 1		{3,5}	3, 4 or 5
378
379
380
381
####################################################################
382
# Windows PowerShell: Extracting Strings Using Regular Expressions #
383
####################################################################
384
To build a script that will extract data from a text file and place the extracted text into another file, we need three main elements:
385
386
1) The input file that will be parsed
387
388
(new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=rDN3CMLc", "c:\ps\emails.txt")
389
(new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=XySD8Mi2", "c:\ps\ip_addresses.txt")
390
(new-object System.Net.WebClient).DownloadFile("http://pastebin.com/raw.php?i=v5Yq66sH", "c:\ps\URL_addresses.txt")
391
392
2) The regular expression that the input file will be compared against
393
394
3) The output file for where the extracted data will be placed.
395
396
Windows PowerShell has a “select-string” cmdlet which can be used to quickly scan a file to see if a certain string value exists. 
397
Using some of the parameters of this cmdlet, we are able to search through a file to see whether any strings match a certain pattern, and then output the results to a separate file.
398
399
To demonstrate this concept, below is a Windows PowerShell script I created to search through a text file for strings that match the Regular Expression (or RegEx for short) pattern belonging to e-mail addresses.
400
401
$input_path = ‘c:\ps\emails.txt’
402
$output_file = ‘c:\ps\extracted_addresses.txt’
403
$regex = ‘\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b’
404
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
405
406
In this script, we have the following variables:
407
408
1) $input_path to hold the path to the input file we want to parse
409
410
2) $output_file to hold the path to the file we want the results to be stored in
411
412
3) $regex to hold the regular expression pattern to be used when the strings are being matched.
413
414
The select-string cmdlet contains various parameters as follows:
415
416
1) “-Path” which takes as input the full path to the input file
417
418
2) “-Pattern” which takes as input the regular expression used in the matching process
419
420
3) “-AllMatches” which searches for more than one match (without this parameter it would stop after the first match is found) and is piped to “$.Matches” and then “$_.Value” which represent using the current values of all the matches.
421
422
Using “>” the results are written to the destination specified in the $output_file variable.
423
424
Here are two further examples of this script which incorporate a regular expression for extracting IP addresses and URLs.
425
426
IP addresses
427
------------
428
For the purposes of this example, I ran the tracert command to trace the route from my host to google.com and saved the results into a file called ip_addresses.txt. You may choose to use this script for extracting IP addresses from router logs, firewall logs, debug logs, etc.
429
430
$input_path = ‘c:\ps\ip_addresses.txt’
431
$output_file = ‘c:\ps\extracted_ip_addresses.txt’
432
$regex = ‘\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b’
433-
###############################################
433+
434-
# Intrusion Analysis Using Windows PowerShell #
434+
435-
###############################################
435+
436
URLs
437-
Download sample file http://pastebin.com/raw.php?i=ysnhXxTV into the c:\ps directory
437+
438
For the purposes of this example, I created a couple of dummy web server log entries and saved them into URL_addresses.txt. 
439
You may choose to use this script for extracting URL addresses from proxy logs, network packet capture logs, debug logs, etc.
440
441
$input_path = ‘c:\ps\URL_addresses.txt’
442
$output_file = ‘c:\ps\extracted_URL_addresses.txt’
443
$regex = ‘([a-zA-Z]{3,})://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)*?’
444
select-string -Path $input_path -Pattern $regex -AllMatches | % { $_.Matches } | % { $_.Value } > $output_file
445
446
447
In addition to the examples above, many other types of strings can be extracted using this script. 
448
All you need to do is switch the regular expression in the “$regex” variable! 
449
In fact, the beauty of such a PowerShell script is its simplicity and speed of execution.
450
451
452
453
454
455
456
457
458
###################
459
# Regex in Python #
460
###################
461
462
463
464
465
**************************************************
466
* What is Regular Expression and how is it used? *
467
**************************************************
468
469
470
Simply put, regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file. 
471
472
473
Regular expressions use two types of characters:
474
475
a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wildcard.
476
477
b) Literals (like a,b,1,2…)
478
479
480
In Python, we have module "re" that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.
481
482-
This will get the setting for logs in the windows firewall which should be enabled in GPO policy for analysis. 
482+
483-
The command shows that the Firewall log is at:
483+
484-
%systemroot%\system32\LogFiles\Firewall\pfirewall.log, in order to open the file PowerShell will need to be run with administrative privileges.
484+
485
486
487-
First step is to get the above command into a variable using script logic.
487+
488-
Thankfully PowerShell has a built-in integrated scripting environment, PowerShell.ise. 
488+
489
--------------------------------------------------
490-
netsh advfirewall show allprofiles | Select-String FileName | select -ExpandProperty line | Select-String “%systemroot%.+\.log" | select -ExpandProperty matches | select -ExpandProperty value | sort –uniq
490+
491
- Search a string (search and match)
492
- Finding a string (findall)
493
- Break string into a sub strings (split)
494
- Replace part of a string (sub)
495
496
497
498
Let's look at the methods that library "re" provides to perform these tasks.
499
500
501
502
****************************************************
503
* What are various methods of Regular Expressions? *
504
****************************************************
505
506
507
The ‘re' package provides multiple methods to perform queries on an input string. Here are the most commonly used methods, I will discuss:
508
509
re.match()
510
re.search()
511
re.findall()
512
re.split()
513
re.sub()
514
re.compile()
515
516
Let's look at them one by one.
517
518
 
519
re.match(pattern, string):
520
-------------------------------------------------
521
522
This method finds match if it occurs at start of the string. For example, calling match() on the string ‘AV Analytics AV' and looking for a pattern ‘AV' will match. However, if we look for only Analytics, the pattern will not match. Let's perform it in python now.
523
524
Code
525
526
import re
527
result = re.match(r'AV', 'AV Analytics ESET AV')
528
print result
529
530
Output:
531
<_sre.SRE_Match object at 0x0000000009BE4370>
532
533
Above, it shows that pattern match has been found. To print the matching string we'll use method group (It helps to return the matching string). Use "r" at the start of the pattern string, it designates a python raw string.
534
535
536
result = re.match(r'AV', 'AV Analytics ESET AV')
537
print result.group(0)
538
539
Output:
540
AV
541
542
543
Let's now find ‘Analytics' in the given string. Here we see that string is not starting with ‘AV' so it should return no match. Let's see what we get:
544
545
546
Code
547
548
result = re.match(r'Analytics', 'AV Analytics ESET AV')
549
print result 
550
551
552
Output: 
553
None
554
555
556
There are methods like start() and end() to know the start and end position of matching pattern in the string.
557
558
Code
559
560
result = re.match(r'AV', 'AV Analytics ESET AV')
561
print result.start()
562
print result.end()
563
564
Output:
565
0
566
2
567
568
Above you can see that start and end position of matching pattern ‘AV' in the string and sometime it helps a lot while performing manipulation with the string.
569
570
571
572
573
574
re.search(pattern, string):
575
-----------------------------------------------------
576
577
578
It is similar to match() but it doesn't restrict us to find matches at the beginning of the string only. Unlike previous method, here searching for pattern ‘Analytics' will return a match.
579
580
Code
581
582
result = re.search(r'Analytics', 'AV Analytics ESET AV')
583
print result.group(0)
584
585
Output:
586
Analytics
587
588
Here you can see that, search() method is able to find a pattern from any position of the string but it only returns the first occurrence of the search pattern.
589
590
591
592
593
594
595
re.findall (pattern, string):
596
------------------------------------------------------
597
598
599
It helps to get a list of all matching patterns. It has no constraints of searching from start or end. If we will use method findall to search ‘AV' in given string it will return both occurrence of AV. While searching a string, I would recommend you to use re.findall() always, it can work like re.search() and re.match() both.
600
601
602
Code
603
604
result = re.findall(r'AV', 'AV Analytics ESET AV')
605
print result
606
607
Output:
608
['AV', 'AV']
609
610
611
612
613
614
re.split(pattern, string, [maxsplit=0]):
615
------------------------------------------------------
616
617
618
619
This methods helps to split string by the occurrences of given pattern.
620
621
622
Code
623
624
result=re.split(r'y','Analytics')
625
result
626
627
Output:
628
[]
629
630
Above, we have split the string "Analytics" by "y". Method split() has another argument "maxsplit". It has default value of zero. In this case it does the maximum splits that can be done, but if we give value to maxsplit, it will split the string. Let's look at the example below:
631
632
633
Code
634
635
result=re.split(r's','Analytics eset')
636
print result
637
638
Output:
639
['Analytic', ' e', 'et']#It has performed all the splits that can be done by pattern "s".
640
641
Code
642
643
result=re.split(r's','Analytics eset',maxsplit=1)
644
result
645
646
Output:
647
[]
648
649
Here, you can notice that we have fixed the maxsplit to 1. And the result is, it has only two values whereas first example has three values.
650
651
652
653
654
re.sub(pattern, repl, string):
655
----------------------------------------------------------
656
657
It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.
658
659
Code
660
661
result=re.sub(r'Ruby','Python','Joe likes Ruby')
662
result
663
664
Output:
665
''
666
667
668
669
670
671
re.compile(pattern, repl, string):
672
----------------------------------------------------------
673
674
675
We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. It also helps to search a pattern again without rewriting it.
676
677
678
Code
679
680
import re
681
pattern=re.compile('XSS')
682
result=pattern.findall('XSS is Cross Site Sripting, XSS')
683
print result
684
result2=pattern.findall('XSS is Cross Site Scripting, SQLi is Sql Injection')
685
print result2
686
Output:
687
['XSS', 'XSS']
688
['XSS']
689
690
Till now,  we looked at various methods of regular expression using a constant pattern (fixed characters). But, what if we do not have a constant search pattern and we want to return specific set of characters (defined by a rule) from a string?  Don't be intimidated.
691
692
This can easily be solved by defining an expression with the help of pattern operators (meta  and literal characters). Let's look at the most common pattern operators.
693
694
 
695
696
697
698
**********************************************
699
* What are the most commonly used operators? *
700
**********************************************
701
702
703
Regular expressions can specify patterns, not just fixed characters. Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file. It is commonly used in web scrapping and  text mining to extract required information.
704
705
Operators	Description
706
.	        Matches with any single character except newline ‘\n'.
707
?	        match 0 or 1 occurrence of the pattern to its left
708
+	        1 or more occurrences of the pattern to its left
709
*	        0 or more occurrences of the pattern to its left
710
\w	        Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character.
711
\d	        Matches with digits [0-9] and /D (upper case D) matches with non-digits.
712
\s	        Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character.
713
\b	        boundary between word and non-word and /B is opposite of /b
714
[..]	        Matches any single character in a square bracket and [^..] matches any single character not in square bracket
715
\	        It is used for special meaning characters like \. to match a period or \+ for plus sign.
716
^ and $	        ^ and $ match the start or end of the string respectively
717
{n,m}	        Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression.
718
a| b	        Matches either a or b
719
( )	        Groups regular expressions and returns matched text
720
\t, \n, \r	Matches tab, newline, return
721
722
723
For more details on  meta characters "(", ")","|" and others details , you can refer this link (https://docs.python.org/2/library/re.html).
724
725
Now, let's understand the pattern operators by looking at the below examples.
726
727
 
728
729
****************************************
730
* Some Examples of Regular Expressions *
731
****************************************
732
733
******************************************************
734
* Problem 1: Return the first word of a given string *
735
******************************************************
736
737
738
Solution-1  Extract each character (using "\w")
739
---------------------------------------------------------------------------
740
741
Code
742
743
import re
744
result=re.findall(r'.','Python is the best scripting language')
745
print result
746
747
Output:
748
['P', 'y', 't', 'h', 'o', 'n', ' ', 'i', 's', ' ', 't', 'h', 'e', ' ', 'b', 'e', 's', 't', ' ', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
749
750
751
Above, space is also extracted, now to avoid it use "\w" instead of ".".
752
753
754
Code
755
756
result=re.findall(r'\w','Python is the best scripting language')
757
print result
758
759
Output:
760
['P', 'y', 't', 'h', 'o', 'n', 'i', 's', 't', 'h', 'e', 'b', 'e', 's', 't', 's', 'c', 'r', 'i', 'p', 't', 'i', 'n', 'g', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e']
761
762
763
764
765
Solution-2  Extract each word (using "*" or "+")
766
---------------------------------------------------------------------------
767
768
Code
769-
['Anal', 'tics']
769+
770
result=re.findall(r'\w*','Python is the best scripting language')
771
print result
772
773
Output:
774
['Python', '', 'is', '', 'the', '', 'best', '', 'scripting', '', 'language', '']
775
 
776
777
Again, it is returning space as a word because "*" returns zero or more matches of pattern to its left. Now to remove spaces we will go with "+".
778
779
Code
780-
['Analytic', 'e', 'et'] #It has performed all the splits that can be done by pattern "s".
780+
781
result=re.findall(r'\w+','Python is the best scripting language')
782
print result
783
784
Output:
785
['Python', 'is', 'the', 'best', 'scripting', 'language']
786
787
788-
['Analytic', 'eset']
788+
789
790
791
Solution-3 Extract each word (using "^")
792
-------------------------------------------------------------------------------------
793
794
795
Code
796
797
result=re.findall(r'^\w+','Python is the best scripting language')
798
print result
799
800
Output:
801
['Python']
802
803
If we will use "$" instead of "^", it will return the word from the end of the string. Let's look at it.
804
805-
'Joe likes Python'
805+
806
807
result=re.findall(r'\w+$','Python is the best scripting language')
808
print result
809
810
Output:
811
[‘language']
812
813
814
815
816
817
********************************************************** 
818
* Problem 2: Return the first two character of each word *
819
**********************************************************
820
821
822
823
824
Solution-1  Extract consecutive two characters of each word, excluding spaces (using "\w")
825
------------------------------------------------------------------------------------------------------
826
827
Code
828
result=re.findall(r'\w\w','Python is the best')
829
print result
830
831
Output:
832
['Py', 'th', 'on', 'is', 'th', 'be', 'st']
833
834
835
836
837
838
839
Solution-2  Extract consecutive two characters those available at start of word boundary (using "\b")
840
------------------------------------------------------------------------------------------------------
841
842
Code
843
844
result=re.findall(r'\b\w.','Python is the best')
845
print result
846
847
Output:
848
['Py', 'is', 'th', 'be']
849
850
851
852
853
854
855
********************************************************
856
* Problem 3: Return the domain type of given email-ids *
857
********************************************************
858
859
860
To explain it in simple manner, I will again go with a stepwise approach:
861
862
863
864
865
866
Solution-1  Extract all characters after "@"
867
------------------------------------------------------------------------------------------------------------------
868
869
Code
870
871
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz') 
872
print result 
873
874
Output: ['@gmail', '@test', '@strategicsec', '@rest']
875
876
877
878
Above, you can see that ".com", ".biz" part is not extracted. To add it, we will go with below code.
879
880
881
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz')
882
print result
883
884
Output:
885
['@gmail.com', '@test.com', '@strategicsec.com', '@rest.biz']
886
887
888
889
890
891
892
Solution – 2 Extract only domain name using "( )"
893
-----------------------------------------------------------------------------------------------------------------------
894
895
896
Code
897
898
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.com, test.first@strategicsec.com, first.test@rest.biz')
899
print result
900
901
Output:
902
['com', 'com', 'com', 'biz']
903
904
905
906
907
908
909
********************************************
910
* Problem 4: Return date from given string *
911
********************************************
912
913
914
Here we will use "\d" to extract digit.
915
916
917
Solution:
918
----------------------------------------------------------------------------------------------------------------------
919
920
Code
921
922
result=re.findall(r'\d{2}-\d{2}-\d{4}','Joe 34-3456 12-05-2007, XYZ 56-4532 11-11-2016, ABC 67-8945 12-01-2009')
923
print result
924
925
Output:
926
['12-05-2007', '11-11-2016', '12-01-2009']
927
928
If you want to extract only year again parenthesis "( )" will help you.
929
930
931
Code
932
933
934
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Joe 34-3456 12-05-2007, XYZ 56-4532 11-11-2016, ABC 67-8945 12-01-2009')
935
print result
936
937
Output:
938
['2007', '2016', '2009']
939
940
941
942
943
944
*******************************************************************
945
* Problem 5: Return all words of a string those starts with vowel *
946
*******************************************************************
947
948
949
950
951
Solution-1  Return each words
952
-----------------------------------------------------------------------------------------------------------------
953
954
Code
955
956
result=re.findall(r'\w+','Python is the best')
957
print result
958
959
Output:
960
['Python', 'is', 'the', 'best']
961
962
963
964
965
966
Solution-2  Return words starts with alphabets (using [])
967
------------------------------------------------------------------------------------------------------------------
968
969
Code
970
971-
['Py', 'th', 'on', 'is,', 'th', 'eb', 'es']
971+
972
print result
973
974
Output:
975
['ove', 'on']
976
977
Above you can see that it has returned "ove" and "on" from the mid of words. To drop these two, we need to use "\b" for word boundary.
978
979
980
981
982
983
Solution- 3
984
------------------------------------------------------------------------------------------------------------------
985
986
Code
987-
['Py', 'is,', 'th', 'be']
987+
988
result=re.findall(r'\b[aeiouAEIOU]\w+','I love Python')
989
print result 
990
991
Output:
992
[]
993
994
995
In similar ways, we can extract words those starts with constant using "^" within square bracket.
996
997
998
Code
999
1000
result=re.findall(r'\b[^aeiouAEIOU]\w+','I love Python')
1001
print result
1002
1003
Output:
1004
[' love', ' Python']
1005
1006
Above you can see that it has returned words starting with space. To drop it from output, include space in square bracket[].
1007
1008
1009
Code
1010
1011
result=re.findall(r'\b[^aeiouAEIOU ]\w+','I love Python')
1012
print result
1013
1014
Output:
1015
['love', 'Python']
1016
1017
1018
1019
1020
1021
1022
*************************************************************************************************
1023
* Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) *
1024
*************************************************************************************************
1025
1026
1027
We have a list phone numbers in list "li" and here we will validate phone numbers using regular
1028
1029
1030
1031
1032
Solution
1033
-------------------------------------------------------------------------------------------------------------------------------------
1034
1035
1036
Code
1037
1038
import re
1039
li=['9999999999','999999-999','99999x9999']
1040
for val in li:
1041
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
1042
     print 'yes'
1043
 else:
1044
     print 'no'
1045
1046
1047
Output:
1048
yes
1049
no
1050
no
1051
1052
1053
1054
1055
1056
******************************************************
1057
* Problem 7: Split a string with multiple delimiters *
1058
******************************************************
1059
1060
1061
1062
Solution
1063
---------------------------------------------------------------------------------------------------------------------------
1064
1065
1066
Code
1067
1068
import re
1069
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
1070
result= re.split(r'[;,\s]', line)
1071
print result
1072
1073
Output:
1074
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
1075
1076
1077
1078
We can also use method re.sub() to replace these multiple delimiters with one as space " ".
1079
1080
1081
Code
1082
1083
import re
1084
line = 'asdf fjdk;afed,fjek,asdf,foo'
1085
result= re.sub(r'[;,\s]',' ', line)
1086
print result
1087
1088
Output:
1089
asdf fjdk afed fjek asdf foo
1090
1091
1092
1093
1094
**************************************************
1095
* Problem 8: Retrieve Information from HTML file *
1096
**************************************************
1097
1098
1099
1100
I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.
1101
1102
1103
1104
Sample HTML file (str)
1105
1106
<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
1107
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
1108
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
1109
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
1110
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
1111
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
1112
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>
1113
Solution:
1114-
['I', 'ove', 'on']
1114+
1115
1116
1117
Code
1118
1119
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
1120
print result
1121
1122
Output:
1123
[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia'), ('Michael', 'Emily')]
1124
1125
1126
1127
You can read html file using library urllib2 (see below code).
1128
1129
1130
Code
1131-
['I']
1131+
1132
import urllib2
1133
response = urllib2.urlopen('')
1134
html = response.read()