amloessb

Scrape-Files.ps1 (v2.1)

Apr 3rd, 2015
288
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. <#
  2. Sequential Web Scraper (for PowerShell) v2.1
  3. Written by Aaron Loessberg-Zahl
  4. Last modified 3 April 2015
  5.  
  6. Scrapes images/files/etc. that are named sequentially.  Will properly skip files that return a HTTP 404 error.
  7.  
  8. For comments/questions/bugs, please contact <[email protected]>.
  9.  
  10. ----------------------------------------------------------------------------
  11. "THE LUNCH-WARE LICENSE" (Revision 2659):
  12. <[email protected]> wrote this file. As long as you retain this
  13. notice, you can do whatever you want with this stuff. If we meet some day,
  14. and you think this stuff is worth it, you can buy me lunch in return.
  15. ----------------------------------------------------------------------------
  16.  
  17. Changelog:
  18. v1.0    2012-10-04      amloessb        Created and debugged
  19. v2.0    2014-08-15      amloessb        Converted to use a template, with the string "\NUM\" where the counter should be placed
  20.                                         Added random (100ms-1s) wait between downloads, to attempt to avoid throttling
  21. v2.1    2015-04-03      amloessb        Added help documentation
  22. #>
  23.  
  24. <#
  25. .SYNOPSIS
  26. Scrapes images/files/etc. that are named sequentially.
  27.  
  28. .DESCRIPTION
  29. Given a URL template and a range of numbers, this script will attempt to access each web resource sequentially and, if it exists, save it to disk.
  30.  
  31. .PARAMETER UrlTemplate
  32. The template form of the URLs to be scraped.  Replace the actual file/image number with the string "\NUM\".
  33.  
  34. .PARAMETER Padding
  35. Number of digits that the number should be padded to.  For example, a Padding value of 3 will produce 009, 010, 011, etc.
  36.  
  37. .PARAMETER StartNumber
  38. The number which the script should start scraping from.
  39.  
  40. .PARAMETER EndNumber
  41. The last number the script should try to scrape.
  42.  
  43. .PARAMETER SaveTo
  44. The local directory path where the scraped files should be saved.
  45.  
  46. .EXAMPLE
  47. .\Scrape-Files.ps1 -UrlTemplate "http://www.site.com/images/adorable_cats_\NUM\.jpg" -Padding 3 -StartNumber 6 -EndNumber 53 -SaveTo "C:\Users\$($ENV:USERNAME)\Pictures\Cats"
  48. Attempts to save http://www.site.com/images/adorable_cats_006.jpg thru http://www.site.com/images/adorable_cats_053.jpg to the Cats folder in the current user's My Pictures.
  49.  
  50. .INPUTS
  51. None. You cannot pipe objects to this script.
  52.  
  53. .OUTPUTS
  54. None. This script does not produce any pipeable output.
  55.  
  56. .LINK
  57. None.
  58. #>
  59.  
  60. Param (
  61.     [String] $UrlTemplate,
  62.     [Int] $Padding,
  63.     [Int] $StartNumber,
  64.     [Int] $EndNumber,
  65.     [String] $SaveTo
  66. )
  67.  
  68. Function butLast ([Array] $arr, [String] $sep) {
  69.     $return = ""
  70.     $num = 0
  71.     If ($sep) {
  72.         While ($num -le ($arr.Length - 2)) {
  73.             $return += $arr[$num] + $sep
  74.             $num ++
  75.         }
  76.         Return $return
  77.     }
  78.     Return $arr[0..($arr.Length - 2)]
  79. }
  80.  
  81. $ErrorActionPreference = "Stop"
  82.  
  83. Add-Type -AssemblyName System.Web
  84.  
  85. $splitURL = $UrlTemplate.split("/")
  86. $filename = $splitURL[-1]
  87. $arrURL = $filename -split "\\NUM\\"
  88. $currentLink = $StartNumber
  89.  
  90. $wc = New-Object system.Net.WebClient
  91.  
  92. $badLinks = 0
  93.  
  94. While ($currentLink -le $EndNumber) {
  95.     $url = (butLast $splitURL "/") + $arrURL[0] + "{0:D$padding}" -f $currentLink + $arrURL[1]
  96.     $nameOnDisk = $url.split("/")[-1]
  97.     Write-Progress -Activity "Sequential Web Scraper" -Status "Scraping files..." -CurrentOperation $nameOnDisk
  98.     Try {
  99.         $wc.DownloadFile($url, (Join-Path $SaveTo $nameOnDisk))
  100.     } Catch {
  101.         $badLinks ++
  102.     }
  103.     $currentLink ++
  104.     Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000)
  105. }
  106.  
  107. Write-Progress -Activity "Sequential Web Scraper" -Status "Completed" -Completed
Advertisement
Add Comment
Please, Sign In to add comment