Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- <#
- Sequential Web Scraper (for PowerShell) v2.1
- Written by Aaron Loessberg-Zahl
- Last modified 3 April 2015
- Scrapes images/files/etc. that are named sequentially. Will properly skip files that return a HTTP 404 error.
- For comments/questions/bugs, please contact <[email protected]>.
- ----------------------------------------------------------------------------
- "THE LUNCH-WARE LICENSE" (Revision 2659):
- <[email protected]> wrote this file. As long as you retain this
- notice, you can do whatever you want with this stuff. If we meet some day,
- and you think this stuff is worth it, you can buy me lunch in return.
- ----------------------------------------------------------------------------
- Changelog:
- v1.0 2012-10-04 amloessb Created and debugged
- v2.0 2014-08-15 amloessb Converted to use a template, with the string "\NUM\" where the counter should be placed
- Added random (100ms-1s) wait between downloads, to attempt to avoid throttling
- v2.1 2015-04-03 amloessb Added help documentation
- #>
- <#
- .SYNOPSIS
- Scrapes images/files/etc. that are named sequentially.
- .DESCRIPTION
- Given a URL template and a range of numbers, this script will attempt to access each web resource sequentially and, if it exists, save it to disk.
- .PARAMETER UrlTemplate
- The template form of the URLs to be scraped. Replace the actual file/image number with the string "\NUM\".
- .PARAMETER Padding
- Number of digits that the number should be padded to. For example, a Padding value of 3 will produce 009, 010, 011, etc.
- .PARAMETER StartNumber
- The number which the script should start scraping from.
- .PARAMETER EndNumber
- The last number the script should try to scrape.
- .PARAMETER SaveTo
- The local directory path where the scraped files should be saved.
- .EXAMPLE
- .\Scrape-Files.ps1 -UrlTemplate "http://www.site.com/images/adorable_cats_\NUM\.jpg" -Padding 3 -StartNumber 6 -EndNumber 53 -SaveTo "C:\Users\$($ENV:USERNAME)\Pictures\Cats"
- Attempts to save http://www.site.com/images/adorable_cats_006.jpg thru http://www.site.com/images/adorable_cats_053.jpg to the Cats folder in the current user's My Pictures.
- .INPUTS
- None. You cannot pipe objects to this script.
- .OUTPUTS
- None. This script does not produce any pipeable output.
- .LINK
- None.
- #>
- Param (
- [String] $UrlTemplate,
- [Int] $Padding,
- [Int] $StartNumber,
- [Int] $EndNumber,
- [String] $SaveTo
- )
- Function butLast ([Array] $arr, [String] $sep) {
- $return = ""
- $num = 0
- If ($sep) {
- While ($num -le ($arr.Length - 2)) {
- $return += $arr[$num] + $sep
- $num ++
- }
- Return $return
- }
- Return $arr[0..($arr.Length - 2)]
- }
- $ErrorActionPreference = "Stop"
- Add-Type -AssemblyName System.Web
- $splitURL = $UrlTemplate.split("/")
- $filename = $splitURL[-1]
- $arrURL = $filename -split "\\NUM\\"
- $currentLink = $StartNumber
- $wc = New-Object system.Net.WebClient
- $badLinks = 0
- While ($currentLink -le $EndNumber) {
- $url = (butLast $splitURL "/") + $arrURL[0] + "{0:D$padding}" -f $currentLink + $arrURL[1]
- $nameOnDisk = $url.split("/")[-1]
- Write-Progress -Activity "Sequential Web Scraper" -Status "Scraping files..." -CurrentOperation $nameOnDisk
- Try {
- $wc.DownloadFile($url, (Join-Path $SaveTo $nameOnDisk))
- } Catch {
- $badLinks ++
- }
- $currentLink ++
- Sleep -Milliseconds (Get-Random -Minimum 100 -Maximum 1000)
- }
- Write-Progress -Activity "Sequential Web Scraper" -Status "Completed" -Completed
Advertisement
Add Comment
Please, Sign In to add comment