Advertisement
dimipan80

Java Regex: Extract Hyperlinks

Aug 5th, 2017
126
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Java 1.80 KB | None | 0 0
  1. /*
  2.  * Write a program to extract all hyperlinks (<href=…>) from a given text.
  3.  * The text comes from the console on a variable number of lines and ends with the command "END".
  4.  * Print at the console the href values in the text. The input text is standard HTML code.
  5.  * It may hold many tags and can be formatted in many different forms (with or without whitespace).
  6.  * The <a> elements may have many attributes, not only href.
  7.  * You should extract only the values of the href attributes of all <a> elements.
  8.  * The input will be well formed HTML fragment (all tags and attributes will be correctly closed).
  9.  * Attribute values will never hold tags and hyperlinks, e.g. "<img alt='<a href="hello">' />" is invalid.
  10.  * Commented links are also extracted.
  11.  * The number of input lines will be in the range [1 ... 100].
  12.  * Print at the console the href values in the text, each at a separate line, in the order they come from the input.
  13.  */
  14.  
  15. import java.io.BufferedReader;
  16. import java.io.IOException;
  17. import java.io.InputStreamReader;
  18. import java.util.regex.Matcher;
  19. import java.util.regex.Pattern;
  20.  
  21. public class ExtractHyperlinks_Regex {
  22.     public static void main(String[] args) throws IOException {
  23.         StringBuilder htmlText = new StringBuilder();
  24.  
  25.         BufferedReader reader =
  26.                 new BufferedReader(new InputStreamReader(System.in));
  27.  
  28.         String inputLine = reader.readLine();
  29.         while (!inputLine.equals("END")) {
  30.             htmlText.append(inputLine);
  31.             inputLine = reader.readLine();
  32.         }
  33.  
  34.         Pattern hyperlinkPatt = Pattern.compile("<a(?:[^>]+?)href\\s*=\\s*(\"|'|\\s?)(.+?)\\1(?=\\s|>)");
  35.         Matcher match = hyperlinkPatt.matcher(htmlText);
  36.  
  37.         while (match.find()) {
  38.             System.out.println(match.group(2));
  39.         }
  40.     }
  41. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement