Advertisement
Guest User

Untitled

a guest
Oct 17th, 2014
331
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 16.31 KB | None | 0 0
  1. ____ ____ ____ _____ ___ _ _ _____ _____
  2. | _ \ / ___| _ \| ____| ( _ ) | | | |_ _| ___|
  3. | |_) | | | |_) | _| / _ \/\ | | | | | | | |_
  4. | __/| |___| _ <| |___ | (_> < | |_| | | | | _|
  5. |_| \____|_| \_\_____| \___/\/ \___/ |_| |_|
  6.  
  7.  
  8. @link http://www.pcre.org/pcre.txt @author Philip Hazel - University of Cambridge
  9. UTF-8 AND UNICODE PROPERTY SUPPORT
  10.  
  11. From release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0
  12. this was greatly extended to cover most common requirements, and in release 5.0 additional support for Unicode
  13. general category properties was added.
  14.  
  15. In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition,
  16. you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject
  17. strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes.
  18.  
  19. If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the
  20. additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big.
  21.  
  22. If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you
  23. should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was
  24. built with UTF support. For example, an 8-bit pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
  25. which interprets patterns and subjects as strings of UTF-8 characters instead of individual 8-bit characters. This
  26. causes both the pattern and any data against which it is matched to be checked for UTF-8 validity. If the data string
  27. is very long, such a check might use sufficiently many resources as to cause your application to lose performance.
  28.  
  29. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at compile time. This
  30. causes an compile time error if a pattern contains a UTF-setting sequence.
  31.  
  32. In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you
  33. must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When
  34. either of these is the case, both the pattern and any subject strings that are matched against it are treated as
  35. UTF-8 strings instead of strings of 1-byte characters.
  36.  
  37.  
  38. VALIDITY OF UTF-8 STRINGS
  39.  
  40. When you set the PCRE_UTF8 flag, the byte strings passed as patterns and subjects are (by default) checked for
  41. validity on entry to the relevant functions. The entire string is checked before any other processing takes
  42. place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, which are themselves derived from
  43. the Unicode specification. Earlier releases of PCRE followed the rules of RFC 2279, which allows the full range
  44. of 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 to U+10FFFF, excluding
  45. the surrogate area. (From release 8.33 the so-called "non-character" code points are no longer excluded because
  46. Unicode corrigendum #9 makes it clear that they should not be.)
  47.  
  48. Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to
  49. encode codepoints with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available
  50. independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16
  51. which unfortunately messes up UTF-8 and UTF-32.)
  52.  
  53. If an invalid UTF-8 string is passed to PCRE, an error return is given.
  54.  
  55.  
  56.  
  57.  
  58.  
  59.  
  60.  
  61. ___ ___ ___ ___ ___ _ _
  62. | _ \/ __| _ \ __| / __| |_ __ _ _ _ __ _ ___| |___ __ _
  63. | _/ (__| / _| | (__| ' \/ _` | ' \/ _` / -_) / _ \/ _` |
  64. |_| \___|_|_\___| \___|_||_\__,_|_||_\__, \___|_\___/\__, |
  65. |___/ |___/
  66. // Release 8.33 28-May-2013
  67.  
  68. Version 8.33 28-May-2013
  69. ---------------------
  70. 00. (*LIMIT_MATCH=d), (*LIMIT_RECURSION=d) added so the pattern can specify lower limits for the matching process.
  71. 35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, blocking (*UTF) etc.
  72.  
  73. Version 8.32 30-November-2012
  74. ---------------------
  75. 14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be set
  76. 24. Add support for 32-bit character strings, and UTF-32
  77. 25. (*UTF) can now be used to start a pattern in any of the three libraries.
  78. 30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints greater than 0x7fffffff (which cannot be
  79. represented in UTF-8, even under the "old" RFC 2279). Instead, it ended up passing a negative length to pcre_exec()
  80.  
  81. Version 7.9 11-Apr-09
  82. ---------------------
  83. 28. Added support for (*UTF8) at the start of a pattern.
  84.  
  85. Version 7.3 28-Aug-07
  86. ---------------------
  87. 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
  88. This restricts code points to be within the range 0 to 0x10FFFF, excluding
  89. the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the
  90. full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still
  91. does: it's just the validity check that is more restrictive.
  92.  
  93. Version 4.4 21-Aug-03
  94. ---------------------
  95. 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
  96. PCRE checks UTF-8 strings for validity by default. There is an option to suppress
  97. this, just in case anybody wants that teeny extra bit of performance.
  98.  
  99. Version 4.4 13-Aug-03
  100. ---------------------
  101. 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
  102. both compile and run time, and gives an error if an invalid UTF-8 sequence
  103. is found. There is a option for disabling this check in cases where the
  104. string is known to be correct and/or the maximum performance is wanted.
  105.  
  106. Version 3.3 01-Aug-00
  107. ---------------------
  108. 7. Added the beginnings of support for UTF-8 character strings.
  109.  
  110.  
  111.  
  112.  
  113.  
  114. PCRE PHP)INI CONFIGURATION OPTIONS
  115.  
  116. @link http://php.net/manual/en/pcre.configuration.php "PCRE Configuration Options"
  117.  
  118. 2 PCRE INI options are available since PHP 5.2.0
  119.  
  120. pcre.backtrack_limit 1000000
  121. PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
  122.  
  123. pcre.recursion_limit 100000
  124. PCRE's recursion limit. Please note that if you set this value too high you may consume all the available
  125. process stack and eventually crash PHP (due to reaching the stack size limit imposed by the OS).
  126.  
  127.  
  128.  
  129.  
  130.  
  131. PCRE CRASHES FROM REGEXES
  132.  
  133. // Release 8.33 28-May-2013
  134. // (*LIMIT_MATCH=d) and (*LIMIT_RECURSION=d) have been added so that the creator of a pattern can specify lower (but not higher) limits for the matching process.
  135.  
  136.  
  137. PCRE_EXTRA_MATCH_LIMIT can be accessed through the set_match_limit()
  138. and match_limit() member functions. Setting match_limit to a non-zero value will limit the execution of
  139. pcre to keep it from doing bad things like blowing the stack or taking an eternity to return a result. A value
  140. of 5000 is good enough to stop stack blowup in a 2MB thread stack. Setting match_limit to zero disables match
  141. limiting. Alternatively, you can call match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit
  142. how much PCRE recurses. match_limit() limits the number of matches PCRE does; match_limit_recursion() limits the
  143. depth of internal recursion, and therefore the amount of stack that is used.
  144.  
  145. The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running
  146. patterns that are not going to match, but which have a very large number of possibilities in their search trees. The
  147. classic example is the use of nested unlimited repeats.
  148.  
  149. Internally, PCRE uses a function called match() which it calls repeatedly (sometimes recursively). The limit set
  150. by match_limit is imposed on the number of times this function is called during a match, which has the effect of
  151. limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts
  152. from zero for each position in the subject string.
  153.  
  154. The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all
  155. but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which
  156. match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec()
  157. returns PCRE_ERROR_MATCHLIMIT.
  158.  
  159. The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times
  160. that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total
  161. number of calls, because not all calls to match() are recursive. This limit is of use only if it is set smaller
  162. than match_limit.
  163.  
  164. Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use
  165. memory on the heap instead of the stack, the amount of heap memory that can be used.
  166.  
  167. The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value
  168. as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in
  169. which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit
  170. is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
  171.  
  172.  
  173.  
  174.  
  175.  
  176. _ _ ____
  177. _ __ _ _ ___ __ _ _ __ __ _| |_ __| |_ / /\ \
  178. | '_ \ '_/ -_) _` | | ' \/ _` | _/ _| ' \| | | |
  179. | .__/_| \___\__, |_|_|_|_\__,_|\__\__|_||_| | | |
  180. |_| |___/___| \_\/_/
  181.  
  182. preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.
  183.  
  184. u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and
  185. subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP
  186. 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will
  187. cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and
  188. six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have
  189. been regarded as valid UTF-8.
  190.  
  191. With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings containing invalid UTF-8 byte sequences. It
  192. does not reject character codes above U+10FFFF (represented by 4 or more octets), though.
  193.  
  194. Originally, this function checked according to RFC 2279, allowing for values in the range 0 to 0x7fffffff, up to 6
  195. bytes long, but ensuring that they were in the canonical format. Once somebody had pointed out RFC 3629 to me (it
  196. obsoletes 2279), additional restrictions were applied. The values are now limited to be between 0 and 0x0010ffff,
  197. no more than 4 bytes long, and the subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte
  198. characters is still checked.
  199.  
  200.  
  201.  
  202. BACKTRACKING CONTROL
  203.  
  204. The following are recognized only at the start of a pattern:
  205.  
  206. (*LIMIT_MATCH=d) set the match limit to d (decimal number) ( added 8.33 28-May-2013 )
  207. (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) ( added 8.33 28-May-2013 )
  208.  
  209. (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) ( added 7.9 11-Apr-09 )
  210. (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) ( added 7.9 11-Apr-09 )
  211. (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) ( added 7.9 11-Apr-09 )
  212. (*UTF) set appropriate UTF mode for the library in use ( added 7.9 11-Apr-09 )
  213.  
  214. In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF support, and, in addition, you
  215. must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8) or
  216. (*UTF). When either of these is the case, both the pattern and any subject strings that are matched against it
  217. are treated as UTF-8 strings instead of strings of individual 1-byte characters.
  218.  
  219.  
  220.  
  221. PCRE UTF ERRORS
  222.  
  223. From release 8.13 more information about the details of the error are passed back in the returned value:
  224.  
  225. PCRE_UTF8_ERR0 No error
  226. PCRE_UTF8_ERR1 Missing 1 byte at the end of the string
  227. PCRE_UTF8_ERR2 Missing 2 bytes at the end of the string
  228. PCRE_UTF8_ERR3 Missing 3 bytes at the end of the string
  229. PCRE_UTF8_ERR4 Missing 4 bytes at the end of the string
  230. PCRE_UTF8_ERR5 Missing 5 bytes at the end of the string
  231. PCRE_UTF8_ERR6 2nd-byte's two top bits are not 0x80
  232. PCRE_UTF8_ERR7 3rd-byte's two top bits are not 0x80
  233. PCRE_UTF8_ERR8 4th-byte's two top bits are not 0x80
  234. PCRE_UTF8_ERR9 5th-byte's two top bits are not 0x80
  235. PCRE_UTF8_ERR10 6th-byte's two top bits are not 0x80
  236. PCRE_UTF8_ERR11 5-byte character is not permitted by RFC 3629
  237. PCRE_UTF8_ERR12 6-byte character is not permitted by RFC 3629
  238. PCRE_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted
  239. PCRE_UTF8_ERR14 3-byte character with value 0xd000-0xdfff is not permitted
  240. PCRE_UTF8_ERR15 Overlong 2-byte sequence
  241. PCRE_UTF8_ERR16 Overlong 3-byte sequence
  242. PCRE_UTF8_ERR17 Overlong 4-byte sequence
  243. PCRE_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur)
  244. PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur)
  245. PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character)
  246. PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff
  247. PCRE_UTF8_ERR22 Unused (was non-character)
  248.  
  249.  
  250. PHP PCRE CONSTANTS
  251.  
  252. PREG_NO_ERROR Returned by preg_last_error() if there were no errors. 5.2.0
  253. PREG_INTERNAL_ERROR Returned by preg_last_error() if there was an internal PCRE error. 5.2.0
  254. PREG_BACKTRACK_LIMIT_ERROR Returned by preg_last_error() if backtrack limit was exhausted. 5.2.0
  255. PREG_RECURSION_LIMIT_ERROR Returned by preg_last_error() if recursion limit was exhausted. 5.2.0
  256. PREG_BAD_UTF8_ERROR Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when
  257. running a regex in UTF-8 mode). 5.2.0
  258. PREG_BAD_UTF8_OFFSET_ERROR Returned by preg_last_error() if the offset didn't correspond to the begin of a valid
  259. UTF-8 code point (only when running a regex in UTF-8 mode). 5.3.0
  260. PCRE_VERSION PCRE version and release date (e.g. "7.0 18-Dec-2006"). 5.2.4
  261.  
  262. PCRE CONSTANTS ON MY INSTALL get_defined_constants()
  263.  
  264. PREG_PATTERN_ORDER' => 1,
  265. PREG_SET_ORDER' => 2,
  266. PREG_OFFSET_CAPTURE' => 256,
  267. PREG_SPLIT_NO_EMPTY' => 1,
  268. PREG_SPLIT_DELIM_CAPTURE' => 2,
  269. PREG_SPLIT_OFFSET_CAPTURE' => 4,
  270. PREG_GREP_INVERT' => 1,
  271. PREG_NO_ERROR' => 0,
  272. PREG_INTERNAL_ERROR' => 1,
  273. PREG_BACKTRACK_LIMIT_ERROR' => 2,
  274. PREG_RECURSION_LIMIT_ERROR' => 3,
  275. PREG_BAD_UTF8_ERROR' => 4,
  276. PREG_BAD_UTF8_OFFSET_ERROR' => 5,
  277. PCRE_VERSION' => '8.34 2013-12-15',
  278.  
  279.  
  280.  
  281.  
  282. _ ____
  283. (_)__ ___ _ ___ __/ /\ \
  284. | / _/ _ \ ' \ V / | | |
  285. |_\__\___/_||_\_/| | | |
  286. \_\/_/
  287.  
  288. https://www.gnu.org/software/libiconv/
  289.  
  290. If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
  291. Otherwise, str is cut from the first illegal character and an E_NOTICE is generated. ( since GNU libiconv 2002-01-13 )
  292.  
  293. In other words, iconv() appears to be intended for use when converting the contents of files - whereas mb_convert_encoding() is intended
  294. for use when juggling strings internally, e.g. strings that aren't being read/written to/from files, but exchanged with some other media.
  295.  
  296. ICONV CHARACTER SET ENCODINGS CONTAINING "UTF"
  297.  
  298. $ iconv -l
  299. - ISO-10646UTF-8
  300. - ISO-10646UTF8
  301. - UTF-7
  302. - UTF-8
  303. - UTF-16
  304. - UTF-16BE
  305. - UTF-16LE
  306. - UTF-32
  307. - UTF-32BE
  308. - UTF-32LE
  309. - UTF7
  310. - UTF8
  311. - UTF16
  312. - UTF16BE
  313. - UTF16LE
  314. - UTF32
  315. - UTF32BE
  316. - UTF32LE
  317.  
  318. If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.
  319.  
  320. ICONV IMPLEMENTATIONS - ICONV_IMPL CONSTANT
  321.  
  322. @link http://www.gnu.org/software/libc/manual/html_node/Other-iconv-Implementations.html "Some Details about other iconv Implementations"
  323. @link http://www.gnu.org/software/libc/manual/html_node/Locales.html "Locales and Internationalization"
  324.  
  325. "libiconv" - GNU libiconv is the native FreeBSD iconv implementation since 2002.
  326. "BSD iconv" - Konstantin Chugeuv's iconv
  327. "glibc" - GNU Glibc's
  328. "unknown" - Not one of the above
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement