Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- ____ ____ ____ _____ ___ _ _ _____ _____
- | _ \ / ___| _ \| ____| ( _ ) | | | |_ _| ___|
- | |_) | | | |_) | _| / _ \/\ | | | | | | | |_
- | __/| |___| _ <| |___ | (_> < | |_| | | | | _|
- |_| \____|_| \_\_____| \___/\/ \___/ |_| |_|
- @link http://www.pcre.org/pcre.txt @author Philip Hazel - University of Cambridge
- UTF-8 AND UNICODE PROPERTY SUPPORT
- From release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0
- this was greatly extended to cover most common requirements, and in release 5.0 additional support for Unicode
- general category properties was added.
- In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition,
- you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject
- strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes.
- If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the
- additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big.
- If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you
- should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was
- built with UTF support. For example, an 8-bit pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
- which interprets patterns and subjects as strings of UTF-8 characters instead of individual 8-bit characters. This
- causes both the pattern and any data against which it is matched to be checked for UTF-8 validity. If the data string
- is very long, such a check might use sufficiently many resources as to cause your application to lose performance.
- Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at compile time. This
- causes an compile time error if a pattern contains a UTF-setting sequence.
- In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you
- must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When
- either of these is the case, both the pattern and any subject strings that are matched against it are treated as
- UTF-8 strings instead of strings of 1-byte characters.
- VALIDITY OF UTF-8 STRINGS
- When you set the PCRE_UTF8 flag, the byte strings passed as patterns and subjects are (by default) checked for
- validity on entry to the relevant functions. The entire string is checked before any other processing takes
- place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, which are themselves derived from
- the Unicode specification. Earlier releases of PCRE followed the rules of RFC 2279, which allows the full range
- of 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 to U+10FFFF, excluding
- the surrogate area. (From release 8.33 the so-called "non-character" code points are no longer excluded because
- Unicode corrigendum #9 makes it clear that they should not be.)
- Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to
- encode codepoints with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available
- independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16
- which unfortunately messes up UTF-8 and UTF-32.)
- If an invalid UTF-8 string is passed to PCRE, an error return is given.
- ___ ___ ___ ___ ___ _ _
- | _ \/ __| _ \ __| / __| |_ __ _ _ _ __ _ ___| |___ __ _
- | _/ (__| / _| | (__| ' \/ _` | ' \/ _` / -_) / _ \/ _` |
- |_| \___|_|_\___| \___|_||_\__,_|_||_\__, \___|_\___/\__, |
- |___/ |___/
- // Release 8.33 28-May-2013
- Version 8.33 28-May-2013
- ---------------------
- 00. (*LIMIT_MATCH=d), (*LIMIT_RECURSION=d) added so the pattern can specify lower limits for the matching process.
- 35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, blocking (*UTF) etc.
- Version 8.32 30-November-2012
- ---------------------
- 14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be set
- 24. Add support for 32-bit character strings, and UTF-32
- 25. (*UTF) can now be used to start a pattern in any of the three libraries.
- 30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints greater than 0x7fffffff (which cannot be
- represented in UTF-8, even under the "old" RFC 2279). Instead, it ended up passing a negative length to pcre_exec()
- Version 7.9 11-Apr-09
- ---------------------
- 28. Added support for (*UTF8) at the start of a pattern.
- Version 7.3 28-Aug-07
- ---------------------
- 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
- This restricts code points to be within the range 0 to 0x10FFFF, excluding
- the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the
- full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still
- does: it's just the validity check that is more restrictive.
- Version 4.4 21-Aug-03
- ---------------------
- 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
- PCRE checks UTF-8 strings for validity by default. There is an option to suppress
- this, just in case anybody wants that teeny extra bit of performance.
- Version 4.4 13-Aug-03
- ---------------------
- 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
- both compile and run time, and gives an error if an invalid UTF-8 sequence
- is found. There is a option for disabling this check in cases where the
- string is known to be correct and/or the maximum performance is wanted.
- Version 3.3 01-Aug-00
- ---------------------
- 7. Added the beginnings of support for UTF-8 character strings.
- PCRE PHP)INI CONFIGURATION OPTIONS
- @link http://php.net/manual/en/pcre.configuration.php "PCRE Configuration Options"
- 2 PCRE INI options are available since PHP 5.2.0
- pcre.backtrack_limit 1000000
- PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
- pcre.recursion_limit 100000
- PCRE's recursion limit. Please note that if you set this value too high you may consume all the available
- process stack and eventually crash PHP (due to reaching the stack size limit imposed by the OS).
- PCRE CRASHES FROM REGEXES
- // Release 8.33 28-May-2013
- // (*LIMIT_MATCH=d) and (*LIMIT_RECURSION=d) have been added so that the creator of a pattern can specify lower (but not higher) limits for the matching process.
- PCRE_EXTRA_MATCH_LIMIT can be accessed through the set_match_limit()
- and match_limit() member functions. Setting match_limit to a non-zero value will limit the execution of
- pcre to keep it from doing bad things like blowing the stack or taking an eternity to return a result. A value
- of 5000 is good enough to stop stack blowup in a 2MB thread stack. Setting match_limit to zero disables match
- limiting. Alternatively, you can call match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit
- how much PCRE recurses. match_limit() limits the number of matches PCRE does; match_limit_recursion() limits the
- depth of internal recursion, and therefore the amount of stack that is used.
- The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running
- patterns that are not going to match, but which have a very large number of possibilities in their search trees. The
- classic example is the use of nested unlimited repeats.
- Internally, PCRE uses a function called match() which it calls repeatedly (sometimes recursively). The limit set
- by match_limit is imposed on the number of times this function is called during a match, which has the effect of
- limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts
- from zero for each position in the subject string.
- The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all
- but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which
- match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec()
- returns PCRE_ERROR_MATCHLIMIT.
- The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times
- that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total
- number of calls, because not all calls to match() are recursive. This limit is of use only if it is set smaller
- than match_limit.
- Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use
- memory on the heap instead of the stack, the amount of heap memory that can be used.
- The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value
- as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in
- which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit
- is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
- _ _ ____
- _ __ _ _ ___ __ _ _ __ __ _| |_ __| |_ / /\ \
- | '_ \ '_/ -_) _` | | ' \/ _` | _/ _| ' \| | | |
- | .__/_| \___\__, |_|_|_|_\__,_|\__\__|_||_| | | |
- |_| |___/___| \_\/_/
- preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.
- u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and
- subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP
- 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will
- cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and
- six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have
- been regarded as valid UTF-8.
- With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings containing invalid UTF-8 byte sequences. It
- does not reject character codes above U+10FFFF (represented by 4 or more octets), though.
- Originally, this function checked according to RFC 2279, allowing for values in the range 0 to 0x7fffffff, up to 6
- bytes long, but ensuring that they were in the canonical format. Once somebody had pointed out RFC 3629 to me (it
- obsoletes 2279), additional restrictions were applied. The values are now limited to be between 0 and 0x0010ffff,
- no more than 4 bytes long, and the subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte
- characters is still checked.
- BACKTRACKING CONTROL
- The following are recognized only at the start of a pattern:
- (*LIMIT_MATCH=d) set the match limit to d (decimal number) ( added 8.33 28-May-2013 )
- (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) ( added 8.33 28-May-2013 )
- (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) ( added 7.9 11-Apr-09 )
- (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) ( added 7.9 11-Apr-09 )
- (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) ( added 7.9 11-Apr-09 )
- (*UTF) set appropriate UTF mode for the library in use ( added 7.9 11-Apr-09 )
- In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF support, and, in addition, you
- must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8) or
- (*UTF). When either of these is the case, both the pattern and any subject strings that are matched against it
- are treated as UTF-8 strings instead of strings of individual 1-byte characters.
- PCRE UTF ERRORS
- From release 8.13 more information about the details of the error are passed back in the returned value:
- PCRE_UTF8_ERR0 No error
- PCRE_UTF8_ERR1 Missing 1 byte at the end of the string
- PCRE_UTF8_ERR2 Missing 2 bytes at the end of the string
- PCRE_UTF8_ERR3 Missing 3 bytes at the end of the string
- PCRE_UTF8_ERR4 Missing 4 bytes at the end of the string
- PCRE_UTF8_ERR5 Missing 5 bytes at the end of the string
- PCRE_UTF8_ERR6 2nd-byte's two top bits are not 0x80
- PCRE_UTF8_ERR7 3rd-byte's two top bits are not 0x80
- PCRE_UTF8_ERR8 4th-byte's two top bits are not 0x80
- PCRE_UTF8_ERR9 5th-byte's two top bits are not 0x80
- PCRE_UTF8_ERR10 6th-byte's two top bits are not 0x80
- PCRE_UTF8_ERR11 5-byte character is not permitted by RFC 3629
- PCRE_UTF8_ERR12 6-byte character is not permitted by RFC 3629
- PCRE_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted
- PCRE_UTF8_ERR14 3-byte character with value 0xd000-0xdfff is not permitted
- PCRE_UTF8_ERR15 Overlong 2-byte sequence
- PCRE_UTF8_ERR16 Overlong 3-byte sequence
- PCRE_UTF8_ERR17 Overlong 4-byte sequence
- PCRE_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur)
- PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur)
- PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character)
- PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff
- PCRE_UTF8_ERR22 Unused (was non-character)
- PHP PCRE CONSTANTS
- PREG_NO_ERROR Returned by preg_last_error() if there were no errors. 5.2.0
- PREG_INTERNAL_ERROR Returned by preg_last_error() if there was an internal PCRE error. 5.2.0
- PREG_BACKTRACK_LIMIT_ERROR Returned by preg_last_error() if backtrack limit was exhausted. 5.2.0
- PREG_RECURSION_LIMIT_ERROR Returned by preg_last_error() if recursion limit was exhausted. 5.2.0
- PREG_BAD_UTF8_ERROR Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when
- running a regex in UTF-8 mode). 5.2.0
- PREG_BAD_UTF8_OFFSET_ERROR Returned by preg_last_error() if the offset didn't correspond to the begin of a valid
- UTF-8 code point (only when running a regex in UTF-8 mode). 5.3.0
- PCRE_VERSION PCRE version and release date (e.g. "7.0 18-Dec-2006"). 5.2.4
- PCRE CONSTANTS ON MY INSTALL get_defined_constants()
- PREG_PATTERN_ORDER' => 1,
- PREG_SET_ORDER' => 2,
- PREG_OFFSET_CAPTURE' => 256,
- PREG_SPLIT_NO_EMPTY' => 1,
- PREG_SPLIT_DELIM_CAPTURE' => 2,
- PREG_SPLIT_OFFSET_CAPTURE' => 4,
- PREG_GREP_INVERT' => 1,
- PREG_NO_ERROR' => 0,
- PREG_INTERNAL_ERROR' => 1,
- PREG_BACKTRACK_LIMIT_ERROR' => 2,
- PREG_RECURSION_LIMIT_ERROR' => 3,
- PREG_BAD_UTF8_ERROR' => 4,
- PREG_BAD_UTF8_OFFSET_ERROR' => 5,
- PCRE_VERSION' => '8.34 2013-12-15',
- _ ____
- (_)__ ___ _ ___ __/ /\ \
- | / _/ _ \ ' \ V / | | |
- |_\__\___/_||_\_/| | | |
- \_\/_/
- https://www.gnu.org/software/libiconv/
- If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
- Otherwise, str is cut from the first illegal character and an E_NOTICE is generated. ( since GNU libiconv 2002-01-13 )
- In other words, iconv() appears to be intended for use when converting the contents of files - whereas mb_convert_encoding() is intended
- for use when juggling strings internally, e.g. strings that aren't being read/written to/from files, but exchanged with some other media.
- ICONV CHARACTER SET ENCODINGS CONTAINING "UTF"
- $ iconv -l
- - ISO-10646UTF-8
- - ISO-10646UTF8
- - UTF-7
- - UTF-8
- - UTF-16
- - UTF-16BE
- - UTF-16LE
- - UTF-32
- - UTF-32BE
- - UTF-32LE
- - UTF7
- - UTF8
- - UTF16
- - UTF16BE
- - UTF16LE
- - UTF32
- - UTF32BE
- - UTF32LE
- If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.
- ICONV IMPLEMENTATIONS - ICONV_IMPL CONSTANT
- @link http://www.gnu.org/software/libc/manual/html_node/Other-iconv-Implementations.html "Some Details about other iconv Implementations"
- @link http://www.gnu.org/software/libc/manual/html_node/Locales.html "Locales and Internationalization"
- "libiconv" - GNU libiconv is the native FreeBSD iconv implementation since 2002.
- "BSD iconv" - Konstantin Chugeuv's iconv
- "glibc" - GNU Glibc's
- "unknown" - Not one of the above
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement