Sophie

Sophie

distrib > Mageia > 4 > x86_64 > by-pkgid > f800694edefe91adea2624f711a41a2d > files > 11024

php-manual-en-5.5.7-1.mga4.noarch.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  <title>Unicode character properties</title>

 </head>
 <body><div class="manualnavbar" style="text-align: center;">
 <div class="prev" style="text-align: left; float: left;"><a href="regexp.reference.escape.html">Escape sequences</a></div>
 <div class="next" style="text-align: right; float: right;"><a href="regexp.reference.anchors.html">Anchors</a></div>
 <div class="up"><a href="reference.pcre.pattern.syntax.html">PCRE regex syntax</a></div>
 <div class="home"><a href="index.html">PHP Manual</a></div>
</div><hr /><div id="regexp.reference.unicode" class="section">
  <h2 class="title">Unicode character properties</h2>
  <p class="para">
   Since 5.1.0, three
   additional escape sequences to match generic character types are available
   when <em class="emphasis">UTF-8 mode</em> is selected. They are:
  </p>
  <dl>

   <dt>

    <span class="term"><em class="emphasis">\p{xx}</em></span>
    <dd>
<span class="simpara">a character with the xx property</span></dd>

   </dt>

   <dt>

    <span class="term"><em class="emphasis">\P{xx}</em></span>
    <dd>
<span class="simpara">a character without the xx property</span></dd>

   </dt>

   <dt>

    <span class="term"><em class="emphasis">\X</em></span>
    <dd>
<span class="simpara">an extended Unicode sequence</span></dd>

   </dt>

  </dl>

  <p class="para">
   The property names represented by <em>xx</em> above are limited 
   to the Unicode general category properties. Each character has exactly one 
   such property, specified by a two-letter abbreviation. For compatibility with
   Perl, negation can be specified by including a circumflex between the
   opening brace and the property name. For example, <em>\p{^Lu}</em> 
   is the same as <em>\P{Lu}</em>.
  </p>
  <p class="para">
   If only one letter is specified with <em>\p</em> or 
   <em>\P</em>, it includes all the properties that start with that
   letter. In this case, in the absence of negation, the curly brackets in the 
   escape sequence are optional; these two examples have the same effect:
  </p>
  <div class="informalexample">
   <div class="example-contents">
<div class="cdata"><pre>
\p{L}
\pL
</pre></div>
   </div>

  </div>
  <table class="doctable table">
   <caption><strong>Supported property codes</strong></caption>
   
    <thead>
     <tr>
      <th>Property</th>
      <th>Matches</th>
      <th>Notes</th>
     </tr>

    </thead>

    <tbody class="tbody">
     <tr>
      <td><em>C</em></td>
      <td>Other</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Cc</em></td>
      <td>Control</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Cf</em></td>
      <td>Format</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Cn</em></td>
      <td>Unassigned</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Co</em></td>
      <td>Private use</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Cs</em></td>
      <td>Surrogate</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>L</em></td>
      <td>Letter</td>
      <td>
       Includes the following properties: <em>Ll</em>, 
       <em>Lm</em>, <em>Lo</em>, <em>Lt</em> and 
       <em>Lu</em>.
      </td>
     </tr>

     <tr>
      <td><em>Ll</em></td>
      <td>Lower case letter</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Lm</em></td>
      <td>Modifier letter</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Lo</em></td>
      <td>Other letter</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Lt</em></td>
      <td>Title case letter</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Lu</em></td>
      <td>Upper case letter</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>M</em></td>
      <td>Mark</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Mc</em></td>
      <td>Spacing mark</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Me</em></td>
      <td>Enclosing mark</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Mn</em></td>
      <td>Non-spacing mark</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>N</em></td>
      <td>Number</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Nd</em></td>
      <td>Decimal number</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Nl</em></td>
      <td>Letter number</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>No</em></td>
      <td>Other number</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>P</em></td>
      <td>Punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Pc</em></td>
      <td>Connector punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Pd</em></td>
      <td>Dash punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Pe</em></td>
      <td>Close punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Pf</em></td>
      <td>Final punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Pi</em></td>
      <td>Initial punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Po</em></td>
      <td>Other punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Ps</em></td>
      <td>Open punctuation</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>S</em></td>
      <td>Symbol</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Sc</em></td>
      <td>Currency symbol</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Sk</em></td>
      <td>Modifier symbol</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Sm</em></td>
      <td>Mathematical symbol</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>So</em></td>
      <td>Other symbol</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Z</em></td>
      <td>Separator</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Zl</em></td>
      <td>Line separator</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Zp</em></td>
      <td>Paragraph separator</td>
      <td class="empty">&nbsp;</td>
     </tr>

     <tr>
      <td><em>Zs</em></td>
      <td>Space separator</td>
      <td class="empty">&nbsp;</td>
     </tr>

    </tbody>
   
  </table>

  <p class="para">
   Extended properties such as <em>InMusicalSymbols</em> are not
   supported by PCRE.
  </p>
  <p class="para">
   Specifying case-insensitive (caseless) matching does not affect these escape sequences.
   For example, <em>\p{Lu}</em> always matches only upper case letters.
  </p>
  <p class="para">
   Sets of Unicode characters are defined as belonging to certain scripts.  A
   character from one of these sets can be matched using a script name.  For
   example:
  </p>
  <ul class="itemizedlist">
   <li class="listitem">
    <span class="simpara"><em>\p{Greek}</em></span>
   </li>
   <li class="listitem">
    <span class="simpara"><em>\P{Han}</em></span>
   </li>
  </ul>
  <p class="para">
   Those that are not part of an identified script are lumped together  as
   <em>Common</em>. The current list of scripts is:
  </p>
  <table class="doctable table">
   <caption><strong>Supported scripts</strong></caption>
   
    <tbody class="tbody">
     <tr>
      <td><em>Arabic</em></td>
      <td><em>Armenian</em></td>
      <td><em>Avestan</em></td>
      <td><em>Balinese</em></td>
      <td><em>Bamum</em></td>
     </tr>

     <tr>
      <td><em>Batak</em></td>
      <td><em>Bengali</em></td>
      <td><em>Bopomofo</em></td>
      <td><em>Brahmi</em></td>
      <td><em>Braille</em></td>
     </tr>

     <tr>
      <td><em>Buginese</em></td>
      <td><em>Buhid</em></td>
      <td><em>Canadian_Aboriginal</em></td>
      <td><em>Carian</em></td>
      <td><em>Chakma</em></td>
     </tr>

     <tr>
      <td><em>Cham</em></td>
      <td><em>Cherokee</em></td>
      <td><em>Common</em></td>
      <td><em>Coptic</em></td>
      <td><em>Cuneiform</em></td>
     </tr>

     <tr>
      <td><em>Cypriot</em></td>
      <td><em>Cyrillic</em></td>
      <td><em>Deseret</em></td>
      <td><em>Devanagari</em></td>
      <td><em>Egyptian_Hieroglyphs</em></td>
     </tr>

     <tr>
      <td><em>Ethiopic</em></td>
      <td><em>Georgian</em></td>
      <td><em>Glagolitic</em></td>
      <td><em>Gothic</em></td>
      <td><em>Greek</em></td>
     </tr>

     <tr>
      <td><em>Gujarati</em></td>
      <td><em>Gurmukhi</em></td>
      <td><em>Han</em></td>
      <td><em>Hangul</em></td>
      <td><em>Hanunoo</em></td>
     </tr>

     <tr>
      <td><em>Hebrew</em></td>
      <td><em>Hiragana</em></td>
      <td><em>Imperial_Aramaic</em></td>
      <td><em>Inherited</em></td>
      <td><em>Inscriptional_Pahlavi</em></td>
     </tr>

     <tr>
      <td><em>Inscriptional_Parthian</em></td>
      <td><em>Javanese</em></td>
      <td><em>Kaithi</em></td>
      <td><em>Kannada</em></td>
      <td><em>Katakana</em></td>
     </tr>

     <tr>
      <td><em>Kayah_Li</em></td>
      <td><em>Kharoshthi</em></td>
      <td><em>Khmer</em></td>
      <td><em>Lao</em></td>
      <td><em>Latin</em></td>
     </tr>

     <tr>
      <td><em>Lepcha</em></td>
      <td><em>Limbu</em></td>
      <td><em>Linear_B</em></td>
      <td><em>Lisu</em></td>
      <td><em>Lycian</em></td>
     </tr>

     <tr>
      <td><em>Lydian</em></td>
      <td><em>Malayalam</em></td>
      <td><em>Mandaic</em></td>
      <td><em>Meetei_Mayek</em></td>
      <td><em>Meroitic_Cursive</em></td>
     </tr>

     <tr>
      <td><em>Meroitic_Hieroglyphs</em></td>
      <td><em>Miao</em></td>
      <td><em>Mongolian</em></td>
      <td><em>Myanmar</em></td>
      <td><em>New_Tai_Lue</em></td>
     </tr>

     <tr>
      <td><em>Nko</em></td>
      <td><em>Ogham</em></td>
      <td><em>Old_Italic</em></td>
      <td><em>Old_Persian</em></td>
      <td><em>Old_South_Arabian</em></td>
     </tr>

     <tr>
      <td><em>Old_Turkic</em></td>
      <td><em>Ol_Chiki</em></td>
      <td><em>Oriya</em></td>
      <td><em>Osmanya</em></td>
      <td><em>Phags_Pa</em></td>
     </tr>

     <tr>
      <td><em>Phoenician</em></td>
      <td><em>Rejang</em></td>
      <td><em>Runic</em></td>
      <td><em>Samaritan</em></td>
      <td><em>Saurashtra</em></td>
     </tr>

     <tr>
      <td><em>Sharada</em></td>
      <td><em>Shavian</em></td>
      <td><em>Sinhala</em></td>
      <td><em>Sora_Sompeng</em></td>
      <td><em>Sundanese</em></td>
     </tr>

     <tr>
      <td><em>Syloti_Nagri</em></td>
      <td><em>Syriac</em></td>
      <td><em>Tagalog</em></td>
      <td><em>Tagbanwa</em></td>
      <td><em>Tai_Le</em></td>
     </tr>

     <tr>
      <td><em>Tai_Tham</em></td>
      <td><em>Tai_Viet</em></td>
      <td><em>Takri</em></td>
      <td><em>Tamil</em></td>
      <td><em>Telugu</em></td>
     </tr>

     <tr>
      <td><em>Thaana</em></td>
      <td><em>Thai</em></td>
      <td><em>Tibetan</em></td>
      <td><em>Tifinagh</em></td>
      <td><em>Ugaritic</em></td>
     </tr>

     <tr>
      <td><em>Vai</em></td>
      <td><em>Yi</em></td>
      <td class="empty">&nbsp;</td>
      <td class="empty">&nbsp;</td>
      <td class="empty">&nbsp;</td>
      <td class="empty">&nbsp;</td>
     </tr>

    </tbody>
   
  </table>

  <p class="para">
   The <em>\X</em> escape matches a Unicode extended grapheme
   cluster. An extended grapheme cluster is one or more Unicode characters
   that combine to form a single glyph. In effect, this can be thought of as
   the Unicode equivalent of <em>.</em> as it will match one
   composed character, regardless of how many individual characters are
   actually used to render it.
  </p>
  <p class="para">
   In versions of PCRE older than 8.32 (which corresponds to PHP versions
   before 5.4.14 when using the bundled PCRE library), <em>\X</em>
   is equivalent to <em>(?&gt;\PM\pM*)</em>.  That is, it matches a
   character without the &quot;mark&quot; property, followed by zero or more characters
   with the &quot;mark&quot; property, and treats the sequence as an atomic group (see
   below). Characters with the &quot;mark&quot; property are typically accents that
   affect the preceding character.
  </p>
  <p class="para">
   Matching characters by Unicode property is not fast, because PCRE has
   to search a structure that contains data for over fifteen thousand
   characters. That is why the traditional escape sequences such as 
   <em>\d</em> and <em>\w</em> do not use Unicode properties 
   in PCRE.
  </p>
 </div><hr /><div class="manualnavbar" style="text-align: center;">
 <div class="prev" style="text-align: left; float: left;"><a href="regexp.reference.escape.html">Escape sequences</a></div>
 <div class="next" style="text-align: right; float: right;"><a href="regexp.reference.anchors.html">Anchors</a></div>
 <div class="up"><a href="reference.pcre.pattern.syntax.html">PCRE regex syntax</a></div>
 <div class="home"><a href="index.html">PHP Manual</a></div>
</div></body></html>