You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
776 lines
34 KiB
776 lines
34 KiB
<!doctype html> |
|
|
|
<html lang="en-us"> |
|
|
|
<head> |
|
<meta charset="utf-8"> |
|
<title>Unicode NamesList Format</title> |
|
<link rel="stylesheet" type="text/css" href="https://www.unicode.org/reports/reports-v2.css"> |
|
<style> |
|
a.headernav { |
|
font-size: 90%; |
|
} |
|
a.headernav:link { |
|
color: white; |
|
} |
|
a.headernav:visited { |
|
color: white; |
|
} |
|
a.headernav:active { |
|
color: white; |
|
} |
|
a.headernav:hover { |
|
color: #B0B0B0; |
|
} |
|
.pageheader { |
|
margin-top: 0; |
|
padding: 0 .5em 0 0; |
|
display: flex; |
|
flex-direction: row; |
|
flex-wrap: nowrap; |
|
justify-content: flex-start; |
|
background-color: #5555FF; |
|
color: white; |
|
font-family: arial, geneva, sans-serif; |
|
font-weight:bold; |
|
align-items: center; |
|
} |
|
.pageicon { |
|
padding : 2px 4px 0 2px; |
|
} |
|
.pagelogo { |
|
height: 33px; width: 34px; |
|
border: 0; |
|
padding-bottom: 0px; |
|
margin-bottom:-2px; |
|
} |
|
.pagetitle { |
|
font-size: 115%; |
|
flex-grow: 4; |
|
padding-left: 1em; |
|
} |
|
.headernav { padding-top: 0px; |
|
font-weight: bold; |
|
font-size: 100%; |
|
color: white; font-family: arial, geneva, sans-serif; |
|
text-align:right; |
|
} |
|
.graybar { |
|
width: 100%;padding:0; |
|
font-size:50%; |
|
background-color: #EEEEFE; |
|
} |
|
.pagecontents { |
|
padding-left: 3.25em; |
|
padding-right: 3.25em; |
|
padding-bottom: 1.75em; |
|
padding-top: 1em; |
|
} |
|
.pagebottom img |
|
{ |
|
padding-top: 2px; |
|
width:216px; |
|
height:50px; |
|
border: 0; |
|
} |
|
.pagebottom |
|
{ |
|
margin: auto; |
|
text-align:center; |
|
} |
|
</style> |
|
</head> |
|
|
|
<body> |
|
|
|
<div class="pageheader"> |
|
<div class="pageicon"><a href="https://www.unicode.org/"><img class="pagelogo" |
|
src="https://www.unicode.org/webscripts/logo60s2.gif" |
|
alt="[Unicode]" ></a></div> |
|
|
|
<div class="pagetitle"><a class="headernav" |
|
href="https://www.unicode.org/ucd/">Unicode Character Database</a></div> |
|
|
|
</div> |
|
<div class="graybar"> </div> |
|
|
|
<div class="body"> |
|
<h1>Unicode® NamesList File Format</h1> |
|
<table class="simple"> |
|
<tbody> |
|
<tr> |
|
<td>Revision</td> |
|
<td>15.1.0</td> |
|
</tr> |
|
<tr> |
|
<td>Authors</td> |
|
<td>Asmus Freytag, Ken Whistler</td> |
|
</tr> |
|
<tr> |
|
<td>Date</td> |
|
<td>2023-08-23</td> |
|
</tr> |
|
<tr> |
|
<td>This Version</td> |
|
<td > |
|
<a href="https://www.unicode.org/Public/15.1.0/ucd/NamesList.html"> |
|
https://www.unicode.org/Public/15.1.0/ucd/NamesList.html</a></td> |
|
</tr> |
|
<tr> |
|
<td>Previous Version</td> |
|
<td> |
|
<a href="https://www.unicode.org/Public/15.0.0/ucd/NamesList.html"> |
|
https://www.unicode.org/Public/15.0.0/ucd/NamesList.html</a></td> |
|
</tr> |
|
<tr> |
|
<td>Latest Version</td> |
|
<td><a href="https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html">https://www.unicode.org/Public/UCD/latest/ucd/NamesList.html</a></td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
<p> </p> |
|
<h3><i>Summary</i></h3> |
|
<blockquote> |
|
<p>This file describes the format and contents of NamesList.txt</p> |
|
</blockquote> |
|
<h3><i>Status</i></h3> |
|
<blockquote> |
|
<p><i>The file and the files described herein are part of the <a href="https://www.unicode.org/ucd/">Unicode |
|
Character Database</a> (UCD). The Unicode <a href="https://www.unicode.org/terms_of_use.html"> |
|
Terms of Use</a> apply.</i></p> |
|
</blockquote> |
|
<hr style="width:50%"> |
|
|
|
<h2 id="Introduction">1.0 <a href="#Introduction">Introduction</a></h2> |
|
|
|
<p>The Unicode name list file NamesList.txt (also NamesList.lst) is a plain |
|
text file used to drive the layout of the character code charts in the Unicode |
|
Standard. The information in this file is a combination of several fields from |
|
the UnicodeData.txt and Blocks.txt files, together with additional annotations |
|
for many characters.</p> |
|
<p>This document describes the syntax rules for the file |
|
format, but also gives brief information on how each construct is rendered |
|
when laid out for the code charts. Some of the syntax elements are used only in |
|
preparation of the drafts of the code charts and are not present in the final, |
|
released form of the NamesList.txt file.</p> |
|
|
|
<p>Over time, the syntax has been extended by adding new features. The syntax for formal aliases and index tabs was introduced with Unicode |
|
5.0. The syntax for marginal sidebar comments is utilized extensively in |
|
draft versions of the NamesList.txt file. The support for UTF-8 encoded files and the syntax for the UTF-8 charset |
|
declaration in a comment at the head of the file were introduced after Unicode |
|
6.1.0 was published, as was the syntax for the specification of variation sequences and alternate glyphs and their respective summaries. The repertoire restriction |
|
in comments and aliases in the names list format was loosened from the prior |
|
limitation to U+0020..U+00FF, to include the wider range U+0020..U+02FF, as of Unicode 11.0.</p> |
|
|
|
<p>The same input file can be used for the preparation of drafts and final editions for ISO/IEC |
|
10646. Earlier versions of that standard used a different style, referred to below as ISO-style. That style necessitated the presence of some |
|
information in the name list file that is not needed (and in fact removed |
|
during parsing) for the Unicode code charts.</p> |
|
|
|
<p>With access to the layout program (<a href="https://www.unicode.org/unibook/">Unibook</a>) it is a simple matter of |
|
creating name lists for the purpose of formatting working drafts or other documents containing |
|
proposed characters.</p> |
|
<p>The content of the NamesList.txt file is optimized for code chart creation. |
|
Some information that can be inferred by the reader from context has been |
|
suppressed to make the code charts more readable. See the chapter on Code |
|
Charts in the <a href="https://www.unicode.org/versions/latest">Unicode |
|
Standard</a>.</p> |
|
|
|
<h3 id="Overview">1.1 <a href="#Overview">NamesList File Overview</a></h3> |
|
|
|
<p>The NamesList files are plain text files which in their most simple form look |
|
like this:</p> |
|
|
|
<p>@@<tab>0020<tab>BASIC LATIN<tab>007F<br> |
|
; this is a file comment (ignored)<br> |
|
0020<tab>SPACE<br> |
|
0021<tab>EXCLAMATION MARK<br> |
|
0022<tab>QUOTATION MARK<br> |
|
. . . <br> |
|
007F<tab>DELETE</p> |
|
|
|
<p>The semicolon (as first character), @ and <tab> characters are used |
|
by the file syntax and must be provided as shown. Hexadecimal digits must be |
|
in UPPERCASE. A double @@ introduces a block header, with the title, and |
|
start and ending code of the block provided as shown.</p> |
|
|
|
<p>For a minimal name list, only the NAME_LINE and BLOCKHEADER and |
|
their constituent syntax elements are needed.</p> |
|
|
|
<p>The full syntax with all the options is provided in the following sections.</p> |
|
|
|
<h2 id="FileStructure">2.0 <a href="#FileStructure">NamesList File Structure</a></h2> |
|
|
|
<p>This section defines the overall file structure</p> |
|
|
|
<pre><strong>NAMELIST: FILE_COMMENT* TITLE_PAGE* EXTENDED_BLOCK*</strong> |
|
|
|
<strong>TITLE_PAGE: TITLE |
|
| TITLE_PAGE SUBTITLE |
|
| TITLE_PAGE SUBHEADER |
|
| TITLE_PAGE IGNORED_LINE |
|
| TITLE_PAGE EMPTY_LINE |
|
| TITLE_PAGE NOTICE_LINE |
|
| TITLE_PAGE COMMENT_LINE |
|
| TITLE_PAGE PAGEBREAK |
|
| TITLE_PAGE FILE_COMMENT |
|
|
|
|
|
EXTENDED_BLOCK: BLOCK |
|
| BLOCK SUMMARY |
|
|
|
|
|
BLOCK: BLOCKHEADER |
|
| BLOCKHEADER INDEX_TAB |
|
| BLOCK CHAR_ENTRY |
|
| BLOCK SUBHEADER |
|
| BLOCK NOTICE_LINE |
|
| BLOCK EMPTY_LINE |
|
| BLOCK IGNORED_LINE |
|
| BLOCK SIDEBAR_LINE |
|
| BLOCK PAGEBREAK |
|
| BLOCK FILE_COMMENT |
|
| BLOCK CROSS_REF |
|
|
|
|
|
CHAR_ENTRY: NAME_LINE | RESERVED_LINE |
|
| CHAR_ENTRY ALIAS_LINE |
|
| CHAR_ENTRY FORMALALIAS_LINE |
|
| CHAR_ENTRY COMMENT_LINE |
|
| CHAR_ENTRY CROSS_REF |
|
| CHAR_ENTRY DECOMPOSITION |
|
| CHAR_ENTRY COMPAT_MAPPING |
|
| CHAR_ENTRY IGNORED_LINE |
|
| CHAR_ENTRY EMPTY_LINE |
|
| CHAR_ENTRY NOTICE_LINE |
|
| CHAR_ENTRY FILE_COMMENT |
|
| CHAR_ENTRY VARIATION_LINE</strong> |
|
</pre> |
|
|
|
<p>In other words:</p> |
|
<p> |
|
Neither TITLE nor SUBTITLE may occur after the first BLOCKHEADER. </p> |
|
<p>Only TITLE, SUBTITLE, SUBHEADER, PAGEBREAK, COMMENT_LINE, NOTICE_LINE, |
|
EMPTY_LINE, IGNORED_LINE and FILE_COMMENT may occur before the first BLOCKHEADER.</p> |
|
<ul> |
|
<li>CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, VARIATION_LINE, ALIAS and FORMALALIAS_LINE lines |
|
occurring before the first block header are treated as if they were |
|
COMMENT_LINEs.</li> |
|
</ul> |
|
<p>Directly following either a NAME_LINE or a RESERVED_LINE an uninterrupted |
|
sequence of the following lines may occur (in any order and repeated as often |
|
as needed): ALIAS_LINE, CROSS_REF, DECOMPOSITION, COMPAT_MAPPING, FORMALALIAS_LINE, NOTICE_LINE, |
|
EMPTY_LINE, IGNORED_LINE, VARIATION_LINE and FILE_COMMENT.</p> |
|
<ul> |
|
<li>The conventional order of elements in a char entry: NAME_LINE, |
|
FORMALALIAS_LINE, ALIAS, COMMENT_LINE or NOTICE_LINE, CROSS_REFs, VARIATION_LINE, and optionally |
|
ending in either DECOMPOSITION or COMPAT_MAPPING is not enforced by the layout program |
|
(<a href="https://www.unicode.org/unibook/">Unibook</a>). </li> |
|
</ul> |
|
<p>Except for CROSS_REF, NOTICE_LINE, SIDEBAR_LINE, EMPTY_LINE, IGNORED_LINE and |
|
FILE_COMMENT, none of these lines may |
|
occur in any other place.</p> |
|
<ul> |
|
<li>A NOTICE_LINE or CROSS_REF displays differently depending on whether it follows a header or title |
|
or is part of a CHAR_ENTRY</li> |
|
</ul> |
|
<p>A PAGEBREAK may appear anywhere, except the middle of a CHARACTER_ENTRY. |
|
A PAGEBREAK before the file title lines may not be supported. INDEX_TABs may |
|
appear after any block header.</p> |
|
<p>If the first line of a file is a file comment, it may contain a UTF-8 |
|
charset declaration (see below). Alternatively, or in addition, a BOM may be |
|
present at the very beginning of the file, forcing the encoding to be |
|
interpreted as UTF-16 (little-endian only) or UTF-8. When |
|
declared as UTF-8, the names list format will support use of characters in |
|
the range U+0020..U+02FF in LINE and LABEL elements. Otherwise, |
|
the supported repertoire is limited to Latin-1, and attempted use of characters outside |
|
the Latin-1 range will result in data corruption.</p> |
|
<p>Several of these elements, while part of the formal definition of the |
|
file format, do not occur in final published versions of |
|
NamesList.txt in the <a href="https://www.unicode.org/Public/UCD/latest/">UCD</a>.</p> |
|
|
|
<h4>Blocks followed by Summaries</h4> |
|
<p>A block may be extended by a summary of standard variation sequences or selected alternate glyphs (or both) defined for characters in the block:</p> |
|
<pre><strong> |
|
SUMMARY: ALTGLYPH_SUMMARY |
|
| VARIATION SUMMARY |
|
| ALTGLYPH_SUMMARY VARIATION_SUMMARY |
|
| MIXED_SUMMARY |
|
|
|
ALTGLYPH_SUMMARY: ALTGLYPH_SUBHEADER |
|
| ALTGLYPH_SUMMARY SUMMARY_LINE |
|
|
|
VARIATION_SUMMARY: VARIATION_SUBHEADER |
|
| VARIATION_SUMMARY SUMMARY_LINE |
|
|
|
MIXED_SUMMARY: MIXED_SUBHEADER |
|
| MIXED_SUMMARY SUMMARY_LINE |
|
|
|
SUMMARY_LINE: SUBHEADER |
|
| NOTICE_LINE |
|
| FILE_COMMENT |
|
| EMPTY_LINE</strong> |
|
</pre> |
|
|
|
<p>When formatted for display, each summary will recap the information presented in the VARIATION_LINE elements |
|
of the preceding block, grouped by alternate glyph variants and standardized variation sequences, and |
|
preceded by the corresponding subheader. Additional SUBHEADER and NOTICE lines, if provided, immediately |
|
follow the ALTGLYPH_SUBHEADER, VARIATION_SUBHEADER or MIXED_SUBHEADER. There is no provision to provide subheaders that are |
|
interspersed between items in the summary.</p> |
|
|
|
<p>These syntax constructs are entirely optional. If the ALTGLYPH_SUBHEADER or VARIATION_SUBHEADER are |
|
omitted from the names list, but the preceding block nevertheless contains VARIATION_LINE elements |
|
as described below, Unibook will automatically generate any required summaries using a default format for the headers.</p> |
|
|
|
<p>Thus, the main purpose for providing ALTGLYPH_SUBHEADER or VARIATION_SUBHEADER elements would be to |
|
provide specific contents for these summary titles as well as allow the ability to add additional |
|
information via SUBHEADER and NOTICE elements. The final published version of the Unicode names list |
|
is machine generated and will always explicitly provide any summary subheaders.</p> |
|
|
|
<h3 id="FileElements">2.1 <a href="#FileElements">NamesList File Elements</a></h3> |
|
|
|
<p>This section provides the details of the syntax for the individual elements.</p> |
|
|
|
<pre><strong>ELEMENT SYNTAX</strong> // How rendered |
|
|
|
<strong>NAME_LINE: CHAR TAB NAME LF</strong> |
|
// The CHAR and the corresponding image are echoed, |
|
// followed by the name as given in NAME |
|
|
|
<strong> | CHAR TAB "<" LCNAME ">" LF</strong> |
|
// Control and noncharacters use this form of |
|
// lowercase, bracketed pseudo character name |
|
|
|
<strong> | CHAR TAB NAME SP COMMENT LF</strong> |
|
// Names may have a comment, which is stripped off |
|
// unless the file is parsed for an ISO style list |
|
|
|
<strong> | CHAR TAB "<" LCNAME ">" SP COMMENT LF</strong> |
|
// Control and noncharacters may also have comments |
|
|
|
<strong>RESERVED_LINE: CHAR TAB "<reserved>" LF</strong> |
|
// The CHAR is echoed followed by an icon for the |
|
// reserved character and a fixed string e.g. "<reserved>" |
|
|
|
<strong>COMMENT_LINE: TAB "*" SP EXPAND_LINE</strong> |
|
// * is replaced by BULLET, output line as comment |
|
|
|
<strong> | TAB EXPAND_LINE</strong> |
|
// Output line as comment |
|
|
|
<strong>ALIAS_LINE: TAB "=" SP LINE</strong> |
|
// Replace = by itself, output line as alias |
|
|
|
<strong>FORMALALIAS_LINE: |
|
TAB "%" SP NAME LF</strong> |
|
// Replace % by U+203B, output line as formal alias |
|
|
|
<strong>CROSS_REF: TAB "x" SP CHAR SP LCNAME LF |
|
| TAB "x" SP CHAR SP "<" LCNAME ">" LF</strong> |
|
// x is replaced by a right arrow |
|
|
|
<strong> | TAB "x" SP "(" LCNAME SP "-" SP CHAR ")" LF |
|
| TAB "x" SP "(" "<" LCNAME ">" SP "-" SP CHAR ")" LF</strong> |
|
// x is replaced by a right arrow; |
|
// (second type as used for control and noncharacters) |
|
|
|
// In the forms with parentheses the "(","-" and ")" are removed |
|
// and the order of CHAR and LCNAME is reversed; |
|
// i.e. all inputs result in the same order of output |
|
|
|
<strong> | TAB "x" SP CHAR LF</strong> |
|
// x is replaced by a right arrow |
|
// (this type is the only one without LCNAME |
|
// and is used for ideographs) |
|
|
|
<strong>VARIATION_LINE: TAB "~" SP CHAR VARSEL SP LABEL LF |
|
| TAB "~" SP CHAR VARSEL SP LABEL "(" LCTAG ")" LF</strong> |
|
// output standardized variation sequence or simply the char code in case of alternate |
|
// glyphs, followed by the alternate glyph or variation glyph and the label and context |
|
|
|
<strong>FILE_COMMENT: ";" LINE</strong> |
|
|
|
<strong>EMPTY_LINE: LF</strong> |
|
// Empty and ignored lines as well as |
|
// file comments are ignored |
|
|
|
<strong>IGNORED_LINE: TAB ";" LINE</strong> |
|
// Ignore LINE |
|
|
|
<strong>SIDEBAR_LINE: ";;" LINE</strong> |
|
// Output LINE as marginal note |
|
|
|
<strong>DECOMPOSITION: TAB ":" SP EXPAND_LINE |
|
| TAB ":" SP "<" TAG ">" SP EXPAND_LINE</strong> |
|
// Replace ':' by EQUIV, expand line into decomposition |
|
// The <tag> gives optional information, |
|
// e.g., about composition exclusion. |
|
// by convention the tag has initial lowercase |
|
|
|
<strong>COMPAT_MAPPING: TAB "#" SP EXPAND_LINE |
|
| TAB "#" SP "<" TAG ">" SP EXPAND_LINE</strong> |
|
// Replace '#' by APPROX, output line as mapping |
|
// The <tag> is the optional compatibility decomposition tag. |
|
// by convention the tag has initial lowercase |
|
|
|
<strong>NOTICE_LINE: "@+" TAB LINE</strong> |
|
// Output LINE as notice |
|
|
|
<strong> | "@+" TAB "*" SP LINE</strong> |
|
// Output LINE as notice |
|
// "*" expands to a bullet character |
|
// Notices following a character code apply to the |
|
// character and are indented. Notices not following |
|
// a character code apply to the page/block/column |
|
// and are italicized, but not indented |
|
|
|
<strong>TITLE: "@@@" TAB LINE</strong> |
|
// Output LINE as text |
|
// Title is used in page headers |
|
|
|
<strong>SUBTITLE: "@@@+" TAB LINE</strong> |
|
// Output LINE as subtitle |
|
|
|
<strong>SUBHEADER: "@" TAB LINE</strong> |
|
// Output LINE as column header |
|
|
|
<strong>VARIATION_SUBHEADER:</strong> <strong>"@~" TAB LINE</strong> |
|
// Output LINE as column header (summary subheader) |
|
<strong>| "@~" LF</strong> |
|
// Output a default standard variation sequences summary subheader |
|
<strong>| "@~" TAB "!" LF</strong> |
|
// Suppress output of a default standard variant sequences summary subheader |
|
// and disable display of summary |
|
<strong>| "@~" TAB "!" VARSEL_LIST LF</strong> |
|
<strong>| "@~" TAB "!" VARSEL_LIST LINE</strong> |
|
// Output a standard summary subheader, using default or LINE respectively |
|
// Suppress any std variation sequences using selectors from the list |
|
|
|
<strong>ALTGLYPH_SUBHEADER:</strong> <strong>"@@~" TAB LINE</strong> |
|
// Output LINE as column header (summary subheader) |
|
<strong>| "@@~" LF</strong> |
|
// Output a default alternate glyph summary subheader |
|
<strong>| "@@~" TAB "!" LF</strong> |
|
// Suppress output of a default alternate glyph summary subheader |
|
// and disable display of summary |
|
|
|
<strong>MIXED_SUBHEADER: </strong><strong>"@@@~" TAB LINE</strong> |
|
// Output LINE as column header (summary subheader) |
|
<strong>| "@@@~" LF</strong> |
|
// Output a default combined variation and alternate glyph summary subheader |
|
<strong>| "@@@~" TAB "!" LF</strong> |
|
// Suppress output of a default alternate glyph summary subheader |
|
// and disable display of summary |
|
<strong>| "@@@~" TAB "!" VARSEL_LIST LF</strong> |
|
<strong>| "@@@~" TAB "!" VARSEL_LIST LINE</strong> |
|
// Output a combined summary subheader, using default or LINE respectively |
|
// Suppress any std variation sequences using selectors from the list |
|
|
|
<strong>BLOCKHEADER: "@@" TAB BLOCKSTART TAB BLOCKNAME TAB BLOCKEND LF</strong> |
|
// Cause a page break and optional |
|
// blank page, then output one or more charts |
|
// followed by the list of character names. |
|
// Use BLOCKSTART and BLOCKEND to define |
|
// what characters belong to a block. |
|
// Use BLOCKNAME in page and table headers |
|
|
|
<strong>BLOCKNAME: LABEL |
|
| LABEL SP "(" LABEL ")"</strong> |
|
// If an alternate label is present it replaces |
|
// the BLOCKNAME when an ISO-style names list is |
|
// laid out; it is ignored in the Unicode charts |
|
|
|
<strong>BLOCKSTART: CHAR</strong> // First character position in block |
|
<strong>BLOCKEND: CHAR</strong> // Last character position in block |
|
<strong>PAGEBREAK: "@@"</strong> // Insert a (column) break |
|
<strong>INDEX_TAB: "@@+"</strong> // Start a new index tab at latest BLOCKSTART |
|
|
|
<strong>EXPAND_LINE: {ESC_CHAR | CHAR | STRING | ESC +}+ LF</strong> |
|
// Instances of CHAR (see Notes) are replaced by |
|
// CHAR NBSP x NBSP where x is the single Unicode |
|
// character corresponding to CHAR. |
|
// If character is combining, it is replaced with |
|
// CHAR NBSP <circ> x NBSP where <circ> is the |
|
// dotted circle |
|
</pre> |
|
|
|
|
|
<b>Notes:</b><ul> |
|
<li>Blocks must be aligned on 16-code point boundary and contain an integer |
|
multiple of 16-code point columns. The exception to that rule is for blocks of |
|
ideographs, <i>etc.</i>, for which no names are listed in the file. The BLOCKEND for such blocks |
|
must correspond to the last assigned character, and not the actual end of the block.</li> |
|
<li>Blocks must be non-overlapping and in ascending order. NAME_LINEs |
|
must be in ascending order and follow the block header for the block to |
|
which they belong. </li> |
|
<li>Reserved entries are optional, and will normally be supplied automatically. They are |
|
required whenever followed by ALIAS_LINE, COMMENT_LINE, NOTICE_LINE or CROSS_REF. |
|
</li> |
|
<li>An empty alternative glyph summary subheader expression will result in default header "Selected Alternative Glyphs"</li> |
|
<li>An empty standard variation subheader expression will result in the default header "Standardized Variation Sequences"</li> |
|
<li>A VARSEL_LIST may only contain code points for standard variation selectors (including script specific ones)</li> |
|
<li>When displaying a VARIATION_LINE for alternate glyphs, the "ALTn" selector is not displayed. </li> |
|
<li>If a glyph is unavailable for the variant glyph in a VARIATION_LINE it is replaced by the glyph for U+2591 LIGHT SHADE.</li> |
|
<li>Because a LINE or an EXPAND_LINE can itself start with a special character followed |
|
by a SP or LF, an "unmarked" COMMENT_LINE should match the input in lower priority than line |
|
types that require a special character or have a more restrictive set of characters than EXPAND_LINE. |
|
Similarly, a SUBHEADER containing TAB "!" LF should match with a higher priority than those |
|
where the TAB is followed by a LINE.</li> |
|
</ul> |
|
|
|
|
|
<h3 id="FilePrimitives">2.2 <a href="#FilePrimitives">NamesList File Primitives</a></h3> |
|
|
|
<p>The following are the primitives and terminals for the NamesList syntax.</p> |
|
|
|
<pre><strong>LINE</strong>: <strong>STRING LF |
|
COMMENT: "(" LABEL ")" |
|
| "(" LABEL ")" SP "*" |
|
| "*"</strong> |
|
|
|
<strong>NAME</strong>: <sequence of uppercase ASCII letters, digits, space and hyphen> |
|
<strong>LCNAME</strong>: <sequence of lowercase ASCII letters, digits, space and hyphen> <strong> ("-" CHAR)?</strong> |
|
|
|
<strong>TAG</strong>: <sequence of ASCII letters> |
|
<strong>LCTAG</strong>: <sequence of lowercase ASCII letters> |
|
<strong>STRING</strong>: <sequence of characters in the range U+0020..U+02FF, except controls> |
|
<strong>LABEL</strong>: <sequence of characters in the range U+0020..U+02FF, except controls, "(" or ")"> |
|
<strong>VARSEL</strong>: <strong>CHAR |
|
| "ALT" ( "1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9" )</strong> |
|
<strong>VARSEL_LIST</strong>: <strong>"{" CHAR_LIST "}"</strong> |
|
<strong>CHAR_LIST</strong>: <strong>CHAR |
|
| CHAR_LIST SP CHAR</strong> |
|
<strong>CHAR</strong>: <strong>X X X X</strong> |
|
<strong>| X X X X X </strong> |
|
<strong>| X X X X X X </strong> |
|
<strong>X</strong>: <strong>"0"|"1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9"|"A"|"B"|"C"|"D"|"E"|"F"</strong> |
|
<strong>ESC_CHAR</strong>: <strong>ESC CHAR</strong> |
|
<strong>ESC</strong>: <strong>"\"</strong> |
|
// Special semantics of backslash (\) are supported |
|
// only in EXPAND_LINE. |
|
<strong>TAB</strong>: <sequence of one or more ASCII tab characters 0x09> |
|
<strong>SP</strong>: <ASCII 20> |
|
<strong>LF</strong>: <any sequence of a single ASCII 0A or 0D, or both> |
|
</pre> |
|
|
|
<p><b>Notes:</b></p> |
|
<ul> |
|
<li>Multiple or leading spaces, multiple or leading hyphens, as well as |
|
word-initial digits in NAMEs or LCNAMEs are illegal.</li> |
|
<li>The French version of the names list uses French rules, which allow |
|
apostrophe and accented letters in character names.</li> |
|
<li>When names containing code points are lowercased to make them LCNAMEs, |
|
the code point values remain uppercase. Such code points by convention |
|
follow a hyphen and are the last element in the name.</li> |
|
<li>Special limited lookbehind logic prevents a 4 digit number for a standard, such |
|
as ISO 9999 from being misinterpreted as ISO CHAR. Currently recognized are |
|
"ISO", "DIN", "IEC" and "S X" as well as "S C" for the JIS X and JIS C series of |
|
standards. (In addition "EEE" and "S X" are recognized for use with IEEE and KSC X standards. For the GB series of standards, " GB" is defined to prevent conversion to CHAR, but has no effect at the start of a line). For other standards, or for four-digit years in a comment, use a |
|
NOTICE_LINE instead, which prevents expansion, or use "\" to escape the digits.</li> |
|
<li>Single and double straight quotes in an EXPAND_LINE are replaced by curly quotes using English rules. |
|
Smart apostrophes are supported, but nested quotes are not. |
|
Single quotes can only be applied around a single word.</li> |
|
<li>A CHAR inside ' or " is expanded, but only its glyph image is printed, the |
|
code value is not echoed.</li> |
|
<li>Inside an EXPAND_LINE, backslash is treated as an escape character that |
|
removes the special meaning of any literal character and also prevents |
|
the following digit sequence from being expanded. A backslash character in |
|
isolation is never displayed. A sequence of two backslash characters results |
|
in display of a single backslash, but has no effect on the interpretation |
|
of following characters.</li> |
|
<li>The hyphen in a character range CHAR-CHAR is replaced by an EN DASH on |
|
output.</li> |
|
<li>In a STRING or LABEL, a Unicode character outside the range |
|
U+0000..U+02FF is displayed as is, with a glyph matching |
|
the chart font, and not with the font that is otherwise defined for that element.</li> |
|
<li>The NamesList.txt file is encoded in UTF-8 if the <i>first line</i> is a |
|
FILE_COMMENT containing the declaration "UTF-8" or any casemap variation |
|
thereof. Otherwise the file is encoded in Latin-1 (older versions). Beyond |
|
detecting the charset declaration (typically: "; charset=utf-8") the |
|
remainder of that comment is ignored. |
|
If the file is not encoded as |
|
UTF-8, the character repertoire for running text (anything |
|
other than CHAR) is effectively restricted to the repertoire of Latin-1. |
|
Otherwise, characters in the range U+0020..U+02FF |
|
are allowed in STRING or LABEL elements, and elements derived from them.</li> |
|
<li>The code chart layout program |
|
(<a href="https://www.unicode.org/unibook/">Unibook</a>) |
|
can accept files in several other formats. These include little-endian UTF-16, |
|
prefixed with a BOM, or UTF-8 prefixed with the UTF-8 BOM.</li> |
|
<li>While the format allows multiple <tab> characters, by convention the |
|
actual number of tabs is always one or two, chosen to provide the best |
|
layout of the plain text file.</li> |
|
<li>Earlier published versions of the NamesList.txt file may contain trailing or otherwise extraneous |
|
spaces or tab characters; while these are errors in the files, they are not |
|
being corrected, to retain stability of the published versions. Anyone |
|
writing a parser for older versions of this file may need to be prepared to |
|
handle such exceptions.</li> |
|
<li>Lines are terminated by \r, \n, \r\n or \n\r. Repeated terminators imply empty lines, e.g. \r\r\n is treated as 2 lines, as is \r\n\r\n.</li> |
|
<li>The final LF in the file must be present.</li> |
|
</ul> |
|
<h2 id="Modifications"><a href="#Modifications">Modifications</a></h2> |
|
|
|
<p><b>Version 15.1.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 15.0.0.</li> |
|
<li>Adjusted NAMELIST definition to account for positions of FILE_COMMENT.</li> |
|
<li>Added a note to the bullets in Section 2.1 to clarify priority of matching for |
|
some line types.</li> |
|
<li>In Section 2.2, added a note clarifying the font handling for characters |
|
outside the range U+0000..U+02FF occurring in NAME or LABEL elements.</li> |
|
<li>Also in Section 2.2, updated the bullet about lookbehind logic |
|
for identifying digit sequences that are part of identifiers for various standards, |
|
to include the detection of IEEE, KSC X, and GB standards.</li> |
|
<li>Added missing quotation marks around * in second expansion for |
|
NOTICE_LINE.</li> |
|
<li>Corrected and clarified the BNF statement of nameslist syntax.</li> |
|
<li>Some literals had not been quoted, some productions were missing the trailing LF</li> |
|
<li>The LF and LCNAME productions were clarified</li> |
|
<li>Updated to HTML5</li> |
|
</ul> |
|
<p><b>Version 15.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 15.0.0.</li> |
|
</ul> |
|
<p><b>Version 14.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 14.0.0.</li> |
|
<li>Corrected character name LIGHT SCREEN to LIGHT SHADE.</li> |
|
</ul> |
|
<p><b>Version 13.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 13.0.0.</li> |
|
<li>Added a second expansion for DECOMPOSITION, for possible future |
|
use to designate specific subtypes of canonical decompositions |
|
in the names list output.</li> |
|
</ul> |
|
<p><b>Version 12.1.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 12.1.0.</li> |
|
</ul> |
|
<p><b>Version 12.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 12.0.0.</li> |
|
<li>Added definition of TAG (allowing uppercase letters), distinct from LCTAG.</li> |
|
<li>Corrected definition of VARIATION_LINE to use LCTAG instead of LCNAME.</li> |
|
<li>Corrected definition of COMPAT_MAPPING to use TAG instead of LCTAG.</li> |
|
<li>Corrected the documentation regarding which elements allow use of characters |
|
in the range U+0020..U+02FF.</li> |
|
</ul> |
|
<p><b>Version 11.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 11.0.0.</li> |
|
<li>Loosened the limitation on repertoire allowed in LINE and LABEL |
|
elements to include characters outside Latin-1, in the range |
|
U+0100..U+02FF.</li> |
|
</ul> |
|
<p><b>Version 10.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 10.0.0.</li> |
|
</ul> |
|
<p><b>Version 9.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 9.0.0.</li> |
|
</ul> |
|
<p><b>Version 8.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 8.0.0.</li> |
|
<li>Added MIXED_SUBHEADER, VARSEL_LIST, and CHAR_LIST to the syntax.</li> |
|
<li>Tweaked BNF and notes for variation summaries.</li> |
|
</ul> |
|
<p><b>Version 7.0.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 7.0.0.</li> |
|
</ul> |
|
<p><b>Version 6.3.0</b></p> |
|
<ul> |
|
<li>Reissued for Unicode 6.3.0.</li> |
|
</ul> |
|
<p><b>Version 6.2.0</b></p> |
|
<ul> |
|
<li>Edited the variation syntax definitions, description and corresponding notes for wording.</li> |
|
<li>Minor tweaks to the layout of BNF syntax, mostly adding tabs and | characters as needed.</li> |
|
<li>Fixed some typographical errors and minor inconsistencies.</li> |
|
<li>Added syntax for elements required by variation sequence and alternate glyph summaries.</li> |
|
<li>Edited and reformatted some notes for readability.</li> |
|
<li>Documented the permitted presence of CROSS_REF outside character entries within blocks. |
|
Such CROSS_REFs have been present in published names lists, but that information was missing in |
|
the syntax description. For an example see the Currency Symbols block in the code charts.</li> |
|
<li>Added description of UTF-8 charset declaration and file encoding.</li> |
|
</ul> |
|
<p><b>Version 6.1.0</b></p> |
|
<ul> |
|
<li>Removed constraint that LCTAG consist only of lowercase letters, |
|
because of the existence of the "noBreak" tag.</li> |
|
</ul> |
|
<p><b>Version 6.0.0</b></p> |
|
<ul> |
|
<li>Added definitions for ESC_CHAR and ESC primitives.</li> |
|
<li>Clarified interpretation of backslash escapes in EXPAND_LINE.</li> |
|
</ul> |
|
<p><b>Version 5.2.0</b></p> |
|
<ul> |
|
<li>Better aligned the rules section with the actual published files and |
|
behavior of existing parsers. This included fixing some obvious typos |
|
and clarifying some notes as well as the following changes, which are |
|
listed individually.</li> |
|
<li>Replaced instances of <tab> by TAB throughout.</li> |
|
<li>NAME_LINE for special names may have trailing COMMENTs including COMMENTs |
|
consisting entirely of "*".</li> |
|
<li>In CROSS_REF added the form without LCNAME, fixed the literal to the |
|
correct lowercase "x" and noted that LCNAME may have "<" and ">" around |
|
it in the data. Also added missing LF in the rules.</li> |
|
<li>Removed a redundant rule for BLOCKHEADER.</li> |
|
<li>Changed FORMALALIAS_LINE from LINE to NAME to match actual restriction |
|
on contents.</li> |
|
<li>Extended the documentation of lookahead logic for CHAR.</li> |
|
<li>Accounted for FILE_COMMENT in overall file structure.</li> |
|
</ul> |
|
<p><b>Version 5.1.0</b></p> |
|
<ul> |
|
<li>Noted that comments in NAME_LINEs must be preceded by SP.</li> |
|
<li>Provided additional information on allowable characters in names.</li> |
|
<li>Added SIDEBAR_LINE.</li> |
|
<li>Noted that CROSS_REF must contain a SP and CHAR, and that |
|
COMPAT_MAPPING must contain a SP and may contain a <tag></li> |
|
<li>Noted that LCNAME may contain uppercase characters under |
|
exceptional circumstances.</li> |
|
<li>Relaxed the restriction on lines starting with #, :, %, x and = on |
|
the TITLE_PAGE. These are now treated as comments.</li> |
|
</ul> |
|
<p><b>Version 5.0.0</b></p> |
|
<ul> |
|
<li>Added FORMALALIAS_LINE and INDEX_TAB to syntax.</li> |
|
<li>Fixed the list of lines that may appear before a BLOCKHEADER by |
|
adding NOTICE_LINE.</li> |
|
<li>Minor fixes to the wording of several syntax definitions.</li> |
|
</ul> |
|
<p><b>Version 4.0.0</b></p> |
|
<ul> |
|
<li>Fixed syntax to better reflect restrictions on characters |
|
in character and block names.</li> |
|
<li>Better document treatment of comments in block names, plus |
|
French name rules.</li> |
|
</ul> |
|
<p><b>Version 3.2.0</b></p> |
|
<ul> |
|
<li>Fixed several broken links, added a left margin, |
|
changed version numbering.</li> |
|
</ul> |
|
<p><b>Version 3.1.0 (2)</b></p> |
|
<ul> |
|
<li>Use of 4-6 digit hex notation is now supported.</li> |
|
</ul> |
|
</div> |
|
|
|
<div class="pagebottom"> |
|
<hr style="width:50%"> |
|
<a href="https://www.unicode.org/copyright.html"> |
|
<img src="https://www.unicode.org/img/hb_notice.gif" |
|
alt="Access to Copyright and terms of use" ></a> |
|
</div> |
|
|
|
</body> |
|
|
|
</html> |
|
|
|
|