notepad-plus-plus/scintilla/doc/StyleMetadata.html

205 lines
13 KiB
HTML

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta name="generator" content="SciTE" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>
Scintilla Style Metadata
</title>
<style type="text/css">
<!--
/*<![CDATA[*/
CODE { font-weight: bold; font-family: Menlo,Consolas,Bitstream Vera Sans Mono,Courier New,monospace; }
/*]]>*/
-->
</style>
</head>
<body bgcolor="#FFFFFF" text="#000000">
<table bgcolor="#000000" width="100%" cellspacing="0" cellpadding="0" border="0">
<tr>
<td>
<img src="SciTEIco.png" border="3" height="64" width="64" alt="Scintilla icon" />
</td>
<td>
<a href="index.html" style="color:white;text-decoration:none"><font size="5">Scintilla</font></a>
</td>
</tr>
</table>
<h2>
Language Types
</h2>
<p>
Scintilla contains lexers for various types of languages:
<ul>
<li>Programming languages like C++, Java, and Python.</li>
<li>Assembler languages are low-level programming languages which may additionally include instructions and registers.</li>
<li>Markup languages like HTML, TeX, and Markdown.</li>
<li>Data languages like EDIFACT and YAML.</li>
</ul>
</p>
<p>
Some languages can be used in different ways. JavaScript is a programming language but also
the basis of JSON data files. Similarly,
<a href="https://en.wikipedia.org/wiki/S-expression">Lisp s expressions</a> can be used for both source code and data.
</p>
<p>
Each language type has common elements such as identifiers in programming languages.
These common elements should be identified so that languages can be displayed with common
styles for these elements.
Style tags are used for this purpose in Scintilla.
</p>
<h2>
Style Tags
</h2>
<p>
Every style has a list of tags where a tag is a lower-case word containing only the common ASCII letters 'a'-'z'
such as "comment" or "operator".
</p>
<p>
Tags are ordered from most important to least important.
</p>
<p>
While applications may assign visual attributes for tag lists in many different ways, one reasonable technique is to
apply tag-specific attributes in reverse order so that earlier and more important tags override less important tags.
For example, the tag list <code>"error comment documentation keyword"</code> with
a set of tag attributes <br />
<code>{ comment=fore:green,back:very-light-green,font:Serif documentation=fore:light-green error=strikethrough keyword=bold }</code><br />
could be rendered as <br />
<code>bold,fore:light-green,back:very-light-green,font:Serif,strikethrough</code>.
</p>
<p>
Alternative renderings could check for multi-tag combinations like
<code>{ comment.documentation=fore:light-green comment.line=dark-green comment=green }.</code>
</p>
<p>
Commonly, a tag list will contain an optional embedded language; optional statuses; a base type; and a set of type modifiers:<br />
<code>embedded-language? status* base-type modifiers*</code>
</p>
<h3>Embedded language</h3>
<p>
The embedded language may be a source <code>(client | server)</code> followed by a language name
<code>(javascript | php | python | basic)</code>.
This may be extended in the future with other programming languages and style-definition languages like CSS.
</p>
<h3>Status</h3>
<p>
The statuses may be <code>(error | unused | predefined | inactive)</code>.<br />
The <code>error</code> status is used for lexical statuses that indicate errors in the source code such as unterminated quoted strings.<br />
The <code>unused</code> status may indicate a gap in the lexical states, possibly because an old lexical class is no longer used or an upcoming lexical class may fill that position.<br />
The <code>predefined</code> status indicates a style in the range 32.39 that is used for non-lexical purposes in Scintilla.<br />
The <code>inactive</code> status is used for text that is not currently interpreted such as C++ code that is contained within a '#if 0' preprocessor block.
</p>
<h3>Basic Types</h3>
<p>
The basic types for programming languages are <code>(default | operator | keyword | identifier | literal | comment | preprocessor | label)</code>.<br />
The <code>default</code> type is commonly used for spaces and tabs between tokens although it may cover other characters in some languages.
</p>
<p>
Assembler languages add <code>(instruction | register)</code>. to the basic types from programming languages.<br />
</p>
<p>
The basic types for markup languages are <code>(default | tag | attribute | comment | preprocessor)</code>.<br />
</p>
<p>
The basic types for data languages are <code>(default | key | data | comment)</code>.<br />
</p>
<h3>Comments</h3>
<p>
Programming languages may differentiate between line and stream comments and treat documentation comments as distinct from other comments.
Documentation comments may be marked up with documentation keywords.<br />
The additional attributes commonly used are <code>(line | documentation | keyword | taskmarker)</code>.
</p>
<h3>Literals</h3>
<p>
Programming and assembler languages contain a rich set of literals including numbers like <code>7</code> and <code>3.89e23</code>; <code>"string\n"</code>; and <code>nullptr</code>
and differentiating between these is often wanted.<br />
The common literal types are <code>(numeric | boolean | string | regex | date | time | uuid | nil | compound)</code>.<br />
Numeric literal types are subdivided into <code>(integer | real)</code>.<br />
String literal types may add (perhaps multiple) further attributes from <code> (heredoc | character | escapesequence | interpolated | multiline | raw)</code>.<br />
</p>
<p>
An escape sequence within an interpolated heredoc may thus be <code>literal string heredoc escapesequence</code>.
</p>
<h3>
List of known tags
</h3>
<table>
<tr><td><code>attribute</code></td><td>Markup attribute</td></tr>
<tr><td><code>basic</code></td><td>Embedded Basic</td></tr>
<tr><td><code>boolean</code></td><td>True or false literal</td></tr>
<tr><td><code>character</code></td><td>Single character literal as opposed to a string literal</td></tr>
<tr><td><code>client</code></td><td>Script executed on client</td></tr>
<tr><td><code>comment</code></td><td>The standard comment type in a language: may be stream or line</td></tr>
<tr><td><code>compound</code></td><td>Literal containing multiple subliterals such as a tuple or complex number</td></tr>
<tr><td><code>data</code></td><td>A value in a data file</td></tr>
<tr><td><code>date</code></td><td>Literal representing a data such as '19/November/1975'</td></tr>
<tr><td><code>default</code></td><td>Starting state commonly also used for white space</td></tr>
<tr><td><code>documentation</code></td><td>Comment that can be extracted into documentation</td></tr>
<tr><td><code>error</code></td><td>State indicating an invalid or erroneous element</td></tr>
<tr><td><code>escapesequence</code></td><td>Parts of a string that are not literal such as '\t' for tab in C</td></tr>
<tr><td><code>heredoc</code></td><td>Lengthy text literal marked by a word at both ends</td></tr>
<tr><td><code>identifier</code></td><td>Name that identifies an object or class of object</td></tr>
<tr><td><code>inactive</code></td><td>Code that is not currently interpreted</td></tr>
<tr><td><code>instruction</code></td><td>Mnemonic in assembler languages like 'addc'</td></tr>
<tr><td><code>integer</code></td><td>Numeric literal with no fraction or exponent like '738'</td></tr>
<tr><td><code>interpolated</code></td><td>String that can contain expressions</td></tr>
<tr><td><code>javascript</code></td><td>Embedded Javascript</td></tr>
<tr><td><code>key</code></td><td>Element which allows finding associated data</td></tr>
<tr><td><code>keyword</code></td><td>Reserved word with special meaning like 'while'</td></tr>
<tr><td><code>label</code></td><td>Destination for jumps in programming and assembler languages</td></tr>
<tr><td><code>line</code></td><td>Differentiates between stream comments and line comments in languages that have both</td></tr>
<tr><td><code>literal</code></td><td>Fixed value in source code</td></tr>
<tr><td><code>multiline</code></td><td>Differentiates between single line and multiline elements, commonly strings</td></tr>
<tr><td><code>nil</code></td><td>Literal for the null pointer such as nullptr in C++ or NULL in C</td></tr>
<tr><td><code>numeric</code></td><td>Literal number like '16'</td></tr>
<tr><td><code>operator</code></td><td>Punctuation character such as '&amp;' or '['</td></tr>
<tr><td><code>php</code></td><td>Embedded PHP</td></tr>
<tr><td><code>predefined</code></td><td>Style in the range 32.39 that is used for non-lexical purposes</td></tr>
<tr><td><code>preprocessor</code></td><td>Element that is recognized in an early stage of translation</td></tr>
<tr><td><code>python</code></td><td>Embedded Python</td></tr>
<tr><td><code>raw</code></td><td>String type that avoids interpretation: may be used for regular expressions in languages without a specific regex type</td></tr>
<tr><td><code>real</code></td><td>Numeric literal which may have a fraction or exponent like '3.84e-15'</td></tr>
<tr><td><code>regex</code></td><td>Regular expression literal like '^[a-z]+'</td></tr>
<tr><td><code>register</code></td><td>CPU register in assembler languages</td></tr>
<tr><td><code>server</code></td><td>Script executed on server</td></tr>
<tr><td><code>string</code></td><td>Sequence of characters</td></tr>
<tr><td><code>tag</code></td><td>Markup tag like '&lt;br /&gt;'</td></tr>
<tr><td><code>taskmarker</code></td><td>Word in comment that marks future work like 'FIXME'</td></tr>
<tr><td><code>time</code></td><td>Literal representing a time such as '9:34:31'</td></tr>
<tr><td><code>unused</code></td><td>Style that is not currently used</td></tr>
<tr><td><code>uuid</code></td><td>Universally unique identifier often used in interface definition files which may look like '{098f2470-bae0-11cd-b579-08002b30bfeb}'</td></tr>
</table>
<h2>
Extension
</h2>
<p>
Each element in this scheme may be extended in the future. This may be done by revising this document to provide a common approach to new features.
Individual lexers may also choose to expose unique language features through new tags.
</p>
<h2>
Translation
</h2>
<p>
Tags could be exposed directly in user interfaces or configuration languages.
However, an application may also translate these to match its naming schema.
Capitalization and punctuation could be different (like <code>Here-Doc</code> instead of <code>heredoc</code>),
terminology changed ("constant" instead of "literal"),
or human language changed from English to Chinese or Spanish.
</p>
<p>
Starting from a common set of tags makes these modifications tractable.
</p>
<h2>
Open issues
</h2>
<p>
The C++ lexer (for example) has inactive states and dynamically allocated substyles.
These should be exposed through the metadata mechanism but are not currently.
</p>
</body>
</html>