Editing C syntax (section)

===Strings===
{{main | C string handling}}

In C, string literals are surrounded by double quotes ({{code|"}}) (e.g., {{code|"Hello world!"}}) and are compiled to an array of the specified {{code|char}} values with an additional [[null terminating character]] (0-valued) code to mark the end of the string.

[[String literal]]s may not contain embedded newlines; this proscription somewhat simplifies parsing of the language. To include a newline in a string, the [[#Backslash escapes|backslash escape]] {{code|\n}} may be used, as below.

There are several standard library functions for operating with string data (not necessarily constant) organized as array of {{code|char}} using this null-terminated format; see [[#Library functions|below]].

C's string-literal syntax has been very influential, and has made its way into many other languages, such as C++, Objective-C, Perl, Python, PHP, Java, JavaScript, C#, and Ruby. Nowadays, almost all new languages adopt or build upon C-style string syntax. Languages that lack this syntax tend to precede C.

====Backslash escapes====
{{main|Escape sequences in C}}
Because certain characters cannot be part of a literal string expression directly, they are instead identified by an escape sequence starting with a backslash ({{code|\}}). For example, the backslashes in {{code|"This string contains \"double quotes\"."}} indicate (to the compiler) that the inner pair of quotes are intended as an actual part of the string, rather than the default reading as a delimiter (endpoint) of the string itself.

Backslashes may be used to enter various control characters, etc., into a string:
{| class="wikitable"
! align="left" |Escape
! align="left" |Meaning
|-
| {{code|\\}} || Literal backslash
|-
| {{code|\"}} || Double quote
|-
| {{code|\'}} || Single quote
|-
| {{code|\n}} || Newline (line feed)
|-
| {{code|\r}} || Carriage return
|-
| {{code|\b}} || Backspace
|-
| {{code|\t}} || Horizontal tab
|-
| {{code|\f}} || Form feed
|-
| {{code|\a}} || Alert (bell)
|-
| {{code|\v}} || Vertical tab
|-
| {{code|\?}} || Question mark (used to escape [[C trigraph|trigraphs]], obsolete feature dropped in C23)
|-
| <code>\''OOO''</code> || Character with octal value ''OOO'' (where ''OOO'' is 1-3 octal digits, '0'-'7')
|-
| <code>\x''hh''</code> || Character with hexadecimal value ''hh'' (where ''hh'' is 1 or more hex digits, '0'-'9','A'-'F','a'-'f')
|-
| <code>\u''hhhh''</code> || [[Unicode]] [[code point]] below 10000 hexadecimal (added in C99)
|-
| <code>\U''hhhhhhhh''</code> || Unicode code point where ''hhhhhhhh'' is eight hexadecimal digits (added in C99)
|}

The use of other backslash escapes is not defined by the C standard, although compiler vendors often provide additional escape codes as language extensions. One of these is the escape sequence <code>\e</code> for the [[escape character]] with ASCII hex value 1B which was not added to the C standard due to lacking representation in other [[character set]]s (such as [[EBCDIC]]). It is available in [[GNU Compiler Collection|GCC]], [[clang]] and [[Tiny C Compiler|tcc]].

Note that [[printf format string]]s use {{code|%%}} to represent literal {{code|%}} character; there is no {{code|\%}} escape sequence in standard C.

====String literal concatenation====
C has [[string literal concatenation]], meaning that adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from [[C preprocessor]] defines and macros to be appended to strings at compile time:
<syntaxhighlight lang=C>
    printf(__FILE__ ": %d: Hello "
           "world\n", __LINE__);
</syntaxhighlight>
will expand to
<syntaxhighlight lang=C>
    printf("helloworld.c" ": %d: Hello "
           "world\n", 10);
</syntaxhighlight>
which is syntactically equivalent to
<syntaxhighlight lang=C>
    printf("helloworld.c: %d: Hello world\n", 10);
</syntaxhighlight>

====Character constants====
Individual character constants are single-quoted, e.g. {{code|'A'}}, and have type {{code|int}} (in C++, {{code|char}}). The difference is that {{code|"A"}} represents a null-terminated array of two characters, 'A' and '\0', whereas {{code|'A'}} directly represents the character value (65 if ASCII is used). The same backslash-escapes are supported as for strings, except that (of course) {{code|"}} can validly be used as a character without being escaped, whereas {{code|'}} must now be escaped.

A character constant cannot be empty (i.e. {{code|''}} is invalid syntax), although a string may be (it still has the null terminating character). Multi-character constants (e.g. {{code|'xy'}}) are valid, although rarely useful — they let one store several characters in an integer (e.g. 4 ASCII characters can fit in a 32-bit integer, 8 in a 64-bit one). Since the order in which the characters are packed into an {{code|int}} is not specified (left to the implementation to define), portable use of multi-character constants is difficult.

Nevertheless, in situations limited to a specific platform and the compiler implementation, multicharacter constants do find their use in specifying signatures. One common use case is the [[OSType]], where the combination of Classic Mac OS compilers and its inherent big-endianness means that bytes in the integer appear in the exact order of characters defined in the literal. The definition by popular "implementations" are in fact consistent: in GCC, Clang, and [[Visual C++]], {{code|'1234'}} yields <code>0x3'''1'''3'''2'''3'''3'''3'''4'''</code> under ASCII.<ref>{{cite web |title=The C Preprocessor: Implementation-defined behavior |url=https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html |website=gcc.gnu.org}}</ref><ref>{{cite web |title=String and character literals (C++) |url=https://docs.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=vs-2019#code-try-2 |website=Visual C++ 19 Documentation |access-date=20 November 2019 |language=en-us}}</ref>

Like string literals, character constants can also be modified by prefixes, for example {{code|L'A'}} has type {{code|wchar_t}} and represents the character value of "A" in the wide character encoding.

====Wide character strings====
Since type {{code|char}} is 1 byte wide, a single {{code|char}} value typically can represent at most 255 distinct character codes, not nearly enough for all the characters in use worldwide. To provide better support for international characters, the first C standard (C89) introduced [[wide character]]s (encoded in type {{code|wchar_t}}) and wide character strings, which are written as {{code|L"Hello world!"}}

Wide characters are most commonly either 2 bytes (using a 2-byte encoding such as [[UTF-16]]) or 4 bytes (usually [[UTF-32]]), but Standard C does not specify the width for {{code|wchar_t}}, leaving the choice to the implementor. [[Microsoft Windows]] generally uses UTF-16, thus the above string would be 26 bytes long for a Microsoft compiler; the [[Unix]] world prefers UTF-32<!-- dubious?! See also new in C23:  char8_t type for storing UTF-8 encoded data -->, thus compilers such as GCC would generate a 52-byte string. A 2-byte wide {{code|wchar_t}} suffers the same limitation as {{code|char}}, in that certain characters (those outside the [[Basic Multilingual Plane|BMP]]) cannot be represented in a single {{code|wchar_t}}; but must be represented using [[surrogate pair]]s.

The original C standard specified only minimal functions for operating with wide character strings; in 1995 the standard was modified to include much more extensive support, comparable to that for {{code|char}} strings. The relevant functions are mostly named after their {{code|char}} equivalents, with the addition of a "w" or the replacement of "str" with "wcs"; they are specified in {{code|<wchar.h>}}, with {{code|<wctype.h>}} containing wide-character classification and mapping functions.

The now generally recommended method<ref group="note">see [[UTF-8]] first section for references</ref> of supporting international characters is through [[UTF-8]], which is stored in {{code|char}} arrays, and can be written directly in the source code if using a UTF-8 editor, because UTF-8 is a direct [[Extended ASCII|ASCII extension]].

====Variable width strings====
A common alternative to {{code|wchar_t}} is to use a [[variable-width encoding]], whereby a logical character may extend over multiple positions of the string. Variable-width strings may be encoded into literals verbatim, at the risk of confusing the compiler, or using numerical backslash escapes (e.g. {{code|"\xc3\xa9"}} for "é" in UTF-8). The [[UTF-8]] encoding was specifically designed (under [[Plan 9 from Bell Labs|Plan 9]]) for compatibility with the standard library string functions; supporting features of the encoding include a lack of embedded nulls, no valid interpretations for subsequences, and trivial resynchronisation. Encodings lacking these features are likely to prove incompatible with the standard library functions; encoding-aware string functions are often used in such cases.

====Library functions====
[[String (computer science)|Strings]], both constant and variable, can be manipulated without using the [[standard library]]. However, the library contains many [[C string handling|useful functions]] for working with null-terminated strings.